Generating a Date column in a dataframe - r

I have the following table of quarterly data and want to generate a new column of date type for each row.
Year,Quarter,Sales
2008,1,1.703
2008,2,0.717
2008,3,6.892
2008,4,4.363
2009,1,3.793
2009,2,5.208
2009,3,7.367
2009,4,8.737
2010,1,8.752
2010,2,8.398
This is what I tried
quarters <- c('-03-31', '-06-30', '-09-30', '-12-31')
gen_date <- function(row) {
year <- row[1]
quarter <- row[2]
date <- paste(toString(year), quarters[quarter], sep='')
date <- as.Date((date), format="%Y-%m-%d")
return(date)
}
df$Date <- apply(df, 1, gen_date)
However, the resulting column df$Date is not a date, but an int.
Year Quarter Sales Date
1 2008 1 1.703 13969
2 2008 2 0.717 14060
3 2008 3 6.892 14152
4 2008 4 4.363 14244
5 2009 1 3.793 14334
6 2009 2 5.208 14425
7 2009 3 7.367 14517
8 2009 4 8.737 14609

Try with lubridate:
library(lubridate)
Year=c(rep(2008,4),rep(2009,4),2010,2010)
Quarter=c(1,2,3,4,1,2,3,4,1,2)
Sales=c(1.7,0.7,6.9,4.3,3.79,5.2,7.3,8.7,8.7,8.4)
df=tibble(Year,Quarter,Sales)
df$Date=yq(paste(as.character(df$Year),as.character(df$Quarter),sep="-"))
df
Year Quarter Sales Date
<dbl> <dbl> <dbl> <date>
1 2008 1.00 1.70 2008-01-01
2 2008 2.00 0.700 2008-04-01
3 2008 3.00 6.90 2008-07-01
4 2008 4.00 4.30 2008-10-01
5 2009 1.00 3.79 2009-01-01
6 2009 2.00 5.20 2009-04-01
7 2009 3.00 7.30 2009-07-01
8 2009 4.00 8.70 2009-10-01
9 2010 1.00 8.70 2010-01-01
10 2010 2.00 8.40 2010-04-01

Try this:
library(lubridate)
dfx <- read.table(text = "Year,Quarter,Sales
2008,1,1.703
2008,2,0.717
2008,3,6.892
2008,4,4.363
2009,1,3.793
2009,2,5.208
2009,3,7.367
2009,4,8.737
2010,1,8.752
2010,2,8.398", header=T, sep=",")
dfx$month <- factor(dfx$Quarter)
levels(dfx$month) <- c('-03-31', '-06-30', '-09-30', '-12-31')
dfx$month <- as.character(dfx$month)
dfx$date <- ymd(paste(dfx$Year, dfx$month, sep="-"))
HTH

Related

Creating averages across time periods

I'm a beginner to R, but I have the below dataframe with more observations in which I have at max each 'id' observation for three years 91, 99, 07.
I want to create a variable avg_ln_rd by 'id' that takes the average of 'ln_rd' and 'ln_rd' from year 91 if the first ln_rd observation is from 99 - and from year 99 if the first ln_rd observation is from 07.
id year ln_rd
<dbl> <dbl> <dbl>
1 1013 1991 3.51
2 1013 1999 5.64
3 1013 2007 4.26
4 1021 1991 0.899
5 1021 1999 0.791
6 1021 2007 0.704
7 1034 1991 2.58
8 1034 1999 3.72
9 1034 2007 4.95
10 1037 1991 0.262
I also already dropped any observations of 'id' that only exist for one of the three years.
My first thought was to create for each year a standalone variable for ln_rd but then i still would need to filter by id which i do not know how to do.
Then I tried using these standalone variables to form an if clause.
df$lagln_rd_99 <- ifelse(df$year == 1999, df$ln_rd_91, NA)
But again I do not know how to keep 'id' fixed.
Any help would be greatly appreciated.
EDIT:
I grouped by id using dplyr. Can I then just sort my df by id and create a new variable that is ln_rd but shifted by one row?
Still a bit unclear what to do if all years are present in a group but this might help.
-- edited -- to show the desired output.
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(avg91 = mean(c(ln_rd[year == 1991], ln_rd[year == 1999])),
avg99 = mean(c(ln_rd[year == 1999], ln_rd[year == 2007])),
avg91 = ifelse(year == 1991, avg91, NA),
avg99 = ifelse(year == 2007, avg99, NA)) %>%
ungroup()
# A tibble: 15 × 5
year id ln_rd avg91 avg99
<int> <int> <dbl> <dbl> <dbl>
1 1991 3505 3.38 3.09 NA
2 1999 3505 2.80 NA NA
3 1991 4584 1.45 1.34 NA
4 1999 4584 1.22 NA NA
5 1991 5709 1.90 2.13 NA
6 1999 5709 2.36 NA NA
7 2007 5709 3.11 NA 2.74
8 2007 9777 2.36 NA 2.36
9 1991 18729 4.82 5.07 NA
10 1999 18729 5.32 NA NA
11 2007 18729 5.53 NA 5.42
12 1991 20054 0.588 0.307 NA
13 1999 20054 0.0266 NA NA
14 1999 62169 1.91 NA NA
15 2007 62169 1.45 NA 1.68

How to add percentile (/quantile) values to a column in dataframe

My data set has flow rate measurements of a river for every day of the year from 2009 to 2021. This is split up into seasons: Winter (December, Jan, Feb), Spring (March, April, May), Summer (June, July, August) and Autumn (September, October, November).
This is a sample of my data set:
> (chitt_brook_wylye_2)
# A tibble: 4,437 x 7
river year season month date flow_rate quality
<chr> <dbl> <chr> <chr> <dttm> <dbl> <chr>
1 chittern_brook 2009 Winter December 2009-12-01 00:00:00 0.059 Good
2 chittern_brook 2009 Winter December 2009-12-02 00:00:00 0.061 Good
3 chittern_brook 2009 Winter December 2009-12-03 00:00:00 0.064 Good
4 chittern_brook 2009 Winter December 2009-12-04 00:00:00 0.068 Good
5 chittern_brook 2009 Winter December 2009-12-05 00:00:00 0.076 Good
6 chittern_brook 2009 Winter December 2009-12-06 00:00:00 0.138 Good
7 chittern_brook 2009 Winter December 2009-12-07 00:00:00 0.592 Good
8 chittern_brook 2009 Winter December 2009-12-08 00:00:00 1.04 Good
9 chittern_brook 2009 Winter December 2009-12-09 00:00:00 1.46 Good
10 chittern_brook 2009 Winter December 2009-12-10 00:00:00 1.7 Good
# ... with 4,427 more rows
I want to find the 95th percentile, 5th percentile, median and the mean of each season of every year and have the values for 95th 5th, median and mean in separate columns in a new dataframe.
For example:
> (df)
# A tibble: 49 x 2
season_label flow_rate_mean Q95 Q5 flow_rate_median
<chr> <dbl>
1 Winter 2009 0.453 3 2 4
2 Spring 2010 0.519 6 3 4
3 Summer 2010 0.0627 4 3 6
4 Autumn 2010 0.0415 6 2 6
5 Winter 2010 0.0622 8 3 3
6 Spring 2011 0.188 10 3 2
7 Summer 2011 0.0499 2 3 2
8 Autumn 2011 0.0383 2 2 1
9 Winter 2011 0.0461 5 2 7
10 Spring 2012 0.0925 3 2 8
# ... with 39 more rows
I currently have this code which creates the above dataframe with just the first two columns but I would like it to also include 95th percentile, 5th percentile and median. Is this feasible or will I need to do it separately and then combine it into one dataframe?
df <- chitt_brook_wylye_2 %>%
dplyr::mutate(month = as.numeric(format(date,"%m")),
year = as.numeric(format(date,"%Y")),
season_id = (12*year + month) %/% 3) %>%
dplyr::group_by(season_id) %>%
dplyr::mutate(season_label = paste(season, min(year))) %>%
dplyr::group_by(season_id,season_label) %>%
dplyr::summarise(flow_rate = mean(flow_rate))
Reproducible example and code:
date <- as.Date(c("2009-12-01","2010-01-01","2010-02-01","2010-03-01","2010-04-01","2010-05-01","2010-06-01","2010-07-01","2010-08-01","2010-09-01","2010-10-01","2010-11-01","2010-12-01"))
season <- c("Winter","Winter","Winter","Spring","Spring","Spring","Summer","Summer","Summer","Autumn","Autumn","Autumn","Winter")
var <- c(1,2,3,5,5,5,7,7,7,9,9,9,10)
df <- data.frame(date,season,var) %>% # creating the dataframe
dplyr::mutate(month = as.numeric(format(date,"%m")),
year = as.numeric(format(date,"%Y")),
season_id = (12*year + month) %/% 3) %>% #generating an identifiant for every season that exists in the data
dplyr::group_by(season_id) %>% # Grouping by the id
dplyr::mutate(season_label = paste(min(year),season)) %>%
dplyr::group_by(season_id,season_label) %>% ## season_label to keep the newly created label after the arriving summarise
dplyr::summarise(var = mean(var)) # Computing the mean

Computing lags but grouping by two categories with dplyr

What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.
Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)

How to add a set of values to an existing data frame?

I have a data frame containing three columns: ID, year, growth. The last one contains data of growth in milimeters for each year.
Example:
df <- data.frame(ID=rep(c("CHC01", "CHC02", "CHC03"), each=4),
year=rep(2015:2018, 3),
growth=c(NA, 2.3, 2.1, 3.0, NA, NA, NA, 3.2, NA, NA, 2.1, 1.2))
In another data frame, I have other three columns: ID, missing_length, missing_years. Missing length relates to the estimated length missed in the measurements. Missing years relates to the number of missing years in df
estimate <- data.frame(ID=c("CHC01", "CHC02", "CHC03"),
missing_length=c(1.0, 4.4, 3.5),
missing_years=c(1,3,2))
For calculating the growth for each missing year, I tried:
missing <- rep(estimate$missing_length / estimate$missing_years, estimate$missing_years)
Does anyone have any idea of how to deal with this problem?
Thank you very much!
We can do a join and then replace the NA with the calculated value
library(dplyr)
df %>%
left_join(estimate) %>%
group_by(ID) %>%
transmute(year, growth = replace(growth, is.na(growth),
missing_length[1]/missing_years[1]))
# A tibble: 12 x 3
# Groups: ID [3]
# ID year growth
# <fct> <int> <dbl>
# 1 CHC01 2015 1
# 2 CHC01 2016 2.3
# 3 CHC01 2017 2.1
# 4 CHC01 2018 3
# 5 CHC02 2015 1.47
# 6 CHC02 2016 1.47
# 7 CHC02 2017 1.47
# 8 CHC02 2018 3.2
# 9 CHC03 2015 1.75
#10 CHC03 2016 1.75
#11 CHC03 2017 2.1
#12 CHC03 2018 1.2
Or with coalesce
df %>%
mutate(growth = coalesce(growth, with(estimate,
setNames(missing_length/missing_years, ID))[as.character(ID)])) %>%
as_tibble
# A tibble: 12 x 3
# ID year growth
# <fct> <int> <dbl>
# 1 CHC01 2015 1
# 2 CHC01 2016 2.3
# 3 CHC01 2017 2.1
# 4 CHC01 2018 3
# 5 CHC02 2015 1.47
# 6 CHC02 2016 1.47
# 7 CHC02 2017 1.47
# 8 CHC02 2018 3.2
# 9 CHC03 2015 1.75
#10 CHC03 2016 1.75
#11 CHC03 2017 2.1
#12 CHC03 2018 1.2
Or similar option in data.table
library(data.table)
setDT(df)[estimate, growth := fcoalesce(growth,
missing_length/missing_years), on = .(ID)]
Base R solution. Supposing tables "df" and "estimate" are sorted by id (ascending CHC) and we keep your "missing" object, this should work :
df$growth=replace(df$growth,which(is.na(df$growth)),missing)
Output :
ID year growth
1 CHC01 2015 1.000000
2 CHC01 2016 2.300000
3 CHC01 2017 2.100000
4 CHC01 2018 3.000000
5 CHC02 2015 1.466667
6 CHC02 2016 1.466667
7 CHC02 2017 1.466667
8 CHC02 2018 3.200000
9 CHC03 2015 1.750000
10 CHC03 2016 1.750000
11 CHC03 2017 2.100000
12 CHC03 2018 1.200000

Calculate the percent occurrence of a variable in multiple groups

Sample data
set.seed(123)
df <- data.frame(loc.id = rep(1:1000, each = 35), year = rep(1980:2014,times = 1000),month.id = sample(c(1:4,8:10,12),35*1000,replace = T))
The data frame has a 1000 locations X 35 years of data for a variable called month.id which is basically the month of a year. For each year, I want to calculate percent occurrence of each month. For e.g. for 1980,
month.vec <- df[df$year == 1980,]
table(month.vec$month.id)
1 2 3 4 8 9 10 12
106 132 116 122 114 130 141 139
To calculate the percent occurrence of months:
table(month.vec$month.id)/length(month.vec$month.id) * 100
1 2 3 4 8 9 10 12
10.6 13.2 11.6 12.2 11.4 13.0 14.1 13.9
I want to have a table something like this:
year month percent
1980 1 10.6
1980 2 13.2
1980 3 11.6
1980 4 12.2
1980 5 NA
1980 6 NA
1980 7 NA
1980 8 11.4
1980 9 13
1980 10 14.1
1980 11 NA
1980 12 13.9
Since, months 5,6,7,11 are missing, I just want to add the additional rows with NAs for those months. If possible, I would
like a dplyr solution to something like this:
library(dplyr)
df %>% group_by(year) %>% summarise(percentage.contri = table(month.id)/length(month.id)*100)
Solution using dplyr and tidyr
# To get month as integer use (or add as.integer to mutate):
# df$month.id <- as.integer(df$month.id)
library(dplyr)
library(tidyr)
df %>%
group_by(year, month.id) %>%
# Count occurrences per year & month
summarise(n = n()) %>%
# Get percent per month (year number is calculated with sum(n))
mutate(percent = n / sum(n) * 100) %>%
# Fill in missing months
complete(year, month.id = 1:12, fill = list(percent = 0)) %>%
select(year, month.id, percent)
year month.id percent
<int> <dbl> <dbl>
1 1980 1.00 10.6
2 1980 2.00 13.2
3 1980 3.00 11.6
4 1980 4.00 12.2
5 1980 5.00 0
6 1980 6.00 0
7 1980 7.00 0
8 1980 8.00 11.4
9 1980 9.00 13.0
10 1980 10.0 14.1
11 1980 11.0 0
12 1980 12.0 13.9
A base R solution:
tab <- table(month.vec$year, factor(month.vec$month.id, levels = 1:12))/length(month.vec$month.id) * 100
dfnew <- as.data.frame(tab)
which gives:
> dfnew
Var1 Var2 Freq
1 1980 1 10.6
2 1980 2 13.2
3 1980 3 11.6
4 1980 4 12.2
5 1980 5 0.0
6 1980 6 0.0
7 1980 7 0.0
8 1980 8 11.4
9 1980 9 13.0
10 1980 10 14.1
11 1980 11 0.0
12 1980 12 13.9
Or with data.table:
library(data.table)
setDT(month.vec)[, .N, by = .(year, month.id)
][.(year = 1980, month.id = 1:12), on = .(year, month.id)
][, N := 100 * N/sum(N, na.rm = TRUE)][]

Resources