Related
I Have a 2 questions in one. I have 20 data frames. Each one is subject to a given year (from 2000 to 2020). They all have the same columns. 1) I want to merge them based on similar observations for a list of variables (columns), so I can construct a panel. 2) Plus when merging I want to rename the columns by adding a suffixes indicating the date.
For example, let take 3 dataframes
df1
year_sample birth_date country work_establishment Wage
2014 1995 US X2134 1700
2014 1996 US X26 1232
2014 1992 CANADA X26 2553
2014 1990 FRANCE X4T346 6574
2014 1983 BELGIUM X2E43 1706
2014 1975 US X2134 1000
2014 1969 CHINA XXZT55 996
df2
year_sample birth_date country work_establishment Wage
2015 1995 US X2134 1756
2015 1996 US X26 1230
2015 1992 CANADA X26 2700
2015 1990 FRANCE X4T346 6574
2015 1975 US X2134 1000
2015 1979 GERMANY X35555 2435
df3
year_sample birth_date country work_establishment Wage
2016 1995 US X2134 1750
2016 1996 US X26 1032
2016 1992 CANADA X26 2353
2016 1990 FRANCE X4T346 6574
2016 1955 MALI X2244 1000
2016 1979 GERMANY X35555 2435
If an observation have similar values for c(birth_date; country ; work_establisment) then I will considere it as the same person. I want therefore:
df_final
id birth_date country work_establishment Wage_2014 Wage_2015 Wage_2016
1 1995 US X2134 1700 1756 1750
2 1996 US X26 1232 1230 1032
3 1992 CANADA X26 2553 2700 2353
4 1990 FRANCE X4T346 6574 6574 6574
I know that if I had just two dataframes I can do :
df_final <- transform(merge(df1,df2, by=c("birth_date", "country", "work_establishment"), suffixes=c("_2014", "_2015")))
But I can't manage to do it for several dataframes at once.
Thank you!
You can get all the dataframes in a list.
list_df <- mget(paste0('df', 1:3))
#OR
#list_df <- list(df1, df2, df3)
Then add suffix to 'Wage' column in each of the dataframe from the year_sample value and drop the year column and use Reduce to merge the dataframes into one.
result <- Reduce(function(x, y)
merge(x, y, by=c("birth_date", "country", "work_establishment")),
lapply(list_df, function(x)
{names(x)[5] <- paste('Wage', x$year_sample[1], sep = '_');x[-1]}))
result
# birth_date country work_establishment Wage_2014 Wage_2015 Wage_2016
#1 1990 FRANCE X4T346 6574 6574 6574
#2 1992 CANADA X26 2553 2700 2353
#3 1995 US X2134 1700 1756 1750
#4 1996 US X26 1232 1230 1032
I have a database with sales value for individual firms that belong to different industries.
In the example dataset below:
set.seed(123)
df <- data.table(year=rep(1980:1984,each=4),sale=sample(100:150,20),ind=sample(LETTERS[1:2],20,replace = TRUE))
df[order(year,ind)]
year sale ind
1: 1980 114 A
2: 1980 102 A
3: 1980 130 B
4: 1980 113 B
5: 1981 136 A
6: 1981 148 A
7: 1981 141 B
8: 1981 142 B
9: 1982 124 A
10: 1982 125 A
11: 1982 104 A
12: 1982 126 B
13: 1983 108 A
14: 1983 128 A
15: 1983 140 B
16: 1983 127 B
17: 1984 134 A
18: 1984 107 A
19: 1984 106 A
20: 1984 146 B
The column "ind" represents industry and I have omitted the firm identifiers (no use in this example).
I want an average defined as follows:
For each year, the desired average is the average of all firms within the industry over the past three years. If the data for past three years is not available, a minimum of two observations is also acceptable.
For example, in the above dataset, if year=1982, and ind=A, there are only two observations for past years (which is still acceptable), so the desired average is the average of all sale values in years 1980 and 1981 for industry A.
If year=1983, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1980, 1981, and 1982 for industry A.
If year=1984, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1981, 1982, and 1983 for industry A.
The desired output, thus, will be as follows:
year sale ind mymean
1: 1980 130 B NA
2: 1980 114 A NA
3: 1980 113 B NA
4: 1980 102 A NA
5: 1981 141 B NA
6: 1981 142 B NA
7: 1981 136 A NA
8: 1981 148 A NA
9: 1982 124 A 125.0000
10: 1982 125 A 125.0000
11: 1982 126 B 131.5000
12: 1982 104 A 125.0000
13: 1983 140 B 130.4000
14: 1983 127 B 130.4000
15: 1983 108 A 121.8571
16: 1983 128 A 121.8571
17: 1984 134 A 124.7143
18: 1984 107 A 124.7143
19: 1984 146 B 135.2000
20: 1984 106 A 124.7143
A data.table solution is much preferred for fast implementation.
Many thanks in advance.
I am not very good in data.table. Here is one tidyverse solution if you like or if you can translate it to data.table
library(tidyverse)
df %>% group_by(ind, year) %>%
summarise(ds = sum(sale),
dn = n()) %>%
mutate(ds = (lag(ds,1)+lag(ds,2)+ifelse(is.na(lag(ds,3)), 0, lag(ds,3)))/(lag(dn,1)+lag(dn,2)+ifelse(is.na(lag(dn,3)), 0, lag(dn,3)))
) %>% select(ind, year, mymean = ds) %>%
right_join(df, by = c("ind", "year"))
`summarise()` regrouping output by 'ind' (override with `.groups` argument)
# A tibble: 20 x 4
ind year mymean sale
<chr> <int> <dbl> <int>
1 A 1980 NA 114
2 A 1980 NA 102
3 A 1981 NA 136
4 A 1981 NA 148
5 A 1982 125 124
6 A 1982 125 125
7 A 1982 125 104
8 A 1983 122. 108
9 A 1983 122. 128
10 A 1984 125. 134
11 A 1984 125. 107
12 A 1984 125. 106
13 B 1980 NA 130
14 B 1980 NA 113
15 B 1981 NA 141
16 B 1981 NA 142
17 B 1982 132. 126
18 B 1983 130. 140
19 B 1983 130. 127
20 B 1984 135. 146
You can use zoo's rollapply function to perform this rolling calculation. Note that there are dedicated functions to calculate rolling mean like frollmean in data.table and rollmean in zoo but they lack the argument partial = TRUE present in rollapply. partial = TRUE is useful here since you want to calculate the mean even if the window size is less than 3.
We can first calculate mean of sale value for each ind and year, then perform rolling mean calculation with window size of 3 and join this data with the original dataframe to get all the rows of original dataframe back.
library(data.table)
library(zoo)
df1 <- df[, .(sale = mean(sale)), .(ind, year)]
df2 <- df1[, my_mean := shift(rollapplyr(sale, 3, function(x)
if(length(x) > 1) mean(x, na.rm = TRUE) else NA, partial = TRUE)), ind]
df[df2, on = .(ind, year)]
This can be written using dplyr as :
library(dplyr)
df %>%
group_by(ind, year) %>%
summarise(sale = mean(sale)) %>%
mutate(avg_mean = lag(rollapplyr(sale, 3, partial = TRUE, function(x)
if(length(x) > 1) mean(x, na.rm = TRUE) else NA))) %>%
left_join(df, by = c('ind', 'year'))
Based on Ronak's answer (the mean of previous means), a more general way (the mean of all previous values), and a data.table solution then can be:
library(data.table)
library(roll)
df1 <- df[, .(sum_1 = sum(sale), n=length(sale)), .(ind, year)]
df1[,`:=`(
my_sum = roll_sum(shift(sum_1),3,min_obs = 2),
my_n = roll_sum(shift(n),3,min_obs = 2)
),by=.(ind)]
df1[,`:=`(my_mean=(my_sum/my_n))]
> df[df1[,!c("sum_1","n","my_sum","my_n")] ,on = .(ind, year)]
year sale ind my_mean
1: 1980 130 B NA
2: 1980 113 B NA
3: 1980 114 A NA
4: 1980 102 A NA
5: 1981 141 B NA
6: 1981 142 B NA
7: 1981 136 A NA
8: 1981 148 A NA
9: 1982 124 A 125.0000
10: 1982 125 A 125.0000
11: 1982 104 A 125.0000
12: 1982 126 B 131.5000
13: 1983 140 B 130.4000
14: 1983 127 B 130.4000
15: 1983 108 A 121.8571
16: 1983 128 A 121.8571
17: 1984 134 A 124.7143
18: 1984 107 A 124.7143
19: 1984 106 A 124.7143
20: 1984 146 B 135.2000
I have a data frame called "data", that has "date, month, discharge, and station" columns. Another data frame called "perc" that has "month, W1_Percentile, and B1_Percentile" columns. W1_Percentile and B1_Percentile are the monthly percentile values for each of the gauging stations. I want my final output to have columns same as in df(data) with an additional column for "Percentile" that will have the percentile values for the respective month and gauging station (percentile values of each gauging station for the respective months is stored in df(perc)). What steps should I follow?
Here is the sample of input data:
date <- as.Date(c('1950-03-12','1954-03-23','1991-06-27','1997-09-04','1991-06-27','1987-05-06','1987-05-29','1856-07-08','1993-06-04', '2001-09-19','2001-05-06','2001-05-27'))
month <- c('Mar','Mar','Jun','Sep','Jun','May','May','Jul','Jun','Sep','May','May')
disch <- c(125,1535,1654,154,4654,453,1654,145,423,433,438,6426)
station <- c('W1','W1','W1','W1','W1','W1','B1','B1','B1','B1','B1','B1')
data <- data.frame("Date"= date, "Month" = month,"Discharge"=disch,"station"=station)
Date Month Discharge station
1 1950-03-12 Mar 125 W1
2 1954-03-23 Mar 1535 W1
3 1991-06-27 Jun 1654 W1
4 1997-09-04 Sep 154 W1
5 1991-06-27 Jun 4654 W1
6 1987-05-06 May 453 W1
7 1987-05-29 May 1654 B1
8 1856-07-08 Jul 145 B1
9 1993-06-04 Jun 423 B1
10 2001-09-19 Sep 433 B1
11 2001-05-06 May 438 B1
12 2001-05-27 May 6426 B1
Month <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
W1 <- c(106,313,531.40,164.10,40,23.39,18.30,24,16,16,12,34)
B1 <- c(1330,1550,1948,1880,1260,853.15,680.15,486.10,503,625,738,1070)
perc <- data.frame("Month"=Month,"W1_Percentile"=W1,"B1_Percentile"=B1)
Month W1_Percentile B1_Percentile
1 Jan 106.00 1330.00
2 Feb 313.00 1550.00
3 Mar 531.40 1948.00
4 Apr 164.10 1880.00
5 May 40.00 1260.00
6 Jun 23.39 853.15
7 Jul 18.30 680.15
8 Aug 24.00 486.10
9 Sep 16.00 503.00
10 Oct 16.00 625.00
11 Nov 12.00 738.00
12 Dec 34.00 1070.00
This is how I want the final output to look like:
Date Month Discharge station Percentile
1 1950-03-12 Mar 125 W1 531.40
2 1954-03-23 Mar 1535 W1 531.40
3 1991-06-27 Jun 1654 W1 23.39
4 1997-09-04 Sep 154 W1 16.00
5 1991-06-27 Jun 4654 W1 23.39
6 1987-05-06 May 453 W1 40.00
7 1987-05-29 May 1654 B1 1260.00
8 1856-07-08 Jul 145 B1 680.15
9 1993-06-04 Jun 423 B1 853.15
10 2001-09-19 Sep 433 B1 503.00
11 2001-05-06 May 438 B1 1260.00
12 2001-05-27 May 6426 B1 1260.00
We need to first convert your perc data into a long format so that we have the columns we want to add to data, then it's a simple join:
library(tidyr)
library(dplyr)
# make the column names the same as the values in data
names(perc)[2:3] = c("W1", "B1")
# convert to long format
perc_long = gather(perc, key = "station", value = "percentile", W1, B1)
# join
left_join(data, perc_long)
# Joining, by = c("Month", "station")
# Date Month Discharge station percentile
# 1 1950-03-12 Mar 125 W1 531.40
# 2 1954-03-23 Mar 1535 W1 531.40
# 3 1991-06-27 Jun 1654 W1 23.39
# 4 1997-09-04 Sep 154 W1 16.00
# 5 1991-06-27 Jun 4654 W1 23.39
# 6 1987-05-06 May 453 W1 40.00
# 7 1987-05-29 May 1654 B1 1260.00
# 8 1856-07-08 Jul 145 B1 680.15
# 9 1993-06-04 Jun 423 B1 853.15
# 10 2001-09-19 Sep 433 B1 503.00
# 11 2001-05-06 May 438 B1 1260.00
# 12 2001-05-27 May 6426 B1 1260.00
There are many ways to do these operations, it's essentially a combination of two R-FAQs. For additional reference see
Reshaping data.frame from wide to long format
How to join (merge) data frames (inner, outer, left, right)
I have climatic data which have been collected during a whole year along an altitude gradient. Shaped like that:
clim <- read.table(text="alti year month week day meanTemp maxTemp minTemp
350 2011 aug. 31 213 10 14 6
350 2011 aug. 31 214 12 18 6
350 2011 aug. 31 215 10 11 9
550 2011 aug. 31 213 8 10 6
550 2011 aug. 31 214 10 12 8
550 2011 aug. 31 215 8 9 7
350 2011 sep. 31 244 9 10 8
350 2011 sep. 31 245 11 12 10
350 2011 sep. 31 246 10 11 9
550 2011 sep. 31 244 7.5 9 6
550 2011 sep. 31 245 8 10 6
550 2011 sep. 31 246 8.5 9 8", header=TRUE)
and I am trying to reshape this data in order to have only one row per altitude and to calculate the mean data for each month and for the whole year. I would be great if it could be shaped like that:
alti mean_year(meanTemp) mean_year(maxTemp) mean_aug.(meanTemp) mean_aug.(maxTemp) mean_sep.(meanTemp) [...]
350 10.333 12.667 10.667 14.3 10 ...
550 8.333 9.833 8.667 10.333 7.766 ...
Any idea to perform this reshaping & calculation?
You can use data.table and dcast:
library(data.table)
setDT(clim)
merge(
clim[, list("mean_temp_mean_year" = mean(meanTemp), "max_temp_mean_year" = mean(maxTemp)), by = alti]
,
dcast(clim[, list("mean_temp_mean" = mean(meanTemp), "max_temp_mean" = mean(maxTemp)), by = c("alti","month")], alti ~ month, value.var = c("mean_temp_mean","max_temp_mean"))
,
by = "alti")
I've switched the names of some of the variables, and you col order is not perfect, but the can be reordered/renamed afterwards
To get the means of the months or years, you can use aggregate followed by reshape.
The two aggregates can be computed separately, and then merge puts them together:
mon <- aggregate(cbind(meanTemp, maxTemp) ~ month + alti, data=clim, FUN=mean)
mon.wide <- reshape(mon, direction='wide', timevar='month', idvar='alti')
yr <- aggregate(cbind(meanTemp, maxTemp) ~ year + alti, data=clim, FUN=mean)
yr.wide <- reshape(yr, direction='wide', timevar='year', idvar='alti')
Each of these .wide sets have the data that you want. The only common column is alti so we take the merge defaults:
merge(mon.wide, yr.wide)
## alti meanTemp.aug. maxTemp.aug. meanTemp.sep. maxTemp.sep. meanTemp.2011 maxTemp.2011
## 1 350 10.666667 14.33333 10 11.000000 10.333333 12.666667
## 2 550 8.666667 10.33333 8 9.333333 8.333333 9.833333
Here's another variation of data.table solution, but this requires the current devel version, v1.9.5:
require(data.table) # v1.9.5+
setDT(clim)
form = paste("alti", c("year", "month"), sep=" ~ ")
val = c("meanTemp", "maxTemp")
ans = lapply(form, function(x) dcast(clim, x, mean, value.var = val))
Reduce(function(x, y) x[y, on="alti"], ans)
# alti meanTemp_mean_2011 maxTemp_mean_2011 meanTemp_mean_aug. meanTemp_mean_sep. maxTemp_mean_aug. maxTemp_mean_sep.
# 1: 350 10.333333 12.666667 10.666667 10 14.33333 11.000000
# 2: 550 8.333333 9.833333 8.666667 8 10.33333 9.333333
I've looked all over the place, but I can't find where this question has been asked before.
What is a clean way to get this data into a proper zoo series? This version is a copy/paste to make this post easier, but it will always come in the following table form (from a text file). My read.zoo() statement reads the Year as the index but the quarters (Qtr1, Qtr2, etc) are read as column names. I've been trying to figure out a non-garbage way to read the columns as the "quarter" part of the index, but it's sloppy (too sloppy to post). I'm guessing this problem has already been solved, but I can't find it.
> texinp <- "
+ Year Qtr1 Qtr2 Qtr3 Qtr4
+ 1992 566 443 329 341
+ 1993 344 212 133 112
+ 1994 252 252 199 207"
> z <- read.zoo(textConnection(texinp), header=TRUE)
> z
From the as.yearqtr() documentation, the target would look like:
1992 Q1 1992 Q2 1992 Q3 1992 Q4 1993 Q1 1993 Q2 1993 Q3 1993 Q4
566 443 329 341 344 212 133 112
1994 Q1 1994 Q2 1994 Q3 1994 Q4
252 252 199 207
Read in the data using read.zoo and then convert it to a zooreg object with yearqtr time index:
texinp <- "Year Qtr1 Qtr2 Qtr3 Qtr4
1992 566 443 329 341
1993 344 212 133 112
1994 252 252 199 207"
library(zoo)
z <- read.zoo(text = texinp, header=TRUE)
zz <- zooreg(c(t(z)), start = yearqtr(start(z)), freq = 4)
The result looks like this:
> zz
1992 Q1 1992 Q2 1992 Q3 1992 Q4 1993 Q1 1993 Q2 1993 Q3 1993 Q4 1994 Q1 1994 Q2 1994 Q3 1994 Q4
566 443 329 341 344 212 133 112 252 252 199 207
read.zoo assumes your data has at most one time-index column, so you have to process this yourself. First read it in using read.table
zt <- read.table( textConnection( texinp ), header = TRUE)
then convert it to a "long table" using the melt function from the reshape package:
require(reshape)
zt.m <- melt( zt, id = 'Year', variable_name = 'Qtr')
> zt.m
Year Qtr value
1 1992 Qtr1 566
2 1993 Qtr1 344
3 1994 Qtr1 252
4 1992 Qtr2 443
5 1993 Qtr2 212
6 1994 Qtr2 252
7 1992 Qtr3 329
8 1993 Qtr3 133
9 1994 Qtr3 199
10 1992 Qtr4 341
11 1993 Qtr4 112
12 1994 Qtr4 207
and finally create your desired zoo object:
z <- with( zt.m, zoo( value, as.yearqtr(paste(Year, Qtr), format = '%Y Qtr%q')))
> z
1992 Q1 1992 Q2 1992 Q3 1992 Q4 1993 Q1 1993 Q2 1993 Q3 1993 Q4 1994 Q1 1994 Q2
566 443 329 341 344 212 133 112 252 252
1994 Q3 1994 Q4
199 207