apply hpfilter to grouped variables with NAs using dplyr - r

I am trying to apply the hpfilter to one of the variables in my dataset that has a panel structure (id + year) and then add the filtered series to my dataset. It works perfectly fine as long as I do not have any NAs in one of the variables, but it yields an error if one of the ids has missing values. The reason for this is that the hpfilter function does not work with NAs (it yields only NAs).
Here's a reproducible example:
df1 <- read.table(text="country year X1 X2 W
A 1990 10 20 40
A 1991 12 15 NA
A 1992 14 17 41
A 1993 17 NA 44
B 1990 20 NA 45
B 1991 NA 13 61
B 1992 12 12 67
B 1993 14 10 68
C 1990 10 20 70
C 1991 11 14 50
C 1992 12 15 NA
C 1993 14 16 NA
D 1990 20 17 80
D 1991 16 20 91
D 1992 15 21 70
D 1993 14 22 69
", header=TRUE, stringsAsFactors=FALSE)
My approach was to use the dplyr group_by function to apply the hpfilter by country to variable X1:
library(mFilter)
library(plm)
# Organizing the Data as a Panel
df1 <- pdata.frame(df1, index = c("country","year"))
# Apply hpfilter to X1 and add trend to the sample
df1 <- df1 %>% group_by(country) %>% mutate(X1_trend = mFilter::hpfilter(na.exclude(X1), type = "lambda", freq = 6.25)$trend)
However, this yields the following error:
Error in `[[<-.data.frame`(`*tmp*`, col, value = c(11.1695436493374, 12.7688604220353, :
replacement has 15 rows, data has 16
The error occurs because the filtered series is shortened after applying the hp filter (by the NAs).
Since I have a large dataset with many countries it would be really great if there was a workaround, to maybe ignore the NAs when passing the series to the hpfilter, but not removing them. Thank you!

Here is a way to drop NA and calculate trend:
df2 <- df1 %>% group_by(country) %>%
filter(!is.na(X1)) %>%
pdata.frame(., index = c("country","year")) %>%
mutate(X1_trend = mFilter::hpfilter(X1, type = "lambda", freq = 6.25)$trend)
> df2
country year X1 X2 W X1_trend
1 A 1990 10 20 40 11.16954
2 A 1991 12 15 NA 12.76886
3 A 1992 14 17 41 14.18105
4 A 1993 17 NA 44 15.09597
5 B 1990 20 NA 45 15.17450
6 B 1992 12 12 67 14.38218
7 B 1993 14 10 68 13.45663
8 C 1990 10 20 70 12.75429
9 C 1991 11 14 50 12.71858
10 C 1992 12 15 NA 13.35221
11 C 1993 14 16 NA 14.38293
12 D 1990 20 17 80 15.32211
13 D 1991 16 20 91 15.61990
14 D 1992 15 21 70 15.47486
15 D 1993 14 22 69 15.14639
EDIT: To keep missing values in the final output, we do one more operation:
df3 <- merge(df1,df2, by = colnames(df1),all.x = T)
> df3
country year X1 X2 W X1_trend
1 A 1990 10 20 40 11.16954
2 A 1991 12 15 NA 12.76886
3 A 1992 14 17 41 14.18105
4 A 1993 17 NA 44 15.09597
5 B 1990 20 NA 45 15.17450
6 B 1991 NA 13 61 NA
7 B 1992 12 12 67 14.38218
8 B 1993 14 10 68 13.45663
9 C 1990 10 20 70 12.75429
10 C 1991 11 14 50 12.71858
11 C 1992 12 15 NA 13.35221
12 C 1993 14 16 NA 14.38293
13 D 1990 20 17 80 15.32211
14 D 1991 16 20 91 15.61990
15 D 1992 15 21 70 15.47486
16 D 1993 14 22 69 15.14639

Related

reordering database by loop in R - help me

I am trying to reorder a database by loop but it does not work for me. There is too much data to do it one by one.
fact <- rep (1:2 , each = 3)
t1 <- c(2006,2007,2008,2000,2001,2002)
t2 <- c(2007,2008,2009,2001,2002,2004)
var1 <- c(56,52,44,10,32,41)
var2 <- c(52,44,50,32,41,23)
db1 <- as.data.frame(cbind(fact, t1, t2, var1, var2))
db1
fact t1 t2 var1 var2
1 1 2006 2007 56 52
2 1 2007 2008 52 44
3 1 2008 2009 44 50
4 2 2000 2001 10 32
5 2 2001 2002 32 41
6 2 2002 2004 41 23
I need it to stay this way:
factor <- rep (1:2 , each = 4)
t <- c(2006,2007,2008,2009,2000,2001,2002,2004)
var <- c(56,52,44,50,10,32,41,23)
db2 <- as.data.frame(cbind(factor, t, var))
db2
factor t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 1 2009 50
5 2 2000 10
6 2 2001 32
7 2 2002 41
8 2 2004 23
very thanks
dat1 <- as.data.frame(cbind(fact, t1, var1))
names(dat1) <- c("fact", "t", "var")
dat2 <- as.data.frame(cbind(fact, t2, var2))
names(dat2) <- c("fact", "t", "var")
rbind.data.frame(dat1, dat2)
fact t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 2 2000 10
5 2 2001 32
6 2 2002 41
7 1 2007 52
8 1 2008 44
9 1 2009 50
10 2 2001 32
11 2 2002 41
12 2 2004 23
Or
dat <- db1
names(dat) <- c("fact", rep("t", 2), rep("var", 2))
rbind(dat[,c(1,2,4)], dat[,c(1,3,5)])
fact t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 2 2000 10
5 2 2001 32
6 2 2002 41
7 1 2007 52
8 1 2008 44
9 1 2009 50
10 2 2001 32
11 2 2002 41
12 2 2004 23
Or, as indicated, have a look at the reshape2 package - melt would certainly be of use, e.g.
library(reshape2)
dat <- db1
names(dat) <- c("fact", rep("t", 2), rep("var", 2))
rbind(melt(dat[,c(1,2,4)], id.vars = c("fact","t"), value.name = "var"),
melt(dat[,c(1,3,5)], id.vars = c("fact","t"), value.name = "var")
)

unlist and merge into a single dataframe in r

I have a list of dataframes that I need to be combined into a single one.
year<-1990:2000
v1<-1:11
v2<-20:30
df1<-data.frame(year,v1)
df2<-data.frame(year,v2)
ldf<-list(df1,df2)
I now want to unlist this dataframe and get
> head(df)
year v1 v2
1 1990 1 20
2 1991 2 21
3 1992 3 22
4 1993 4 23
Note that my question is different from the solution provided in a similar question, where the solution to that question was: `df <- ldply(ldf, data.frame)
Because what I am essentially looking for, is a more automatic way of doing this: df<-merge(df1,df2, by="year")
With more number of list elements, a convenient option is reduce with one of the join functions
library(tidyverse)
ldf %>%
reduce(inner_join, by = "year")
# year v1 v2
#1 1990 1 20
#2 1991 2 21
#3 1992 3 22
#4 1993 4 23
#5 1994 5 24
#6 1995 6 25
#7 1996 7 26
#8 1997 8 27
#9 1998 9 28
#10 1999 10 29
#11 2000 11 30
Is there anything wrong with:
df <- merge(ldf[[1]], ldf[[2]], by="year")
Or for a long list:
df1 <- ldf[[1]]
for (x in 2:length(ldf)) {
df1 <- merge(df1, ldf[[x]])
}
# year v1 v2
# 1 1990 1 20
# 2 1991 2 21
# 3 1992 3 22
# 4 1993 4 23
# 5 1994 5 24
# 6 1995 6 25
# 7 1996 7 26
# 8 1997 8 27
# 9 1998 9 28
# 10 1999 10 29
# 11 2000 11 30

how to get previous matching value

how to get value of "a" if value of b matches with most recent previous value e.g - row$3 of b matches with previous row$1 ,row$6 matches with row$4
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30.133,29,30.1223,33,17,33,11,17,14,13.913,14))
year a b *NEW*
2013 10 30.133 NA
2013 11 29 NA
2014 NA 30.1223 10
2014 13 33 NA
2014 22 17 NA
2015 NA 33 13
2015 19 11 NA
2015 NA 17 22
2016 10 14 NA
2016 15 13.913 10
2016 NA 14 15
Thanks
For OPs example case
One way could be is to use duplicated() function.
# Input dataframe
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30,29,30,33,17,33,11,17,14,14,14))
# creating a new column with default values
df$NEW <- NA
# updating the value using the previous matching position
df$NEW[duplicated(df$b)] <- df$a[duplicated(df$b,fromLast = TRUE)]
# expected output
df
# year a b NEW
# 1 2013 10 30 NA
# 2 2013 11 29 NA
# 3 2014 NA 30 10
# 4 2014 13 33 NA
# 5 2014 22 17 NA
# 6 2015 NA 33 13
# 7 2015 19 11 NA
# 8 2015 NA 17 22
# 9 2016 10 14 NA
# 10 2016 15 14 10
# 11 2016 NA 14 15
General purpose usage
The above solution fails when the duplicates are not in sequential order. As per #DavidArenburg's advice. I have changed the fourth element df$b[4] <- 14. The general solution would require the usage of another handy function order() and should work for different possible cases.
# Input dataframe
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30,29,30,14,17,33,11,17,14,14,14))
# creating a new column with default values
df$NEW <- NA
# sort the matching column
df <- df[order(df$b),]
# updating the value using the previous matching position
df$NEW[duplicated(df$b)] <- df$a[duplicated(df$b,fromLast = TRUE)]
# To original order
df <- df[order(as.integer(rownames(df))),]
# expected output
df
# year a b NEW
# 1 2013 10 30 NA
# 2 2013 11 29 NA
# 3 2014 NA 30 10
# 4 2014 13 14 NA
# 5 2014 22 17 NA
# 6 2015 NA 33 NA
# 7 2015 19 11 NA
# 8 2015 NA 17 22
# 9 2016 10 14 13
# 10 2016 15 14 10
# 11 2016 NA 14 15
Here, the solution is based on the base package' functions. I am sure there should other ways of doing this using other packages.

Calculating yearly growth-rates from quarterly, long form data in r

My data takes the following form:
df <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("USA", 16)),
Quarter=rep(1:8,2),Income=20:35)
df2 <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("UK", 16)),
Quarter=rep(1:8,2),Income=32:47)
df <- rbind(df, df2)
What I want to do is to calculate the growth rate from the first quarter each year to the first quarter the second year, within country and sector. In the example above it would be the growth rate from quarter 1 to quarter 5. So for Sector A, in the USA, it would be (24/20)-1=0.2
I then want to append this data to the dataframe as a new column.
I looked at the solutions in:
How calculate growth rate in long format data frame?
But didn't have the r-skills to get it to work if the lag is more then one time-unit. Any suggestions?
ADDITION
So what i want is the growth-rate, that is (24/20)-1=0.2 in the example below. Not 1-(24/20), which I first wrote. The desired output should look something like this:
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2
6 A USA 6 25 0.1904
7 A USA 7 26 0.1818
I think you need something like this:
library(dplyr)
df %>%
#group by sector and country
group_by(Sector, Country) %>%
#calculate growth as (quarter / 5-period-lagged quarter) - 1
mutate(growth = Income / lag(Income, 4) - 1)
Output
Source: local data frame [32 x 5]
Groups: Sector, Country [4]
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2000000
6 A USA 6 25 0.1904762
7 A USA 7 26 0.1818182
8 A USA 8 27 0.1739130
9 B USA 1 28 NA
10 B USA 2 29 NA
.. ... ... ... ... ...
df3 = copy(df)
df3$Quarter = df3$Quarter - 4
df = merge(df,df3,c('Sector','Country','Quarter'), suffixes = c('','_prev'), all.x = T)
df$growth = 1 - (df$Income_prev/df$Income
> df
Sector Country Quarter Income Income_prev growth
1 A USA 1 20 24 -4
2 A USA 2 21 25 -4
3 A USA 3 22 26 -4
4 A USA 4 23 27 -4
5 A USA 5 24 NA NA
6 A USA 6 25 NA NA
7 A USA 7 26 NA NA
8 A USA 8 27 NA NA
9 A UK 1 32 36 -4
10 A UK 2 33 37 -4
11 A UK 3 34 38 -4
12 A UK 4 35 39 -4
13 A UK 5 36 NA NA
14 A UK 6 37 NA NA
15 A UK 7 38 NA NA
16 A UK 8 39 NA NA
17 B USA 1 28 32 -4
18 B USA 2 29 33 -4
19 B USA 3 30 34 -4
20 B USA 4 31 35 -4
21 B USA 5 32 NA NA
22 B USA 6 33 NA NA
23 B USA 7 34 NA NA
24 B USA 8 35 NA NA
25 B UK 1 40 44 -4
26 B UK 2 41 45 -4
27 B UK 3 42 46 -4
28 B UK 4 43 47 -4
29 B UK 5 44 NA NA
30 B UK 6 45 NA NA
31 B UK 7 46 NA NA
32 B UK 8 47 NA NA
>

In R, sum over all rows above a given row and restarting at new ID?

The following is what I have:
ID Year Score
1 1999 10
1 2000 11
1 2001 14
1 2002 22
2 2000 19
2 2001 17
2 2002 22
3 1998 10
3 1999 12
The following is what I would like to do:
ID Year Score Total
1 1999 10 10
1 2000 11 21
1 2001 14 35
1 2002 22 57
2 2000 19 19
2 2001 17 36
2 2002 22 48
3 1998 10 10
3 1999 12 22
The amount of years and the specific years vary for each Id.
I have a feeling that it's some advanced options in ddply but I have not been able to find the answer. I've also tried working with for/while loops but since these are dreadfully slow in R and my data-set is large, it's not working all that well.
Thanks in advance!
You can use the sumsum function and apply it with ave to all subgroups.
transform(dat, Total = ave(Score, ID, FUN = cumsum))
ID Year Score Total
1 1 1999 10 10
2 1 2000 11 21
3 1 2001 14 35
4 1 2002 22 57
5 2 2000 19 19
6 2 2001 17 36
7 2 2002 22 58
8 3 1998 10 10
9 3 1999 12 22
If your data is large, then ddply will be slow.
data.table is the way to go.
library(data.table)
DT <- data.table(dat)
# create your desired column in `DT`
DT[, agg.Score := cumsum(Score), by = ID]

Resources