Taking variance of some rows above in panel structrure (R data table ) - r

# Example of a panel data
library(data.table)
panel<-data.table(expand.grid(Year=c(2017:2020),Individual=c("A","B","C")))
panel$value<-rnorm(nrow(panel),10) # The value I am interested in
I want to take the variance of prior two years value by Individual.
For example, if I were to sum the value of prior two years I would do something like:
panel[,sum_of_past_2_years:=shift(value)+shift(value, 2),Individual]
I thought this would work.
panel[,var(c(shift(value),shift(value, 2))),Individual]
# This doesn't work of course
Ideally the answer should look like
a<-c(NA,NA,var(panel$value[1:2]),var(panel$value[2:3]))
b<-c(NA,NA,var(panel$value[5:6]),var(panel$value[6:7]))
c<-c(NA,NA,var(panel$value[9:10]),var(panel$value[10:11]))
panel[,variance_past_2_years:=c(a,b,c)]
# NAs when there is no value for 2 prior years

You can use frollapply to perform rolling operation of every 2 values.
library(data.table)
panel[, var := frollapply(shift(value), 2, var), Individual]
# Year Individual value var
# 1: 2017 A 9.416218 NA
# 2: 2018 A 8.424868 NA
# 3: 2019 A 8.743061 0.49138739
# 4: 2020 A 9.489386 0.05062333
# 5: 2017 B 10.102086 NA
# 6: 2018 B 8.674827 NA
# 7: 2019 B 10.708943 1.01853361
# 8: 2020 B 11.828768 2.06881272
# 9: 2017 C 10.124349 NA
#10: 2018 C 9.024261 NA
#11: 2019 C 10.677998 0.60509700
#12: 2020 C 10.397105 1.36742220

Related

Euclidean distant for distinct classes of factors iterated by groups

*Update: The answer suggested by Rui is great and works as it should. However, when I run it on about 7 million observations (my actual dataset), R gets stuck in a computational block (I'm using a machine with 64gb of RAM). Any other solutions are greatly appreciated!
I have a dataframe of patents consisting of the firms, application years, patent number, and patent classes. I want to calculate the Euclidean distance between consecutive years for each firm based on patent classes according to the following formula:
Where Xi represents the number of patents belonging to a specific class in year t, and Yi represents the number of patents belonging to a specific class in the previous year (t-1).
To further illustrate this, consider the following dataset:
df <- data.table(Firm = rep(c(LETTERS[1:2]),each=6), Year = rep(c(1990,1990,1991,1992,1992,1993),2),
Patent_Number = sample(184785:194785,12,replace = FALSE),
Patent_Class = c(12,5,31,12,31,6,15,15,15,3,3,1))
> df
Firm Year Patent_Number Patent_Class
1: A 1990 192473 12
2: A 1990 193702 5
3: A 1991 191889 31
4: A 1992 193341 12
5: A 1992 189512 31
6: A 1993 185582 6
7: B 1990 190838 15
8: B 1990 189322 15
9: B 1991 190620 15
10: B 1992 193443 3
11: B 1992 189937 3
12: B 1993 194146 1
Since year 1990 is the beginning year for Firm A, there is no Euclidean distance for that year (NAs should be produced. Moving forward to year 1991, the distinct classses for this year (1991) and the previous year (1990) are 31, 5, and 12. Therefore, the above formula is summed over these three distinct classes (there is three distinc 'i's). So the formula's output will be:
Following the same calculation and reiterating over firms, the final output should be:
> df
Firm Year Patent_Number Patent_Class El_Dist
1: A 1990 192473 12 NA
2: A 1990 193702 5 NA
3: A 1991 191889 31 1.2247450
4: A 1992 193341 12 0.7071068
5: A 1992 189512 31 0.7071068
6: A 1993 185582 6 1.2247450
7: B 1990 190838 15 NA
8: B 1990 189322 15 NA
9: B 1991 190620 15 0.5000000
10: B 1992 193443 3 1.1180340
11: B 1992 189937 3 1.1180340
12: B 1993 194146 1 1.1180340
I'm preferably looking for a data.table solution for speed purposes.
Thank you very much in advance for any help.
I believe that the function below does what the question asks for, but the results for Firm == "B" are not equal to the question's.
fEl_Dist <- function(X){
Year <- X[["Year"]]
PatentClass <- X[["Patent_Class"]]
sapply(seq_along(Year), function(i){
j <- which(Year %in% (Year[i] - 1:0))
tbl <- table(Year[j], PatentClass[j])
if(NROW(tbl) == 1){
NA_real_
} else {
numer <- sum((tbl[2, ] - tbl[1, ])^2)
denom <- sum(tbl[2, ]^2)*sum(tbl[1, ]^2)
sqrt(numer/denom)
}
})
}
setDT(df)[, El_Dist := fEl_Dist(.SD),
by = .(Firm),
.SDcols = c("Year", "Patent_Class")]
head(df)
# Firm Year Patent_Number Patent_Class El_Dist
#1: A 1990 190948 12 NA
#2: A 1990 186156 5 NA
#3: A 1991 190801 31 1.2247449
#4: A 1992 185226 12 0.7071068
#5: A 1992 185900 31 0.7071068
#6: A 1993 186928 6 1.2247449

data table lapply and additional columns in output

I am just hoping there is a more convenient way. Imaging I would like to run a model with different transformations of some of the columns, e.g. winsorizing. I would like to provide the transformed data set to the model and some additional columns that do not need to be transformed. Is there a practical way to this in one line? I do not want to replace the data using := because I am planning to run the model with different specifications of the transformation.
dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
sel.col<-c("x","y")
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
I would Need to call data.table again to merge the original dt with the transformed data and pay Attention to the order.
data.table(dt[,.(id,Country),by=factor],
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor])
I was hoping that I could include the additional columns with the lapply call
dt[,.(lapply(.SD,Winsorize), id, Country),.SDcols=sel.col,by=factor]
Are there any other solutions?
Do you just need?
dt[, c(lapply(.SD,Winsorize), list(id = id, Country = Country)), .SDcols=sel.col,by=factor]
Unfortunately this method get's slow with big data. Apparently this was optimised in some recent update, but it still very slow.
There is no need to merge, you can assign columns after lapply call:
> library(DescTools)
> library(data.table)
> dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
> sel.col<-c("x","y")
> dt
id Country x y factor
1: 1 Germany 13.116248 -0.4609152 B
2: 2 Germany -6.623404 -3.7048052 A
3: 3 USA -18.027532 22.2946805 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -12.585897 0.8255081 B
6: 6 Germany -8.816252 -12.1218135 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 6.3262951 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 15.857069 8.6422997 A
> # Notice an assignment `(sel.col) :=` here:
> dt[,(sel.col) := lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
> dt
id Country x y factor
1: 1 Germany 11.129140 -0.4609152 B
2: 2 Germany -6.623404 -1.7234191 A
3: 3 USA -17.097573 19.5642043 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -11.831968 0.8255081 B
6: 6 Germany -8.816252 -12.0116571 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 5.2261377 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 11.581528 8.6422997 A

R data.table difference equation (dynamic panel data)

I have a data table with a column v2 with 'initial values' and a column v1 with a growth rate. I would like to extrapolate v2 for years past the available value, by growing the previous value by factor v1. In 'time series' notation v2(t+1)=v2(t)*v1(t), given a v2(0).
The problem is, the year of the initial value may vary by group x in the dataset. In some groups, v2 may be available in multiple years, or not at all. Also, the number of years per group may vary (unbalanced panel). Using the shift function does not help, because it shifts v2 once, and does not reference the previously update value.
x year v1 v2
1: a 2012 0.8501072 NA
2: a 2013 1.0926093 39.36505
3: a 2014 1.2084379 NA
4: a 2015 0.8921997 NA
5: a 2016 0.8023251 NA
6: b 2012 1.1005287 NA
7: b 2013 1.0139800 NA
8: b 2014 1.1539676 NA
9: b 2015 1.2282501 NA
10: b 2016 0.8052265 NA
11: c 2012 0.8866425 NA
12: c 2013 0.9952566 44.30377
13: c 2014 0.9092020 NA
14: c 2015 1.0295864 15.04948
15: c 2016 0.8812966 NA
The value of V2, x=a, year=2014 should be 39.36*1.208, and in 2015 that answer times 0.89.
The following code, in a set of loops, works and does what I want:
ivec<-unique(DT[,x])
for (i in 1:length(ivec)) {
tvec<-unique(DT[x==ivec[i] ,y])
for (t in 2:length(tvec)) {
if (is.na(DT[x==ivec[i] & y==tvec[t], v2])) {
DT[x==ivec[i] & y==tvec[t],v2:=DT[x==ivec[i] & y==tvec[(t-1)],v2]*v1]
}
}
}
Try this:
DT[, v2:= Reduce(`*`, v1[-1], init=v2[1], acc=TRUE), by=.(x, cumsum(!is.na(v2)))]
# x year v1 v2
# 1: a 2012 0.8501072 NA
# 2: a 2013 1.0926093 39.36505
# 3: a 2014 1.2084379 47.57022
# 4: a 2015 0.8921997 42.44213
# 5: a 2016 0.8023251 34.05239
# 6: b 2012 1.1005287 NA
# 7: b 2013 1.0139800 NA
# 8: b 2014 1.1539676 NA
# 9: b 2015 1.2282501 NA
# 10: b 2016 0.8052265 NA
# 11: c 2012 0.8866425 NA
# 12: c 2013 0.9952566 44.30377
# 13: c 2014 0.9092020 40.28108
# 14: c 2015 1.0295864 15.04948
# 15: c 2016 0.8812966 13.26306

R: Insert and fill missing periods in panel data

I'm trying to learn R coming from Stata, but have run into the following two problems which I cannot seem to find elegant solutions for in R:
1) I have a panel dataset with gaps in my time variable. I would like to expand my time variable to include the gaps despite having no observed data for these rows.
In Stata I would usually go about this by setting my ID and time variables with xtset and then expanding the dataset based on this with tsfill. Is there an equivalently elegant way in R?
2) I would like to fill some of the new, blank cells with data for constant variables.
In Stata I would do this by copying data from previous (relative to my time variable) observations using the l.-prefix; for example using replace Con = l.Con.
In other words I'm asking how to go from something like this:
ID Time Num Con
1 Jan 10 A
1 Feb 15 A
1 May 20 A
2 Feb 12 B
2 Mar 14 B
2 Jun 15 B
To something like this:
ID Time Num Con
1 Jan 10 A
1 Feb 15 A
1 Mar A
1 Apr A
1 May 20 A
2 Feb 12 B
2 Mar 14 B
2 Apr B
2 May B
2 Jun 15 B
Hopefully that makes sense. Thanks in advance.
You can try merge from base R or the data.table join
library(data.table)
DT2 <- setDT(df1)[, {tmp <- match(Time, month.abb)
list(Time=month.abb[min(tmp):max(tmp)])}, .(ID,Con)]
setkey(df1[, c(1,4,2,3), with=FALSE], ID, Con, Time)[DT2]
# ID Con Time Num
# 1: 1 A Jan 10
# 2: 1 A Feb 15
# 3: 1 A Mar NA
# 4: 1 A Apr NA
# 5: 1 A May 20
# 6: 2 B Feb 12
# 7: 2 B Mar 14
# 8: 2 B Apr NA
# 9: 2 B May NA
#10: 2 B Jun 15
NOTE: It may be better to keep missing value as NA

How can I pass NA when attempt to select fails with "attempted to select less than one element"?

I have two data frames, a and b, both have a "state" and "year" columns (as well as others). I'm trying to transfer the value for VARX from b to a. I used this:
for(i in seq_along(a)) {a$VARX[[i]]<-b$VARX[[which(b$state==a$state[[i]] & b$year==a$year[[i]])]]}
I get this error:
Error in b$VARX[[which(b$state == a$state[[i]] & b$year == :
attempt to select less than one element
The problem seems to be that there are some rows in a without a corresponding row in b, so it can't select an element. How can I either return NA in those cases (so a$VARX[i]=NA), or clean a from all the cases where there's no corresponding rows in b?
Loop is not a good choice. See this working example:
a=data.frame(state=c("a","b","a"),year=c("2012","2013","2014"),VARX=c(1,2,3))
###################
state year VARX
1 a 2012 1
2 b 2013 2
3 a 2014 3
###################
b=data.frame(state=c("a","b","c"),year=c("2012","2013","2014"),VARX=c(4,5,6))
###################
state year VARX
1 a 2012 4
2 b 2013 5
3 c 2014 6
###################
# merge a and b
c=merge(a,b,by=c("state","year"),all.x=T,sort =F,suffixes = c(".a",""))
###################
state year VARX.a VARX
1 a 2012 1 4
2 b 2013 2 5
3 a 2014 3 NA
###################
# drop VARX in data frame a
subset(c,select=-VARX.a)
###################
state year VARX
1 a 2012 4
2 b 2013 5
3 a 2014 NA
###################

Resources