R data.table difference equation (dynamic panel data) - r

I have a data table with a column v2 with 'initial values' and a column v1 with a growth rate. I would like to extrapolate v2 for years past the available value, by growing the previous value by factor v1. In 'time series' notation v2(t+1)=v2(t)*v1(t), given a v2(0).
The problem is, the year of the initial value may vary by group x in the dataset. In some groups, v2 may be available in multiple years, or not at all. Also, the number of years per group may vary (unbalanced panel). Using the shift function does not help, because it shifts v2 once, and does not reference the previously update value.
x year v1 v2
1: a 2012 0.8501072 NA
2: a 2013 1.0926093 39.36505
3: a 2014 1.2084379 NA
4: a 2015 0.8921997 NA
5: a 2016 0.8023251 NA
6: b 2012 1.1005287 NA
7: b 2013 1.0139800 NA
8: b 2014 1.1539676 NA
9: b 2015 1.2282501 NA
10: b 2016 0.8052265 NA
11: c 2012 0.8866425 NA
12: c 2013 0.9952566 44.30377
13: c 2014 0.9092020 NA
14: c 2015 1.0295864 15.04948
15: c 2016 0.8812966 NA
The value of V2, x=a, year=2014 should be 39.36*1.208, and in 2015 that answer times 0.89.
The following code, in a set of loops, works and does what I want:
ivec<-unique(DT[,x])
for (i in 1:length(ivec)) {
tvec<-unique(DT[x==ivec[i] ,y])
for (t in 2:length(tvec)) {
if (is.na(DT[x==ivec[i] & y==tvec[t], v2])) {
DT[x==ivec[i] & y==tvec[t],v2:=DT[x==ivec[i] & y==tvec[(t-1)],v2]*v1]
}
}
}

Try this:
DT[, v2:= Reduce(`*`, v1[-1], init=v2[1], acc=TRUE), by=.(x, cumsum(!is.na(v2)))]
# x year v1 v2
# 1: a 2012 0.8501072 NA
# 2: a 2013 1.0926093 39.36505
# 3: a 2014 1.2084379 47.57022
# 4: a 2015 0.8921997 42.44213
# 5: a 2016 0.8023251 34.05239
# 6: b 2012 1.1005287 NA
# 7: b 2013 1.0139800 NA
# 8: b 2014 1.1539676 NA
# 9: b 2015 1.2282501 NA
# 10: b 2016 0.8052265 NA
# 11: c 2012 0.8866425 NA
# 12: c 2013 0.9952566 44.30377
# 13: c 2014 0.9092020 40.28108
# 14: c 2015 1.0295864 15.04948
# 15: c 2016 0.8812966 13.26306

Related

Euclidean distant for distinct classes of factors iterated by groups

*Update: The answer suggested by Rui is great and works as it should. However, when I run it on about 7 million observations (my actual dataset), R gets stuck in a computational block (I'm using a machine with 64gb of RAM). Any other solutions are greatly appreciated!
I have a dataframe of patents consisting of the firms, application years, patent number, and patent classes. I want to calculate the Euclidean distance between consecutive years for each firm based on patent classes according to the following formula:
Where Xi represents the number of patents belonging to a specific class in year t, and Yi represents the number of patents belonging to a specific class in the previous year (t-1).
To further illustrate this, consider the following dataset:
df <- data.table(Firm = rep(c(LETTERS[1:2]),each=6), Year = rep(c(1990,1990,1991,1992,1992,1993),2),
Patent_Number = sample(184785:194785,12,replace = FALSE),
Patent_Class = c(12,5,31,12,31,6,15,15,15,3,3,1))
> df
Firm Year Patent_Number Patent_Class
1: A 1990 192473 12
2: A 1990 193702 5
3: A 1991 191889 31
4: A 1992 193341 12
5: A 1992 189512 31
6: A 1993 185582 6
7: B 1990 190838 15
8: B 1990 189322 15
9: B 1991 190620 15
10: B 1992 193443 3
11: B 1992 189937 3
12: B 1993 194146 1
Since year 1990 is the beginning year for Firm A, there is no Euclidean distance for that year (NAs should be produced. Moving forward to year 1991, the distinct classses for this year (1991) and the previous year (1990) are 31, 5, and 12. Therefore, the above formula is summed over these three distinct classes (there is three distinc 'i's). So the formula's output will be:
Following the same calculation and reiterating over firms, the final output should be:
> df
Firm Year Patent_Number Patent_Class El_Dist
1: A 1990 192473 12 NA
2: A 1990 193702 5 NA
3: A 1991 191889 31 1.2247450
4: A 1992 193341 12 0.7071068
5: A 1992 189512 31 0.7071068
6: A 1993 185582 6 1.2247450
7: B 1990 190838 15 NA
8: B 1990 189322 15 NA
9: B 1991 190620 15 0.5000000
10: B 1992 193443 3 1.1180340
11: B 1992 189937 3 1.1180340
12: B 1993 194146 1 1.1180340
I'm preferably looking for a data.table solution for speed purposes.
Thank you very much in advance for any help.
I believe that the function below does what the question asks for, but the results for Firm == "B" are not equal to the question's.
fEl_Dist <- function(X){
Year <- X[["Year"]]
PatentClass <- X[["Patent_Class"]]
sapply(seq_along(Year), function(i){
j <- which(Year %in% (Year[i] - 1:0))
tbl <- table(Year[j], PatentClass[j])
if(NROW(tbl) == 1){
NA_real_
} else {
numer <- sum((tbl[2, ] - tbl[1, ])^2)
denom <- sum(tbl[2, ]^2)*sum(tbl[1, ]^2)
sqrt(numer/denom)
}
})
}
setDT(df)[, El_Dist := fEl_Dist(.SD),
by = .(Firm),
.SDcols = c("Year", "Patent_Class")]
head(df)
# Firm Year Patent_Number Patent_Class El_Dist
#1: A 1990 190948 12 NA
#2: A 1990 186156 5 NA
#3: A 1991 190801 31 1.2247449
#4: A 1992 185226 12 0.7071068
#5: A 1992 185900 31 0.7071068
#6: A 1993 186928 6 1.2247449

Taking variance of some rows above in panel structrure (R data table )

# Example of a panel data
library(data.table)
panel<-data.table(expand.grid(Year=c(2017:2020),Individual=c("A","B","C")))
panel$value<-rnorm(nrow(panel),10) # The value I am interested in
I want to take the variance of prior two years value by Individual.
For example, if I were to sum the value of prior two years I would do something like:
panel[,sum_of_past_2_years:=shift(value)+shift(value, 2),Individual]
I thought this would work.
panel[,var(c(shift(value),shift(value, 2))),Individual]
# This doesn't work of course
Ideally the answer should look like
a<-c(NA,NA,var(panel$value[1:2]),var(panel$value[2:3]))
b<-c(NA,NA,var(panel$value[5:6]),var(panel$value[6:7]))
c<-c(NA,NA,var(panel$value[9:10]),var(panel$value[10:11]))
panel[,variance_past_2_years:=c(a,b,c)]
# NAs when there is no value for 2 prior years
You can use frollapply to perform rolling operation of every 2 values.
library(data.table)
panel[, var := frollapply(shift(value), 2, var), Individual]
# Year Individual value var
# 1: 2017 A 9.416218 NA
# 2: 2018 A 8.424868 NA
# 3: 2019 A 8.743061 0.49138739
# 4: 2020 A 9.489386 0.05062333
# 5: 2017 B 10.102086 NA
# 6: 2018 B 8.674827 NA
# 7: 2019 B 10.708943 1.01853361
# 8: 2020 B 11.828768 2.06881272
# 9: 2017 C 10.124349 NA
#10: 2018 C 9.024261 NA
#11: 2019 C 10.677998 0.60509700
#12: 2020 C 10.397105 1.36742220

How can I change row and column indexes of a dataframe in R?

I have a dataframe in R which has three columns Product_Name(name of books), Year and Units (number of units sold in that year) which looks like this:
Product_Name Year Units
A Modest Proposal 2011 10000
A Modest Proposal 2012 11000
A Modest Proposal 2013 12000
A Modest Proposal 2014 13000
Animal Farm 2011 8000
Animal Farm 2012 9000
Animal Farm 2013 11000
Animal Farm 2014 15000
Catch 22 2011 1000
Catch 22 2012 2000
Catch 22 2013 3000
Catch 22 2014 4000
....
I intend to make a R Shiny dashboard with that where I want to keep the year as a drop-down menu option, for which I wanted to have the dataframe in the following format
A Modest Proposal Animal Farm Catch 22
2011 10000 8000 1000
2012 11000 9000 2000
2013 12000 11000 3000
2014 13000 15000 4000
or the other way round where the Product Names are row indexes and Years are column indexes, either way goes.
How can I do this in R?
Your general issue is transforming long data to wide data. For this, you can use data.table's dcast function (amongst many others):
dt = data.table(
Name = c(rep('A', 4), rep('B', 4), rep('C', 4)),
Year = c(rep(2011:2014, 3)),
Units = rnorm(12)
)
> dt
Name Year Units
1: A 2011 -0.26861318
2: A 2012 0.27194732
3: A 2013 -0.39331361
4: A 2014 0.58200101
5: B 2011 0.09885381
6: B 2012 -0.13786098
7: B 2013 0.03778400
8: B 2014 0.02576433
9: C 2011 -0.86682584
10: C 2012 -1.34319590
11: C 2013 0.10012673
12: C 2014 -0.42956207
> dcast(dt, Year ~ Name, value.var = 'Units')
Year A B C
1: 2011 -0.2686132 0.09885381 -0.8668258
2: 2012 0.2719473 -0.13786098 -1.3431959
3: 2013 -0.3933136 0.03778400 0.1001267
4: 2014 0.5820010 0.02576433 -0.4295621
For the next time, it is easier if you provide a reproducible example, so that the people assisting you do not have to manually recreate your data structure :)
You need to use pivot_wider from tidyr package. I assumed your data is saved in df and you also need dplyr package for %>% (piping)
library(tidyr)
library(dplyr)
df %>%
pivot_wider(names_from = Product_Name, values_from = Units)
Assuming that your dataframe is ordered by Product_Name and by year, I will generate artificial data similar to your datafrme, try this:
Col_1 <- sort(rep(LETTERS[1:3], 4))
Col_2 <- rep(2011:2014, 3)
# artificial data
resp <- ceiling(rnorm(12, 5000, 500))
uu <- data.frame(Col_1, Col_2, resp)
uu
# output is
Col_1 Col_2 resp
1 A 2011 5297
2 A 2012 4963
3 A 2013 4369
4 A 2014 4278
5 B 2011 4721
6 B 2012 5021
7 B 2013 4118
8 B 2014 5262
9 C 2011 4601
10 C 2012 5013
11 C 2013 5707
12 C 2014 5637
>
> # Here starts
> output <- aggregate(uu$resp, list(uu$Col_1), function(x) {x})
> output
Group.1 x.1 x.2 x.3 x.4
1 A 5297 4963 4369 4278
2 B 4721 5021 4118 5262
3 C 4601 5013 5707 5637
>
output2 <- output [, -1]
colnames(output2) <- levels(as.factor(uu$Col_2))
rownames(output2) <- levels(as.factor(uu$Col_1))
# transpose the matrix
> t(output2)
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637
> # or convert to data.frame
> as.data.frame(t(output2))
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637

How to quartely a GDP series On

I m starting work with R now, and I m having troubles on quartely a GDP data.
The command that I am using is:
library("data.table")
pib<- read.csv("PIB.csv", header = TRUE, sep=";", dec=",")
setDT(pib)
pib
attach(pib)
aggregate(pib, by= PIB.mensal, frequency=4, FUN='sum')
My data is the following :
datareferencia| GDP.month
1: 01/01/2010| 288.980,20
2: 01/02/2010| 285.738,70
3: 01/03/2010| 311.677,40
4: 01/04/2010| 307.106,60
5: 01/05/2010| 316.005,10
6: 01/06/2010| 321.032,90
7: 01/07/2010| 332.472,50
8: 01/08/2010| 334.225,30
9: 01/09/2010| 331.237,00
10: 01/10/2010| 344.965,70
11: 01/11/2010| 356.675,00
12: 01/12/2010| 355.730,60
13: 01/01/2011| 333.330,90
14: 01/02/2011| 335.118,40
15: 01/03/2011| 348.084,20
16: 01/04/2011| 349.255,90
17: 01/05/2011| 366.411,50
18: 01/06/2011| 371.046,10
19: 01/07/2011| 373.334,50
20: 01/08/2011| 377.005,90
21: 01/09/2011| 361.993,50
22: 01/10/2011| 378.843,40
23: 01/11/2011| 389.948,20
24: 01/12/2011| 392.009,40
Can someone help me? I need to quattely both years 2010 and 2011!
You can use the by command of data.table to do this. A variable for year and "quatley" is all you need.
Reading in your data:
pib <- data.table(datareferencia = c("01/01/2010", "01/02/2010", "01/03/2010",
"01/04/2010", "01/05/2010", "01/06/2010",
"01/07/2010", "01/08/2010", "01/09/2010",
"01/10/2010", "01/11/2010", "01/12/2010",
"01/01/2011", "01/02/2011", "01/03/2011",
"01/04/2011", "01/05/2011", "01/06/2011",
"01/07/2011", "01/08/2011", "01/09/2011",
"01/10/2011", "01/11/2011", "01/12/2011") ,
GDP.month = c( 288980.20, 285738.70, 311677.40,
307106.60, 316005.10, 321032.90,
332472.50, 334225.30, 331237.00,
344965.70, 356675.00, 355730.60,
333330.90, 335118.40, 348084.20,
349255.90, 366411.50, 371046.10,
373334.50, 377005.90, 361993.50,
378843.40, 389948.20, 392009.40))
Adjust your date if it is not already done:
pib[, datareferencia := as.IDate(datareferencia, format = "%d/%m/%Y")]
With the year function from data.table you get ... well the year.
For the "quatley" I use the modulo function %/% with the month and a little adjustment, so that the result is 1 to 3 and not 0 to 2.
pib[, quatley := ((month(datareferencia)-1) %/% 4) + 1 ]
pib[, year := year(datareferencia)]
At last you can calculate the sum by year and quatley:
pib[, sum.quatley:= sum(GDP.month), by = c("quatley", "year")]
The result:
unique(pib[, list(quatley, year, sum.quatley)])
quatley year sum.quatley
1: 1 2010 1193503
2: 2 2010 1303736
3: 3 2010 1388608
4: 1 2011 1365789
5: 2 2011 1487798
6: 3 2011 1522795

R data.table Conditional Sum: Cleaner way

This of course is a very often encountered problem, so I have expected many questions here on SO regarding this. However, all the answers that I could find were very specific to the question and often encountered workarounds (you don't have to do this, foobar is much better in this scenario) or non data.table solutions. Perhaps this is because it should be a no-brainer with data.table
I have a data.table which contains yearly data on tentgelt and te_med. For each year, I want to know the share of observations for which tentgelt > te_med. This is what I am doing:
# note that nAbove and nBelow do not add up to 1
nAbove <- wages[tentgelt > te_med, list(nAbove = .N), by=list(year)]
nBelow <- wages[tentgelt < te_med, list(nBelow = .N), by=list(year)]
nBelow[nAbove][, list(year, foo=nAbove/(nAbove+nBelow))]
which works but whenever I see other people's data.table code, it looks much clearer and easier than my workarounds. Is there a cleaner way to get the following type of output?
year foo
1: 1993 0.2372093
2: 1994 0.1567568
3: 1995 0.8132530
4: 1996 0.1235955
5: 1997 0.1065574
6: 1998 0.3070684
7: 1999 0.1491974
Here's a sample of my data:
year tentgelt te_med
1: 2010 120.95 53.64929
2: 2010 9.99 116.72601
3: 2010 113.52 53.07394
4: 2010 10.27 38.45728
5: 2010 48.58 124.65753
6: 2010 96.38 86.99060
7: 2010 3.46 65.75342
8: 2010 107.52 91.87592
9: 2010 107.52 42.92953
10: 2010 3.46 73.92328
11: 2010 96.38 85.23419
12: 2010 2.25 79.19995
13: 2010 42.32 35.75757
14: 2010 7.94 93.44305
15: 2010 120.95 113.41370
16: 2010 7.94 110.68628
17: 2010 107.52 127.30682
18: 2010 2.25 103.49036
19: 2010 120.95 123.62054
20: 2010 96.38 68.57532
For this sample, the expected output should be:
year V2
1: 2010 0.45
Try this
wages[, list(foo= sum(tentgelt > te_med)/.N), by = year]
# year foo
# 1: 2010 0.45

Resources