How to quartely a GDP series On - r

I m starting work with R now, and I m having troubles on quartely a GDP data.
The command that I am using is:
library("data.table")
pib<- read.csv("PIB.csv", header = TRUE, sep=";", dec=",")
setDT(pib)
pib
attach(pib)
aggregate(pib, by= PIB.mensal, frequency=4, FUN='sum')
My data is the following :
datareferencia| GDP.month
1: 01/01/2010| 288.980,20
2: 01/02/2010| 285.738,70
3: 01/03/2010| 311.677,40
4: 01/04/2010| 307.106,60
5: 01/05/2010| 316.005,10
6: 01/06/2010| 321.032,90
7: 01/07/2010| 332.472,50
8: 01/08/2010| 334.225,30
9: 01/09/2010| 331.237,00
10: 01/10/2010| 344.965,70
11: 01/11/2010| 356.675,00
12: 01/12/2010| 355.730,60
13: 01/01/2011| 333.330,90
14: 01/02/2011| 335.118,40
15: 01/03/2011| 348.084,20
16: 01/04/2011| 349.255,90
17: 01/05/2011| 366.411,50
18: 01/06/2011| 371.046,10
19: 01/07/2011| 373.334,50
20: 01/08/2011| 377.005,90
21: 01/09/2011| 361.993,50
22: 01/10/2011| 378.843,40
23: 01/11/2011| 389.948,20
24: 01/12/2011| 392.009,40
Can someone help me? I need to quattely both years 2010 and 2011!

You can use the by command of data.table to do this. A variable for year and "quatley" is all you need.
Reading in your data:
pib <- data.table(datareferencia = c("01/01/2010", "01/02/2010", "01/03/2010",
"01/04/2010", "01/05/2010", "01/06/2010",
"01/07/2010", "01/08/2010", "01/09/2010",
"01/10/2010", "01/11/2010", "01/12/2010",
"01/01/2011", "01/02/2011", "01/03/2011",
"01/04/2011", "01/05/2011", "01/06/2011",
"01/07/2011", "01/08/2011", "01/09/2011",
"01/10/2011", "01/11/2011", "01/12/2011") ,
GDP.month = c( 288980.20, 285738.70, 311677.40,
307106.60, 316005.10, 321032.90,
332472.50, 334225.30, 331237.00,
344965.70, 356675.00, 355730.60,
333330.90, 335118.40, 348084.20,
349255.90, 366411.50, 371046.10,
373334.50, 377005.90, 361993.50,
378843.40, 389948.20, 392009.40))
Adjust your date if it is not already done:
pib[, datareferencia := as.IDate(datareferencia, format = "%d/%m/%Y")]
With the year function from data.table you get ... well the year.
For the "quatley" I use the modulo function %/% with the month and a little adjustment, so that the result is 1 to 3 and not 0 to 2.
pib[, quatley := ((month(datareferencia)-1) %/% 4) + 1 ]
pib[, year := year(datareferencia)]
At last you can calculate the sum by year and quatley:
pib[, sum.quatley:= sum(GDP.month), by = c("quatley", "year")]
The result:
unique(pib[, list(quatley, year, sum.quatley)])
quatley year sum.quatley
1: 1 2010 1193503
2: 2 2010 1303736
3: 3 2010 1388608
4: 1 2011 1365789
5: 2 2011 1487798
6: 3 2011 1522795

Related

Taking variance of some rows above in panel structrure (R data table )

# Example of a panel data
library(data.table)
panel<-data.table(expand.grid(Year=c(2017:2020),Individual=c("A","B","C")))
panel$value<-rnorm(nrow(panel),10) # The value I am interested in
I want to take the variance of prior two years value by Individual.
For example, if I were to sum the value of prior two years I would do something like:
panel[,sum_of_past_2_years:=shift(value)+shift(value, 2),Individual]
I thought this would work.
panel[,var(c(shift(value),shift(value, 2))),Individual]
# This doesn't work of course
Ideally the answer should look like
a<-c(NA,NA,var(panel$value[1:2]),var(panel$value[2:3]))
b<-c(NA,NA,var(panel$value[5:6]),var(panel$value[6:7]))
c<-c(NA,NA,var(panel$value[9:10]),var(panel$value[10:11]))
panel[,variance_past_2_years:=c(a,b,c)]
# NAs when there is no value for 2 prior years
You can use frollapply to perform rolling operation of every 2 values.
library(data.table)
panel[, var := frollapply(shift(value), 2, var), Individual]
# Year Individual value var
# 1: 2017 A 9.416218 NA
# 2: 2018 A 8.424868 NA
# 3: 2019 A 8.743061 0.49138739
# 4: 2020 A 9.489386 0.05062333
# 5: 2017 B 10.102086 NA
# 6: 2018 B 8.674827 NA
# 7: 2019 B 10.708943 1.01853361
# 8: 2020 B 11.828768 2.06881272
# 9: 2017 C 10.124349 NA
#10: 2018 C 9.024261 NA
#11: 2019 C 10.677998 0.60509700
#12: 2020 C 10.397105 1.36742220

Bridge the last and next non-NA value with intermediate values that grow evenly

What would be a good way to fill the missing NAs in a dataframe column with intermediate values that grow gradually from the last non-NA value to the next non-NA value?
Here is an example: for the column cost, I would like to obtain the column cost_esti where the cost increase by $31 each year between 2014 and 2016, bridging the last known cost of $595 to the next known cost of $720
The code I came up with is lengthy. Is there an elegant way to do the same?
library(data.table)
data = data.table(year=2000:2018,
cost = c(100,120,NA,200,220,NA,NA,300,350,470,500,NA,NA,595,NA,NA,NA,720,800))
data[,cost_nas:=as.numeric(is.na(cost))]
## consecutive nas so far for each row:
data[, consecutive_nas_so_far := seq_len(.N), by=rleid(cost_nas)]
data[cost_nas==0,consecutive_nas_so_far:=0]
# total number of consecutive nas in the sequence
data[,total_number_of_consec_nas:=ifelse(consecutive_nas_so_far>0&shift(consecutive_nas_so_far,1,type = "lead")==0,consecutive_nas_so_far,NA)]
data[cost_nas==0,total_number_of_consec_nas:=0]
data[,total_number_of_consec_nas:=zoo::na.locf(total_number_of_consec_nas,fromLast=T)]
#get last and next known values for cost:
data[,cost_previous:=zoo::na.locf(cost)]
data[,cost_following:=zoo::na.locf(cost,fromLast=T)]
# apply the formula to calculate the gradual increase from cost_previous to cost_following
data[,cost_esti:=round(consecutive_nas_so_far*(cost_following-cost_previous)/(total_number_of_consec_nas+1)+cost_previous,0)]
data[is.na(cost_esti),cost_esti:=cost]
You can re-write data.table operations using zoo::na.locf and data.table::rleid. Add 2 columns, one each for lastNonNA and nextNonNA using na.locf. rleid will provide you distinct group for continuous NA. Now you can write logic to fill NA using linear between lastNonNA and nextNonNA.
library(data.table)
library(zoo)
#Data
data = data.table(year=2000:2018,
cost = c(100,120,NA,200,220,NA,NA,300,350,470,500,NA,NA,595,NA,NA,NA,720,800))
data[,':='(lastNonNA = na.locf(cost, fromLast = FALSE),
nextNonNA = na.locf(cost, fromLast = TRUE), Group_NA = rleid(is.na(cost)))][
,':='(IDX = 1:.N), by=Group_NA][
,':='(cost = ifelse(is.na(cost), lastNonNA + IDX*((nextNonNA - lastNonNA)/(.N+1)),cost)),
by=Group_NA][,.(year, cost)]
# year cost
# 1: 2000 100.0000
# 2: 2001 120.0000
# 3: 2002 160.0000 #Filled
# 4: 2003 200.0000
# 5: 2004 220.0000
# 6: 2005 246.6667 #Filled
# 7: 2006 273.3333 #Filled
# 8: 2007 300.0000
# 9: 2008 350.0000
# 10: 2009 470.0000
# 11: 2010 500.0000
# 12: 2011 531.6667 #Filled
# 13: 2012 563.3333 #Filled
# 14: 2013 595.0000
# 15: 2014 626.2500 #Filled
# 16: 2015 657.5000 #Filled
# 17: 2016 688.7500 #Filled
# 18: 2017 720.0000
# 19: 2018 800.0000
What you are asking for in the question is a linear interpolation.
It can be obtained quite easily in R for your data with NAs.
In this case the solution would be:
library("imputeTS")
na_interpolation(data, option = "linear")
You could also use option = "spline" or "stine" then the increase wouldn't necessarily strictly linear.

R data.table difference equation (dynamic panel data)

I have a data table with a column v2 with 'initial values' and a column v1 with a growth rate. I would like to extrapolate v2 for years past the available value, by growing the previous value by factor v1. In 'time series' notation v2(t+1)=v2(t)*v1(t), given a v2(0).
The problem is, the year of the initial value may vary by group x in the dataset. In some groups, v2 may be available in multiple years, or not at all. Also, the number of years per group may vary (unbalanced panel). Using the shift function does not help, because it shifts v2 once, and does not reference the previously update value.
x year v1 v2
1: a 2012 0.8501072 NA
2: a 2013 1.0926093 39.36505
3: a 2014 1.2084379 NA
4: a 2015 0.8921997 NA
5: a 2016 0.8023251 NA
6: b 2012 1.1005287 NA
7: b 2013 1.0139800 NA
8: b 2014 1.1539676 NA
9: b 2015 1.2282501 NA
10: b 2016 0.8052265 NA
11: c 2012 0.8866425 NA
12: c 2013 0.9952566 44.30377
13: c 2014 0.9092020 NA
14: c 2015 1.0295864 15.04948
15: c 2016 0.8812966 NA
The value of V2, x=a, year=2014 should be 39.36*1.208, and in 2015 that answer times 0.89.
The following code, in a set of loops, works and does what I want:
ivec<-unique(DT[,x])
for (i in 1:length(ivec)) {
tvec<-unique(DT[x==ivec[i] ,y])
for (t in 2:length(tvec)) {
if (is.na(DT[x==ivec[i] & y==tvec[t], v2])) {
DT[x==ivec[i] & y==tvec[t],v2:=DT[x==ivec[i] & y==tvec[(t-1)],v2]*v1]
}
}
}
Try this:
DT[, v2:= Reduce(`*`, v1[-1], init=v2[1], acc=TRUE), by=.(x, cumsum(!is.na(v2)))]
# x year v1 v2
# 1: a 2012 0.8501072 NA
# 2: a 2013 1.0926093 39.36505
# 3: a 2014 1.2084379 47.57022
# 4: a 2015 0.8921997 42.44213
# 5: a 2016 0.8023251 34.05239
# 6: b 2012 1.1005287 NA
# 7: b 2013 1.0139800 NA
# 8: b 2014 1.1539676 NA
# 9: b 2015 1.2282501 NA
# 10: b 2016 0.8052265 NA
# 11: c 2012 0.8866425 NA
# 12: c 2013 0.9952566 44.30377
# 13: c 2014 0.9092020 40.28108
# 14: c 2015 1.0295864 15.04948
# 15: c 2016 0.8812966 13.26306

How can I aggregate data.table in quarterly frequency?

My data is available in monthly frequency and I'm trying to aggregate them in quarterly frequency. I'm working with data.table which package I dont understand very well, to be honest.
X.DATA_BASE NOME_INSTITUICAO SALDO.x SALDO.y
1: 199407 ASB S/A - CFI 1694581 1124580
2: 199407 BANCO ARAUCARIA S.A. 40079517 6314782
3: 199407 BANCO ATLANTIS S.A. 200463907 9356445
4: 199407 BANCO BANKPAR 1078342 5770046
5: 199407 BANCO BBI 97812975 31112289
For each date, which is defined by X.DATA_BASE, 199407 = July 1994. I have several institutions with SALDO.x and SALDO.y values. I want to add SALDO.x and SALDO.y for each institution in each quarterly. One of the problem is that some institutions get in and get out through the time. In the end of the day I want to have mydata with the same columns but quarterly frequency.
How could I do that?
Here's an example of how to group and sum by quarter (with thanks to #eddi for his suggested improvement). First let's create some fake date:
library(data.table)
set.seed(1485)
dat = data.table(date=rep(c(199401:199412,199501:199512),2),
firm=rep(c("A","B"), each=24),
value1=rnorm(48,1000,10),
value2=rnorm(48,2000,100))
dat
date firm value1 value2
1: 199401 A 1009.8620 2054.251
2: 199402 A 1009.7180 2124.202
3: 199403 A 1014.3421 1919.251
...
46: 199510 B 992.9961 2079.517
47: 199511 B 997.9147 1968.676
48: 199512 B 1002.5993 2006.231
Now, summarize by firm, year, and quarter. To do this, we create year and quarter grouping variables from date (we use integer division (%/%) to create the years and mod (%%) plus integer division to create the quarters), and calculate the sum of value1 and value2 for each sub-group. This all assumes date is numeric. If you have it stored as character or factor, convert to numeric first:
dat.summary = dat[ , list(valueByQuarter = sum(sum(value1) + sum(value2))),
by=list(firm,
year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1)]
dat.summary
firm year quarter valueByQuarter
1: A 1994 1 9131.626
2: A 1994 2 8953.116
3: A 1994 3 8981.407
4: A 1994 4 9175.959
5: A 1995 1 9003.225
6: A 1995 2 8962.690
7: A 1995 3 8809.256
8: A 1995 4 8885.264
9: B 1994 1 9000.791
10: B 1994 2 8936.356
11: B 1994 3 8905.789
12: B 1994 4 8951.369
13: B 1995 1 8922.716
14: B 1995 2 9097.134
15: B 1995 3 8724.188
16: B 1995 4 9047.934
For dplyr fans, here's a dplyr approach:
library(dplyr)
dat %>%
group_by(firm, year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1) %>%
summarise(valueByQuarter = sum(value1 + value2))

R data.table Conditional Sum: Cleaner way

This of course is a very often encountered problem, so I have expected many questions here on SO regarding this. However, all the answers that I could find were very specific to the question and often encountered workarounds (you don't have to do this, foobar is much better in this scenario) or non data.table solutions. Perhaps this is because it should be a no-brainer with data.table
I have a data.table which contains yearly data on tentgelt and te_med. For each year, I want to know the share of observations for which tentgelt > te_med. This is what I am doing:
# note that nAbove and nBelow do not add up to 1
nAbove <- wages[tentgelt > te_med, list(nAbove = .N), by=list(year)]
nBelow <- wages[tentgelt < te_med, list(nBelow = .N), by=list(year)]
nBelow[nAbove][, list(year, foo=nAbove/(nAbove+nBelow))]
which works but whenever I see other people's data.table code, it looks much clearer and easier than my workarounds. Is there a cleaner way to get the following type of output?
year foo
1: 1993 0.2372093
2: 1994 0.1567568
3: 1995 0.8132530
4: 1996 0.1235955
5: 1997 0.1065574
6: 1998 0.3070684
7: 1999 0.1491974
Here's a sample of my data:
year tentgelt te_med
1: 2010 120.95 53.64929
2: 2010 9.99 116.72601
3: 2010 113.52 53.07394
4: 2010 10.27 38.45728
5: 2010 48.58 124.65753
6: 2010 96.38 86.99060
7: 2010 3.46 65.75342
8: 2010 107.52 91.87592
9: 2010 107.52 42.92953
10: 2010 3.46 73.92328
11: 2010 96.38 85.23419
12: 2010 2.25 79.19995
13: 2010 42.32 35.75757
14: 2010 7.94 93.44305
15: 2010 120.95 113.41370
16: 2010 7.94 110.68628
17: 2010 107.52 127.30682
18: 2010 2.25 103.49036
19: 2010 120.95 123.62054
20: 2010 96.38 68.57532
For this sample, the expected output should be:
year V2
1: 2010 0.45
Try this
wages[, list(foo= sum(tentgelt > te_med)/.N), by = year]
# year foo
# 1: 2010 0.45

Resources