R data.table Conditional Sum: Cleaner way - r

This of course is a very often encountered problem, so I have expected many questions here on SO regarding this. However, all the answers that I could find were very specific to the question and often encountered workarounds (you don't have to do this, foobar is much better in this scenario) or non data.table solutions. Perhaps this is because it should be a no-brainer with data.table
I have a data.table which contains yearly data on tentgelt and te_med. For each year, I want to know the share of observations for which tentgelt > te_med. This is what I am doing:
# note that nAbove and nBelow do not add up to 1
nAbove <- wages[tentgelt > te_med, list(nAbove = .N), by=list(year)]
nBelow <- wages[tentgelt < te_med, list(nBelow = .N), by=list(year)]
nBelow[nAbove][, list(year, foo=nAbove/(nAbove+nBelow))]
which works but whenever I see other people's data.table code, it looks much clearer and easier than my workarounds. Is there a cleaner way to get the following type of output?
year foo
1: 1993 0.2372093
2: 1994 0.1567568
3: 1995 0.8132530
4: 1996 0.1235955
5: 1997 0.1065574
6: 1998 0.3070684
7: 1999 0.1491974
Here's a sample of my data:
year tentgelt te_med
1: 2010 120.95 53.64929
2: 2010 9.99 116.72601
3: 2010 113.52 53.07394
4: 2010 10.27 38.45728
5: 2010 48.58 124.65753
6: 2010 96.38 86.99060
7: 2010 3.46 65.75342
8: 2010 107.52 91.87592
9: 2010 107.52 42.92953
10: 2010 3.46 73.92328
11: 2010 96.38 85.23419
12: 2010 2.25 79.19995
13: 2010 42.32 35.75757
14: 2010 7.94 93.44305
15: 2010 120.95 113.41370
16: 2010 7.94 110.68628
17: 2010 107.52 127.30682
18: 2010 2.25 103.49036
19: 2010 120.95 123.62054
20: 2010 96.38 68.57532
For this sample, the expected output should be:
year V2
1: 2010 0.45

Try this
wages[, list(foo= sum(tentgelt > te_med)/.N), by = year]
# year foo
# 1: 2010 0.45

Related

merge by nearest neighbour in group - R

I have two country*year level datasets that cover the same countries but in different years. I would like to merge the two in a way that year is matched with its nearest neighbor, always within country (iso2code).
The first (dat1) looks like this (showing here only the head for AT, but iso2code has multiple different values):
iso2code year elect_polar_lrecon
<chr> <dbl> <dbl>
1 AT 1999 2.48
2 AT 2002 4.18
3 AT 2006 3.66
4 AT 2010 3.91
5 AT 2014 4.01
6 AT 2019 3.55
The second (dat2) looks like this:
iso2code year affpol
<chr> <dbl> <dbl>
1 AT 2008 2.47
2 AT 2013 2.49
3 DE 1998 2.63
4 DE 2002 2.83
5 DE 2005 2.89
6 DE 2009 2.09
In the end I would like to have something like (note that the value of affpol for 2008 could be matched both with 2010 and with 2006 as it is equally distant from both. If possible, I would go for the most recent date, as it is below):
iso2code year.1 elect_polar_lrecon year.2 affpol
<chr> <dbl> <dbl> <dbl> <dbl>
1 AT 1999 2.48
2 AT 2002 4.18
3 AT 2006 3.66
4 AT 2010 3.91 2008 2.47
5 AT 2014 4.01 2013 2.49
6 AT 2019 3.55
Not sure about how to do this... I am happy for a tidyverse solution, but really, all help is much appreciated!
As mentioned by Henrik, this can be solved by updating in a rolling join to the nearest which is available in the data.table package. Additionally, the OP has requested to go for the most recent date if matches are equally distant.
library(data.table)
setDT(dat1)[setDT(dat2), roll = "nearest", on = c("iso2code", "year"),
`:=`(year.2 = i.year, affpol = i.affpol)]
dat1
iso2code year elect_polar_lrecon year.2 affpol
1: AT 1999 2.48 NA NA
2: AT 2002 4.18 NA NA
3: AT 2006 3.66 2008 2.47
4: AT 2010 3.91 NA NA
5: AT 2014 4.01 2013 2.49
6: AT 2019 3.55 NA NA
This operation has updated dat1 by reference, i.e., without copying the whole data object by adding two additional columns.
Now, the OP has requested to go for the most recent date if matches are equally distant but the join has picked the older date. Apparently, there is no parameter to control this in a rolling join to the nearest.
The workaround is to create a helper variable nyear which holds the negative year and to join on this:
setDT(dat1)[, nyear := -year][setDT(dat2)[, nyear := -year],
roll = "nearest", on = c("iso2code", "nyear"),
`:=`(year.2 = i.year, affpol = i.affpol)][
, nyear := NULL]
dat1
iso2code year elect_polar_lrecon year.2 affpol
1: AT 1999 2.48 NA NA
2: AT 2002 4.18 NA NA
3: AT 2006 3.66 NA NA
4: AT 2010 3.91 2008 2.47
5: AT 2014 4.01 2013 2.49
6: AT 2019 3.55 NA NA
I figured it out with the help of a friend. I leave it here in case anyone else is looking for a solution. Assuming that the first dataset is to_plot and the second is called to_plot2. Then:
find_nearest_year <- function(p_year, p_code){
years <- to_plot$year[to_plot$iso2code==p_code]
nearest_year <- years[1]
for (i in sort(years, decreasing = TRUE)) {
if (abs(i - p_year) < abs(nearest_year-p_year)) {
nearest_year <- i
}
}
return(nearest_year)
}
to_plot2 <- to_plot2 %>%
group_by(iso2code, year) %>%
mutate(matching_year=find_nearest_year(year, iso2code))
merged <- left_join(to_plot, to_plot2, by=c("iso2code", "year"="matching_year"))

Change name of column after uniqueN function

I am already happy with the results, but want to further tidy up my data by giving the right name to the respective column.
The problem to solve is to give the number of different authors which are included for each years publication between 2000 and 2010. Here is my code and my result:
books_dt[Year_Of_Publication <= 2010 & Year_Of_Publication >= 2000, uniqueN(Book_Author), by = "Year_Of_Publication"][order(Year_Of_Publication)]
Year_Of_Publication V1
1: 2000 12057
2: 2001 11818
3: 2002 11942
4: 2003 9913
5: 2004 4536
6: 2005 38
7: 2006 3
8: 2008 1
9: 2010 2
The numbers in the result are right, but I want to change the column name V1 to something like "Num_Of_Dif_Auth". I tried the setnames function, but as I don`t want to change the underlying dataset it didnĀ“t help.
You can use :
library(data.table)
books_dt[Year_Of_Publication <= 2010 & Year_Of_Publication >= 2000,
.(Num_Of_Dif_Auth = uniqueN(Book_Author)),
by = Year_Of_Publication][order(Year_Of_Publication)]

How can I change row and column indexes of a dataframe in R?

I have a dataframe in R which has three columns Product_Name(name of books), Year and Units (number of units sold in that year) which looks like this:
Product_Name Year Units
A Modest Proposal 2011 10000
A Modest Proposal 2012 11000
A Modest Proposal 2013 12000
A Modest Proposal 2014 13000
Animal Farm 2011 8000
Animal Farm 2012 9000
Animal Farm 2013 11000
Animal Farm 2014 15000
Catch 22 2011 1000
Catch 22 2012 2000
Catch 22 2013 3000
Catch 22 2014 4000
....
I intend to make a R Shiny dashboard with that where I want to keep the year as a drop-down menu option, for which I wanted to have the dataframe in the following format
A Modest Proposal Animal Farm Catch 22
2011 10000 8000 1000
2012 11000 9000 2000
2013 12000 11000 3000
2014 13000 15000 4000
or the other way round where the Product Names are row indexes and Years are column indexes, either way goes.
How can I do this in R?
Your general issue is transforming long data to wide data. For this, you can use data.table's dcast function (amongst many others):
dt = data.table(
Name = c(rep('A', 4), rep('B', 4), rep('C', 4)),
Year = c(rep(2011:2014, 3)),
Units = rnorm(12)
)
> dt
Name Year Units
1: A 2011 -0.26861318
2: A 2012 0.27194732
3: A 2013 -0.39331361
4: A 2014 0.58200101
5: B 2011 0.09885381
6: B 2012 -0.13786098
7: B 2013 0.03778400
8: B 2014 0.02576433
9: C 2011 -0.86682584
10: C 2012 -1.34319590
11: C 2013 0.10012673
12: C 2014 -0.42956207
> dcast(dt, Year ~ Name, value.var = 'Units')
Year A B C
1: 2011 -0.2686132 0.09885381 -0.8668258
2: 2012 0.2719473 -0.13786098 -1.3431959
3: 2013 -0.3933136 0.03778400 0.1001267
4: 2014 0.5820010 0.02576433 -0.4295621
For the next time, it is easier if you provide a reproducible example, so that the people assisting you do not have to manually recreate your data structure :)
You need to use pivot_wider from tidyr package. I assumed your data is saved in df and you also need dplyr package for %>% (piping)
library(tidyr)
library(dplyr)
df %>%
pivot_wider(names_from = Product_Name, values_from = Units)
Assuming that your dataframe is ordered by Product_Name and by year, I will generate artificial data similar to your datafrme, try this:
Col_1 <- sort(rep(LETTERS[1:3], 4))
Col_2 <- rep(2011:2014, 3)
# artificial data
resp <- ceiling(rnorm(12, 5000, 500))
uu <- data.frame(Col_1, Col_2, resp)
uu
# output is
Col_1 Col_2 resp
1 A 2011 5297
2 A 2012 4963
3 A 2013 4369
4 A 2014 4278
5 B 2011 4721
6 B 2012 5021
7 B 2013 4118
8 B 2014 5262
9 C 2011 4601
10 C 2012 5013
11 C 2013 5707
12 C 2014 5637
>
> # Here starts
> output <- aggregate(uu$resp, list(uu$Col_1), function(x) {x})
> output
Group.1 x.1 x.2 x.3 x.4
1 A 5297 4963 4369 4278
2 B 4721 5021 4118 5262
3 C 4601 5013 5707 5637
>
output2 <- output [, -1]
colnames(output2) <- levels(as.factor(uu$Col_2))
rownames(output2) <- levels(as.factor(uu$Col_1))
# transpose the matrix
> t(output2)
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637
> # or convert to data.frame
> as.data.frame(t(output2))
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637

R data.table difference equation (dynamic panel data)

I have a data table with a column v2 with 'initial values' and a column v1 with a growth rate. I would like to extrapolate v2 for years past the available value, by growing the previous value by factor v1. In 'time series' notation v2(t+1)=v2(t)*v1(t), given a v2(0).
The problem is, the year of the initial value may vary by group x in the dataset. In some groups, v2 may be available in multiple years, or not at all. Also, the number of years per group may vary (unbalanced panel). Using the shift function does not help, because it shifts v2 once, and does not reference the previously update value.
x year v1 v2
1: a 2012 0.8501072 NA
2: a 2013 1.0926093 39.36505
3: a 2014 1.2084379 NA
4: a 2015 0.8921997 NA
5: a 2016 0.8023251 NA
6: b 2012 1.1005287 NA
7: b 2013 1.0139800 NA
8: b 2014 1.1539676 NA
9: b 2015 1.2282501 NA
10: b 2016 0.8052265 NA
11: c 2012 0.8866425 NA
12: c 2013 0.9952566 44.30377
13: c 2014 0.9092020 NA
14: c 2015 1.0295864 15.04948
15: c 2016 0.8812966 NA
The value of V2, x=a, year=2014 should be 39.36*1.208, and in 2015 that answer times 0.89.
The following code, in a set of loops, works and does what I want:
ivec<-unique(DT[,x])
for (i in 1:length(ivec)) {
tvec<-unique(DT[x==ivec[i] ,y])
for (t in 2:length(tvec)) {
if (is.na(DT[x==ivec[i] & y==tvec[t], v2])) {
DT[x==ivec[i] & y==tvec[t],v2:=DT[x==ivec[i] & y==tvec[(t-1)],v2]*v1]
}
}
}
Try this:
DT[, v2:= Reduce(`*`, v1[-1], init=v2[1], acc=TRUE), by=.(x, cumsum(!is.na(v2)))]
# x year v1 v2
# 1: a 2012 0.8501072 NA
# 2: a 2013 1.0926093 39.36505
# 3: a 2014 1.2084379 47.57022
# 4: a 2015 0.8921997 42.44213
# 5: a 2016 0.8023251 34.05239
# 6: b 2012 1.1005287 NA
# 7: b 2013 1.0139800 NA
# 8: b 2014 1.1539676 NA
# 9: b 2015 1.2282501 NA
# 10: b 2016 0.8052265 NA
# 11: c 2012 0.8866425 NA
# 12: c 2013 0.9952566 44.30377
# 13: c 2014 0.9092020 40.28108
# 14: c 2015 1.0295864 15.04948
# 15: c 2016 0.8812966 13.26306

How to quartely a GDP series On

I m starting work with R now, and I m having troubles on quartely a GDP data.
The command that I am using is:
library("data.table")
pib<- read.csv("PIB.csv", header = TRUE, sep=";", dec=",")
setDT(pib)
pib
attach(pib)
aggregate(pib, by= PIB.mensal, frequency=4, FUN='sum')
My data is the following :
datareferencia| GDP.month
1: 01/01/2010| 288.980,20
2: 01/02/2010| 285.738,70
3: 01/03/2010| 311.677,40
4: 01/04/2010| 307.106,60
5: 01/05/2010| 316.005,10
6: 01/06/2010| 321.032,90
7: 01/07/2010| 332.472,50
8: 01/08/2010| 334.225,30
9: 01/09/2010| 331.237,00
10: 01/10/2010| 344.965,70
11: 01/11/2010| 356.675,00
12: 01/12/2010| 355.730,60
13: 01/01/2011| 333.330,90
14: 01/02/2011| 335.118,40
15: 01/03/2011| 348.084,20
16: 01/04/2011| 349.255,90
17: 01/05/2011| 366.411,50
18: 01/06/2011| 371.046,10
19: 01/07/2011| 373.334,50
20: 01/08/2011| 377.005,90
21: 01/09/2011| 361.993,50
22: 01/10/2011| 378.843,40
23: 01/11/2011| 389.948,20
24: 01/12/2011| 392.009,40
Can someone help me? I need to quattely both years 2010 and 2011!
You can use the by command of data.table to do this. A variable for year and "quatley" is all you need.
Reading in your data:
pib <- data.table(datareferencia = c("01/01/2010", "01/02/2010", "01/03/2010",
"01/04/2010", "01/05/2010", "01/06/2010",
"01/07/2010", "01/08/2010", "01/09/2010",
"01/10/2010", "01/11/2010", "01/12/2010",
"01/01/2011", "01/02/2011", "01/03/2011",
"01/04/2011", "01/05/2011", "01/06/2011",
"01/07/2011", "01/08/2011", "01/09/2011",
"01/10/2011", "01/11/2011", "01/12/2011") ,
GDP.month = c( 288980.20, 285738.70, 311677.40,
307106.60, 316005.10, 321032.90,
332472.50, 334225.30, 331237.00,
344965.70, 356675.00, 355730.60,
333330.90, 335118.40, 348084.20,
349255.90, 366411.50, 371046.10,
373334.50, 377005.90, 361993.50,
378843.40, 389948.20, 392009.40))
Adjust your date if it is not already done:
pib[, datareferencia := as.IDate(datareferencia, format = "%d/%m/%Y")]
With the year function from data.table you get ... well the year.
For the "quatley" I use the modulo function %/% with the month and a little adjustment, so that the result is 1 to 3 and not 0 to 2.
pib[, quatley := ((month(datareferencia)-1) %/% 4) + 1 ]
pib[, year := year(datareferencia)]
At last you can calculate the sum by year and quatley:
pib[, sum.quatley:= sum(GDP.month), by = c("quatley", "year")]
The result:
unique(pib[, list(quatley, year, sum.quatley)])
quatley year sum.quatley
1: 1 2010 1193503
2: 2 2010 1303736
3: 3 2010 1388608
4: 1 2011 1365789
5: 2 2011 1487798
6: 3 2011 1522795

Resources