dplyr group_by and iterative loop calculation - r

I am trying to perform an iterative calculation on grouped data that depend on two previous elements within a group. As a toy example:
set.seed(100)
df = data.table(ID = c(rep("A_index1",9)),
Year = c(2001:2005, 2001:2004),
Price = c(NA, NA, 10, NA, NA, 15, NA, 13, NA),
Index = sample(seq(1, 3, by = 0.5), size = 9, replace = TRUE))
ID Year Price Index
R> df
1: A_index1 2001 NA 1.5
2: A_index1 2002 NA 1.5
3: A_index1 2003 10 2.0
4: A_index1 2004 NA 1.0
5: A_index1 2005 NA 2.0
6: A_index1 2006 15 2.0
7: A_index1 2007 NA 3.0
8: A_index1 2008 13 1.5
9: A_index1 2009 NA 2.0
The objective is to fill the missing prices using the last available price and an index to adjust. I have a loop that performs these calculations, which I am trying to vectorize using dplyr.
My logic is defined in the below loop:
df$Price_adj = df$Price
for (i in 2:nrow(df)) {
if (is.na(df$Price[i])) {
df$Price_adj[i] = round(df$Price_adj[i-1] * df$Index[i] / df$Index[i-1], 2)
}
}
R> df
ID Year Price Index Price_adj
1: A_index1 2001 NA 1.5 NA
2: A_index1 2002 NA 1.5 NA
3: A_index1 2003 10 2.0 10.00
4: A_index1 2004 NA 1.0 5.00
5: A_index1 2005 NA 2.0 10.00
6: A_index1 2006 15 2.0 15.00
7: A_index1 2007 NA 3.0 22.50
8: A_index1 2008 13 1.5 13.00
9: A_index1 2009 NA 2.0 17.33
In my actual large data, I will have to apply this function to multiple groups and speed is a consideration. My attempt at this is below, that needs help to point me in the right direction. I did consider Reduce, but not sure how it can incorporate two previous elements within the group.
foo = function(Price, Index){
for (i in 2:nrow(df)) {
if (is.na(df$Price[i])) {
df$Price_adj[i] = df$Price_adj[i-1] * df$Index[i] / df$Index[i-1]
}
}
}
df %>%
group_by(ID) %>%
mutate(Price_adj = Price,
Price_adj = foo(Price, Index))

One option with cumprod:
df %>%
# group data frame into chunks starting from non na price
group_by(ID, g = cumsum(!is.na(Price))) %>%
# for each chunk multiply the first non na price with the cumprod of Index[i]/Index[i-1]
mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>%
ungroup() %>% select(-g)
# A tibble: 9 x 5
# ID Year Price Index Price_adj
# <fctr> <int> <dbl> <dbl> <dbl>
#1 A_index1 2001 NA 1.5 NA
#2 A_index1 2002 NA 1.5 NA
#3 A_index1 2003 10 2.0 10.00
#4 A_index1 2004 NA 1.0 5.00
#5 A_index1 2005 NA 2.0 10.00
#6 A_index1 2001 15 2.0 15.00
#7 A_index1 2002 NA 3.0 22.50
#8 A_index1 2003 13 1.5 13.00
#9 A_index1 2004 NA 2.0 17.33
Group data frame by ID and cumsum(!is.na(Price)), the letter split data frame into chunks and each chunk start with a non NA Price;
first(Price) * cumprod(Index / lag(Index, default=first(Index))) does the iterative calculation, which is equivalent to the formula given in the question if you substitute Price_adj[i-1] with Price_adj[i-2] until it's Price_adj[1] or first(Price);
Caveat: may not be very efficient if you have many NA chunks.
If the speed is the primary concern, you could write your function using Rcpp package:
library(Rcpp)
cppFunction("
NumericVector price_adj(NumericVector price, NumericVector index) {
int n = price.size();
NumericVector adjusted_price(n);
adjusted_price[0] = price[0];
for (int i = 1; i < n; i++) {
if(NumericVector::is_na(price[i])) {
adjusted_price[i] = adjusted_price[i-1] * index[i] / index[i-1];
} else {
adjusted_price[i] = price[i];
}
}
return adjusted_price;
}")
Now use the cpp function with dplyr as follows:
cpp_fun <- function() df %>% group_by(ID) %>% mutate(Price_adj = round(price_adj(Price, Index), 2))
cpp_fun()
# A tibble: 9 x 5
# Groups: ID [1]
# ID Year Price Index Price_adj
# <fctr> <int> <dbl> <dbl> <dbl>
#1 A_index1 2001 NA 1.5 NA
#2 A_index1 2002 NA 1.5 NA
#3 A_index1 2003 10 2.0 10.00
#4 A_index1 2004 NA 1.0 5.00
#5 A_index1 2005 NA 2.0 10.00
#6 A_index1 2001 15 2.0 15.00
#7 A_index1 2002 NA 3.0 22.50
#8 A_index1 2003 13 1.5 13.00
#9 A_index1 2004 NA 2.0 17.33
Benchmark:
Define r_fun as:
r_fun <- function() df %>% group_by(ID, g = cumsum(!is.na(Price))) %>% mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>% ungroup() %>% select(-g)
On the small sample data, there's already a difference:
microbenchmark::microbenchmark(r_fun(), cpp_fun())
#Unit: milliseconds
# expr min lq mean median uq max neval
# r_fun() 10.127839 10.500281 12.627831 11.148093 12.686662 101.466975 100
# cpp_fun() 3.191278 3.308758 3.738809 3.491495 3.937006 6.627019 100
Testing on a slightly larger data frame:
df <- bind_rows(rep(list(df), 10000))
#dim(df)
#[1] 90000 4
microbenchmark::microbenchmark(r_fun(), cpp_fun(), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# r_fun() 842.706134 890.978575 904.70863 908.77042 921.89828 986.44576 10
# cpp_fun() 8.722794 8.888667 10.67781 10.86399 12.10647 13.68302 10
Identity test:
identical(ungroup(r_fun()), ungroup(cpp_fun()))
# [1] TRUE

Related

merge by nearest neighbour in group - R

I have two country*year level datasets that cover the same countries but in different years. I would like to merge the two in a way that year is matched with its nearest neighbor, always within country (iso2code).
The first (dat1) looks like this (showing here only the head for AT, but iso2code has multiple different values):
iso2code year elect_polar_lrecon
<chr> <dbl> <dbl>
1 AT 1999 2.48
2 AT 2002 4.18
3 AT 2006 3.66
4 AT 2010 3.91
5 AT 2014 4.01
6 AT 2019 3.55
The second (dat2) looks like this:
iso2code year affpol
<chr> <dbl> <dbl>
1 AT 2008 2.47
2 AT 2013 2.49
3 DE 1998 2.63
4 DE 2002 2.83
5 DE 2005 2.89
6 DE 2009 2.09
In the end I would like to have something like (note that the value of affpol for 2008 could be matched both with 2010 and with 2006 as it is equally distant from both. If possible, I would go for the most recent date, as it is below):
iso2code year.1 elect_polar_lrecon year.2 affpol
<chr> <dbl> <dbl> <dbl> <dbl>
1 AT 1999 2.48
2 AT 2002 4.18
3 AT 2006 3.66
4 AT 2010 3.91 2008 2.47
5 AT 2014 4.01 2013 2.49
6 AT 2019 3.55
Not sure about how to do this... I am happy for a tidyverse solution, but really, all help is much appreciated!
As mentioned by Henrik, this can be solved by updating in a rolling join to the nearest which is available in the data.table package. Additionally, the OP has requested to go for the most recent date if matches are equally distant.
library(data.table)
setDT(dat1)[setDT(dat2), roll = "nearest", on = c("iso2code", "year"),
`:=`(year.2 = i.year, affpol = i.affpol)]
dat1
iso2code year elect_polar_lrecon year.2 affpol
1: AT 1999 2.48 NA NA
2: AT 2002 4.18 NA NA
3: AT 2006 3.66 2008 2.47
4: AT 2010 3.91 NA NA
5: AT 2014 4.01 2013 2.49
6: AT 2019 3.55 NA NA
This operation has updated dat1 by reference, i.e., without copying the whole data object by adding two additional columns.
Now, the OP has requested to go for the most recent date if matches are equally distant but the join has picked the older date. Apparently, there is no parameter to control this in a rolling join to the nearest.
The workaround is to create a helper variable nyear which holds the negative year and to join on this:
setDT(dat1)[, nyear := -year][setDT(dat2)[, nyear := -year],
roll = "nearest", on = c("iso2code", "nyear"),
`:=`(year.2 = i.year, affpol = i.affpol)][
, nyear := NULL]
dat1
iso2code year elect_polar_lrecon year.2 affpol
1: AT 1999 2.48 NA NA
2: AT 2002 4.18 NA NA
3: AT 2006 3.66 NA NA
4: AT 2010 3.91 2008 2.47
5: AT 2014 4.01 2013 2.49
6: AT 2019 3.55 NA NA
I figured it out with the help of a friend. I leave it here in case anyone else is looking for a solution. Assuming that the first dataset is to_plot and the second is called to_plot2. Then:
find_nearest_year <- function(p_year, p_code){
years <- to_plot$year[to_plot$iso2code==p_code]
nearest_year <- years[1]
for (i in sort(years, decreasing = TRUE)) {
if (abs(i - p_year) < abs(nearest_year-p_year)) {
nearest_year <- i
}
}
return(nearest_year)
}
to_plot2 <- to_plot2 %>%
group_by(iso2code, year) %>%
mutate(matching_year=find_nearest_year(year, iso2code))
merged <- left_join(to_plot, to_plot2, by=c("iso2code", "year"="matching_year"))

Euclidean distant for distinct classes of factors iterated by groups

*Update: The answer suggested by Rui is great and works as it should. However, when I run it on about 7 million observations (my actual dataset), R gets stuck in a computational block (I'm using a machine with 64gb of RAM). Any other solutions are greatly appreciated!
I have a dataframe of patents consisting of the firms, application years, patent number, and patent classes. I want to calculate the Euclidean distance between consecutive years for each firm based on patent classes according to the following formula:
Where Xi represents the number of patents belonging to a specific class in year t, and Yi represents the number of patents belonging to a specific class in the previous year (t-1).
To further illustrate this, consider the following dataset:
df <- data.table(Firm = rep(c(LETTERS[1:2]),each=6), Year = rep(c(1990,1990,1991,1992,1992,1993),2),
Patent_Number = sample(184785:194785,12,replace = FALSE),
Patent_Class = c(12,5,31,12,31,6,15,15,15,3,3,1))
> df
Firm Year Patent_Number Patent_Class
1: A 1990 192473 12
2: A 1990 193702 5
3: A 1991 191889 31
4: A 1992 193341 12
5: A 1992 189512 31
6: A 1993 185582 6
7: B 1990 190838 15
8: B 1990 189322 15
9: B 1991 190620 15
10: B 1992 193443 3
11: B 1992 189937 3
12: B 1993 194146 1
Since year 1990 is the beginning year for Firm A, there is no Euclidean distance for that year (NAs should be produced. Moving forward to year 1991, the distinct classses for this year (1991) and the previous year (1990) are 31, 5, and 12. Therefore, the above formula is summed over these three distinct classes (there is three distinc 'i's). So the formula's output will be:
Following the same calculation and reiterating over firms, the final output should be:
> df
Firm Year Patent_Number Patent_Class El_Dist
1: A 1990 192473 12 NA
2: A 1990 193702 5 NA
3: A 1991 191889 31 1.2247450
4: A 1992 193341 12 0.7071068
5: A 1992 189512 31 0.7071068
6: A 1993 185582 6 1.2247450
7: B 1990 190838 15 NA
8: B 1990 189322 15 NA
9: B 1991 190620 15 0.5000000
10: B 1992 193443 3 1.1180340
11: B 1992 189937 3 1.1180340
12: B 1993 194146 1 1.1180340
I'm preferably looking for a data.table solution for speed purposes.
Thank you very much in advance for any help.
I believe that the function below does what the question asks for, but the results for Firm == "B" are not equal to the question's.
fEl_Dist <- function(X){
Year <- X[["Year"]]
PatentClass <- X[["Patent_Class"]]
sapply(seq_along(Year), function(i){
j <- which(Year %in% (Year[i] - 1:0))
tbl <- table(Year[j], PatentClass[j])
if(NROW(tbl) == 1){
NA_real_
} else {
numer <- sum((tbl[2, ] - tbl[1, ])^2)
denom <- sum(tbl[2, ]^2)*sum(tbl[1, ]^2)
sqrt(numer/denom)
}
})
}
setDT(df)[, El_Dist := fEl_Dist(.SD),
by = .(Firm),
.SDcols = c("Year", "Patent_Class")]
head(df)
# Firm Year Patent_Number Patent_Class El_Dist
#1: A 1990 190948 12 NA
#2: A 1990 186156 5 NA
#3: A 1991 190801 31 1.2247449
#4: A 1992 185226 12 0.7071068
#5: A 1992 185900 31 0.7071068
#6: A 1993 186928 6 1.2247449

Taking variance of some rows above in panel structrure (R data table )

# Example of a panel data
library(data.table)
panel<-data.table(expand.grid(Year=c(2017:2020),Individual=c("A","B","C")))
panel$value<-rnorm(nrow(panel),10) # The value I am interested in
I want to take the variance of prior two years value by Individual.
For example, if I were to sum the value of prior two years I would do something like:
panel[,sum_of_past_2_years:=shift(value)+shift(value, 2),Individual]
I thought this would work.
panel[,var(c(shift(value),shift(value, 2))),Individual]
# This doesn't work of course
Ideally the answer should look like
a<-c(NA,NA,var(panel$value[1:2]),var(panel$value[2:3]))
b<-c(NA,NA,var(panel$value[5:6]),var(panel$value[6:7]))
c<-c(NA,NA,var(panel$value[9:10]),var(panel$value[10:11]))
panel[,variance_past_2_years:=c(a,b,c)]
# NAs when there is no value for 2 prior years
You can use frollapply to perform rolling operation of every 2 values.
library(data.table)
panel[, var := frollapply(shift(value), 2, var), Individual]
# Year Individual value var
# 1: 2017 A 9.416218 NA
# 2: 2018 A 8.424868 NA
# 3: 2019 A 8.743061 0.49138739
# 4: 2020 A 9.489386 0.05062333
# 5: 2017 B 10.102086 NA
# 6: 2018 B 8.674827 NA
# 7: 2019 B 10.708943 1.01853361
# 8: 2020 B 11.828768 2.06881272
# 9: 2017 C 10.124349 NA
#10: 2018 C 9.024261 NA
#11: 2019 C 10.677998 0.60509700
#12: 2020 C 10.397105 1.36742220

multiplying column from data frame 1 by a condition found in data frame 2

I have two separate data frame and what I am trying to do is that for each year, I want to check data frame 2 (in the same year) and multiply a column from data frame 1 by the found number. So for example, imagine my first data frame is:
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
year price
1 2001 1000
2 2003 1000
3 2001 1000
4 2004 1000
5 2006 1000
6 2007 1000
7 2008 1000
8 2008 1000
9 2001 1000
10 2009 1000
11 2001 1000
Now, I have a second data frame which includes inflation conversion rate (code from #akrun)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data<-inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100))
ref_year ref_inf final_inf
1 2010 2.0 1.020000
2 2009 3.0 1.050600
3 2008 1.0 1.061106
4 2007 2.2 1.084450
5 2006 1.3 1.098548
6 2005 1.5 1.115026
7 2004 1.9 1.136212
8 2003 1.8 1.156664
9 2002 1.9 1.178640
10 2001 1.9 1.201035
What I want to do is that for example for the first row of data frame 1, it's the year 2001, so I go and found a conversion for the year 2001 from data frame 2 which is 1.201035 and then multiply the price in a data frame 1 by this found conversion rate.
So the result should look like this:
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
is there any way to do this without using else and if commands?
We can do a join on the 'year' with 'ref_year' and create the new column by assigning (:=) the output of product of 'price' and 'final_inf'
library(data.table)
setDT(df)[inf_data, after_conv := price * final_inf, on = .(year = ref_year)]
-output
df
# year price after_conv
# 1: 2001 1000 1201.035
# 2: 2003 1000 1156.664
# 3: 2001 1000 1201.035
# 4: 2004 1000 1136.212
# 5: 2006 1000 1098.548
# 6: 2007 1000 1084.450
# 7: 2008 1000 1061.106
# 8: 2008 1000 1061.106
# 9: 2001 1000 1201.035
#10: 2009 1000 1050.600
#11: 2001 1000 1201.035
Since the data is already being processed by dplyr, we can also solve this problem with dplyr. A dplyr based solution joins the data with the reference data by year and calculates after_conv.
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
library(dplyr)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv)
We use left_join() to keep the data ordered in the original order of df as well as ensure rows in inf_data only contribute to the output if they match at least one row in df. We use . to reference the data already in the pipeline as the right side of the join, merging in final_inf so we can use it in the subsequent mutate() function. We then select() to keep the three result columns we need.
...and the output:
Joining, by = "year"
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
We can save the result to the original df by writing the result of the pipeline to df.
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv) -> df

Why does grouping function work slow when fitting lm models

I have data frame with 5 column and about 12 000 000 rows.
lon lat for_R_WDEP_SOX number year
2 -29.95 30.05 128.44461 1 2000
624002 -29.95 30.05 320.17755 1 2001
1248002 -29.95 30.05 192.20628 1 2002
1872002 -29.95 30.05 325.44336 1 2003
2496002 -29.95 30.05 368.46976 1 2004
3120002 -29.95 30.05 409.80154 1 2005
3744002 -29.95 30.05 265.71161 1 2006
4368002 -29.95 30.05 147.92351 1 2007
4992002 -29.95 30.05 279.87851 1 2008
5616002 -29.95 30.05 136.38370 1 2009
6240002 -29.95 30.05 223.43958 1 2010
6864002 -29.95 30.05 132.92253 1 2011
7488002 -29.95 30.05 112.68416 1 2012
8112002 -29.95 30.05 83.81801 1 2013
8736002 -29.95 30.05 80.33523 1 2014
9360002 -29.95 30.05 71.58231 1 2015
9984002 -29.95 30.05 91.07822 1 2016
10608002 -29.95 30.05 98.69281 1 2017
I try to use groping function to it
gromov_analise_fuction <- function(table)
{
x <- table$year
y <- table$for_R_WDEP_SOX
line<- lm(y~x)
p_value_coef <- summary(line)$coefficients["x","Estimate"]/abs(summary(line)$coefficients["x","Estimate"])*(1 -summary(line)$coefficients["x","Pr(>|t|)"])
k <- summary(line)$coefficients["x","Estimate"]
B_K <- summary(line)$coefficients["x","Estimate"]*1800/summary(line)$coefficients["(Intercept)","Estimate"]
result_vector <- c(p_value_coef,k,B_K)
return (result_vector)
}
result <- table %>%
group_by(number) %>%
do(data.frame(val=gromov_analise_fuction(.)))
It works for about 30-37 minutes.
Tell me please what is the reason?
How should I make this code work faster.
As I understood I should remove unused vectors and data.frame.
library(data.table)
result2 <- table[,
data.frame(val=gromov_analise_fuction(.SD)),
by = number]
See how much faster data.table is! (BTW, check Hadley Wickham's GitHub repo, he has been re-writing dplyr verbs with the data.table backend to make these faster.)
Unit: nanoseconds
expr min lq mean median uq max neval cld
dplyr 3717853 4262732 5028474.44 4526830 5648242.0 15919477 100 b
data.table 35 39 316.29 41 603.5 3024 100 a
And here are the outputs from your code with dplyr and the data.table way (with the data you provided in your question, I agree with #Roland though when it comes to providing data!):
> result (dplyr)
# A tibble: 3 x 2
# Groups: number [1]
number val
<int> <dbl>
1 1 -0.998
2 1 -13.9
3 1 -0.890
> result2 (data.table)
number val
1: 1 -0.9979470
2: 1 -13.8587730
3: 1 -0.8900289
You can speed up fitting the models significantly, but the slow part is calculating the model summaries. The "low-hanging fruit" is limiting the calls to summary to one call per model, obviously. You might want to create your own function that calculates only the values of interest, i.e., the p-values for the slopes (see there for the calculation of the standard errors). For the coefficients you can just use the function coef which is fast.
Here is an approach that fits the models much faster (but still uses summary):
library(data.table)
n <- 1e5
set.seed(42)
DT <- data.table(x = 1:10, y = rnorm(n), g = rep(seq_len(n/10), each = 10))
#fit each model separately
system.time({
res <- DT[, .(pslope = summary(lm(y ~ x, data = .SD))$coefficients["x","Pr(>|t|)"]), by = g]
})
#user system elapsed
#5.89 0.01 6.02
#use the fact that the models have all the same design matrix
system.time({
DT1 <- dcast(DT, x ~ g, value.var = "y")
setnames(DT1, make.names(names(DT1)))
fit <- lm(as.formula(sprintf("cbind(%s) ~ x", paste(names(DT1)[-1], collapse = ","))), data = DT1)
pslope <- vapply(summary(fit), function(fitsum) fitsum$coefficients["x","Pr(>|t|)"], FUN.VALUE = 1.0)
})
#user system elapsed
#4.34 0.00 4.42
#same result
all.equal(unname(pslope), res[["pslope"]])
#[1] TRUE
#intercepts
coef(fit)[1,]
#slopes
coef(fit)[2,]

Resources