Speeding up rowwise comparison - r

I have a data.frame like the following:
id year x y v1
1 2006 12 1 0.8510703
1 2007 12 1 0.5954527
1 2008 12 2 -1.9312854
1 2009 12 1 0.1558393
1 2010 8 1 0.9051487
2 2001 12 2 -0.5480566
2 2002 12 2 -0.7607420
2 2003 3 2 -0.8094283
2 2004 3 2 -0.1732794
I would like to sum up (grouped by id) v1 of consecutive years (so 2010 and 2009, 2009 and 2008 and so on) only if x and y match. Expected output:
id year res
1 2010 NA
1 2009 NA
1 2008 NA
1 2007 1.4465230
2 2004 -0.9827077
2 2003 NA
2 2002 -1.3087987
The oldest year per id is removed, as there is no preceding year.
I have a slow lapply solution in place but would like to speed things up, as my data is rather large.
Data:
set.seed(1)
dat <- data.frame(id = c(rep(1,5),rep(2,4)),year = c(2006:2010,2001:2004),
x = c(12,12,12,12,8,12,12,3,3), y = c(1,1,2,1,1,2,2,2,2),
v1 = rnorm(9))
Current Solution:
require(dplyr)
myfun <- function(dat) { do.call(rbind,lapply(rev(unique(dat$year)[-1]),
function(z) inner_join(dat[dat$year==z,2:5],
dat[dat$year==z-1,2:5],
by=c("x","y")) %>%
summarise(year = z, res = ifelse(nrow(.) < 1,NA,sum(v1.x,v1.y)))))
}
dat %>% group_by(id) %>% do(myfun(.))

Here is a data.table solution, I think.
datNew <- setDT(dat)[, .(year=year, res=(v1+shift(v1)) * NA^(x != shift(x) | y != shift(y))),
by=id][-1, .SD, by=id][]
id year res
1: 1 2007 -0.4428105
2: 1 2008 NA
3: 1 2009 NA
4: 1 2010 NA
5: 2 2001 NA
6: 2 2002 -0.3330393
7: 2 2003 NA
8: 2 2004 1.3141061
Here, the j statement contains a list with two elements, the year and a function. This function sums values with the lagged value, using shift, but is multiplied by NA or 1 depending on whether the x and y match with their lagged values. This calculation is performed by id. The output is fed to a second chain, which drops the first observation of each id which is all NA.
You can adjust the order efficiently using setorder if desired.
setorder(datNew, id, -year)
datNew
id year res
1: 1 2010 NA
2: 1 2009 NA
3: 1 2008 NA
4: 1 2007 -0.4428105
5: 2 2004 1.3141061
6: 2 2003 NA
7: 2 2002 -0.3330393
8: 2 2001 NA

Assuming there are sorted years as in the example:
dat %>%
group_by(id) %>%
mutate(res = v1 + lag(v1), #simple lag for difference
res = ifelse(x == lag(x) & y == lag(y), v1, NA)) %>% #NA if x and y don't match
slice(-1) #drop the first year
You can use %>% select(id, year, res), and %>% arrange(id, desc(year)) at the end if you want.

Related

Data file organized in block per year: add column for year to combine all data

I have large quantities of datasets in Excel that I would like to analyze in R. The files have a format that organizes all information per block of the same year, which looks like:
Group <- c(2010, 'Group', 'A', 'B', 'C', 2011, 'Group', 'A', 'B', 'E', 2012, 'Group', 'A', 'B')
Value <- c(NA,'Value', 1, 2, 9, NA, 'Value', 3, 5, 2, NA, 'Value', 9, 1)
df <- cbind(Group, Value)
Group Value
1: 2010 NA
2: Group Value
3: A 1
4: B 2
5: C 9
6: 2011 NA
7: Group Value
8: A 3
9: B 5
10: E 2
11: 2012 NA
12: Group Value
13: A 9
14: B 1
To be able to analyze the data, I would like to automatically add a column for the year so that all data can be combined, as follows:
Year Group Value
1: 2010 A 1
2: 2010 B 2
3: 2010 C 9
4: 2011 A 3
5: 2011 B 5
6: 2011 E 2
7: 2012 A 9
8: 2012 B 1
library(data.table)
dt <- data.table(df)
dt[, Year := Group[1], cumsum(is.na(Value))][Value != 'Value']
Group Value Year
1: A 1 2010
2: B 2 2010
3: C 9 2010
4: A 3 2011
5: B 5 2011
6: E 2 2011
7: A 9 2012
8: B 1 2012
in Base R:
subset(transform(df, Year = ave(Group, cumsum(is.na(Value)), FUN=\(x)x[1])), Value != 'Value')
Group Value Year
3 A 1 2010
4 B 2 2010
5 C 9 2010
8 A 3 2011
9 B 5 2011
10 E 2 2011
13 A 9 2012
14 B 1 2012
Note that the above columns are character. You can use type.convert(new_df, as.is = TRUE) where new_df is the resultant df to convert the columns to respective classes
Here is one method with tidyverse - create the 'Year' column where the 'Group' values have 4 digits numbers, then filter out the 'Group' rows where value is 'Group', fill the 'Year' column with the previous non-NA values, filter out the first row with duplicated and convert the type (type.convert)
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(Year = case_when(str_detect(Group, "^\\d{4}$") ~ Group)) %>%
filter(Group != 'Group') %>%
fill(Year) %>%
filter(duplicated(Year)) %>%
type.convert(as.is = TRUE) %>%
select(Year, Group, Value)
-output
Year Group Value
1 2010 A 1
2 2010 B 2
3 2010 C 9
4 2011 A 3
5 2011 B 5
6 2011 E 2
7 2012 A 9
8 2012 B 1
data
df <- data.frame(Group, Value)

Remove rows out of a specific year range, without using for-loop in R

I am looking for a way to omit the rows which are not between two specific values, without using for loop. All rows in year column are between 1999 and 2002, however some of them do not include all years between these two dates. You can see the initial data as follows:
a <- data.frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
year id
1 2000 4
2 2001 6
3 2002 2
4 1999 1
5 2000 3
6 2001 5
7 2002 7
8 1999 4
9 2000 2
10 2001 0
11 2002 -1
12 1999 -3
13 2000 4
14 2001 3
Processed dataset should only include consecutive rows between 1999:2002. The following data.frame is exactly what I need:
year id
1 1999 1
2 2000 3
3 2001 5
4 2002 7
5 1999 4
6 2000 2
7 2001 0
8 2002 -1
When I execute the following for loop, I get previous data.frame without any problem:
for(i in 1:which(a$year == 2002)[length(which(a$year == 2002))]){
if(a[i,1] == 1999 & a[i+3,1] == 2002){
b <- a[i:(i+3),]
}else{next}
if(!exists("d")){
d <- b
}else{
d <- rbind(d,b)
}
}
However, I have more than 1 million rows and I need to do this process without using for loop. Is there any faster way for that?
You could try this. First we create groups of consecutive numbers, then we join with the full date range, then we filter if any group is not full. If you already have a grouping variable, this can be cut down a lot.
library(tidyverse)
df <- data_frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
df %>%
mutate(groups = cumsum(c(0,diff(year)!=1))) %>%
nest(-groups) %>%
mutate(data = map(data, .f = ~full_join(.x, data_frame(year = 1999:2002), by = "year")),
drop = map_lgl(data, ~any(is.na(.x$id)))) %>%
filter(drop == FALSE) %>%
unnest() %>%
select(-c(groups, drop))
#> # A tibble: 8 x 2
#> year id
#> <int> <dbl>
#> 1 1999 1
#> 2 2000 3
#> 3 2001 5
#> 4 2002 7
#> 5 1999 4
#> 6 2000 2
#> 7 2001 0
#> 8 2002 -1
Created on 2018-08-31 by the reprex
package (v0.2.0).
There is a function that can do this automatically.
First, install the package called dplyr or tidyverse with command install.packages("dplyr") or install.packages("tidyverse").
Then, load the package with library(dplyr).
Then, use the filter function: a_filtered = filter(a, year >=1999 & year < 2002).
This should be fast even there are many rows.
We could also do this by creating a grouping column based on the logical expression checking the 'year' 1999, then filter by checking the first 'year' as '1999', last as '2002' and if all the 'year' in between are present for the particular 'grp'
library(dplyr)
a %>%
group_by(grp = cumsum(year == 1999)) %>%
filter(dplyr::first(year) == 1999,
dplyr::last(year) == 2002,
all(1999:2002 %in% year)) %>%
ungroup %>% # in case to remove the 'grp'
select(-grp)
# A tibble: 8 x 2
# year id
# <int> <dbl>
#1 1999 1
#2 2000 3
#3 2001 5
#4 2002 7
#5 1999 4
#6 2000 2
#7 2001 0
#8 2002 -1

Expand data frame with intervening observations

I am trying to expand a data frame in R with missing observations that are not immediately obvious. Here is what I mean:
data.frame(id = c("a","b"),start = c(2002,2004), end = c(2005,2007))
Which is:
id start end
1 a 2002 2005
2 b 2004 2007
What I would like is a new data frame with 8 total observations, 4 each for "a" and "b", and a year that is one of the values between start and end (inclusive). So:
id year
a 2002
a 2003
a 2004
a 2005
b 2004
b 2005
b 2006
b 2007
As I understand, various versions of expand only work on unique values, but here my data frame doesn't have all the unique values (explicitly).
I was thinking to step through each row and then generate a data frame with sapply(), then join all the new data frames together. But this attempt fails:
sapply(test,function(x) { data.frame( id=rep(id,x[["end"]]-x[["start"]]), year = x[["start"]]:x[["end"]] )})
I know there must be some dplyr or other magic to solve this problem!
you could use tidyr and dplyr
library(tidyr)
library(dplyr)
df %>%
gather(key = key, value = year, -id) %>%
select(-key) %>%
group_by(id) %>%
complete(year = full_seq(year,1))
# A tibble: 8 x 2
# Groups: id [2]
id year
<fct> <dbl>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
Using dplyr and tidyr, I make a new column which contains the list of years, then unnest the dataframe.
library(tidyr)
library(dplyr)
df <-
data.frame(
id = c("a", "b"),
start = c(2002, 2004),
end = c(2005, 2007)
)
df %>%
rowwise() %>%
mutate(year = list(seq(start, end))) %>%
select(-start, -end) %>%
unnest()
Output
# A tibble: 8 x 2
id year
<fct> <int>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
An easy solution with data.table:
library(data.table)
# option 1
setDT(df)[, .(year = seq(start, end)), by = id]
# option 2
setDT(df)[, .(year = start:end), by = id]
which gives:
id year
1: a 2002
2: a 2003
3: a 2004
4: a 2005
5: b 2004
6: b 2005
7: b 2006
8: b 2007
An approach with base R:
lst <- Map(seq, df$start, df$end)
data.frame(id = rep(df$id, lengths(lst)), year = unlist(lst))

Fill numeric variable while preserving group

[EDITED to reflect a better example]
Say I have a dataframe like this:
df <- data.frame(x = c("A","A","B", "B"), year = c(2001,2004,2002,2005))
> df
x year
1 A 2001
2 A 2004
3 B 2002
4 B 2005
How can I increment year by 1 while preserving x? I would like to fill in year so that the sequence is this:
x year
1 A 2001
2 A 2002
3 A 2003
4 A 2004
5 B 2002
6 B 2003
7 B 2004
8 B 2005
Can anyone recommend a good way of doing this?
#useR recommend this approach:
> data.frame(year = min(df$year):max(df$year)) %>%
full_join(df) %>%
fill(x)
Joining, by = "year"
year x
1 2001 A
2 2002 B
3 2003 B
4 2004 A
5 2005 B
However that does not match the desired output.
An option using tidyr::complete and dplyr::lead can be as:
library(tidyverse)
df <- data.frame(x = LETTERS[1:3], year = c(2001,2004,2007))
df %>% mutate(nextYear = ifelse(is.na(lead(year)),year, lead(year)-1)) %>%
group_by(x) %>%
complete(year = seq(year, nextYear, by=1)) %>%
select(-nextYear) %>%
as.data.frame()
# x year
# 1 A 2001
# 2 A 2002
# 3 A 2003
# 4 B 2004
# 5 B 2005
# 6 B 2006
# 7 C 2007
Edited: The solution for modified data
df <- data.frame(x = c("A","A","B", "B"), year = c(2001,2004,2002,2005))
library(tidyverse)
df %>% group_by(x) %>%
complete(year = seq(min(year), max(year), by=1)) %>%
as.data.frame()
# x year
# 1 A 2001
# 2 A 2002
# 3 A 2003
# 4 A 2004
# 5 B 2002
# 6 B 2003
# 7 B 2004
# 8 B 2005
Using base R (with a little help from zoo):
full_df = data.frame(year = min(df$year):max(df$year))
df = merge(df, full_df, all = TRUE)
df = df[order(df$year), ]
df$x = zoo::na.locf(df$x)
df
# year x
# 1 2001 A
# 2 2002 A
# 3 2003 A
# 4 2004 B
# 5 2005 B
# 6 2006 B
# 7 2007 C
Using the "tidyverse"
df <- data.frame(x = LETTERS[1:3], year = c(2001,2004,2007))
library(dplyr)
library(tidyr)
df = df %>% mutate(year = factor(year, levels = min(year):max(year))) %>%
complete(year) %>%
fill(x) %>%
mutate(year = as.numeric(as.character(year)))
df
# # A tibble: 7 x 2
# year x
# <dbl> <fctr>
# 1 2001 A
# 2 2002 A
# 3 2003 A
# 4 2004 B
# 5 2005 B
# 6 2006 B
# 7 2007 C
We can first split by x, then create a year vector for each x group, join with each group df, fill down x, then finally rbind all group df's together.
library(dplyr)
library(tidyr)
df %>%
split(.$x) %>%
lapply(function(y) data.frame(year = min(y$year):max(y$year)) %>%
full_join(y) %>%
fill(x)) %>%
unname() %>%
do.call(rbind, .)
Result:
year x
1 2001 A
2 2002 A
3 2003 A
4 2004 A
5 2002 B
6 2003 B
7 2004 B
8 2005 B
Here's a pretty simple base R method with tapply and stack.
stack(tapply(df$year, df["x"], function(x) min(x):max(x)))
Here, tapply splits the year vector by df$x groups and then constructs a sequence from the min to the max year. This returns a named list which is fed to stack to produce the following.
values ind
1 2001 A
2 2002 A
3 2003 A
4 2004 A
5 2002 B
6 2003 B
7 2004 B
8 2005 B
If you are curious how you might do this in data.table, it's also pretty straight forward:
library(data.table)
setDT(df)[, .(year=min(year):max(year)), by=x]
which returns
x year
1: A 2001
2: A 2002
3: A 2003
4: A 2004
5: B 2002
6: B 2003
7: B 2004
8: B 2005

Percentage of reoccurring observations by group

I have a data set of firm-employee through time, that looks like this
data.table(firm = c(rep("A", 8), rep("B", 8)),
employee = c(1, 2, 3, 4, 1, 2, 3, NA, 5, 6, NA, NA, 5, 6, 7, 8),
year = c(rep(1, 4), rep(2, 4)))
firm employee_id year
A 1 1
A 2 1
A 3 1
A 4 1
A 1 2
A 2 2
A 3 2
A NA 2
B 5 1
B 6 1
B NA 1
B NA 1
B 5 2
B 6 2
B 7 2
B 8 2
I want to calculate the percentage of employees from year==1 that were still working on year==2, for each firm.
The output should be like this
firm year continued_employees
A 2 0.75
B 2 1
I can do it in a loop for each year, using
sum(employee_id[year==1] %in% employee_id[year==2]) / length(employee_id[year==1])
but I have around 40k firms and 10 years of observations. Any thoughts on how to do it in a dplyr or data.table syntax?
Here's a not-so-pretty data.table approach you could use for any number of firms and years:
years <- head(sort(unique(dt$year)), -1)
setNames(lapply(years, function(y) {
dt[dt[(year == y), .(firm, employee)], on = .(firm, employee)][
!is.na(employee), all(c(y, y+1) %in% year), by = .(employee, firm)][,
.(continued = mean(V1), year = y+1), by = firm]
}), paste("Year", years, sep="-"))
#$`Year-1`
# firm continued year
#1: A 0.75 2
#2: B 1.00 2
Since you only have two years in your sample data, you only get a single list element in return.
Join with shifted year
This is an approach using a kind of self join with a shifted year:
library(data.table)
options(datatable.print.class = TRUE)
# self join with shifted year
DT[.(firm = firm, employee = employee, year = year - 1),
on = .(firm, employee, year), cont := TRUE][]
# aggregate
DT[!is.na(employee), sum(cont, na.rm = TRUE) / .N, by = .(firm, year = year + 1)][
# beautify result
year <= max(DT$year)]
firm year V1
<char> <num> <num>
1: A 2 0.75
2: B 2 1.00
The first expression modifies DT to indicate continued employees:
firm employee year cont
<char> <num> <num> <lgcl>
1: A 1 1 TRUE
2: A 2 1 TRUE
3: A 3 1 TRUE
4: A 4 1 NA
5: A 1 2 NA
6: A 2 2 NA
7: A 3 2 NA
8: A NA 2 NA
9: B 5 1 TRUE
10: B 6 1 TRUE
11: B NA 1 NA
12: B NA 1 NA
13: B 5 2 NA
14: B 6 2 NA
15: B 7 2 NA
16: B 8 2 NA
Using shift()
Alternatively, the shift() function can be used to compute the cont column. The aggregation part is the same as with the join approach above. shift() requires to ensure the data are ordered by year.
DT[order(year), cont := shift(year, type = "lead") == year + 1, by = .(firm, employee)][
!is.na(employee), sum(cont, na.rm = TRUE) / .N, by = .(firm, year = year + 1)][
year <= max(DT$year)]
Benchmark
At the time of writing, three approaches have been proposed in addition to OP's own attempt using loops:
by docendo discimus
join with shifted year
using shift()
The answer of Jean Vuda is not considered in the benchmark as it is limited to 2 years.
According to the OP, the production data set consists of 40 k firms and 10 years of data. For a realistic benchmark, a sample data set of similar size is created:
n_firm <- 40000L
max_employee <- 10L
fluctuation_rate <- 0.2
n_year <- 10L
start_year <- 2001L
DT0 <- CJ(firm = sprintf("%06i", seq_len(n_firm)),
employee = seq_len(max_employee),
year = seq(start_year, length.out = n_year))
set.seed(123L)
n_row <- nrow(DT0)
DT0[sample.int(n_row, fluctuation_rate * n_row), employee := NA]
The sample data set consists of 4 M rows and can be visualised best after reshaping from long to wide format:
dcast(DT0[!is.na(employee)], firm + employee ~ year)
Using 'year' as value column. Use 'value.var' to override
firm employee 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
<char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1: 000001 1 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
2: 000001 2 2001 2002 2003 NA 2005 2006 2007 NA 2009 NA
3: 000001 3 2001 2002 NA NA 2005 2006 2007 2008 2009 2010
4: 000001 4 2001 NA NA NA 2005 2006 2007 2008 NA 2010
---
399996: 040000 6 2001 2002 NA 2004 2005 NA NA NA 2009 2010
399997: 040000 7 NA 2002 NA NA 2005 2006 2007 2008 2009 2010
399998: 040000 8 2001 2002 2003 NA NA NA 2007 NA NA 2010
399999: 040000 9 2001 2002 2003 NA 2005 2006 2007 2008 2009 NA
400000: 040000 10 2001 2002 2003 NA NA 2006 2007 2008 2009 2010
For benchmarking, the microbenchmark package is used because a check function can be passed to verify the results are identical:
my_check <- function(values) {
values <- lapply(values, function(x) x[, dcast(.SD, firm ~ year, value.var = "continued")])
all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
The benchmark code:
microbenchmark::microbenchmark(
dd = {
dt <- copy(DT0)
years <- head(sort(unique(dt$year)), -1)
rbindlist(
setNames(lapply(years, function(y) {
dt[dt[(year == y), .(firm, employee)], on = .(firm, employee)][
!is.na(employee), all(c(y, y+1) %in% year), by = .(employee, firm)][
, .(continued = mean(V1), year = y+1), by = firm]
}), paste("Year", years, sep="-"))
)
},
join = {
DT <- copy(DT0)
DT[.(firm = firm, employee = employee, year = year - 1),
on = .(firm, employee, year), cont := TRUE][
!is.na(employee), .(continued = sum(cont, na.rm = TRUE) / .N),
by = .(firm, year = year + 1)][
year <= max(DT$year)]
},
shift = {
DT <- copy(DT0)
DT[order(year), cont := shift(year, type = "lead") == year + 1,
by = .(firm, employee)][
!is.na(employee), .(continued = sum(cont, na.rm = TRUE) / .N),
by = .(firm, year = year + 1)][
year <= max(DT$year)]
},
check = my_check,
times = 3L
)
The benchmark results show that the join approach is 4 times faster than the shift approach and 8 times faster than docendo discimus's approach.
Unit: seconds
expr min lq mean median uq max neval cld
dd 11.756114 11.919959 12.083042 12.083805 12.246506 12.409207 3 c
join 1.054293 1.239829 1.303971 1.425366 1.428810 1.432254 3 a
shift 6.105725 6.105906 6.148136 6.106087 6.169342 6.232596 3 b
Here is a slightly different approach to do it:
dt<-dat[,list(all=.(unique(employee))), by=list(year,firm)]
dt<-dt[,list(year1=sapply(list(all),`[`,1),
year2=sapply(list(all),`[`,2)), by=firm]
dt[,uniqueN(mapply(intersect, year1, year2))/uniqueN(na.omit(unlist(year1))),by=firm]

Resources