Remove discontinuous time points in longitudinal data - r

I have a firm-year longitudinal data but the year is not continuous for some firms, for example
library(data.table)
dt = data.table(firm_id=c(rep(1,5),rep(2,5)),year=c(1990,1991,1999,2000,2001,1995,1997,2008,2009,2010))
For each firm, I want to keep observations in the most recent continuous years and remove other observations. For example, Firm 1 has five-year observations in (1990, 1991, 1999, 2000, 2001) and I want to keep (1999, 2000, 2001)
I can think of some awkward approaches to solve this issue but I am wondering if there is an easy way to solve it.
Enlighted by the comments, I am also wondering if there is any way to keep the longest continuous vector block of years. For example,
library(data.table)
dt = data.table(firm_id=c(rep(1,5),rep(2,5)),year=c(1990,1991,1992,2000,2001,1995,1997,2008,2009,2010))
The result would be

library(data.table)
DT2 <- setorder(dt, firm_id, year)[
,d := cumsum(c(TRUE, diff(year) > 1)), by = .(firm_id) ][
,n := .N, by = .(firm_id, d) ]
DT2
# firm_id year d n
# <num> <num> <int> <int>
# 1: 1 1990 1 3
# 2: 1 1991 1 3
# 3: 1 1992 1 3
# 4: 1 2000 2 2
# 5: 1 2001 2 2
# 6: 2 1995 1 1
# 7: 2 1997 2 1
# 8: 2 2008 3 3
# 9: 2 2009 3 3
# 10: 2 2010 3 3
From here, if you want runs of 3 consecutive years or more, then
DT2[ (n > 2), ]
If you want the longest run for each firm_id, then
DT2[, .SD[n == max(n),], by = .(firm_id) ]

Related

R subset data.table by condition , getting nearest value row if missing

I would like to subset a data.table by condition "Year". Basically I want the data from the dt that matches a given year, per group. However, some groups do not have a complete time line across all years, and therefore I would like to return the nearest year's data for every group, so there are data for every group present for any year chosen (whether that is exactly the right year, or not).
library(data.table)
# make dummy data
dt <- data.table(Group = c(rep("A", 5),rep("B", 3),rep("C", 5),rep("D", 2)),
x = sample(1:10,15, rep=T), Year = c(2011:2015, 2013:2015, 2011:2015, 2014:2015))
# subset by, e.g., Year == 2015 is fine, but I want a full result for ANY
# year chosen, such as 2012, by using the closest entry in time, per group.
# Attempt;
y <- 2012
dt[Year == which.min(abs(Year - y)), .SD, by = Group]
Empty data.table (0 rows and 3 cols): Group,x,Year
The result in this example should be;
Group x Year
1: A 4 2012
2: B 7 2013
3: C 2 2012
4: D 3 2014
You are close: the use of which.min(abs(Year - y)) is good, but needs to be within the .SD-subsetting in the j portion.
dt[, .SD[which.min(abs(Year - y)),], Group]
# Group x Year
# <char> <int> <int>
# 1: A 5 2012
# 2: B 4 2013
# 3: C 8 2012
# 4: D 5 2014
Reproducible data
set.seed(42)
dt <- data.table(Group = c(rep("A", 5),rep("B", 3),rep("C", 5),rep("D", 2)), x = sample(1:10,15, rep=T), Year = c(2011:2015, 2013:2015, 2011:2015, 2014:2015))
dt
# Group x Year
# <char> <int> <int>
# 1: A 1 2011
# 2: A 5 2012
# 3: A 1 2013
# 4: A 9 2014
# 5: A 10 2015
# 6: B 4 2013
# 7: B 2 2014
# 8: B 10 2015
# 9: C 1 2011
# 10: C 8 2012
# 11: C 7 2013
# 12: C 4 2014
# 13: C 9 2015
# 14: D 5 2014
# 15: D 4 2015
y <- 2012

How to fasten nested for-loop R

I have two datasets, and one of them is very big. I'm trying to run the following loop to create a treatment column, treatment, in the dataset a. However, it is way too slow. I looked for a way to fasten for-loops like vectorization or defining conditions outside the loops however I'm having a hard time applying those methods since I have two datasets I'm conditioning on.
Here is my code:
reform_loop <- function(a, b){
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
if(!is.na(a[i,"treatment"])){break}
a[i,"treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] >= b[j,"cohort"] &
a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] < b[j,"cohort"]&
a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
}
}
return(a)
}
a <- reform_loop(a, b)
You can find a sample dataset below. Dataset a is an individual dataset with birth year informations and dataset b is country-level data with some country reform information. treatment is 1 if thebirth_year is between the cohort and upper_cutoff and 0 if between cohort and lower_cutoff in a specific country which means country_code variables should also be matched. And anything else should be NA.
#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8),
birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
lower_cutoff = c(1928, 1975, 1907, 1934),
upper_cutoff = c(1948, 1995, 1927, 1948),
cohort = c(1938, 1985, 1917, 1942))
The following is the result I want to get:
treatment <- c(NA, 0, 1, NA, 0, 1, NA, NA)
Unfortunately, I cannot merge these two datasets since most of the countries in my dataset have more than one reform.
Any ideas on how can I fasten this code? Thank you so much in advance!
This is a range-based non-equi join. As such, this can be done with data.table or fuzzyjoin or sqldf.
data.table
library(data.table)
setDT(a)
setDT(b)
b[, treatment := 1L]
a[b, treatment := i.treatment, on = .(country_code, birth_year >= lower_cutoff, birth_year <= upper_cutoff)]
a[is.na(treatment), treatment := 0L]
a
# country_code birth_year treatment
# <num> <num> <int>
# 1: 2 1920 0
# 2: 2 1930 1
# 3: 2 1940 1
# 4: 10 1970 0
# 5: 10 1980 1
# 6: 10 1990 1
# 7: 10 2000 0
# 8: 8 1910 0
sqldf
out <- sqldf::sqldf("select a.*, b.treatment from a left join b on a.country_code=b.country_code and a.birth_year between b.lower_cutoff and b.upper_cutoff")
out$treatment[is.na(out$treatment)] <- 0L
out
# country_code birth_year treatment
# 1 2 1920 0
# 2 2 1930 1
# 3 2 1940 1
# 4 10 1970 0
# 5 10 1980 1
# 6 10 1990 1
# 7 10 2000 0
# 8 8 1910 0
fuzzyjoin
fuzzyjoin::fuzzy_left_join(a, b, by = c("country_code" = "country_code", "birth_year" = "lower_cutoff", "birth_year" = "upper_cutoff"), match_fun = list(`==`, `>=`, `<=`))
# country_code.x birth_year country_code.y lower_cutoff upper_cutoff cohort treatment
# 1 2 1920 NA NA NA NA NA
# 2 2 1930 2 1928 1948 1938 1
# 3 2 1940 2 1928 1948 1938 1
# 4 10 1970 NA NA NA NA NA
# 5 10 1980 10 1975 1995 1985 1
# 6 10 1990 10 1975 1995 1985 1
# 7 10 2000 NA NA NA NA NA
# 8 8 1910 NA NA NA NA NA
and then you need to clean up the extra columns (and fill 0 for NA).

repeated observations average per month

I have data on households that made purchase with individual ID for every receipts under some time where weeks are code just as regular integers.
I need to count numbers of receipts from each household during 4 weeks period.(Data is over 3 years; 1st year - 52 weeks, 2nd - 53, 3d-48). Eventually I want to have an average number of purchases per 4 weeks for every household. If the solution includes converting to months and counting monthly,that works as well. The dataset is over 100k rows. I'm quite new to R, all suggestions are very much appreciated!
Household<-c(1,2,3,1,1,2,2,2,3,1,3,3)
Week<-c(201501,201501,201501,201502,201502,201502,201502,201503,201503,201504,201504,201504)
Receipt<-c(111,112,113,114,115,116,117,118,119,120,121,121)
df<-data.frame(Household,Week,Receipt)
This calculates the number of reciepts (rows) per houehold, per 4-week period
library(data.table)
setDT(df)
n_reciepts <- df[, .N, by = .(Household, period = floor(Week/4))]
# Household period N
# 1: 1 50375 3
# 2: 2 50375 4
# 3: 3 50375 2
# 4: 1 50376 1
# 5: 3 50376 2
Then you just need to average by household over all periods
avg_n_reciepts <- n_reciepts[, .(avg_reciepts = mean(N)), by = Household]
# Household avg_reciepts
# 1: 1 2
# 2: 2 4
# 3: 3 2
You could also do this in one step
df[, .N, by = .(Household, period = floor(Week/4))
][, .(avg_reciepts = mean(N)), by = Household]
# Household avg_reciepts
# 1: 1 2
# 2: 2 4
# 3: 3 2
dplyr equivalent:
library(dplyr)
df %>%
group_by(Household, period = floor(Week/4)) %>%
count %>%
group_by(Household) %>%
summarise(avg_reciepts = mean(n))
# # A tibble: 3 x 2
# Household avg_reciepts
# <dbl> <dbl>
# 1 1 2
# 2 2 4
# 3 3 2

Percentage of reoccurring observations by group

I have a data set of firm-employee through time, that looks like this
data.table(firm = c(rep("A", 8), rep("B", 8)),
employee = c(1, 2, 3, 4, 1, 2, 3, NA, 5, 6, NA, NA, 5, 6, 7, 8),
year = c(rep(1, 4), rep(2, 4)))
firm employee_id year
A 1 1
A 2 1
A 3 1
A 4 1
A 1 2
A 2 2
A 3 2
A NA 2
B 5 1
B 6 1
B NA 1
B NA 1
B 5 2
B 6 2
B 7 2
B 8 2
I want to calculate the percentage of employees from year==1 that were still working on year==2, for each firm.
The output should be like this
firm year continued_employees
A 2 0.75
B 2 1
I can do it in a loop for each year, using
sum(employee_id[year==1] %in% employee_id[year==2]) / length(employee_id[year==1])
but I have around 40k firms and 10 years of observations. Any thoughts on how to do it in a dplyr or data.table syntax?
Here's a not-so-pretty data.table approach you could use for any number of firms and years:
years <- head(sort(unique(dt$year)), -1)
setNames(lapply(years, function(y) {
dt[dt[(year == y), .(firm, employee)], on = .(firm, employee)][
!is.na(employee), all(c(y, y+1) %in% year), by = .(employee, firm)][,
.(continued = mean(V1), year = y+1), by = firm]
}), paste("Year", years, sep="-"))
#$`Year-1`
# firm continued year
#1: A 0.75 2
#2: B 1.00 2
Since you only have two years in your sample data, you only get a single list element in return.
Join with shifted year
This is an approach using a kind of self join with a shifted year:
library(data.table)
options(datatable.print.class = TRUE)
# self join with shifted year
DT[.(firm = firm, employee = employee, year = year - 1),
on = .(firm, employee, year), cont := TRUE][]
# aggregate
DT[!is.na(employee), sum(cont, na.rm = TRUE) / .N, by = .(firm, year = year + 1)][
# beautify result
year <= max(DT$year)]
firm year V1
<char> <num> <num>
1: A 2 0.75
2: B 2 1.00
The first expression modifies DT to indicate continued employees:
firm employee year cont
<char> <num> <num> <lgcl>
1: A 1 1 TRUE
2: A 2 1 TRUE
3: A 3 1 TRUE
4: A 4 1 NA
5: A 1 2 NA
6: A 2 2 NA
7: A 3 2 NA
8: A NA 2 NA
9: B 5 1 TRUE
10: B 6 1 TRUE
11: B NA 1 NA
12: B NA 1 NA
13: B 5 2 NA
14: B 6 2 NA
15: B 7 2 NA
16: B 8 2 NA
Using shift()
Alternatively, the shift() function can be used to compute the cont column. The aggregation part is the same as with the join approach above. shift() requires to ensure the data are ordered by year.
DT[order(year), cont := shift(year, type = "lead") == year + 1, by = .(firm, employee)][
!is.na(employee), sum(cont, na.rm = TRUE) / .N, by = .(firm, year = year + 1)][
year <= max(DT$year)]
Benchmark
At the time of writing, three approaches have been proposed in addition to OP's own attempt using loops:
by docendo discimus
join with shifted year
using shift()
The answer of Jean Vuda is not considered in the benchmark as it is limited to 2 years.
According to the OP, the production data set consists of 40 k firms and 10 years of data. For a realistic benchmark, a sample data set of similar size is created:
n_firm <- 40000L
max_employee <- 10L
fluctuation_rate <- 0.2
n_year <- 10L
start_year <- 2001L
DT0 <- CJ(firm = sprintf("%06i", seq_len(n_firm)),
employee = seq_len(max_employee),
year = seq(start_year, length.out = n_year))
set.seed(123L)
n_row <- nrow(DT0)
DT0[sample.int(n_row, fluctuation_rate * n_row), employee := NA]
The sample data set consists of 4 M rows and can be visualised best after reshaping from long to wide format:
dcast(DT0[!is.na(employee)], firm + employee ~ year)
Using 'year' as value column. Use 'value.var' to override
firm employee 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
<char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1: 000001 1 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
2: 000001 2 2001 2002 2003 NA 2005 2006 2007 NA 2009 NA
3: 000001 3 2001 2002 NA NA 2005 2006 2007 2008 2009 2010
4: 000001 4 2001 NA NA NA 2005 2006 2007 2008 NA 2010
---
399996: 040000 6 2001 2002 NA 2004 2005 NA NA NA 2009 2010
399997: 040000 7 NA 2002 NA NA 2005 2006 2007 2008 2009 2010
399998: 040000 8 2001 2002 2003 NA NA NA 2007 NA NA 2010
399999: 040000 9 2001 2002 2003 NA 2005 2006 2007 2008 2009 NA
400000: 040000 10 2001 2002 2003 NA NA 2006 2007 2008 2009 2010
For benchmarking, the microbenchmark package is used because a check function can be passed to verify the results are identical:
my_check <- function(values) {
values <- lapply(values, function(x) x[, dcast(.SD, firm ~ year, value.var = "continued")])
all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
The benchmark code:
microbenchmark::microbenchmark(
dd = {
dt <- copy(DT0)
years <- head(sort(unique(dt$year)), -1)
rbindlist(
setNames(lapply(years, function(y) {
dt[dt[(year == y), .(firm, employee)], on = .(firm, employee)][
!is.na(employee), all(c(y, y+1) %in% year), by = .(employee, firm)][
, .(continued = mean(V1), year = y+1), by = firm]
}), paste("Year", years, sep="-"))
)
},
join = {
DT <- copy(DT0)
DT[.(firm = firm, employee = employee, year = year - 1),
on = .(firm, employee, year), cont := TRUE][
!is.na(employee), .(continued = sum(cont, na.rm = TRUE) / .N),
by = .(firm, year = year + 1)][
year <= max(DT$year)]
},
shift = {
DT <- copy(DT0)
DT[order(year), cont := shift(year, type = "lead") == year + 1,
by = .(firm, employee)][
!is.na(employee), .(continued = sum(cont, na.rm = TRUE) / .N),
by = .(firm, year = year + 1)][
year <= max(DT$year)]
},
check = my_check,
times = 3L
)
The benchmark results show that the join approach is 4 times faster than the shift approach and 8 times faster than docendo discimus's approach.
Unit: seconds
expr min lq mean median uq max neval cld
dd 11.756114 11.919959 12.083042 12.083805 12.246506 12.409207 3 c
join 1.054293 1.239829 1.303971 1.425366 1.428810 1.432254 3 a
shift 6.105725 6.105906 6.148136 6.106087 6.169342 6.232596 3 b
Here is a slightly different approach to do it:
dt<-dat[,list(all=.(unique(employee))), by=list(year,firm)]
dt<-dt[,list(year1=sapply(list(all),`[`,1),
year2=sapply(list(all),`[`,2)), by=firm]
dt[,uniqueN(mapply(intersect, year1, year2))/uniqueN(na.omit(unlist(year1))),by=firm]

Do a partition over using dplyr in R [duplicate]

I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently. I tried two different approaches and they've been looping on two different computers for a long time now. I wish I could say the competition made it more exciting, but ... bleh.
rank observations in group
I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.
I have the first two columns and want the third one:
person wave obs
pers1 1999 1
pers1 2000 2
pers1 2003 3
pers2 1998 1
pers2 2001 2
Now I'm using two loop-approaches. Both are excruciatingly slow (150k rows). I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).
Thanks for any pointers!
# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ]
person.obs$n.obs = 0
# first approach: loop through people and assign range
unp = unique(person.obs$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
print(unp[i])
person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs =
1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)
i=i+1
gc()
}
# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR
e = 0
}
e=e+1
person.obs[i,]$n.obs = e
i=i+1
gc()
}
The answer from Marek in this question has proven very useful in the past. I wrote it down and use it almost daily since it was fast and efficient. We'll use ave() and seq_along().
foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011))
foo <- transform(foo, obs = ave(rep(NA, nrow(foo)), person, FUN = seq_along))
foo
person year obs
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
Another option using plyr
library(plyr)
ddply(foo, "person", transform, obs2 = seq_along(person))
person year obs obs2
1 pers1 1999 1 1
2 pers1 2000 2 2
3 pers1 2003 3 3
4 pers2 1998 1 1
5 pers2 2011 2 2
A few alternatives with the data.table and dplyr packages.
data.table:
library(data.table)
# setDT(foo) is needed to convert to a data.table
# option 1:
setDT(foo)[, rn := rowid(person)]
# option 2:
setDT(foo)[, rn := 1:.N, by = person]
both give:
> foo
person year rn
1: pers1 1999 1
2: pers1 2000 2
3: pers1 2003 3
4: pers2 1998 1
5: pers2 2011 2
If you want a true rank, you should use the frank function:
setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]
dplyr:
library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())
both giving a similar result:
> foo
Source: local data frame [5 x 3]
Groups: person [2]
person year rn
(fctr) (dbl) (int)
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
Would by do the trick?
> foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011),obs=c(1,2,3,1,2))
> foo
person year obs
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
> by(foo, foo$person, nrow)
foo$person: pers1
[1] 3
------------------------------------------------------------
foo$person: pers2
[1] 2
Another option using aggregate and rank in base R:
foo$obs <- unlist(aggregate(.~person, foo, rank)[,2])
# person year obs
# 1 pers1 1999 1
# 2 pers1 2000 2
# 3 pers1 2003 3
# 4 pers2 1998 1
# 5 pers2 2011 2

Resources