R subset data.table by condition , getting nearest value row if missing - r

I would like to subset a data.table by condition "Year". Basically I want the data from the dt that matches a given year, per group. However, some groups do not have a complete time line across all years, and therefore I would like to return the nearest year's data for every group, so there are data for every group present for any year chosen (whether that is exactly the right year, or not).
library(data.table)
# make dummy data
dt <- data.table(Group = c(rep("A", 5),rep("B", 3),rep("C", 5),rep("D", 2)),
x = sample(1:10,15, rep=T), Year = c(2011:2015, 2013:2015, 2011:2015, 2014:2015))
# subset by, e.g., Year == 2015 is fine, but I want a full result for ANY
# year chosen, such as 2012, by using the closest entry in time, per group.
# Attempt;
y <- 2012
dt[Year == which.min(abs(Year - y)), .SD, by = Group]
Empty data.table (0 rows and 3 cols): Group,x,Year
The result in this example should be;
Group x Year
1: A 4 2012
2: B 7 2013
3: C 2 2012
4: D 3 2014

You are close: the use of which.min(abs(Year - y)) is good, but needs to be within the .SD-subsetting in the j portion.
dt[, .SD[which.min(abs(Year - y)),], Group]
# Group x Year
# <char> <int> <int>
# 1: A 5 2012
# 2: B 4 2013
# 3: C 8 2012
# 4: D 5 2014
Reproducible data
set.seed(42)
dt <- data.table(Group = c(rep("A", 5),rep("B", 3),rep("C", 5),rep("D", 2)), x = sample(1:10,15, rep=T), Year = c(2011:2015, 2013:2015, 2011:2015, 2014:2015))
dt
# Group x Year
# <char> <int> <int>
# 1: A 1 2011
# 2: A 5 2012
# 3: A 1 2013
# 4: A 9 2014
# 5: A 10 2015
# 6: B 4 2013
# 7: B 2 2014
# 8: B 10 2015
# 9: C 1 2011
# 10: C 8 2012
# 11: C 7 2013
# 12: C 4 2014
# 13: C 9 2015
# 14: D 5 2014
# 15: D 4 2015
y <- 2012

Related

Data file organized in block per year: add column for year to combine all data

I have large quantities of datasets in Excel that I would like to analyze in R. The files have a format that organizes all information per block of the same year, which looks like:
Group <- c(2010, 'Group', 'A', 'B', 'C', 2011, 'Group', 'A', 'B', 'E', 2012, 'Group', 'A', 'B')
Value <- c(NA,'Value', 1, 2, 9, NA, 'Value', 3, 5, 2, NA, 'Value', 9, 1)
df <- cbind(Group, Value)
Group Value
1: 2010 NA
2: Group Value
3: A 1
4: B 2
5: C 9
6: 2011 NA
7: Group Value
8: A 3
9: B 5
10: E 2
11: 2012 NA
12: Group Value
13: A 9
14: B 1
To be able to analyze the data, I would like to automatically add a column for the year so that all data can be combined, as follows:
Year Group Value
1: 2010 A 1
2: 2010 B 2
3: 2010 C 9
4: 2011 A 3
5: 2011 B 5
6: 2011 E 2
7: 2012 A 9
8: 2012 B 1
library(data.table)
dt <- data.table(df)
dt[, Year := Group[1], cumsum(is.na(Value))][Value != 'Value']
Group Value Year
1: A 1 2010
2: B 2 2010
3: C 9 2010
4: A 3 2011
5: B 5 2011
6: E 2 2011
7: A 9 2012
8: B 1 2012
in Base R:
subset(transform(df, Year = ave(Group, cumsum(is.na(Value)), FUN=\(x)x[1])), Value != 'Value')
Group Value Year
3 A 1 2010
4 B 2 2010
5 C 9 2010
8 A 3 2011
9 B 5 2011
10 E 2 2011
13 A 9 2012
14 B 1 2012
Note that the above columns are character. You can use type.convert(new_df, as.is = TRUE) where new_df is the resultant df to convert the columns to respective classes
Here is one method with tidyverse - create the 'Year' column where the 'Group' values have 4 digits numbers, then filter out the 'Group' rows where value is 'Group', fill the 'Year' column with the previous non-NA values, filter out the first row with duplicated and convert the type (type.convert)
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(Year = case_when(str_detect(Group, "^\\d{4}$") ~ Group)) %>%
filter(Group != 'Group') %>%
fill(Year) %>%
filter(duplicated(Year)) %>%
type.convert(as.is = TRUE) %>%
select(Year, Group, Value)
-output
Year Group Value
1 2010 A 1
2 2010 B 2
3 2010 C 9
4 2011 A 3
5 2011 B 5
6 2011 E 2
7 2012 A 9
8 2012 B 1
data
df <- data.frame(Group, Value)

How to count the number multiple strings appear in another string list by group?

Now I have two data, name and text, and I want to calculate the number of occurrences of each name in name in the current year in text, that is, to generate data result. How to achieve this? I tried lapply and grepl, but both failed. Thanks a lot!
name=data.table(year=c(2018,2019,2020),
name0=list(c("A","B","C"),c("B","C"),c("D","E","F")))
text=data.table(year=c(2018,2018,2019,2019,2020),
text0=list(c("DEF","BG","CG"),c("ART","CWW"),c("DLK","BU","FO"),
c("A45","11B","C23"),c("EIU","CM")))
result=data.table(year=c(2018,2018,2018,2019,2019,2020,2020,2020),
name0=c("A","B","C","B","C","D","E","F"),
count=c(1,1,2,2,1,0,1,0))
A merge on unlisted values will work:
library(data.table)
merge(
name[, .(name0 = unlist(name0)), by = .(year)],
text[, .(name0 = unlist(strsplit(unlist(text0), ""))), by=.(year)][, ign := 1],
by = c("year", "name0"), all.x = TRUE, allow.cartesian = TRUE
)[,.(count = sum(!is.na(ign))), by = .(year, name0)]
# year name0 count
# <num> <char> <int>
# 1: 2018 A 1
# 2: 2018 B 1
# 3: 2018 C 2
# 4: 2019 B 2
# 5: 2019 C 1
# 6: 2020 D 0
# 7: 2020 E 1
# 8: 2020 F 0
The ign variable is so that we can force all.x=TRUE yet account for those that were not found in y.
Slower but perhaps more memory-frugal method:
namelong <- name[, .(name0 = unlist(name0)), by = .(year)]
namelong
# year name0
# <num> <char>
# 1: 2018 A
# 2: 2018 B
# 3: 2018 C
# 4: 2019 B
# 5: 2019 C
# 6: 2020 D
# 7: 2020 E
# 8: 2020 F
func <- function(yr, nm) text[year == yr, sum(grepl(nm, unlist(text0)))]
namelong[, count := do.call(mapply, c(list(FUN=func), unname(namelong)))]
# year name0 count
# <num> <char> <int>
# 1: 2018 A 1
# 2: 2018 B 1
# 3: 2018 C 2
# 4: 2019 B 2
# 5: 2019 C 1
# 6: 2020 D 0
# 7: 2020 E 1
# 8: 2020 F 0

Remove discontinuous time points in longitudinal data

I have a firm-year longitudinal data but the year is not continuous for some firms, for example
library(data.table)
dt = data.table(firm_id=c(rep(1,5),rep(2,5)),year=c(1990,1991,1999,2000,2001,1995,1997,2008,2009,2010))
For each firm, I want to keep observations in the most recent continuous years and remove other observations. For example, Firm 1 has five-year observations in (1990, 1991, 1999, 2000, 2001) and I want to keep (1999, 2000, 2001)
I can think of some awkward approaches to solve this issue but I am wondering if there is an easy way to solve it.
Enlighted by the comments, I am also wondering if there is any way to keep the longest continuous vector block of years. For example,
library(data.table)
dt = data.table(firm_id=c(rep(1,5),rep(2,5)),year=c(1990,1991,1992,2000,2001,1995,1997,2008,2009,2010))
The result would be
library(data.table)
DT2 <- setorder(dt, firm_id, year)[
,d := cumsum(c(TRUE, diff(year) > 1)), by = .(firm_id) ][
,n := .N, by = .(firm_id, d) ]
DT2
# firm_id year d n
# <num> <num> <int> <int>
# 1: 1 1990 1 3
# 2: 1 1991 1 3
# 3: 1 1992 1 3
# 4: 1 2000 2 2
# 5: 1 2001 2 2
# 6: 2 1995 1 1
# 7: 2 1997 2 1
# 8: 2 2008 3 3
# 9: 2 2009 3 3
# 10: 2 2010 3 3
From here, if you want runs of 3 consecutive years or more, then
DT2[ (n > 2), ]
If you want the longest run for each firm_id, then
DT2[, .SD[n == max(n),], by = .(firm_id) ]

Is there a faster way of processing sum by more the one level of a factor in R?

I have a data frame composed of four columns: state A; state B; imports of state B from state A; and year. It has 594473 rows.
For each state B, want to calculate the total imports B with all possible partners in a given year and the 4 years before that.
To that end, I created the following for loop, where "j" is the country B in a given row, year is the year of that row, and z[,5] is the imports of country b from a on that year:
# create column for 5 year total imports of b
for (row in 1:nrow(z)){
j <-z[row,3]
year<-as.numeric(paste(z[row,1]))
z[row,6]<- sum(z[z[,3]==j & z[,1]==year,5])+
sum(z[z[,3]==j & z[,1]==year-1,5])+
sum(z[z[,3]==j & z[,1]==year-2,5])+
sum(z[z[,3]==j & z[,1]==year-3,5])+
sum(z[z[,3]==j & z[,1]==year-4,5])
}
colnames(z)[6]<-"5year_imp_b
I works, but takes ages, given the size of the data. My computer can't finish it. Is there a faster way to do that?
Sorry if that is not a good question, I'm new to R and programming in general.
Many thanks!
Here is a function that should be relatively quick (you will need to import data.table for it to work):
calculate_rolling_sum <- function(df, date_col, calc_col, id_var, k) {
return(setDT(df)[order(get(date_col)),][, paste(calc_col, "roll_sum", k, sep = "_") := sapply(get(date_col), function(x) sum(get(calc_col)[between(get(date_col), x - k, x)])),
by = mget(id_var)])
}
Example dataframe:
df <- data.frame(
state_A = c(rep("x", 6), rep("y", 4), rep("z", 6)),
state_B = c(rep("d", 16)),
imports_AB = c(rep(3, 3), rep(4, 4), rep(5, 2), rep(6, 2), rep(9, 3), rep(3, 2)),
yr = c(seq(2000, 2006, 1), seq(2009, 2017, 1))
)
state_A state_B imports_AB yr
1: x d 3 2000
2: x d 3 2001
3: x d 3 2002
4: x d 4 2003
5: x d 4 2004
6: x d 4 2005
7: y d 4 2006
8: y d 5 2009
9: y d 5 2010
10: y d 6 2011
11: z d 6 2012
12: z d 9 2013
13: z d 9 2014
14: z d 9 2015
15: z d 3 2016
16: z d 3 2017
Applying the function for current and the past 3 years and the new dataframe:
library(data.table)
df_rolling <- calculate_rolling_sum(df, date_col = "yr", calc_col = "imports_AB", id_var = c("state_A", "state_B"), k = 3)
df_rolling[]
state_A state_B imports_AB yr imports_AB_roll_sum_3
1: x d 3 2000 3
2: x d 3 2001 6
3: x d 3 2002 9
4: x d 4 2003 13
5: x d 4 2004 14
6: x d 4 2005 15
7: y d 4 2006 4
8: y d 5 2009 9
9: y d 5 2010 10
10: y d 6 2011 16
11: z d 6 2012 6
12: z d 9 2013 15
13: z d 9 2014 24
14: z d 9 2015 33
15: z d 3 2016 30
16: z d 3 2017 24
What is the advantage of this function over standard rolling functions? For instance, in 2010 it won't take into account 2006 anymore, since this is not the requirement.
Usual rolling functions that count only by row indices would count it (as it is 2 rows below).
In this way, you don't need to care whether you have gap between years, and there is no need to complete the dataset.

Percentage of reoccurring observations by group

I have a data set of firm-employee through time, that looks like this
data.table(firm = c(rep("A", 8), rep("B", 8)),
employee = c(1, 2, 3, 4, 1, 2, 3, NA, 5, 6, NA, NA, 5, 6, 7, 8),
year = c(rep(1, 4), rep(2, 4)))
firm employee_id year
A 1 1
A 2 1
A 3 1
A 4 1
A 1 2
A 2 2
A 3 2
A NA 2
B 5 1
B 6 1
B NA 1
B NA 1
B 5 2
B 6 2
B 7 2
B 8 2
I want to calculate the percentage of employees from year==1 that were still working on year==2, for each firm.
The output should be like this
firm year continued_employees
A 2 0.75
B 2 1
I can do it in a loop for each year, using
sum(employee_id[year==1] %in% employee_id[year==2]) / length(employee_id[year==1])
but I have around 40k firms and 10 years of observations. Any thoughts on how to do it in a dplyr or data.table syntax?
Here's a not-so-pretty data.table approach you could use for any number of firms and years:
years <- head(sort(unique(dt$year)), -1)
setNames(lapply(years, function(y) {
dt[dt[(year == y), .(firm, employee)], on = .(firm, employee)][
!is.na(employee), all(c(y, y+1) %in% year), by = .(employee, firm)][,
.(continued = mean(V1), year = y+1), by = firm]
}), paste("Year", years, sep="-"))
#$`Year-1`
# firm continued year
#1: A 0.75 2
#2: B 1.00 2
Since you only have two years in your sample data, you only get a single list element in return.
Join with shifted year
This is an approach using a kind of self join with a shifted year:
library(data.table)
options(datatable.print.class = TRUE)
# self join with shifted year
DT[.(firm = firm, employee = employee, year = year - 1),
on = .(firm, employee, year), cont := TRUE][]
# aggregate
DT[!is.na(employee), sum(cont, na.rm = TRUE) / .N, by = .(firm, year = year + 1)][
# beautify result
year <= max(DT$year)]
firm year V1
<char> <num> <num>
1: A 2 0.75
2: B 2 1.00
The first expression modifies DT to indicate continued employees:
firm employee year cont
<char> <num> <num> <lgcl>
1: A 1 1 TRUE
2: A 2 1 TRUE
3: A 3 1 TRUE
4: A 4 1 NA
5: A 1 2 NA
6: A 2 2 NA
7: A 3 2 NA
8: A NA 2 NA
9: B 5 1 TRUE
10: B 6 1 TRUE
11: B NA 1 NA
12: B NA 1 NA
13: B 5 2 NA
14: B 6 2 NA
15: B 7 2 NA
16: B 8 2 NA
Using shift()
Alternatively, the shift() function can be used to compute the cont column. The aggregation part is the same as with the join approach above. shift() requires to ensure the data are ordered by year.
DT[order(year), cont := shift(year, type = "lead") == year + 1, by = .(firm, employee)][
!is.na(employee), sum(cont, na.rm = TRUE) / .N, by = .(firm, year = year + 1)][
year <= max(DT$year)]
Benchmark
At the time of writing, three approaches have been proposed in addition to OP's own attempt using loops:
by docendo discimus
join with shifted year
using shift()
The answer of Jean Vuda is not considered in the benchmark as it is limited to 2 years.
According to the OP, the production data set consists of 40 k firms and 10 years of data. For a realistic benchmark, a sample data set of similar size is created:
n_firm <- 40000L
max_employee <- 10L
fluctuation_rate <- 0.2
n_year <- 10L
start_year <- 2001L
DT0 <- CJ(firm = sprintf("%06i", seq_len(n_firm)),
employee = seq_len(max_employee),
year = seq(start_year, length.out = n_year))
set.seed(123L)
n_row <- nrow(DT0)
DT0[sample.int(n_row, fluctuation_rate * n_row), employee := NA]
The sample data set consists of 4 M rows and can be visualised best after reshaping from long to wide format:
dcast(DT0[!is.na(employee)], firm + employee ~ year)
Using 'year' as value column. Use 'value.var' to override
firm employee 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
<char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1: 000001 1 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
2: 000001 2 2001 2002 2003 NA 2005 2006 2007 NA 2009 NA
3: 000001 3 2001 2002 NA NA 2005 2006 2007 2008 2009 2010
4: 000001 4 2001 NA NA NA 2005 2006 2007 2008 NA 2010
---
399996: 040000 6 2001 2002 NA 2004 2005 NA NA NA 2009 2010
399997: 040000 7 NA 2002 NA NA 2005 2006 2007 2008 2009 2010
399998: 040000 8 2001 2002 2003 NA NA NA 2007 NA NA 2010
399999: 040000 9 2001 2002 2003 NA 2005 2006 2007 2008 2009 NA
400000: 040000 10 2001 2002 2003 NA NA 2006 2007 2008 2009 2010
For benchmarking, the microbenchmark package is used because a check function can be passed to verify the results are identical:
my_check <- function(values) {
values <- lapply(values, function(x) x[, dcast(.SD, firm ~ year, value.var = "continued")])
all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
The benchmark code:
microbenchmark::microbenchmark(
dd = {
dt <- copy(DT0)
years <- head(sort(unique(dt$year)), -1)
rbindlist(
setNames(lapply(years, function(y) {
dt[dt[(year == y), .(firm, employee)], on = .(firm, employee)][
!is.na(employee), all(c(y, y+1) %in% year), by = .(employee, firm)][
, .(continued = mean(V1), year = y+1), by = firm]
}), paste("Year", years, sep="-"))
)
},
join = {
DT <- copy(DT0)
DT[.(firm = firm, employee = employee, year = year - 1),
on = .(firm, employee, year), cont := TRUE][
!is.na(employee), .(continued = sum(cont, na.rm = TRUE) / .N),
by = .(firm, year = year + 1)][
year <= max(DT$year)]
},
shift = {
DT <- copy(DT0)
DT[order(year), cont := shift(year, type = "lead") == year + 1,
by = .(firm, employee)][
!is.na(employee), .(continued = sum(cont, na.rm = TRUE) / .N),
by = .(firm, year = year + 1)][
year <= max(DT$year)]
},
check = my_check,
times = 3L
)
The benchmark results show that the join approach is 4 times faster than the shift approach and 8 times faster than docendo discimus's approach.
Unit: seconds
expr min lq mean median uq max neval cld
dd 11.756114 11.919959 12.083042 12.083805 12.246506 12.409207 3 c
join 1.054293 1.239829 1.303971 1.425366 1.428810 1.432254 3 a
shift 6.105725 6.105906 6.148136 6.106087 6.169342 6.232596 3 b
Here is a slightly different approach to do it:
dt<-dat[,list(all=.(unique(employee))), by=list(year,firm)]
dt<-dt[,list(year1=sapply(list(all),`[`,1),
year2=sapply(list(all),`[`,2)), by=firm]
dt[,uniqueN(mapply(intersect, year1, year2))/uniqueN(na.omit(unlist(year1))),by=firm]

Resources