data.table joins - Select all columns in the i argument - r

Joining two data.table I can specify the table I want the column from, like
X[Y, i.id] # `id` is taken from Y
My problem is that I have a big table with ~80 columns. Every night a data refresh happens and, according to some parameters, some rows get replaced by a new version of the table (same table, just new data).
current <- data.table(id=1:4, var=1:4, var2=1:4, key="id")
new <- data.table(id=1:4, var=11:14, var2=11:14, key="id")
current[new[c(1,3)], `:=`(var=i.var, var2=i.var2)]
> current
id var var2
1: 1 11 11
2: 2 2 2
3: 3 13 13
4: 4 4 4
As I said, in my real case, I have much more columns so (besides rbind()ing pieces of the two tables) I wonder how can I select all the columns of the data.table used in a join as the i argument? I could spend an half an hour in hard coding all of them but it wouldn't be a maintainable code (in case new columns get added to the tables in future).

How about constructing the j-expression and just eval'ing it?
nc = names(current)[-1L]
nn = paste0("i.", nc)
expr = lapply(nn, as.name)
setattr(expr, 'names', nc)
expr = as.call(c(quote(`:=`), expr))
> current[new[c(1,3)], eval(expr)]
> current
## id var var2
## 1: 1 11 11
## 2: 2 2 2
## 3: 3 13 13
## 4: 4 4 4

Related

R: faster merging of data frames

I have just started using R and wrote the following code but this is taking about 40 mins to process, so I am sure this can be coded in a way that it runs faster.
Basically, I have one large data set (crsp_td_net) of about 7GB n size and a second smaller data set (ff_35f). Both contain trading dates. What I am trying to do is to fill in trading dates for each company in the first data set.
From my first data set, I am creating subsets of data based on a company index, and then merging each subset with the second data set based on trading dates. This merged data set is appended to the other companies data set and so on until at the end, I am left with a large data set with all the initial companies but with the missing trading days incorporated.
I am not sure at this point whether the fact that the data frame final keeps expanding is causing my loop to run slowly or whether the loop is itself coded inefficiently. I understand that vectorization of the data can help speed this up, but I am not sure how to do this here (the matrix size of the subsets of data keeps changing for each company). I am also not sure of the best way to use apply, sapply or lapply (if any of these can be used here) for this. I have browsed a few queries on R but was I have not found a way to go about this. I would very much appreciate an alternative snippet of code that can make the below run faster.
todo<-matrix(numeric(0), 0,4)
for (i in 1:7396) {
final<- crsp_td_net %>%
filter(compid==i) %>%
merge(ff_35f,by="date_crsp",all=TRUE)
final<-final%>% filter(between(date_crsp,
as.Date(min(date_crsp_orig,na.rm="TRUE")),
as.Date(max(date_crsp_orig, na.rm="TRUE")))) %>%
arrange(date_crsp) %>%
mutate(cusip8dg_compustat =
ifelse(is.na(cusip8dg_compustat),
max(cusip8dg_compustat, na.rm="TRUE"),
cusip8dg_compustat)) %>%
mutate(compid = ifelse(is.na(compid), i, compid))%>%
select(compid, cusip8dg_compustat, date_crsp,
date_crsp_orig)%>%
distinct()
todo<-bind_rows(todo,final)
}
Thanks in advance,
Dev
Thank you all for your response. I was unable to reply in the comment box due to limit on response, so I am adding to my original post. #P Lapointe, please find a reproducible data set (I have used integer values instead of actual dates) #eipi10 - I think you have understood what I am after and thanks for the code but I am not sure if it is missing something as it is prompting for an input (I have all relevant libraries). #Alistaire - I will indeed be facing memory problems as I perform more calculations to add to the original data set. Grateful for your suggestions on how to make the loop faster/an alternative to it, which would be very helpful to understand how they would be implemented in the example below.
many thanks
zz <- "compid date_crsp
1 1 2
2 1 3
3 1 5
4 2 3
5 2 7
6 2 9
7 3 3
8 3 5
9 3 7
10 3 8"
crsp_td_net <- read.table(text=zz, header = TRUE)
xx <- "date_crsp
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11"
ff_35f <- read.table(text=xx, header = TRUE)
# I expect my final output to look like this:
yy<-"compid date_crsp
1 1 2
2 1 3
3 1 4
4 1 5
5 2 3
6 2 4
7 2 5
8 2 6
9 2 7
10 2 8
11 2 9
12 3 3
13 3 4
14 3 5
15 3 6
16 3 7
17 3 8"
output_wanted<-read.table(text=yy, header = TRUE)
df <- full_join(crsp_td_net, expand.grid(compid = unique(crsp_td_net$compid), date_crsp=unique(ff_35f$date_crsp)))
todo<-array(numeric(),c(1,4,0))
todo<-matrix(numeric(0), 0,0)
for (i in 1:3) {
final<- filter(crsp_td_net,compid==i)
final<- mutate(final,date_crsp_orig=date_crsp)
final<- merge(final,ff_35f, by="date_crsp",all=TRUE)
final<- filter(final,between(date_crsp, min(date_crsp_orig, na.rm=TRUE), max(date_crsp_orig, na.rm=TRUE)))
final<- arrange(final,date_crsp)
final<- mutate(final,compid = ifelse(is.na(compid), i, compid))
final<- select(final,compid, date_crsp)
final<- distinct(final)
todo<-bind_rows(todo,final)
}
I have modified the full_join example and it now runs but is not doing what I want it to do re merging each compid with unique trading days to fill in missing trading days in the first data set. I would very much appreciate any suggestion on this please.
The loop I wrote above works to give me exactly what I want, but I was wondering if there is a faster way to do this as I will have to loop over 7000 or so compid to create the large data set todo. This takes about 40 mins to run, so I wonder if there is a faster way to write this loop or an alternative to it.
Many thanks in advance
crsp_td_net$date_crsp_orig <-crsp_td_net$date_crsp
df <- full_join(crsp_td_net, by="date_crsp", expand.grid(compid = unique(crsp_td_net$compid), date_crsp=unique(ff_35f$date_crsp)) )
df<- df%>% filter(between(date_crsp, min(date_crsp_orig, na.rm=TRUE), max(date_crsp_orig, na.rm=TRUE)))
df<- df%>%filter(!compid.x=="NA")%>% select(-compid.y)%>% distinct()%>%arrange(compid.x,date_crsp)
Although the OP has asked for a dplyr solution, I can only suggest a solution which uses the foverlaps() function from the data.table package.
The OP has requested to complete the trading dates for each company in crsp_td_net by trading dates given in ff_35f. Completion means to fill up date ranges from a start date to an end date with given dates. (Note that the OP is using integer values in place of dates). The given dates can be considered to be date ranges as well where each range consists only of one day.
Now, the problem has been paraphrased to find the overlaps of two sequences of (date) ranges (overlap joins). For this, the foverlaps() function can be used which is inspired by findOverlaps()function of Bioconductor's IRanges package but works with non-genomic (i.e., non-integer) ranges as well.
library(data.table)
# coerce to data.table
setDT(crsp_td_net)
setDT(ff_35f)
# find start and end date for each company
comp_date_range <- crsp_td_net[, .(start = min(date_crsp), end = max(date_crsp)),
by = compid]
# turn given dates into date ranges of one day lengths
# by adding an end column equal to the start dates
ff_35f[, end := date_crsp]
# set keys
setkey(comp_date_range, start, end)
setkey(ff_35f, date_crsp, end)
# find all overlapping ranges
temp <- foverlaps(comp_date_range, ff_35f)
# reorder result for convenience and pick desired columns
result <- temp[order(compid, date_crsp), .(compid, date_crsp)]
The result is in line with the expected output:
result
# compid date_crsp
# 1: 1 2
# 2: 1 3
# 3: 1 4
# 4: 1 5
# 5: 2 3
# 6: 2 4
# 7: 2 5
# 8: 2 6
# 9: 2 7
#10: 2 8
#11: 2 9
#12: 3 3
#13: 3 4
#14: 3 5
#15: 3 6
#16: 3 7
#17: 3 8
This can be written more concisely in a single line of code:
foverlaps(
setkey(setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)),
by = compid], start, end),
setkey(setDT(ff_35f)[, .(date_crsp, end = date_crsp)], date_crsp, end)
)[order(compid, start), .(compid, date_crsp)]
Note The OP has replaced dates by integers in his Q. The data.table package offers Date and time classes with integer storage for fast sorting and grouping (see ?as.IDate).
Giving a second thought to this problem, I believe it can be solved at reasonable speed using data.tables' non-equi join. (I'm posting this a separate answer because the approach is quite different to foverlaps().)
library(data.table)
# coerce to data.table
setDT(crsp_td_net)
setDT(ff_35f)
# find start and end date for each company
comp_date_range <- crsp_td_net[, .(start = min(date_crsp), end = max(date_crsp)),
by = compid]
# non equi join: the result contains only rows which fulfill the condition in on = ...
# by = .EACHI executes .SD for each group, returning matching rows for each date
# nomatch = 0 (inner join) skips dates without matching company
temp <- comp_date_range[ff_35f, on = c("start<=date_crsp", "end>=date_crsp"),
.SD, by = .EACHI, nomatch = 0, allow.cartesian = TRUE]
# reorder result for convenience and pick desired columns
result <- temp[order(compid, start), .(compid, date_crsp = start)]
The result is in line with expected output
result
# compid date_crsp
# 1: 1 2
# 2: 1 3
# 3: 1 4
# 4: 1 5
# 5: 2 3
# 6: 2 4
# 7: 2 5
# 8: 2 6
# 9: 2 7
#10: 2 8
#11: 2 9
#12: 3 3
#13: 3 4
#14: 3 5
#15: 3 6
#16: 3 7
#17: 3 8
Note that there is an implicit assumption that the range of dates given ff_35f covers the whole range of dates used in crsp_td_net. Otherwise, company trades would drop off the result.
Benchmark results
At the time of writing, three different solutions were posted. The OP has measured the elapsed times of all three solutions with his 7 Gb data set and reported the measured elapsed times:
1.12 seconds for the foverlaps() solution
1.41 seconds for the non-equi join solution
11.92 seconds for the tidyr/ dplyr solution
in comments here and here.
As I was quite surprised to find the foverlaps() solution to be faster than the non-equi joins so I ran some benchmarks with varying problem sizes using the microbenchmark package.
The problem size is given by the number of companies. For each company, the trading days are randomly sampled from a selection of 260 "dates" simulating one year without weekends (For details see code below). The data set contains about 130 rows per company on average.
As can be seen from the chart of my own benchmarks (note that both axes are in log scale)
foverlaps() is somewhat faster than non-equi joins for larger problem sizes while non-equi joins is the fastest method for smaller problem sizes. tidyr/ dplyr is almost always the slowest method and a magnitude slower on large problems.
Define function for benchmark runs of problem size n_comp
bm_run <- function(n_comp) {
# define 1 year of trading dates, simulating weekends
ff_35f <- sort(outer(1:5, 7*(0:51), `+`))
# create tradings dates for each company
crsp_td_net <- rbindlist(lapply(seq_len(n_comp), function(i) {
# how many trading dates to sample for actual company?
n_days <- sample(length(ff_35f), 1)
# sample trading dates
data.frame(compid = i,
date_crsp = sort(sample(ff_35f, n_days)))
}))
# coerce to data.frame
setDF(crsp_td_net)
# turn vector of trading dates into data.frame
ff_35f <- data.frame(date_crsp = ff_35f)
# scale down number of repetitions with problem size
n_times <- as.integer(scales::squish(1000*1000 / nrow(crsp_td_net), c(3, 1000)))
print(sprintf("%i companies with a total of %i trading dates, %i runs",
n_comp, nrow(crsp_td_net), n_times))
# do the benchmark runs for this problem size
mb <- microbenchmark::microbenchmark(
foverlaps = {
foverlaps(
setkey(setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)),
by = compid], start, end),
setkey(setDT(ff_35f)[, .(date_crsp, end = date_crsp)], date_crsp, end)
)[order(compid, start), .(compid, date_crsp)]
},
non_equi_join = {
setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)), by = compid
][setDT(ff_35f), on = c("start<=date_crsp", "end>=date_crsp"),
.SD, by = .EACHI, nomatch = 0, allow.cartesian = TRUE
][order(compid, start), .(compid, date_crsp = start)]
},
dplyr = {
setDF(crsp_td_net)
setDF(ff_35f)
crsp_td_net %>%
dplyr::group_by(compid) %>%
dplyr::summarize(date_crsp = list(seq(from=min(date_crsp), to=max(date_crsp), by=1))) %>%
tidyr::unnest() %>%
dplyr::semi_join(ff_35f, by="date_crsp") %>%
dplyr::arrange(compid, date_crsp)
},
times = n_times
)
# return problem size and timings as list
return(list(n_comp, nrow(crsp_td_net), mb))
}
Run benchmark for different problem sizes
library(data.table)
library(magrittr)
# number of companies
n_comp <- outer(c(1,2), 10^(1:4), `*`)
# set seed of RNG for creation of reproducible data
set.seed(1234)
# do benchmark runs with different problem size derived from number of companies
bm <- lapply(n_comp, bm_run)
Prepare data for plotting
# create data.table with benchmark timinings from chunks in returned list
mbl <- rbindlist(lapply(bm, `[[`, i = 3), id = "n_row")
# aggregate results
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
# reorder factor levels
mba[, expr := forcats::fct_reorder(expr, -median_time)]
# replace chunk number by number of rows
mba[, n_row := unlist(lapply(bm, `[[`, i = 2))[n_row]]
Creat chart
library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) +
geom_point() + geom_smooth(se = FALSE) +
scale_x_log10(breaks = unique(mba$n_row), labels = scales::comma) +
scale_y_log10() +
xlab("number of rows") + ylab("median of execution time [ms]") +
ggtitle("microbenchmark results") + theme_bw()
Adapted your data to use actual dates. In the data 2017-01-04 and -06 are not in the date table. This approach generates a sequence from the companies first and last date. On compid 2, the filling in of the missing dates can be seen. `seq.Date(from= , to=, by=1) makes the missing dates.
The unnest probably creates a large data frame, so there is some risk on memory, but if you keep operations on these tables to just be the compid and date_crsp then maybe it will fit.
semi_join and inner_join should both work - you want to test for speed.
zz <- "compid date_crsp
1 1 2017-01-02
2 1 2017-01-03
3 1 2017-01-05
4 2 2017-01-03
5 2 2017-01-07
6 2 2017-01-09
7 3 2017-01-03
8 3 2017-01-05
9 3 2017-01-07
10 3 2017-01-08"
crsp_td_net <- read.table(text=zz, header = TRUE)
library(lubridate)
crsp_td_net$date_crsp <- ymd(crsp_td_net$date_crsp)
xx <- "date_crsp
1 2017-01-02
2 2017-01-03
3 2017-01-05
4 2017-01-07
5 2017-01-08
6 2017-01-09
7 2017-01-10"
ff_35f <- read.table(text=xx, header = TRUE)
ff_35f$date_crsp <- ymd(ff_35f$date_crsp)
library(dplyr)
library(tidyr)
crsp_td_net_summary <- crsp_td_net %>%
group_by(compid) %>%
summarize(date_crsp = list(seq.Date(from=min(date_crsp), to=max(date_crsp), by=1))) %>%
unnest() %>%
semi_join(ff_35f, by="date_crsp") %>%
arrange(compid, date_crsp)
crsp_td_net_summary
# # A tibble: 12 × 2
# compid date_crsp
# <int> <date>
# 1 1 2017-01-02
# 2 1 2017-01-03
# 3 1 2017-01-05
# 4 2 2017-01-03
# 5 2 2017-01-05
# 6 2 2017-01-07
# 7 2 2017-01-08
# 8 2 2017-01-09
# 9 3 2017-01-03
# 10 3 2017-01-05
# 11 3 2017-01-07
# 12 3 2017-01-08

R: Efficiently find the correct range my value fits in

I have a huge table filled with ranges (2 columns), i.e. min and max:
1 , 5
6 , 10
11 , 15
and so on. I'd like a function that, given a number, would return both boundaries of the correct range. Say I input f(12), I'd like back 11 and 15.
I'd like some help how to build that in R. It has to be efficient because the table is relatively big (1M rows).
Using the new non-equi joins feature available in the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
foo <- function(x, y) {
x = setDT(list(x=x)) # convert to data.table
x[y, on=.(x >= V1, x <= V2), nomatch=0L] # perform a conditional join
}
foo(12, dt)
# x x.1
# 1: 11 15
where:
dt = fread('1,5\n6,10\n11,15\n')
# V1 V2
# 1: 1 5
# 2: 6 10
# 3: 11 15
See installation instructions for devel version here.
Note that for repetitive calls, it might not be efficient due to the call overhead to [.data.table. Suppose you'd like to get all the interval ranges for inputs, 3,4,12,19, then I'd do it like this:
require(data.table) #v1.9.7+
x = data.table(x=c(3,4,12,19), id = 1:4)
x[dt, on=.(x >= V1, x <= V2), nomatch=0L]
# x id x.1
# 1: 1 1 5
# 2: 1 2 5
# 3: 11 3 15
The first value (id=1) is in [1,5] and so is the second value (id=2). The 3rd is in [11,15]. Fourth is not in the result due to nomatch=0.

Order data.table by a character vector of column names

I'd like to order a data.table by a variable holding the name of a column:
I've tried every combination of + eval, getandc` without success:
I have colVar = "someColumnName"
I'd like to apply this to: DT[order(colVar)]
data.table has special functions for that matter which will modify your data set by reference instead of copying it to a new object.
You can either use setkey or (in versions >= 1.9.4) setorder which is capable of ordering in decreasing order too.
Note the difference between setkey vs. setkeyv and setorder vs. setorderv. v notes that you can pass either a quoted variable name or a variable containing one.
Using #andrewzm data set
dtbl
# x y
# 1: 1 5
# 2: 2 4
# 3: 3 3
# 4: 4 2
# 5: 5 1
setorderv(dtbl, colVar)[] # or `sekeyv(dtbl, colVar)[]` or `setorderv(dtbl, "y")[]`
# x y
# 1: 5 1
# 2: 4 2
# 3: 3 3
# 4: 2 4
# 5: 1 5
You can use double brackets for data tables:
library(data.table)
dtbl <- data.table(x = 1:5, y = 5:1)
colVar = "y"
dtbl_sorted <- dtbl[order(dtbl[[colVar]])]
dtbl_sorted

How to Enter data for only conditioned rows on data table

I need to put number on first or random item in the group.
I do following:
item<-sample(c("a","b", "c"), 30,replace=T)
week<-rep(c("1","2","3"),10)
volume<-c(1:30)
DT<-data.table(item, week,volume)
setkeyv(DT, c("item", "week"))
sampleDT <- DT[,.SD[1], by= list(item,week)]
item week volume newCol
1: a 1 1 5
2: a 2 14 5
3: a 3 6 5
4: b 1 13 5
5: b 2 2 5
6: b 3 9 5
7: c 1 7 5
8: c 2 5 5
9: c 3 3 5
DT[DT[,.SD[1], by= list(item,week)], newCol:=5]
The sampleDT comes out correct ,but last line puts 5 on all columns instead of conditioned ones.
What am I doing wrong?
I think you want to do this instead:
DT[DT[, .I[1], by = list(item, week)]$V1, newCol := 5]
Your version doesn't work because the join that you have results in the full data.table.
Also there is a pending FR to make the syntax simpler:
# won't work now, but maybe in the future
DT[, newCol[1] := 5, by = list(item, week)]
The problem with your command is that it is finding rows in the original data.table that have combinations of the keys [item, week] that you found in sampleDT. Since sampleDT includes all combinations of [item, week], you get the whole data.table back.
A simpler solution (I think) would be using !duplicated() to retrieve the first instance of each [item, week] combination:
DT[!duplicated(DT, c("item", "week") ), newCol := 5]

Pairing rows in data.table

I have a data table similar to the one obtained with the following command:
dt <- data.table(
time = 1:8,
part = rep(c(1, 1, 2, 2), 2),
type = rep(c('A', 'B'), 4),
data = rep(c(runif(1), 0), 4))
Basically, such a table contains two different type of instances (A or B). The time column contains a timestamp for when a request arrived to or leaved from a certain part. If the instance type is A, the timestamp states the arrival time (enter), and if the type is B, the timestamp states the leaving time (exit).
time part type data
1: 1 1 A 0.5842668
2: 2 1 B 0.0000000
3: 3 2 A 0.5842668
4: 4 2 B 0.0000000
5: 5 1 A 0.5842668
6: 6 1 B 0.0000000
7: 7 2 A 0.5842668
8: 8 2 B 0.0000000
I would like to pair A and B instances, and obtain the following data table:
part data enter.time exit.time
1: 1 0.4658239 1 2
2: 1 0.4658239 5 6
3: 2 0.4658239 3 4
4: 2 0.4658239 7 8
I have tried the following:
pair.types <- function(x) {
a.type <- x[type == 'A']
b.type <- x[type == 'B']
return(data.table(
enter.time = a.type$time,
exit.time = b.type$time,
data = a.type$data))
}
dt[, c('enter.time', 'exit.time', 'data') := pair.types(.SD), by = list(part)]
But, that gives me the following, which is not exactly what I want:
time part type data enter.time exit.time
1: 1 1 A 0.3441592 1 2
2: 2 1 B 0.3441592 5 6
3: 3 2 A 0.3441592 3 4
4: 4 2 B 0.3441592 7 8
5: 5 1 A 0.3441592 1 2
6: 6 1 B 0.3441592 5 6
7: 7 2 A 0.3441592 3 4
8: 8 2 B 0.3441592 7 8
It is kind of close, but since column 'type' is kept, some rows are duplicated. Perhaps, I can try to remove columns 'time' and 'type', and then remove the second half of rows. But, I am not sure whether that will work in all the cases, and I would like to learn a better way to do this operation.
Assuming your data looks like your example data:
dt[, list(part = part[1],
data = data[1],
enter.time = time[1],
exit.time = time[2]),
by = as.integer((seq_len(nrow(dt)) + 1)/2)]
# by = rep(seq(1, nrow(dt), 2), each = 2)]
# ^^^ a slightly shorter and a little more readable alternative
The idea is very simple - group rows in groups of 2 (that's the by part), i.e. each group will be one A and one B, then for each group take first part and first data and then the enter and exit times are just the first and second time's respectively. This is likely how you'd do this if you followed the by-hand logic, making it easy to read (once you know just a tiny bit about how data.table works).
Another way:
setkey(dt, "type")
dt.out <- cbind(dt[J("A"), list(part, data, entry.time = time)][, type := NULL],
exit.time = dt[J("B"), list(time)]$time)
# part data entry.time exit.time
# 1: 1 0.1294204 1 2
# 2: 2 0.1294204 3 4
# 3: 1 0.1294204 5 6
# 4: 2 0.1294204 7 8
If you want you can now do setkey(dt.out, "part") to get the same order.
The idea: Your problem seems a simple "reshaping" one to me. The way I've approached it is first to create a key column as type. Now, we can subset data.table for a specific value in the key column by: dt[J("A")]. This would return the entire data.table. Since you want the column time renamed, I explicitly mention which columns to subset using:
dt[J("A"), list(part, data, entry.time = time)]
Of course this'll return also the type column (= A) which we've to remove. So, I've added a [, type := NULL] to remove column type by reference.
Now we've the first part. All we need is the exit.time. This can be obtained similarly as:
dt[J("B"), list(time)] # I don't name the column here
But this gives a data.table when you need just the time column, which can be accessed by:
dt[J("B"), list(time)]$time
So, while using cbind I name this column as exit.time to get the final result as:
cbind(dt[J("A"), list(part, data, entry.time = time)][, type := NULL],
exit.time = dt[J("B"), list(time)]$time)
Hope this helps.

Resources