Working with Data.table within Data.table in R - r

I'm trying to build a column in a data.table by interacting with another data.table and having trouble referring to variables correctly to do this without a for-loop. Once I enter the second data.table, I can no longer seem to refer to the column in the first data.table correctly.
This is kind of similar to Subsetting a data.table using another data.table
but I believe the merge-style solutions aren't appropriate.
Consider something like
#used numbers instead of dates to not have to deal with formatting, but idea is the same.
dt1 <- data.table(id = c('a', 'b', 'c'), date1 = c(1.1, 5.4, 9.1), amt= '100')
dt2 <- data.table(date2 = c(1.3, 3, 6.4, 10.5),
dt2col = c(1.5, 1.02, 1.005, .99)
)
dt1[result := prod(dt2[date2-(date1)>0,
dt2col
]
)
]
I want the result to be a new column in dt1 which is the product of dt2col when date2 (in dt2) is later than date1 (in dt1) for each specific row in dt1. I think the (date1) part is the problem.
I expect result[1] to be the product of dt2col for all of them, but result[2] to be the product of dt2col for only the dates after '5/4/2018', etc.

Here are some data.table options:
1) Using non-equi joins:
dt1[, result := dt2[dt1, on=.(date2 > date1), prod(dt2col), by=.EACHI]$V1]
dt1
2) Using rolling joins after calculating the cumulative product:
setorder(dt2, -date2)
dt2[, cprod := cumprod(dt2col)]
dt1[dt2, result := cprod, on=.(date1=date2), roll=Inf]
output:
id date1 amt result
1: a 1.1 100 1.522273
2: b 5.4 100 0.994950
3: c 9.1 100 0.990000

Try this:
dt1[,`:=`(date1 = as.Date.character(date1,format = "%d/%m/%Y"))]
dt2[,`:=`(date2 = as.Date.character(date2,format = "%d/%m/%Y"))]
dt1[,`:=`(inds = lapply(X = date1,function(t){
intersect(x = which(year(t)==year(dt2$date2)),
y = which(as.integer(dt2$date2-t)>0))}))][,result:=
lapply(X = inds,function(t){prod(dt2$dt2col[t])})]
# id date1 amt inds result
#1: a 2018-01-01 100 1,2,3,4 1.522273
#2: b 2018-04-05 100 1,4 1.485
#3: c 2018-01-09 100 1,4 1.485

Related

How to do a special type of lookup join in R data.table?

How to do a special type of lookup join in R data.table ?
Suppose there are two tables in R as under:
library(data.table)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
They look like:
> dt1
a b
1: p 1,2
2: q 1,2,3
3: r 4,5
> dt2
code desc
1: 1 good
2: 2 better
3: 3 best
4: 4 bad
5: 5 worst
The goal is join dt1 and dt2 in such a way the result looks like
> result
a b desc
1: p 1,2 good,better
2: q 1,2,3 good,better,best
3: r 4,5 bad,worst
Can anyone show how this type of join can be accomplished in R ?
That's not really a join but as dt1$b contains convoluted values anyway here is my ugly hack:
dt2[, code := as.character(code)]
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc
1: p 1,2 good,better
2: q 1,2,3 good,better,best
3: r 4,5 bad,worst
Edit:
The replacement has to be done from the longest to the shortest code (string lengths or number of characters) and desc must not contain any digits.
So, with setorder(dt2, -code) added to the code and the new use case provided by the OP in the comment:
dt1 <- data.table(a = c("p", "q", "r"), b = c("1,21", "23,11,36", "11,36"))
dt2 <- data.table(code = c(1,11,21,23,36), desc = c("good", "better", "best", "bad", "worst"))
setorder(dt2, -code) # set order first (descending numeric value)
dt2[, code := as.character(code)] # then convert to character
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc
1: p 1,21 good,best
2: q 23,11,36 bad,better,worst
3: r 11,36 better,worst
Edit 2:
According to OP's comment the requirement for the ugly hack no digits in desc aren't fulfilled in the production data. (As it almost always happens when a quick & dirty solution meets real world's data :-) ).
So here is a concise data.table solution which does what all the others answers do as well: split column b, join or look up the matching desc, and recombine:
dt2[, code := as.character(code)][
dt1[, strsplit(b, ","), by = .(a, b)], on = "code==V1"][
, .(desc = paste(desc, collapse = ",")), by = .(a, b)]
Using OP's new use case
a b desc
1: p 1,21 good,best
2: q 23,11,36 bad,better,worst
3: r 11,36 better,worst
Note that grouping uses both columns a and b for two reasons: 1) convenience (to keep both columns in the final result), 2) in case a is not a unique identifier
Idea is to get column b as list of integers and then subset column desc in dt2 (note that code is just row number, otherwise use function match).
library(purrr)
library(stringr)
dt1[, b := map(b, ~str_split(.x, ",") %>% unlist() %>% as.integer())]
dt1[, desc := map(b, ~dt2$desc[match(.x, dt2$code)])]
library(data.table)
library(magrittr)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
dt1 <- dt1[, list(b = unlist(strsplit(x = b, split = ","))), by = "a"] %>%
.[, b := type.convert(b)]
dt2[dt1, on = c("code == b")] %>%
.[, lapply(.SD, toString), by = "a"]
#> a code desc
#> 1: p 1, 2 good, better
#> 2: q 1, 2, 3 good, better, best
#> 3: r 4, 5 bad, worst
Created on 2021-07-27 by the reprex package (v2.0.0)
You can split the string on comma and do a join.
library(dplyr)
library(tidyr)
dt1 %>%
separate_rows(b, sep = ',\\s*', convert = TRUE) %>%
left_join(dt2, by = c('b' = 'code')) %>%
group_by(a) %>%
summarise(desc = toString(desc))
# a desc
# <chr> <chr>
#1 p good, better
#2 q good, better, best
#3 r bad, worst

dplyr into data.table: filter > group by > count

I usually work with dplyr but face a rather large data set and my approach is very slow. I basically need to filter a df group it by dates and count the occurrence within
sample data (turned already everything into data.table)
library(data.table)
library(dplyr)
set.seed(123)
df <- data.table(startmonth = seq(as.Date("2014-07-01"),as.Date("2014-11-01"),by="months"),
endmonth = seq(as.Date("2014-08-01"),as.Date("2014-12-01"),by="months")-1)
df2 <- data.table(id = sample(1:10, 5, replace = T),
start = sample(seq(as.Date("2014-07-01"),as.Date("2014-10-01"),by="days"),5),
end = df$startmonth + sample(10:90,5, replace = T)
)
#cross joining
res <- setkey(df2[,c(k=1,.SD)],k)[df[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
My dplyr approach works but is slow
res %>% filter(start <=endmonth & end>= startmonth) %>%
group_by(startmonth,endmonth) %>%
summarise(countmonth=n())
My data.table knowledge is limited but I guess we would setkeys() on the date columns and something like res[ , :=( COUNT = .N , IDX = 1:.N ) , by = startmonth, endmonth] to get the counts by group but I'm not sure how the filter goes in there.
Appreciate your help!
You could do the counting inside the join:
df2[df, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]
start end N
1: 2014-07-31 2014-07-01 1
2: 2014-08-31 2014-08-01 4
3: 2014-09-30 2014-09-01 5
4: 2014-10-31 2014-10-01 3
5: 2014-11-30 2014-11-01 3
or add it as a new column in df:
df[, n :=
df2[.SD, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]$N
]
How it works. The syntax is x[i, on=, allow.cartesian=, j, by=.EACHI]. Each row if i is used to look up values in x. The symbol .EACHI indicates that aggregation (j=.N) will be done for each row of i.

Operate in data.table column by matching column from second data.table

I am trying to perform a character operation (paste) in a column from one data.table using data from a second data.table.
Since I am also performing other unrelated merge operations before and after this particular code, the rows order might change, so I am currently setting the order both before and after this manipulation.
DT1 <- data.table(ID = c("a", "b", "c"), N = c(4,1,3)) # N used
DT2 <- data.table(ID = c("b","a","c"), N = c(10,10, 15)) # N total
# without merge
DT1 <- DT1[order(ID)]
DT2 <- DT2[order(ID)]
DT1[, N := paste0(N, "/", DT2$N)]
DT1
# ID N
# 1: a 4/10
# 2: b 1/10
# 3: c 3/15
I know a merge of the two DTs (by definition) would take care of the matching, but this creates extra columns that I need to remove afterwards.
# using merge
DT1 <- merge(DT1, DT2, by = "ID")
DT1[, N := paste0(N.x, "/", N.y)]
DT1[, c("N.x", "N.y") := list(NULL, NULL)]
DT1
# ID N
# 1: a 4/10
# 2: b 1/10
# 3: c 3/15
Is there a more intelligent way of doing this using data.table?
We can use join after converting the 'N' column to character
DT1[DT2, N := paste0(N, "/", i.N), on = .(ID)]
DT1
# ID N
#1: a 4/10
#2: b 1/10
#3: c 3/15
data
DT1 <- data.table(ID = c("a", "b", "c"), N = c(4,1,3))
DT2 <- data.table(ID = c("b","a","c"), N = c(10,10, 15)) # N total
DT1[, N:= as.character(N)]

Create column names based on "by" argument the data.table way

Say I have the following data.table
dt <- data.table(var = c("a", "b"), val = c(1, 2))
Now I want to add two new columns to dt, named a, and b with the respective values (1, 2). I can do this with a loop, but I want to do it the data.table way.
The result would be a data.table like this:
dt.res <- data.table(var = c("a", "b"), val = c(1, 2), #old vars
a = c(1, NA), b = c(NA, 2)) # newly created vars
So far I came up with something like this
dt[, c(xx) := val, by = var]
where xx would be a data.table-command similar to .N which addresses the value of the by-group.
Thanks for the help!
Appendix: The for-loop way
The non-data.table-way with a for-loop instead of a by-argument would look something like this:
for (varname in dt$var){
dt[var == varname, c(varname) := val]
}
Based on the example showed, we can use dcast from the data.table to convert the long format to wide, and join with the original dataset on the 'val' column.
library(data.table)#v1.9.6+
dt[dcast(dt, val~var, value.var='val'), on='val']
# var val a b
#1: a 1 1 NA
#2: b 2 NA 2
Or as #CathG mentioned in the comments, for previous versions either merge or set the key column and then join.
merge(dt, dcast.data.table(dt, val~var, value.var='val'))

Finding the appropriate interval

Suppose I've several intervals which are subset of real line as follows:
I_1 = [0, 1]
I_2 = [1.5, 2]
I_3 = [5, 9]
I_4 = [13, 16]
Now given a real number x = 6.4, say, I'd like to find which interval contains the number x. I would like to know the algorithm to find this interval, and/or how to do this in R.
Thanks in advance.
Update using non-equi joins:
This is much simpler and straightforward using the new non-equi joins feature in the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[.(x=4.5), on=.(start<=x, end>=x), which=TRUE]
# [1] 7
No need to set keys or create indices.
Old solution using foverlaps:
One way would be to use interval/overlap joins using the data.table package:
require(data.table) ## 1.9.4+
DT1 = data.table(start=c(0,1.5,5,13), end=c(1,2,9,16))
DT2 = data.table(start=6.4, end=6.4)
setkey(DT1)
foverlaps(DT2, DT1, which=TRUE, type="within")
# xid yid
# 1: 1 3
This searches if each interval in DT2 lies completely within DT1 efficiently. In your case DT2 is a point, not an interval. If it did not exist within any intervals in DT1, it'd return NA.
Have a look at ?foverlaps to check out the other arguments you can use. For example mult= argument controls if you'd want to return all the matching rows or just the first or last etc..
Since setkey sorts the result, you'll have to add a separate id as follows:
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[, id := .I] # .I is a special variable. See ?data.table
setkey(DT1, start, end)
DT2 = data.table(start=4.5 ,end=4.5)
olaps = foverlaps(DT2, DT1, type="within", which=TRUE)
olaps[, yid := DT1$id[yid]]
# xid yid
# 1: 1 7

Resources