Conditionally select multiple items by date range and identifier in data.table - r

I have a data.table containing unit identifiers, a setting identifier, and a data range for which this setting is valid. I need to extract the settings for specific unit identifiers for a specific day. The following minimum working example shows how I would obtain the result.
library(data.table)
settingstable=data.table(UNITID=c(1,1,1,2,2,2,3,4,5,6,6),
STARTDATE=as.POSIXct(c("2018-01-01","2018-02-28","2018-06-01","2018-01-01","2018-04-01","2018-06-01","2018-01-01","2018-01-01","2018-01-01","2018-01-01","2018-05-01")),
ENDDATE=as.POSIXct(c("2018-02-28","2018-05-31","2018-12-31","2018-03-31","2018-05-31","2018-12-31","2018-12-31","2018-12-31","2018-12-31","2018-04-30","2018-12-31")),
SETTINGS=c(1,2,3,4,5,6,7,8,9,10,11))
selectunits=c(2,4,6)
selectdays=as.POSIXct(c("2018-04-02","2018-05-03","2018-02-01"))
resultsettings=NULL
for (i in 1:length(selectunits)) {
resultsettings=rbind(resultsettings,settingstable[UNITID==selectunits[i] & STARTDATE <= selectdays[i] & ENDDATE >= selectdays[i],.(UNITID,SETTINGS)])
}
For large data.tables or large amounts of units and days this will be very inefficient. I was hoping that a grouping with by=UNITID would work, but unfortunately this is not possible as the following will result in a longer object length is not a multiple of shorter object length error.
resultsettings=settingstable[UNITID %in% selectunits & STARTDATE <= selectdays & ENDDATE >= selectdays,.(UNITID,SETTINGS),by=UNITID]
How can I improve my code so it runs more efficient?

You can use a non-equi join:
settingstable[.(u = selectunits, d = selectdays),
on=.(UNITID = u, STARTDATE <= d, ENDDATE >= d),
.(UNITID, SETTINGS)]
UNITID SETTINGS
1: 2 5
2: 4 8
3: 6 10
The syntax is x[i, on=, j].
The list i = .(u = selectunits, d = selectdays) is treated as a table, to be joined to x = settingstable.
The join works by looking up each row if i in x according to on=.
In j, we can transform the result. (Without j, we'd just get the joined table.)
If your on= conditions yield multiple matches, they will all appear in the result. If they leave no matches, SETTINGS and other columns from x will be NA (though this can be tweaked using the nomatch= argument).

Related

R: Is there a way to get unique, closest matches with the rows in the same data.table based on multiple columns?

In R, I want to get unique, closest matches for the rows in a data.table which are identified by unique ids based on values in two columns. Here, I provide a toy example and the code I'm using to achieve this.
dt <- data.table(id = letters,
value_1 = as.integer(runif(26,1,20)),
value_2 = as.integer(runif(26,1,10)))
pairs <- data.table()
while(nrow(dt) >= 2){
k <- dt[c(1)]
m <- dt[-1]
t <- m[k, roll = "nearest",on = .(value_1,value_2)]
pairs <- rbind(pairs,t)
dt <- dt[!dt$id %in% pairs$id & !dt$id %in% pairs$i.id]
}
pairs <- pairs[,-c(2,3)]
This gives me a data.table with the matched ids and the ones that do not get any matches.
id i.id
1 NA a
2 NA b
3 m c
4 v d
5 y e
6 i f
...
Is there a way to do this without the loop. I intend to implement this on a data.table with more than 20 million observations? Clearly, using a loop is extremely inefficient. I was wondering if the roll join command can be run on a copy of the main data.table by introducing an exception condition -- so as not to match the same ids with each other. Maybe something like this:
m <- dt
t <- m[dt, roll = "nearest",on = .(value_1,value_2)]
Without the exception, this command merely generates matches of ids with themselves. Also, this does not ensure unique matches.
Thanks!

R conditional calculate date difference

I need to work out a fast way of conditionally finding the difference, in days, between two dates in a data table. I managed to do it with an "ifelse" statement but it is slow on big objects, so my question, is there a faster, more elegant way of achieving the same, perhaps using data.table commands like ":=" or something. Thx. J.
library(lubridate)
library(data.table)
rm(list = ls())
a <- as.Date(c ("2021-09-27", "2019-10-30", "2021-09-05"))
b <- as.Date(c ("2020-06-14", "2019-09-15", "2020-09-23"))
c <- as.Date(c ("2022-07-12", "2020-09-23", "2021-06-19"))
new <- data.table(leave = a, start = b, end = c)
new$days <- ifelse (
new$leave < new$end,
new$leave - new$start,
new$end - new$start)
So in words, when leaving date < end of period, subtract leave from start, however if leave >= end then subtract start from end, and give result back in a new column in days.
Using the pmin() function and data.table's assign operator :=
new[, days := as.numeric(pmin(leave, end)-start)]
Or you could assign it to all rows one way then chain off a subset:
new[, days := as.numeric(end - start)][leave < end, days := leave - start]
Or take advantage of by and the .GRP special character:
new[, days := list(leave-start, end-start)[.GRP], keyby=.(leave>=end)]

Generate group by condition on row value in column R data.table

I want to split a data.table in R into groups based on a condition in the value of a row. I have searched SO extensively and can't find an efficient data.table way to do this (I'm not looking to for loop across rows)
I have data like this:
library(data.table)
dt1 <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)))
I'd like to group at the large numbers (over a settable value) and come up with the example below:
dt.desired <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)), group=c(rep(1,50),rep(2,46),rep(3,43)))
dt1[ , group := cumsum(t > 200) + 1]
dt1[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
dt.desired[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
You can use a test like t>100 to find the large values. You can then use cumsum() to get a running integer for each set of rows up to (but not including) the large number.
# assuming you can define "large" as >100
dt1[ , islarge := t>100]
dt1[ , group := shift(cumsum(islarge))]
I understand that you want the large number to be part of the group above it. To do this, use shift() and then fill in the first value (which will be NA after shift() is run.
# a little cleanup
# (fix first value and start group at 1 instead of 0)
dt1[1, group := 0]
dt1[ , group := group+1]

R, Create data.frame conditional on colnames and row entries of existing df

I have a follow up to this question.
I am creating a data.frame conditional on the column names and specific row entries of an existing data.frame. Below is how I resolved it using a for loop (thanks to #Roland's suggestion... the real data violated requirements of #eddi's answer), but it has been running on the actual data set (200x500,000+ rows.cols) for more than two hours now...
(The following generated data.frames are very similar to the actual data.)
set.seed(1)
a <- data.frame(year=c(1986:1990),
events=round(runif(5,0,5),digits=2))
b <- data.frame(year=c(rep(1986:1990,each=2,length.out=40),1986:1990),
region=c(rep(c("x","y"),10),rep(c("y","z"),10),rep("y",5)),
state=c(rep(c("NY","PA","NC","FL"),each=10),rep("AL",5)),
events=round(runif(45,0,5),digits=2))
d <- matrix(rbinom(200,1,0.5),10,20, dimnames=list(c(1:10), rep(1986:1990,each=4)))
e <- data.frame(id=sprintf("%02d",1:10), as.data.frame(d),
region=c("x","y","x","z","z","y","y","z","y","y"),
state=c("PA","AL","NY","NC","NC","NC","FL","FL","AL","AL"))
for (i in seq_len(nrow(d))) {
for (j in seq_len(ncol(d))) {
d[i,j] <- ifelse(d[i,j]==0,
a$events[a$year==colnames(d)[j]],
b$events[b$year==colnames(d)[j] &
b$state==e$state[i] &
b$region==e$region[i]])
}
}
Is there a better/faster way to do this?
A simpler way to do it (I think - it does not involve melting, dcasting and merging) is as follows:
First, your a and b arrays, should be indexed by year (for a) and by year/state/region (for b):
at = a$events; names(at) = a$year
bt = tapply(b$events,list(b$year,b$state,b$region),function(x) min(x))
# note, I used min(x) in tapply just to be on the safe side, that the functions always returns a scalar
# we now create the result of the more complex case (lookup in b)
ids = cbind(colnames(d)[col(d)],
as.character(e$state[row(d)]),
as.character(e$region[row(d)])
)
vals=bt[ids]; dim(vals)=dim(d)
# and compute your desired result with the ifelse
result = ifelse(d==0,at[colnames(d)[col(d)]],vals)
# and that's it!
This should be faster (avoiding the nested loops), but I haven't profiled that. Let us know how that works for you on the full data
# This will require a couple of merges,
# but first let's convert the data to long form and extract year as integer
# I convert result to data.table, since that's easier and faster to deal with
# Note: it *is* possible to do the melt/dcast entirely in data.table framework,
# but it's a hassle right now - there is a FR iirc about that
library(reshape2)
library(data.table)
dt = data.table(melt(e))[, year := as.integer(sub('X([0-9]*).*','\\1',variable))]
# set key for merging and merge with b and a
setkey(dt, year, region, state)
dt.result = data.table(a, key = 'year')[
data.table(b, key = c('year', 'region', 'state'))[dt]]
# now we can compute the value we want
dt.result[, final.value := value * events.1 + (!value) * events]
# dcast back
e.result = dcast(dt.result, id + region + state ~ variable,
value.var = 'final.value')

Recursive function in R to find unique rows of a list of data tables

I am working on a function that takes a list of data tables with the same column names as an input and returns a single data table that has the unique rows from each data frame combined using successive rbind as shown below.
The function would be applied on a "very" large data.table (10s of millions of rows) which is why I had to split it up into several smaller data tables and assign them into a list to use recursion. At each step depending upon the length of the list of data tables (odd or even), I find the unique of data.table at that list index and the data table at the list index x - 1 and then successively rbind the 2 and assign to list index x - 1, and more list index x.
I must be missing something obvious, because although I can produce the final unique-d data.table when I print it (eg., print (listelement[[1]]), when I return (listelement[[1]]) I get NULL. Would help if someone can spot what I am missing ... or suggest if there is perhaps any other more efficient way to perform this.
Also, instead of having to add each data.table to a list, can I add them as "references" in the list ? I believe doing something like list(datatable1, datatable2 ...) would actually copy them ?
## CODE
returnUnique2 <- function (alist) {
if (length(alist) == 1) {
z <- (alist[[1]])
print (class(z))
print (z) ### This is the issue, if I change to return (z), I get NULL (?)
}
if (length(alist) %% 2 == 0) {
alist[[length(alist) - 1]] <- unique(rbind(unique(alist[[length(alist)]]), unique(alist[[length(alist) - 1]])))
alist[[length(alist)]] <- NULL
returnUnique2(alist)
}
if (length(alist) %% 2 == 1 && length(alist) > 2) {
alist[[length(alist) - 1]] <- unique(rbind(unique(alist[[length(alist)]]), unique(alist[[length(alist) - 1]])))
alist[[length(alist)]] <- NULL
returnUnique2(alist)
}
}
## OUTPUT with print statement
t1 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
t2 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
t3 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
tempList <- list(t1, t2, t3)
returnUnique2(tempList)
[1] "list"
[[1]]
col1 col2
1: a 3
2: a 2
3: a 5
4: a 9
5: a 10
6: a 7
7: a 1
8: a 8
9: a 4
10: a 6
Changing the following,
print (z) ### This is the issue, if I change to return (z), I get NULL (?)
to read
return(z)
returns NULL
Thanks in advance.
Please correct me if I misunderstand what you're doing, but it sounds like you have one big data.table and are trying to split it up to run some function on it and would then combine everything back and run a unique on that. The data.table way of doing that would be to use by, e.g.
fn = function(d) {
# do whatever to the subset and return the resulting data.table
# in this case, do nothing
d
}
N = 10 # number of pieces you like
dt[, fn(.SD), by = (seq_len(nrow(dt)) - 1) %/% (nrow(dt)/N)][, seq_len := NULL]
dt = dt[!duplicated(dt)]
Seems like this could be a good use case for a for loop. With many rows the overhead of using a for loop should be relatively small compared to the computation time. I would try combining my data.table's into a list (called ll in my example), then for each one remove duplicated rows, then rbind to the previous data.table with unique rows and then subset by unique rows again.
If you have many duplicated rows in each chunk then this might save some time, overall I'm not sure how effective it will be, but worth a shot?
# Create empty data.table for results (I have columns x and y in this case)
res <- data.table( x= numeric(0),y=numeric(0))
# loop over all data.tables in a list called 'll'
for( i in 1:length(ll) ){
# rbind the unique rows from the current list element to the results from all previous iterations
res <- rbind( res , ll[[i]][ ! duplicated(ll[[i]]) , ] )
# Keep only unique records at each iteration
res <- res[ ! duplicated(res) , ]
}
On another note, have you looked at the documentation for data.table? It explicitly states,
Because data.tables are usually sorted by key, tests for duplication
are especially quick.
So you might just be better off running on the entire data.table?
DT[ ! duplicated(DT) , ]
Add an id column to each data.table
t1$id=1
t2$id=2
t3$id=3
then combine them all at once and do a unique using by=.
If the data.tables are huge you could use setkey(...) to create an index on id before calling unique.
tall=rbind(t1,t2,t3)
tall[,unique(col1,col2),by=id]

Resources