R conditional calculate date difference - r

I need to work out a fast way of conditionally finding the difference, in days, between two dates in a data table. I managed to do it with an "ifelse" statement but it is slow on big objects, so my question, is there a faster, more elegant way of achieving the same, perhaps using data.table commands like ":=" or something. Thx. J.
library(lubridate)
library(data.table)
rm(list = ls())
a <- as.Date(c ("2021-09-27", "2019-10-30", "2021-09-05"))
b <- as.Date(c ("2020-06-14", "2019-09-15", "2020-09-23"))
c <- as.Date(c ("2022-07-12", "2020-09-23", "2021-06-19"))
new <- data.table(leave = a, start = b, end = c)
new$days <- ifelse (
new$leave < new$end,
new$leave - new$start,
new$end - new$start)
So in words, when leaving date < end of period, subtract leave from start, however if leave >= end then subtract start from end, and give result back in a new column in days.

Using the pmin() function and data.table's assign operator :=
new[, days := as.numeric(pmin(leave, end)-start)]
Or you could assign it to all rows one way then chain off a subset:
new[, days := as.numeric(end - start)][leave < end, days := leave - start]
Or take advantage of by and the .GRP special character:
new[, days := list(leave-start, end-start)[.GRP], keyby=.(leave>=end)]

Related

Conditionally select multiple items by date range and identifier in data.table

I have a data.table containing unit identifiers, a setting identifier, and a data range for which this setting is valid. I need to extract the settings for specific unit identifiers for a specific day. The following minimum working example shows how I would obtain the result.
library(data.table)
settingstable=data.table(UNITID=c(1,1,1,2,2,2,3,4,5,6,6),
STARTDATE=as.POSIXct(c("2018-01-01","2018-02-28","2018-06-01","2018-01-01","2018-04-01","2018-06-01","2018-01-01","2018-01-01","2018-01-01","2018-01-01","2018-05-01")),
ENDDATE=as.POSIXct(c("2018-02-28","2018-05-31","2018-12-31","2018-03-31","2018-05-31","2018-12-31","2018-12-31","2018-12-31","2018-12-31","2018-04-30","2018-12-31")),
SETTINGS=c(1,2,3,4,5,6,7,8,9,10,11))
selectunits=c(2,4,6)
selectdays=as.POSIXct(c("2018-04-02","2018-05-03","2018-02-01"))
resultsettings=NULL
for (i in 1:length(selectunits)) {
resultsettings=rbind(resultsettings,settingstable[UNITID==selectunits[i] & STARTDATE <= selectdays[i] & ENDDATE >= selectdays[i],.(UNITID,SETTINGS)])
}
For large data.tables or large amounts of units and days this will be very inefficient. I was hoping that a grouping with by=UNITID would work, but unfortunately this is not possible as the following will result in a longer object length is not a multiple of shorter object length error.
resultsettings=settingstable[UNITID %in% selectunits & STARTDATE <= selectdays & ENDDATE >= selectdays,.(UNITID,SETTINGS),by=UNITID]
How can I improve my code so it runs more efficient?
You can use a non-equi join:
settingstable[.(u = selectunits, d = selectdays),
on=.(UNITID = u, STARTDATE <= d, ENDDATE >= d),
.(UNITID, SETTINGS)]
UNITID SETTINGS
1: 2 5
2: 4 8
3: 6 10
The syntax is x[i, on=, j].
The list i = .(u = selectunits, d = selectdays) is treated as a table, to be joined to x = settingstable.
The join works by looking up each row if i in x according to on=.
In j, we can transform the result. (Without j, we'd just get the joined table.)
If your on= conditions yield multiple matches, they will all appear in the result. If they leave no matches, SETTINGS and other columns from x will be NA (though this can be tweaked using the nomatch= argument).

Conditional group by join in R

I am new to R and rather flumoxed by the following problem. I have two vectors of dates (the vectors are not necessarily aligned, nor of the same length).
I want to find for each date in the first vector the next date in the second vector.
vecA <- as.Date(c('1951-07-01', '1953-01-01', '1957-04-01', '1958-12-01',
'1963-06-01', '1965-05-01'))
vecB <- as.Date(c('1952-01-12', '1952-02-01', '1954-03-01', '1958-08-01',
'1959-03-01', '1964-03-01', '1966-05-01'))
In SQL I would write something like this, but I cannot find any tips in SO as to how to do this in R.
select vecA.Date, min(vecB.Date)
from vecA inner join vecB
on vecA.Date < vecB.Date
group by vecA.Date
The output should look like this:
Start End
1951-07-01 1952-01-12
1953-01-01 1954-03-01
1957-04-01 1958-08-01
1958-12-01 1959-03-01
1963-06-01 1964-03-01
1965-05-01 1966-05-01
Here's a possible solution using data.table rolling joins
library(data.table)
dt1 <- as.data.table(vecA) ## convert to `data.table` object
dt2 <- as.data.table(vecB) ## convert to `data.table` object
setkey(dt2) # key in order to perform a binary join
res <- dt2[dt1, vecB, roll = -Inf, by = .EACHI] # run the inner join while selecting closest date
setnames(res, c("Start", "End"))
res
# Start End
# 1: 1951-07-01 1952-01-12
# 2: 1953-01-01 1954-03-01
# 3: 1957-04-01 1958-08-01
# 4: 1958-12-01 1959-03-01
# 5: 1963-06-01 1964-03-01
# 6: 1965-05-01 1966-05-01
Alternatively, we can also do:
data.table(vecA=vecB, vecB, key="vecA")[dt1, roll=-Inf]
This code will do what you are asking, but it's not clear what you are trying to accomplish and so this might not be the best way. In essence, this code first orders both vectors to ensure they are in the same ordering. Then, using a for loop, it loops over all the elements in vecA and uses x < vecB to find out which elements in vecB are less than x.
That is wrapped in which, which returns the numeric index of of each TRUE element of a vector, and then in min which gives the smallest numeric index. This is then used to subset vecB to return the date; it's all wrapped in print so you can see the output of the loop.
This is probably not the best way of doing this, but without more context on your goals it should at least get you started.
> vecA <- vecA[order(vecA)]
> vecB <- vecB[order(vecB)]
> for(x in vecA) {print(vecB[min(which(x < vecB))])}
[1] "1952-01-12"
[1] "1954-03-01"
[1] "1958-08-01"
[1] "1959-03-01"
[1] "1964-03-01"
[1] "1966-05-01"

Using conditional statements in r data.table

I am trying to use data.table to recode a variable based on certain conditions. My original dataset has around 30M records and after all variable creation around 130 variables. I used the methods suggested here: conditional statements in data.table (M1) and also here data.table: Proper way to do create a conditional variable when column names are not known? (M2)
My goal is get the equivalent of the below code but something that is applicable using data.table
samp$lf5 <- samp$loadfactor5
samp$lf5 <- with(samp, ifelse(loadfactor5 < 0, 0, lf5))
I will admit that I don't understand .SD and .SDCols very well, so I might be using it wrong. The code and errors from (M1) and (M2) are given below and the sample dataset is here: http://goo.gl/Jp97Wn
(M1)
samp[,lf5 = if(loadfactor5 <0) 0 else loadfactor5]
Error Message
Error in `[.data.table`(samp, , lf5 = if (loadfactor5 < 0) 0 else loadfactor5) :
unused argument (lf5 = if (loadfactor5 < 0) 0 else loadfactor5)
When I do this:
samp[,list(lf5 = if(loadfactor5 <0) 0 else loadfactor5)]
it gives lf5 as a list but not as part of the samp data.table and does not really apply the condition as lf5 still has values less than 0.
(M2)
Col1 <- "loadfactor5"
Col2 <- "lf5"
setkeyv(samp,Col1)
samp[,(Col2) :=.SD,.SDCols = Col1][Col1<0,(Col2) := .SD, .SDcols = 0]
I get the following error
Error in `[.data.table`(samp, , `:=`((Col2), .SD), .SDCols = Col1) :
unused argument (.SDCols = Col1)
Any insights on how to finish this appreciated. My dataset has 30M records so I am hoping to use data.table to really cut the run time down.
Thanks,
Krishnan
Answer provided by eddi and included here for the sake of completeness.
samp[, lf5 := ifelse(loadfactor5 < 0, 0, loadfactor5)]
Another way (which I prefer because it's, in my opinion, cleaner):
samp[, lf5 := 0]; samp[loadfactor5 > 0, lf5 := loadfactor5];
I use data.table with a dataset with 90M rows; I am continually amazed at how fast data.table is for operations like the above.

R : Efficient loop on row with data.table

I am using data.table in R and looping over my table, it s really slow because of my table size.
I wonder if someone have any idea on
I have a set of value that I want to "cluster".
Each line have a position, a positive integer. You can load a simple view of that :
library(data.table)
#Here is a toy example
fulltable=c(seq (1,4))*c(seq(1,1000,10))
fulltable=data.table(pos=fulltable[order(fulltable)])
fulltable$id=1
So I loop in my lines and When there is more than 50 between two position I change the group :
#here is the main loop
lastposition=fulltable[1]$pos
lastid=fulltable[1]$id
for(i in 2:nrow(fulltable)){
if(fulltable[i]$pos-50>lastposition){
lastid=lastid+1
print(lastid)
}
fulltable[i]$id=lastid;
lastposition=fulltable[i]$pos
}
Any idea for an effi
fulltable[which((c(fulltable$pos[-1], NA) - fulltable$pos) > 50) + 1, new_group := 2:(.N+1)]
fulltable[is.na(new_group), new_group := 1]
fulltable[, c("lastid_new", "new_group") := list(cummax(new_group), NULL)]

R, Create data.frame conditional on colnames and row entries of existing df

I have a follow up to this question.
I am creating a data.frame conditional on the column names and specific row entries of an existing data.frame. Below is how I resolved it using a for loop (thanks to #Roland's suggestion... the real data violated requirements of #eddi's answer), but it has been running on the actual data set (200x500,000+ rows.cols) for more than two hours now...
(The following generated data.frames are very similar to the actual data.)
set.seed(1)
a <- data.frame(year=c(1986:1990),
events=round(runif(5,0,5),digits=2))
b <- data.frame(year=c(rep(1986:1990,each=2,length.out=40),1986:1990),
region=c(rep(c("x","y"),10),rep(c("y","z"),10),rep("y",5)),
state=c(rep(c("NY","PA","NC","FL"),each=10),rep("AL",5)),
events=round(runif(45,0,5),digits=2))
d <- matrix(rbinom(200,1,0.5),10,20, dimnames=list(c(1:10), rep(1986:1990,each=4)))
e <- data.frame(id=sprintf("%02d",1:10), as.data.frame(d),
region=c("x","y","x","z","z","y","y","z","y","y"),
state=c("PA","AL","NY","NC","NC","NC","FL","FL","AL","AL"))
for (i in seq_len(nrow(d))) {
for (j in seq_len(ncol(d))) {
d[i,j] <- ifelse(d[i,j]==0,
a$events[a$year==colnames(d)[j]],
b$events[b$year==colnames(d)[j] &
b$state==e$state[i] &
b$region==e$region[i]])
}
}
Is there a better/faster way to do this?
A simpler way to do it (I think - it does not involve melting, dcasting and merging) is as follows:
First, your a and b arrays, should be indexed by year (for a) and by year/state/region (for b):
at = a$events; names(at) = a$year
bt = tapply(b$events,list(b$year,b$state,b$region),function(x) min(x))
# note, I used min(x) in tapply just to be on the safe side, that the functions always returns a scalar
# we now create the result of the more complex case (lookup in b)
ids = cbind(colnames(d)[col(d)],
as.character(e$state[row(d)]),
as.character(e$region[row(d)])
)
vals=bt[ids]; dim(vals)=dim(d)
# and compute your desired result with the ifelse
result = ifelse(d==0,at[colnames(d)[col(d)]],vals)
# and that's it!
This should be faster (avoiding the nested loops), but I haven't profiled that. Let us know how that works for you on the full data
# This will require a couple of merges,
# but first let's convert the data to long form and extract year as integer
# I convert result to data.table, since that's easier and faster to deal with
# Note: it *is* possible to do the melt/dcast entirely in data.table framework,
# but it's a hassle right now - there is a FR iirc about that
library(reshape2)
library(data.table)
dt = data.table(melt(e))[, year := as.integer(sub('X([0-9]*).*','\\1',variable))]
# set key for merging and merge with b and a
setkey(dt, year, region, state)
dt.result = data.table(a, key = 'year')[
data.table(b, key = c('year', 'region', 'state'))[dt]]
# now we can compute the value we want
dt.result[, final.value := value * events.1 + (!value) * events]
# dcast back
e.result = dcast(dt.result, id + region + state ~ variable,
value.var = 'final.value')

Resources