Lookup observations data based on another table - r

I have 2 tibble data frames that I am trying to reconcile. The first tibble has over a million observations, the first few rows are as follows:
data
ID Time(Converted to number)
1 23160
1 23161
1 23162
1 23163
1 23164
1 23165
2 24251
2 24252
The second tibble is a lookup table (that has information of a particular event that has occurred), simplified version as follows:
lookup_table
ID Event_Time Event_Indicator Number_of_Cumulative_Events
1 23162 1 1
1 23164 1 2
2 24255 1 1
2 24280 0 1
I would like to create a 3rd column in the first tibble, such that it shows the number of cumulative events at that time of the observation. The 3rd column in the above example would therefore be:
ID Time(Converted to number) Number
1 23160 0
1 23161 0
1 23162 1
1 23163 1
1 23164 2
1 23165 2
2 24251 0
2 24252 0
I am trying to avoid having to loop through the millions of observations to compare each observation's time to the Event_Time in the lookup table because of computation time.
However, I am not sure how to go about doing this without the use of a loop. The issue is that the lookup_table contains some IDs multiple times, if all IDs only appeared in the lookup_table only once, then I could do:
data$Event_Time <- lookup_table[match(data$ID, lookup_table$ID),"Event_Time"]
data$Number <- data %>% mutate(ifelse(Time >= Event_Time,1,0))
Any ideas how I could avoid the use of a loop and yet apply the lookup conditions for each observation? Thank you.
Edit: I am not trying to join the tables, but more of comparing the time columns in the lookup_table and data table to obtain my desired column. Example, if I were to write an inefficient loop function, it would be:
for (i in 1:nrow(data)) {
data$Number[i] <- subset(lookup_table,ID == data$ID[i])[max(which
(data$Time[i] >= lookup_table$Event_Time)), "Number_of_Cumulative_Events"]
}

A possible solution is to count the cumulative events after the join. Note that an update on join is used.
library(data.table)
setDT(data)[, new := 0L][setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
, new := cumsum(new), by = ID][]
ID Time new
1: 1 23160 0
2: 1 23161 0
3: 1 23162 1
4: 1 23163 1
5: 1 23164 2
6: 1 23165 2
7: 2 24251 0
8: 2 24252 0
Alternatively,
setDT(data)[setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
is.na(new), new := 0][
, new := cumsum(new), by = ID][]
will set missing entries to zero after the join.
A completely different approach is to use a rolling join:
lookup_table[, !"Event_Indicator"][data, on = .(ID, Event_Time = Time), roll = TRUE]
ID Event_Time Number_of_Cumulative_Events
1: 1 23160 NA
2: 1 23161 NA
3: 1 23162 1
4: 1 23163 1
5: 1 23164 2
6: 1 23165 2
7: 2 24251 NA
8: 2 24252 NA
(NA's have been left untouched for illustration)

Related

mutate variable by condition using two variables in long format data.table in r

In this data.table:
dt <- data.table(id=c(1,1,1,2,2,2), time=rep(1:3,2), x=c(1,0,0,0,1,0))
dt
id time x
1: 1 1 1
2: 1 2 0
3: 1 3 0
4: 2 1 0
5: 2 2 1
6: 2 3 0
I need the following:
id time x
1: 1 1 1
2: 1 2 1
3: 1 3 1
4: 2 1 0
5: 2 2 1
6: 2 3 1
that is
if x==1 at time==1 then x=1 at times 2 and 3, by id
if x==1 at time==2 then x=1 at time 3, by id
For the first point (I guess the second one will be similar), I have tried approaches mentioned in similar questions I posted before (here and here), but none work:
dt[x==1[time == 1], x := x[time == 1], id] gives an error
setDT(dt)[, x2:= ifelse(x==1 & time==1, x[time==1], x), by=id] changes xonly at time 1 (so, no real change observed)
It would be much easier to work with data.table in wide format, but I keep facing this kind of problem in long format and I don't want to reshape my data all the time
Thank you!
EDIT:
The answer provided by #GregorThomas, dt[, x := cummax(x), by = id], works for the problem that I presented.
Now I ask the same question for a character variable:
dt2 <- data.table(id=c(1,1,1,2,2,2), time=rep(1:3,2), x=c('a','b','b','b','a','b'))
dt2
id time x
1: 1 1 a
2: 1 2 b
3: 1 3 b
4: 2 1 b
5: 2 2 a
6: 2 3 b
In the table above, how could be done the following:
if x=='a' at time==1 then x='a' at times 2 and 3, by id
if x=='a' at time==2 then x='a' at time 3, by id
Using the cumulative maximum function cummax:
dt[, x := cummax(x), by = id]
dt
# id time x
# 1: 1 1 1
# 2: 1 2 1
# 3: 1 3 1
# 4: 2 1 0
# 5: 2 2 1
# 6: 2 3 1

Count prior occurences of value one greater than current value

I am trying to create a column that contains the number of prior occurrences (sorted by date) of a value one greater than the current value. In the example provided here, I manually create the values I want in the column labeled “wanted”, which is equal to the count of the prior occurrences (sorted by "date") of RoundNo that are equal to one greater than the focal row RoundNo. And I need this to be computed separately by group for each individual InvestorID.
So the first row "wanted" value is equal to the count of prior RoundNo of Investor 1 where RoundNo == 3 (aka one larger than the first row's RoundNo of 2). So in this case that would be 0. Similarly for the second row, the "wanted" value is the count of prior RoundNo of Investor 1 where RoundNo == 2 (aka one larger than the second row's RoundNo of 1). So in this case that would be 1. Would appreciate any help. Code example is below. Thanks!
dt = as.data.table(cbind(c(rep(1,7),rep(2,7)),
c("2019-08-01","2019-09-01","2019-10-01","2019-11-01","2019-12-01","2021-04-01","2021-10-01",
"2019-01-01","2019-02-01","2019-04-01","2019-08-01","2019-09-01","2019-10-01","2019-11-01"),
c(2,1,2,2,1,3,2,1,2,3,2,1,3,1)))
names(dt) = c("InvestorID","date","RoundNo")
wanted = c(0,1,0,0,3,0,1,0,0,0,1,2,0,2)
dt$wanted = wanted
1) Define a function Count which counts the number of times each element of its vector input equals one plus its last element. Then use rollapplyr to apply that to successively larger leading sequences of RoundNo.
library(zoo)
Count <- function(x) sum(x == tail(x, 1) + 1)
dt[, wanted := rollapplyr(as.numeric(RoundNo), 1:.N, Count), by = InvestorID]
2) An alternate method is to use a self left join in which the first instance of dt aliased to a is left joined to the second instance of dt aliased to b associating those b rows which are from the same InvestorID and come before or at the a row. Group by a row and take the appropriate sum over the b rows.
library(sqldf)
sqldf("select a.*, sum(a.RoundNo + 1 == b.RoundNo) wanted
from dt a
left join dt b on a.InvestorID = b.InvestorID and b.rowid <= a.rowid
group by a.rowid")
3) This alternative only uses data.table. Count is from (1).
dt[, wanted := sapply(1:.N, function(i) Count(as.numeric(RoundNo)[1:i])),
by = InvestorID]
Another data.table solution using Reduce:
dt[order(date),.(date,
result=lapply(Reduce('c',as.numeric(RoundNo),accumulate=T),
function(x) sum(x==last(x)+1)),
wanted), by=InvestorID]
InvestorID date result wanted
1: 2 2019-01-01 0 0
2: 2 2019-02-01 0 0
3: 2 2019-04-01 0 0
4: 2 2019-08-01 1 1
5: 2 2019-09-01 2 2
6: 2 2019-10-01 0 0
7: 2 2019-11-01 2 2
8: 1 2019-08-01 0 0
9: 1 2019-09-01 1 1
10: 1 2019-10-01 0 0
11: 1 2019-11-01 0 0
12: 1 2019-12-01 3 3
13: 1 2021-04-01 0 0
14: 1 2021-10-01 1 1

Data.table: sum between irregular date ranges

Surveys and fires occurred at irregular intervals in different burn units.
(srv=1 means a survey was done, fire=1 means a fire occurred)
I want calculate how many fires were lighted between surveys, i.e.,
including the year of the survey and going back to one year before the last survey.
nyear = 10
units = 4
set.seed(15)
DT <- data.table(
unit = rep(1:units, each=nyear),
year = 2000:(2000+nyear-1),
srv = rbinom(nyear*units, 1, 0.4),
fire = rbinom(nyear*units, 1, 0.3)
)
DT
I can calculate the years elapsed but I have to create a new dataset then join it back to the original data set. Then I cannot figure out out to sum fires between date ranges.
DT1 <- DT[srv != 0] # Drop years without surveys
DT2 <- DT1[, .(year, elapsed = year - shift(year)), by = "unit"] # Use 'shift' to find years elapsed
DT3 <- DT2[DT, on=.(unit, year)] # join dataset with elapsed time to original dataset
DT3[ , sum(fire), on = .(year >= year, year < year -(elapsed-1)), by="unit"] # Doesn't work
Example output follows, where 'nfire' is what I'm after -- in years without surveys it is 'NA', otherwise it provides numbers of fires after the last survey and including current survey year:
unit year elapsed srv fire nfire
1: 1 2000 NA 1 1 1
2: 1 2001 NA 0 0 NA
3: 1 2002 2 1 1 1
4: 1 2003 1 1 0 0
5: 1 2004 NA 0 0 NA
6: 1 2005 2 1 0 0
7: 1 2006 1 1 0 1
8: 1 2007 NA 0 1 NA
9: 1 2008 2 1 1 2
10: 1 2009 1 1 0 1
11: 2 2000 NA 0 0 NA
12: 2 2001 NA 1 1 NA
The answer of r2evans works:
DT[, grp := rev(cumsum(rev(srv == 1))), by = .(unit)][, nfire := sum(fire), by=.(unit, grp)]
Times when surveys occurred (srv ==1) are placed in reverse order then summed cumulatively. The reverse ordering ensures that each survey is grouped with the years that preceded it, and the cumulative summing provides assigns a list of consecutively numbered groups. The outer 'rev' changes the order back to its original organization.
The second part of the statement '[, nfire := sum(fire), by=.(unit, grp)]' is an example of chaining--as I understand it, just a way of introducing more operations in a data.table step without cluttering the first part of the statement. The syntax within is reasonably intuitive.

How can I test if a value in one row is unique comparing with all previous rows by group, and count number of different value

I am trying to test if value in one row is unique comparing with all previous rows by group.
For example, for ID=1, I want to compare the drug of the current row to all previous rows (or to day, compare to those DATE earlier than the current row) under ID=1, eg. In row 2, drug A is same as in row 1 , thus EXIST_BEFORE codes as 1 ; for row 4, C is unique comparing with previous rows (A, B , C) thus codes as 0.
add another question: how can I count the number of different drug prior the current date ? for example, for ID=1 , prev_drug for row 4 is 2 , because it has two drugs ( A ,B) different from drug C prior the the DATE of row 4.
ID DATE DRUG EXIST_BEFORE prev_drug
1 2001-01-01 A NA 0
1 2001-02-01 A 1 0
1 2001-03-15 B 0 1
1 2001-04-20 C 0 2
1 2001-05-29 A 1 2
1 2001-05-02 B 1 2
2 2001-03-02 A NA 0
2 2001-03-23 C 0 1
2 2001-04-04 D 0 2
2 2001-05-05 B 0 3
I only know how to compare with one row above by lag(), but have no idea on comparing to date before for each ID.
For this, try using dplyr. Basically we can just group on ID and DRUG. For that grouped combination, find the first DATE occurrence using min(). Then, if the date is after that first occurrence, than it is a repeat.
library(dplyr)
mydata %>%
group_by(ID, DRUG) %>%
mutate(FIRST_OCCURANCE = min(DATE),
EXIST_BEFORE = DATE > FIRST_OCCURANCE)
ID DATE DRUG EXIST_BEFORE FIRST_OCCURANCE
<int> <date> <chr> <lgl> <date>
1 1 2001-01-01 A FALSE 2001-01-01
2 1 2001-02-01 A TRUE 2001-01-01
3 1 2001-03-15 B FALSE 2001-03-15
4 1 2001-04-20 C FALSE 2001-04-20
5 1 2001-05-29 A TRUE 2001-01-01
6 1 2001-05-02 B TRUE 2001-03-15
7 2 2001-03-02 A FALSE 2001-03-02
8 2 2001-03-23 C FALSE 2001-03-23
9 2 2001-04-04 D FALSE 2001-04-04
10 2 2001-05-05 B FALSE 2001-05-05
I broke it into two variables to show what is going on, but you can also reduce the mutate() line simply to:
mutate(EXIST_BEFORE = DATE > min(DATE))
Alternatively, the rowid() function from the data.table package can be used:
library(data.table)
setDT(DT)[order(DATE), EXIST_BEFORE := pmin(1L, rowid(ID, DRUG) - 1L)]
DT
ID DATE DRUG EXIST_BEFORE
1: 1 2001-01-01 A 0
2: 1 2001-02-01 A 1
3: 1 2001-03-15 B 0
4: 1 2001-04-20 C 0
5: 1 2001-05-29 A 1
6: 1 2001-05-02 B 1
7: 2 2001-03-02 A 0
8: 2 2001-03-23 C 0
9: 2 2001-04-04 D 0
10: 2 2001-05-05 B 0
rowid(ID, DRUG) - 1L counts the number of occurrences of ID and DRUG (kind of implied grouping) starting at 0. pmin() is used to cut off values greater 1. order(DATE) ensures that rows are sorted appropriately.
Or, as suggested in Sotos' comment:
setDT(DT)[order(DATE), EXIST_BEFORE := as.integer(duplicated(DRUG)), by = ID][]

R data.table subsetting on multiple conditions.

With the below data set, how do I write a data.table call that subsets this table and returns all customer ID's and associated orders for that customer IF that customer has ever purchased SKU 1?
Expected result should return a table that excludes cid 3 and 5 on that condition and every row for customers matching sku==1.
I am getting stuck as I don't know how to write a "contains" statement, == literal returns only sku's matching condition... I am sure there is a better way..
library("data.table")
df<-data.frame(cid=c(1,1,1,1,1,2,2,2,2,2,3,4,5,5,6,6),
order=c(1,1,1,2,3,4,4,4,5,5,6,7,8,8,9,9),
sku=c(1,2,3,2,3,1,2,3,1,3,2,1,2,3,1,2))
dt=as.data.table(df)
This is similar to a previous answer, but here the subsetting works in a more data.table like manner.
First, lets take the cids that meet our condition:
matching_cids = dt[sku==1, cid]
the %in% operator allows us to filter to just those items that are contained in the list. so, using the above:
dt[cid %in% matching_cids]
or on one line:
> dt[cid %in% dt[sku==1, cid]]
cid order sku
1: 1 1 1
2: 1 1 2
3: 1 1 3
4: 1 2 2
5: 1 3 3
6: 2 4 1
7: 2 4 2
8: 2 4 3
9: 2 5 1
10: 2 5 3
11: 4 7 1
12: 6 9 1
13: 6 9 2
I would have thought that it was more (?!) data.table to use keys. I couldn't quite work out how to stick the whole lot on a single line, but I think that this would be a bit quicker on large data, because as I understand it (and I may very well be mistaken) this is the only solution presented thus far that avoids vector scanning (which is slow compared to binary search):
# Set initial key
setkey(dt,sku)
# Select only rows with 1 in the sku and return first example of each, setting key to customer id
dts <- dt[ J(1) , .SD[1] , keyby = cid ]
# change key of dt to cid to match customer id
setkey(dt,cid)
# join based on common key
dt[dts,.SD]
# cid order sku
# 1: 1 1 1
# 2: 1 1 2
# 3: 1 2 2
# 4: 1 1 3
# 5: 1 3 3
# 6: 2 4 1
# 7: 2 5 1
# 8: 2 4 2
# 9: 2 4 3
#10: 2 5 3
#11: 4 7 1
#12: 6 9 1
#13: 6 9 2
An alternative that you can do on one line is to use a data.table merge like so...
setkey(dt,sku)
merge( dt[ J(1) , .SD[1] , keyby = cid ] , dt , by = "cid" )

Resources