Count prior occurences of value one greater than current value - r

I am trying to create a column that contains the number of prior occurrences (sorted by date) of a value one greater than the current value. In the example provided here, I manually create the values I want in the column labeled “wanted”, which is equal to the count of the prior occurrences (sorted by "date") of RoundNo that are equal to one greater than the focal row RoundNo. And I need this to be computed separately by group for each individual InvestorID.
So the first row "wanted" value is equal to the count of prior RoundNo of Investor 1 where RoundNo == 3 (aka one larger than the first row's RoundNo of 2). So in this case that would be 0. Similarly for the second row, the "wanted" value is the count of prior RoundNo of Investor 1 where RoundNo == 2 (aka one larger than the second row's RoundNo of 1). So in this case that would be 1. Would appreciate any help. Code example is below. Thanks!
dt = as.data.table(cbind(c(rep(1,7),rep(2,7)),
c("2019-08-01","2019-09-01","2019-10-01","2019-11-01","2019-12-01","2021-04-01","2021-10-01",
"2019-01-01","2019-02-01","2019-04-01","2019-08-01","2019-09-01","2019-10-01","2019-11-01"),
c(2,1,2,2,1,3,2,1,2,3,2,1,3,1)))
names(dt) = c("InvestorID","date","RoundNo")
wanted = c(0,1,0,0,3,0,1,0,0,0,1,2,0,2)
dt$wanted = wanted

1) Define a function Count which counts the number of times each element of its vector input equals one plus its last element. Then use rollapplyr to apply that to successively larger leading sequences of RoundNo.
library(zoo)
Count <- function(x) sum(x == tail(x, 1) + 1)
dt[, wanted := rollapplyr(as.numeric(RoundNo), 1:.N, Count), by = InvestorID]
2) An alternate method is to use a self left join in which the first instance of dt aliased to a is left joined to the second instance of dt aliased to b associating those b rows which are from the same InvestorID and come before or at the a row. Group by a row and take the appropriate sum over the b rows.
library(sqldf)
sqldf("select a.*, sum(a.RoundNo + 1 == b.RoundNo) wanted
from dt a
left join dt b on a.InvestorID = b.InvestorID and b.rowid <= a.rowid
group by a.rowid")
3) This alternative only uses data.table. Count is from (1).
dt[, wanted := sapply(1:.N, function(i) Count(as.numeric(RoundNo)[1:i])),
by = InvestorID]

Another data.table solution using Reduce:
dt[order(date),.(date,
result=lapply(Reduce('c',as.numeric(RoundNo),accumulate=T),
function(x) sum(x==last(x)+1)),
wanted), by=InvestorID]
InvestorID date result wanted
1: 2 2019-01-01 0 0
2: 2 2019-02-01 0 0
3: 2 2019-04-01 0 0
4: 2 2019-08-01 1 1
5: 2 2019-09-01 2 2
6: 2 2019-10-01 0 0
7: 2 2019-11-01 2 2
8: 1 2019-08-01 0 0
9: 1 2019-09-01 1 1
10: 1 2019-10-01 0 0
11: 1 2019-11-01 0 0
12: 1 2019-12-01 3 3
13: 1 2021-04-01 0 0
14: 1 2021-10-01 1 1

Related

Counting group size including zero with R's data.table

I have a small (< 10M row) data.table with a variable that takes on integer values. I would like to generate a count of the number of the number of times that the variable takes on each integer value, including zeroes when the variable never takes on that value.
For example, I might have:
dt <- data.table(a = c(1,1,3,3,5,5,5))
My desired output is a data.table with values:
a N
1 2
2 0
3 2
4 0
5 3
This is an extremely basic question, but it is difficult to find data.table specific answers for it. In my example, we can assume that the minimum is always 0, but the maximum variable value is unknown.
dt[, .N, by = .(a)
][data.table(a = seq(min(dt$a), max(dt$a))), on = .(a)
][is.na(N), N := 0][]
# a N
# <int> <int>
# 1: 1 2
# 2: 2 0
# 3: 3 2
# 4: 4 0
# 5: 5 3

Data.table: sum between irregular date ranges

Surveys and fires occurred at irregular intervals in different burn units.
(srv=1 means a survey was done, fire=1 means a fire occurred)
I want calculate how many fires were lighted between surveys, i.e.,
including the year of the survey and going back to one year before the last survey.
nyear = 10
units = 4
set.seed(15)
DT <- data.table(
unit = rep(1:units, each=nyear),
year = 2000:(2000+nyear-1),
srv = rbinom(nyear*units, 1, 0.4),
fire = rbinom(nyear*units, 1, 0.3)
)
DT
I can calculate the years elapsed but I have to create a new dataset then join it back to the original data set. Then I cannot figure out out to sum fires between date ranges.
DT1 <- DT[srv != 0] # Drop years without surveys
DT2 <- DT1[, .(year, elapsed = year - shift(year)), by = "unit"] # Use 'shift' to find years elapsed
DT3 <- DT2[DT, on=.(unit, year)] # join dataset with elapsed time to original dataset
DT3[ , sum(fire), on = .(year >= year, year < year -(elapsed-1)), by="unit"] # Doesn't work
Example output follows, where 'nfire' is what I'm after -- in years without surveys it is 'NA', otherwise it provides numbers of fires after the last survey and including current survey year:
unit year elapsed srv fire nfire
1: 1 2000 NA 1 1 1
2: 1 2001 NA 0 0 NA
3: 1 2002 2 1 1 1
4: 1 2003 1 1 0 0
5: 1 2004 NA 0 0 NA
6: 1 2005 2 1 0 0
7: 1 2006 1 1 0 1
8: 1 2007 NA 0 1 NA
9: 1 2008 2 1 1 2
10: 1 2009 1 1 0 1
11: 2 2000 NA 0 0 NA
12: 2 2001 NA 1 1 NA
The answer of r2evans works:
DT[, grp := rev(cumsum(rev(srv == 1))), by = .(unit)][, nfire := sum(fire), by=.(unit, grp)]
Times when surveys occurred (srv ==1) are placed in reverse order then summed cumulatively. The reverse ordering ensures that each survey is grouped with the years that preceded it, and the cumulative summing provides assigns a list of consecutively numbered groups. The outer 'rev' changes the order back to its original organization.
The second part of the statement '[, nfire := sum(fire), by=.(unit, grp)]' is an example of chaining--as I understand it, just a way of introducing more operations in a data.table step without cluttering the first part of the statement. The syntax within is reasonably intuitive.

Lookup observations data based on another table

I have 2 tibble data frames that I am trying to reconcile. The first tibble has over a million observations, the first few rows are as follows:
data
ID Time(Converted to number)
1 23160
1 23161
1 23162
1 23163
1 23164
1 23165
2 24251
2 24252
The second tibble is a lookup table (that has information of a particular event that has occurred), simplified version as follows:
lookup_table
ID Event_Time Event_Indicator Number_of_Cumulative_Events
1 23162 1 1
1 23164 1 2
2 24255 1 1
2 24280 0 1
I would like to create a 3rd column in the first tibble, such that it shows the number of cumulative events at that time of the observation. The 3rd column in the above example would therefore be:
ID Time(Converted to number) Number
1 23160 0
1 23161 0
1 23162 1
1 23163 1
1 23164 2
1 23165 2
2 24251 0
2 24252 0
I am trying to avoid having to loop through the millions of observations to compare each observation's time to the Event_Time in the lookup table because of computation time.
However, I am not sure how to go about doing this without the use of a loop. The issue is that the lookup_table contains some IDs multiple times, if all IDs only appeared in the lookup_table only once, then I could do:
data$Event_Time <- lookup_table[match(data$ID, lookup_table$ID),"Event_Time"]
data$Number <- data %>% mutate(ifelse(Time >= Event_Time,1,0))
Any ideas how I could avoid the use of a loop and yet apply the lookup conditions for each observation? Thank you.
Edit: I am not trying to join the tables, but more of comparing the time columns in the lookup_table and data table to obtain my desired column. Example, if I were to write an inefficient loop function, it would be:
for (i in 1:nrow(data)) {
data$Number[i] <- subset(lookup_table,ID == data$ID[i])[max(which
(data$Time[i] >= lookup_table$Event_Time)), "Number_of_Cumulative_Events"]
}
A possible solution is to count the cumulative events after the join. Note that an update on join is used.
library(data.table)
setDT(data)[, new := 0L][setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
, new := cumsum(new), by = ID][]
ID Time new
1: 1 23160 0
2: 1 23161 0
3: 1 23162 1
4: 1 23163 1
5: 1 23164 2
6: 1 23165 2
7: 2 24251 0
8: 2 24252 0
Alternatively,
setDT(data)[setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
is.na(new), new := 0][
, new := cumsum(new), by = ID][]
will set missing entries to zero after the join.
A completely different approach is to use a rolling join:
lookup_table[, !"Event_Indicator"][data, on = .(ID, Event_Time = Time), roll = TRUE]
ID Event_Time Number_of_Cumulative_Events
1: 1 23160 NA
2: 1 23161 NA
3: 1 23162 1
4: 1 23163 1
5: 1 23164 2
6: 1 23165 2
7: 2 24251 NA
8: 2 24252 NA
(NA's have been left untouched for illustration)

How to find the last observation in the next 10 seconds comparing to the current row?

I want to find the last observation in the next 10 seconds comparing to current row, here is the example:
library(dplyr)
library(data.table)
data <- setDT(data.frame(price=c(4,5,3,1,0),
datetime=c(as.POSIXct("2015/06/12 12:10:25"),
as.POSIXct("2015/06/12 12:10:27"),
as.POSIXct("2015/06/12 12:10:32"),
as.POSIXct("2015/06/12 12:10:36"),
as.POSIXct("2015/06/12 12:10:38"))))
#price datetime
#1 4 2015/06/12 12:10:25
#2 5 2015/06/12 12:10:27
#3 3 2015/06/12 12:10:32
#4 1 2015/06/12 12:10:36
#5 0 2015/06/12 12:10:38
This is the desired output:
#price datetime next.price
#1 4 2015/06/12 12:10:25 3
#2 5 2015/06/12 12:10:27 1
#3 3 2015/06/12 12:10:32 0
#4 1 2015/06/12 12:10:36 0
#5 0 2015/06/12 12:10:38 NA
Here is my code but it does not provide the desired output:
data.new <- data %>% rowwise() %>%
do(data.frame(
price=.$price,
datetime= .$datetime,
next.price = data[data$datetime <= .$datetime +10 ,"price", with=FALSE, mult="last"]
))
What do you generally do when you want to create a value for each row based on the subsequent rows? Questions such as: find the max/min in the subsequent rows, etc. Do you use rowwise from dplyr?
IIUC this can be done using rolling joins as well:
# data.table v1.9.6
data[, next.price := data[.(datetime=datetime+10), price, roll=10-1, on="datetime"]]
# price datetime next.price
# 1: 4 2015-06-12 12:10:25 3
# 2: 5 2015-06-12 12:10:27 1
# 3: 3 2015-06-12 12:10:32 0
# 4: 1 2015-06-12 12:10:36 0
# 5: 0 2015-06-12 12:10:38 NA
This performs a last observation carried forward (LOCF) rolling join. For each value in i = datetime+10, it finds the matching row of the largest observation <= that value in x = data. We extract the corresponding price value and create a new column. roll = 9 limits how far the values are carried forward to 9.
From the next version you wouldn't need the datetime =. It'd work with .(datetime+10) directly.

rolling cumulative sums conditional on missing data

I want to calculate rolling cumulative sums by item in a data.table. Sometimes, data is missing for a given time period.
set.seed(8)
item <- c(rep("A",4), rep("B",3))
time <- c(1,2,3,4,1,3,4)
sales <- rpois(7,5)
DT <- data.table(item, time,sales)
For a rolling window of 2 time periods I want the following output:
item time sales sales_rolling2
1: A 1 5 5
2: A 2 3 8
3: A 3 7 10
4: A 4 6 13
5: B 1 4 4
6: B 3 6 6
7: B 4 4 10
Note, that item B has no data at time 2. Thus the result for row 6 just includes the latest observation.
We can use rollsum from library(zoo) to do the rolling sum. Before applying the rollsum, I guess we need to create another grouping variable ('indx') based on the 'time' variable. I find that for the item 'B', the time is not continous, ie. 2 is missing. So, we can use diff to create a logical index based on the difference of adjacent elements. If the difference is not 1, it will return TRUE or else FALSE. As the diff output is of length 1 less than the length of the column, we can pad with TRUE and then do the cumsum to create the 'indx' variable.
library(zoo)
DT[, indx:=cumsum(c(TRUE, diff(time)!=1))]
In the second step, we use both 'indx' and 'time' as the grouping variable, get the rollsum of 'sales' with k=2 and also based on the condition that if the number of elements in the group is greater than 1 only we need to do this (if(.N >1)), otherwise it should return the 'sales', create the 'sales_rolling2', and assign (:=) the 'indx' to NULL as it is not needed in the expected output.
DT[, sales_rolling2 := if(.N>1) c(sales[1],rollsum(sales,2)) else sales,
by = .(indx, item)][,indx:= NULL]
# item time sales sales_rolling2
#1: A 1 5 5
#2: A 2 3 8
#3: A 3 7 10
#4: A 4 6 13
#5: B 1 4 4
#6: B 3 6 6
#7: B 4 4 10
Update
As per #Khashaa's suggestion, we can use roll_sum from library(RcppRoll) can be used more effectively as it will even work with number of rows less than 'k'. In this way, we can remove the if/else condition in my previous solution. (Full credit to #Khashaa)
library(RcppRoll)
DT[, sales_rolling2 := c(sales[1L], roll_sum(sales, 2)), by = .(indx, item)]

Resources