Removing all rows under a specified row in a time series - r

I'm trying to analyze game data but I need to remove of all rows after a specified row.
In the following case I want to remove all rows after the EVENT "Die" for each users. Data is sorted by UID, TIME.HOUR.
df:
UID TIME.HOUR EVENT
1 5 Run
1 5 Run
1 6 Run
1 7 Die
1 8 Run
1 9 Run
2 14 Jump
2 15 Die
2 16 Run
2 17 Run
Expected result:
UID TIME.HOUR EVENT
1 5 Run
1 5 Run
1 6 Run
1 7 Die
2 14 Jump
2 15 Die
I think i'm on the right track with the code below but don't struggling with the next step.
args <- which(df$EVENT== "Die")
df[,c(sapply(args, function(x) ???), by = UID] #seq? range?
Thank you.

We can use data.table. Convert the 'data.frame' to 'data.table', grouped by 'UID', get a double cumsum of logical vector (EVENT == "Die"), check whether it is less than 2 to Subset the Data.table (.SD)
library(data.table)
setDT(df)[, .SD[cumsum(cumsum(EVENT=="Die"))<2] , UID]
# UID TIME.HOUR EVENT
#1: 1 5 Run
#2: 1 5 Run
#3: 1 6 Run
#4: 1 7 Die
#5: 2 14 Jump
#6: 2 15 Die
Or a faster approach: to get the row index, extract that column ($V1) to subset the data
setDT(df)[df[, .I[cumsum(cumsum(EVENT=="Die"))<2] , UID]$V1]
Or a modification of #Psidom's approach
setDT(df)[df[, .I[seq(match("Die", EVENT, nomatch = .N))] , UID]$V1]
Or use dplyr
library(dplyr)
df %>%
group_by(UID) %>%
slice(seq(match("Die", EVENT, nomatch = n())))
# UID TIME.HOUR EVENT
# <int> <int> <chr>
#1 1 5 Run
#2 1 5 Run
#3 1 6 Run
#4 1 7 Die
#5 2 14 Jump
#6 2 15 Die
In case, we need a data.frame output, chain with %>% as.data.frame (from #R.S. comments)

This probably isn't so efficient, but you could do a fancy join:
mdf = df[EVENT == "Die", head(.SD, 1L), by=UID, .SDcols = "TIME.HOUR"]
df[!mdf, on=.(UID, TIME.HOUR > TIME.HOUR)]
UID TIME.HOUR EVENT
1: 1 5 Run
2: 1 5 Run
3: 1 6 Run
4: 1 7 Die
5: 2 14 Jump
6: 2 15 Die
You don't actually need to save the mdf table as a separate object, of course.
How it works
x[!i], where i is another data.table or list of vectors, is an anti-join, telling R to exclude rows of x based on i, similar to how it works with vectors (where i would have to be a logical vector).
The on=.(ID, v >= v) option tells R that we're doing a "non-equi join." The v >= v part means that the v from i (on the left-hand side) should be greater than the v from x (on the right-hand side).
Combining these two, we're excluding rows that meet the criteria specified in the on=.
Side notes. There are a couple things I'm not sure about: Do we have a better name than non-equi join? Why is the v on the left from i even though x[i] has x to the left of i?
I borrowed from both Psidom and akrun's answer, using head and an inequality. One (maybe?) advantage here is that head(.SD, 1L) is optimized while head(.SD, expr) is not yet.

Another option, you can use head() with match()(find the first Die index):
dt[, head(.SD, match("Die", EVENT, nomatch = .N)), UID] # if no match is found within the
# group, return the whole group
# UID TIME.HOUR EVENT
#1: 1 5 Run
#2: 1 5 Run
#3: 1 6 Run
#4: 1 7 Die
#5: 2 14 Jump
#6: 2 15 Die

Related

Summing sequences in r using data.table

I am trying to sum pieces of a series using data.table in r. The idea is that I define a start index and an end index as columns in the table, then make a third column for "sum of the series between start and end indexes."
series = c(1,2,3,4,5,6)
a = data.table(start=c(1,2,3),end=c(4,5,6))
a[,S := sum(series[start:end])]
a
Expected result:
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18
Actual result:
Warning messages:
1: In start:end : numerical expression has 3 elements: only the first used
2: In start:end : numerical expression has 3 elements: only the first used
> a
start end S
1: 1 4 10
2: 2 5 10
3: 3 6 10
What am I missing here? If I just do a[,S := start+end] the code executes as one would expect.
An option is to loop over the 'start', 'end' columns with Map, get the sequence (:) of the corresponding elements, get the sum and unlist, the list column to assign (:=) it to a new column
a[, S := unlist(Map(function(x, y) sum(x:y), start, end))]
-output
a
# start end S
#1: 1 4 10
#2: 2 5 14
#3: 3 6 18
The : is not vectorized for its operands i.e. it takes just a single operand on either side, and that is the reason it showed a warning
Maybe you can try cumsum like below, which allows you apply vectorized operations within data.table
cs <- cumsum(series)
a[,S := cs[end]-c(0,cs)[start]]
which gives
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18
You can use the arithmetic series formula:
a[, S := (end - start + 1) * (start + end) / 2]
Gives:
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18
Your code would work if you make this a rowwise operation so each start and end represent a single value at a time.
library(data.table)
a[,S := sum(series[start:end]), 1:nrow(a)]
a
# start end S
#1: 1 4 10
#2: 2 5 14
#3: 3 6 18

Summing the number of times a value appears in either of 2 columns

I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

rolling cumulative sums conditional on missing data

I want to calculate rolling cumulative sums by item in a data.table. Sometimes, data is missing for a given time period.
set.seed(8)
item <- c(rep("A",4), rep("B",3))
time <- c(1,2,3,4,1,3,4)
sales <- rpois(7,5)
DT <- data.table(item, time,sales)
For a rolling window of 2 time periods I want the following output:
item time sales sales_rolling2
1: A 1 5 5
2: A 2 3 8
3: A 3 7 10
4: A 4 6 13
5: B 1 4 4
6: B 3 6 6
7: B 4 4 10
Note, that item B has no data at time 2. Thus the result for row 6 just includes the latest observation.
We can use rollsum from library(zoo) to do the rolling sum. Before applying the rollsum, I guess we need to create another grouping variable ('indx') based on the 'time' variable. I find that for the item 'B', the time is not continous, ie. 2 is missing. So, we can use diff to create a logical index based on the difference of adjacent elements. If the difference is not 1, it will return TRUE or else FALSE. As the diff output is of length 1 less than the length of the column, we can pad with TRUE and then do the cumsum to create the 'indx' variable.
library(zoo)
DT[, indx:=cumsum(c(TRUE, diff(time)!=1))]
In the second step, we use both 'indx' and 'time' as the grouping variable, get the rollsum of 'sales' with k=2 and also based on the condition that if the number of elements in the group is greater than 1 only we need to do this (if(.N >1)), otherwise it should return the 'sales', create the 'sales_rolling2', and assign (:=) the 'indx' to NULL as it is not needed in the expected output.
DT[, sales_rolling2 := if(.N>1) c(sales[1],rollsum(sales,2)) else sales,
by = .(indx, item)][,indx:= NULL]
# item time sales sales_rolling2
#1: A 1 5 5
#2: A 2 3 8
#3: A 3 7 10
#4: A 4 6 13
#5: B 1 4 4
#6: B 3 6 6
#7: B 4 4 10
Update
As per #Khashaa's suggestion, we can use roll_sum from library(RcppRoll) can be used more effectively as it will even work with number of rows less than 'k'. In this way, we can remove the if/else condition in my previous solution. (Full credit to #Khashaa)
library(RcppRoll)
DT[, sales_rolling2 := c(sales[1L], roll_sum(sales, 2)), by = .(indx, item)]

Compute new column based on values in current and following rows with data.table in R

I have already posted this question, unfortunately without any success. In the meantime I discovered the data.table package and tried to compensate speed for memory by using the nice 'update by reference' style it offers. I noticed data.table has a pretty lively and helpful community so I was wondering if someone here can maybe help me further.
The link with the other question I posed goes more into detail, here I would simply like to know the way to reference "upcoming" values in the i- expression i.e. I have a datatable with a key - I want to compute new columns in the j-expression, based on subsets of current and following row values for another column in the i-expression, all done grouped by Key.
Something like this:
data[TimeToGo %in% seq(TimeToGo-1,TimeToGo-7),
MinPrice := min(Price),
by = key(data)]
Which would basically go through each value of TimeToGo (like TimeToGo[i]), subset on the datatable with values of TimeToGo in (TimeToGo[i]-1, TimeToGo[i]-7) and pull out the min(Price) from this subset as the new column entry MinPrice (at index i).
The result of such a try gives back the following error :
Error in seq.default(data$TimeToGo - 1, data$TimeToGo - 7 :
'from' must be of length 1
I've tried using by = .EACHI and error persists. On the other hand, a solution which uses for-loops was very very slow, so I thought there must be a better way to do this.
I would greatly appreciate any help, opinion, direction, reference for further reading...
Thanks in advance
d
Use foverlaps:
dt = data.table(time = c(1:4, 10:14), b = 1:9)
dt[, end := time] # necessary for the foverlaps atm
# time b end
#1: 1 1 1
#2: 2 2 2
#3: 3 3 3
#4: 4 4 4
#5: 10 5 10
#6: 11 6 11
#7: 12 7 12
#8: 13 8 13
#9: 14 9 14
intervals = dt[, .(start = time - 1, end = time + 7, idx = .I)]
setkey(intervals, start, end)
foverlaps(dt, intervals, by.x = c('time', 'end'))[, max(b), by = idx]
# idx V1
#1: 1 4
#2: 2 4
#3: 3 5
#4: 4 6
#5: 5 9
#6: 6 9
#7: 7 9
#8: 8 9
#9: 9 9

Is my way of duplicating rows in data.table efficient?

I have monthly data in one data.table and annual data in another data.table and now I want to match the annual data to the respective observation in the monthly data.
My approach is as follows: Duplicating the annual data for every month and then join the monthly and annual data. And now I have a question regarding the duplication of rows. I know how to do it, but I'm not sure if it is the best way to do it, so some opinions would be great.
Here is an exemplatory data.table DT for my annual data and how I currently duplicate:
library(data.table)
DT <- data.table(ID = paste(rep(c("a", "b"), each=3), c(1:3, 1:3), sep="_"),
values = 10:15,
startMonth = seq(from=1, by=2, length=6),
endMonth = seq(from=3, by=3, length=6))
DT
ID values startMonth endMonth
[1,] a_1 10 1 3
[2,] a_2 11 3 6
[3,] a_3 12 5 9
[4,] b_1 13 7 12
[5,] b_2 14 9 15
[6,] b_3 15 11 18
#1. Alternative
DT1 <- DT[, list(MONTH=startMonth:endMonth), by="ID"]
setkey(DT, ID)
setkey(DT1, ID)
DT1[DT]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
[...]
The last join is exactly what I want. However, DT[, list(MONTH=startMonth:endMonth), by="ID"] already does everything I want except adding the other columns to DT, so I was wondering if I could get rid of the last three rows in my code, i.e. the setkey and join operations. It turns out, you can, just do the following:
#2. Alternative: More intuitiv and just one line of code
DT[, list(MONTH=startMonth:endMonth, values, startMonth, endMonth), by="ID"]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
...
This, however, only works because I hardcoded the column names into the list expression. In my real data, I do not know the names of all columns in advance, so I was wondering if I could just tell data.table to return the column MONTH that I compute as shown above and all the other columns of DT. .SD seemed to be able to do the trick, but:
DT[, list(MONTH=startMonth:endMonth, .SD), by="ID"]
Error in `[.data.table`(DT, , list(YEAR = startMonth:endMonth, .SD), by = "ID") :
maxn (4) is not exact multiple of this j column's length (3)
So to summarize, I know how it's been done, but I was just wondering if this is the best way to do it because I'm still struggling a little bit with the syntax of data.table and often read in posts and on the wiki that there are good and bads ways of doing things. Also, I don't quite get why I get an error when using .SD. I thought it is just any easy way to tell data.table that you want all columns. What do I miss?
Looking at this I realized that the answer was only possible because ID was a unique key (without duplicates). Here is another answer with duplicates. But, by the way, some NA seem to creep in. Could this be a bug? I'm using v1.8.7 (commit 796).
library(data.table)
DT <- data.table(x=c(1,1,1,1,2,2,3),y=c(1,1,2,3,1,1,2))
DT[,rep:=1L][c(2,7),rep:=c(2L,3L)] # duplicate row 2 and triple row 7
DT[,num:=1:.N] # to group each row by itself
DT
x y rep num
1: 1 1 1 1
2: 1 1 2 2
3: 1 2 1 3
4: 1 3 1 4
5: 2 1 1 5
6: 2 1 1 6
7: 3 2 3 7
DT[,cbind(.SD,dup=1:rep),by="num"]
num x y rep dup
1: 1 1 1 1 1
2: 2 1 1 1 NA # why these NA?
3: 2 1 1 2 NA
4: 3 1 2 1 1
5: 4 1 3 1 1
6: 5 2 1 1 1
7: 6 2 1 1 1
8: 7 3 2 3 1
9: 7 3 2 3 2
10: 7 3 2 3 3
Just for completeness, a faster way is to rep the row numbers and then take the subset in one step (no grouping and no use of cbind or .SD) :
DT[rep(num,rep)]
x y rep num
1: 1 1 1 1
2: 1 1 2 2
3: 1 1 2 2
4: 1 2 1 3
5: 1 3 1 4
6: 2 1 1 5
7: 2 1 1 6
8: 3 2 3 7
9: 3 2 3 7
10: 3 2 3 7
where in this example data the column rep happens to be the same name as the rep() base function.
Great question. What you tried was very reasonable. Assuming you're using v1.7.1 it's now easier to make list columns. In this case it's trying to make one list column out of .SD (3 items) alongside the MONTH column of the 2nd group (4 items). I'll raise it as a bug [EDIT: now fixed in v1.7.5], thanks.
In the meantime, try :
DT[, cbind(MONTH=startMonth:endMonth, .SD), by="ID"]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
...
Also, just to check you've seen roll=TRUE? Typically you'd have just one startMonth column (irregular with gaps) and then just roll join to it. Your example data has overlapping month ranges though, so that complicates it.
Here is a function I wrote which mimics disaggregate (I needed something that handled complex data). It might be useful for you, if it isn't overkill. To expand only rows, set the argument fact to c(1,12) where 12 would be for 12 'month' rows for each 'year' row.
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
# new nonloop method, seems to work just ducky
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T)
#does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
return(invisible(bigx))
}
The fastest and most succinct way of doing it:
DT[rep(1:nrow(DT), endMonth - startMonth)]
We can also enumerate by group by:
dd <- DT[rep(1:nrow(DT), endMonth - startMonth)]
dd[, nn := 1:.N, by = ID]
dd

Resources