Summing sequences in r using data.table - r

I am trying to sum pieces of a series using data.table in r. The idea is that I define a start index and an end index as columns in the table, then make a third column for "sum of the series between start and end indexes."
series = c(1,2,3,4,5,6)
a = data.table(start=c(1,2,3),end=c(4,5,6))
a[,S := sum(series[start:end])]
a
Expected result:
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18
Actual result:
Warning messages:
1: In start:end : numerical expression has 3 elements: only the first used
2: In start:end : numerical expression has 3 elements: only the first used
> a
start end S
1: 1 4 10
2: 2 5 10
3: 3 6 10
What am I missing here? If I just do a[,S := start+end] the code executes as one would expect.

An option is to loop over the 'start', 'end' columns with Map, get the sequence (:) of the corresponding elements, get the sum and unlist, the list column to assign (:=) it to a new column
a[, S := unlist(Map(function(x, y) sum(x:y), start, end))]
-output
a
# start end S
#1: 1 4 10
#2: 2 5 14
#3: 3 6 18
The : is not vectorized for its operands i.e. it takes just a single operand on either side, and that is the reason it showed a warning

Maybe you can try cumsum like below, which allows you apply vectorized operations within data.table
cs <- cumsum(series)
a[,S := cs[end]-c(0,cs)[start]]
which gives
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18

You can use the arithmetic series formula:
a[, S := (end - start + 1) * (start + end) / 2]
Gives:
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18

Your code would work if you make this a rowwise operation so each start and end represent a single value at a time.
library(data.table)
a[,S := sum(series[start:end]), 1:nrow(a)]
a
# start end S
#1: 1 4 10
#2: 2 5 14
#3: 3 6 18

Related

Find start and end positions/indices of runs/consecutive values

Problem: Given an atomic vector, find the start and end indices of runs in the vector.
Example vector with runs:
x = rev(rep(6:10, 1:5))
# [1] 10 10 10 10 10 9 9 9 9 8 8 8 7 7 6
Output from rle():
rle(x)
# Run Length Encoding
# lengths: int [1:5] 5 4 3 2 1
# values : int [1:5] 10 9 8 7 6
Desired output:
# start end
# 1 1 5
# 2 6 9
# 3 10 12
# 4 13 14
# 5 15 15
The base rle class doesn't appear to provide this functionality, but the class Rle and function rle2 do. However, given how minor the functionality is, sticking to base R seems more sensible than installing and loading additional packages.
There are examples of code snippets (here, here and on SO) which solve the slightly different problem of finding start and end indices for runs which satisfy some condition. I wanted something that would be more general, could be performed in one line, and didn't involve the assignment of temporary variables or values.
Answering my own question because I was frustrated by the lack of search results. I hope this helps somebody!
Core logic:
# Example vector and rle object
x = rev(rep(6:10, 1:5))
rle_x = rle(x)
# Compute endpoints of run
end = cumsum(rle_x$lengths)
start = c(1, lag(end)[-1] + 1)
# Display results
data.frame(start, end)
# start end
# 1 1 5
# 2 6 9
# 3 10 12
# 4 13 14
# 5 15 15
Tidyverse/dplyr way (data frame-centric):
library(dplyr)
rle(x) %>%
unclass() %>%
as.data.frame() %>%
mutate(end = cumsum(lengths),
start = c(1, dplyr::lag(end)[-1] + 1)) %>%
magrittr::extract(c(1,2,4,3)) # To re-order start before end for display
Because the start and end vectors are the same length as the values component of the rle object, solving the related problem of identifying endpoints for runs meeting some condition is straightforward: filter or subset the start and end vectors using the condition on the run values.
A data.table possibility, where .I and .N are used to pick relevant indices, per group defined by rleid runs.
library(data.table)
data.table(x)[ , .(start = .I[1], end = .I[.N]), by = rleid(x)][, rleid := NULL][]
# start end
# 1: 1 5
# 2: 6 9
# 3: 10 12
# 4: 13 14
# 5: 15 15

Removing all rows under a specified row in a time series

I'm trying to analyze game data but I need to remove of all rows after a specified row.
In the following case I want to remove all rows after the EVENT "Die" for each users. Data is sorted by UID, TIME.HOUR.
df:
UID TIME.HOUR EVENT
1 5 Run
1 5 Run
1 6 Run
1 7 Die
1 8 Run
1 9 Run
2 14 Jump
2 15 Die
2 16 Run
2 17 Run
Expected result:
UID TIME.HOUR EVENT
1 5 Run
1 5 Run
1 6 Run
1 7 Die
2 14 Jump
2 15 Die
I think i'm on the right track with the code below but don't struggling with the next step.
args <- which(df$EVENT== "Die")
df[,c(sapply(args, function(x) ???), by = UID] #seq? range?
Thank you.
We can use data.table. Convert the 'data.frame' to 'data.table', grouped by 'UID', get a double cumsum of logical vector (EVENT == "Die"), check whether it is less than 2 to Subset the Data.table (.SD)
library(data.table)
setDT(df)[, .SD[cumsum(cumsum(EVENT=="Die"))<2] , UID]
# UID TIME.HOUR EVENT
#1: 1 5 Run
#2: 1 5 Run
#3: 1 6 Run
#4: 1 7 Die
#5: 2 14 Jump
#6: 2 15 Die
Or a faster approach: to get the row index, extract that column ($V1) to subset the data
setDT(df)[df[, .I[cumsum(cumsum(EVENT=="Die"))<2] , UID]$V1]
Or a modification of #Psidom's approach
setDT(df)[df[, .I[seq(match("Die", EVENT, nomatch = .N))] , UID]$V1]
Or use dplyr
library(dplyr)
df %>%
group_by(UID) %>%
slice(seq(match("Die", EVENT, nomatch = n())))
# UID TIME.HOUR EVENT
# <int> <int> <chr>
#1 1 5 Run
#2 1 5 Run
#3 1 6 Run
#4 1 7 Die
#5 2 14 Jump
#6 2 15 Die
In case, we need a data.frame output, chain with %>% as.data.frame (from #R.S. comments)
This probably isn't so efficient, but you could do a fancy join:
mdf = df[EVENT == "Die", head(.SD, 1L), by=UID, .SDcols = "TIME.HOUR"]
df[!mdf, on=.(UID, TIME.HOUR > TIME.HOUR)]
UID TIME.HOUR EVENT
1: 1 5 Run
2: 1 5 Run
3: 1 6 Run
4: 1 7 Die
5: 2 14 Jump
6: 2 15 Die
You don't actually need to save the mdf table as a separate object, of course.
How it works
x[!i], where i is another data.table or list of vectors, is an anti-join, telling R to exclude rows of x based on i, similar to how it works with vectors (where i would have to be a logical vector).
The on=.(ID, v >= v) option tells R that we're doing a "non-equi join." The v >= v part means that the v from i (on the left-hand side) should be greater than the v from x (on the right-hand side).
Combining these two, we're excluding rows that meet the criteria specified in the on=.
Side notes. There are a couple things I'm not sure about: Do we have a better name than non-equi join? Why is the v on the left from i even though x[i] has x to the left of i?
I borrowed from both Psidom and akrun's answer, using head and an inequality. One (maybe?) advantage here is that head(.SD, 1L) is optimized while head(.SD, expr) is not yet.
Another option, you can use head() with match()(find the first Die index):
dt[, head(.SD, match("Die", EVENT, nomatch = .N)), UID] # if no match is found within the
# group, return the whole group
# UID TIME.HOUR EVENT
#1: 1 5 Run
#2: 1 5 Run
#3: 1 6 Run
#4: 1 7 Die
#5: 2 14 Jump
#6: 2 15 Die

How to calculate differences scores with R?

So I have 2 sets of data, each comparing a specific category, like so:
Category : Solution 1 : Solution 2
1: 5 : 6
2: 7 : 6
3: 4 : 4
4: 8 : 9
How do I calculate the difference scores using R specifically? Somehow I need to load the data them calculate solution1 - solution2 I believe.
We could 'read' the dataset using read.table/read.csv with the appropriate delimiter. Based on the example showed, it is :. After the 'data.frame' object is created ('df1'), we can use transform or within to create the 'Diff' column (i.e. the difference of the "Solution" columns
df1 <- read.table('file.txt', sep=':', strip.white=TRUE, header=TRUE)
transform(df1, Diff= Solution.1-Solution.2)
# Category Solution.1 Solution.2 Diff
#1 1 5 6 -1
#2 2 7 6 1
#3 3 4 4 0
#4 4 8 9 -1
Or
df1$Diff <- with(df1, Solution.1-Solution.2)

Joining tables with identical (non-keyed) column names in R data.table

How do you deal with identically named, non-key columns when joining data.tables? I am looking for a solution to table.field notation in SQL.
For instance, lets' say I have a table DT that is repopulated with new data for column v every time period. I also have a table DT_HIST that stores entries from previous time periods (t). I want to find the difference between the current and previous time period for each x
In this case: DT is time period 3, and DT_HIST has time periods 1 and 2:
DT <- data.table(x=c(1,2,3,4),v=c(20,20,35,30))
setkey(DT,x)
DT_HIST <- data.table(x=rep(seq(1,4,1),2),v=c(40,40,40,40,30,25,45,40),t=c(rep(1,4),rep(2,4)))
setkey(DT_HIST,x)
> DT
x v
1: 1 20
2: 2 20
3: 3 35
4: 4 30
> DT_HIST
x v t
1: 1 40 1
2: 1 30 2
3: 2 40 1
4: 2 25 2
5: 3 40 1
6: 3 45 2
7: 4 40 1
8: 4 40 2
I would like to join DT with DT_HIST[t==1,] on x and calculate the difference in v.
Just joining the tables results in columns v and v.1.
> DT[DT_HIST[t==2],]
x v v.1 t
1: 1 20 30 2
2: 2 20 25 2
3: 3 35 45 2
4: 4 30 40 2
However, I can't find a way to refer to the different v columns when doing the join.
> DT[DT_HIST[t==2],list(delta=v-v.1)]
Error in `[.data.table`(DT, DT_HIST[t == 2], list(delta = v - v.1)) :
object 'v.1' not found
> DT[DT_HIST[t==2],list(delta=v-v)]
x delta
1: 1 0
2: 2 0
3: 3 0
4: 4 0
If this is a duplicate, I apologize. I searched and couldn't find a similar question.
Also, I realize that I can simply rename the columns after joining and then run my desired expression, but I want to know if I'm doing this in the completely wrong way.
You can use i.colname to access the column in the i-expression data.table. I see you're using an old data.table version. There have been a few changes since then: the duplicated joined column names have a prefix i. instead of a number postfix (making it more consistent with the i. access of joined column names, and there is no by-without-by anymore by default.
In the latest version (1.9.3), this is what you get:
DT[DT_HIST[t==2],list(delta = v - i.v)]
# delta
#1: -10
#2: -5
#3: -10
#4: -10

data.table aggregations that return vectors, such as scale()

I have recently been work with much larger datasets and have started learning and migrating to data.table to improve performance of aggregation/grouping. I have been unable to get certain expressions or functions to group as expected. Here is an example of a basic group by operation that I am having trouble with.
library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)
If I want to simply calculate the mean for each group by category. This works easily enough.
dt[,mean(value),by="category"]
category V1
1: 1 -0.67555478
2: 2 -0.50438413
3: 3 0.29093723
4: 4 -0.41684790
5: 5 0.33921764
6: 6 0.01970997
7: 7 -0.23684245
8: 8 -0.04280998
9: 9 0.01838804
10: 10 0.44295978
I run into problems if I try and use the scale function or even a simple expression subtracting the value from itself. The grouping is ignored and I get the function/expression applied to each row instead. The following returns all 100 rows instead of 10 group by categories.
dt[,scale(value),by="category"]
dt[,value-mean(value),by="category"]
I thought recreating scale as function that returns a numeric vector instead of a matrix might help.
zScore <- function(x) {
z=(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
return(z)
}
dt[,zScore(value),by="category"]
category V1
1: 1 -1.45114132
2: 1 -0.35304528
3: 1 -0.94075418
4: 1 1.44454416
5: 1 1.39448268
6: 1 0.55366652
....
97: 10 -0.43190602
98: 10 -0.25409244
99: 10 0.35496694
100: 10 0.57323480
category V1
This also returns the zScore function applied to all rows (N=100) and ignoring the grouping. What am I missing in order to get scale() or a custom function to use the grouping like it did above when using mean()?
You've clarified in the comments that you'd like the same behaviour as:
ddply(df,"category",transform, zscorebycategory=zScore(value))
which gives:
category value zscorebycategory
1 1 0.28860691 0.31565682
2 1 1.17473759 1.33282374
3 1 0.06395503 0.05778463
4 1 1.37825487 1.56643607
etc
The data table option you gave gives:
category V1
1: 1 0.31565682
2: 1 1.33282374
3: 1 0.05778463
4: 1 1.56643607
etc
Which is exactly the same data. However you'd like to also repeat the value column in your result, and rename the V1 variable with something more descriptive. data.table gives you the grouping variable in the result, along with the result of the expression you provide. So lets modify that to give the rows you'd like:
Your
dt[,zScore(value),by="category"]
becomes:
dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
Where the named items in the list become columns in the result.
plyr = data.table(ddply(df,"category",transform, zscorebycategory=zScore(value)))
dt = dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)
> TRUE
(note I converted your ddply data.frame result into a data.table, to allow the identical command to work).
Your claim that data.table does not group is wrong:
library(data.table)
category <- rep(1:2, each=4)
value <- c(rep(c(1:2),each=2),rep(c(4,10),each=2))
dt <- data.table(category, value)
category value
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 4
6: 2 4
7: 2 10
8: 2 10
dt[,value-mean(value),by=category]
category V1
1: 1 -0.5
2: 1 -0.5
3: 1 0.5
4: 1 0.5
5: 2 -3.0
6: 2 -3.0
7: 2 3.0
8: 2 3.0
If you want to scale/transform this is exactly the behavior you want, because these operations by definition return an object of the same size as the input.

Resources