Summary of a data.table using internal order - r

I need a way to calculate the third column of this data.table.
DT=data.table(group=c(1,1,0,0,1,1),x=c(1,1,1,2,2,2),ResultNeeded=c(2,2,3,3,4,4))
I am guessing the following can be modified to get the result I need.
DT[,sum:=sum(x),by=group]. I just don't know how to do it.

In the development version of data.table, v1.9.5, there's a function rleid(), that helps accomplish this in a slightly cleaner way:
require(data.table) ## v1.9.5+
DT[, ans := sum(x), by=rleid(group)]
# group x ResultNeeded ans
# 1: 1 1 2 2
# 2: 1 1 2 2
# 3: 0 1 3 3
# 4: 0 2 3 3
# 5: 1 2 4 4
# 6: 1 2 4 4
rleid() groups consecutive runs of identical values together (named after the base function rle()).
You can install the development version by following the instructions here.

You're close, you just need the correct grouping:
DT[, sum := sum(x), by = cumsum(c(F, diff(group) != 0))]
# group x ResultNeeded sum
#1: 1 1 2 2
#2: 1 1 2 2
#3: 0 1 3 3
#4: 0 2 3 3
#5: 1 2 4 4
#6: 1 2 4 4

Related

Why is class(.SD) on a data.table showing "data.frame"?

colnames() seems to be enumerating all columns per group as expected, but class() shows exactly two rows per group! And one of them is data.frame
> dt <- data.table("a"=1:3, "b"=1:3, "c"=1:3, "d"=1:3, "e"=1:3)
> dt[, class(.SD), by=a]
x y z V1
1: 1 1 1 data.table
2: 1 1 1 data.frame
3: 2 2 2 data.table
4: 2 2 2 data.frame
5: 3 3 3 data.table
6: 3 3 3 data.frame
> dt[, colnames(.SD), by=x]
x y z V1
1: 1 1 1 a
2: 1 1 1 b
3: 1 1 1 c
4: 1 1 1 d
5: 1 1 1 e
6: 2 2 2 a
7: 2 2 2 b
8: 2 2 2 c
9: 2 2 2 d
10: 2 2 2 e
11: 3 3 3 a
12: 3 3 3 b
13: 3 3 3 c
14: 3 3 3 d
15: 3 3 3 e
.SD stands for column Subset of Data.table, thus it is also a data.table object. And because data.table is a data.frame class(.SD) returns a length 2 character vector for each group, making it a little bit confusing if you expect single row for each group.
To avoid such confusion you can just wrap results into another list, enforcing single row for each group.
library(data.table)
dt <- data.table(x=1:3, y=1:3)
dt[, .(class = list(class(.SD))), by = x]
# x class
#1: 1 data.table,data.frame
#2: 2 data.table,data.frame
#3: 3 data.table,data.frame
Every data.table is a data.frame, and shows both applicable classes when asked:
> class(dt)
[1] "data.table" "data.frame"
This applies to .SD, too, because .SD is a data table by definition (.SD is a data.table containing the Subset of x's Data for each group)

data.table and pmin with na.rm=TRUE argument

I am trying to calculate the minimum across rows using the pmin function and data.table (similar to the post row-by-row operations and updates in data.table) but with a character list of columns using something like the with=FALSE syntax, and with the na.rm=TRUE argument.
DT <- data.table(x = c(1,1,2,3,4,1,9),
y = c(2,4,1,2,5,6,6),
z = c(3,5,1,7,4,5,3),
a = c(1,3,NA,3,5,NA,2))
> DT
x y z a
1: 1 2 3 1
2: 1 4 5 3
3: 2 1 1 NA
4: 3 2 7 3
5: 4 5 4 5
6: 1 6 5 NA
7: 9 6 3 2
I can calculate the minimum across rows using columns directly:
DT[,min_val := pmin(x,y,z,a,na.rm=TRUE)]
giving
> DT
x y z a min_val
1: 1 2 3 1 1
2: 1 4 5 3 1
3: 2 1 1 NA 1
4: 3 2 7 3 2
5: 4 5 4 5 4
6: 1 6 5 NA 1
7: 9 6 3 2 2
However, I am trying to do this over an automatically generated large set of columns, and I want to be able to do this across this arbitrary list of columns, stored in a col_names variable, col_names <- c("a","y","z')
I can do this:
DT[, col_min := do.call(pmin,DT[,col_names,with=FALSE])]
But it gives me NA values. I can't figure out how to pass the na.rm=TRUE argument into the do.call. I've tried defining the function as
DT[, col_min := do.call(function(x) pmin(x,na.rm=TRUE),DT[,col_names,with=FALSE])]
but this gives me an error. I also tried passing in the argument as an additional element in a list, but I think pmin (or do.call) gets confused between the DT non-standard evaluation of column names and the argument.
Any ideas?
If we need to get the minimum value of each row of the whole dataset, use the pmin, on .SD concatenate the na.rm=TRUE as a list with .SD for the do.call(pmin.
DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE)))]
DT
# x y z a col_min
#1: 1 2 3 1 1
#2: 1 4 5 3 1
#3: 2 1 1 NA 1
#4: 3 2 7 3 2
#5: 4 5 4 5 4
#6: 1 6 5 NA 1
#7: 9 6 3 2 2
If we want only to do this only for a subset of column names stored in 'col_names', use the .SDcols.
DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE))),
.SDcols= col_names]

Data.table summary statistics from n first observations per group

I'd like to use data.table to make summary statistics based on only the first n observations found for each group. I have one solution that works below but I have a nagging feeling that this might be written as a one-liner in data.table but I cannot find out how.
library(data.table)
DT <- data.table(y=1:10, grp=rep(1:2,5))
This produces
y grp
1: 1 1
2: 2 2
3: 3 1
4: 4 2
5: 5 1
6: 6 2
7: 7 1
8: 8 2
9: 9 1
10: 10 2
and I basically want to make summary statistics of y based on, say, the first two observations for each group. The following command gives me the index (by group)
DT2 <- DT[, .(idx = 1:.N, y), by=grp]
which yields
grp idx y
1: 1 1 1
2: 1 2 3
3: 1 3 5
4: 1 4 7
5: 1 5 9
6: 2 1 2
7: 2 2 4
8: 2 3 6
9: 2 4 8
10: 2 5 10
and then I can use data.table again to create the summary based on the relevant selection.
DT2[idx<3, .(my = mean(y)), by=grp]
to get
grp my
1: 1 2
2: 2 3
Is it possible to write this as a single call to data.table?
The one call solution is
DT[, .(my = mean(y[1:2])), by = grp]

Replicating rows in data.table by column value

I have a dataset that is structured as following:
data <- data.table(ID=1:10,Tenure=c(2,3,4,2,1,1,3,4,5,2),Var=rnorm(10))
ID Tenure Var
1: 1 2 -0.72892371
2: 2 3 -1.73534591
3: 3 4 0.47007030
4: 4 2 1.33173044
5: 5 1 -0.07900914
6: 6 1 0.63493316
7: 7 3 -0.62710577
8: 8 4 -1.69238758
9: 9 5 -0.85709328
10: 10 2 0.10716830
I need to replicate each row N=Tenure times. e.g. I need to replicate the first row 2 times (since Tenure = 2.
I need my transformed dataset to look like the following:
setkey(data,ID)
print(data[,.(ID=rep(ID,Tenure))][data][, Indx := 1:.N, by=ID])
ID Tenure Var Indx
1: 1 2 -0.7289237 1
2: 1 2 -0.7289237 2
3: 2 3 -1.7353459 1
4: 2 3 -1.7353459 2
5: 2 3 -1.7353459 3
6: 3 4 0.4700703 1
...
...
Is there a more efficient way (a more data.table way) to do this? My way is pretty slow. I was thinking there should be a way to do this using a by-without-by merge usng .EACHI?
I don't think using a key/merge is helpful here. Just expand by passing a vector of row indices:
DT <- data[rep(1:.N,Tenure)][,Indx:=1:.N,by=ID]
You could try:
library(splitstackshape)
expandRows(data, "Tenure", drop = FALSE)[,Indx:=1:.N,by=ID][]
Or
library(dplyr)
library(splitstackshape)
expandRows(data, "Tenure", drop = FALSE) %>%
group_by(ID) %>%
mutate(Indx = row_number(Tenure))
Which gives:
ID Tenure Var Indx
1: 1 2 -0.8808717 1
2: 1 2 -0.8808717 2
3: 2 3 0.5962590 1
4: 2 3 0.5962590 2
5: 2 3 0.5962590 3
6: 3 4 0.1197176 1
7: 3 4 0.1197176 2
8: 3 4 0.1197176 3
9: 3 4 0.1197176 4
10: 4 2 -0.2821739 1

Inserting a count field for each row by a grouping variable

I have a data set with observations that are both grouped and ordered (by rank). I'd like to add a third variable that is a count of the number of observations for each grouping variable. I'm aware of ways to group and count variables but I can't find a way to re-insert these counts back into the original data set, which has more rows. I'd like to get the variable C in the example table below.
A B C
1 1 3
1 2 3
1 3 3
2 1 4
2 2 4
2 3 4
2 4 4
Here's one way using ave:
DF <- within(DF, {C <- ave(A, A, FUN=length)})
# A B C
# 1 1 1 3
# 2 1 2 3
# 3 1 3 3
# 4 2 1 4
# 5 2 2 4
# 6 2 3 4
# 7 2 4 4
Here is one approach using data.table that makes use of .N, which is described in the help file to "data.table" as .N is an integer, length 1, containing the number of rows in the group.
> library(data.table)
> DT <- data.table(A = rep(c(1, 2), times = c(3, 4)), B = c(1:3, 1:4))
> DT
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7: 2 4
> DT[, C := .N, by = "A"]
> DT
A B C
1: 1 1 3
2: 1 2 3
3: 1 3 3
4: 2 1 4
5: 2 2 4
6: 2 3 4
7: 2 4 4

Resources