Data.table summary statistics from n first observations per group - r

I'd like to use data.table to make summary statistics based on only the first n observations found for each group. I have one solution that works below but I have a nagging feeling that this might be written as a one-liner in data.table but I cannot find out how.
library(data.table)
DT <- data.table(y=1:10, grp=rep(1:2,5))
This produces
y grp
1: 1 1
2: 2 2
3: 3 1
4: 4 2
5: 5 1
6: 6 2
7: 7 1
8: 8 2
9: 9 1
10: 10 2
and I basically want to make summary statistics of y based on, say, the first two observations for each group. The following command gives me the index (by group)
DT2 <- DT[, .(idx = 1:.N, y), by=grp]
which yields
grp idx y
1: 1 1 1
2: 1 2 3
3: 1 3 5
4: 1 4 7
5: 1 5 9
6: 2 1 2
7: 2 2 4
8: 2 3 6
9: 2 4 8
10: 2 5 10
and then I can use data.table again to create the summary based on the relevant selection.
DT2[idx<3, .(my = mean(y)), by=grp]
to get
grp my
1: 1 2
2: 2 3
Is it possible to write this as a single call to data.table?

The one call solution is
DT[, .(my = mean(y[1:2])), by = grp]

Related

Why is class(.SD) on a data.table showing "data.frame"?

colnames() seems to be enumerating all columns per group as expected, but class() shows exactly two rows per group! And one of them is data.frame
> dt <- data.table("a"=1:3, "b"=1:3, "c"=1:3, "d"=1:3, "e"=1:3)
> dt[, class(.SD), by=a]
x y z V1
1: 1 1 1 data.table
2: 1 1 1 data.frame
3: 2 2 2 data.table
4: 2 2 2 data.frame
5: 3 3 3 data.table
6: 3 3 3 data.frame
> dt[, colnames(.SD), by=x]
x y z V1
1: 1 1 1 a
2: 1 1 1 b
3: 1 1 1 c
4: 1 1 1 d
5: 1 1 1 e
6: 2 2 2 a
7: 2 2 2 b
8: 2 2 2 c
9: 2 2 2 d
10: 2 2 2 e
11: 3 3 3 a
12: 3 3 3 b
13: 3 3 3 c
14: 3 3 3 d
15: 3 3 3 e
.SD stands for column Subset of Data.table, thus it is also a data.table object. And because data.table is a data.frame class(.SD) returns a length 2 character vector for each group, making it a little bit confusing if you expect single row for each group.
To avoid such confusion you can just wrap results into another list, enforcing single row for each group.
library(data.table)
dt <- data.table(x=1:3, y=1:3)
dt[, .(class = list(class(.SD))), by = x]
# x class
#1: 1 data.table,data.frame
#2: 2 data.table,data.frame
#3: 3 data.table,data.frame
Every data.table is a data.frame, and shows both applicable classes when asked:
> class(dt)
[1] "data.table" "data.frame"
This applies to .SD, too, because .SD is a data table by definition (.SD is a data.table containing the Subset of x's Data for each group)

How to set a column to the number of rows based on a condition in the same row? [duplicate]

I have the following data.table
set.seed(1)
DT <- data.table(VAL = sample(c(1, 2, 3), 10, replace = TRUE))
VAL
1: 1
2: 2
3: 2
4: 3
5: 1
6: 3
7: 3
8: 2
9: 2
10: 1
Within each number in VAL I want to:
Count the number of records/rows
Create an row index (counter) of first, second, third occurrence et c.
At the end I want the result
VAL COUNT IDX
1: 1 3 1
2: 2 4 1
3: 2 4 2
4: 3 3 1
5: 1 3 2
6: 3 3 2
7: 3 3 3
8: 2 4 3
9: 2 4 4
10: 1 3 3
where "COUNT" is the number of records/rows for each "VAL", and "IDX" is the row index within each "VAL".
I tried to work with which and length using .I:
dt[, list(COUNT = length(VAL == VAL[.I]),
IDX = which(which(VAL == VAL[.I]) == .I))]
but this does not work as .I refers to a vector with the index, so I guess one must use .I[]. Though inside .I[] I again face the problem, that I do not have the row index and I do know (from reading data.table FAQ and following the posts here) that looping through rows should be avoided if possible.
So, what's the data.table way?
Using .N...
DT[ , `:=`( COUNT = .N , IDX = 1:.N ) , by = VAL ]
# VAL COUNT IDX
# 1: 1 3 1
# 2: 2 4 1
# 3: 2 4 2
# 4: 3 3 1
# 5: 1 3 2
# 6: 3 3 2
# 7: 3 3 3
# 8: 2 4 3
# 9: 2 4 4
#10: 1 3 3
.N is the number of records in each group, with groups defined by "VAL".

data.table and pmin with na.rm=TRUE argument

I am trying to calculate the minimum across rows using the pmin function and data.table (similar to the post row-by-row operations and updates in data.table) but with a character list of columns using something like the with=FALSE syntax, and with the na.rm=TRUE argument.
DT <- data.table(x = c(1,1,2,3,4,1,9),
y = c(2,4,1,2,5,6,6),
z = c(3,5,1,7,4,5,3),
a = c(1,3,NA,3,5,NA,2))
> DT
x y z a
1: 1 2 3 1
2: 1 4 5 3
3: 2 1 1 NA
4: 3 2 7 3
5: 4 5 4 5
6: 1 6 5 NA
7: 9 6 3 2
I can calculate the minimum across rows using columns directly:
DT[,min_val := pmin(x,y,z,a,na.rm=TRUE)]
giving
> DT
x y z a min_val
1: 1 2 3 1 1
2: 1 4 5 3 1
3: 2 1 1 NA 1
4: 3 2 7 3 2
5: 4 5 4 5 4
6: 1 6 5 NA 1
7: 9 6 3 2 2
However, I am trying to do this over an automatically generated large set of columns, and I want to be able to do this across this arbitrary list of columns, stored in a col_names variable, col_names <- c("a","y","z')
I can do this:
DT[, col_min := do.call(pmin,DT[,col_names,with=FALSE])]
But it gives me NA values. I can't figure out how to pass the na.rm=TRUE argument into the do.call. I've tried defining the function as
DT[, col_min := do.call(function(x) pmin(x,na.rm=TRUE),DT[,col_names,with=FALSE])]
but this gives me an error. I also tried passing in the argument as an additional element in a list, but I think pmin (or do.call) gets confused between the DT non-standard evaluation of column names and the argument.
Any ideas?
If we need to get the minimum value of each row of the whole dataset, use the pmin, on .SD concatenate the na.rm=TRUE as a list with .SD for the do.call(pmin.
DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE)))]
DT
# x y z a col_min
#1: 1 2 3 1 1
#2: 1 4 5 3 1
#3: 2 1 1 NA 1
#4: 3 2 7 3 2
#5: 4 5 4 5 4
#6: 1 6 5 NA 1
#7: 9 6 3 2 2
If we want only to do this only for a subset of column names stored in 'col_names', use the .SDcols.
DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE))),
.SDcols= col_names]

Replicating rows in data.table by column value

I have a dataset that is structured as following:
data <- data.table(ID=1:10,Tenure=c(2,3,4,2,1,1,3,4,5,2),Var=rnorm(10))
ID Tenure Var
1: 1 2 -0.72892371
2: 2 3 -1.73534591
3: 3 4 0.47007030
4: 4 2 1.33173044
5: 5 1 -0.07900914
6: 6 1 0.63493316
7: 7 3 -0.62710577
8: 8 4 -1.69238758
9: 9 5 -0.85709328
10: 10 2 0.10716830
I need to replicate each row N=Tenure times. e.g. I need to replicate the first row 2 times (since Tenure = 2.
I need my transformed dataset to look like the following:
setkey(data,ID)
print(data[,.(ID=rep(ID,Tenure))][data][, Indx := 1:.N, by=ID])
ID Tenure Var Indx
1: 1 2 -0.7289237 1
2: 1 2 -0.7289237 2
3: 2 3 -1.7353459 1
4: 2 3 -1.7353459 2
5: 2 3 -1.7353459 3
6: 3 4 0.4700703 1
...
...
Is there a more efficient way (a more data.table way) to do this? My way is pretty slow. I was thinking there should be a way to do this using a by-without-by merge usng .EACHI?
I don't think using a key/merge is helpful here. Just expand by passing a vector of row indices:
DT <- data[rep(1:.N,Tenure)][,Indx:=1:.N,by=ID]
You could try:
library(splitstackshape)
expandRows(data, "Tenure", drop = FALSE)[,Indx:=1:.N,by=ID][]
Or
library(dplyr)
library(splitstackshape)
expandRows(data, "Tenure", drop = FALSE) %>%
group_by(ID) %>%
mutate(Indx = row_number(Tenure))
Which gives:
ID Tenure Var Indx
1: 1 2 -0.8808717 1
2: 1 2 -0.8808717 2
3: 2 3 0.5962590 1
4: 2 3 0.5962590 2
5: 2 3 0.5962590 3
6: 3 4 0.1197176 1
7: 3 4 0.1197176 2
8: 3 4 0.1197176 3
9: 3 4 0.1197176 4
10: 4 2 -0.2821739 1

Summary of a data.table using internal order

I need a way to calculate the third column of this data.table.
DT=data.table(group=c(1,1,0,0,1,1),x=c(1,1,1,2,2,2),ResultNeeded=c(2,2,3,3,4,4))
I am guessing the following can be modified to get the result I need.
DT[,sum:=sum(x),by=group]. I just don't know how to do it.
In the development version of data.table, v1.9.5, there's a function rleid(), that helps accomplish this in a slightly cleaner way:
require(data.table) ## v1.9.5+
DT[, ans := sum(x), by=rleid(group)]
# group x ResultNeeded ans
# 1: 1 1 2 2
# 2: 1 1 2 2
# 3: 0 1 3 3
# 4: 0 2 3 3
# 5: 1 2 4 4
# 6: 1 2 4 4
rleid() groups consecutive runs of identical values together (named after the base function rle()).
You can install the development version by following the instructions here.
You're close, you just need the correct grouping:
DT[, sum := sum(x), by = cumsum(c(F, diff(group) != 0))]
# group x ResultNeeded sum
#1: 1 1 2 2
#2: 1 1 2 2
#3: 0 1 3 3
#4: 0 2 3 3
#5: 1 2 4 4
#6: 1 2 4 4

Resources