Why is class(.SD) on a data.table showing "data.frame"? - r

colnames() seems to be enumerating all columns per group as expected, but class() shows exactly two rows per group! And one of them is data.frame
> dt <- data.table("a"=1:3, "b"=1:3, "c"=1:3, "d"=1:3, "e"=1:3)
> dt[, class(.SD), by=a]
x y z V1
1: 1 1 1 data.table
2: 1 1 1 data.frame
3: 2 2 2 data.table
4: 2 2 2 data.frame
5: 3 3 3 data.table
6: 3 3 3 data.frame
> dt[, colnames(.SD), by=x]
x y z V1
1: 1 1 1 a
2: 1 1 1 b
3: 1 1 1 c
4: 1 1 1 d
5: 1 1 1 e
6: 2 2 2 a
7: 2 2 2 b
8: 2 2 2 c
9: 2 2 2 d
10: 2 2 2 e
11: 3 3 3 a
12: 3 3 3 b
13: 3 3 3 c
14: 3 3 3 d
15: 3 3 3 e

.SD stands for column Subset of Data.table, thus it is also a data.table object. And because data.table is a data.frame class(.SD) returns a length 2 character vector for each group, making it a little bit confusing if you expect single row for each group.
To avoid such confusion you can just wrap results into another list, enforcing single row for each group.
library(data.table)
dt <- data.table(x=1:3, y=1:3)
dt[, .(class = list(class(.SD))), by = x]
# x class
#1: 1 data.table,data.frame
#2: 2 data.table,data.frame
#3: 3 data.table,data.frame

Every data.table is a data.frame, and shows both applicable classes when asked:
> class(dt)
[1] "data.table" "data.frame"
This applies to .SD, too, because .SD is a data table by definition (.SD is a data.table containing the Subset of x's Data for each group)

Related

R data.table "j" reference to "by" variables very unintuitive?

I'm just doing the data.table datacamp excercises and there is something which really disturbes my sense for logic.
Somehow columns which are refered to by the "by" operator are treated different to other columns?
The used data table is the following:
DT
x y z
1: 2 1 2
2: 1 3 4
3: 2 5 6
4: 1 7 8
5: 2 9 10
6: 2 11 12
7: 1 13 14
When I enter DT[,sum(x),x] I would expect:
x V1
1: 2 8
2: 1 3
but I get:
x V1
1: 2 2
2: 1 1
for other columns I get the group sum as I would expect it:
> DT[,sum(y),x]
x V1
1: 2 26
2: 1 23
One way to fix this would be to name the grouping variable with a different name
setnames(DT[, sum(x), .(xN=x)], "xN", "x")[]
# x V1
#1: 2 8
#2: 1 3

data.table and pmin with na.rm=TRUE argument

I am trying to calculate the minimum across rows using the pmin function and data.table (similar to the post row-by-row operations and updates in data.table) but with a character list of columns using something like the with=FALSE syntax, and with the na.rm=TRUE argument.
DT <- data.table(x = c(1,1,2,3,4,1,9),
y = c(2,4,1,2,5,6,6),
z = c(3,5,1,7,4,5,3),
a = c(1,3,NA,3,5,NA,2))
> DT
x y z a
1: 1 2 3 1
2: 1 4 5 3
3: 2 1 1 NA
4: 3 2 7 3
5: 4 5 4 5
6: 1 6 5 NA
7: 9 6 3 2
I can calculate the minimum across rows using columns directly:
DT[,min_val := pmin(x,y,z,a,na.rm=TRUE)]
giving
> DT
x y z a min_val
1: 1 2 3 1 1
2: 1 4 5 3 1
3: 2 1 1 NA 1
4: 3 2 7 3 2
5: 4 5 4 5 4
6: 1 6 5 NA 1
7: 9 6 3 2 2
However, I am trying to do this over an automatically generated large set of columns, and I want to be able to do this across this arbitrary list of columns, stored in a col_names variable, col_names <- c("a","y","z')
I can do this:
DT[, col_min := do.call(pmin,DT[,col_names,with=FALSE])]
But it gives me NA values. I can't figure out how to pass the na.rm=TRUE argument into the do.call. I've tried defining the function as
DT[, col_min := do.call(function(x) pmin(x,na.rm=TRUE),DT[,col_names,with=FALSE])]
but this gives me an error. I also tried passing in the argument as an additional element in a list, but I think pmin (or do.call) gets confused between the DT non-standard evaluation of column names and the argument.
Any ideas?
If we need to get the minimum value of each row of the whole dataset, use the pmin, on .SD concatenate the na.rm=TRUE as a list with .SD for the do.call(pmin.
DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE)))]
DT
# x y z a col_min
#1: 1 2 3 1 1
#2: 1 4 5 3 1
#3: 2 1 1 NA 1
#4: 3 2 7 3 2
#5: 4 5 4 5 4
#6: 1 6 5 NA 1
#7: 9 6 3 2 2
If we want only to do this only for a subset of column names stored in 'col_names', use the .SDcols.
DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE))),
.SDcols= col_names]

Assigning an unique identification variable across repeated values

I will create a simple example of some dummy data:
case <- c('a','a','a','b','b','c','c','c','c','d','d','e','e')
object <- c(1,1,2,1,1,1,1,2,3,1,1,1,2)
df1 <- data.frame(case, object)
Now for each unique case and object value, I want to create a corresponding unique numerical value (an identifier)
df1$UNIQ_ID <- ........
The end result should take the following values c(1,1,2,3,3,4,4,5,6,7,7,8,9) as when
unique(df1$object[df1$case=='a'])
unique(df1$object[df1$case=='b'])
I have though of using dpylr and group_by(case)
We can use the .GRP from data.table after grouping by 'case' and 'object' on a data.table object (setDT(df1)).
library(data.table)
setDT(df1)[,UNIQ_ID:= .GRP ,.(case, object)]
df1
# case object UNIQ_ID
# 1: a 1 1
# 2: a 1 1
# 3: a 2 2
# 4: b 1 3
# 5: b 1 3
# 6: c 1 4
# 7: c 1 4
# 8: c 2 5
# 9: c 3 6
#10: d 1 7
#11: d 1 7
#12: e 1 8
#13: e 2 9
A base R option would be
grp <- interaction(df1)
as.numeric(factor(grp, levels= unique(grp)))
#[1] 1 1 2 3 3 4 4 5 6 7 7 8 9

Data.table summary statistics from n first observations per group

I'd like to use data.table to make summary statistics based on only the first n observations found for each group. I have one solution that works below but I have a nagging feeling that this might be written as a one-liner in data.table but I cannot find out how.
library(data.table)
DT <- data.table(y=1:10, grp=rep(1:2,5))
This produces
y grp
1: 1 1
2: 2 2
3: 3 1
4: 4 2
5: 5 1
6: 6 2
7: 7 1
8: 8 2
9: 9 1
10: 10 2
and I basically want to make summary statistics of y based on, say, the first two observations for each group. The following command gives me the index (by group)
DT2 <- DT[, .(idx = 1:.N, y), by=grp]
which yields
grp idx y
1: 1 1 1
2: 1 2 3
3: 1 3 5
4: 1 4 7
5: 1 5 9
6: 2 1 2
7: 2 2 4
8: 2 3 6
9: 2 4 8
10: 2 5 10
and then I can use data.table again to create the summary based on the relevant selection.
DT2[idx<3, .(my = mean(y)), by=grp]
to get
grp my
1: 1 2
2: 2 3
Is it possible to write this as a single call to data.table?
The one call solution is
DT[, .(my = mean(y[1:2])), by = grp]

Inserting a count field for each row by a grouping variable

I have a data set with observations that are both grouped and ordered (by rank). I'd like to add a third variable that is a count of the number of observations for each grouping variable. I'm aware of ways to group and count variables but I can't find a way to re-insert these counts back into the original data set, which has more rows. I'd like to get the variable C in the example table below.
A B C
1 1 3
1 2 3
1 3 3
2 1 4
2 2 4
2 3 4
2 4 4
Here's one way using ave:
DF <- within(DF, {C <- ave(A, A, FUN=length)})
# A B C
# 1 1 1 3
# 2 1 2 3
# 3 1 3 3
# 4 2 1 4
# 5 2 2 4
# 6 2 3 4
# 7 2 4 4
Here is one approach using data.table that makes use of .N, which is described in the help file to "data.table" as .N is an integer, length 1, containing the number of rows in the group.
> library(data.table)
> DT <- data.table(A = rep(c(1, 2), times = c(3, 4)), B = c(1:3, 1:4))
> DT
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7: 2 4
> DT[, C := .N, by = "A"]
> DT
A B C
1: 1 1 3
2: 1 2 3
3: 1 3 3
4: 2 1 4
5: 2 2 4
6: 2 3 4
7: 2 4 4

Resources