R data.table "j" reference to "by" variables very unintuitive? - r

I'm just doing the data.table datacamp excercises and there is something which really disturbes my sense for logic.
Somehow columns which are refered to by the "by" operator are treated different to other columns?
The used data table is the following:
DT
x y z
1: 2 1 2
2: 1 3 4
3: 2 5 6
4: 1 7 8
5: 2 9 10
6: 2 11 12
7: 1 13 14
When I enter DT[,sum(x),x] I would expect:
x V1
1: 2 8
2: 1 3
but I get:
x V1
1: 2 2
2: 1 1
for other columns I get the group sum as I would expect it:
> DT[,sum(y),x]
x V1
1: 2 26
2: 1 23

One way to fix this would be to name the grouping variable with a different name
setnames(DT[, sum(x), .(xN=x)], "xN", "x")[]
# x V1
#1: 2 8
#2: 1 3

Related

Mixing by and .SDcols in data.table

I am trying to mix by and .SDcols in data.table cran 1.9.6 (and also tested on dev from github, so it is likely a misundertanding on my part)
f = function(x){
print(x);
res=data.table(X=x,Y=x*x);
return(res)
}
DT = data.table(x=1:4, y=rep(c('a','b'),2))
DT[,c('A','B'):=lapply(.SD,FUN=f),.SDcols='x',by=y]
I get:
[1] 1 3
Error in `[.data.table`(DT, , `:=`(c("A", "B"), lapply(.SD, FUN = f)), :
All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.
I would expect
x y A B
1: 1 a 1 1
2: 2 b 2 4
3: 3 a 3 9
4: 4 b 4 16
I would have expected the by operation to take place and SDcols to be replace by 'x' Could someone explain why I am wrong here ?
All the following works, as #Frank pinpointed, the problem was in the level nesting of the list by lapply
DT[,f(.SD[[1]]),.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,lapply(.SD, f)[[1]],.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,rbindlist(lapply(.SD, f)),.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,sapply(.SD, f),.SDcols='x',by=y]
y V1 V2
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,mapply(FUN=f, mget('x')),by=y]
y V1 V2
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16

Why is class(.SD) on a data.table showing "data.frame"?

colnames() seems to be enumerating all columns per group as expected, but class() shows exactly two rows per group! And one of them is data.frame
> dt <- data.table("a"=1:3, "b"=1:3, "c"=1:3, "d"=1:3, "e"=1:3)
> dt[, class(.SD), by=a]
x y z V1
1: 1 1 1 data.table
2: 1 1 1 data.frame
3: 2 2 2 data.table
4: 2 2 2 data.frame
5: 3 3 3 data.table
6: 3 3 3 data.frame
> dt[, colnames(.SD), by=x]
x y z V1
1: 1 1 1 a
2: 1 1 1 b
3: 1 1 1 c
4: 1 1 1 d
5: 1 1 1 e
6: 2 2 2 a
7: 2 2 2 b
8: 2 2 2 c
9: 2 2 2 d
10: 2 2 2 e
11: 3 3 3 a
12: 3 3 3 b
13: 3 3 3 c
14: 3 3 3 d
15: 3 3 3 e
.SD stands for column Subset of Data.table, thus it is also a data.table object. And because data.table is a data.frame class(.SD) returns a length 2 character vector for each group, making it a little bit confusing if you expect single row for each group.
To avoid such confusion you can just wrap results into another list, enforcing single row for each group.
library(data.table)
dt <- data.table(x=1:3, y=1:3)
dt[, .(class = list(class(.SD))), by = x]
# x class
#1: 1 data.table,data.frame
#2: 2 data.table,data.frame
#3: 3 data.table,data.frame
Every data.table is a data.frame, and shows both applicable classes when asked:
> class(dt)
[1] "data.table" "data.frame"
This applies to .SD, too, because .SD is a data table by definition (.SD is a data.table containing the Subset of x's Data for each group)

Data.table summary statistics from n first observations per group

I'd like to use data.table to make summary statistics based on only the first n observations found for each group. I have one solution that works below but I have a nagging feeling that this might be written as a one-liner in data.table but I cannot find out how.
library(data.table)
DT <- data.table(y=1:10, grp=rep(1:2,5))
This produces
y grp
1: 1 1
2: 2 2
3: 3 1
4: 4 2
5: 5 1
6: 6 2
7: 7 1
8: 8 2
9: 9 1
10: 10 2
and I basically want to make summary statistics of y based on, say, the first two observations for each group. The following command gives me the index (by group)
DT2 <- DT[, .(idx = 1:.N, y), by=grp]
which yields
grp idx y
1: 1 1 1
2: 1 2 3
3: 1 3 5
4: 1 4 7
5: 1 5 9
6: 2 1 2
7: 2 2 4
8: 2 3 6
9: 2 4 8
10: 2 5 10
and then I can use data.table again to create the summary based on the relevant selection.
DT2[idx<3, .(my = mean(y)), by=grp]
to get
grp my
1: 1 2
2: 2 3
Is it possible to write this as a single call to data.table?
The one call solution is
DT[, .(my = mean(y[1:2])), by = grp]

Replace row values in data.table using 'by' and conditions

I am trying to replace certain row values in a column according to conditions in another column, within a grouping.
EDIT: edited to highligh the recursive nature of the problem.
E.g.
DT = data.table(y=rep(c(1,3), each = 3)
,v=as.numeric(c(1,2,4,4,5,8))
,x=as.numeric(rep(c(9:11),each=2)),key=c("y","v"))
DT
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 10
4: 3 4 10
5: 3 5 11
6: 3 8 11
Within each 'y', I then want to replace values of 'x' where 'v' has an observation v+t (e.g. t = 3), with 2222 (or in reality the results of a function) to following result:
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 2222
4: 3 4 10
5: 3 5 11
6: 3 8 2222
I have tried the following, but to no avail.
DT[which((v-3) %in% v), x:= 2222, y][]
And it mysteriously (?) results in:
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 2222
4: 3 4 2222
5: 3 5 2222
6: 3 8 2222
Running:
DT[,print(which((v-3) %in% v)), by =y]
Indicates that it does the correct indexing within the groups, but what happens from (or the lack thereof) I don't understand.
You could try using replace (which could have some overhead because it copies whole x)
DT[, x:=replace(x, which(v %in% (v+3)), 2222), by=y]
# y v x
#1: 1 1 9
#2: 1 2 9
#3: 1 4 2222
#4: 3 4 10
#5: 3 5 11
#6: 3 8 2222
Alternatively, you could create a logical index column and then do the assignment in the next step
DT[,indx:=v %in% (v+3), by=y][(indx), x:=2222, by=y][, indx:=NULL]
DT
# y v x
#1: 1 1 9
#2: 1 2 9
#3: 1 4 2222
#4: 3 4 10
#5: 3 5 11
#6: 3 8 2222
Or slightly modifying your own approach using .I in order to create an index
indx <- DT[, .I[which((v-3) %in% v)], by = y]$V1
DT[indx, x := 2222]

Unexpected result using unique inside a data.table

Given a data.table (vith version 1.9.5)
TEST <- data.table(1:20,rep(1:5,each=4, times=1))
If I run this:
TEST[unique(V2)]
I get this result:
V1 V2
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
Is it really the intended beahaviour or a bug?
Or I'm just not using it properly?
I was reading the "R book" and in an example they use TEST[unique(Vegetation),] and say it's intended to select a subset of rows unique for the vegetation.
I expected to get something like
V1 V2
1: 1 1
2: 5 2
3: 9 3
4: 13 4
5: 16 5
Though I understand that would need to specify an aggregation criteria.
TEST[,unique(V2)] gives [1] 1 2 3 4 5. Since TEST[1:5] is supposed to give you the first 5 rows and that's what you get, there is no bug.
To get your expected result, you can do this:
TEST[!duplicated(V2)]
# V1 V2
#1: 1 1
#2: 5 2
#3: 9 3
#4: 13 4
#5: 17 5
or this:
TEST[, V1[1], by = V2]
# V2 V1
#1: 1 1
#2: 2 5
#3: 3 9
#4: 4 13
#5: 5 17
or as #Arun reminds me there is now a data.table method for unique:
unique(TEST, by="V2")
# V1 V2
#1: 1 1
#2: 5 2
#3: 9 3
#4: 13 4
#5: 17 5

Resources