Mixing by and .SDcols in data.table - r

I am trying to mix by and .SDcols in data.table cran 1.9.6 (and also tested on dev from github, so it is likely a misundertanding on my part)
f = function(x){
print(x);
res=data.table(X=x,Y=x*x);
return(res)
}
DT = data.table(x=1:4, y=rep(c('a','b'),2))
DT[,c('A','B'):=lapply(.SD,FUN=f),.SDcols='x',by=y]
I get:
[1] 1 3
Error in `[.data.table`(DT, , `:=`(c("A", "B"), lapply(.SD, FUN = f)), :
All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.
I would expect
x y A B
1: 1 a 1 1
2: 2 b 2 4
3: 3 a 3 9
4: 4 b 4 16
I would have expected the by operation to take place and SDcols to be replace by 'x' Could someone explain why I am wrong here ?

All the following works, as #Frank pinpointed, the problem was in the level nesting of the list by lapply
DT[,f(.SD[[1]]),.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,lapply(.SD, f)[[1]],.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,rbindlist(lapply(.SD, f)),.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,sapply(.SD, f),.SDcols='x',by=y]
y V1 V2
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,mapply(FUN=f, mget('x')),by=y]
y V1 V2
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16

Related

Is there some way to keep variable names from.SD+.SDcols together with non .SD variable names in data.table?

Given a data.table
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
if one does
DT[, .(a, .SD), .SDcols=x:y]
a .SD.x .SD.v .SD.y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the variables from .SDcols become prefixed by .SD. On the other hand, if one tries, as in https://stackoverflow.com/a/62282856/997979,
DT[, c(.(a), .SD), .SDcols=x:y]
V1 x v y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the other variable name (a) become lost. (It is due to this reason that I re-ask the question which I initially marked as a duplicate to that linked above).
Is there some way to keep the names from both .SD variables and non .SD variables?
The goal is simultaneously being able to use .() to select variables without quotes and being able to select variables through .SDcols = patterns("...")
Thanks in advance!
not really sure why.. but it works ;-)
DT[, .(a, (.SD)), .SDcols=x:y]
# a x v y
# 1: 1 b 1 1
# 2: 2 b 1 3
# 3: 3 b 1 6
# 4: 4 a 2 1
# 5: 5 a 2 3
# 6: 6 a 1 6
# 7: 7 c 1 1
# 8: 8 c 2 3
# 9: 9 c 2 6

Removing rows in a R data.table with NAs in specific columns

I have a data.table with a large number of features. I would like to remove the rows where the values are NAs only for certain features.
Currently I am using the following to handle this:
data.joined.sample <- data.joined.sample %>%
filter(!is.na(lat)) %>%
filter(!is.na(long)) %>%
filter(!is.na(temp)) %>%
filter(!is.na(year)) %>%
filter(!is.na(month)) %>%
filter(!is.na(day)) %>%
filter(!is.na(hour)) %>%
.......
Is there a more concise way to achieve this?
str(data.joined.sample)
Classes ‘data.table’ and 'data.frame': 336776 obs. of 50 variables:
We can select those columns, get a logical vector of NA's based on it using complete.cases and use that to remove the NA elements
data.joined.sample[complete.cases(data.joined.sample[colsofinterest]),]
where
colsofinterest <- c("lat", "long", "temp", "year", "month", "day", "hour")
Update
Based on the OP's comments, if it is a data.table, then subset the colsofinterest and use complete.cases
data.joined.sample[complete.cases(data.joined.sample[, colsofinterest, with = FALSE])]
data.table-objects, if that is in fact what your working with, have a somewhat different syntax for the "[" function. Look through this console session:
> DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
> DT[x=="a"&y==1]
x y v
1: a 1 4
> is.na(DT[x=="a"&y==1]$v) <- TRUE # make one item NA
> DT[x=="a"&y==1]
x y v
1: a 1 NA
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT[complete.cases(DT)] # note no comma
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
> DT # But that didn't remove the NA, it only gave a value
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT <- DT[complete.cases(DT)] # do this assignment to make permanent
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
Probably not the true "data.table way".

between vs inrange in data.table

In R's data.table, when should one choose between %between% and %inrange% for subsetting operations? I've read the help page for ?between and I'm still scratching my head as to the differences.
library(data.table)
X = data.table(a=1:5, b=6:10, c=c(5:1))
> X[b %between% c(7,9)]
a b c
1: 2 7 4
2: 3 8 3
3: 4 9 2
> X[b %inrange% c(7,9)]
a b c
1: 2 7 4
2: 3 8 3
3: 4 9 2
They look the same to me. Could someone please explain why there exist both operations?
> X
a b c
1: 1 6 5
2: 2 7 4
3: 3 8 3
4: 4 9 2
5: 5 10 1
Using the example in the comments:
> X[a %between% list(c, b)]
a b c
1: 3 8 3
2: 4 9 2
3: 5 10 1
> X[a %inrange% list(c, b)]
a b c
1: 1 6 5
2: 2 7 4
3: 3 8 3
4: 4 9 2
5: 5 10 1
It seems between looks at each row individually and checks to see if the value in a is such that c <= a <= b for that row.
inrange looks for the smallest scalar value in c, say cmin and the largest scalar value in b, bmax, forming a range [cmin, bmax], and then checks to see if a lies in this range [cmin, bmax], for each row in the a column.

R data.table "j" reference to "by" variables very unintuitive?

I'm just doing the data.table datacamp excercises and there is something which really disturbes my sense for logic.
Somehow columns which are refered to by the "by" operator are treated different to other columns?
The used data table is the following:
DT
x y z
1: 2 1 2
2: 1 3 4
3: 2 5 6
4: 1 7 8
5: 2 9 10
6: 2 11 12
7: 1 13 14
When I enter DT[,sum(x),x] I would expect:
x V1
1: 2 8
2: 1 3
but I get:
x V1
1: 2 2
2: 1 1
for other columns I get the group sum as I would expect it:
> DT[,sum(y),x]
x V1
1: 2 26
2: 1 23
One way to fix this would be to name the grouping variable with a different name
setnames(DT[, sum(x), .(xN=x)], "xN", "x")[]
# x V1
#1: 2 8
#2: 1 3

Why is class(.SD) on a data.table showing "data.frame"?

colnames() seems to be enumerating all columns per group as expected, but class() shows exactly two rows per group! And one of them is data.frame
> dt <- data.table("a"=1:3, "b"=1:3, "c"=1:3, "d"=1:3, "e"=1:3)
> dt[, class(.SD), by=a]
x y z V1
1: 1 1 1 data.table
2: 1 1 1 data.frame
3: 2 2 2 data.table
4: 2 2 2 data.frame
5: 3 3 3 data.table
6: 3 3 3 data.frame
> dt[, colnames(.SD), by=x]
x y z V1
1: 1 1 1 a
2: 1 1 1 b
3: 1 1 1 c
4: 1 1 1 d
5: 1 1 1 e
6: 2 2 2 a
7: 2 2 2 b
8: 2 2 2 c
9: 2 2 2 d
10: 2 2 2 e
11: 3 3 3 a
12: 3 3 3 b
13: 3 3 3 c
14: 3 3 3 d
15: 3 3 3 e
.SD stands for column Subset of Data.table, thus it is also a data.table object. And because data.table is a data.frame class(.SD) returns a length 2 character vector for each group, making it a little bit confusing if you expect single row for each group.
To avoid such confusion you can just wrap results into another list, enforcing single row for each group.
library(data.table)
dt <- data.table(x=1:3, y=1:3)
dt[, .(class = list(class(.SD))), by = x]
# x class
#1: 1 data.table,data.frame
#2: 2 data.table,data.frame
#3: 3 data.table,data.frame
Every data.table is a data.frame, and shows both applicable classes when asked:
> class(dt)
[1] "data.table" "data.frame"
This applies to .SD, too, because .SD is a data table by definition (.SD is a data.table containing the Subset of x's Data for each group)

Resources