I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets.
So, I am using .SDcols to pass in the column to summarize, and using functions on .SD in the j part of a data.table expression. Since I am passing in one column at a time, I am not using lapply. And what I am finding is that some functions work and others do not. Below is a test dataset I am working with and the results I see:
dt <- data.table(
a=1:10,
b=as.factor(letters[1:10]),
c=c(TRUE, FALSE),
d=runif(10, 0.5, 100),
e=c(0,1),
f=as.integer(c(0,1)),
g=as.numeric(1:10),
h=c("cat1", "cat2", "cat3", "cat4", "cat5"))
mean(dt$a)
[1] 5.5
dt[, mean(.SD), .SDcols = "a"]
[1] NA
Warning message:
In mean.default(.SD) : argument is not numeric or logical: returning NA
dt[, sum(.SD), .SDcols = "a"]
[1] 55
dt[, max(.SD), .SDcols = "a"]
[1] 10
dt[, colMeans(.SD), .SDcols = "a"]
a
5.5
dt[, lapply(.SD, mean), .SDcols = "a"]
a
1: 5.5
Interestingly, weighted.mean gives the wrong answer (55, the sum) when I use weighted.mean(.SD) in j. But when I use lapply(.SD, weighted.mean) in j, it gives the right answer (5.5, the mean).
I tried turning off data.table optimizations to see if it was the internal data.table mean function, but that didn't change things.
Maybe this is just a problem with using mean() on a list (which seems to be what .SD returns)? I guess there is never a reason to NOT use the lapply paradigm with .SD? It seems that only the lapply option returns a data.table. The others seem to return vectors, except for colMeans which is returning something else (list?).
My main question is why mean(.SD) does not work. And the corollary is whether .SD can be used in the absence of one of the apply functions.
Thanks.
I think the appropriate way of approaching what you want is to just use the standard syntax:
dt[ , lapply(.SD, mean), .SDcols = "a"]
Alternatively, you can pass a variable by name as follows:
col_to_pass = "a"
dt[ , mean(get(col_to_pass)) ]
Eventually, you can generalized this approach to multiple columns as follows:
col_to_pass = c("a", "d")
dt[ , lapply( mget(col_to_pass), mean) ]
Related
I am trying to select some columns from a data.table but getting unexpected results.
For the following, I want to select columns y and z and this works as expected
library(data.table)
dt <- data.table(x=1:4, y=5:8, z=9:6)
dt[, c("y", "z")]
When I try to do this using setdiff it returns nonsense
omit_var <- "x"
dt[, setdiff(c("x","y","z"), omit_var)]
Even though they are equivalent all.equal(setdiff(c("x","y","z"), omit_var), c("y", "z"))
Why is this happening -- I guessing a scoping issue but can I avoid it while keeping the code similar?
(I realise I can do i <- setdiff(c("x","y","z"), omit); dt[,..i])
Following comments from #Roland: "If you call a function in j, data.table can't know that you want to subset. It assumes that you want the return value of the function.". So can use either of
dt[, mget(setdiff(c("x","y","z"), omit_var))]
dt[, setdiff(c("x","y","z"), omit_var), with = FALSE]
#sindri_baldur gave another alternative using .SDcols
dt[, .SD, .SDcols = setdiff(c("x","y","z"), omit_var)]
And more options from #DavidArenburg
dt[, .SD, .SDcols = -omit_var]
If you want to keep all columns except the one in a character string you can use
dt[, !"x"]
Or you could use the .. operator
cols <- setdiff(names(dt), omit_var)
dt[, ..cols]
I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.
Take this example:
dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')
How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:
dat[, cols] <- lapply(dat[, cols], factor)
but that doesn't work for a data.table, and neither does this
dat[, .SD := lapply(.SD, factor), .SDcols = cols]
A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.
for (j in cols) set(dat, j = j, value = factor(dat[[j]]))
Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the := operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by #MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by #Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
You can use .SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]
I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]
I'm trying to apply a function to a group of columns in a large data.table without referring to each one individually.
a <- data.table(
a=as.character(rnorm(5)),
b=as.character(rnorm(5)),
c=as.character(rnorm(5)),
d=as.character(rnorm(5))
)
b <- c('a','b','c','d')
with the MWE above, this:
a[,b=as.numeric(b),with=F]
works, but this:
a[,b[2:3]:=data.table(as.numeric(b[2:3])),with=F]
doesn't work. What is the correct way to apply the as.numeric function to just columns 2 and 3 of a without referring to them individually.
(In the actual data set there are tens of columns so it would be impractical)
The idiomatic approach is to use .SD and .SDcols
You can force the RHS to be evaluated in the parent frame by wrapping in ()
a[, (b) := lapply(.SD, as.numeric), .SDcols = b]
For columns 2:3
a[, 2:3 := lapply(.SD, as.numeric), .SDcols = 2:3]
or
mysubset <- 2:3
a[, (mysubset) := lapply(.SD, as.numeric), .SDcols = mysubset]
This is from an observation during my answering this question from #sds here.
First, let me switch on the trace messages for data.table:
options(datatable.verbose = TRUE)
dt <- data.table(a = c(rep(3, 5), rep(4, 5)), b=1:10, c=11:20, d=21:30, key="a")
Now, suppose one wants to get the sum of all columns grouped by column a, then, we could do:
dt.out <- dt[, lapply(.SD, sum), by = a]
Now, suppose I'd want to add also the number of entries that belong to each group to dt.out, then I normally assign it by reference as follows:
dt.out[, count := dt[, .N, by=a][, N]]
# or alternatively
dt.out[, count := dt[, .N, by=a][["N"]]]
In this assignment by reference, one of the messages data.table produces is:
RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
This is a message from a file in data.table's source directory assign.C. I dont want to paste the relevant snippet here as it's about 18 lines. If necessary, just leave a comment and I'll paste the code. dt[, .N, by=a][["N"]] just gives [1] 5 5. So, it's not a named vector. And I don't understand what this recycled list in RHS is..
But if I do:
dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# or equivalently
dt.out[, `:=`(count = dt[, .N, by=a][["N"]])]
Then, I get the message:
Direct plonk of unnamed RHS, no copy.
As I understand this, the RHS has been duplicated in the first case, meaning it's making a copy (shallow/deep, this I don't know). If so, why is this happening?
Even if not, why the changes in assignment by reference between two internally? Any ideas?
To bring out the main underlying question that I had in my mind while writing this post (and seem to have forgotten!): Is it "less efficient" to assign as dt.out[, count := dt[, .N, by=a][["N"]]] (compared to the second way of doing it)?
Update: The expression,
DT[, c(..., lapply(.SD, .), ..., by=.]
has been optimised internally in commit #1242 of v1.9.3 (FR #2722). Here's the entry from NEWS:
o Complex j-expressions of the form DT[, c(..., lapply(.SD, fun)), by=grp]are now optimised, as long as .SD is only present in the form lapply(.SD, fun).
For ex: DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
is optimised to: DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]
But DT[, c(.SD, lapply(.SD, sum)), by=grp] for example isn't optimised yet.
This partially resolves FR #2722. Thanks to Sam Steingold for filing the FR.
Where it says NAMED vector it means that in the internal R sense at C level; i.e., whether an object has been assigned a symbol and is called something, not whether an atomic vector has a "names" attribute or not. The NAMED value in the SEXP structure takes value 0, 1 or 2. R uses that to know whether it needs to copy-on-subassign or not. See section 1.1.2 of R-ints.
What would be better is if optimization of j in data.table could handle :
DT[, c(lapply(.SD,sum),.N), by=a]
That works but may be slow. Currently only the simpler form is optimized :
DT[, lapply(.SD,sum), by=a]
To answer main question, yes the following :
Direct plonk of unnamed RHS, no copy.
is desirable compared to :
RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
Another way to achieve this is :
dt.out[, count := dt[, .N, by=a]$N]
I'm not quite sure why [["N"]] returns a NAM(2) compared to $N which doesn't.