R data.table - How to modify by reference when using .SD? - r

So I'm new to data.table and don't understand now I can modify by reference at the same time that I perform an operation on chosen columns using the .SD symbol? I have two examples.
Example 1
> DT <- data.table("group1:1" = 1, "group1:2" = 1, "group2:1" = 1)
> DT
group1:1 group1:2 group2:1
1: 1 1 1
Let's say for example I simply to choose only columns which contain "group1:" in the name. I know it's pretty straightforward to just reassign the result of operation to the same object like so:
cols1 <- names(DT)[grep("group1:", names(DT))]
DT <- DT[, .SD, .SDcols = cols1]
From reading the data.table vignette on reference-semantics my understanding is that the above does not modify by reference, whereas a similar operation that would use the := would do so. Is this accurate? If that's correct Is there a better way to do this operation that does modify by reference? In trying to figure this out I got stuck on how to combine the .SD symbol and the := operator. I tried
DT[, c(cols1) := .SD, .SDcols = cols1]
DT[, c(cols1) := lapply(.SD,function(x)x), .SDcols = cols1]
neither of which gave the result I wanted.
Example 2
Say I want to perform a different operation dcast that uses .SD as input. Example data table:
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> DT
x y z
1: 1 A 5
2: 2 A 6
3: 1 B 7
4: 2 B 8
Again, I know I can just reassign like so:
> DT <- dcast(DT, x ~ y, value.var = "z")
> DT
x A B
1: 1 5 7
2: 2 6 8
But don't understand why the following does not work (or whether it would be preferable in some circumstances):
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> cols <- c("x", unique(DT$y))
> DT[, cols := dcast(.SD, x ~ y, value.var = "z")]

In your example,
cols1 <- names(DT)[grep("group1:", names(DT))]
DT[, c(cols1) := .SD, .SDcols = cols1] # not this
DT[, (cols1) := .SD, .SDcols = cols1] # this will work
Below is other example to set 0 values on numeric columns .SDcols by reference.
The trick is to assign column names vector before :=.
colnames = DT[, names(.SD), .SDcols = is.numeric] # column name vector
DT[, (colnames) := lapply(.SD, nafill, fill = 0), .SDcols= is.numeric]

Related

Nice way to group data in a `data.table` when the new column name is given as a character vector

In other words, my question is about the j argument to data.table when the name of the new column is a character vector. For example:
dt <- data.table(x = c(1, 1, 2, 2, 3, 3), y = rnorm(6))
agg_col_name <- 'avg'
grouped_dt <- dt[, .(z = mean(y)), by = x]
setnames(grouped_dt, 'z', agg_col_name)
> grouped_dt
x avg
1: 1 -0.2554987
2: 2 -0.4245852
3: 3 -0.4881073
There should be a more elegant way to do the last two statements as one, yes?
Perhaps this is a question about how to create suitable list for the j argument.
Although probably not what you are looking for, but you could use setNames inside, where it wraps around (.(z = mean(y)).
library(data.table)
dt[, setNames(.(z = mean(y)), agg_col_name), by = x]
Or use setnames after doing the summary:
setnames(dt[, mean(y), by = x], 'V1', agg_col_name)[]
Output
x avg
1: 1 0.5626526
2: 2 0.3549653
3: 3 -0.2861405
However, as mentioned in the comments, it is easier to do with the dev version of data.table. You can see more about the development of this feature at [programming on data.table #4304]:(https://github.com/Rdatatable/data.table/pull/4304).
# Latest development version:
data.table::update.dev.pkg()
library(data.table)
dt[, .(z = mean(y)), by = x, env = list(z=agg_col_name)]
# x avg
#1: 1 -0.1640783
#2: 2 0.5375794
#3: 3 0.1539785

R data.table: names of .SD not available for assignment

Often, I want to manipulate several variables in a DT and I need to select the column names based on their names or class.
d <- data.table(x = 1:10, y= letters[1:10])
# My usual approach
col <- str_subset(names(d), '^x')
d[, (col) := 2:11]
However, it would be very useful and less verbose to do this:
d[, (names(.SD)) := 2:11, .SDcols = patterns('^x')]
But this throws an error:
Error in `[.data.table`(d, , `:=`((names(.SD)), 2:11), .SDcols = patterns("^x")) :
LHS of := isn't column names ('character') or positions ('integer' or 'numeric')
>
The column names of .SD are available, though:
> d[, names(.SD), .SDcols = patterns('^x')]
[1] "x"
Why aren't the names of .SD available for assignment on the LHS of :=?
As noted this is not yet possible. The workaround only adds one line of code:
cols = grep('^x', names(d))
d[ , (cols) := 2:11, .SDcols = cols]

create a new column in a data.table from group by multiple columns

I'm working on a data.table that includes X and Y columns and I want to create a new column Z which is the number of all records with the same value of (X, Y).
I know the syntax when working with a data.frame:
ddply(df,.(X,Y),nrow)
I tested different syntaxes I found on this forum but they didn't work:
dt[, Z := lapply(.SD,nrow), by="X,Y"] # or
dt[, `:=`(Z = lapply(.SD,nrow)), by="X,Y"]
I precise X and Y are numeric.
Starting from
library(data.table)
dt <- data.table(X = c(1, 1, 2), Y = c(1, 1, 2))
The appropriate syntax is
dt[, Z := .N, by = c("X","Y")]
or
dt[, Z := .N, by = .(X,Y)]

Convert *some* column classes in data.table

I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.
Take this example:
dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')
How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:
dat[, cols] <- lapply(dat[, cols], factor)
but that doesn't work for a data.table, and neither does this
dat[, .SD := lapply(.SD, factor), .SDcols = cols]
A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.
for (j in cols) set(dat, j = j, value = factor(dat[[j]]))
Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the := operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by #MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by #Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
You can use .SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]

Use of .N and .SD in one call

Suppose I have a data.table as follows -:
data = data.table(c("a","a","b","b","c"),c(1,2,3,4,5))
I would like to sum the numeric vector, only when the factor vector has more than one entry.
The problem I have will require the use of .SD. I understand that I could create a N field via
data[ , N := .N, by = V1]
and then sum via
data[N > 1, lapply(.SD,sum), by = V1, .SDcols = 2]
However, is there a one step call to do this?
Referencing .SD in the call doesn't return an answer -
data[, lapply(.SD[which(length(.SD)>1)],sum), by = V1, .SDcols = 2]
I would like to understand why this doesn't work. Neither does -:
data[, lapply(.SD[which(.N>1)],sum), by = V1, .SDcols = 2]
Thanks!
data <- data.table(c("a","a","b","b","c"),c(1,2,3,4,5))
data[, if(.N > 1) lapply(.SD, sum) else NULL, by=V1]
# V1 V2
# 1: a 3
# 2: b 7

Resources