Suppose I have a data.table as follows -:
data = data.table(c("a","a","b","b","c"),c(1,2,3,4,5))
I would like to sum the numeric vector, only when the factor vector has more than one entry.
The problem I have will require the use of .SD. I understand that I could create a N field via
data[ , N := .N, by = V1]
and then sum via
data[N > 1, lapply(.SD,sum), by = V1, .SDcols = 2]
However, is there a one step call to do this?
Referencing .SD in the call doesn't return an answer -
data[, lapply(.SD[which(length(.SD)>1)],sum), by = V1, .SDcols = 2]
I would like to understand why this doesn't work. Neither does -:
data[, lapply(.SD[which(.N>1)],sum), by = V1, .SDcols = 2]
Thanks!
data <- data.table(c("a","a","b","b","c"),c(1,2,3,4,5))
data[, if(.N > 1) lapply(.SD, sum) else NULL, by=V1]
# V1 V2
# 1: a 3
# 2: b 7
Related
So I'm new to data.table and don't understand now I can modify by reference at the same time that I perform an operation on chosen columns using the .SD symbol? I have two examples.
Example 1
> DT <- data.table("group1:1" = 1, "group1:2" = 1, "group2:1" = 1)
> DT
group1:1 group1:2 group2:1
1: 1 1 1
Let's say for example I simply to choose only columns which contain "group1:" in the name. I know it's pretty straightforward to just reassign the result of operation to the same object like so:
cols1 <- names(DT)[grep("group1:", names(DT))]
DT <- DT[, .SD, .SDcols = cols1]
From reading the data.table vignette on reference-semantics my understanding is that the above does not modify by reference, whereas a similar operation that would use the := would do so. Is this accurate? If that's correct Is there a better way to do this operation that does modify by reference? In trying to figure this out I got stuck on how to combine the .SD symbol and the := operator. I tried
DT[, c(cols1) := .SD, .SDcols = cols1]
DT[, c(cols1) := lapply(.SD,function(x)x), .SDcols = cols1]
neither of which gave the result I wanted.
Example 2
Say I want to perform a different operation dcast that uses .SD as input. Example data table:
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> DT
x y z
1: 1 A 5
2: 2 A 6
3: 1 B 7
4: 2 B 8
Again, I know I can just reassign like so:
> DT <- dcast(DT, x ~ y, value.var = "z")
> DT
x A B
1: 1 5 7
2: 2 6 8
But don't understand why the following does not work (or whether it would be preferable in some circumstances):
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> cols <- c("x", unique(DT$y))
> DT[, cols := dcast(.SD, x ~ y, value.var = "z")]
In your example,
cols1 <- names(DT)[grep("group1:", names(DT))]
DT[, c(cols1) := .SD, .SDcols = cols1] # not this
DT[, (cols1) := .SD, .SDcols = cols1] # this will work
Below is other example to set 0 values on numeric columns .SDcols by reference.
The trick is to assign column names vector before :=.
colnames = DT[, names(.SD), .SDcols = is.numeric] # column name vector
DT[, (colnames) := lapply(.SD, nafill, fill = 0), .SDcols= is.numeric]
Often, I want to manipulate several variables in a DT and I need to select the column names based on their names or class.
d <- data.table(x = 1:10, y= letters[1:10])
# My usual approach
col <- str_subset(names(d), '^x')
d[, (col) := 2:11]
However, it would be very useful and less verbose to do this:
d[, (names(.SD)) := 2:11, .SDcols = patterns('^x')]
But this throws an error:
Error in `[.data.table`(d, , `:=`((names(.SD)), 2:11), .SDcols = patterns("^x")) :
LHS of := isn't column names ('character') or positions ('integer' or 'numeric')
>
The column names of .SD are available, though:
> d[, names(.SD), .SDcols = patterns('^x')]
[1] "x"
Why aren't the names of .SD available for assignment on the LHS of :=?
As noted this is not yet possible. The workaround only adds one line of code:
cols = grep('^x', names(d))
d[ , (cols) := 2:11, .SDcols = cols]
Question 1: line 1 throws an error. Why and how to multiply all columns by DT[i,j]?
Question 2: line 2 works but are there better ways to multiply all other columns by one column?
df=data.table(matrix(1:15,3,5))
df[ , lapply(.SD, function(x) {x*df$V5), .SDcols = c("V1","V2","V3","V4")] #line 1
df[ , lapply(.SD, function(x) {x*df[1,"V5"})}, .SDcols = c("V1","V2","V3","V4")] #line 2
As we are multiplying one column with the rest, either do the multiplication of the Subset of Data.table directly
df[, .SD * V5, .SDcols = V1:V4]
Or with lapply
df[, lapply(.SD, `*`, V5), .SDcols = V1:V4]
Note that in both cases, we are not updating the original dataset columns. For that we need :=
df[, paste0("V", 1:4) := .SD * V5, .SDcols = V1:V4]
In the OP's code, there is a closing } missing in the line 1
df[ , lapply(.SD, function(x) {x*df$V5), .SDcols = c("V1","V2","V3","V4")]
^^
It would be
df[, lapply(.SD, function(x) { x* V5 }), .SDcols = V1:V4]
Here, we don't really need those curlies as well as within the data.table, column names can be referenced as unquoted instead of df$ along with the shortened version of .SDcols where column names are represented as a range (:)
In case of assignment (by reference), with = FALSE can be replaced by LHS in parentheses, (). This nice feature does not work when simply subsetting the column without assignment. Of course there is workarount with .SD/.SDcols or get()/mget(), but it would be nice to subset a column just the same way, with or without assignment.
dt <- data.table(A = 1:3, B = 4:6 )
col <- "A"
cols <- c("A","B")
# assign the old way
dt[, col := 9 , with=FALSE]
dt[, cols := .(9,8), with=FALSE]
# assign the new way
dt[, (col) := 8 ]
dt[, (cols) := .(8,7)]
# But the above syntax does not work for subsetting
dt[, (col)]
dt[, (cols)]
# I know how I can subset col and cols, but that is not the question here,
# e.g.:
dt[, col, with=FALSE]
dt[, cols, with=FALSE]
dt[, .SD, .SDcols=col]
dt[, .SD, .SDcols=cols]
# Below, further (there are even more) types of subsetting but they are not
# the same for col and cols, which is importent for looping where I dont
# know how many cols I call in advance.
dt[, get(col)]
dt[, mget(cols)]
dt[[col]] # Returns a vector, nor running: dt[[cols]]
In other words: if dt[ , (col) := 8] runs, as a naive user I expect df[ , (col)] to run as well. Probably there would be a conflict in [data.table so that cannot be implemented?
I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.
Take this example:
dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')
How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:
dat[, cols] <- lapply(dat[, cols], factor)
but that doesn't work for a data.table, and neither does this
dat[, .SD := lapply(.SD, factor), .SDcols = cols]
A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.
for (j in cols) set(dat, j = j, value = factor(dat[[j]]))
Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the := operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by #MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by #Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
You can use .SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]