data.table: create new columns with lapply - r

i have a data.table and want to apply a function to on each subset of a row.
Normaly one would do as follows: DT[, lapply(.SD, function), by = y]
But in my case the function does not return a atomic vector but simply a vector.
Is there a chance to do something like this?
library(data.table)
set.seed(9)
DT <- data.table(x1=letters[sample(x=2L,size=6,replace=TRUE)],
x2=letters[sample(x=2L,size=6,replace=TRUE)],
y=rep(1:2,3), key="y")
DT
# x1 x2 y
#1: a a 1
#2: a b 1
#3: a a 1
#4: a a 2
#5: a b 2
#6: a a 2
DT[, lapply(.SD, table), by = y]
# Desired Result, something like this:
# x1_a x2_a x2_b
# 3 2 1
# 3 2 1
Thanks in advance, and also: I would not mind if the result of the function must have a fixed length.

You simply need to unlist the table and then coerce back to a list:
> DTCounts <- DT[, as.list(unlist(lapply(.SD, table))), by=y]
> DTCounts
y x1.a x2.a x2.b
1: 1 3 2 1
2: 2 3 2 1
.
if you do not like the dots in the names, you can sub them out:
> setnames(DTCounts, sub("\\.", "_", names(DTCounts)))
> DTCounts
y x1_a x2_a x2_b
1: 1 3 2 1
2: 2 3 2 1
Note that if not all values in a column are present for each group
(ie, if x2=c("a", "b") when y=1, but x2=c("b", "b") when y=2)
then the above breaks.
The solution is to make the columns factors before counting.
DT[, lapply(.SD, is.factor)]
## OR
columnsToConvert <- c("x1", "x2") # or .. <- setdiff(names(DT), "y")
DT <- cbind(DT[, lapply(.SD, factor), .SDcols=columnsToConvert], y=DT[, y])

Related

How to use cbind() when columns stored in a character vector?

Let's say I have the below data.table:
DT <- data.table(x=rep(c(1,2),3),y=rep(1,6),z=rep(2,6))
DT
x y z
1: 1 1 2
2: 2 1 2
3: 1 1 2
4: 2 1 2
5: 1 1 2
6: 2 1 2
and I am calculating the rowMeans across columns y,z. Can I use the cbind() if the column names are in a character vector?
colN = c('y','z')
I know that DT[, meanYZ := rowMeans(cbind(y,z))] works. But is there a way to make this work with colN? So, like --
DT[, meanYZ := rowMeans(cbind(colN)]
The preferred option would be through .SDcols to specify the columns of interest, apply the rowmeans on the .SD (Subset of Data.table)
library(data.table)
DT[, meanYZ := rowMeans(.SD, na.rm = TRUE), .SDcols = colN]
It can also be done with mget to return a list, cbind them with do.call and apply the rowMeans
DT[, meanYZ := rowMeans(do.call(cbind, mget(colN)), na.rm = TRUE)]
cbind(y, z) works as y and z are unquoted, while the elements in colN are strings "y" and "z". It needs to be either converted to symbol or name (as.name) and evaluated or use get to return the values of the column when it searches on the environment of the dataset
Perhaps you can subset the columns in colN from DT first and then generate a new column for rowMeans, e.g.,
DT[, MeanYZ := rowMeans(.SD[, colN, with = FALSE])]
or
such that
x y z MeanYZ
1: 1 1 2 1.5
2: 2 1 2 1.5
3: 1 1 2 1.5
4: 2 1 2 1.5
5: 1 1 2 1.5
6: 2 1 2 1.5

R data.table: keep column when grouping by expression

When grouping by an expression involving a column (e.g. DT[...,.SD[c(1,.N)],by=expression(col)]), I want to keep the value of col in .SD.
For example, in the following I am grouping by the remainder of a divided by 3, and keeping the first and last observation in each group. However, a is no longer present in .SD
f <- function(x) x %% 3
Q <- data.table(a = 1:20, x = rnorm(20), y = rnorm(20))
Q[, .SD[c(1., .N)], by = f(a)]
f x y
1: 1 0.2597929 1.0256259
2: 1 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 2 0.6600591 -0.5827745
5: 0 1.3758503 1.3122561
6: 0 2.6501140 1.9394756
The desired output is as if I had done the following
Q[, f := f(a)]
tmp <- Q[, .SD[c(1, .N)], by=f]
Q[, f := NULL]
tmp[, f := NULL]
tmp
a x y
1: 1 0.2597929 1.0256259
2: 19 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 20 0.6600591 -0.5827745
5: 3 1.3758503 1.3122561
6: 18 2.6501140 1.9394756
Is there a way to do this directly, without creating a new variable and creating a new intermediate data.table?
Instead of .SD, use .I to get the row index, extract that column ($V1) and subset the original dataset
library(data.table)
Q[Q[, .I[c(1., .N)], by = f(a)]$V1]
# a x y
#1: 1 0.7265238 0.5631753
#2: 19 1.7110611 -0.3141118
#3: 2 0.1643566 -0.4704501
#4: 20 0.5182394 -0.1309016
#5: 3 -0.6039137 0.1349981
#6: 18 0.3094155 -1.1892190
NOTE: The values in columns 'x', 'y' would be different as there was no set.seed

calculated columns in new datatable without altering the original

I have a dataset which looks like this:
set.seed(43)
dt <- data.table(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10),
e = sample(c("x","y"),10,replace = T),
f=sample(c("t","s"),10,replace = T)
)
i need (for example) a count of negative values in columns 1:4 for each value of e, f. The result would have to look like this:
e neg_a_count neg_b_count neg_c_count neg_d_count
1: x 6 3 5 3
2: y 2 1 3 NA
1: s 4 2 3 1
2: t 4 2 5 2
Here's my code:
for (k in 5:6) { #these are the *by* columns
for (i in 1:4) {#these are the columns whose negative values i'm counting
n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
}
}
dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)
which obviously produces two tables, not one:
e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
1: x 6 3 5 3
2: y 2 1 3 NA
f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
1: s 4 2 3 1
2: t 4 2 5 2
and need to be rbind to produce one table.
This approach modifies dt by adding eight additional columns (4 data columns x 2 by columns), and the counts related to the levels of e and f get recycled (as expected). I was wondering if there is a cleaner way to achieve the result, one which does not modify dt. Also, casting after melting seems inefficient, there should be a better way, especially since my dataset has several e and f-like columns.
If there is only two grouping columns, we could do an rbindlist after grouping by them separately
rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d],
dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
# e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2
Or make it more dynamic by looping through the grouping column names
rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD,
function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))
You can melt before aggregating as follows:
cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[,
lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]

R - Data.table - Using variable column names in RHS operations

How do I use variable column names on the RHS of := operations? For example, given this data.table "dt", I'd like to create two new columns, "first_y" and "first_z" that contains the first observation of the given column for the values of "x".
dt <- data.table(x = c("one","one","two","two","three"),
y = c("a", "b", "c", "d", "e"),
z = c(1, 2, 3, 4, 5))
dt
x y z
1: one a 1
2: one b 2
3: two c 3
4: two d 4
5: three e 5
Here's how you would do it without variable column names.
dt[, c("first_y", "first_z") := .(first(y), first(z)), by = x]
dt
x y z first_y first_z
1: one a 1 a 1
2: one b 2 a 1
3: two c 3 c 3
4: two d 4 c 3
5: three e 5 e 5
But how would I do this if the "y" and "z" column names are dynamically stored in a variable?
cols <- c("y", "z")
# This doesn't work
dt[, (paste0("first_", cols)) := .(first(cols)), by = x]
# Nor does this
q <- quote(first(as.name(cols[1])))
p <- quote(first(as.name(cols[2])))
dt[, (paste0("first_", cols)) := .(eval(q), eval(p)), by = x]
I've tried numerous other combinations of quote() and eval() and as.name() without success. The LHS of the operation appears to be working as intended and is documented in many places, but I can't find anything about using a variable column name on the RHS. Thanks in advance.
I'm not familiar with the first function (although it looks like something Hadley would define).
dt[, paste0("first_", cols) := lapply(.SD, head, n = 1L),
by = x, .SDcols = cols]
# x y z first_y first_z
#1: one a 1 a 1
#2: one b 2 a 1
#3: two c 3 c 3
#4: two d 4 c 3
#5: three e 5 e 5
The .SDcols answer is fine for this case, but you can also just use get:
dt[, paste0("first_", cols) := lapply(cols, function(x) get(x)[1]), by = x]
dt
# x y z first_y first_z
#1: one a 1 a 1
#2: one b 2 a 1
#3: two c 3 c 3
#4: two d 4 c 3
#5: three e 5 e 5
Another alternative is the vectorized version - mget:
dt[, paste0("first_", cols) := setDT(mget(cols))[1], by = x]
I can never get "get" or "eval" to work on the RHS when trying to do mathematical operations. Try this if you need to.
Thing_dt[, c(new_col) := Thing_dt[[oldcol1]] * Thing_dt[[oldcol2]]]

When grouping by all columns in a data.table, .SD is empty

I'm having trouble getting consistent output in data.table using consistent syntax. See example below
library(data.table)
d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2))
# data.table shown below
# x y
1: 1 1
2: 1 1
3: 2 2
4: 2 2
d[, if(.N>1) .SD else NULL, by = .(x, y)]
# returns Empty data.table (0 rows) of 2 cols: x,y
When all columns are used for grouping in by, .SD is empty, causing an empty data.table to be returned.
When one adds another column, .SD contains columns not being grouped by, the correct output is returned.
d[, if(.N>1) .SD else NULL, by = x]
# returns
x y
1: 1 1
2: 1 1
3: 2 2
4: 2 2
d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2), t = 1:4)
d[, if(.N>1) .SD else NULL, by = .(x, y)]
# returns
x y t
1: 1 1 1
2: 1 1 2
3: 2 2 3
4: 2 2 4
I'm trying to find a way to write code to return rows that appear duplicate times that works for both the case where the by columns do and don't consist of all columns in the data.table. Toward this end, I tried setting .SDcols = c("x", "y"). However, the columns get repeated in the output
d[, if(.N>1) .SD else NULL, by = .(x, y), .SDcols = c("x", "y")]
x y x y
1: 1 1 1 1
2: 1 1 1 1
3: 2 2 2 2
4: 2 2 2 2
Is there a way to make it so d[, if(.N > 1) .SD else NULL, by = colnames] returns the desired output independent of whether the column names grouped by consist of all columns in 'd'? Or do I need to use an if statement and break up the 2 cases?
Here's one approach
setkey(d,x,y)
dnew <- d[d[,.N>1,by=key(d)][(V1),key(d),with=FALSE]]
This
sets (x,y) to a key;
identifies which (x,y) groups satisfy the criterion; and then
selects those groups from d.

Resources