I'm having trouble getting consistent output in data.table using consistent syntax. See example below
library(data.table)
d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2))
# data.table shown below
# x y
1: 1 1
2: 1 1
3: 2 2
4: 2 2
d[, if(.N>1) .SD else NULL, by = .(x, y)]
# returns Empty data.table (0 rows) of 2 cols: x,y
When all columns are used for grouping in by, .SD is empty, causing an empty data.table to be returned.
When one adds another column, .SD contains columns not being grouped by, the correct output is returned.
d[, if(.N>1) .SD else NULL, by = x]
# returns
x y
1: 1 1
2: 1 1
3: 2 2
4: 2 2
d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2), t = 1:4)
d[, if(.N>1) .SD else NULL, by = .(x, y)]
# returns
x y t
1: 1 1 1
2: 1 1 2
3: 2 2 3
4: 2 2 4
I'm trying to find a way to write code to return rows that appear duplicate times that works for both the case where the by columns do and don't consist of all columns in the data.table. Toward this end, I tried setting .SDcols = c("x", "y"). However, the columns get repeated in the output
d[, if(.N>1) .SD else NULL, by = .(x, y), .SDcols = c("x", "y")]
x y x y
1: 1 1 1 1
2: 1 1 1 1
3: 2 2 2 2
4: 2 2 2 2
Is there a way to make it so d[, if(.N > 1) .SD else NULL, by = colnames] returns the desired output independent of whether the column names grouped by consist of all columns in 'd'? Or do I need to use an if statement and break up the 2 cases?
Here's one approach
setkey(d,x,y)
dnew <- d[d[,.N>1,by=key(d)][(V1),key(d),with=FALSE]]
This
sets (x,y) to a key;
identifies which (x,y) groups satisfy the criterion; and then
selects those groups from d.
Related
Let's say I have the below data.table:
DT <- data.table(x=rep(c(1,2),3),y=rep(1,6),z=rep(2,6))
DT
x y z
1: 1 1 2
2: 2 1 2
3: 1 1 2
4: 2 1 2
5: 1 1 2
6: 2 1 2
and I am calculating the rowMeans across columns y,z. Can I use the cbind() if the column names are in a character vector?
colN = c('y','z')
I know that DT[, meanYZ := rowMeans(cbind(y,z))] works. But is there a way to make this work with colN? So, like --
DT[, meanYZ := rowMeans(cbind(colN)]
The preferred option would be through .SDcols to specify the columns of interest, apply the rowmeans on the .SD (Subset of Data.table)
library(data.table)
DT[, meanYZ := rowMeans(.SD, na.rm = TRUE), .SDcols = colN]
It can also be done with mget to return a list, cbind them with do.call and apply the rowMeans
DT[, meanYZ := rowMeans(do.call(cbind, mget(colN)), na.rm = TRUE)]
cbind(y, z) works as y and z are unquoted, while the elements in colN are strings "y" and "z". It needs to be either converted to symbol or name (as.name) and evaluated or use get to return the values of the column when it searches on the environment of the dataset
Perhaps you can subset the columns in colN from DT first and then generate a new column for rowMeans, e.g.,
DT[, MeanYZ := rowMeans(.SD[, colN, with = FALSE])]
or
such that
x y z MeanYZ
1: 1 1 2 1.5
2: 2 1 2 1.5
3: 1 1 2 1.5
4: 2 1 2 1.5
5: 1 1 2 1.5
6: 2 1 2 1.5
When grouping by an expression involving a column (e.g. DT[...,.SD[c(1,.N)],by=expression(col)]), I want to keep the value of col in .SD.
For example, in the following I am grouping by the remainder of a divided by 3, and keeping the first and last observation in each group. However, a is no longer present in .SD
f <- function(x) x %% 3
Q <- data.table(a = 1:20, x = rnorm(20), y = rnorm(20))
Q[, .SD[c(1., .N)], by = f(a)]
f x y
1: 1 0.2597929 1.0256259
2: 1 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 2 0.6600591 -0.5827745
5: 0 1.3758503 1.3122561
6: 0 2.6501140 1.9394756
The desired output is as if I had done the following
Q[, f := f(a)]
tmp <- Q[, .SD[c(1, .N)], by=f]
Q[, f := NULL]
tmp[, f := NULL]
tmp
a x y
1: 1 0.2597929 1.0256259
2: 19 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 20 0.6600591 -0.5827745
5: 3 1.3758503 1.3122561
6: 18 2.6501140 1.9394756
Is there a way to do this directly, without creating a new variable and creating a new intermediate data.table?
Instead of .SD, use .I to get the row index, extract that column ($V1) and subset the original dataset
library(data.table)
Q[Q[, .I[c(1., .N)], by = f(a)]$V1]
# a x y
#1: 1 0.7265238 0.5631753
#2: 19 1.7110611 -0.3141118
#3: 2 0.1643566 -0.4704501
#4: 20 0.5182394 -0.1309016
#5: 3 -0.6039137 0.1349981
#6: 18 0.3094155 -1.1892190
NOTE: The values in columns 'x', 'y' would be different as there was no set.seed
I have a dataset which looks like this:
set.seed(43)
dt <- data.table(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10),
e = sample(c("x","y"),10,replace = T),
f=sample(c("t","s"),10,replace = T)
)
i need (for example) a count of negative values in columns 1:4 for each value of e, f. The result would have to look like this:
e neg_a_count neg_b_count neg_c_count neg_d_count
1: x 6 3 5 3
2: y 2 1 3 NA
1: s 4 2 3 1
2: t 4 2 5 2
Here's my code:
for (k in 5:6) { #these are the *by* columns
for (i in 1:4) {#these are the columns whose negative values i'm counting
n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
}
}
dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)
which obviously produces two tables, not one:
e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
1: x 6 3 5 3
2: y 2 1 3 NA
f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
1: s 4 2 3 1
2: t 4 2 5 2
and need to be rbind to produce one table.
This approach modifies dt by adding eight additional columns (4 data columns x 2 by columns), and the counts related to the levels of e and f get recycled (as expected). I was wondering if there is a cleaner way to achieve the result, one which does not modify dt. Also, casting after melting seems inefficient, there should be a better way, especially since my dataset has several e and f-like columns.
If there is only two grouping columns, we could do an rbindlist after grouping by them separately
rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d],
dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
# e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2
Or make it more dynamic by looping through the grouping column names
rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD,
function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))
You can melt before aggregating as follows:
cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[,
lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]
Suppose I have a data.table:
x <- data.table(x=runif(3), group=factor(c('a','b','a'), levels=c('a','b','c')))
I want to know how many rows in x exist for each group:
x[, .N, by="group"]
# group N
# 1: a 2
# 2: b 1
Question: is there some way to force the above by="group" to consider all levels of the factor group?
Notice how since I don't have any rows of with group 'c' in the table, I don't get a row for c.
Desired output:
x[, .N, by="group", ???] # somehow use all levels in `group`
# group N
# 1: a 2
# 2: b 1
# 3: c 0
If you are willing to run through the factor levels by enumerating them in i (rather than by setting by="group"), this will get you the hoped for results.
setkey(x, "group")
x[levels(group), .N, by=.EACHI]
# group N
# 1: a 2
# 2: b 1
# 3: c 0
With the on= argument (introduced to data.table with CRAN version 1.9.6 as of 19 Sep 2015) we can achieve the same result as in Josh's answer without setting keys:
x[.(group = levels(group)), on = .(group), .N, by = .EACHI]
group N
1: a 2
2: b 1
3: c 0
or shorter but less self-explanatory
x[.(levels(group)), on = .(group = V1), .N, by = .EACHI]
i have a data.table and want to apply a function to on each subset of a row.
Normaly one would do as follows: DT[, lapply(.SD, function), by = y]
But in my case the function does not return a atomic vector but simply a vector.
Is there a chance to do something like this?
library(data.table)
set.seed(9)
DT <- data.table(x1=letters[sample(x=2L,size=6,replace=TRUE)],
x2=letters[sample(x=2L,size=6,replace=TRUE)],
y=rep(1:2,3), key="y")
DT
# x1 x2 y
#1: a a 1
#2: a b 1
#3: a a 1
#4: a a 2
#5: a b 2
#6: a a 2
DT[, lapply(.SD, table), by = y]
# Desired Result, something like this:
# x1_a x2_a x2_b
# 3 2 1
# 3 2 1
Thanks in advance, and also: I would not mind if the result of the function must have a fixed length.
You simply need to unlist the table and then coerce back to a list:
> DTCounts <- DT[, as.list(unlist(lapply(.SD, table))), by=y]
> DTCounts
y x1.a x2.a x2.b
1: 1 3 2 1
2: 2 3 2 1
.
if you do not like the dots in the names, you can sub them out:
> setnames(DTCounts, sub("\\.", "_", names(DTCounts)))
> DTCounts
y x1_a x2_a x2_b
1: 1 3 2 1
2: 2 3 2 1
Note that if not all values in a column are present for each group
(ie, if x2=c("a", "b") when y=1, but x2=c("b", "b") when y=2)
then the above breaks.
The solution is to make the columns factors before counting.
DT[, lapply(.SD, is.factor)]
## OR
columnsToConvert <- c("x1", "x2") # or .. <- setdiff(names(DT), "y")
DT <- cbind(DT[, lapply(.SD, factor), .SDcols=columnsToConvert], y=DT[, y])