How do I use variable column names on the RHS of := operations? For example, given this data.table "dt", I'd like to create two new columns, "first_y" and "first_z" that contains the first observation of the given column for the values of "x".
dt <- data.table(x = c("one","one","two","two","three"),
y = c("a", "b", "c", "d", "e"),
z = c(1, 2, 3, 4, 5))
dt
x y z
1: one a 1
2: one b 2
3: two c 3
4: two d 4
5: three e 5
Here's how you would do it without variable column names.
dt[, c("first_y", "first_z") := .(first(y), first(z)), by = x]
dt
x y z first_y first_z
1: one a 1 a 1
2: one b 2 a 1
3: two c 3 c 3
4: two d 4 c 3
5: three e 5 e 5
But how would I do this if the "y" and "z" column names are dynamically stored in a variable?
cols <- c("y", "z")
# This doesn't work
dt[, (paste0("first_", cols)) := .(first(cols)), by = x]
# Nor does this
q <- quote(first(as.name(cols[1])))
p <- quote(first(as.name(cols[2])))
dt[, (paste0("first_", cols)) := .(eval(q), eval(p)), by = x]
I've tried numerous other combinations of quote() and eval() and as.name() without success. The LHS of the operation appears to be working as intended and is documented in many places, but I can't find anything about using a variable column name on the RHS. Thanks in advance.
I'm not familiar with the first function (although it looks like something Hadley would define).
dt[, paste0("first_", cols) := lapply(.SD, head, n = 1L),
by = x, .SDcols = cols]
# x y z first_y first_z
#1: one a 1 a 1
#2: one b 2 a 1
#3: two c 3 c 3
#4: two d 4 c 3
#5: three e 5 e 5
The .SDcols answer is fine for this case, but you can also just use get:
dt[, paste0("first_", cols) := lapply(cols, function(x) get(x)[1]), by = x]
dt
# x y z first_y first_z
#1: one a 1 a 1
#2: one b 2 a 1
#3: two c 3 c 3
#4: two d 4 c 3
#5: three e 5 e 5
Another alternative is the vectorized version - mget:
dt[, paste0("first_", cols) := setDT(mget(cols))[1], by = x]
I can never get "get" or "eval" to work on the RHS when trying to do mathematical operations. Try this if you need to.
Thing_dt[, c(new_col) := Thing_dt[[oldcol1]] * Thing_dt[[oldcol2]]]
Related
When grouping by an expression involving a column (e.g. DT[...,.SD[c(1,.N)],by=expression(col)]), I want to keep the value of col in .SD.
For example, in the following I am grouping by the remainder of a divided by 3, and keeping the first and last observation in each group. However, a is no longer present in .SD
f <- function(x) x %% 3
Q <- data.table(a = 1:20, x = rnorm(20), y = rnorm(20))
Q[, .SD[c(1., .N)], by = f(a)]
f x y
1: 1 0.2597929 1.0256259
2: 1 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 2 0.6600591 -0.5827745
5: 0 1.3758503 1.3122561
6: 0 2.6501140 1.9394756
The desired output is as if I had done the following
Q[, f := f(a)]
tmp <- Q[, .SD[c(1, .N)], by=f]
Q[, f := NULL]
tmp[, f := NULL]
tmp
a x y
1: 1 0.2597929 1.0256259
2: 19 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 20 0.6600591 -0.5827745
5: 3 1.3758503 1.3122561
6: 18 2.6501140 1.9394756
Is there a way to do this directly, without creating a new variable and creating a new intermediate data.table?
Instead of .SD, use .I to get the row index, extract that column ($V1) and subset the original dataset
library(data.table)
Q[Q[, .I[c(1., .N)], by = f(a)]$V1]
# a x y
#1: 1 0.7265238 0.5631753
#2: 19 1.7110611 -0.3141118
#3: 2 0.1643566 -0.4704501
#4: 20 0.5182394 -0.1309016
#5: 3 -0.6039137 0.1349981
#6: 18 0.3094155 -1.1892190
NOTE: The values in columns 'x', 'y' would be different as there was no set.seed
I have two data.tables:
left_table <- data.table(a = c(1,2,3,4), b = c(4,5,6,7), c = c(8,9,10,11))
right_table <- data.table(record = sample(LETTERS, 9))
I would like to replace the numeric entries in left_table by the values associated with the corresponding row numbers in right_table. e.g. All instances of 4 in left_table are replaced by whatever letter (or set of characters in my real data) is on row 4 of right_table and so on.
I have this solution but I feel it's a bit cumbersome and a simpler solution must be possible?
right_table <- data.table(row_n = as.character(seq_along(1:9)), right_table)
for (i in seq_along(left_table)){
cols <- colnames(left_table)
current_col <- cols[i]
# convert numbers to character to allow := to work for matching records
left_table[,(current_col) := lapply(.SD, as.character), .SDcols = current_col]
#right_table[,(current_col) := lapply(.SD, as.character), .SDcols = current_col]
#set key for quick joins
setkeyv(left_table, current_col)
setkeyv(right_table, "row_n")
# replace matching records
left_table[right_table, (current_col) := record]
}
You can create the new columns fetching the letters from right_table using the original variables.
left_table[, c("newa","newb","newc") :=
.(right_table[a,record],right_table[b,record],right_table[c,record])]
# a b c newa newb newc
# 1: 1 4 8 Y A R
# 2: 2 5 9 D B W
# 3: 3 6 10 G K <NA>
# 4: 4 7 11 A N <NA>
Edit:
To make it more generic:
columnNames <- names(left_table)
left_table[, (columnNames) :=
lapply(columnNames, function(x) right_table[left_table[,get(x)],record])]
Although there is probably a better way to do this without needing to call left_table inside lapply()
Using mapvalue from plyr:
library(plyr)
corresp <- function(x) mapvalues(x,seq(right_table$record),right_table$record)
left_table[,c(names(left_table)) := lapply(.SD,corresp),.SDcols = names(left_table)]
a b c
1: N K X
2: U Q V
3: Z I 10
4: K G 11
Here is my attempt. When we replace the numeric values to character values, we get NAs as we see from some other answers. So I decided to take another way. First, I created a vector using unlist(). Then, I used fifelse() from the data.table package. I used foo as indices and replaces numbers in foo with characters. I also converted numeric to character (i.e., 10 and 11 in the sample data). Then, I created a matrix and converted it to a data.table object. Finally, I assigned column names to the object.
library(data.table)
foo <- unlist(left_table)
temp <- fifelse(test = foo <= nrow(right_table),
yes = right_table$record[foo],
no = as.character(foo))
res <- as.data.table(matrix(data = temp, nrow = nrow(left_table)))
setnames(res, names(left_table))
# a b c
#1: B G J
#2: Y D I
#3: P T 10
#4: G S 11
I think it might be easier to just keep record as a vector and access it via indexing:
left_table <- data.table(a = c(1,2,3,4), b = c(4,5,6,7), c = c(8,9,10,11))
# a b c
#1: 1 4 8
#2: 2 5 9
#3: 3 6 10
#4: 4 7 11
set.seed(0L)
right_table <- data.table(record = sample(LETTERS, 9))
record <- right_table$record
#[1] "N" "Y" "D" "G" "A" "B" "K" "Z" "R"
left_table[, names(left_table) := lapply(.SD, function(k) fcoalesce(record[k], as.character(k)))]
left_table
# a b c
# 1: N G Z
# 2: Y A R
# 3: D B 10
# 4: G K 11
I have a dataset which looks like this:
set.seed(43)
dt <- data.table(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10),
e = sample(c("x","y"),10,replace = T),
f=sample(c("t","s"),10,replace = T)
)
i need (for example) a count of negative values in columns 1:4 for each value of e, f. The result would have to look like this:
e neg_a_count neg_b_count neg_c_count neg_d_count
1: x 6 3 5 3
2: y 2 1 3 NA
1: s 4 2 3 1
2: t 4 2 5 2
Here's my code:
for (k in 5:6) { #these are the *by* columns
for (i in 1:4) {#these are the columns whose negative values i'm counting
n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
}
}
dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)
which obviously produces two tables, not one:
e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
1: x 6 3 5 3
2: y 2 1 3 NA
f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
1: s 4 2 3 1
2: t 4 2 5 2
and need to be rbind to produce one table.
This approach modifies dt by adding eight additional columns (4 data columns x 2 by columns), and the counts related to the levels of e and f get recycled (as expected). I was wondering if there is a cleaner way to achieve the result, one which does not modify dt. Also, casting after melting seems inefficient, there should be a better way, especially since my dataset has several e and f-like columns.
If there is only two grouping columns, we could do an rbindlist after grouping by them separately
rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d],
dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
# e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2
Or make it more dynamic by looping through the grouping column names
rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD,
function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))
You can melt before aggregating as follows:
cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[,
lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]
Say we have this toy data.table example:
temp <- data.table(V=c("A", "B", "C", "D","A"), GR=c(1,1,1,2,2))
"V" "GR"
A 1
B 1
C 1
D 2
A 2
I would like to generate all ordered combinations with combn within each subset defined by GR and create with it a new data.table and with a new column with the grouping factor.
For example, for GR=1 we have (A,B),(A,C),(B,C)
for GR=2 we have (D,A)
If I create the result manually it would be
cbind(V=c(1,1,1,2),rbind(t(combn(c("A", "B", "C"),2)),t(combn(c( "D","A"),2))))
1 A B
1 A C
1 B C
2 D A
But I would like to do it with data.table easily instead.
This two option don't work:
temp[,cbind(rep(.GRP,.N),as.data.frame(t(combn(V,2)))),by=GR]
temp[,cbind(rep(.BY,.N),as.data.frame(t(combn(V,2)))),by=GR]
This one work, but I don't understand why. I'm afraid it could copy the whole B vector as is instead of the proper value.
temp[,.(GR,as.list(as.data.frame((combn(V,2))))),by=GR]
And I guess it should be a shorter way to write it.
This works:
> temp[, {v_comb = combn(V,2); .(v_comb[1,], v_comb[2,])}, by=GR]
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
In general, I would avoid when possible all the reshaping operations within the data.table using cbind(), rep(), as.data.frame() or t()... It takes many trials-and-errors to figure out the right way to do it, and produces code that is very hard to maintain.
On the other hand, using code blocks {...} improves the readability of the code.
This uses data.table, though not all within [] using .BY or .GRP.
library(data.table)
temp <- data.table(V=c("A", "B", "C", "D","A"), GR=c(1,1,1,2,2))
tempfunc <- function(x){
dat <- as.data.table(t(combn(temp[GR == x, V], 2)))
dat[, GR := x]
setcolorder(dat, c("GR", "V1", "V2"))
dat[]
}
rbindlist(lapply(unique(temp$GR), tempfunc))
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
Here are two other approaches which also work if there is a group with just one row, e.g., row 6 below:
library(data.table)
temp <- data.table(V=c("A", "B", "C", "D","A","E"), GR=c(1,1,1,2,2,3))
temp
V GR
1: A 1
2: B 1
3: C 1
4: D 2
5: A 2
6: E 3
Using cominat::combn2
temp[, as.data.table(combinat::combn2(V)), by = GR]
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
Using a non-equi join
temp[, V := factor(V)][temp, on = .(GR, V < V), .(GR, x.V, i.V),
nomatch = 0L, allow = TRUE]
GR x.V i.V
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 A D
I have one solution but it seems long and complex too.
temp[,do.call(c, apply(t(combn(V,2)), 2, list)),by=GR]
I've also found that combn is 10 times slower than some specialized packages such as iterpc or combinat
temp[,do.call(c, apply(combn2(V), 2, list)),by=GR]
You must also first filter out any group having just one row because otherwise it would cause an error.
And this is my final version, much faster and needs much less memory:
temp[,.(from=rep(V,(.N-1):0),to=V[unlist(sapply(2:.N, seq, .N, simplify = T))]), by=GR]
I had a data.table like this:
library(data.table)
dt <- data.table(a = c(rep("A", 3), rep("B", 3)), b = c(1, 3, 5, 2, 4, 6))
I needed to perform an operation (forecast) on the values for each a, so I decided to put them in a list, like this:
dt <- dt[, x := .(list(b)), by = a][, .SD[1,], by = a, .SDcols = "x"]
Now I wanted to "melt" (that's the thing that comes to mind) dt back into its original form.
I could do it for very few levels of a like this:
dt2 <- rbind(expand.grid(dt[1, a], dt[1, x[[1]]]), expand.grid(dt[2, a], dt[2, x[[1]]]))
but of course, the solution is impractical for more levels of a.
I've tried
dt2 <- dt[, expand.grid(a, x[[1]]), by = a]
which results in
dt2
## a Var1 Var2
## 1: A A 1
## 2: A A 3
## 3: A A 5
## 4: B A 2
## 5: B A 4
## 6: B A 6
it's interesting to notice that Var1 doesn't actually follow the "A - B" pattern expected (but at least a remains).
Is there a better approach to achieve this?
EDITS
Expected output will be the result of
dt2[, .(a, Var2)]
Corrected "melt" for "dcast".
You are looking for a method to nest(convert a column from a atomic vector type to list type) and unnest(the opposite direction) in a data.table way. This is different from reshaping data which either spread a column values to row header(dcast) or gather the row headers to a column values(melt):
In data.table syntax, you can use list and unlist on the target column to summarize or broadcast it along with group variables:
Say if we are starting from:
dt
# a b
# 1: A 1
# 2: A 3
# 3: A 5
# 4: B 2
# 5: B 4
# 6: B 6
To repeat what you have achieved in your first step, i.e. nest column b, you can do:
dt_nest <- dt[, .(b = list(b)), a]
dt_nest
# a b
# 1: A 1,3,5
# 2: B 2,4,6
To go the opposite direction, use unlist with the group variable:
dt_nest[, .(b = unlist(b)), a]
# a b
# 1: A 1
# 2: A 3
# 3: A 5
# 4: B 2
# 5: B 4
# 6: B 6