Melt data.table according to nested list - r

I had a data.table like this:
library(data.table)
dt <- data.table(a = c(rep("A", 3), rep("B", 3)), b = c(1, 3, 5, 2, 4, 6))
I needed to perform an operation (forecast) on the values for each a, so I decided to put them in a list, like this:
dt <- dt[, x := .(list(b)), by = a][, .SD[1,], by = a, .SDcols = "x"]
Now I wanted to "melt" (that's the thing that comes to mind) dt back into its original form.
I could do it for very few levels of a like this:
dt2 <- rbind(expand.grid(dt[1, a], dt[1, x[[1]]]), expand.grid(dt[2, a], dt[2, x[[1]]]))
but of course, the solution is impractical for more levels of a.
I've tried
dt2 <- dt[, expand.grid(a, x[[1]]), by = a]
which results in
dt2
## a Var1 Var2
## 1: A A 1
## 2: A A 3
## 3: A A 5
## 4: B A 2
## 5: B A 4
## 6: B A 6
it's interesting to notice that Var1 doesn't actually follow the "A - B" pattern expected (but at least a remains).
Is there a better approach to achieve this?
EDITS
Expected output will be the result of
dt2[, .(a, Var2)]
Corrected "melt" for "dcast".

You are looking for a method to nest(convert a column from a atomic vector type to list type) and unnest(the opposite direction) in a data.table way. This is different from reshaping data which either spread a column values to row header(dcast) or gather the row headers to a column values(melt):
In data.table syntax, you can use list and unlist on the target column to summarize or broadcast it along with group variables:
Say if we are starting from:
dt
# a b
# 1: A 1
# 2: A 3
# 3: A 5
# 4: B 2
# 5: B 4
# 6: B 6
To repeat what you have achieved in your first step, i.e. nest column b, you can do:
dt_nest <- dt[, .(b = list(b)), a]
dt_nest
# a b
# 1: A 1,3,5
# 2: B 2,4,6
To go the opposite direction, use unlist with the group variable:
dt_nest[, .(b = unlist(b)), a]
# a b
# 1: A 1
# 2: A 3
# 3: A 5
# 4: B 2
# 5: B 4
# 6: B 6

Related

How to check if values in individiual rows of a data.table are identical

Suppose I have the following data.table:
dt <- data.table(a = 1:2, b = 1:2, c = c(1, 1))
# dt
# a b c
# 1: 1 1 1
# 2: 2 2 1
What would be the fastest way to create a fourth column d indicating that the preexisting values in each row are all identical, so that the resulting data.table will look like the following?
# dt
# a b c d
# 1: 1 1 1 identical
# 2: 2 2 1 not_identical
I want to avoid using duplicated function and want to stick to using identical or a similar function even if it means iterating through items within each row.
uniqueN can be applied grouped by row and create a logical expression (== 1)
library(data.table)
dt[, d := c("not_identical", "identical")[(uniqueN(unlist(.SD)) == 1) +
1], 1:nrow(dt)]
-output
dt
# a b c d
#1: 1 1 1 identical
#2: 2 2 1 not_identical
Or another efficient approach might be to do comparison with the first column, and create an expression with rowSums
dt[, d := c("identical", "not_identical")[1 + rowSums(.SD[[1]] != .SD) > 0 ] ]
Here is another data.table option using var
dt[, d := ifelse(var(unlist(.SD)) == 0, "identical", "non_identical"), seq(nrow(dt))]
which gives
> dt
a b c d
1: 1 1 1 identical
2: 2 2 1 non_identical

Replace values in a data.table based on row values in another table

I have two data.tables:
left_table <- data.table(a = c(1,2,3,4), b = c(4,5,6,7), c = c(8,9,10,11))
right_table <- data.table(record = sample(LETTERS, 9))
I would like to replace the numeric entries in left_table by the values associated with the corresponding row numbers in right_table. e.g. All instances of 4 in left_table are replaced by whatever letter (or set of characters in my real data) is on row 4 of right_table and so on.
I have this solution but I feel it's a bit cumbersome and a simpler solution must be possible?
right_table <- data.table(row_n = as.character(seq_along(1:9)), right_table)
for (i in seq_along(left_table)){
cols <- colnames(left_table)
current_col <- cols[i]
# convert numbers to character to allow := to work for matching records
left_table[,(current_col) := lapply(.SD, as.character), .SDcols = current_col]
#right_table[,(current_col) := lapply(.SD, as.character), .SDcols = current_col]
#set key for quick joins
setkeyv(left_table, current_col)
setkeyv(right_table, "row_n")
# replace matching records
left_table[right_table, (current_col) := record]
}
You can create the new columns fetching the letters from right_table using the original variables.
left_table[, c("newa","newb","newc") :=
.(right_table[a,record],right_table[b,record],right_table[c,record])]
# a b c newa newb newc
# 1: 1 4 8 Y A R
# 2: 2 5 9 D B W
# 3: 3 6 10 G K <NA>
# 4: 4 7 11 A N <NA>
Edit:
To make it more generic:
columnNames <- names(left_table)
left_table[, (columnNames) :=
lapply(columnNames, function(x) right_table[left_table[,get(x)],record])]
Although there is probably a better way to do this without needing to call left_table inside lapply()
Using mapvalue from plyr:
library(plyr)
corresp <- function(x) mapvalues(x,seq(right_table$record),right_table$record)
left_table[,c(names(left_table)) := lapply(.SD,corresp),.SDcols = names(left_table)]
a b c
1: N K X
2: U Q V
3: Z I 10
4: K G 11
Here is my attempt. When we replace the numeric values to character values, we get NAs as we see from some other answers. So I decided to take another way. First, I created a vector using unlist(). Then, I used fifelse() from the data.table package. I used foo as indices and replaces numbers in foo with characters. I also converted numeric to character (i.e., 10 and 11 in the sample data). Then, I created a matrix and converted it to a data.table object. Finally, I assigned column names to the object.
library(data.table)
foo <- unlist(left_table)
temp <- fifelse(test = foo <= nrow(right_table),
yes = right_table$record[foo],
no = as.character(foo))
res <- as.data.table(matrix(data = temp, nrow = nrow(left_table)))
setnames(res, names(left_table))
# a b c
#1: B G J
#2: Y D I
#3: P T 10
#4: G S 11
I think it might be easier to just keep record as a vector and access it via indexing:
left_table <- data.table(a = c(1,2,3,4), b = c(4,5,6,7), c = c(8,9,10,11))
# a b c
#1: 1 4 8
#2: 2 5 9
#3: 3 6 10
#4: 4 7 11
set.seed(0L)
right_table <- data.table(record = sample(LETTERS, 9))
record <- right_table$record
#[1] "N" "Y" "D" "G" "A" "B" "K" "Z" "R"
left_table[, names(left_table) := lapply(.SD, function(k) fcoalesce(record[k], as.character(k)))]
left_table
# a b c
# 1: N G Z
# 2: Y A R
# 3: D B 10
# 4: G K 11

Using .BY, .GRP or other methods to add a multicolumn aggregation with data.table

Say we have this toy data.table example:
temp <- data.table(V=c("A", "B", "C", "D","A"), GR=c(1,1,1,2,2))
"V" "GR"
A 1
B 1
C 1
D 2
A 2
I would like to generate all ordered combinations with combn within each subset defined by GR and create with it a new data.table and with a new column with the grouping factor.
For example, for GR=1 we have (A,B),(A,C),(B,C)
for GR=2 we have (D,A)
If I create the result manually it would be
cbind(V=c(1,1,1,2),rbind(t(combn(c("A", "B", "C"),2)),t(combn(c( "D","A"),2))))
1 A B
1 A C
1 B C
2 D A
But I would like to do it with data.table easily instead.
This two option don't work:
temp[,cbind(rep(.GRP,.N),as.data.frame(t(combn(V,2)))),by=GR]
temp[,cbind(rep(.BY,.N),as.data.frame(t(combn(V,2)))),by=GR]
This one work, but I don't understand why. I'm afraid it could copy the whole B vector as is instead of the proper value.
temp[,.(GR,as.list(as.data.frame((combn(V,2))))),by=GR]
And I guess it should be a shorter way to write it.
This works:
> temp[, {v_comb = combn(V,2); .(v_comb[1,], v_comb[2,])}, by=GR]
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
In general, I would avoid when possible all the reshaping operations within the data.table using cbind(), rep(), as.data.frame() or t()... It takes many trials-and-errors to figure out the right way to do it, and produces code that is very hard to maintain.
On the other hand, using code blocks {...} improves the readability of the code.
This uses data.table, though not all within [] using .BY or .GRP.
library(data.table)
temp <- data.table(V=c("A", "B", "C", "D","A"), GR=c(1,1,1,2,2))
tempfunc <- function(x){
dat <- as.data.table(t(combn(temp[GR == x, V], 2)))
dat[, GR := x]
setcolorder(dat, c("GR", "V1", "V2"))
dat[]
}
rbindlist(lapply(unique(temp$GR), tempfunc))
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
Here are two other approaches which also work if there is a group with just one row, e.g., row 6 below:
library(data.table)
temp <- data.table(V=c("A", "B", "C", "D","A","E"), GR=c(1,1,1,2,2,3))
temp
V GR
1: A 1
2: B 1
3: C 1
4: D 2
5: A 2
6: E 3
Using cominat::combn2
temp[, as.data.table(combinat::combn2(V)), by = GR]
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
Using a non-equi join
temp[, V := factor(V)][temp, on = .(GR, V < V), .(GR, x.V, i.V),
nomatch = 0L, allow = TRUE]
GR x.V i.V
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 A D
I have one solution but it seems long and complex too.
temp[,do.call(c, apply(t(combn(V,2)), 2, list)),by=GR]
I've also found that combn is 10 times slower than some specialized packages such as iterpc or combinat
temp[,do.call(c, apply(combn2(V), 2, list)),by=GR]
You must also first filter out any group having just one row because otherwise it would cause an error.
And this is my final version, much faster and needs much less memory:
temp[,.(from=rep(V,(.N-1):0),to=V[unlist(sapply(2:.N, seq, .N, simplify = T))]), by=GR]

R - Data.table - Using variable column names in RHS operations

How do I use variable column names on the RHS of := operations? For example, given this data.table "dt", I'd like to create two new columns, "first_y" and "first_z" that contains the first observation of the given column for the values of "x".
dt <- data.table(x = c("one","one","two","two","three"),
y = c("a", "b", "c", "d", "e"),
z = c(1, 2, 3, 4, 5))
dt
x y z
1: one a 1
2: one b 2
3: two c 3
4: two d 4
5: three e 5
Here's how you would do it without variable column names.
dt[, c("first_y", "first_z") := .(first(y), first(z)), by = x]
dt
x y z first_y first_z
1: one a 1 a 1
2: one b 2 a 1
3: two c 3 c 3
4: two d 4 c 3
5: three e 5 e 5
But how would I do this if the "y" and "z" column names are dynamically stored in a variable?
cols <- c("y", "z")
# This doesn't work
dt[, (paste0("first_", cols)) := .(first(cols)), by = x]
# Nor does this
q <- quote(first(as.name(cols[1])))
p <- quote(first(as.name(cols[2])))
dt[, (paste0("first_", cols)) := .(eval(q), eval(p)), by = x]
I've tried numerous other combinations of quote() and eval() and as.name() without success. The LHS of the operation appears to be working as intended and is documented in many places, but I can't find anything about using a variable column name on the RHS. Thanks in advance.
I'm not familiar with the first function (although it looks like something Hadley would define).
dt[, paste0("first_", cols) := lapply(.SD, head, n = 1L),
by = x, .SDcols = cols]
# x y z first_y first_z
#1: one a 1 a 1
#2: one b 2 a 1
#3: two c 3 c 3
#4: two d 4 c 3
#5: three e 5 e 5
The .SDcols answer is fine for this case, but you can also just use get:
dt[, paste0("first_", cols) := lapply(cols, function(x) get(x)[1]), by = x]
dt
# x y z first_y first_z
#1: one a 1 a 1
#2: one b 2 a 1
#3: two c 3 c 3
#4: two d 4 c 3
#5: three e 5 e 5
Another alternative is the vectorized version - mget:
dt[, paste0("first_", cols) := setDT(mget(cols))[1], by = x]
I can never get "get" or "eval" to work on the RHS when trying to do mathematical operations. Try this if you need to.
Thing_dt[, c(new_col) := Thing_dt[[oldcol1]] * Thing_dt[[oldcol2]]]

What does ".N" mean in data.table?

I have a data.table dt:
library(data.table)
dt = data.table(a=LETTERS[c(1,1:3)],b=4:7)
a b
1: A 4
2: A 5
3: B 6
4: C 7
The result of dt[, .N, by=a] is
a N
1: A 2
2: B 1
3: C 1
I know the by=a or by="a" means grouped by a column and the N column is the sum of duplicated times of a. However, I don't use nrow() but I get the result. The .N is not just the column name? I can't find the document by ??".N" in R. I tried to use .K, but it doesn't work. What does .N means?
Think of .N as a variable for the number of instances. For example:
dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7)
dt[.N] # returns the last row
# a b
# 1: C 7
Your example returns a new variable with the number of rows per case:
dt[, new_var := .N, by = a]
dt
# a b new_var
# 1: A 4 2 # 2 'A's
# 2: A 5 2
# 3: B 6 1 # 1 'B'
# 4: C 7 1 # 1 'C'
For a list of all special symbols of data.table, see also https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/special-symbols

Resources