How can I introduce dcast into data.table chain without using piping? - r

data.table is graceful and intuitive with the chains rule. Everything is just lined up like a machine. But sometimes we have to introduce some operation like dcast or melt.
How can I integrate all operation into the []? Simply because it's more graceful, I admit.
DT <- data.table(A = rep(letters[1:3],4), B = rep(1:4,3), C = rep(c("OK", "NG"),6))
DT.1 <- DT[,.N, by = .(B,C)] %>% dcast(B~C)
DT.2 <- DT.1[,.N, by = .(NG)]
# NG N
#1: NA 2
#2: 3 2
#same
DT <- data.table(A = rep(letters[1:3],4), B = rep(1:4,3), C = rep(c("OK", "NG"),6))[,.N, by = .(B, C)] %>%
dcast(B~C) %>% .[,.N, by =.(NG)]
Can I remove the %>% and integrate into the []?
Thanks

What about using .SD to this end:
DT[, .N, by = .(B, C)
][, dcast(.SD, B ~ C)
][, .N, by = .(NG)]
NG N
1: NA 2
2: 3 2

Related

Nice way to group data in a `data.table` when the new column name is given as a character vector

In other words, my question is about the j argument to data.table when the name of the new column is a character vector. For example:
dt <- data.table(x = c(1, 1, 2, 2, 3, 3), y = rnorm(6))
agg_col_name <- 'avg'
grouped_dt <- dt[, .(z = mean(y)), by = x]
setnames(grouped_dt, 'z', agg_col_name)
> grouped_dt
x avg
1: 1 -0.2554987
2: 2 -0.4245852
3: 3 -0.4881073
There should be a more elegant way to do the last two statements as one, yes?
Perhaps this is a question about how to create suitable list for the j argument.
Although probably not what you are looking for, but you could use setNames inside, where it wraps around (.(z = mean(y)).
library(data.table)
dt[, setNames(.(z = mean(y)), agg_col_name), by = x]
Or use setnames after doing the summary:
setnames(dt[, mean(y), by = x], 'V1', agg_col_name)[]
Output
x avg
1: 1 0.5626526
2: 2 0.3549653
3: 3 -0.2861405
However, as mentioned in the comments, it is easier to do with the dev version of data.table. You can see more about the development of this feature at [programming on data.table #4304]:(https://github.com/Rdatatable/data.table/pull/4304).
# Latest development version:
data.table::update.dev.pkg()
library(data.table)
dt[, .(z = mean(y)), by = x, env = list(z=agg_col_name)]
# x avg
#1: 1 -0.1640783
#2: 2 0.5375794
#3: 3 0.1539785

How to do a special type of lookup join in R data.table?

How to do a special type of lookup join in R data.table ?
Suppose there are two tables in R as under:
library(data.table)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
They look like:
> dt1
a b
1: p 1,2
2: q 1,2,3
3: r 4,5
> dt2
code desc
1: 1 good
2: 2 better
3: 3 best
4: 4 bad
5: 5 worst
The goal is join dt1 and dt2 in such a way the result looks like
> result
a b desc
1: p 1,2 good,better
2: q 1,2,3 good,better,best
3: r 4,5 bad,worst
Can anyone show how this type of join can be accomplished in R ?
That's not really a join but as dt1$b contains convoluted values anyway here is my ugly hack:
dt2[, code := as.character(code)]
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc
1: p 1,2 good,better
2: q 1,2,3 good,better,best
3: r 4,5 bad,worst
Edit:
The replacement has to be done from the longest to the shortest code (string lengths or number of characters) and desc must not contain any digits.
So, with setorder(dt2, -code) added to the code and the new use case provided by the OP in the comment:
dt1 <- data.table(a = c("p", "q", "r"), b = c("1,21", "23,11,36", "11,36"))
dt2 <- data.table(code = c(1,11,21,23,36), desc = c("good", "better", "best", "bad", "worst"))
setorder(dt2, -code) # set order first (descending numeric value)
dt2[, code := as.character(code)] # then convert to character
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc
1: p 1,21 good,best
2: q 23,11,36 bad,better,worst
3: r 11,36 better,worst
Edit 2:
According to OP's comment the requirement for the ugly hack no digits in desc aren't fulfilled in the production data. (As it almost always happens when a quick & dirty solution meets real world's data :-) ).
So here is a concise data.table solution which does what all the others answers do as well: split column b, join or look up the matching desc, and recombine:
dt2[, code := as.character(code)][
dt1[, strsplit(b, ","), by = .(a, b)], on = "code==V1"][
, .(desc = paste(desc, collapse = ",")), by = .(a, b)]
Using OP's new use case
a b desc
1: p 1,21 good,best
2: q 23,11,36 bad,better,worst
3: r 11,36 better,worst
Note that grouping uses both columns a and b for two reasons: 1) convenience (to keep both columns in the final result), 2) in case a is not a unique identifier
Idea is to get column b as list of integers and then subset column desc in dt2 (note that code is just row number, otherwise use function match).
library(purrr)
library(stringr)
dt1[, b := map(b, ~str_split(.x, ",") %>% unlist() %>% as.integer())]
dt1[, desc := map(b, ~dt2$desc[match(.x, dt2$code)])]
library(data.table)
library(magrittr)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
dt1 <- dt1[, list(b = unlist(strsplit(x = b, split = ","))), by = "a"] %>%
.[, b := type.convert(b)]
dt2[dt1, on = c("code == b")] %>%
.[, lapply(.SD, toString), by = "a"]
#> a code desc
#> 1: p 1, 2 good, better
#> 2: q 1, 2, 3 good, better, best
#> 3: r 4, 5 bad, worst
Created on 2021-07-27 by the reprex package (v2.0.0)
You can split the string on comma and do a join.
library(dplyr)
library(tidyr)
dt1 %>%
separate_rows(b, sep = ',\\s*', convert = TRUE) %>%
left_join(dt2, by = c('b' = 'code')) %>%
group_by(a) %>%
summarise(desc = toString(desc))
# a desc
# <chr> <chr>
#1 p good, better
#2 q good, better, best
#3 r bad, worst

R data.table - How to modify by reference when using .SD?

So I'm new to data.table and don't understand now I can modify by reference at the same time that I perform an operation on chosen columns using the .SD symbol? I have two examples.
Example 1
> DT <- data.table("group1:1" = 1, "group1:2" = 1, "group2:1" = 1)
> DT
group1:1 group1:2 group2:1
1: 1 1 1
Let's say for example I simply to choose only columns which contain "group1:" in the name. I know it's pretty straightforward to just reassign the result of operation to the same object like so:
cols1 <- names(DT)[grep("group1:", names(DT))]
DT <- DT[, .SD, .SDcols = cols1]
From reading the data.table vignette on reference-semantics my understanding is that the above does not modify by reference, whereas a similar operation that would use the := would do so. Is this accurate? If that's correct Is there a better way to do this operation that does modify by reference? In trying to figure this out I got stuck on how to combine the .SD symbol and the := operator. I tried
DT[, c(cols1) := .SD, .SDcols = cols1]
DT[, c(cols1) := lapply(.SD,function(x)x), .SDcols = cols1]
neither of which gave the result I wanted.
Example 2
Say I want to perform a different operation dcast that uses .SD as input. Example data table:
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> DT
x y z
1: 1 A 5
2: 2 A 6
3: 1 B 7
4: 2 B 8
Again, I know I can just reassign like so:
> DT <- dcast(DT, x ~ y, value.var = "z")
> DT
x A B
1: 1 5 7
2: 2 6 8
But don't understand why the following does not work (or whether it would be preferable in some circumstances):
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> cols <- c("x", unique(DT$y))
> DT[, cols := dcast(.SD, x ~ y, value.var = "z")]
In your example,
cols1 <- names(DT)[grep("group1:", names(DT))]
DT[, c(cols1) := .SD, .SDcols = cols1] # not this
DT[, (cols1) := .SD, .SDcols = cols1] # this will work
Below is other example to set 0 values on numeric columns .SDcols by reference.
The trick is to assign column names vector before :=.
colnames = DT[, names(.SD), .SDcols = is.numeric] # column name vector
DT[, (colnames) := lapply(.SD, nafill, fill = 0), .SDcols= is.numeric]

data.table update join by group

I have a specific data.table question: is there a way to do an update join but by group ? Let me give an example:
df1 <- data.table(ID = rep(letters[1:3],each = 3),x = c(runif(3,0,1),runif(3,1,2),runif(3,2,3)))
df2 <- data.table(ID = c(letters[1],letters[1:5]))
> df2
ID
1: a
2: a
3: b
4: c
5: d
6: e
> df1
ID x
1: a 0.9719153
2: a 0.8897171
3: a 0.7067390
4: b 1.2122764
5: b 1.7441528
6: b 1.3389710
7: c 2.8898255
8: c 2.0388562
9: c 2.3025064
I would like to do something like
df2[df1,plouf := sample(i.x),on ="ID"]
But for each ID group, meaning that plouf would be a sample of the x values for each corresponding ID. The above line of code does not work this way, it sample the whole x vector:
> df2
ID plouf
1: a 1.3099715
2: a 0.8540039
3: b 2.0767138
4: c 0.6530148
5: d NA
6: e NA
You see that the values of plouf are not the x corresponding to the ID group of df1. I would like that the plouf value is between 0 and 1 for a, 1 and 2 for b, and 2 and 3 for c. I want to sample without replacement.
I tried :
df2[df1,plouf := as.numeric(sample(i.x,.N)),on ="ID",by = .EACHI]
which does not work:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
This other attempt seems to be working:
df2$plouf <- df2[df1,on ="ID"][,sample(x,df2[ID == ID2,.N]),by = .(ID2 = ID)]$V1
But I find it hard to read or understand, it could be problematic for more than one grouping variable, and I am not sure it is quite efficient. I am sure there is a nice simple way to write it, but I don't have it. Any idea ?
Another option:
df1[df2[, .N, ID], on=.(ID), sample(x, N), by=.EACHI]
output:
ID V1
1: a 0.2655087
2: a 0.3721239
3: b 1.2016819
4: c 2.6607978
5: d NA
6: e NA
data:
library(data.table)
set.seed(0L)
df1 <- data.table(ID = rep(letters[1:3],each = 3),x = c(runif(3,0,1),runif(3,1,2),runif(3,2,3)))
df2 <- data.table(ID = c(letters[1],letters[1:5]))
Addressing comment:
library(data.table)
set.seed(0L)
df1 <- data.table(ID = rep(letters[1:3],each = 3),
NAME = rep(LETTERS[1:3],each = 3),
x = c(runif(3,0,1),runif(3,1,2),runif(3,2,3)))
df2 <- data.table(ID = c(letters[1],letters[1:5]),
NAME = c(LETTERS[1],LETTERS[1:5]))
df2[, ri := rowid(ID, NAME)][
df1[df2[, .N, .(ID, NAME)], on=.(ID, NAME), .(ri=1L:N, VAL=sample(x, N)), by=.EACHI],
on=.(ri, ID, NAME), VAL := VAL]
df2
If it is too repetitive to type ID, NAME, you can use
cols <- c("ID", "NAME")
df2[, ri := rowidv(.SD, cols)][
df1[df2[, .N, cols], on=cols, .(ri=1L:N, VAL=sample(x, N)), by=.EACHI],
on=c("ri", cols), VAL := VAL]
df2
Sample with replacement
You can do that like this:
df2[, plouf := df1[df2, on = .(ID),
sample(x, size = 1),
by=.EACHI]$V1]
You can join on the ID variable, but you must specify by=.EACHI as you are returning multiple values. The $V1 tells it to return the first column of the results.
Result:
ID sample
1: a 0.042188292
2: a 0.002502247
3: b 1.145714600
4: c 2.541768627
5: d NA
6: e NA
Sample without replacement
Its not pretty but it works:
df2$plouf = as.numeric(NA)
# create temporary table of number of sample required for each group
temp = df2[, .N, by = ID]
for(i in temp$ID){
# create a temporary sample
temp_sample = sample(df1[i==ID]$x, size = temp[ID==i]$n, replace = FALSE)
# assign sample
for(j in seq(1, length(temp_sample))){
df2[ID==i][j]$plouf = temp_sample[j]
}
}
Thanks to #David Arenburg for help

How do I self join a data.table in a manner like dcast

Suppose I have a data.table in "melted" form where I have a key, and identifier and a value
library(data.table)
library(reshape2)
DT = data.table(X = c(1:5, 1:4), Y = c(rep("A", 5), rep("B", 4)), Z = rnorm(9))
DT2 = data.table(dcast(DT, X~Y))
How can I perform that sort of self join inside data.table?
> DT
X Y Z
1: 1 A -0.19790449
2: 2 A 0.17906116
3: 3 A 0.01821837
4: 4 A 0.17309716
5: 5 A 0.05962474
6: 1 B -0.24629468
7: 2 B 0.92285734
8: 3 B 0.66002573
9: 4 B -1.01403880
> DT2
X A B
1: 1 -0.19790449 -0.2462947
2: 2 0.17906116 0.9228573
3: 3 0.01821837 0.6600257
4: 4 0.17309716 -1.0140388
5: 5 0.05962474 NA
Aside (mostly for Arun):
Here is a solution I already use for melt (was written with help from Matthew D, so he should have this code), that I think replicates melt completely, and is pretty efficient. Dcast on the other hand (or should that be dtcast?) is much harder!
melt.data.table = function(data, id.vars, measure.vars,
variable.name = "variable",
..., na.rm = FALSE, value.name = "value") {
if(missing(id.vars)){
id.vars = setdiff(names(data), measure.vars)
}
if(missing(measure.vars)){
measure.vars = setdiff(names(data), id.vars)
}
dtlist = lapply(measure.vars, function(..colname) {
data[, c(id.vars, ..colname), with = FALSE][, (variable.name) := ..colname]
})
dt = rbindlist(dtlist)
setnames(dt, measure.vars[1], value.name)
if(na.rm){
return(na.omit(dt))
} else {
return(dt)
}
}
Update: faster versions of melt and dcast are now implemented (in C) in data.table versions >= 1.9.0. Check this post for more info.
Now you can just do:
dcast.data.table(DT, X~Y)
In case of dcast alone, at the moment, it has to be written out completely (as it's not a S3 generic yet in reshape2). We'll try to fix this as soon as possible. For melt, you can just use melt(.) as you'd do normally.
The general idea is this:
setkey(DT, X, Y)
DT[CJ(1:5, c("A", "B"))][, as.list(Z), by=X]
You can name the columns V1 and V2 as A and B using setnames.
But this may not be efficient on large data or when the cast formula is complex. Or rather I should say, it could be much more efficient. We're in the process of finding such an implementation to integrate melt and cast on to data.table. Until then, you could get around this as above.
I'll update this post once we've made significant progress with melt/cast.

Resources