Renaming multiple columns in R data.table - r

This is related to this question from Henrik
Assign multiple columns using := in data.table, by group
But what if I want to create a new data.table with given column names instead of assigning new columns to an existing one?
f <- function(x){list(head(x,2),tail(x,2))}
dt <- data.table(group=sample(c('a','b'),10,replace = TRUE),val=1:10)
> dt
group val
1: b 1
2: b 2
3: a 3
4: b 4
5: a 5
6: b 6
7: a 7
8: a 8
9: b 9
10: b 10
I want to get a new data.table with predefined column names by calling the function f:
dt[,c('head','tail')=f(val),by=group]
I wish to get this:
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
But it gives me an error. What I can do is create the table then change the column names, but that seems cumbersome:
> dt2 <- dt[,f(val),by=group]
> dt2
group V1 V2
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
> colnames(dt2)[-1] <- c('head','tail')
> dt2
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
Is it something I can do with one call?

From running your code as-is, this is the error I get:
dt[,c('head','tail')=f(val),by=group]
# Error: unexpected '=' in "dt2[,c('head','tail')="
The problem is using = instead of := for assignment.
On to your problem of wanting a new data.table:
dt2 <- dt[, setNames(f(val), c('head', 'tail')), by = group]

Related

Merge multiple numeric column as list typed column in data.table [R]

I'm trying to find a way to merge multiple column numeric column as a new list type column.
Data Table
dt <- data.table(
a=c(1,2,3),
b=c(4,5,6),
c=c(7,8,9)
)
Expected Result
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Attempt 1
I have tried doing append with a list with dt[,d:=list(c(a,b,c))] but it just append everything instead and get the incorrect result
a b c d
1: 1 4 7 1,2,3,4,5,6,...
2: 2 5 8 1,2,3,4,5,6,...
3: 3 6 9 1,2,3,4,5,6,...
Do a group by row and place the elements in the list
dt[, d := .(list(unlist(.SD, recursive = FALSE))), 1:nrow(dt)]
-output
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Or another option is paste and strsplit
dt[, d := strsplit(do.call(paste, c(.SD, sep=",")), ",")]
Or may use transpose
dt[, d := lapply(data.table::transpose(unname(.SD)), unlist)]
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
dt[, d := purrr::pmap(.SD, ~c(...))]

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

Creating a new data table for each row of an existing data table R while avoiding memory vector issue

Suppose I have two data tables:
library(data.table)
A=data.table(w=1:3,d=5:7)
B=data.table(K=2:4,m=9:11)
> A
w d
1: 1 5
2: 2 6
3: 3 7
> B
K m
1: 2 9
2: 3 10
3: 4 11
I want to do the following expansion, where I have a new B for each row of A:
C=A[,B[],by=names(A)]
w d K m
1: 1 5 2 9
2: 1 5 3 10
3: 1 5 4 11
4: 2 6 2 9
5: 2 6 3 10
6: 2 6 4 11
7: 3 7 2 9
8: 3 7 3 10
9: 3 7 4 11
However, when I do it with my real data, I get this error:
Error in `[.data.table`(A, , B[], by = names(A)) :
negative length vectors are not allowed
It turns out this is a memory error. However, I think there should be a way to do this without loops, memory is not an issue on my server up to 50gb of ram, which the following data table would certainly be less than.
Does anyone know an efficient way to do this?
A hacky way to handle this might be to add an identical helper column to each table and then to allow cartesian joins:
library(data.table)
A = data.table(w = 1:3, d = 5:7)
B = data.table(K = 2:4, m = 9:11)
A[, j := 1]
B[, j := 1]
C = A[B, on = 'j', allow.cartesian = T]

I found a strange thing (bug?) about 'combn' function and 'data.table' package [all possible combinations by group]

I tried to find all possible combinations by group. I tried to use combn function and data.table package as a below post teaches [(here is the link)](Generate All ID Pairs, by group with data.table in R
This gives me the expected result.
dat1 <- data.table(ids=1:4, groups=c("B","A","B","A"))
dat1
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 A
dat1[, as.data.table(t(combn(ids, 2))), .(groups)]
groups V1 V2
1: B 1 3
2: A 2 4
But this gives me a strange result. It's very weird. I tried to understand this result for about 3 hours but I can't. Isn't it a bug?
dat2 <- data.table(ids=1:4, groups=c("B","A","B","C"))
dat2
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 C
dat2[, as.data.table(t(combn(ids, 2))), .( groups)]
groups V1 V2
1: B 1 3
2: A 1 2
3: C 1 2
4: C 1 3
5: C 1 4
6: C 2 3
7: C 2 4
8: C 3 4
I really appreciate it for your teaching.

R data.table filtering on group size

I am trying to find all the records in my data.table for which there is more than one row with value v in field f.
For instance, we can use this data:
dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))
If looking for that property in field f2, we'd get (note the absence of the (3,2) tuple)
f1 f2
1: 1 1
2: 2 1
3: 4 3
4: 5 3
My first guess was dt[.N>2,list(.N),by=f2], but that actually keeps entries with .N==1.
dt[.N>2,list(.N),by=f2]
f2 N
1: 1 2
2: 2 1
3: 3 2
The other easy guess, dt[duplicated(dt$f2)], doesn't do the trick, as it keeps one of the 'duplicates' out of the results.
dt[duplicated(dt$f2)]
f1 f2
1: 2 1
2: 5 3
So how can I get this done?
Edited to add example
The question is not clear. Based on the title, it looks like we want to extract all groups with number of rows (.N) greater than 1.
DT[, if(.N>1) .SD, by=f]
But the value v in field f is making it confusing.
If I understand what you're after correctly, you'll need to do some compound queries:
library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
# v1 f
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 1
# 5: 5 2
# 6: 6 3
# 7: 7 1
# 8: 8 2
# 9: 9 3

Resources