Merge multiple numeric column as list typed column in data.table [R] - r

I'm trying to find a way to merge multiple column numeric column as a new list type column.
Data Table
dt <- data.table(
a=c(1,2,3),
b=c(4,5,6),
c=c(7,8,9)
)
Expected Result
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Attempt 1
I have tried doing append with a list with dt[,d:=list(c(a,b,c))] but it just append everything instead and get the incorrect result
a b c d
1: 1 4 7 1,2,3,4,5,6,...
2: 2 5 8 1,2,3,4,5,6,...
3: 3 6 9 1,2,3,4,5,6,...

Do a group by row and place the elements in the list
dt[, d := .(list(unlist(.SD, recursive = FALSE))), 1:nrow(dt)]
-output
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Or another option is paste and strsplit
dt[, d := strsplit(do.call(paste, c(.SD, sep=",")), ",")]
Or may use transpose
dt[, d := lapply(data.table::transpose(unname(.SD)), unlist)]
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9

dt[, d := purrr::pmap(.SD, ~c(...))]

Related

Add max column using variable

I am trying to do the same thing as this question: Add max value to a new column in R, however, I want to pass in a variable instead of the column name directly so I don't hard code the columns name into the formula.
Sample code:
a <- c(1,1,2,2,3,3)
b <- c(1,3,5,9,4,NA)
d <- data.table(a, b)
d
a b
1 1
1 3
2 5
2 9
3 4
3 NA
I can get this:
a b max_b
1 1 3
1 3 3
2 5 9
2 9 9
3 4 4
3 NA 4
By hard coding it: setDT(d)[, max_b:= max(b, na.rm = T), a] but I would like to do something like this instead:
cn <- "b"
setDT(d)[, paste0("max_", cn):= max(cn, na.rm = T), a]
However, this is not working because inside of max() it evaluates to max of the character instead of the column. And it evaluates to a column named max_b that contains the value b because max("b") = "b". I get why this is happening, I just do not know a workaround.
What is a solution to this?
Note: the above stack question I tagged was marked as a duplicate and closed, but I chose that question because I am using the accepted answer from it in my code. I also do not 100% agree that it is a duplicate question anyways.
Try setDT(d)[, paste0("max_", cn) := eval(parse(text = max(eval(parse(text = cn))))), a]
# output
a b max_b
1: 1 1 3
2: 1 3 3
3: 2 5 9
4: 2 9 9
5: 3 4 4
# example with missing values
a <- c(1,1,2,2,3,3)
b <- c(1,3,5,9,4,NA)
d <- data.table(a, b)
cn <- "b"
setDT(d)[, paste0("max_", cn) := eval(parse(text = max(eval(parse(text = cn)),
na.rm = TRUE))), a]
#output
a b max_b
1: 1 1 3
2: 1 3 3
3: 2 5 9
4: 2 9 9
5: 3 4 4
6: 3 NA 4
One option is to specify the variable in .SDcols and then apply the function on .SD (Subset of Data.table).
d[, paste0("max_", cn) := lapply(.SD, max, na.rm = TRUE), by = a, .SDcols = cn]
d
# a b max_b
#1: 1 1 3
#2: 1 3 3
#3: 2 5 9
#4: 2 9 9
#5: 3 4 4
#6: 3 NA 4
Another option is converting to symbol and then do the evaluation
d[, paste0("max_", cn) := max(eval(as.symbol(cn)), na.rm = TRUE), by = a]

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

Renaming multiple columns in R data.table

This is related to this question from Henrik
Assign multiple columns using := in data.table, by group
But what if I want to create a new data.table with given column names instead of assigning new columns to an existing one?
f <- function(x){list(head(x,2),tail(x,2))}
dt <- data.table(group=sample(c('a','b'),10,replace = TRUE),val=1:10)
> dt
group val
1: b 1
2: b 2
3: a 3
4: b 4
5: a 5
6: b 6
7: a 7
8: a 8
9: b 9
10: b 10
I want to get a new data.table with predefined column names by calling the function f:
dt[,c('head','tail')=f(val),by=group]
I wish to get this:
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
But it gives me an error. What I can do is create the table then change the column names, but that seems cumbersome:
> dt2 <- dt[,f(val),by=group]
> dt2
group V1 V2
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
> colnames(dt2)[-1] <- c('head','tail')
> dt2
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
Is it something I can do with one call?
From running your code as-is, this is the error I get:
dt[,c('head','tail')=f(val),by=group]
# Error: unexpected '=' in "dt2[,c('head','tail')="
The problem is using = instead of := for assignment.
On to your problem of wanting a new data.table:
dt2 <- dt[, setNames(f(val), c('head', 'tail')), by = group]

I found a strange thing (bug?) about 'combn' function and 'data.table' package [all possible combinations by group]

I tried to find all possible combinations by group. I tried to use combn function and data.table package as a below post teaches [(here is the link)](Generate All ID Pairs, by group with data.table in R
This gives me the expected result.
dat1 <- data.table(ids=1:4, groups=c("B","A","B","A"))
dat1
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 A
dat1[, as.data.table(t(combn(ids, 2))), .(groups)]
groups V1 V2
1: B 1 3
2: A 2 4
But this gives me a strange result. It's very weird. I tried to understand this result for about 3 hours but I can't. Isn't it a bug?
dat2 <- data.table(ids=1:4, groups=c("B","A","B","C"))
dat2
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 C
dat2[, as.data.table(t(combn(ids, 2))), .( groups)]
groups V1 V2
1: B 1 3
2: A 1 2
3: C 1 2
4: C 1 3
5: C 1 4
6: C 2 3
7: C 2 4
8: C 3 4
I really appreciate it for your teaching.

R delete non max values in redundant rows

I have a matrix that contains following:
A B C D
a 1 3 2 5
b 3 2 5 8
a 2 1 0 9
a 4 2 1 3
c 4 3 1 1
b 2 5 1 9
A, B, C, D are column names and
a, b, c, d are row names.
I want to make it look like
A B C D
a 4 3 2 9
b 3 5 5 9
c 4 3 1 1
using R, Which is to
1) order the row in alphabetical order,
2) and then if there are redundant rows (i.e. there are other rows with the same row name), pick a maximum value among the redundant rows for each column and delete the others.
I first used python to do this process, but I was wondering if there is
more convenient way for this job in R.
I would appreciate any help.
You can use data.table
dt_in <- data.table(matrix_in)
dt_in[, name := rownames(matrix_in)]
dt_max <- dt_in[, list(A = max(A), B = max(B), C = max(C), D = max(D)), by = "name"]
as.matrix(data.frame(dt_max))
Here's a one liner using data.table you can keep the rows while converting to data.table and then apply max function over all columns using lapply(.SD,...) by the rn variable (the saved row names)
library(data.table)
data.table(m, keep.rownames = TRUE)[, lapply(.SD, max), by = rn]
# rn A B C D
# 1: a 4 3 2 9
# 2: b 3 5 5 9
# 3: c 4 3 1 1
You can simply use aggregate function:
aggregate(matrix ~ rownames(matrix), matrix, max)

Resources