R data.table merge duplicate rows and concatenate unique values - r

I am trying to merge duplicate rows using data.table aggregate, but I need to figure out how to concatenate the non-duplicated columns as strings in the output:
dt = data.table(
ensembl_id=c("ENSRNOG00000055068", "ENSRNOG00000055068", "ENSRNOG00000055068"),
hsapiens_ensembl_id=c("ENSG00000196262", "ENSG00000236334", "ENSG00000263353"),
chr=c(14, 14, 14),
start=c(22706901, 22706901, 22706901),
hsapiens_symbol=c("PPIA", "PPIAL4G", "PPIAL4A"),
hsapiens_chr=c(7, 1, 1)
)
dt[, lapply(.SD, paste(...,sep=",")), by=ensembl_id] # <- need magic join/paste function
desired output:
ensembl_id hsapiens_ensembl_id chr start hsapiens_symbol hsapiens_chr
1: ENSRNOG00000055068 ENSG00000196262,ENSG00000236334,ENSG00000263353 14 22706901 PPIA,PPIAL4G,PPIAL4A 7,1,1

We can use collapse with paste instead of sep and include 'chr', 'start' also as grouping variables
library(data.table)
dt[, lapply(.SD, paste, collapse=","), by = .(chr, start, ensembl_id)]
Or more compactly, with toString
dt[, lapply(.SD, toString), by = .(chr, start, ensembl_id)]
If we there are duplicates, get the unique values and paste
dt[, lapply(.SD, function(x) toString(unique(x))), by = .(chr, start, ensembl_id)]

Related

R data.table: names of .SD not available for assignment

Often, I want to manipulate several variables in a DT and I need to select the column names based on their names or class.
d <- data.table(x = 1:10, y= letters[1:10])
# My usual approach
col <- str_subset(names(d), '^x')
d[, (col) := 2:11]
However, it would be very useful and less verbose to do this:
d[, (names(.SD)) := 2:11, .SDcols = patterns('^x')]
But this throws an error:
Error in `[.data.table`(d, , `:=`((names(.SD)), 2:11), .SDcols = patterns("^x")) :
LHS of := isn't column names ('character') or positions ('integer' or 'numeric')
>
The column names of .SD are available, though:
> d[, names(.SD), .SDcols = patterns('^x')]
[1] "x"
Why aren't the names of .SD available for assignment on the LHS of :=?
As noted this is not yet possible. The workaround only adds one line of code:
cols = grep('^x', names(d))
d[ , (cols) := 2:11, .SDcols = cols]

Multiply columns in a DT by DT[i,j]

Question 1: line 1 throws an error. Why and how to multiply all columns by DT[i,j]?
Question 2: line 2 works but are there better ways to multiply all other columns by one column?
df=data.table(matrix(1:15,3,5))
df[ , lapply(.SD, function(x) {x*df$V5), .SDcols = c("V1","V2","V3","V4")] #line 1
df[ , lapply(.SD, function(x) {x*df[1,"V5"})}, .SDcols = c("V1","V2","V3","V4")] #line 2
As we are multiplying one column with the rest, either do the multiplication of the Subset of Data.table directly
df[, .SD * V5, .SDcols = V1:V4]
Or with lapply
df[, lapply(.SD, `*`, V5), .SDcols = V1:V4]
Note that in both cases, we are not updating the original dataset columns. For that we need :=
df[, paste0("V", 1:4) := .SD * V5, .SDcols = V1:V4]
In the OP's code, there is a closing } missing in the line 1
df[ , lapply(.SD, function(x) {x*df$V5), .SDcols = c("V1","V2","V3","V4")]
^^
It would be
df[, lapply(.SD, function(x) { x* V5 }), .SDcols = V1:V4]
Here, we don't really need those curlies as well as within the data.table, column names can be referenced as unquoted instead of df$ along with the shortened version of .SDcols where column names are represented as a range (:)

.SD and .SDcols for the i expression in data.table join

i'm trying to copy a subset of columns from Y to X based on a join, where the subset of columns is dynamic
I can identify the columns quite easily: names(Y)[grep("xxx", names(Y))]
but when i try to use that code in the j expression, it just gives me the column names, not the values of the columns. the .SD and .SDcols gets pretty close, but they only apply to the x expression. I'm trying to do something like this:
X[Y[names(Y)[grep("xxx", names(Y))] := .SD, .SDcols = names(Y)[grep("xxx", names(Y)), on=.(zzz)]
is there an equivalent set of .SD and .SDcols constructs that apply to the i expression? Or, do I need to build up a string for the j expression and eval that string?
Perhaps this will help you get started:
library(data.table)
X <- as.data.table(mtcars[1:5], keep.rownames = "id")
Y <- as.data.table(mtcars, keep.rownames = "id")
cols <- c("gear", "carb")
# copy cols from Y to X based on common "id":
X[Y, (cols) := mget(cols), on = "id"]
As Frank notes in his comment, it might be safer to prefix the column names with i. to ensure the assigned columns are indeed from Y:
X[Y, (cols) := mget(paste0("i.", cols)), on = "id"]

Use parentheses () to subset a data.table column

In case of assignment (by reference), with = FALSE can be replaced by LHS in parentheses, (). This nice feature does not work when simply subsetting the column without assignment. Of course there is workarount with .SD/.SDcols or get()/mget(), but it would be nice to subset a column just the same way, with or without assignment.
dt <- data.table(A = 1:3, B = 4:6 )
col <- "A"
cols <- c("A","B")
# assign the old way
dt[, col := 9 , with=FALSE]
dt[, cols := .(9,8), with=FALSE]
# assign the new way
dt[, (col) := 8 ]
dt[, (cols) := .(8,7)]
# But the above syntax does not work for subsetting
dt[, (col)]
dt[, (cols)]
# I know how I can subset col and cols, but that is not the question here,
# e.g.:
dt[, col, with=FALSE]
dt[, cols, with=FALSE]
dt[, .SD, .SDcols=col]
dt[, .SD, .SDcols=cols]
# Below, further (there are even more) types of subsetting but they are not
# the same for col and cols, which is importent for looping where I dont
# know how many cols I call in advance.
dt[, get(col)]
dt[, mget(cols)]
dt[[col]] # Returns a vector, nor running: dt[[cols]]
In other words: if dt[ , (col) := 8] runs, as a naive user I expect df[ , (col)] to run as well. Probably there would be a conflict in [data.table so that cannot be implemented?

R change column class by variable

In R I have a data.table df with an integer column X. I want to convert this column from integer to a character.
This is really easy with df[, X:=as.character(X)].
Now for the question, I have the name of the column (X) stored in a variable like this:
col_name <- 'X'.
How do I access the column (and convert it to a character column) while only knowing the variable.
I tried numerious things all yielding in nothing useful or a column of NA's. Which syntax will get me the result I want?
library(data.table)
DT <- as.data.table(iris)
col_name <- "Petal.Length"
Use ( to evaluate the LHS of := and use list subsetting to select the column:
DT[, (col_name) := as.character(DT[[col_name]])]
class(DT[[col_name]])
#[1] "character"
We can specify it in .SDcols and assign the columns to character
df[, (col_name) := as.character(.SD[[1L]]), .SDcols = col_name]
If there are more than one column, use lapply
df[, (col_names) := lapply(.SD, as.character), .SDcols = col_names]
data
df <- data.table(X = as.integer(1:5), Y = LETTERS[1:5])
col_name <- "X"

Resources