Rename multiple aggregated columns using data.table in R [duplicate] - r

This question already has answers here:
Calculate multiple aggregations on several variables using lapply(.SD, ...)
(2 answers)
Apply multiple functions to multiple columns in data.table
(5 answers)
Closed 4 years ago.
I have a data frame with many columns, some of which are measure variables. I would like to extract a bunch of summary statistics from the latter using data.table. My problem is the following: how to rename the aggregated columns according to the function that was used?
I want to have an aggregated data.table with column names like: c("measure1_mean", "measure1_sd", "measure2_mean", "measure2_sd", ...)
My code looks like this:
library(data.table)
library(stringr)
dt <- data.table(meas1=1:10,
meas2=seq(5,25, length.out = 10),
meas3=rnorm(10),
groupvar=rep(LETTERS[1:5], each=2))
measure_cols <- colnames(dt)[str_detect(colnames(dt), "^meas")]
dt_agg <- dt[, c(lapply(.SD, mean),
lapply(.SD, sd)),
by=groupvar, .SDcols = measure_cols]
# Does not work because of duplicates in rep(measure_cols, 3)
agg_names <- c(measure_cols, paste(rep(c("mean", "sd"), each=length(measure_cols)), measure_cols, sep="_"))
setnames(dt_agg, rep(measure_cols,3), agg_names)
This chunk effectively extracts the statistics but returns columns with identical names. Therefore I cannot use something like setnames(dt, old, new) because duplicates exist in my 'old' vector.
I came across this post: Rename aggregated columns using data.table in R. But I do not like the accepted solution because it relies on column index, and not names, to rename the columns.

library(data.table)
dt <- data.table(meas1=1:10,
meas2=seq(5,25, length.out = 10),
meas3=rnorm(10),
groupvar=rep(LETTERS[1:5], each=2))
measure_cols <- colnames(dt)[str_detect(colnames(dt), "^meas")]
dt_agg <- dt[, c(lapply(.SD, mean),
lapply(.SD, sd)),
by=groupvar, .SDcols = measure_cols]
you can create a vector with names... use the each argument to paste the function,names behind the measure_cols.
function.names <- c("mean", "sd")
column.names <- paste0( measure_cols, "_", rep( function.names, each = length( measure_cols ) ) )
setnames( dt_agg, c("groupvar", column.names ))
# groupvar meas1_mean meas2_mean meas3_mean meas1_sd meas2_sd meas3_sd
# 1: A 1.5 6.111111 0.2346044 0.7071068 1.571348 1.6733804
# 2: B 3.5 10.555556 0.5144621 0.7071068 1.571348 0.0894364
# 3: C 5.5 15.000000 -0.5469839 0.7071068 1.571348 2.1689620
# 4: D 7.5 19.444444 -0.3898213 0.7071068 1.571348 1.0007027
# 5: E 9.5 23.888889 0.5569743 0.7071068 1.571348 1.4499413

Related

Paste name of new columns when summarizing data.table [duplicate]

This question already has answers here:
Dynamically add column names to data.table when aggregating
(2 answers)
Dynamic column names in data.table
(3 answers)
Closed 5 years ago.
How to summarize a data.table creating new column whose name comes from a string or character?
reproducible example:
library(data.table)
dt <- data.table(x=rep(c("a","b"),20),y=factor(sample(letters,40,replace=T)), z=1:20)
i <- 15
new_var <- paste0("new_",i)
# my attempt
dt[, .( eval(new_var) = sum( z[which( z <= i)] )), by= x]
# expected result
dt[, .( new_15 = sum( z[which( z <= i)] )), by= x]
> x new_15
> 1: a 128
> 2: b 112
This approach using eval() works fine for creating a new column with := (see this SO questions), but I don't know why it does not work when summarizing a data.table.
One option is setNames
dt[, setNames(.(sum( z[which( z <= i)])), new_var) , by= x]
# x new_15
#1: a 128
#2: b 112

How to reassign values to existing columns of a data.table using lapply? [duplicate]

This question already has answers here:
Update multiple data.table columns elegantly [duplicate]
(4 answers)
Fill in mean values for NA in every column of a data frame [duplicate]
(1 answer)
Closed 5 years ago.
I want to update NAs in numeric columns with median values for that column.
dt <- data.table(
name = c("A","B","C","D","E"),
sex = c("M","F",NA,"F","M"),
age = c(1,2,3,NA,4),
height = c(178.1, 162.1, NA, 169.5, 172.3)
)
Extract the numeric columns
num.cols <- sapply(dt, is.numeric)
num.cols <- names(num.cols)[num.cols]
Check values
median(dt[,age], na.rm = T) # 2.5
median(dt[,height], na.rm = T) #170.9
Use lapply for each num.cols
dt[,lapply(.SD, function(value)
ifelse(is.na(value), median(value, na.rm=TRUE), value)),
.SDcols = num.cols]
Question, I cannot work out how to overwrite the vector with NA with the vector of imputed medians in data.table syntax ?
We can use the na.aggregate from zoo and specify the FUN as median to impute the missing values with median for the selected columns specified in .SDcols and assign (:=) the values to the concerned columns
library(zoo)
dt[, (num.cols) := na.aggregate(.SD, FUN = median),.SDcols = num.cols]
dt
# name sex age height
#1: A M 1.0 178.1
#2: B F 2.0 162.1
#3: C NA 3.0 170.9
#4: D F 2.5 169.5
#5: E M 4.0 172.3

R - Selecting columns from data table with for loop issue [duplicate]

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Efficiently reformat column entries in large data set in R

I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about.
The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns.
The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns.
The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.
The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.
My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.
Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.
pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
current <- pos[[i]]
mutation <- change[[i]]
af <- fraction[[i]]
total.esp$chrom[i] <- current[1]
total.esp$base[i] <- current [2]
total.esp$ref[i] <- mutation[1]
total.esp$alt[i] <- mutation[2]
total.esp$af[i] <- af[3]
}
Thanks!
Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.
library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]
# Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1: C G 10 130448 0.02 0.03
#2: A C 2 40483 0.05 0.03
#3: A G 5 30821291 0.02 0.04
# MAFinPercent3
#1: 0.20
#2: 0.04
#3: 0.03
NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.
data
df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'),
NCBI.Base=c('10:130448', '2:40483', '5:30821291'),
MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'),
stringsAsFactors=FALSE)
You can use tidyr, dplyr and separate:
library(tidyr)
library(dplyr)
total.esp %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent.EA.AA.All., c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)
You'll need to be careful about that last MAFinPercent.EA.AA.All. - you have a horrible column name so may have to rename it/quote it depending on how exactly r has it (this is also a good reason to include at least some data in your question, such as the output of dput(head(total.esp))).
data used to check:
total.esp <- data.frame(Alleles= rep("C>G", 50), NCBI.Base = rep("10:130448", 50), 'MAFinPercent(EA/AA/All)'= rep(".02/.03/.20", 50))
Because we now have a tidyr/dplyr solution, a data.table solution and a base solution, let's benchmark them. First, data from #akrun, 300,000 rows in total:
df1 <- data.frame(Alleles =rep(c('C>G', 'A>C', 'A>G'), 100000),
NCBI.Base=rep(c('10:130448', '2:40483', '5:30821291'), 100000),
MAFinPercent= rep(c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 100000),
stringsAsFactors=FALSE)
Now, the benchmark:
microbenchmark::microbenchmark(
tidyr = {df1 %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent, c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)},
data.table = {setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]},
base = {pos <- strsplit(df1$NCBI.Base, ":");
change <- strsplit(df1$Alleles, ">");
fraction <- strsplit(df1$MAFinPercent, "/");
data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( fraction, "[", 3)
)}
)
Unit: seconds
expr min lq mean median uq max neval
tidyr 1.295970 1.398792 1.514862 1.470185 1.629978 1.889703 100
data.table 2.140007 2.209656 2.315608 2.249883 2.481336 2.666345 100
base 2.718375 3.079861 3.183766 3.154202 3.221133 3.791544 100
tidyr is the winner
Try this (after retaining your first three lines of code):
total.esp <- data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( af, "[", 3)
)
I cannot imagine this taking more than a couple of minutes. (I do work with R objects of similar size.)

Select multiple columns in data.table by their numeric indices

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Resources