We would like to summarize a data table to create a lot of new variables that result from the combination of columns names and values from the original data.
Here is reproducile example illustrating the result we would like to achieve with two columns only for the sake of brevity
library(data.table)
data('mtcars')
setDT(mtcars)
# Desired output
mtcars[, .(
acm_hp_carb2 = mean(hp[which( carb <= 2)], na.rm=T),
acm_wt_am1 = mean(wt[which( am== 1)], na.rm=T)
), by= .(cyl, gear)]
Because we want to summarize a lot of columns, we created a function that returns all the strings that we would use to create each summary variable. In this example, we have this:
a <- 'acm_hp_carb2 = mean(hp[which( carb <= 2)], na.rm=T)'
b <- 'acm_wt_am1 = mean(wt[which( am== 1)], na.rm=T)'
And here is our failed attempt. Note that the new columns created do not receive the names we want to assign to them.
mtcars[, .(
eval(parse(text=a)),
eval(parse(text=b))
), by= .(cyl, gear)]
Seems like the only part which isn't working is the column names. If you put a and b in a vector and add names to them, you can use lapply to do the eval(parse and keep the names from the vector. I used regex to get the names, but presumably in the real code you can assign the names as whatever variable you're using to construct the strings in the first place.
Result has many NaNs but it matches your desired output.
to_make <- c(a, b)
to_make <- setNames(to_make, sub('^(.*) =.*', '\\1', to_make))
mtcars2[, lapply(to_make, function(x) eval(parse(text = x)))
, by= .(cyl, gear)]
# cyl gear acm_hp_carb2 acm_wt_am1
# 1: 6 4 NaN 2.747500
# 2: 4 4 76.0 2.114167
# 3: 6 3 107.5 NaN
# 4: 8 3 162.5 NaN
# 5: 4 3 97.0 NaN
# 6: 4 5 102.0 1.826500
# 7: 8 5 NaN 3.370000
# 8: 6 5 NaN 2.770000
You can make one call and eval it:
f = function(...){
ex = parse(text = sprintf(".(%s)", paste(..., sep=", ")))[[1]]
print(ex)
mtcars[, eval(ex), by=.(cyl, gear)]
}
f(a,b)
a2 <- 'acm_hp_carb2 = mean(hp[carb <= 2], na.rm=T)'
b2 <- 'acm_wt_am1 = mean(wt[am == 1], na.rm=T)'
f(a2, b2)
I guess the which() is not needed.
Related
I am trying to dynamically reference and perform operations on a vector in a data frame. I've tried various forms of eval, parse, and so on, but they either return the string I provide or throw errors. Someone has a solution? As I propose In the psuedo-code below, the solution is presumably to replace DO_SOMETHING() with some other functions.
# Example data
mydat <- data.frame(x = rnorm(10))
# Function to add 5 to specified variable in a data frame
add5 <- function(data, var){
var_ref <- paste0("data$", var)
out <- DO_SOMETHING(var_ref) + 5
return out
}
add5(mydat,x) // returns a numeric vector value of 5
class(add5(mydat,x)) // numeric
If you want to pass unquoted column names, you could use deparse substitute like :
add5 <- function(data, var){
out <- data[deparse(substitute(var))] + 5
return(out)
}
add5(mydat,x)
Using dplyr and some non-standard evaluation with curly-curly we can do :
library(dplyr)
library(rlang)
add5 <- function(data, var){
data %>% mutate(out = {{var}} + 5)
}
add5(mydat, x)
# x out
#1 1.1604 6.16
#2 0.7002 5.70
#3 1.5868 6.59
#4 0.5585 5.56
#5 -1.2766 3.72
#6 -0.5733 4.43
#7 -1.2246 3.78
#8 -0.4734 4.53
#9 -0.6204 4.38
#10 0.0421 5.04
With data.table, unquoting argument names is easy. If you start writing functions using variable names, I recommend you to use data.table (see a blog post I wrote on the subject).
With one variable you will use get to unquote variable name
library(data.table)
data <- data.table(x = rnorm(10))
myvar <- "x"
data[, out := get(myvar) + 5]
data
x out
1: -0.30229987 4.697700
2: 0.51658585 5.516586
3: 0.12180432 5.121804
4: 1.53438805 6.534388
5: 0.06213513 5.062135
6: 0.17935070 5.179351
7: 0.70002065 5.700021
8: 0.12067590 5.120676
9: -0.41002931 4.589971
10: 0.45385072 5.453851
Note that I don't need to reassign result because := updates by reference.
With several variables, you will use .SD + lapply. This syntax means apply something over the Subset of Data (.SD). .SDcols argument is used to control what are the columns considered in the subset of data.
This is a very general approach that works in many situations.
data <- data.table(x = rnorm(10), y = rnorm(10))
data[, c('out1','out2') := lapply(.SD, function(x) return(x + 5)), .SDcols = c("x","y")]
data
x y out1 out2
1: 0.91187875 -0.2010539 5.911879 4.798946
2: -0.70906903 0.2074829 4.290931 5.207483
3: -0.52517961 0.2027444 4.474820 5.202744
4: 0.09967933 -1.2315601 5.099679 3.768440
5: -0.40392510 -0.1777705 4.596075 4.822229
6: 0.65891623 0.2394889 5.658916 5.239489
7: 0.76275090 1.5695957 5.762751 6.569596
8: -0.52395704 -0.7083462 4.476043 4.291654
9: 0.52728890 -1.1308284 5.527289 3.869172
10: -1.00418691 -0.5569468 3.995813 4.443053
I could have used this approach with one column (.SDcols = 'x').
I'm using data.table to make aggregation, collapse and group by. The thing is that i know a method to do this with column number but when i put a by it directly make the aggregation. I just want the collapse to be done without group by but putting the by. i know this method:
dt[,X := list(paste(X, collapse = ";")),by = list(Y,Z)]
What i want to do now is:
dt[,names(dt)[1] := list(paste(names(dt)[1], collapse = ";")),by = list(Y,Z)]
But with this code it just write me X at each line
here is an example:
X <- c("a","b","c","d","e","f","g")
Y <- c(1,2,3,4,4,6,4)
Z <- c(10,11,23,8,8,1,3)
dt <- data.table(X,Y,Z)
This is the desired output, but i need to now this because i'm trying to do this in multiple columns (i have a data frame with 400 columns):
X Y Z
1: a 1 10
2: b 2 11
3: c 3 23
4: d;e 4 8
5: f 6 1
6: g 4 3
You should wrap names(dt)[1] inside get():
dt[,names(dt)[1] := list(paste(get(names(dt)[1]), collapse = ";")),by = list(Y,Z)]
Additionally, if you want to deduplicate your data you can use unique(dt).
To apply your functions to multiple columns, you can use .SD in combination with lapply(). For example pasting together the first two cols, grouped by Z:
dt[, lapply(.SD, function(x) paste(x, collapse=";")), by=list(Z),.SDcols=names(dt)[1:2]]
I have a pairwise table of values, and I'm trying to find the fastest way to apply some function to various subsets of this table. I'm experimenting with data.table to see if it will suit my needs.
For example, I start with this vector of data points, which I convert to a pairwise distance matrix.
dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
It looks like this:
> pdist
spA spB spC spD spE spF
spA NA NA NA NA NA NA
spB 6 NA NA NA NA NA
spC 4 2 NA NA NA NA
spD 3 9 7 NA NA NA
spE 1 5 3 4 NA NA
spF 5 1 1 8 4 NA
Converting this table to a data.table
library(data.table)
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
> pdist
rn spA spB spC spD spE spF
1: spA NA NA NA NA NA NA
2: spB 6 NA NA NA NA NA
3: spC 4 2 NA NA NA NA
4: spD 3 9 7 NA NA NA
5: spE 1 5 3 4 NA NA
6: spF 5 1 1 8 4 NA
If I have some subset that I want to extract the values for,
sub <- c('spB', 'spF', 'spD')
I can do the following, which yields the submatrix that I am interested in:
> pdist[.(sub), sub, with=FALSE]
spB spF spD
1: NA NA NA
2: 1 NA 8
3: 9 NA NA
Now, how can I apply a function, for example taking the mean (but potentially a custom function), of all values in this subset? I can do it this way, but I wonder if there are better ways in line with data.table manipulation.
> mean(unlist(pdist[.(sub), sub, with=FALSE]), na.rm=TRUE)
[1] 6
UPDATE
Following up on this, I decided to see how different in performance a matrix vs a data.table approach would be:
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(mean(unlist(pdistDT[.(sub), sub, with=FALSE]), na.rm=TRUE))
}
> system.time(q1 <- lapply(spSub, function(x) matMethod(pdist, x)))
user system elapsed
18.116 0.154 18.317
> system.time(q2 <- lapply(spSub, function(x) dtMethod(pdistDT, x)))
user system elapsed
795.456 13.357 806.820
It appears that going through the data.table step here is leading to a big performance cost.
Please see the solution posted here for an every more general solution. It may also help:
data.table: transforming subset of columns with a function, row by row
To apply the function, you can do the following:
Part 1. A Step-by-Step Solution
(1.a) Get the data into Data.Table format:
library(data.table)
library(magrittr) #for access to pipe operator
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
(1.b) Then, Get the list of Column Names:
# Get the list of names
sub <- c('spB', 'spF', 'spD')
(1.c) Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X, X.mean = mean(X, na.rm=T), X.sd = sd(X, na.rm=T)){
X <- (X - X.mean) / X.sd
return(X)}
(1.d) Apply the function:
# Voila:
pdist[, unlist(.SD, use.names = FALSE), .SDcols = sub] %>% normalize()
#Or, you can apply the function inside the [], as below:
pdist[, unlist(.SD, use.names = FALSE) %>% normalize(), .SDcols = sub]
# Or, if you prefer to do it without the pipe operator:
pdist[, normalize(unlist(.SD, use.names = FALSE)), .SDcols = sub]
Part 2. Some Advantages for Data.Table approach
Since you seem familiar with matrix approach, I just wanted to point out some advantages of keeping the data.table approach
(2.a) Apply functions within group by using the "by ="
One advantage over matrix is that you can still apply functions within group by using the "by =" argument.
In the example here, I assume you have a variable called "Grp."
With the by=Grp line, the normalization is within group now.
pdist[, unlist(.SD) %>% normalize(), .SDcols = sub, by=Grp]
(2.b) Another advantage is that you can keep other identifying information, for example, if each row has a "participant identifier" P.Id that you wish to keep and repeat:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
In the first step, done in this portion of the code: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]
First, we create a new column called Combined.Data for data in all three columns identified in "sub"
Next to each row of the combined data, the appropriate Participant Id will repeat in column P.Id
In the second step, done in this portion of the code:
[,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]
We can create a new column called Normalized to store the normalized values that result from applying the function normalize()
In addition, we can also include the Combined.Data column as well
So, with this single line:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
we subset columns,
collapse data across the subset,
keep track of the identifier for each datum (P.Id) even when collapsed,
apply a transformation on the entire collapsed data, and
end-up with a neat output in the form of a data table with 3 columns: (1) P.Id, (2) Transformed, & (3) Combined.Data (original values).
and, the order(P.Id) allows the output to appear meaningfully ordered.
The same would be possible with matrix approach, but would be much more cumbersome and take more lines of code.
Data table allows for powerful manipulation and management of data, especially when you start chaining operations together.
(2.c) Finally, if you just wish to keep row information as simple row.numbers, you can use the .I feature of the data.table package:
pdist[, .(.I, normalize(unlist(.SD)), .SDcols = sub]
This feature can be quite helpful, especially if you dont have a participant or row identifier that is inherently meaningful.
Part 3. Disadvantage: Time Cost
I recreated the corrected time cost shown above and the solution for Data Table does take significantly longer
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
# pdistDT$sp %<>% as.factor()
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
unlist(., recursive = FALSE, use.names = FALSE) %>%
mean(., na.rm = TRUE))
}
dtMethod1 <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
melt.data.table(., measure.vars = sub, na.rm=TRUE) %$%
mean(value))
}
system.time(q1 <- apply(spSub, MARGIN = 2, function(x) matMethod(pdist, x)))
# user system elapsed
# 2.86 0.00 3.27
system.time(q2 <- apply(spSub, MARGIN = 2, function(x) dtMethod(pdistDT, x)))
# user system elapsed
# 57.20 0.02 57.23
system.time(q3 <- apply(spSub, MARGIN = 2, function(x) dtMethod1(pdistDT, x)))
# user system elapsed
# 62.78 0.06 62.91
I am looking to turn a dataframe (or datatable) such as
dt <- data.table(a = c(1,2,4), b = c(NA,3,5), d = c(NA,8,NA))
into something with one column, such as
dt <- data.table(combined = list(list(1,NA,NA),list(2,3,8),list(4,5,NA))
None of the following work:
dt[,combined := as.list(a,b,d)]
dt[,combined := do.call(list,list(a,b,d))]
dt[,combined := cbind(a,b,d)]
dt[,combined := lapply(list(a,b,d),list)]
Note that this is different from the question here, data.frame rows to a list, which returns a different shaped object (I think it's just a plain list, with each row as an item in the list, rather than a vector of lists)
You can use purrr::transpose(), which transposes a list of vectors to a list of lists:
dt[, combined := purrr::transpose(.(a,b,d))]
dt
# a b d combined
#1: 1 NA NA <list>
#2: 2 3 8 <list>
#3: 4 5 NA <list>
combined = list(list(1,NA_real_,NA_real_),list(2,3,8),list(4,5,NA_real_))
identical(dt$combined, combined)
# [1] TRUE
If you don't want to use an extra package, you can use data.table::transpose with a little extra effort:
dt[, combined := lapply(transpose(.(a,b,d)), as.list)]
identical(dt$combined, combined)
# [1] TRUE
To make #David's comment more explicit, and generalize the data.table approach to SE version, which allows you to pass in columns names as character vector and avoids hard coding column names, you can do, to learn more about SE vs NSE (you can refer to vignette("nse")):
dt[, combined := lapply(transpose(.SD), as.list), .SDcols = c("a","b","d")]
This makes all sublists named, but the values correspond to the combined list:
identical(lapply(dt$combined, setNames, NULL), combined)
# [1] TRUE
If you don't want to use any functions:
dt[, combined := .(.(.SD)), by = 1:nrow(dt)]
# because you want to transform each row to a list, normally you can group the data frame
# by the row id, and turn each row into a list, and store the references in a new list
# which will be a column in the resulted data.table
dt$combined
#[[1]]
# a b d
#1: 1 NA NA
#[[2]]
# a b d
#1: 2 3 8
#[[3]]
# a b d
#1: 4 5 NA
Or: dt[, combined := .(.(.(a,b,d))), by = 1:nrow(dt)] which gives you closer to the exact desired output.
I am struggling a bit with my tables. I am trying to split some variables (using R), but I am having difficulties with one specific column.
My dataset is like this:
test<-data.frame(
Chrom_no=c(1,1,2,3),
Region=c('12..13','22..23','100','34..36'),
Ref=c('AT','CG','A','AAA'),
Alt=c('TA','GA','T','CGG'),
Prob=c(99,98.7,99,99.9))
I want to separate all the regions that are grouped together. So far, I have solved for all the columns, but the 'Region' one:
ref2 <- strsplit(as.character(test$Ref), '')
alt2<-strsplit(as.character(test$Alt), '')
test2<-data.frame(
Chrom_no=rep(test$Chrom_no, vapply(ref2, FUN=length, FUN.VALUE=integer(1))),
Region=rep(test$Region, vapply(ref2, FUN=length, FUN.VALUE=integer(1))),
Ref=unlist(ref2),
Alt=unlist(alt2),
Prob=rep(test$Prob, vapply(ref2, FUN=length, FUN.VALUE=integer(1))))
I don't know how to solve fix that column: e.g. '12..13': 12 should go on the Ref=A and 13 should go in Ref=T (first and second character, respectively). Things get complicated, as some of the columns have 3 characters (and corresponding range: 22..24), some will have more.
How could I solve? I have been looking for a solution in the last couple of days, but I am still not sure how to solve. I apologize if this has already been solved somewhere else.
P.S.: I am aware that in order to strsplit on the 'Region' column I need to use:
'\\..'
as separator.
If I understand your end goal correctly, you can look into using the "data.table" package. With it, you can set up your problem like the following:
library(data.table)
## Change your data.frame to a data.table
DT <- as.data.table(test)
## Convert the relevant columns to be characters instead of factors
DT[, c("Region", "Ref", "Alt") := lapply(.SD, as.character),
.SDcols = c("Region", "Ref", "Alt")]
DT[, list(Chrom_no = rep(Chrom_no, nchar(Ref)), # Expand the Chrom_no
Region = unlist(lapply( # Split Region and use
strsplit(Region, "..", TRUE), # the result to create
function(x) { # the range of values
x <- as.numeric(x) # needed
if (length(x) > 1) seq(x[1], x[2]) else x
})),
Ref = unlist(strsplit(Ref, "")), # Split Ref
Alt = unlist(strsplit(Alt, "")), # Split Alt
Prob = rep(Prob, nchar(Ref)))] # Expand Prob
# Chrom_no Region Ref Alt Prob
# 1: 1 12 A T 99.0
# 2: 1 13 T A 99.0
# 3: 1 22 C G 98.7
# 4: 1 23 G A 98.7
# 5: 2 100 A T 99.0
# 6: 3 34 A C 99.9
# 7: 3 35 A G 99.9
# 8: 3 36 A G 99.9
The above code can probably be streamlined a bit, but I thought this should be enough to get you started.