R: Dynamically referencing and operating on variables in data frame - r

I am trying to dynamically reference and perform operations on a vector in a data frame. I've tried various forms of eval, parse, and so on, but they either return the string I provide or throw errors. Someone has a solution? As I propose In the psuedo-code below, the solution is presumably to replace DO_SOMETHING() with some other functions.
# Example data
mydat <- data.frame(x = rnorm(10))
# Function to add 5 to specified variable in a data frame
add5 <- function(data, var){
var_ref <- paste0("data$", var)
out <- DO_SOMETHING(var_ref) + 5
return out
}
add5(mydat,x) // returns a numeric vector value of 5
class(add5(mydat,x)) // numeric

If you want to pass unquoted column names, you could use deparse substitute like :
add5 <- function(data, var){
out <- data[deparse(substitute(var))] + 5
return(out)
}
add5(mydat,x)
Using dplyr and some non-standard evaluation with curly-curly we can do :
library(dplyr)
library(rlang)
add5 <- function(data, var){
data %>% mutate(out = {{var}} + 5)
}
add5(mydat, x)
# x out
#1 1.1604 6.16
#2 0.7002 5.70
#3 1.5868 6.59
#4 0.5585 5.56
#5 -1.2766 3.72
#6 -0.5733 4.43
#7 -1.2246 3.78
#8 -0.4734 4.53
#9 -0.6204 4.38
#10 0.0421 5.04

With data.table, unquoting argument names is easy. If you start writing functions using variable names, I recommend you to use data.table (see a blog post I wrote on the subject).
With one variable you will use get to unquote variable name
library(data.table)
data <- data.table(x = rnorm(10))
myvar <- "x"
data[, out := get(myvar) + 5]
data
x out
1: -0.30229987 4.697700
2: 0.51658585 5.516586
3: 0.12180432 5.121804
4: 1.53438805 6.534388
5: 0.06213513 5.062135
6: 0.17935070 5.179351
7: 0.70002065 5.700021
8: 0.12067590 5.120676
9: -0.41002931 4.589971
10: 0.45385072 5.453851
Note that I don't need to reassign result because := updates by reference.
With several variables, you will use .SD + lapply. This syntax means apply something over the Subset of Data (.SD). .SDcols argument is used to control what are the columns considered in the subset of data.
This is a very general approach that works in many situations.
data <- data.table(x = rnorm(10), y = rnorm(10))
data[, c('out1','out2') := lapply(.SD, function(x) return(x + 5)), .SDcols = c("x","y")]
data
x y out1 out2
1: 0.91187875 -0.2010539 5.911879 4.798946
2: -0.70906903 0.2074829 4.290931 5.207483
3: -0.52517961 0.2027444 4.474820 5.202744
4: 0.09967933 -1.2315601 5.099679 3.768440
5: -0.40392510 -0.1777705 4.596075 4.822229
6: 0.65891623 0.2394889 5.658916 5.239489
7: 0.76275090 1.5695957 5.762751 6.569596
8: -0.52395704 -0.7083462 4.476043 4.291654
9: 0.52728890 -1.1308284 5.527289 3.869172
10: -1.00418691 -0.5569468 3.995813 4.443053
I could have used this approach with one column (.SDcols = 'x').

Related

Using mutate with a stored list of formulas over specified columns

This is a follow up to my previous question here, which #ronak_shah was kind enough to answer. I apologize as some of this information may be redundant to anyone who saw that post, but figure best to post a new question, rather than modify the previous version.
I would still like to iterate through a stored list of columns and procedures to create n new columns based on this list. In the example below, we start with 3 columns, a, b, c and a simple function, func1.
The data frame col_mod identifies which column should be changed, what the second argument to the function that changes them should be, and then generates a statement to execute the function. Each of these modifications should be an addition to the original data frame, rather than replacements of the specified columns. The new names of these columns should be a_new and c_new, respectively.
At the bottom of the reprex below, I am able to obtain my desired result manually, but as before, I would like to automate this using a mapping function.
I am attempting to use the same approach that was provided as an answer to my previous question, but I keep on getting the following error: "Error in get(as.character(FUN), mode = "function", envir = envir) : object 'func1(a,3)' of mode 'function' was not found"
If anyone can help would be much appreciated!
library(tidyverse)
## fake data
dat <- data.frame(a = 1:5,
b = 6:10,
c = 11:15)
## function
func1 <- function(x, y) {x + y}
## modification list
col_mod <- data.frame("col" = c("a", "c"),
"y_val" = c(3, 4),
stringsAsFactors = FALSE) %>%
mutate(func = paste0("func1(", col, ",", y_val, ")"))
## desired end result
dat %>%
mutate(a_new = func1(a, 3),
c_new = func1(c, 4))
## attempting to generate new columns based on #ronak_shah's answer to my previous
## question but fails to run
dat[paste0(col_mod$col, '_new')] <- Map(function(x, y) match.fun(y)(x),
dat[col_mod$col], col_mod$func)
We can use pmap from purrr, transmute the columns based on the name from the 'col' i.e. ..1, function from the 'func' i.e. ..3 and 'y_val' from ..2, assign (:=) the value to a new column by creating a string with paste (or str_c), and bind the columns to the original dataset
library(dplyr)
library(purrr)
library(stringr)
library(tibble)
col_mod$func <- 'func1'
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
match.fun(..3)(!! rlang::sym(..1), ..2))) %>%
bind_cols(dat, .)
-output
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
If we want to parse the function as it is, use the parse_expr and eval i.e. without changing the func column - it remains as func1(a, 3), and func1(c, 4)
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
eval(rlang::parse_expr(..3)))) %>%
bind_cols(dat, .)
-output
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
Or using base R with Map
dat[paste0(col_mod$col, '_new')] <- do.call(Map, c(f =
function(x, y, z) eval(parse(text = z), envir = dat), unname(col_mod)))

R loop to standardize variables using data.table

how can I loop over certain variables in order to standardize them? I am strying to set up the code but it's not working, my idea was to use assign or eval but those seem not working. Below a reproducible working example.
if (!require('data.table')) {install.packages('data.table'); library('data.table')}
a <- seq(0,10,1)
b <- seq(99,100,0.1)
dt <- data.table(a,b)
# Expected result
dt[,z_a:= ((a-mean(a,na.rm=TRUE))/sd(a,na.rm=TRUE)) ]
dt[,z_b:= ((a-mean(a,na.rm=TRUE))/sd(a,na.rm=TRUE)) ]
# Loop not working
stdvars <- c(a,b)
for (v in stdvars) {
dt[z_v:= ((v-mean(v,na.rm=TRUE))/sd(v,na.rm=TRUE)) ]
}
dt
I would advise against using explicit loops when working with data.table, as its internal functionality is many times more efficient. In particular, you can define a function which you call through lapply over a specified subset (.SD):
standardise = function(x){(x-mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE)} # Define a standardising function
oldcols = c('a', 'b') # Name of old columns
newcols = paste0('z_', oldcols) # Name of new columns ('z_a' and 'z_b')
dt[, (newcols) := lapply(.SD, standardise), .SDcols = oldcols]
Output:
> dt
a b z_a z_b
1: 0 99.0 -1.5075567 -1.5075567
2: 1 99.1 -1.2060454 -1.2060454
3: 2 99.2 -0.9045340 -0.9045340
4: 3 99.3 -0.6030227 -0.6030227
5: 4 99.4 -0.3015113 -0.3015113
6: 5 99.5 0.0000000 0.0000000
7: 6 99.6 0.3015113 0.3015113
8: 7 99.7 0.6030227 0.6030227
9: 8 99.8 0.9045340 0.9045340
10: 9 99.9 1.2060454 1.2060454
11: 10 100.0 1.5075567 1.5075567
.SD means that you are calling a lapply across a Subset of the Data, defined by the .SDcols argument. In this case, we define newcols as the application of standardise function across the subset oldcols.
There is a built-in function scale which allows to standardize variables.
The missing values are removed when standardizing.
So it would be more direct to proceed as follows:
cols <- c("a", "b")
dt[, paste0("z_", cols) := lapply(.SD, scale), .SDcols = cols]
An option is to use non-standard evaluation:
for (v in c("a", "b")) {
eval(substitute(dt[, paste0("z_", v) := (V - mean(V, na.rm=TRUE)) / sd(V, na.rm=TRUE)],
list(V=as.name(v))))
}
dt
Or putting it in a function:
f <- function(DT, v) {
lhs <- paste0("z_", as.list(match.call())$v)
eval(substitute(
DT[, (lhs) := (v - mean(v, na.rm=TRUE)) / sd(v, na.rm=TRUE)]))
}
f(dt, a)
f(dt, b)
dt

Pass strings as code to summarize multiple columns with data.table

We would like to summarize a data table to create a lot of new variables that result from the combination of columns names and values from the original data.
Here is reproducile example illustrating the result we would like to achieve with two columns only for the sake of brevity
library(data.table)
data('mtcars')
setDT(mtcars)
# Desired output
mtcars[, .(
acm_hp_carb2 = mean(hp[which( carb <= 2)], na.rm=T),
acm_wt_am1 = mean(wt[which( am== 1)], na.rm=T)
), by= .(cyl, gear)]
Because we want to summarize a lot of columns, we created a function that returns all the strings that we would use to create each summary variable. In this example, we have this:
a <- 'acm_hp_carb2 = mean(hp[which( carb <= 2)], na.rm=T)'
b <- 'acm_wt_am1 = mean(wt[which( am== 1)], na.rm=T)'
And here is our failed attempt. Note that the new columns created do not receive the names we want to assign to them.
mtcars[, .(
eval(parse(text=a)),
eval(parse(text=b))
), by= .(cyl, gear)]
Seems like the only part which isn't working is the column names. If you put a and b in a vector and add names to them, you can use lapply to do the eval(parse and keep the names from the vector. I used regex to get the names, but presumably in the real code you can assign the names as whatever variable you're using to construct the strings in the first place.
Result has many NaNs but it matches your desired output.
to_make <- c(a, b)
to_make <- setNames(to_make, sub('^(.*) =.*', '\\1', to_make))
mtcars2[, lapply(to_make, function(x) eval(parse(text = x)))
, by= .(cyl, gear)]
# cyl gear acm_hp_carb2 acm_wt_am1
# 1: 6 4 NaN 2.747500
# 2: 4 4 76.0 2.114167
# 3: 6 3 107.5 NaN
# 4: 8 3 162.5 NaN
# 5: 4 3 97.0 NaN
# 6: 4 5 102.0 1.826500
# 7: 8 5 NaN 3.370000
# 8: 6 5 NaN 2.770000
You can make one call and eval it:
f = function(...){
ex = parse(text = sprintf(".(%s)", paste(..., sep=", ")))[[1]]
print(ex)
mtcars[, eval(ex), by=.(cyl, gear)]
}
f(a,b)
a2 <- 'acm_hp_carb2 = mean(hp[carb <= 2], na.rm=T)'
b2 <- 'acm_wt_am1 = mean(wt[am == 1], na.rm=T)'
f(a2, b2)
I guess the which() is not needed.

Pass a column name as an object and not a string for data.table

I'm using data.table to make aggregation, collapse and group by. The thing is that i know a method to do this with column number but when i put a by it directly make the aggregation. I just want the collapse to be done without group by but putting the by. i know this method:
dt[,X := list(paste(X, collapse = ";")),by = list(Y,Z)]
What i want to do now is:
dt[,names(dt)[1] := list(paste(names(dt)[1], collapse = ";")),by = list(Y,Z)]
But with this code it just write me X at each line
here is an example:
X <- c("a","b","c","d","e","f","g")
Y <- c(1,2,3,4,4,6,4)
Z <- c(10,11,23,8,8,1,3)
dt <- data.table(X,Y,Z)
This is the desired output, but i need to now this because i'm trying to do this in multiple columns (i have a data frame with 400 columns):
X Y Z
1: a 1 10
2: b 2 11
3: c 3 23
4: d;e 4 8
5: f 6 1
6: g 4 3
You should wrap names(dt)[1] inside get():
dt[,names(dt)[1] := list(paste(get(names(dt)[1]), collapse = ";")),by = list(Y,Z)]
Additionally, if you want to deduplicate your data you can use unique(dt).
To apply your functions to multiple columns, you can use .SD in combination with lapply(). For example pasting together the first two cols, grouped by Z:
dt[, lapply(.SD, function(x) paste(x, collapse=";")), by=list(Z),.SDcols=names(dt)[1:2]]

apply function to all values in data.table subset

I have a pairwise table of values, and I'm trying to find the fastest way to apply some function to various subsets of this table. I'm experimenting with data.table to see if it will suit my needs.
For example, I start with this vector of data points, which I convert to a pairwise distance matrix.
dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
It looks like this:
> pdist
spA spB spC spD spE spF
spA NA NA NA NA NA NA
spB 6 NA NA NA NA NA
spC 4 2 NA NA NA NA
spD 3 9 7 NA NA NA
spE 1 5 3 4 NA NA
spF 5 1 1 8 4 NA
Converting this table to a data.table
library(data.table)
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
> pdist
rn spA spB spC spD spE spF
1: spA NA NA NA NA NA NA
2: spB 6 NA NA NA NA NA
3: spC 4 2 NA NA NA NA
4: spD 3 9 7 NA NA NA
5: spE 1 5 3 4 NA NA
6: spF 5 1 1 8 4 NA
If I have some subset that I want to extract the values for,
sub <- c('spB', 'spF', 'spD')
I can do the following, which yields the submatrix that I am interested in:
> pdist[.(sub), sub, with=FALSE]
spB spF spD
1: NA NA NA
2: 1 NA 8
3: 9 NA NA
Now, how can I apply a function, for example taking the mean (but potentially a custom function), of all values in this subset? I can do it this way, but I wonder if there are better ways in line with data.table manipulation.
> mean(unlist(pdist[.(sub), sub, with=FALSE]), na.rm=TRUE)
[1] 6
UPDATE
Following up on this, I decided to see how different in performance a matrix vs a data.table approach would be:
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(mean(unlist(pdistDT[.(sub), sub, with=FALSE]), na.rm=TRUE))
}
> system.time(q1 <- lapply(spSub, function(x) matMethod(pdist, x)))
user system elapsed
18.116 0.154 18.317
> system.time(q2 <- lapply(spSub, function(x) dtMethod(pdistDT, x)))
user system elapsed
795.456 13.357 806.820
It appears that going through the data.table step here is leading to a big performance cost.
Please see the solution posted here for an every more general solution. It may also help:
data.table: transforming subset of columns with a function, row by row
To apply the function, you can do the following:
Part 1. A Step-by-Step Solution
(1.a) Get the data into Data.Table format:
library(data.table)
library(magrittr) #for access to pipe operator
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
(1.b) Then, Get the list of Column Names:
# Get the list of names
sub <- c('spB', 'spF', 'spD')
(1.c) Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X, X.mean = mean(X, na.rm=T), X.sd = sd(X, na.rm=T)){
X <- (X - X.mean) / X.sd
return(X)}
(1.d) Apply the function:
# Voila:
pdist[, unlist(.SD, use.names = FALSE), .SDcols = sub] %>% normalize()
#Or, you can apply the function inside the [], as below:
pdist[, unlist(.SD, use.names = FALSE) %>% normalize(), .SDcols = sub]
# Or, if you prefer to do it without the pipe operator:
pdist[, normalize(unlist(.SD, use.names = FALSE)), .SDcols = sub]
Part 2. Some Advantages for Data.Table approach
Since you seem familiar with matrix approach, I just wanted to point out some advantages of keeping the data.table approach
(2.a) Apply functions within group by using the "by ="
One advantage over matrix is that you can still apply functions within group by using the "by =" argument.
In the example here, I assume you have a variable called "Grp."
With the by=Grp line, the normalization is within group now.
pdist[, unlist(.SD) %>% normalize(), .SDcols = sub, by=Grp]
(2.b) Another advantage is that you can keep other identifying information, for example, if each row has a "participant identifier" P.Id that you wish to keep and repeat:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
In the first step, done in this portion of the code: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]
First, we create a new column called Combined.Data for data in all three columns identified in "sub"
Next to each row of the combined data, the appropriate Participant Id will repeat in column P.Id
In the second step, done in this portion of the code:
[,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]
We can create a new column called Normalized to store the normalized values that result from applying the function normalize()
In addition, we can also include the Combined.Data column as well
So, with this single line:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
we subset columns,
collapse data across the subset,
keep track of the identifier for each datum (P.Id) even when collapsed,
apply a transformation on the entire collapsed data, and
end-up with a neat output in the form of a data table with 3 columns: (1) P.Id, (2) Transformed, & (3) Combined.Data (original values).
and, the order(P.Id) allows the output to appear meaningfully ordered.
The same would be possible with matrix approach, but would be much more cumbersome and take more lines of code.
Data table allows for powerful manipulation and management of data, especially when you start chaining operations together.
(2.c) Finally, if you just wish to keep row information as simple row.numbers, you can use the .I feature of the data.table package:
pdist[, .(.I, normalize(unlist(.SD)), .SDcols = sub]
This feature can be quite helpful, especially if you dont have a participant or row identifier that is inherently meaningful.
Part 3. Disadvantage: Time Cost
I recreated the corrected time cost shown above and the solution for Data Table does take significantly longer
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
# pdistDT$sp %<>% as.factor()
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
unlist(., recursive = FALSE, use.names = FALSE) %>%
mean(., na.rm = TRUE))
}
dtMethod1 <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
melt.data.table(., measure.vars = sub, na.rm=TRUE) %$%
mean(value))
}
system.time(q1 <- apply(spSub, MARGIN = 2, function(x) matMethod(pdist, x)))
# user system elapsed
# 2.86 0.00 3.27
system.time(q2 <- apply(spSub, MARGIN = 2, function(x) dtMethod(pdistDT, x)))
# user system elapsed
# 57.20 0.02 57.23
system.time(q3 <- apply(spSub, MARGIN = 2, function(x) dtMethod1(pdistDT, x)))
# user system elapsed
# 62.78 0.06 62.91

Resources