R loop to standardize variables using data.table - r

how can I loop over certain variables in order to standardize them? I am strying to set up the code but it's not working, my idea was to use assign or eval but those seem not working. Below a reproducible working example.
if (!require('data.table')) {install.packages('data.table'); library('data.table')}
a <- seq(0,10,1)
b <- seq(99,100,0.1)
dt <- data.table(a,b)
# Expected result
dt[,z_a:= ((a-mean(a,na.rm=TRUE))/sd(a,na.rm=TRUE)) ]
dt[,z_b:= ((a-mean(a,na.rm=TRUE))/sd(a,na.rm=TRUE)) ]
# Loop not working
stdvars <- c(a,b)
for (v in stdvars) {
dt[z_v:= ((v-mean(v,na.rm=TRUE))/sd(v,na.rm=TRUE)) ]
}
dt

I would advise against using explicit loops when working with data.table, as its internal functionality is many times more efficient. In particular, you can define a function which you call through lapply over a specified subset (.SD):
standardise = function(x){(x-mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE)} # Define a standardising function
oldcols = c('a', 'b') # Name of old columns
newcols = paste0('z_', oldcols) # Name of new columns ('z_a' and 'z_b')
dt[, (newcols) := lapply(.SD, standardise), .SDcols = oldcols]
Output:
> dt
a b z_a z_b
1: 0 99.0 -1.5075567 -1.5075567
2: 1 99.1 -1.2060454 -1.2060454
3: 2 99.2 -0.9045340 -0.9045340
4: 3 99.3 -0.6030227 -0.6030227
5: 4 99.4 -0.3015113 -0.3015113
6: 5 99.5 0.0000000 0.0000000
7: 6 99.6 0.3015113 0.3015113
8: 7 99.7 0.6030227 0.6030227
9: 8 99.8 0.9045340 0.9045340
10: 9 99.9 1.2060454 1.2060454
11: 10 100.0 1.5075567 1.5075567
.SD means that you are calling a lapply across a Subset of the Data, defined by the .SDcols argument. In this case, we define newcols as the application of standardise function across the subset oldcols.

There is a built-in function scale which allows to standardize variables.
The missing values are removed when standardizing.
So it would be more direct to proceed as follows:
cols <- c("a", "b")
dt[, paste0("z_", cols) := lapply(.SD, scale), .SDcols = cols]

An option is to use non-standard evaluation:
for (v in c("a", "b")) {
eval(substitute(dt[, paste0("z_", v) := (V - mean(V, na.rm=TRUE)) / sd(V, na.rm=TRUE)],
list(V=as.name(v))))
}
dt
Or putting it in a function:
f <- function(DT, v) {
lhs <- paste0("z_", as.list(match.call())$v)
eval(substitute(
DT[, (lhs) := (v - mean(v, na.rm=TRUE)) / sd(v, na.rm=TRUE)]))
}
f(dt, a)
f(dt, b)
dt

Related

R: Dynamically referencing and operating on variables in data frame

I am trying to dynamically reference and perform operations on a vector in a data frame. I've tried various forms of eval, parse, and so on, but they either return the string I provide or throw errors. Someone has a solution? As I propose In the psuedo-code below, the solution is presumably to replace DO_SOMETHING() with some other functions.
# Example data
mydat <- data.frame(x = rnorm(10))
# Function to add 5 to specified variable in a data frame
add5 <- function(data, var){
var_ref <- paste0("data$", var)
out <- DO_SOMETHING(var_ref) + 5
return out
}
add5(mydat,x) // returns a numeric vector value of 5
class(add5(mydat,x)) // numeric
If you want to pass unquoted column names, you could use deparse substitute like :
add5 <- function(data, var){
out <- data[deparse(substitute(var))] + 5
return(out)
}
add5(mydat,x)
Using dplyr and some non-standard evaluation with curly-curly we can do :
library(dplyr)
library(rlang)
add5 <- function(data, var){
data %>% mutate(out = {{var}} + 5)
}
add5(mydat, x)
# x out
#1 1.1604 6.16
#2 0.7002 5.70
#3 1.5868 6.59
#4 0.5585 5.56
#5 -1.2766 3.72
#6 -0.5733 4.43
#7 -1.2246 3.78
#8 -0.4734 4.53
#9 -0.6204 4.38
#10 0.0421 5.04
With data.table, unquoting argument names is easy. If you start writing functions using variable names, I recommend you to use data.table (see a blog post I wrote on the subject).
With one variable you will use get to unquote variable name
library(data.table)
data <- data.table(x = rnorm(10))
myvar <- "x"
data[, out := get(myvar) + 5]
data
x out
1: -0.30229987 4.697700
2: 0.51658585 5.516586
3: 0.12180432 5.121804
4: 1.53438805 6.534388
5: 0.06213513 5.062135
6: 0.17935070 5.179351
7: 0.70002065 5.700021
8: 0.12067590 5.120676
9: -0.41002931 4.589971
10: 0.45385072 5.453851
Note that I don't need to reassign result because := updates by reference.
With several variables, you will use .SD + lapply. This syntax means apply something over the Subset of Data (.SD). .SDcols argument is used to control what are the columns considered in the subset of data.
This is a very general approach that works in many situations.
data <- data.table(x = rnorm(10), y = rnorm(10))
data[, c('out1','out2') := lapply(.SD, function(x) return(x + 5)), .SDcols = c("x","y")]
data
x y out1 out2
1: 0.91187875 -0.2010539 5.911879 4.798946
2: -0.70906903 0.2074829 4.290931 5.207483
3: -0.52517961 0.2027444 4.474820 5.202744
4: 0.09967933 -1.2315601 5.099679 3.768440
5: -0.40392510 -0.1777705 4.596075 4.822229
6: 0.65891623 0.2394889 5.658916 5.239489
7: 0.76275090 1.5695957 5.762751 6.569596
8: -0.52395704 -0.7083462 4.476043 4.291654
9: 0.52728890 -1.1308284 5.527289 3.869172
10: -1.00418691 -0.5569468 3.995813 4.443053
I could have used this approach with one column (.SDcols = 'x').

Lapply function to list of data.tables by reference silently

I have a list of data.tables and I want to apply a function to each data.table. I things set up to use := inside an lapply function. Everything works fine and my outputs are updated by reference, but my function also prints to the console. This is part of a much larger project and printing this step to the console is not ideal.
How do I run this 'silently' without printing? Is there a better way to structure the workflow / code?
dt1 <- data.table(a = rnorm(1:10),
b = rnorm(1:10))
dt2 <- data.table(a = rnorm(1:10),
b = rnorm(1:10))
dts <- list(dt1, dt2)
lapply(dts, function(dt) {
dt[, ':=' (c = a + b)]
})
dts
dts now has a C column, but the outputs were displayed in the console. This code chunk is called from another function.
You can use a for loop
for(dt in dts) dt[, ':='(c = a + b)]
You can assign the lapply call which will suppress the output
dts <- lapply(dts, function(dt) {
dt[, ':=' (c = a + b)]
})
We. can use walk which will not print anything into the console
library(purrr)
walk(dts, ~ .x[, `:=`(c = a + b)])
dts
#[[1]]
# a b c
# 1: -0.1069952 0.1115983 0.004603111
# 2: 0.3228771 -0.8400846 -0.517207530
# 3: -1.6072728 -0.2727947 -1.880067477
# 4: 0.1715614 -0.3864995 -0.214938065
# 5: 1.8233350 -1.0786569 0.744678084
# 6: 0.2366026 -0.6166318 -0.380029253
# 7: 0.2373992 0.2251559 0.462555116
# 8: -0.1075611 -1.0418174 -1.149378504
# 9: 1.6742520 -0.5635583 1.110693774
#10: 2.4733842 2.1091365 4.582520731
#[[2]]
# a b c
# 1: -0.8332617 1.67201117 0.83874947
# 2: 1.3688393 1.12168046 2.49051974
# 3: 1.0208642 -1.18482073 -0.16395650
# 4: 0.6784662 2.15979872 2.83826493
# 5: -0.4351644 -0.04629453 -0.48145894
# 6: 1.3133550 -1.03423308 0.27912197
# 7: 1.0143396 -0.84787780 0.16646185
# 8: -0.9622108 0.92338456 -0.03882627
# 9: -0.3106202 1.08886031 0.77824008
#10: 0.7602507 -0.08996701 0.67028370
Or wrap with invisible along with lapply
invisible(lapply(dts, function(dt) {
dt[, ':=' (c = a + b)]
}))
Using set:
for (i in seq_along(dts)) set(dts[[i]], j = "c", value = dts[[i]]$a + dts[[i]]$b)

Pass strings as code to summarize multiple columns with data.table

We would like to summarize a data table to create a lot of new variables that result from the combination of columns names and values from the original data.
Here is reproducile example illustrating the result we would like to achieve with two columns only for the sake of brevity
library(data.table)
data('mtcars')
setDT(mtcars)
# Desired output
mtcars[, .(
acm_hp_carb2 = mean(hp[which( carb <= 2)], na.rm=T),
acm_wt_am1 = mean(wt[which( am== 1)], na.rm=T)
), by= .(cyl, gear)]
Because we want to summarize a lot of columns, we created a function that returns all the strings that we would use to create each summary variable. In this example, we have this:
a <- 'acm_hp_carb2 = mean(hp[which( carb <= 2)], na.rm=T)'
b <- 'acm_wt_am1 = mean(wt[which( am== 1)], na.rm=T)'
And here is our failed attempt. Note that the new columns created do not receive the names we want to assign to them.
mtcars[, .(
eval(parse(text=a)),
eval(parse(text=b))
), by= .(cyl, gear)]
Seems like the only part which isn't working is the column names. If you put a and b in a vector and add names to them, you can use lapply to do the eval(parse and keep the names from the vector. I used regex to get the names, but presumably in the real code you can assign the names as whatever variable you're using to construct the strings in the first place.
Result has many NaNs but it matches your desired output.
to_make <- c(a, b)
to_make <- setNames(to_make, sub('^(.*) =.*', '\\1', to_make))
mtcars2[, lapply(to_make, function(x) eval(parse(text = x)))
, by= .(cyl, gear)]
# cyl gear acm_hp_carb2 acm_wt_am1
# 1: 6 4 NaN 2.747500
# 2: 4 4 76.0 2.114167
# 3: 6 3 107.5 NaN
# 4: 8 3 162.5 NaN
# 5: 4 3 97.0 NaN
# 6: 4 5 102.0 1.826500
# 7: 8 5 NaN 3.370000
# 8: 6 5 NaN 2.770000
You can make one call and eval it:
f = function(...){
ex = parse(text = sprintf(".(%s)", paste(..., sep=", ")))[[1]]
print(ex)
mtcars[, eval(ex), by=.(cyl, gear)]
}
f(a,b)
a2 <- 'acm_hp_carb2 = mean(hp[carb <= 2], na.rm=T)'
b2 <- 'acm_wt_am1 = mean(wt[am == 1], na.rm=T)'
f(a2, b2)
I guess the which() is not needed.

data.table operations by column name

Suppose I have a data.table
a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")
I can add new columns like this:
a[, sa := sum(a), by="id"]
a[, sb := sum(b), by="id"]
> a
id a b sa sb
1: 1 21 11 43 23
2: 1 22 12 43 23
3: 2 23 13 47 27
4: 2 24 14 47 27
5: 3 25 15 25 15
However, suppose that I have column names instead:
for (n in c("a","b")) {
s <- paste0("s",n)
a[, s := sum(n), by="id", with=FALSE] # ERROR: invalid 'type' (character) of argument
}
what do I do?
You can also do this:
a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")
a[, c("sa", "sb") := lapply(.SD, sum), by = id]
Or slightly more generally:
cols.to.sum = c("a", "b")
a[, paste0("s", cols.to.sum) := lapply(.SD, sum), by = id, .SDcols = cols.to.sum]
This is similar to :
How to generate a linear combination of variables and update table using data.table in a loop call?
but you want to combine this with by= too, so set() isn't flexible enough. That's a deliberate design design and set() is unlikely to change in that regard.
I sometimes use the EVAL helper at the end of that answer.
https://stackoverflow.com/a/20808573/403310 Some wince at that approach but I just think of it like constructing a dynamic SQL statement, which is quite common practice. The EVAL approach gives ultimate flexibility without head scratching about eval() and quote(). To see the dynamic query that's been constructed (to check it) you can add a print inside your EVAL helper function.
However, in this simple example you can wrap the LHS of := with brackets to tell data.table to lookup the value (clearer than with=FALSE), and the RHS needs a get().
for (n in c("a","b")) {
s <- paste0("s",n)
a[, (s) := sum(get(n)), by="id"]
}
Edit 2020-02-15 about ..
data.table also supports the .. syntax to "look up a level", obviating the need for with=FALSE in most cases, e.g. dt[ , ..n1] and dt[ , ..n2] in the below
have a look at with in ? data.table:
dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[, n3 := dt[ , n1, with = FALSE ] * dt[ , n2, with = FALSE ], with = FALSE ]
EDIT:
Or you just change the colnames forth and back:
dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[ , dt.names["n3"] := 1L, with = FALSE ]
dt.names <- c( n1 = "a", n2 = "b", n3 = "c" )
setnames( dt, dt.names, names(dt.names) )
dt[ , n3 := n1 * n2, by = "id" ]
setnames( dt, names(dt.names), dt.names )
which works together with by.
Here is an approach that does the call mangling and avoids any overhead with .SD
# a helper function
makeCall <- function(x,fun) bquote(.(fun)(.(x)))
# the columns you wish to sum (apply function to)
cols <- c('a','b')
new.cols <- paste0('s',cols)
# create named list of names
name.cols <- setNames(sapply(cols,as.name), new.cols)
# create the call
my_call <- as.call(c(as.name(':='), lapply(name.cols, makeCall, fun = as.name('sum'))))
(a[, eval(my_call), by = 'id'])
# id a b sa sb
# 1: 1 21 11 43 23
# 2: 1 22 12 43 23
# 3: 2 23 13 47 27
# 4: 2 24 14 47 27
# 5: 3 25 15 25 15

Dynamic column names in data.table

I am trying to add columns to my data.table, where the names are dynamic. I addition I need to use the by argument when adding these columns. For example:
test_dtb <- data.table(a = sample(1:100, 100), b = sample(1:100, 100), id = rep(1:10,10))
cn <- parse(text = "blah")
test_dtb[ , eval(cn) := mean(a), by = id]
# Error in `[.data.table`(test_dtb, , `:=`(eval(cn), mean(a)), by = id) :
# LHS of := must be a single column name when with=TRUE. When with=FALSE the LHS may be a vector of column names or positions.
Another attempt:
cn <- "blah"
test_dtb[ , cn := mean(a), by = id, with = FALSE]
# Error in `[.data.table`(test_dtb, , `:=`(cn, mean(a)), by = id, with = FALSE) : 'with' must be TRUE when 'by' or 'keyby' is provided
Update from Matthew:
This now works in v1.8.3 on R-Forge. Thanks for highlighting!
See this similar question for new examples:
Assign multiple columns using data.table, by group
From data.table 1.9.4, you can just do this:
## A parenthesized symbol, `(cn)`, gets evaluated to "blah" before `:=` is carried out
test_dtb[, (cn) := mean(a), by = id]
head(test_dtb, 4)
# a b id blah
# 1: 41 19 1 54.2
# 2: 4 99 2 50.0
# 3: 49 85 3 46.7
# 4: 61 4 4 57.1
See Details in ?:=:
DT[i, (colvector) := val]
[...] NOW PREFERRED [...] syntax. The parens are enough to stop the LHS being a symbol; same as c(colvector)
Original answer:
You were on exactly the right track: constructing an expression to be evaluated within the call to [.data.table is the data.table way to do this sort of thing. Going just a bit further, why not construct an expression that evaluates to the entire j argument (rather than just its left hand side)?
Something like this should do the trick:
## Your code so far
library(data.table)
test_dtb <- data.table(a=sample(1:100, 100),b=sample(1:100, 100),id=rep(1:10,10))
cn <- "blah"
## One solution
expr <- parse(text = paste0(cn, ":=mean(a)"))
test_dtb[,eval(expr), by=id]
## Checking the result
head(test_dtb, 4)
# a b id blah
# 1: 30 26 1 38.4
# 2: 83 82 2 47.4
# 3: 47 66 3 39.5
# 4: 87 23 4 65.2
Expression can be constructed with bquote.
cn <- "blah"
expr <- bquote(.(as.name(cn)):=mean(a))
test_dtb[,eval(expr), by=id]
I believe setnames(DT, c(col.names)) yields the most readable code

Resources