r data.table lapply with multiple SDcols - r

I am trying to use an lapply that takes two lists of columns into account (all are numeric type) in what would seem to be a simple line of code, but the output is not as expected.
This is my code :
x<-50
measure <- c("haz", "waz", "whz", "htcm", "wtkg", "bmi")
new_measure_1.5 <- paste(measure, "1.5", sep = "_")
new_temp_cols<-paste("temp", new_measure_1.5, sep = "_")
new_columns<-paste(new_measure_1.5, "1", sep="_")
newcols_1.5_months<-function(x, agedays, new_temp_cols, y){
ifelse(agedays==x, new_temp_cols, y)
}
DT[, (new_columns) := lapply(.SD, function(y) newcols_1.5_months(x, agedays, new_temp_cols, y)), .SDcols = new_columns ]
The above code results in new_columns (haz_1.5_1, waz_1.5_1, whz_1.5_1, wtkg_1.5_1, htcm_1.5_1, bmi_1.5_1) holding the names of the columns in the list new_temp_cols (temp_haz_1.5, temp_waz_1.5, temp_whz_1.5, temp_wtkg_1.5, temp_htcm_1.5, temp_bmi_1.5) as opposed to the values they hold in my data table. It seems that R is reading the list as a vector of strings rather than a vector of columns. Why is this?
I've tried using multiple .SDcols, but this doesn't work :
DT[, (new_columns) := lapply(.SD, function(y) newcols_1.5_months(x, agedays, new_temp_cols, y)), .SDcols = c(new_columns, new_temp_cols) ]
Is there a simple fix to this?
**** Editing to add a small subset of dummy data similar to my data table
measure<-c("haz", "waz")
new_measure_1.5 <- paste(measure, "1.5", sep = "_")
new_temp_cols<-paste("temp", new_measure_1.5, sep = "_")
new_columns<-paste(new_measure_1.5, "1", sep="_")
anthro <- data.table
(agedays = c(25,50,53,22,37,50,12,45,50,15,33,50),
temp_haz_1.5 = c(1.2,1.5,1.7,2.0,4.5,6.7,6.8,6.7,4.5,6.6,8.9,6.7),
temp_waz_1.5 = c(3.2,1.8,6.7,2.8,3.5,7.7,9.8,1.7,6.9,3.8,0.9,4.7),
haz_1.5_1 = c(1.2,2.5,4.7,7.0,4.7,6.8,6.3,2.7,5.5,8.6,3.9,6.7),
waz_1.5_1 =c(6.2,2.5,5.7,7.0,2.5,7.7,8.8,9.7,2.5,4.6,5.9,6.7))

If we have multiple columns, then use Map to loop over each of the corresponding sets of columns and apply the function.
library(data.table)
x <- 50
DT[, (new_columns) := Map(function(u, y)
newcols_1.5_months(u, DT[['agedays']], x, y),
.SD[, new_columns, with = FALSE],
.SD[, new_temp_cols, with = FALSE]) ]
Perhaps the function can be
newcols_1.5_months<- function(u, agedays, x, y){
ifelse(agedays==x, u, y)
}
data
DT <- data.table(agedays = c(25,50,53,22,37,50,12,45,50,15,33,50),
temp_haz_1.5 = c(1.2,1.5,1.7,2.0,4.5,6.7,6.8,6.7,4.5,6.6,8.9,6.7),
temp_waz_1.5 = c(3.2,1.8,6.7,2.8,3.5,7.7,9.8,1.7,6.9,3.8,0.9,4.7),
haz_1.5_1 = c(1.2,2.5,4.7,7.0,4.7,6.8,6.3,2.7,5.5,8.6,3.9,6.7),
waz_1.5_1 =c(6.2,2.5,5.7,7.0,2.5,7.7,8.8,9.7,2.5,4.6,5.9,6.7))

Related

Speeding up dplyr pipe including checks with mutate_if and if_else on larger tables

I wrote some code to performed oversampling, meaning that I replicate my observations in a data.frame and add noise to the replicates, so they are not exactly the same anymore. I'm quite happy that it works now as intended, but...it is too slow. I'm just learning dplyr and have no clue about data.table, but I hope there is a way to improve my function. I'm running this code in a function for 100s of data.frames which may contain about 10,000 columns and 400 rows.
This is some toy data:
library(tidyverse)
train_set1 <- rep(0, 300)
train_set2 <- rep("Factor1", 300)
train_set3 <- data.frame(replicate(1000, sample(0:1, 300, rep = TRUE)))
train_set <- cbind(train_set1, train_set2, train_set3)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))
This is the code to replicate each row a given number of times and a function to determine whether the added noise later will be positive or negative:
# replicate each row twice, added row.names contain a "."
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]
# create a flip function
flip <- function() {
sample(c(-1,1), 1)
}
In the relevant "too slow" piece of code, I'm subsetting the row.names for the added "." to filter for the replicates. Than I select only the numeric columns. I go through those columns row by row and leave the values untouched if they are 0. If not, a certain amount is added (here +- 1 %). Later on, I combine this data set with the original data set and have my oversampled data.frame.
# add percentage of noise to non-zero values in numerical columns
noised_copies <- train_oversampled %>%
rownames_to_column(var = "rowname") %>%
filter(grepl("\\.", row.names(train_oversampled))) %>%
rowwise() %>%
mutate_if(~ is.numeric(.), ~ if_else(. == 0, 0,. + (. * flip() * 0.01 ))) %>%
ungroup() %>%
column_to_rownames(var = "rowname")
# combine original and oversampled, noised data set
train_noised <- rbind(noised_copies, train_set)
I assume there are faster ways using e.g. data.table, but it was already tough work to get this code running and I have no idea how to improve its performance.
EDIT:
The solution is working perfectly fine with fixed values, but called within a for loop I receive "Error in paste(Sample, n, sep = ".") : object 'Sample' not found"
Code to replicate:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = train_set, cc = train_set)
for(current_table in train_list) {
setDT(current_table, keep.rownames="Sample")
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
Any ideas why the column Sample can't be found now?
Here is a more vectorized approach using data.table:
library(data.table)
setDT(train_set, keep.rownames="Sample")
cols <- names(train_set)[sapply(train_set, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(train_set)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
With data.table version >= 1.12.9, you can pass is.numeric directly to .SDcols argument and maybe a shorter way (e.g. (.SD) or names(.SD)) to pass to the left hand side of :=
address OP's updated post:
The issue is that although each data.frame within the list is converted to a data.table, the train_list is not updated. You can update the list with a left bind before the for loop:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = copy(train_set), cc = copy(train_set))
train_list <- lapply(train_list, setDT, keep.rownames="Sample")
for(current_table in train_list) {
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, train_list), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}

Implement lapply in conjunction with the get() function to vectorize merge of data tables? R

Question: How to implement the lapply function in conjunction with the get() function to merge a list of data tables?
Objective: For each of the elements in ticker_name, merge the data table called "dt_q_'ticker_name[i]'" and that called "meta_'ticker_name[i]'" by common "id" variable:
ticker_name <- c("CTNP", "PB", "SD", "PC", "PE", "TY", "XD")
for (i in 1:length(ticker_name)) {
dt <- get(paste0("dt_q_", ticker_name[i]))
meta <- get(paste0("meta_", ticker_name[i]))
dt <- merge(x = dt, y = meta, by= c("id"))
head(dt)
}
My non-working attempt with lapply:
lapply(
X = ticker_name,
FUN =
merge(x = get(paste0("dt_q_", ticker_name)),
y = get(paste0("meta_", ticker_name)), by = c("id")
))
The error message:
Error in match.fun(FUN) :
c("'merge(x = get(paste0(\"dt_q_\", ticker_name)),
y = get(paste0(\"meta_\", ' is not a function,
character or symbol", "' ticker_name)), by = c(\"id\"))'
is not a function, character or symbol")
We can use mget to return all the objects into a list and as the corresponding data.table should be merged, use Map which can have multiple arguments
Map(merge, mget(paste0("dt_q_", ticker_name)),
mget(paste0("meta_", ticker_name)),
MoreArgs = list(by = 'id'))
Or using lapply, loop through the 'ticker_name' then paste the corresponding 'prefix' part, get the values of the string objects and merge
lapply(ticker_name, function(x) merge(get(paste0("dt_q_", x)),
get(paste0("meta_", x)), by = 'id'))
NOTE: In the OP's code, after looping through the 'ticker_name' ( or ticker_list - not clear), then it is pasteing the prefix with the whole 'ticker_name' which is not the case if we check the for loop where it is looping through the sequence of 'ticker_name'. We can also loop through the sequence
lapply(seq_along(ticker_name), function(i) {
dt <- get(paste0("dt_q_", ticker_name[i]))
meta <- get(paste0("meta_", ticker_name[i]))
merge(x = dt, y = meta, by= "id")
})

Calculating columns in data.table with variable

I'm trying to calculate columns in a data.table having the calculation passed by variable. The following is the same as what I'm trying to achieve:
dt <- data.table(mpg)
dt[, list(manufacturer, model, mpg_cyl_cty=cty/cyl, mpg_cyl_hwy=hwy/cyl)]
where I want mpg_cyl_cty=cty/cyl, mpg_cyl_hwy=hwy/cyl to come from a variable like:
var <- c('mpg_cyl_cty=cty/cyl', 'mpg_cyl_hwy=hwy/cyl')
dt[, list(manufacturer, model, var)]
I guess there are more problems to this as what type var should be assigned (c or list) and how dt is called, via list or c.
Hope somebody has a suggestion as I'm not finding anything on the WWW.
library(ggplot2)
library(data.table)
dt <- data.table(mpg)
# The original calculation
dt1 <- dt[, list(manufacturer, model, mpg_cyl_cty=cty/cyl, mpg_cyl_hwy=hwy/cyl)]
var <- c('mpg_cyl_cty=cty/cyl', 'mpg_cyl_hwy=hwy/cyl')
# create a string to pass for evaluation
expr <- paste0("`:=`(", paste0(var, collapse = ", "), ")")
dt2 <- dt[,
.(manufacturer, model, cty, cyl, hwy)
][, eval(parse(text = expr)) # evaluate the expression
][, c("cty", "cyl", "hwy") := NULL] # delete unnecessary columns
> print(all.equal(dt1, dt2))
[1] TRUE
Slightly different approach to avoid eval(parse(.)) and operate on language objects.
Instead of c('mpg_cyl_cty=cty/cyl', 'mpg_cyl_hwy=hwy/cyl') it takes just c("cty","hwy") input.
library(data.table)
dt = as.data.table(ggplot2::mpg)
r.expected = dt[, list(manufacturer, model, mpg_cyl_cty=cty/cyl, mpg_cyl_hwy=hwy/cyl)]
cyl.ratio.j = function(var){
substitute(lhs := rhs, list(
lhs = as.name(paste0("mpg_cyl_", var)),
rhs = call("/", as.name(var), as.name("cyl"))
))
}
r = dt[, eval(cyl.ratio.j("cty"))
][, eval(cyl.ratio.j("hwy"))
][, .SD, .SDcols = c("manufacturer", "model", paste0("mpg_cyl_", c("cty","hwy")))]
all.equal(r.expected, r)
#[1] TRUE

Pass a list of columns to data.table’s CJ as a vector

I have the following code:
main_cols <- c('num', 'let')
dt <- data.table(num = 1:5, let = letters[1:5])
dt
new_dt <- dt[CJ(num = num
, let = let
, unique = TRUE)
, on = main_cols
]
head(new_dt, 10)
The thing is: I want to pass the columns to cross-join on as a vector. How do I “unpack” main_cols inside the CJ function? Thanks.
I think you'll want to use do.call, as #AnandaMahto suggested:
m = dt[, do.call(CJ, .SD), .SDcols=main_cols]
dt[m, on=main_cols]
You could also create m this way:
m = do.call(CJ, dt[,main_cols,with=FALSE])
If you have repeating values in the columns, use the unique option to CJ:
m = dt[, do.call(CJ, c(.SD, unique=TRUE)), .SDcols=main_cols]
# or
m = do.call(CJ, c(dt[,main_cols,with=FALSE], unique=TRUE))

Filter out data.table columns based on summary statistics

I often need to filter out columns with a low variance from a data.table. The column names are not known in advance.
dt = data.table(mtcars)
# calculate standard deviation with arbitrary max value of 1:
mask = dt[,lapply(.SD, function(x) sd(x, na.rm = TRUE) > 1)]
# The columns with the FALSE values in row 1 need to be removed
mask.t = t(mask)
mask.t = which(mask.t)
dt[,mask.t,with=FALSE]
The approach above is clunky. Is there a more elegant way to filter out columns out of a data.table for which the column statistic evaluates to TRUE?
These work:
dt[, .SD, .SDcols=unlist(mask)]
dt[, .SD, .SDcols=which(unlist(mask))]
All together now:
variance.filter = function(df) {
mask = df[,lapply(.SD, function(x) sd(x,na.rm = TRUE) > 1)]
df = df[, .SD, .SDcols = unlist(mask)]
}
EDIT in the current development version of data.table (1.12.9), .SDcols accepts a function filter for columns, so this would work:
variance.filter = function(df) {
df[ , .SD, .SDcols = function(x) sd(x, na.rm = TRUE) > 1]
}

Resources