Fast ways to subset categorical data in R with multiple conditions - r

I have a large dataset in R (say >40,000 rows and >20 categorical columns) that I repeatedly subset, so I would like to speed this up as much as possible. It needs to be a general function (each categorical column has a discrete number of possible values, say in string format).
Each time I subset, I need to identify the subset of rows that satisfy multiple logical set membership conditions (e.g. >10 conditions). I.e., I need to check several columns and check if values in that column match a certain set membership (hence the use of %in%).
# simple dataset example
library(dplyr)
num_col <- 15
num_row <- 100000
dat_list <- list()
for (i in 1:num_col) {
dat_list[[i]] <- data_frame(sample(letters[1:10], size = num_row, r = T))
}
dat <- bind_cols(dat_list)
names(dat) <- paste0("col", seq(15))
I've looked around the internet and SO a lot, but haven't found the discussion of performance I'm looking for. I mostly code using dplyr, so apologies if there's a clear performance improvement here in data.table; I've tried some simple benchmarks between the two (but without using any data.table indexing or etc.) and it's not obvious if one is faster.
Example options I've considered (since I'm not great at data.table, I've excluded data.table options from here):
base_filter <- function(dat) {
for (i in 1:7) {
col_name <- paste0('col', i)
dat <- dat[dat[[col_name]] %in% sample(letters[1:10], size = 4), ]
}
dat
}
dplyr_filter1 <- function(dat) {
for (i in 1:7) {
col_name <- paste0('col', i)
dat <- filter_(dat,
.dots = interp(~ colname %in% vals,
colname = as.name(col_name),
vals = sample(letters[1:10], size = 4)))
}
dat
}
dplyr_filter2 <- function(dat) {
dots_filter <- list()
for (i in 1:7) {
col_name <- paste0('col', i)
dots_filter[[i]] <- interp(~ colname %in% vals,
colname = as.name(col_name),
vals = sample(letters[1:10], size = 4))
}
filter_(dat, .dots = dots_filter)
}
Note: In practice, on my real datasets, dplyr_filter2 actually works fastest. I've also tried dtplyr or converting my data to a data.table, but this seems slower than without.
Note: On the other hand, in practice, the base R function outperforms the dplyr examples when data has fewer rows and fewer columns (perhaps due to copying speed?).
Thus, I'd like to ask SO what the general, most efficient way(s) to subset a categorical dataframe under multiple (set membership) conditions is. And if possible, explain the mechanics for why? Does this answer differ for smaller datasets? Does it depend on copying time or search time?
Useful related links
fast lookup for one key
using hash tables in R for key-value pairs

Understand that you prefer not to use data.table. Just providing some timings for reference below. With indexing, subsetting can be performed much faster and inner join of the 2 tables can also be done easily in data.table.
# simple dataset example
library(dplyr)
library(lazyeval)
set.seed(0L)
num_col <- 15
num_row <- 100000
dat_list <- list()
for (i in 1:num_col) {
dat_list[[i]] <- data_frame(sample(letters[1:10], size = num_row, r = T))
}
dat <- bind_cols(dat_list)
names(dat) <- paste0("col", seq(15))
selection <- lapply(1:7, function(n) sample(letters[1:10], size = 4))
base_filter <- function(df) {
for (i in 1:7) {
col_name <- paste0('col', i)
df <- df[df[[col_name]] %in% selection[[i]], ]
}
df
}
dplyr_filter1 <- function(df) {
for (i in 1:7) {
col_name <- paste0('col', i)
df <- filter_(df,
.dots = interp(~ colname %in% vals,
colname = as.name(col_name),
vals = selection[[i]]))
}
df
}
dplyr_filter2 <- function(df) {
dots_filter <- list()
for (i in 1:7) {
col_name <- paste0('col', i)
dots_filter[[i]] <- interp(~ colname %in% vals,
colname = as.name(col_name),
vals = selection[[i]])
}
filter_(df, .dots = dots_filter)
}
library(data.table)
#convert data.frame into data.table
dt <- data.table(dat, key=names(dat)[1:7])
#create the sets of selection
dtSelection <- data.table(expand.grid(selection, stringsAsFactors=FALSE))
library(microbenchmark)
microbenchmark(
base_filter(dat),
dplyr_filter1(dat),
dplyr_filter2(dat),
dt[dtSelection, nomatch=0], #perform inner join between dataset and selection
times=5L)
#Unit: milliseconds
# expr min lq mean median uq max neval
# base_filter(dat) 27.084801 27.870702 35.849261 32.045900 32.872601 59.372301 5
# dplyr_filter1(dat) 23.130100 24.114301 26.922081 24.860701 29.804301 32.701002 5
# dplyr_filter2(dat) 29.641101 30.686002 32.363681 31.103000 31.884701 38.503601 5
# dt[dtSelection, nomatch = 0] 3.626001 3.646201 3.829341 3.686601 3.687001 4.500901 5

In addition to chinsoon12's alternatives, one thing to consider is to avoid subsetting the data.frame in each iteration. So, instead of
f0 = function(x, cond)
{
for(j in seq_along(x)) x = x[x[[j]] %in% cond[[j]], ]
return(x)
}
one alternative is to accumulate a logical vector of whether to include each row in the final subset:
f1 = function(x, cond)
{
i = rep_len(TRUE, nrow(x))
for(j in seq_along(x)) i = i & (x[[j]] %in% cond[[j]])
return(x[i, ])
}
or, another alternative, is to iteratively reduce the amount of comparisons, but by reducing the row indices instead of the data.frame itself:
f2 = function(x, cond)
{
i = 1:nrow(x)
for(j in seq_along(x)) i = i[x[[j]][i] %in% cond[[j]]]
return(x[i, ])
}
And a comparison with data:
set.seed(1821)
dat = as.data.frame(replicate(30, sample(c(letters, LETTERS), 5e5, TRUE), FALSE),
stringsAsFactors = FALSE)
conds = replicate(ncol(dat), sample(c(letters, LETTERS), 48), FALSE)
system.time({ ans0 = f0(dat, conds) })
# user system elapsed
# 3.44 0.28 3.86
system.time({ ans1 = f1(dat, conds) })
# user system elapsed
# 0.66 0.01 0.68
system.time({ ans2 = f2(dat, conds) })
# user system elapsed
# 0.34 0.01 0.39
identical(ans0, ans1)
#[1] TRUE
identical(ans1, ans2)
#[1] TRUE

Related

Quickly apply & operation to pairs of columns in R

Let’s say I have two large data.tables and need to combine their columns pairwise using the & operation. The combinations are dictated by grid (combine dt1 column1 with dt2 column2, etc.)
Right now I'm using a mclapply loop and the script takes hours when I run the full dataset. I tried converting the data to a matrix and using a vectorized approach but that took even longer. Is there a faster and/or more elegant way to do this?
mx1 <- replicate(10, sample(c(T,F), size = 1e6, replace = T)) # 1e6 rows x 10 columns
mx1 <- as.data.table(mx1)
colnames(mx1) <- LETTERS[1:10]
mx2 <- replicate(10, sample(c(T,F), size = 1e6, replace = T)) # 1e6 rows x 10 columns
mx2 <- as.data.table(mx2)
colnames(mx2) <- letters[1:10]
grid <- expand.grid(col1 = colnames(mx1), col2 = colnames(mx2)) # the combinations I want to evaluate
out <- new_layer <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) { # <--- mclapply loop
mx1[[col1]] & mx2[[col2]]
}, SIMPLIFY = F)
setDT(out) # convert output into data table
colnames(out) <- paste(grid$col1, grid$col2, sep = "_")
For context, this data is from a gene expression matrix where 1 row = 1 cell
This can be done directly with no mapply: Just ensure that the with argument is FALSE
ie:
mx1[, grid$col1, with = FALSE] & mx2[, grid$col2, with=FALSE]
After some digging around I found a package called bit that is specifically designed for fast boolean operations. Converting each column of my data.table from logical to bit gave me a 100-fold increase in compute speed.
# Load libraries.
library(data.table)
library(bit)
# Create data set.
mx1 <- replicate(10, sample(c(T,F), size = 5e6, replace = T)) # 5e6 rows x 10 columns
colnames(mx1) <- LETTERS[1:10]
mx2 <- replicate(10, sample(c(T,F), size = 5e6, replace = T)) # 5e6 rows x 10 columns
colnames(mx2) <- letters[1:10]
grid <- expand.grid(col1 = colnames(mx1), col2 = colnames(mx2)) # combinations I want to evaluate
# Single operation with logical matrix.
system.time({
out <- mx1[, grid$col1] & mx2[, grid$col2]
}) # 26.014s
# Loop with logical matrix.
system.time({
out <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) {
mx1[, col1] & mx2[, col2]
})
}) # 31.914s
# Single operation with logical data.table.
mx1.dt <- as.data.table(mx1)
mx2.dt <- as.data.table(mx2)
system.time({
out <- mx1.dt[, grid$col1, with = F] & mx2.dt[, grid$col2, with = F] # 26.014s
}) # 32.349s
# Loop with logical data.table.
system.time({
out <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) {
mx1.dt[[col1]] & mx2.dt[[col2]]
})
}) # 15.031s <---- SECOND FASTEST TIME, ~2X IMPROVEMENT
# Loop with bit data.table.
mx1.bit <- mx1.dt[, lapply(.SD, as.bit)]
mx2.bit <- mx2.dt[, lapply(.SD, as.bit)]
system.time({
out <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) {
mx1.bit[[col1]] & mx2.bit[[col2]]
})
}) # 0.383s <---- FASTEST TIME, ~100X IMPROVEMENT
# Convert back to logical table.
out <- setDT(out)
colnames(out) <- paste(grid$col1, grid$col2, sep = "_")
out <- out[, lapply(.SD, as.logical)]
There are also special functions like sum.bit and ri that you can use to aggregate data without converting it back to logical.

Speeding up dplyr pipe including checks with mutate_if and if_else on larger tables

I wrote some code to performed oversampling, meaning that I replicate my observations in a data.frame and add noise to the replicates, so they are not exactly the same anymore. I'm quite happy that it works now as intended, but...it is too slow. I'm just learning dplyr and have no clue about data.table, but I hope there is a way to improve my function. I'm running this code in a function for 100s of data.frames which may contain about 10,000 columns and 400 rows.
This is some toy data:
library(tidyverse)
train_set1 <- rep(0, 300)
train_set2 <- rep("Factor1", 300)
train_set3 <- data.frame(replicate(1000, sample(0:1, 300, rep = TRUE)))
train_set <- cbind(train_set1, train_set2, train_set3)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))
This is the code to replicate each row a given number of times and a function to determine whether the added noise later will be positive or negative:
# replicate each row twice, added row.names contain a "."
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]
# create a flip function
flip <- function() {
sample(c(-1,1), 1)
}
In the relevant "too slow" piece of code, I'm subsetting the row.names for the added "." to filter for the replicates. Than I select only the numeric columns. I go through those columns row by row and leave the values untouched if they are 0. If not, a certain amount is added (here +- 1 %). Later on, I combine this data set with the original data set and have my oversampled data.frame.
# add percentage of noise to non-zero values in numerical columns
noised_copies <- train_oversampled %>%
rownames_to_column(var = "rowname") %>%
filter(grepl("\\.", row.names(train_oversampled))) %>%
rowwise() %>%
mutate_if(~ is.numeric(.), ~ if_else(. == 0, 0,. + (. * flip() * 0.01 ))) %>%
ungroup() %>%
column_to_rownames(var = "rowname")
# combine original and oversampled, noised data set
train_noised <- rbind(noised_copies, train_set)
I assume there are faster ways using e.g. data.table, but it was already tough work to get this code running and I have no idea how to improve its performance.
EDIT:
The solution is working perfectly fine with fixed values, but called within a for loop I receive "Error in paste(Sample, n, sep = ".") : object 'Sample' not found"
Code to replicate:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = train_set, cc = train_set)
for(current_table in train_list) {
setDT(current_table, keep.rownames="Sample")
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
Any ideas why the column Sample can't be found now?
Here is a more vectorized approach using data.table:
library(data.table)
setDT(train_set, keep.rownames="Sample")
cols <- names(train_set)[sapply(train_set, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(train_set)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
With data.table version >= 1.12.9, you can pass is.numeric directly to .SDcols argument and maybe a shorter way (e.g. (.SD) or names(.SD)) to pass to the left hand side of :=
address OP's updated post:
The issue is that although each data.frame within the list is converted to a data.table, the train_list is not updated. You can update the list with a left bind before the for loop:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = copy(train_set), cc = copy(train_set))
train_list <- lapply(train_list, setDT, keep.rownames="Sample")
for(current_table in train_list) {
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, train_list), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}

R speed up the for loop using apply() or lapply() or etc

I wrote a special "impute' function that replaces the column values that have missing (NA) values with either mean() or mode() based on the specific column name.
The input dataframe is 400,000+ rows and its vert slow , how can i speed up the imputation part using lapply() or apply().
Here is the function , mark section I want optimized with START OPTIMIZE & END OPTIMIZE:
specialImpute <- function(inputDF)
{
discoveredDf <- data.frame(STUDYID_SUBJID=character(), stringsAsFactors=FALSE)
dfList <- list()
counter = 1;
Whilecounter = nrow(inputDF)
#for testing just do 10 iterations,i = 10;
while (Whilecounter >0)
{
studyid_subjid=inputDF[Whilecounter,"STUDYID_SUBJID"]
vect = which(discoveredDf$STUDYID_SUBJID == studyid_subjid)
#was discovered and subset before
if (!is.null(vect))
{
#not subset before
if (length(vect)<1)
{
#subset the dataframe base on regex inputDF$STUDYID_SUBJID
df <- subset(inputDF, regexpr(studyid_subjid, inputDF$STUDYID_SUBJID) > 0)
#START OPTIMIZE
for (i in nrow(df))
{
#impute , add column mean & add to list
#apply(df[,c("y1","y2","y3","etc..")],2,function(x){x[is.na(x)] =mean(x, na.rm=TRUE)})
if (is.na(df[i,"y1"])) {df[i,"y1"] = mean(df[,"y1"], na.rm = TRUE)}
if (is.na(df[i,"y2"])) {df[i,"y2"] =mean(df[,"y2"], na.rm = TRUE)}
if (is.na(df[i,"y3"])) {df[i,"y3"] =mean(df[,"y3"], na.rm = TRUE)}
#impute using mean for CONTINUOUS variables
if (is.na(df[i,"COVAR_CONTINUOUS_2"])) {df[i,"COVAR_CONTINUOUS_2"] =mean(df[,"COVAR_CONTINUOUS_2"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_3"])) {df[i,"COVAR_CONTINUOUS_3"] =mean(df[,"COVAR_CONTINUOUS_3"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_4"])) {df[i,"COVAR_CONTINUOUS_4"] =mean(df[,"COVAR_CONTINUOUS_4"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_5"])) {df[i,"COVAR_CONTINUOUS_5"] =mean(df[,"COVAR_CONTINUOUS_5"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_6"])) {df[i,"COVAR_CONTINUOUS_6"] =mean(df[,"COVAR_CONTINUOUS_6"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_7"])) {df[i,"COVAR_CONTINUOUS_7"] =mean(df[,"COVAR_CONTINUOUS_7"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_10"])) {df[i,"COVAR_CONTINUOUS_10"] =mean(df[,"COVAR_CONTINUOUS_10"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_14"])) {df[i,"COVAR_CONTINUOUS_14"] =mean(df[,"COVAR_CONTINUOUS_14"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_30"])) {df[i,"COVAR_CONTINUOUS_30"] =mean(df[,"COVAR_CONTINUOUS_30"], na.rm = TRUE)}
#impute using mode ordinal & nominal values
if (is.na(df[i,"COVAR_ORDINAL_1"])) {df[i,"COVAR_ORDINAL_1"] =Mode(df[,"COVAR_ORDINAL_1"])}
if (is.na(df[i,"COVAR_ORDINAL_2"])) {df[i,"COVAR_ORDINAL_2"] =Mode(df[,"COVAR_ORDINAL_2"])}
if (is.na(df[i,"COVAR_ORDINAL_3"])) {df[i,"COVAR_ORDINAL_3"] =Mode(df[,"COVAR_ORDINAL_3"])}
if (is.na(df[i,"COVAR_ORDINAL_4"])) {df[i,"COVAR_ORDINAL_4"] =Mode(df[,"COVAR_ORDINAL_4"])}
#nominal
if (is.na(df[i,"COVAR_NOMINAL_1"])) {df[i,"COVAR_NOMINAL_1"] =Mode(df[,"COVAR_NOMINAL_1"])}
if (is.na(df[i,"COVAR_NOMINAL_2"])) {df[i,"COVAR_NOMINAL_2"] =Mode(df[,"COVAR_NOMINAL_2"])}
if (is.na(df[i,"COVAR_NOMINAL_3"])) {df[i,"COVAR_NOMINAL_3"] =Mode(df[,"COVAR_NOMINAL_3"])}
if (is.na(df[i,"COVAR_NOMINAL_4"])) {df[i,"COVAR_NOMINAL_4"] =Mode(df[,"COVAR_NOMINAL_4"])}
if (is.na(df[i,"COVAR_NOMINAL_5"])) {df[i,"COVAR_NOMINAL_5"] =Mode(df[,"COVAR_NOMINAL_5"])}
if (is.na(df[i,"COVAR_NOMINAL_6"])) {df[i,"COVAR_NOMINAL_6"] =Mode(df[,"COVAR_NOMINAL_6"])}
if (is.na(df[i,"COVAR_NOMINAL_7"])) {df[i,"COVAR_NOMINAL_7"] =Mode(df[,"COVAR_NOMINAL_7"])}
if (is.na(df[i,"COVAR_NOMINAL_8"])) {df[i,"COVAR_NOMINAL_8"] =Mode(df[,"COVAR_NOMINAL_8"])}
}#for
#END OPTIMIZE
dfList[[counter]] <- df
#add to discoveredDf since already substed
discoveredDf[nrow(discoveredDf)+1,]<- c(studyid_subjid)
counter = counter +1;
#for debugging to check progress
if (counter %% 100 == 0)
{
print(counter)
}
}
}
Whilecounter = Whilecounter -1;
}#end while
return (dfList)
}
Thanks
It's likely that performance can be improved in many ways, so long as you use a vectorized function on each column. Currently, you're iterating through each row, and then handling each column separately, which really slows you down. Another improvement is to generalize the code so you don't have to keep typing a new line for each variable. In the examples I'll give below, this is handled because continuous variables are numeric, and categorical are factors.
To get straight to an answer, you can replace your code to be optimized with the following (though fixing variable names) provided that your numeric variables are numeric and ordinal/categorical are not (e.g., factors):
impute <- function(x) {
if (is.numeric(x)) { # If numeric, impute with mean
x[is.na(x)] <- mean(x, na.rm = TRUE)
} else { # mode otherwise
x[is.na(x)] <- names(which.max(table(x)))
}
x
}
# Correct cols_to_impute with names of your variables to be imputed
# e.g., c("COVAR_CONTINUOUS_2", "COVAR_NOMINAL_3", ...)
cols_to_impute <- names(df) %in% c("names", "of", "columns")
library(purrr)
df[, cols_to_impute] <- dmap(df[, cols_to_impute], impute)
Below is a detailed comparison of five approaches:
Your original approach using for to iterate on rows; each column then handled separately.
Using a for loop.
Using lapply().
Using sapply().
Using dmap() from the purrr package.
The new approaches all iterate on the data frame by column and make use of a vectorized function called impute, which imputes missing values in a vector with the mean (if numeric) or the mode (otherwise). Otherwise, their differences are relatively minor (except sapply() as you'll see), but interesting to check.
Here are the utility functions we'll use:
# Function to simulate a data frame of numeric and factor variables with
# missing values and `n` rows
create_dat <- function(n) {
set.seed(13)
data.frame(
con_1 = sample(c(10:20, NA), n, replace = TRUE), # continuous w/ missing
con_2 = sample(c(20:30, NA), n, replace = TRUE), # continuous w/ missing
ord_1 = sample(c(letters, NA), n, replace = TRUE), # ordinal w/ missing
ord_2 = sample(c(letters, NA), n, replace = TRUE) # ordinal w/ missing
)
}
# Function that imputes missing values in a vector with mean (if numeric) or
# mode (otherwise)
impute <- function(x) {
if (is.numeric(x)) { # If numeric, impute with mean
x[is.na(x)] <- mean(x, na.rm = TRUE)
} else { # mode otherwise
x[is.na(x)] <- names(which.max(table(x)))
}
x
}
Now, wrapper functions for each approach:
# Original approach
func0 <- function(d) {
for (i in 1:nrow(d)) {
if (is.na(d[i, "con_1"])) d[i,"con_1"] <- mean(d[,"con_1"], na.rm = TRUE)
if (is.na(d[i, "con_2"])) d[i,"con_2"] <- mean(d[,"con_2"], na.rm = TRUE)
if (is.na(d[i,"ord_1"])) d[i,"ord_1"] <- names(which.max(table(d[,"ord_1"])))
if (is.na(d[i,"ord_2"])) d[i,"ord_2"] <- names(which.max(table(d[,"ord_2"])))
}
return(d)
}
# for loop operates directly on d
func1 <- function(d) {
for(i in seq_along(d)) {
d[[i]] <- impute(d[[i]])
}
return(d)
}
# Use lapply()
func2 <- function(d) {
lapply(d, function(col) {
impute(col)
})
}
# Use sapply()
func3 <- function(d) {
sapply(d, function(col) {
impute(col)
})
}
# Use purrr::dmap()
func4 <- function(d) {
purrr::dmap(d, impute)
}
Now, we'll compare the performance of these approaches with n ranging from 10 to 100 (VERY small):
library(microbenchmark)
ns <- seq(10, 100, by = 10)
times <- sapply(ns, function(n) {
dat <- create_dat(n)
op <- microbenchmark(
ORIGINAL = func0(dat),
FOR_LOOP = func1(dat),
LAPPLY = func2(dat),
SAPPLY = func3(dat),
DMAP = func4(dat)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
# Plot the results
library(tidyr)
library(ggplot2)
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()
It's pretty clear that the original approach is much slower than the new approaches that use the vectorized function impute on each column. What about differences between the new ones? Let's bump up our sample size to check:
ns <- seq(5000, 50000, by = 5000)
times <- sapply(ns, function(n) {
dat <- create_dat(n)
op <- microbenchmark(
FOR_LOOP = func1(dat),
LAPPLY = func2(dat),
SAPPLY = func3(dat),
DMAP = func4(dat)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()
Looks like sapply() is not great (as #Martin pointed out). This is because sapply() is doing extra work to get our data into a matrix shape (which we don't need). If you run this yourself without sapply(), you'll see that the remaining approaches are all pretty comparable.
So the major performance improvement is to use a vectorized function on each column. I suggested using dmap at the beginning because I'm a fan of the function style and the purrr package generally, but you can comfortably substitute for whichever approach you prefer.
Aside, many thanks to #Martin for the very useful comment that got me to improve this answer!
If you are going to be working with what looks like a matrix, then use a matrix instead of a dataframe, since indexing into a dataframe, like it was a matrix, is very costly. You might want to extract the numerical values to a matrix for part of your calculations. This can provide a significant increase in speed.
Here is a really simple and fast solution using data.table.
library(data.table)
# name of columns
cols <- c("a", "c")
# impute date
setDT(dt)[, (cols) := lapply(.SD, function(x) ifelse( is.na(x) & is.numeric(x), mean(x, na.rm = T),
ifelse( is.na(x) & is.character(x), names(which.max(table(x))), x))) , .SDcols = cols ]
I haven't compared the performance of this solution to the one provided by #Simon Jackson, but this should be pretty fast.
data from reproducible example
set.seed(25)
dt <- data.table(a=c(1:5,NA,NA,1,1),
b=sample(1:15, 9, replace=TRUE),
c=LETTERS[c(1:6,NA,NA,1)])

working with large lists that become too big for RAM when operated on

Short of working on a machine with more RAM, how can I work with large lists in R, for example put them on disk and then work on sections of it?
Here's some code to generate the type of lists I'm using
n = 50; i = 100
WORD <- vector(mode = "integer", length = n)
for (i in 1:n){
WORD[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
dat <- data.frame(WORD = WORD,
COUNTS = sample(1:50, n, replace = TRUE))
dat_list <- lapply(1:i, function(i) dat)
In my actual use case each data frame in the list is unique, unlike the quick example here. I'm aiming for n = 4000 and i = 100,000
This is one example of what I want to do with this list of dataframes:
FUNC <- function(x) {rep(x$WORD, times = x$COUNTS)}
la <- lapply(dat_list, FUNC)
With my actual use case this runs for a few hours, fills up the RAM and most of the swap and then RStudio freezes and shows a message with a bomb on it (RStudio was forced to terminate due to an error in the R session).
I see that bigmemory is limited to matrices and ff doesn't seem to handle lists. What are the other options? If sqldf or a related out-of-memory method possible here, how might I get started? I can't get enough out of the documentation to make any progress and would be grateful for any pointers. Note that instructions to "buy more RAM" will be ignored! This is for a package that I'm hoping will be suitable for average desktop computers (ie. undergrad computer labs).
UPDATE Followining up on the helpful comments from SimonO101 and Ari, here's some benchmarking comparing dataframes and data.tables, loops and lapply, and with and without gc
# self-contained speed test of untable
n = 50; i = 100
WORD <- vector(mode = "integer", length = n)
for (i in 1:n){
WORD[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
# as data table
library(data.table)
dat_dt <- data.table(WORD = WORD, COUNTS = sample(1:50, n, replace = TRUE))
dat_list_dt <- lapply(1:i, function(i) dat_dt)
# as data frame
dat_df <- data.frame(WORD = WORD, COUNTS = sample(1:50, n, replace = TRUE))
dat_list_df <- lapply(1:i, function(i) dat_df)
# increase object size
y <- 10
dt <- c(rep(dat_list_dt, y))
df <- c(rep(dat_list_df, y))
# untable
untable <- function(x) rep(x$WORD, times = x$COUNTS)
# preallocate objects for loop to fill
df1 <- vector("list", length = length(df))
dt1 <- vector("list", length = length(dt))
df3 <- vector("list", length = length(df))
dt3 <- vector("list", length = length(dt))
# functions for lapply
df_untable_gc <- function(x) { untable(df[[x]]); if (x%%10) invisible(gc()) }
dt_untable_gc <- function(x) { untable(dt[[x]]); if (x%%10) invisible(gc()) }
# speedtests
library(microbenchmark)
microbenchmark(
for(i in 1:length(df)) { df1[[i]] <- untable(df[[i]]); if (i%%10) invisible(gc()) },
for(i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]); if (i%%10) invisible(gc()) },
df2 <- lapply(1:length(df), function(i) df_untable_gc(i)),
dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i)),
for(i in 1:length(df)) { df3[[i]] <- untable(df[[i]])},
for(i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]])},
df4 <- lapply(1:length(df), function(i) untable(df[[i]]) ),
dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]) ),
times = 10)
And here are the results, without explicit garbage collection, data.table is much faster and lapply slightly faster than a loop. With explicit garbage collection (as I think SimonO101 might be suggesting) they are all much the same speed - a lot slower! I know that using gc is a bit controversial and probably not helpful in this case, but I'll give it a shot with my actual use-case and see if it makes any difference. Of course I don't have any data on memory use for any of these functions, which is really my main concern. Seems that there is no function for memory benchmarking equivalent to the timing functions (for windows, anyway).
Unit: milliseconds
expr
for (i in 1:length(df)) { df1[[i]] <- untable(df[[i]]) if (i%%10) invisible(gc()) }
for (i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]) if (i%%10) invisible(gc()) }
df2 <- lapply(1:length(df), function(i) df_untable_gc(i))
dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i))
for (i in 1:length(df)) { df3[[i]] <- untable(df[[i]]) }
for (i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]]) }
df4 <- lapply(1:length(df), function(i) untable(df[[i]]))
dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]))
min lq median uq max neval
37436.433962 37955.714144 38663.120340 39142.350799 39651.88118 10
37354.456809 38493.268121 38636.424561 38914.726388 39111.20439 10
36959.630896 37924.878498 38314.428435 38636.894810 39537.31465 10
36917.765453 37735.186358 38106.134494 38563.217919 38751.71627 10
28.200943 29.221901 30.205502 31.616041 34.32218 10
10.230519 10.418947 10.665668 12.194847 14.58611 10
26.058039 27.103217 27.560739 28.189448 30.62751 10
8.835168 8.904956 9.214692 9.485018 12.93788 10
If you really are going to be using very large data you can use the h5r package to write hdf5 files. You would be writing to and reading from your hard drive on the fly instead of using RAM. I have not used this so I can be of little help on it's general usage, I mention this because I think there's is no tutorial for it. I got this idea by thinking about pytables. Not sure if this solution is appropriate for you.

Efficiently locate group-wise constant columns in a data.frame

How can I efficiently extract group-wise constant columns from a data frame? I've included an plyr implementation below to make precise what I'm trying to do, but it's slow. How can I do it as efficiently as possible? (Ideally without splitting the data frame at all).
base <- data.frame(group = 1:1000, a = sample(1000), b = sample(1000))
df <- data.frame(
base[rep(seq_len(nrow(base)), length = 1e6), ],
c = runif(1e6),
d = runif(1e6)
)
is.constant <- function(x) length(unique(x)) == 1
constant_cols <- function(x) head(Filter(is.constant, x), 1)
system.time(constant <- ddply(df, "group", constant_cols))
# user system elapsed
# 20.531 1.670 22.378
stopifnot(identical(names(constant), c("group", "a", "b")))
stopifnot(nrow(constant) == 1000)
In my real use case (deep inside ggplot2) there may be an arbitrary number of constant and non-constant columns. The size of the data in the example is about the right order of magnitude.
(Edited to possibly address the issue of consecutive groups with the same value)
I'm tentatively submitting this answer, but I haven't completely convinced myself that it will correctly identify within group constant columns in all cases. But it's definitely faster (and can probably be improved):
constant_cols1 <- function(df,grp){
df <- df[order(df[,grp]),]
#Adjust values based on max diff in data
rle_group <- rle(df[,grp])
vec <- rep(rep(c(0,ceiling(diff(range(df)))),
length.out = length(rle_group$lengths)),
times = rle_group$lengths)
m <- matrix(vec,nrow = length(vec),ncol = ncol(df)-1)
df_new <- df
df_new[,-1] <- df[,-1] + m
rles <- lapply(df_new,FUN = rle)
nms <- names(rles)
tmp <- sapply(rles[nms != grp],
FUN = function(x){identical(x$lengths,rles[[grp]]$lengths)})
return(tmp)
}
My basic idea was to use rle, obviously.
I'm not sure if this is exactly what you are looking for, but it identifies columns a and b.
require(data.table)
is.constant <- function(x) identical(var(x), 0)
dtOne <- data.table(df)
system.time({dtTwo <- dtOne[, lapply(.SD, is.constant), by=group]
result <- apply(X=dtTwo[, list(a, b, c, d)], 2, all)
result <- result[result == TRUE] })
stopifnot(identical(names(result), c("a", "b")))
result
(edit: better answer)
What about something like
is.constant<-function(x) length(which(x==x[1])) == length(x)
This seems to be a nice improvement. Compare the following.
> a<-rnorm(5000000)
> system.time(is.constant(a))
user system elapsed
0.039 0.010 0.048
>
> system.time(is.constantOld(a))
user system elapsed
1.049 0.084 1.125
A bit slower than what hadley suggested above, but I think it should handle the case of equal adjacent groups
findBreaks <- function(x) cumsum(rle(x)$lengths)
constantGroups <- function(d, groupColIndex=1) {
d <- d[order(d[, groupColIndex]), ]
breaks <- lapply(d, findBreaks)
groupBreaks <- breaks[[groupColIndex]]
numBreaks <- length(groupBreaks)
isSubset <- function(x) length(x) <= numBreaks && length(setdiff(x, groupBreaks)) == 0
unlist(lapply(breaks[-groupColIndex], isSubset))
}
The intuition is that if a column is constant groupwise then the breaks in the column values (sorted by the group value) will be a subset of the breaks in the group value.
Now, compare it with hadley's (with small modification to ensure n is defined)
# df defined as in the question
n <- nrow(df)
changed <- function(x) c(TRUE, x[-1] != x[-n])
constant_cols2 <- function(df,grp){
df <- df[order(df[,grp]),]
changes <- lapply(df, changed)
vapply(changes[-1], identical, changes[[1]], FUN.VALUE = logical(1))
}
> system.time(constant_cols2(df, 1))
user system elapsed
1.779 0.075 1.869
> system.time(constantGroups(df))
user system elapsed
2.503 0.126 2.614
> df$f <- 1
> constant_cols2(df, 1)
a b c d f
TRUE TRUE FALSE FALSE FALSE
> constantGroups(df)
a b c d f
TRUE TRUE FALSE FALSE TRUE
Inspired by #Joran's answer, here's similar strategy that's a little faster (1 s vs 1.5 s on my machine)
changed <- function(x) c(TRUE, x[-1] != x[-n])
constant_cols2 <- function(df,grp){
df <- df[order(df[,grp]),]
n <- nrow(df)
changes <- lapply(df, changed)
vapply(changes[-1], identical, changes[[1]], FUN.VALUE = logical(1))
}
system.time(cols <- constant_cols2(df, "group")) # about 1 s
system.time(constant <- df[changed(df$group), cols])
# user system elapsed
# 1.057 0.230 1.314
stopifnot(identical(names(constant), c("group", "a", "b")))
stopifnot(nrow(constant) == 1000)
It has the same flaws though, in that it won't detect columns that are have the same values for adjacent groups (e.g. df$f <- 1)
With a bit more thinking plus #David's ideas:
constant_cols3 <- function(df, grp) {
# If col == TRUE and group == FALSE, not constant
matching_breaks <- function(group, col) {
!any(col & !group)
}
n <- nrow(df)
changed <- function(x) c(TRUE, x[-1] != x[-n])
df <- df[order(df[,grp]),]
changes <- lapply(df, changed)
vapply(changes[-1], matching_breaks, group = changes[[1]],
FUN.VALUE = logical(1))
}
system.time(x <- constant_cols3(df, "group"))
# user system elapsed
# 1.086 0.221 1.413
And that gives the correct result.
How fast does is.unsorted(x) fail for non-constant x? Sadly I don't have access to R at the moment. Also seems that's not your bottleneck though.

Resources