I'm the maintainer of the seqMeta package and looking for ideas on how to speed up the bottleneck of splitting a large matrix into smaller pieces a large number of times.
Background
The seqMeta package is used to analyze genetic data. So you have a group of subjects (n_subject) and a number of genetic markers (n_snps). This leads to a n_subject x n_snp matrix (Z). There is also a data frame that tells you which snps get grouped together for analysis (typically which snps comprise a given gene).
While Z may be large, it is quite sparse. Typically less than 10%, and sometimes around 2%, of the values are non-zero. A sprase matrix representation seems like an obvious choice to save space.
A current project: nsubjects ~15,000 and nsnps ~2 million, with over 200,000 splits.
As the data size continues to grow I've found the time limiting factor tends to be the number of groupings, not the actual size of the data. (See the below example runtime is a linear function of n_splits for the same data)
Simplified Example
library(Matrix)
seed(1)
n_subjects <- 1e3
n_snps <- 1e5
sparcity <- 0.05
n <- floor(n_subjects*n_snps*sparcity)
# create our simulated data matrix
Z <- Matrix(0, nrow = n_subjects, ncol = n_snps, sparse = TRUE)
pos <- sample(1:(n_subjects*n_snps), size = n, replace = FALSE)
vals <- rnorm(n)
Z[pos] <- vals
# create the data frame on how to split
# real data set the grouping size is between 1 and ~1500
n_splits <- 500
sizes <- sample(2:20, size = n_splits, replace = TRUE)
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
snps = sample(n_snps, size = sum(sizes)))
# simple function that gets called on the split
# the real function creates a cols x cols dense upper triangular matrix
# similar to a covariance matrix
simple_fun <- function(Z, cols) {sum(Z[ , cols])}
# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)
## user system elapsed
## 3.21 0.00 3.21
n_splits <- 1000
sizes <- sample(2:20, size = n_splits, replace = TRUE)
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
snps = sample(n_snps, size = sum(sizes)))
# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)
## user system elapsed
## 6.38 0.00 6.38
n_splits <- 5000
sizes <- sample(2:20, size = n_splits, replace = TRUE)
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
snps = sample(n_snps, size = sum(sizes)))
# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)
## user system elapsed
## 31.65 0.00 31.66
Question: Is there a faster way to subset a Matrix than '['? Or other approached I'm missing?
Related
I have a data frame with 10,000 rows and 40 columns. I am trying to apply a function to each of these rows. For each row, I am expecting to return a scalar which is the value of the statistic I am calculating in this function. Below is what I have done so far;
library(sandwich)
# Creating example data #
nrows=10000
ncols=40
n1=20
n2=20
df=data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
# Function to evaluate on each row of df #
get_est= function(x){
mod = lm(x~cov$group)
vcov = vcovHC(mod)
coef = as.numeric(mod$coefficients[2])
se = sqrt(as.numeric(diag(vcov)[2]))
stats = coef/se
return(stats)
}
# Applying above function to full data #
t1=Sys.time()
estimates=apply(df, 1, function(x) get_est(x))
t2=Sys.time()-t1
# Time taken by apply function
Time difference of 32.10623 secs
Is there a way to significantly decrease the time taken to implement get_est() on the full data? The main reason I need to speed up the computation on a single df is because I have 1000 more data frames with the same dimension and I have to apply this function to each row to each of these data frames simultaneously. To illustrate, below is the broader situation I am dealing with;
# Creating example data
set.seed(1234)
nrows = 10000
ncols = 40
n1 = 20
n2 = 20
df.list = list()
for(i in 1:1000){
df.list[[i]] = data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
}
# Applying get_est() to each row and to each of data frame in df.list #
all.est = foreach(j = 1:length(df.list), .combine = cbind, .packages = 'sandwich') %dopar% {
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
est = apply(df.list[[j]], 1, function(x) get_est(x))
return(est)
}
Even after parallelizing it is taking hours to finish. My ultimate objective is to significantly cut down the time to obtain "all.est" which will contain 10000 rows and 1000 columns where each column has the stats estimates for the respective data set. Any help is much appreciated!! Thanks in advance!
Your function get_est uses some "expensive" functions, such as lm, vcovHC, and so on. If you think of the OLS equation,
$$
\hat{\beta} = (X^TX)^{-1}X^Ty,
$$
then you can see that the first part $(X^TX)^{-1}X^T$ doesn't change in your simulation, so the design matrix is constant. To make use of this, I compute $(X^TX)^{-1}X^T$ before starting the simulation. This approach then also requires computing the HC3 standard errors manually using the formula (see e.g. here)
$$
\widehat{\text{Cov}}_{\text{HC3}}(\hat{\beta}) = (X^TX)^{-1}X^T \text{diag} \left[ \frac{e_i^2}{(1-h_{ii})^2} \right] X(X^TX)^{-1}.
$$
Everything except for the residuals is constant across your simulation iterations, so it can be precomputed. Once I implement these tricks, I achieve a speed up of roughly factor 50.
(Note: lm uses QR decomposition, which could also be implement similarly here. Maybe you can an ever bigger speed up by parallelizing the code.)
nrows = 10000
ncols = 40
n1 = 20
n2 = 20
df = data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
cov = data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
# old function
get_est_old = function(x){
mod = lm(x~cov$group)
vcov = sandwich::vcovHC(mod)
coef = as.numeric(mod$coefficients[2])
se = sqrt(as.numeric(diag(vcov)[2]))
stats = coef/se
return(stats)
}
# new function
# first construct design matrix
X = matrix(c(rep(1, ncols), rep(0, ncols / 2), rep(1, ncols / 2)), ncol = 2)
# these quantities will be used below
inv = solve(crossprod(X)) %*% t(X)
h = diag(X %*% inv)
get_est_new= function(x){
coef = (inv %*% x)
resid = x - (X %*% coef)
bread = (resid^2 / (1 - h)^2)[,1]
hc3 = inv %*% diag(bread) %*% t(inv)
se = sqrt(hc3[2,2])
stats = coef[2,1]/sqrt(hc3[2,2])
}
# Applying above function to full data #
system.time({
estimates_old = apply(df, 1, function(x) get_est_old(x))
})
#> user system elapsed
#> 7.876 0.042 7.929
system.time({
estimates_new = apply(df, 1, function(x) get_est_new(x))
})
#> user system elapsed
#> 0.141 0.016 0.158
# check
all.equal(estimates_old, estimates_new)
#> [1] TRUE
Created on 2021-09-04 by the reprex package (v2.0.1)
These posts could be of interest:
https://grantmcdermott.com/efficient-simulations-in-r/
https://grantmcdermott.com/simulations-remix-turn-up-the-base/
I would like to check the convergence of Sobol' sensitivity indices, using the sensobol library, by re-computing the sensitivity indices using sub-samples of decreasing size extracted from the original sample.
Here, I present an example code using the Ishigami function as model. Since computing the model output takes very long with the model I actually use, I want to avoid recomputing the model output for different sample sizes, but want to use sub-samples of my overall sample for this check.
I have written code that runs through, however, it seems that the result is 'not correct', as soon as the sample size is not equal the initial sample size.
Inital set-up
library(sensobol)
# Define settings
matrices <- c("A", "B", "AB", "BA")
N <- 1000
params <- paste("X", 1:3, sep = "")
first <- total <- "azzini"
order <- "first"
R <- 10
type <- "percent"
conf <- 0.95
# Create sample matrix using Sobol' (1967) quasi-random numbers
mat <- sobol_matrices(matrices = matrices, N = N, params = params, order = order, type = "QRN")
# Compute model output using Ishigami function as model
Y <- ishigami_Fun(mat)
Correct Sobol' indices as benchmark result
# Compute and bootstrap Sobol' indices for entire sample N
ind <- sobol_indices(matrices = c("A", "B", "AB", "BA"),
Y = Y,
N = N,
params = params,
boot = TRUE,
first = "azzini",
total = "azzini",
order = "first",
R = R,
type = type,
conf = conf)
cols <- colnames(ind)[1:length(params)]
ind[ , (cols):= round(.SD, 3), .SDcols = (cols)]
Check for convergence
Now, to analyze whether convergence is reached, I want to re-compute the sensitivity indices using sub-samples of decreasing size extracted from the original sample
# function to compute sensitivity indices, depending on the sample size and the model output vector
fct_conv <- function(N, Y) {
# compute how many model runs are performed in the case of the Azzini estimator
nr_model_runs <- 2*N*(length(params)+1) # length(params) = k
# extract sub-sample of model output
y_sub <- Y[1:nr_model_runs]
# compute and bootstrap Sobol' indices
ind_sub <- sobol_indices(matrices = c("A", "B", "AB", "BA"),
Y = y_sub,
N = N,
params = params,
boot = TRUE,
first = "azzini",
total = "azzini",
order = "first",
R = R,
type = type,
conf = conf)
cols <- colnames(ind_sub)[1:length(params)]
ind_sub[ , (cols):= round(.SD, 3), .SDcols = (cols)]
return(ind_sub)
}
Let's compare the benchmark result (ind) to two other outputs: Running fct_conv with the full sample (ind_full_sample) and running fct_conv with a very slightly reduced sample (ind_red_sample).
ind_full_sample <- fct_conv(1000, Y)
ind_red_sample <- fct_conv(999, Y)
ind
ind_full_sample
ind_red_sample
It seems that as soon as the sample size is reduced, the result doesn't make sense. Why is that? I'd be glad for any hints or ideas!
The results do not make sense because you are sampling without considering the ordering of the sample matrix. Try the following
# Load the required packages:
library(sensobol)
library(data.table)
library(ggplot2)
# Create function to swiftly check convergence (you do not need bootstrap)
sobol_convergence <- function(Y, N, sample.size, seed = 666) {
dt <- data.table(matrix(Y, nrow = N))
set.seed(seed) # To permit replication
subsample <- unlist(dt[sample(.N, sample.size)], use.names = FALSE)
ind <- sobol_indices(matrices = matrices,
Y = subsample,
N = sample.size,
params = params,
first = first,
total = total,
order = order)
return(ind)
}
# Define sequence of sub-samples at which you want to check convergence
sample.size <- seq(100, 1000, 50) # every 50
# Run function
ind.list <- lapply(sample.size, function(n)
sobol_convergence(Y = Y, N = N, sample.size = n))
# Extract total number of model runs C and results in each run
Cost <- indices <- list()
for(i in 1:length(ind.list)) {
Cost[[i]] <- ind.list[[i]]$C
indices[[i]] <- ind.list[[i]]$results
}
names(indices) <- Cost
# Final dataset
final.dt <- rbindlist(indices, idcol = "Cost")[, Cost:= as.numeric(Cost)]
# Plot results
ggplot(final.dt, aes(Cost, original, color = sensitivity)) +
geom_line() +
labs(x = "Total number of model runs", y = "Sobol' indices") +
facet_wrap(~parameters) +
theme_bw()
I am trying to identify the most probable group that an observation belongs to, for several thousand large datasets. It is possible that some of the data is incorrectly classified and I am trying to work out the most likely "true" value. I have tried to use knn3 from the caret package but the predictions take too long to compute. In researching alternatives I have came across the nn2 function from RANN package which performs a nearest neighbour search that is significantly faster than K-Nearest Neighbours.
library(RANN)
library(tidyverse)
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
The result on the nn2 function is two lists, one of indices and one of distances. I want to use the indices table to work out the most likely grouping of each observation, however it returns the row number of the observation and not it's group. I need to replace this with the group it belongs to (in this case, the species column).
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
I have removed the first column as the first nearest neighbour is always the observation itself.
matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
This code gives me the output I want, but is there a tidier way of creating this table and then calculating the most common response for each row, with the speed of calculation being the key.
Your scaling can be a real bottleneck when you have more columns (tested on 200 x 22216 gene expression matrix). My version might not seem that impressive with the iris dataset, but on the larger dataset I get 1.3 sec vs. 32.8 sec execution time.
Using tabulate instead of table gives an additional improvement, which is dwarfed, however, by the matrix scaling.
I used a custom scale function here, but using base::scale on a matrix would already be a major improvement.
I also addressed the issue raised by M. Papenberg of "self" not being considered the nearest neighbor by setting those to NA.
invisible(lapply(c("tidyverse", "matrixStats", "RANN", "microbenchmark", "compiler"),
require, character.only=TRUE))
enableJIT(3)
# faster column scaling (modified from https://www.r-bloggers.com/author/strictlystat/)
colScale <- function(x, center = TRUE, scale = TRUE, rows = NULL, cols = NULL) {
if (!is.null(rows) && !is.null(cols)) {x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) x <- x[, cols, drop = FALSE]
cm <- colMeans(x, na.rm = TRUE)
if (scale) csd <- matrixStats::colSds(x, center = cm, na.rm = TRUE) else
csd <- rep(1, length = length(cm))
if (!center) cm <- rep(0, length = length(cm))
x <- t((t(x) - cm) / csd)
return(x)
}
# your posted version (mostly):
oldv <- function(){
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
category_neighbours <- matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
class <- apply(category_neighbours, 1, function(x) {
x1 <- table(x)
names(x1)[which.max(x1)]})
cbind(iris, class)
}
## my version:
myv <- function(){
iris.scaled <- colScale(data.matrix(iris[, 1:(dim(iris)[2]-1)]))
iris.nn2 <- nn2(iris.scaled)
# set self neighbors to NA
iris.nn2$nn.idx[iris.nn2$nn.idx - seq_len(dim(iris.nn2$nn.idx)[1]) == 0] <- NA
# match up categories
category_neighbours <- matrix(iris$Species[iris.nn2$nn.idx[,]],
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
# turn category_neighbours into numeric for tabulate
cn <- matrix(as.numeric(factor(category_neighbours, exclude=NULL)),
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
cnl <- levels(factor(category_neighbours, exclude = NULL))
# tabulate frequencies and match up with factor levels
class <- apply(cn, 1, function(x) {
cnl[which.max(tabulate(x, nbins=length(cnl))[!is.na(cnl)])]})
cbind(iris, class)
}
microbenchmark(oldv(), myv(), times=100L)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> oldv() 11.015986 11.679337 12.806252 12.064935 12.745082 33.89201 100 b
#> myv() 2.430544 2.551342 3.020262 2.612714 2.691179 22.41435 100 a
I tried to create a matrix from a list which consists of N unequal matrices...
The reason to do this is to make R individual bootstrap samples.
In the example below you can find e.g. 2 companies, where we have 1 with 10 & 1 with just 5 observations.
Data:
set.seed(7)
Time <- c(10,5)
xv <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2);
y <- matrix( c(rnorm(10,5,2), rnorm(5,20,1)));
z <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2)
# create data frame of input variables which helps
# to conduct the rowise bootstrapping
data <- data.frame (y = y, xv = xv, z = z);
rows <- dim(data)[1];
cols <- dim(data)[2];
# create the index to sample from the different panels
cumTime <- c(0, cumsum (Time));
index <- findInterval (seq (1:rows), cumTime, left.open = TRUE);
# draw R individual bootstrap samples
bootList <- replicate(R = 5, list(), simplify=F);
bootList <- lapply (bootList, function(x) by (data, INDICES = index, FUN = function(x) dplyr::sample_n (tbl = x, size = dim(x)[1], replace = T)));
---------- UNLISTING ---------
Currently, I try do it incorrectly like this:
Example for just 1 entry of the list:
matrix(unlist(bootList[[1]], recursive = T), ncol = cols)
The desired output is just
bootList[[1]]
as a matrix.
Do you have an idea how to do this & if possible reasonably efficient?
The matrices are then processed in unfortunately slow MLE estimations...
i found a solution for you. From what i gather, you have a Dataframe containing all observations of all companies, which may have different panel lengths. And as a result you would like to have a Bootstap sample for each company of same size as the original panel length.
You mearly have to add a company indicator
data$company = c(rep(1, 10), rep(2, 5)) # this could even be a factor.
L1 = split(data, data$company)
L2 = lapply(L1, FUN = function(s) s[sample(x = 1:nrow(s), size = nrow(s), replace = TRUE),] )
stop here if you would like to have saperate bootstap samples e.g. in case you want to estimate seperately
bootdata = do.call(rbind, L2)
Best wishes,
Tim
I'm using the R package data.table to read large amounts of data and analyse them. What I was wondering is why is selecting rows from a data.table so much slower than from a matrix.
require(data.table)
## create some random data
n = 1000
p = 1000
set.seed(1)
data.raw <- matrix(rnorm(n*p), nrow = n, ncol = p)
rownames(data.raw) <- lapply(1:n, FUN = function(x, length)paste(sample(c(letters, LETTERS), length, replace=TRUE), collapse=""), length = 10)
colnames(data.raw) <- paste0("X", 1:n)
#do the same thing as data.table
data.t <- data.table(data.raw)
data.t[, id := rownames(data.raw)]
setkey(data.t, id)
## now select one row after the other in both matrix and data.table
system.time(for(r in rownames(data.raw)) y <- data.raw[r, ])
# user system elapsed
# 0.016 0.000 0.017
system.time(for(r in data.t$id) y <- data.t[r])
# user system elapsed
# 30.580 0.000 30.608
Even for this relatively small example, data.table is extremely slow even though using the setkey. Is there any way to improve the performance of this?