Optimize a large number of variable operations and variable ordering - r

I would like some suggestions on speeding up the code below. The flow of the code is fairly straight forward:
create a vector of unique combinations (m=3, 4, or 5) from df variable names
transform the vector of combinations into a list of formulas
break up the list of formulas to process into chunks to get around memory limitations
iterate through each chunk performing the formula operation and subset the df to the user specified number of rows (topn)
The full reprex is below including the different attempts using purrr::map and base lapply. I also attempted to use:= from data.table following the link below but I was unable to figure out how to transform the list of formulas into formulas that could be fed to qoute(:=(...)):
Apply a list of formulas to R data.table
It appears to me that one of the bottlenecks in my code is in variable operation step. A previous bottleneck was in the ordering step that I've managed to speed up quite a bit using the library kit and the link below but any suggestions that could speed up the entire flow is appreciated. The example I'm posting here uses combn of 4 as that is typically what I use in my workflow but I would also like to be able to go up to combn of 5 if the speed is reasonable.
Fastest way to find second (third...) highest/lowest value in vector or column
library(purrr)
library(stringr)
library(kit)
df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
df$value <- rnorm(80000,200,500)
cols <- names(df)
cols <- cols[!grepl("value", cols)]
combination <- 4
## create unique combinations of column names
ops_vec <- combn(cols, combination, FUN = paste, collapse = "*")
## transform ops vector into list of formulas
ops_vec_l <- purrr::map(ops_vec, .f = function(x) str_split(x, "\\*", simplify = T))
## break up the list of formulas into chunks otherwise memory error
chunks_run <- split(1:length(ops_vec_l), ceiling(seq_along(ops_vec_l)/10000))
## store results of each chunk into one final list
chunks_list <- vector("list", length = length(chunks_run))
ptm <- Sys.time()
chunks_idx <- 1
for (chunks_idx in seq_along(chunks_run))
{
## using purrr::map
# p <- Sys.time()
ele_length <- length(chunks_run[[chunks_idx]])
ops_list_temp <- vector("list", length = ele_length)
ops_list_temp <- purrr::map(
ops_vec_l[ chunks_run[[chunks_idx]] ], .f = function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
)
# (p <- Sys.time()-p) #Time difference of ~ 3.6 secs to complete chunk of 10,000 operations
# ## using base lapply
# p <- Sys.time()
# ele_length <- length( ops_vec_l[ chunks_run[[chunks_idx]] ])
# ops_list_temp <- vector("list", length = ele_length)
# ops_list_temp <- lapply(
# ops_vec_l[ chunks_run[[chunks_idx]] ], function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
# )
# (p <- Sys.time()-p) #Time difference of ~3.7 secs to complete a chunk of 10,000 operations
## number of rows I want to subset from df
topn <- 250
## list to store indices of topn values for each list element
indices_list <- vector("list", length = length(ops_list_temp))
## list to store value of the topn indices for each list element
values_list <- vector("list", length = length(ops_list_temp))
## for each variable combination in "ops_list_temp" list, find the index (indices) of the topn values in decreasing order
## each element in this list should be the length of topn
indices_list <- purrr::map(ops_list_temp, .f = function(x) kit::topn(vec = x, n = topn, decreasing = T, hasna = F))
## after finding the indices of the topn values for a given variable combination, find the value(s) corresponding to index (indices) and store in the list
## each element in this list, should be the length of topn
values_list <- purrr::map(indices_list, .f = function(x) df[x,"value"])
## save completed chunk to final list
chunks_list[[chunks_idx]] <- values_list
}
(ptm <- Sys.time()-ptm) # Time difference of 41.1 mins

Related

Two-way frequency table followed by matrix multiplication - high running time

I'm new to R, and trying to calculate the product between a fixed matrix to a 2-way frequency table for any combinations of columns in a dataframe or matrix and divide it by the sequence length (aka number of rows which is 15), the problem is that the running time increases dramatically when performing it on 1K sequences (1K columns). the goal is to use it with as much as possible sequences (more than 10 minutes, for 10K could be more than 1hr)
mat1 <- matrix(sample(LETTERS),ncol = 100,nrow = 15)
mat2 <- matrix(sample(abs(rnorm(26,0,3))),ncol=26,nrow=26)
rownames(mat2) <- LETTERS
colnames(mat2) <- LETTERS
diag(mat2) <- 0
test_vec <- c()
for (i in seq(ncol(mat1)-1)){
for(j in seq(i+1,ncol(mat1))){
s2 <- table(mat1[,i],mat1[,j]) # create 2-way frequency table
mat2_1 <- mat2
mat2_1 <- mat2_1[rownames(mat2_1) %in% rownames(s2),
colnames(mat2_1) %in% colnames(s2)]
calc <- ((1/nrow(mat1))*sum(mat2_1*s2))
test_vec <- append(test_vec,calc)
}}
Thanks for the help.
Here is an approach that converts mat1 to a data.table, and converts all the columns to factors, and uses table(..., exclude=NULL)
library(data.table)
m=as.data.table(mat1)[,lapply(.SD, factor, levels=LETTERS)]
g = combn(colnames(m),2, simplify = F)
result = sapply(g, function(x) sum(table(m[[x[1]]], m[[x[2]]], exclude=NULL)*mat2)/nrow(m))
Check equality:
sum(result-test_vec>1e-10)
[1] 0
Here there are 4950 combinations (100*99/2), but the number of combinations will increase quickly as nrow(mat1) increases (as you point out). You might find in that case that a parallelized version works well.
library(doParallel)
library(data.table)
registerDoParallel()
m=as.data.table(mat1)[,lapply(.SD, factor, levels=LETTERS)]
g = combn(colnames(m),2, simplify = F)
result = foreach(i=1:length(g), .combine=c) %dopar%
sum(table(m[[g[[i]][1]]], m[[g[[i]][2]]], exclude=NULL)*mat2)
result = result/nrow(m)

Submit every similarly named elements of a list of vectors to a function in R

Below, I'm wondering how to use BASE R function quantile() separately across elements in L that are named EFL and ESL?
Note: this is a toy example, L could contain any number of similarly named elements.
foo <- function(X) {
X <- as.matrix(X)
tab <- table(row(X), factor(X, levels = sort(unique(as.vector(X)))))
w <- diag(ncol(tab))
rosum <- rowSums(tab)
obs_oc <- tab * (t(w %*% t(tab)) - 1)
obs_c <- colSums(obs_oc)
max_oc <- tab * (rosum - 1)
max_c <- colSums(max_oc)
SA <- obs_c / max_c
h <- names(SA)
h[is.na(h)] <- "NA"
setNames(SA, h)
}
DAT <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/X.csv", row.names = 1)
L <- replicate(50, foo(DAT[sample(1:nrow(DAT), replace = TRUE),]), simplify = FALSE)
# How to use `quantile()` separately across all similarly named elements (e.g., EFL, ESL) in `L[[i]]` i = 1,... 5
# quantile(all EFL elements across `L`)
# quantile(all ESL elements across `L`)
The previous solution I used do.call to rbind each list into a matrix and array and then calculate the quantile over each data.frame row.
sapply(as.data.frame(do.call(rbind, L)), quantile)
However, when there is a missing row, it does not take that into account. To accurately get the rows you need to fill the missing rows. I used data.table's rbindlist (you could also use plyr::rbind.fill) with fill=TRUE to fill the missing values. It requires each to be a data.frame/table/list, so I converted each to a data.frame, but before doing so you need to transpose (t()) the data so that the rows line up to each element. It could be written in a single line, but it's easier read what is happening in multiple lines.
L2 = lapply(L, function(x){as.data.frame(t(x))})
df = data.table::rbindlist(L2, fill=TRUE) # or plyr::rbind.fill(L2)
sapply(df, quantile, na.rm = TRUE)
You can also use purrr::transpose:
Lt <- purrr::tranpose(L)
quantile(unlist(Lt$EFL),.8)
quantile(unlist(Lt$ESL),.8)

R Convert loop into function

I would like to clean up my code a bit and start to use more functions for my everyday computations (where I would normally use for loops). I have an example of a for loop that I would like to make into a function. The problem I am having is in how to step through the constraint vectors without a loop. Here's what I mean;
## represents spectral data
set.seed(11)
df <- data.frame(Sample = 1:100, replicate(1000, sample(0:1000, 100, rep = TRUE)))
## feature ranges by column number
frm <- c(438,563,953,963)
to <- c(548,803,1000,993)
nm <- c("WL890", "WL1080", "WL1400", "WL1375")
WL.ps <- list()
for (i in 1:length(frm)){
## finds the minimum value within the range constraints and returns the corresponding column name
WL <- colnames(df[frm[i]:to[i]])[apply(df[frm[i]:to[i]],1,which.min)]
WL.ps[[i]] <- WL
}
new.df <- data.frame(WL.ps)
colnames(new.df) <- nm
The part where I iterate through the 'frm' and 'to' vector values is what I'm having trouble with. How does one go from frm[1] to frm[2].. so-on in a function (apply or otherwise)?
Any advice would be greatly appreciated.
Thank you.
You could write a function which returns column name of minimum value in each row for a particular range of columns. I have used max.col instead of apply(df, 1, which.min) to get minimum value in a row since max.col would be efficient compared to apply.
apply_fun <- function(data, x, y) {
cols <- x:y
names(data[cols])[max.col(-data[cols])]
}
Apply this function using Map :
WL.ps <- Map(apply_fun, frm, to, MoreArgs = list(data = df))

apply list of indices to list of dataframes

I need to apply a list of indices to a list of dataframes with a one on one mapping. First element of the list of indices goes to the first dataframe only and so on. List of indices applies to the rows in the dataframes.
And a list of complementary dataframes needs to created by selecting rows not mentioned in the indices list.
Here is some sample data:
set.seed(1)
A <- data.frame(matrix(rnorm(40,0,1), nrow = 10))
B <- data.frame(matrix(rnorm(40,2,3), nrow = 10))
C <- data.frame(matrix(rnorm(40,3,4), nrow = 10))
dflis <- list(A,B,C)
# Create a sample row index
ix <- lapply(lapply(dflis,nrow), sample, size = 6)
So far I have managed this working but ugly looking code:
dflis.train <- lapply(seq_along(dflis), function(x) dflis[[x]][ix[[x]],])
dflis.test <- lapply(seq_along(dflis), function(x) dflis[[x]][-ix[[x]],])
Can someone suggest something better, more elegant?
Use Map/mapply instead of the univariate lapply, so that you can iterate over both objects and apply a function, like:
Map(function(d,r) d[r,], dflis, ix)
Or if you want to be fancy:
Map(`[`, dflis, ix, TRUE)
Matches your requested answer.
identical(
Map(function(d,r) d[r,], dflis, ix),
lapply(seq_along(dflis), function(x) dflis[[x]][ix[[x]],])
)
#[1] TRUE

Creating list of randomized data-frames with increasing number in names

I want to create 10 (at work: 50,000) random data-frames with setting seed for sake of reproducibility. The seed should be different for each data-frame, also its name should increase from df_01, df_02 ... to df_10. With help of #akrun 's answer I coded a loop like this:
# Number of data-frames to be created
n <- 10
# setting a seed vector
x <- 42
# loop
for (i in 1:10) {
set.seed(x)
a <- rnorm(10,.9,.05)
b <- sample(8:200,10,replace=TRUE)
c <- rnorm(10,80,30)
lst <- replicate(i, data.frame(a,b,c), simplify=FALSE)
x <- x+i
}
# name data-frames
names(lst) <- paste0('df', 1:10)
Now I have my data-frames, but it seems I can't get he random generation running. All data are similar. When I replace the lst-line with following code at least the seeded randomization works:
print(data.frame(a,b,c))
A crackajack extra would be a hint for leading zeros in the dfs-names in order to sort them.
Any help appreciated, thx!
You get the same results in all your list elements, because you create your list from scratch in every iteration using replicate and replace the previously created one. If you are using a for loop, you do not need replicate.
For the sake of reproducability I would create a vector of seeds before the loop and then set the seed in each iteration. The leading zeros can be produced using sprintf:
## Number of random data frames to create:
n <- 10
## Sample vector of seeds:
initSeed <- 1234
set.seed(initSeed)
seedVec <- sample.int(n = 1e8, size = n, replace = FALSE)
## loop:
lst <- lapply(1:n, function(i){
set.seed(seedVec[i])
a <- rnorm(10,.9,.05)
b <- sample(8:200,10,replace=TRUE)
c <- rnorm(10,80,30)
data.frame(a,b,c)
})
## Set names with leading zeroes (2 digits). If you want
## three digits, change "%02d" to "%03d" etc.
names(lst) <- paste0('df', sprintf("%02d", 1:10))

Resources