R: Row resampling loop speed improvement - r

I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
Here's the row subsample function:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.
Let me know what you think! Thanks in advance.
=========================================================================
UPDATE EDIT: Sample data from which to sample:
https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran's code in a function (in a sourced function.R file):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
Calling the function
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
outputs data, unfortunately with this warning message:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter

I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.
Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.
This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
The best part about R is that not only is this way, way faster, it's also way less code.

Related

Combining the Results of Several Loops Together

I wrote the following code that generates a single random number, subtracts this random number from some constant, records this result - and then repeats this process 100 times:
# 1 random number
results <- list()
for (i in 1:100) {
iteration = i
number_i_1 = mean(rnorm(1,10,2))
difference_i_1 = 10 - number_i_1
results_tmp = data.frame(iteration, number_i_1, difference_i_1)
results[[i]] <- results_tmp
}
results_df_1 <- do.call(rbind.data.frame, results)
To do this for 2 random numbers and 3 random numbers - the above code only needs to be slightly modified:
# 2 random numbers
results <- list()
for (i in 1:100) {
iteration = i
number_i_2 = mean(rnorm(2,10,2))
difference_i_2 = 10 - number_i_2
results_tmp = data.frame( number_i_2, difference_i_2)
results[[i]] <- results_tmp
}
results_df_2 <- do.call(rbind.data.frame, results)
# 3 random numbers
results <- list()
for (i in 1:100) {
iteration = i
number_i_3 = mean(rnorm(3,10,2))
difference_i_3 = 10 - number_i_3
results_tmp = data.frame( number_i_3, difference_i_3)
results[[i]] <- results_tmp
}
results_df_3 <- do.call(rbind.data.frame, results)
My Question: I would like to repeat this general process 20 times and store all the results in a single data frame. For example (note: the actual data frame would have 20 pairs of such columns):
final_frame = cbind(results_df_1 , results_df_2, results_df_3)
iteration number_i_1 difference_i_1 number_i_2 difference_i_2 number_i_3 difference_i_3
1 1 12.534059 -2.5340585 9.623655 0.3763455 9.327020 0.67298023
2 2 9.893728 0.1062721 10.135650 -0.1356502 10.037904 -0.03790384
3 3 8.895232 1.1047680 9.848402 0.1515981 7.588531 2.41146943
4 4 11.648550 -1.6485504 8.509288 1.4907120 10.294153 -0.29415334
5 5 9.045034 0.9549660 9.351834 0.6481655 11.084067 -1.08406691
6 6 9.230139 0.7698612 8.163164 1.8368356 7.846356 2.15364367
And then make two mean files (note: each of these two files would also have 20 rows):
mean_numbers = data_frame(iterations = c(1:3), mean_number = c(mean(final_frame$number_i_1),mean(final_frame$number_i_2), mean(final_frame$number_i_3) ) )
mean_differences = data_frame(iterations = c(1:3), mean_differences = c(mean(final_frame$difference_i_1),mean(final_frame$difference_i_1), mean(final_frame$difference_i_1) ) )
Can someone please show me how to do this?
Your initial objective can be simplified like this:
results <- list()
for (i in seq_len(100)) {
#Samples from 1 to 20 numbers, averages them
a <- unlist(lapply(seq_len(20), function(x) mean(rnorm(x, 10, 2))))
#Creates names for this vector
names(a) <- paste0(rep("number_i_", 20), 1:20)
#differences
b <- 10-a
#and it's names
names(b) <- paste0(rep("diff_i_", 20), 1:20)
#creating 40c df (there are better structures for this specially if the final outcome is to separate them)
c <- as.data.frame(cbind(rbind(a), rbind(b)))
#storing in list
results[[i]] <- c
}
results_df_3 <- do.call(rbind.data.frame, results)
There are even more elegant ways to write this but it will be enough for you to get there.
The format in your last section does not make sense to what you want to achieve. If it is to create a summary of the means for each number of samples taken, like this:
mockfdf <- data.frame(nsamp = 1:20, meanmeans = rnorm(20))#summarized means go here
mockddf <- data.frame(nsamp = 1:20, diffmeans = rnorm(20))#summarized means go here
Then you can easily separate the dataframes for differences and means and process them a lot better by using separate dataframes for each.

Randomly subsampling seurat object

I've been trying to randomly subsample my seurat object.
I'm interested in subsampling based on 2 columns: condition and cell type. I have 5 conditions and 5 cell types. Main goal is to have 1000 cells for each cell type in each condition.
I've tried this so far:
First thing is subsetting my seurat object:
my.list <- list(hipo.c1.neurons = hipo %>%
subset(., condition %in% "c1" & group %in% "Neurons"),
hipo.c1.oligo = hipo %>%
subset(., condition %in% "c1" & group %in% "Oligod")...etc...)
And then subsample it using sample function:
set.seed(0)
my.list.sampled <- lapply(X = my.list, FUN = function(x) {
x <- x[,sample(ncol(x), 1000, replace = FALSE)]
})
And I get this error since there are some objects with less than 1000 cells: error in evaluating the argument 'j' in selecting a method for function '[': cannot take a sample larger than the population when 'replace = FALSE'
Then I've tried with this function:
lapply_with_error <- function(X,FUN,...){
lapply(X, function(x, ...) tryCatch(FUN(x, ...),
error = function(e)NULL))
}
But then it gives me 0 in those objects that have less than 1000 cells. What would be the way to skip those objects that have less than 1000 cells and leave it like they are (not sample those ones)?
Is there a simpler way to do this, so I don't have to subset all of my objects separately?
I can't say for certain without seeing your data, but could you just add an if statement in the function? It looks like you're sampling column-wise, so check the number of columns. Just return x if the number of columns is less than the number you'd like to sample.
set.seed(0)
my.list.sampled <- lapply(X = my.list, FUN = function(x) {
if(ncol(x) > 1000){
x <- x[,sample(ncol(x), 1000, replace = FALSE)]
} else {
x
}
})
You could make it more flexible if you want to sample something other than 1000.
set.seed(0)
my.list.sampled <- lapply(X = my.list, B = 1000, FUN = function(x, B) {
if(ncol(x) > B){
x <- x[,sample(ncol(x), B, replace = FALSE)]
} else {
x
}
})

for inside foreach parallel not populating a dataframe in R

I am having an issue populating a foreach. Suppose I have the following dataframe, the consequence of this dataframe is exactly what my real one looks like:
Elec2 <- rep(rep(rep(27:1, each = 81), each = 18), times = 100)
Ind <- rep(1:18, times = 218700)
Cond <- rep(1:3, times = 1312200)
Trial <- rep(rep(1:100, each = 2187), each = 18)
DVAR <- rbeta (3936600, 0.7, 1,5)
data <- cbind(DVAR, Ind, Cond, Trial, Elec1, Elec2)
I am trying the following code of parallelisation:
distinct_pairs <-
data %>%
select(Elec1, Elec2) %>%
distinct()
cl <- makeCluster(2) #values here are adjusted to cores, used 2 for the example
registerDoParallel(cl)
output <- foreach (i = 1:nrow(distinct_pairs), .packages='glmmTMB',
.combine = rbind,
.errorhandling="pass",
.verbose = T) %dopar% {
dep <- distinct_pairs[i,]
dat1 <- subset(data, dep$Elec1 == data$Elec1 & dep$Elec2 == data$Elec2)
df[i,]$Elec1 <- dep[i,]$Elec1
df[i,]$Elec2 <- dep[i,]$Elec2
for (j in 1:18) { #By individual
dat2 <- subset(dat1, dat1$Ind==j)
model <- glmmTMB(DVAR ~ Cond, family=beta_family('logit'), data=dat2)
results <- summary(model)
est <- results$coefficients$cond[2,1]
ste <- results$coefficients$cond[2,2]
df[j,] <- c(est,ste)
}
return(df)
}
output <- as.data.frame(output, row.names = FALSE)
As you can see I am expecting a dataframe with the results of the iterations est & ste plus the identification of the electrodes Elec1 & Elec2. If I run the lines independently one by one it seems to work fine, but i cannot make it work the way I expect.
First loop takes a pair of electrodes, every row in distinct_pairs is a pair of electrodes, numbered from 1 to 27 for Elec1 and for Elec2.
Problem is I am unable to get the data of the for loop written in the final output dataframe.
I am sure the problem is pretty basic, but I appreciate any insight as I seem to be missing something.
Thanks!
[[UPDATE WITH SOLUTION]]
In case anyone is interested, here is the solution.
output <- foreach (i = 1:10, .packages='glmmTMB',
.combine = rbind,
.errorhandling="pass",
.inorder = TRUE,
.verbose = T) %dopar% {
dat1 <- subset(data, distinct_pairs[i,]$Elec1 == data$Elec1 & distinct_pairs[i,]$Elec2 == data$Elec2)
df <- data.frame('Elec1'=rep(distinct_pairs[i,]$Elec1,18),'Elec2'=rep(distinct_pairs[i,]$Elec2,18),'est'=rep(NA,18),'ste'=rep(NA,18))
for (j in 1:18) {
dat2 <- subset(dat1, dat1$Ind==j)
model <- glmmTMB(DVAR ~ Condition, family=beta_family('logit'), data=dat2)
results <- summary(model)
est <- results$coefficients$cond[2,1]
ste <- results$coefficients$cond[2,2]
df[j,c('est','ste')] <- c(est,ste)
}
return(df)
}
Which returns exactly what I was looking for:
> head(output)
Elec1 Elec2 est ste
1 1 1 0.034798615 0.03530296
2 1 1 -0.005363760 0.03392442
3 1 1 -0.017349123 0.03404430
4 1 1 -0.034819068 0.03196078
5 1 1 0.002301062 0.03163825
6 1 1 0.003575131 0.03452420
I am definetly not sure if I got the problem, could you also provide an Elec1 in your data Example?
An idea:
Foreach might not find df, you could create the data frame at the beginning of your loop with something like
df <- data.frame('Elec1'=rep(NA,18),'Elec2'=rep(NA,18),'est'=rep(NA,18),'ste'=rep(NA,18))
maybe add then below in the for loop: df[j,c('est','ste')] <- c(est,ste)

Vectorization of a nested for-loop that inputs all paired combinations

I thought that the following problem must have been answered or a function must exist to do it, but I was unable to find an answer.
I have a nested loop that takes a row from one 3-col. data frame and copies it next to each of the other rows, to form a 6-col. data frame (with all possible combinations). This works fine, but with a medium sized data set (800 rows), the loops take forever to complete the task.
I will demonstrate on a sample data set:
Sdat <- data.frame(
x = c(10,20,30,40),
y = c(15,25,35,45),
ID =c(1,2,3,4)
)
compar <- data.frame(matrix(nrow=0, ncol=6)) # to contain all combinations
names(compar) <- c("x","y", "ID", "x","y", "ID")
N <- nrow(Sdat) # how many different points we have
for (i in 1:N)
{
for (j in 1:N)
{
Temp1 <- Sdat[i,] # data from 1st point
Temp2 <- Sdat[j,] # data from 2nd point
C <- cbind(Temp1, Temp2)
compar <- rbind(C,compar)
}
}
These loops provide exactly the output that I need for further analysis. Any suggestion for vectorizing this section?
You can do:
ind <- seq_len(nrow(Sdat))
grid <- expand.grid(ind, ind)
compar <- cbind(Sdat[grid[, 1], ], Sdat[grid[, 2], ])
A naive solution using rep (assuming you are happy with a data frame output):
compar <- data.frame(x = rep(Sdat$x, each = N),
y = rep(Sdat$y, each = N),
id = rep(1:n, each = N),
x1 = rep(Sdat$x, N),
y1 = rep(Sdat$y, N),
id_1 = rep(1:n, N))

How to vectorize a function in R

I need some help vectorizing the following code because I believe that it will become more efficient. However i do not know how to begin... I created a loop that goes through z. z has 3 columns and 112847 rows, which might be a reason it takes a long time. The 3 columns contain numbers that are used in the MACD() function...
library(quantmod)
library(TTR)
# get stock data
getSymbols('LUNA')
#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)
#Create matrix for returns only
y <- stock
#Create a "MATRIX" by choosing the Adjusted Close
Nudata3 <- stock
#Sharpe Ratio Matrix
SR1<- matrix(NA, nrow=1)
# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)
# stores possible combinations into z
z <- expand.grid(i,k,j)
colnames(z)<- c("one","two","three")
n = 1
stretches <- length(z[,1])
while (n < stretches){
# I am trying to go through all the values in "z"
Nuw <- MACD((stock), nFast=z[n,1], nSlow=z[n,2], nSig=z[n,3], maType="EMA")
colnames(Nuw) <- c("MACD","Signal") #change the col names to create signals
x <- na.omit(merge((stock), Nuw))
x$sig <- NA
# Create trading signals
sig1 <- Lag(ifelse((x$MACD <= x$Signal),-1, 0)) # short when MACD < SIGNAL
sig2 <- Lag(ifelse((x$MACD >= x$Signal),1, 0)) # long when MACD > SIGNAL
x$sig <- sig1 + sig2
#calculate Returns
ret <- na.omit(ROC(Ad(x))*x$sig)
colnames(ret)<- c(paste(z[n,1],z[n,2],z[n,3],sep=","))
x <- merge(ret,x)
y <- merge(y,ret) #This creates a MATRIX with RETURNs ONLY
Nudata3 <- merge(Nudata3, x)
((mean(ret)/sd(ret)) * sqrt(252)) -> ANNUAL # Creates a Ratio
ANNUAL->Shrat # stores Ratio into ShRat
SR1 <- cbind(SR1,Shrat) # binds all ratios as it loops
n <- (n+1)
}
I would like to know how to vectorize the MACD() function, to speed up the process since the length of stretches is approx. 112847. It takes my computer quite some time to go through the loop itself.
First and foremost - case specific optimization - remove the cases where nFast > nSlow as it doesn't make sense technically.
Secondly - you are creating objects and copying them over and over again. This is very expensive.
Thirdly - you can code this better perhaps by creating a matrix of signals in one loop and doing rest of the operations in vectorized manner.
I would code what you are doing something like this.
Please read help pages of mapply, do.call, merge and sapply if you don't understand.
require(quantmod)
getSymbols("LUNA")
#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)
# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)
# stores possible combinations into z
z <- expand.grid(i,k,j)
IMO : This is where your first optimization should be. Remove cases where i > k
z <- z[z[,1]<z[,2], ]
It reduces the number of cases from 112847 to 57575
#Calculate only once. No need to calculate this in every iteration.
stockret <- ROC(stock)
getStratRet <- function(nFast, nSlow, nSig, stock, stockret) {
x <- MACD((stock), nFast=nFast, nSlow=nSlow, nSig=nSig, maType="EMA")
x <- na.omit(x)
sig <- Lag(ifelse((x$macd <= x$signal),-1, 0)) + Lag(ifelse((x$macd >= x$signal),1, 0))
return(na.omit(stockret * sig))
}
RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[,1], nSlow = z[,2], nSig = z[,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE))
getAnnualSharpe <- function(ret) {
ret <- na.omit(ret)
return ((mean(ret)/sd(ret)) * sqrt(252))
}
SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
Results will be as below. Which column belongs to which combo of i, j, k is trivial.
head(RETURNSLIST[, 1:3])
## LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2
## 2007-01-10 0.012739026 -0.012739026 0
## 2007-01-11 -0.051959739 0.051959739 0
## 2007-01-12 -0.007968170 -0.007968170 0
## 2007-01-16 -0.007905180 -0.007905180 0
## 2007-01-17 -0.005235614 -0.005235614 0
## 2007-01-18 0.028315920 -0.028315920 0
SHARPELIST
## LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2 LUNA.Adjusted.3 LUNA.Adjusted.4 LUNA.Adjusted.5 LUNA.Adjusted.6
## 0.04939150 -0.07428392 NaN 0.02626382 -0.06789803 -0.22584987 -0.07305477
## LUNA.Adjusted.7 LUNA.Adjusted.8 LUNA.Adjusted.9
## -0.05831643 -0.08864845 -0.08221986
system.time(
+ RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[1:100,1], nSlow = z[1:100,2], nSig = z[1:100,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE)),
+ SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
+ )
user system elapsed
2.28 0.00 2.29

Resources