R - speed up R version of MATLAB for incremental loop - r

Is there any way to speed up the following R translation of the MATLAB for loop below?
Although this example is small, the real data set may have up to 500,000 rows for SSC, SSL, and dt. Plus this similar operation will have to be applied to about 20 data sets.
SSC <- abs(rnorm(1000))
SSL <- abs(rnorm(1000))
dt <- rep(15, 1000)
for (i in 2:length(SSC))
{
TSSL[[i-1]] <- (SSL[i] + SSL[i-1])/(2*60*dt[i-1])
}
# MATLAB code
# for i=2:length(SSC)
# TSSL(i-1)=(SSL(i) + SSL(i-1))/2*60*dt(i-1);
# end
Thank you.

what is TSSL? You should initialize it.
Also you dont' need a loop here, you should instead use vector operations:
N <- 1000
SSC <- abs(rnorm(N))
SSL <- abs(rnorm(N))
dt <- rep(15, N)
TSSL <- rep(NA,N)
TSSL[1:(N-1)] = (SSL[2:N]+SSL[1:(N-1)])/(2*60*dt[1:(N-1)])
I strongly recommend reading Chapter 3 "Failing to vectorize" of R inferno.

Related

R aborts when using function DIST (110 GB vector)

I need to run a hierarchical clustering algorithm in R on a dataset with 173000 rows and 17 columns.
When running the function dist() on the dataset, R aborts. I have also tried it with a Windows pc and the error message I get is "cannot allocate vector of size 110.5 Gb".
My Mac and my Windows pc have 4 GB of RAM.
Is there a way to still do this in R? I know hierarchical algorithms are not the best for large datasets but it is requireed by a University assignment.
Thank you
The problem can be solved by writing a function to compute the pairwise euclidian distances between columns of the data set, assumed below to be in tabular form. For other distances, a similar function can be written.
dist2 <- function(X){
cmb <- combn(seq_len(ncol(X)), 2)
d <- matrix(NA_real_, nrow = ncol(X), ncol = ncol(X))
if(!is.null(colnames(X)))
dimnames(d) <- list(colnames(X), colnames(X))
for(i in seq_len(ncol(cmb))){
ix <- cmb[1, i]
iy <- cmb[2, i]
res <- sqrt(sum((X[, ix] - X[, iy])^2))
d[ix, iy] <- d[iy, ix] <- res
diag(d) <- 0
}
d
}
Now test the function with a data.frame of the dimensions in the question.
set.seed(2021)
m <- replicate(17, rnorm(173000))
m <- as.data.frame(m)
dist2(m)
First and foremost, it would be very nice of you to provide a reprex (reproducible example). Make sure you will do it later.
Speaking about the issue, you can use sample_frac function (if I am not mistaken, this is a function from tidyverse package). For example, sample_frac(your_data, .5) will sample 50% of your dataframe. It will reduce the size of data to be clustered and it will be easier for your laptop.
The other way is to extend the memory.limit(size = n) where n is a number in megabytes.

Simulate over an xts object in R

I am looking for a function, or package, that will help me with this goal. I've looked through several packages but can't find what I am looking for:
Lets say I have an xts object with 10 columns and 250 rows.
What I want to do is run a simulation, such that I get a robust calculation of my performance metric over the period.
So, lets say that I have 250 data points, I want to run x number of simulations over random samples of the data computing the Sharpe Ratio using the function (PerformanceAnalytics::SharpeRatio) varying the random samples to be lengths 30-240, and then find the average. Keep in mind I want to do this for every column and I'd rather not have to use apply if possible. I'd also like to find something that processes the information rather quickly.
What package or functions would best serve this purpose?
Thank you!
Subsetting xts objects for the rows you want to randomly sample should be good enough, performance wise, if that is your main concern. If you want some other concrete examples, you may find it useful to look at the monte carlo simulation functions recently added to the R blotter package:
https://github.com/braverock/blotter/blob/master/R/mcsim.R
Your requirements are quite detailed and a little tricky to follow, but I think this example may be what you're after?
This solution does use apply functions though! Because it just makes life easier. If you don't use lapply, the code will expand quickly and distract from achieving the goal quickly (and you risk introducing bugs with longer, messier code; one reason to use apply family functions where you can).
library(quantmod)
library(PerformanceAnalytics)
# Set up the data:
syms <- c("GOOG", "FB", "TSLA", "SNAP", "MU")
getSymbols(syms)
z <- do.call(merge, lapply(syms, function(s) {
x <- get(s)
dailyReturn(Cl(x))
}))
# Here we have 250 rows, 5 columns:
z <- tail(z, 250)
colnames(z) <- paste0(syms, ".rets")
subSample <- function(x, n.sub = 40) {
# Assuming subsampling by row, preserving all returns and cross symbol dependence structure at a given timestamp
ii <- sample(1:NROW(x), size = n.sub, replace = FALSE)
# sort in order to preserve time ordering?
ii <- sort(ii)
xs <- x[ii, ]
xs
}
set.seed(5)
# test:
z2 <- subSample(z, n.sub = 40)
zShrp <- SharpeRatio(z2)[1, ]
# now run simulation:
nSteps <- seq(30, 240, by = 30)
sharpeSimulation <- function(x, n.sub) {
x <- subSample(x, n.sub)
SharpeRatio(x)[1, ]
}
res <- lapply(nSteps, FUN = sharpeSimulation, x = z)
res <- do.call(rbind, res)
resMean <- colMeans(res)
resMean
# GOOG.rets FB.rets TSLA.rets SNAP.rets MU.rets
# 0.085353854 0.059577882 0.009783841 0.026328660 0.080846592
Do you realise that SharpeRatio uses sapply? And it's likely other performance metrics you want to use will as well. Since you seem to have something against apply (possibly all apply functions in R), this might be worth noting.

How to control the number of CPUs used by R?

I'm using the R package crossmatch that itself relies on some other R packages ( survival, nbpMatching, MASS) and that in turn import a wide range of more dependencies.
The crossmatch package implements a statistical test on a (potentially) large matrix, that I need to compute very often (within an MCMC algorithm). I've written the following wrapper that computes some preprocessing steps before the actual test is computed (which is the crossmatchtest() in the last line):
# wrapper function to directly call the crossmatch test with a single matrix
# first column of the matrix must be a binary group indicator, following columns are observations
# code is modified from the documentation of the crossmatch package
crossmatchdata <- function(dat) {
# the grouping variable should be in the first column
z = dat[,1]
X = subset(dat, select = -1)
## Rank based Mahalanobis distance between each pair:
# X <- as.matrix(X)
n <- dim(X)[1]
k <- dim(X)[2]
for (j in 1:k) {
X[, j] <- rank(X[, j])
}
cv <- cov(X)
vuntied <- var(1:n)
rat <- sqrt(vuntied / diag(cv))
cv <- diag(rat) %*% cv %*% diag(rat)
out <- matrix(NA, n, n)
icov <- ginv(cv)
for (i in 1:n) {
out[i, ] <- mahalanobis(X, X[i, ], icov, inverted = TRUE)
}
dis <- out
## The cross-match test:
return(crossmatchtest(z, dis))
}
I've noticed that if the matrix is rather small, this test will only use one CPU:
library(MASS)
library(crossmatch)
source("theCodeFromAbove.R")
# create a dummy matrix
m = cbind(c(rep(0, 100), rep(1, 100)))
m = cbind(m, (matrix(runif(100), ncol=10, nrow=20, byrow=T)))
while(TRUE) { crossmatchdata(m) }
as monitored via htop. However, if I'm increasing this matrix, R will use as many cores as are available (at least it looks like this):
# create a dummy matrix
m = cbind(c(rep(0, 1000), rep(1, 1000)))
m = cbind(m, (matrix(runif(100000), ncol=1000, nrow=2000, byrow=T)))
while(TRUE) { crossmatchdata(m) }
I'm fine with this parallelization in general but I would like to be able to manually control the number of cores the R process is using. I've tried options(mc.cores = 4) without success.
Is there any other variable I could set? Or what's the best way of finding the package that's responsible for the use of more than one core?
Let's look at the dependencies:
library(miniCRAN)
tags <- "crossmatch"
dg <- makeDepGraph(tags, enhances = FALSE, suggests = FALSE)
set.seed(1)
plot(dg, legendPosition = c(-1, 1), vertex.size = 20)
That is quite a few dependencies. At a first glance, there is no package for R level parallelization there. That leaves the possibility of packages using parallelization via compiled code. One such package is data.table (there might be others), try if using setDTthreads(1) turns off parallelization.
Of course, you might also have R linked to an optimized BLAS. If that's the case, the parallelization most likely happens there during matrix algebra.
Update:
#Dirk Eddelbuettel just pointed out that packages RhpcBLASctl and OpenMPController allow controlling the number of cores used by the BLAS or OpenMP.
Edit by kartoffelsalat:
The following worked for the issue in the question under Ubuntu 16.04. It did not work under macOS (neither did the package OpenMPController).
library(RhpcBLASctl)
blas_set_num_threads(3)

Parallelizing a double for loop in R

I've been using the parallel package in R to do loops like:
cl <- makeCluster(getOption("cl.cores", 6))
result <- parSapply(cl,1:k,function(i){ ... })
Is there a natural way to parallelize a nested for loop in R using this package? Or perhaps another package? I know there are several ways to implement parallelism in R.
My loop looks something like this. I simplified a bit but it gets the message across:
sup_mse <- matrix(0,nrow=k,ncol=length(sigma))
k <- 100000 #Number of iterations
sigma <- seq(from=0.1,to=10,by=0.2)
for(i in 1:k){
for(j in 1:length(sigma)){
sup<-supsmu(x,y)
sup_mse[i,j] <- mean((m(x)-sup$y)^2)
}
}
Thanks for making the reproducible example! I prefer snowfall for my parallel processing, so here's how it looks in there.
install.packages('snowfall')
require(snowfall)
### wasn't sure what you were using for x or y
set.seed(1001)
x <- sample(seq(1,100),20)
y <- sample(seq(1,100),20)
k <- 100
sigma <- seq(0.1, 10, 0.2)
### makes a local cluster on 4 cores and puts the data each core will need onto each
sfInit(parallel=TRUE,cpus=4, type="SOCK",socketHosts=rep("localhost",4))
sfExport('x','y','k','sigma')
answers <- sfSapply(seq(1,k), function(M)
sapply(seq(1,length(sigma)), function(N)
mean((mean(x)-supsmu(x,y)$y)^2) ## wasn't sure what you mean by m(x) so guessed mean
)
)
sup_mse <- t(answers) ## will give you a matrix with length(sigma) columns and k rows
sfStop()
I remember reading somewhere that you only want to use sfSapply in the outer loops and then use your regular apply functions inside of that loop. Hope this helps!

Calculating a GP correlation matrix outside of a loop

So I am trying to calculate the correlation matrix associated with a Gaussian Process using R and was hoping for some suggestions for doing so without using the triple for-loop I have written below. Mainly I want to try and condense the code for readable purposes and also to speed up calculations.
#Example Data
n = 500
x1 = sample(1:100,n,replace=T)
x2 = sample(1:100,n,replace=T)
x3 = sample(1:100,n,replace=T)
X = cbind(x1,x2,x3)
R = matrix(NA,nrow=n,ncol=n)
for(i in 1:nrow(X)){
for(j in 1:nrow(X)){
temp = 0
for(k in 1:ncol(X)){
temp = -abs(X[i,k]-X[j,k])^1.99 + temp
}
R[i,j] = exp(temp)
}
}
So as n gets large, the code gets much slower. Also worth noting, since this is a correlation matrix, the matrix is syymetric and the diagonal is equal to 1.
It's much faster using this:
y <- t(X)
R <- exp(-sapply(1:ncol(y), function(i) colSums((y-y[,i])^2)))
If you want ot keep your original formula:
R <- exp(-sapply(1:ncol(y), function(i) colSums(abs(y-y[,i])^1.99)))
I'm wondering if you could cut your calculation and looping times in half by changing these two lines? (Actually the timing was improved by more than 50% 14.304 secs improved to 6.234 secs )
1: for(j in 1:nrow(X)){
2: R[i,j] = exp(temp)
To:
1: for(j in i:nrow(X)){
2: R[i,j] = R[j,i]= exp(temp)
Tested:
> all.equal(R, R2)
[1] TRUE
That way you populate the lower triangle without doing any calculations.BTW, what's with the 1.99? This is perhaps a problem more suited to submitting as a C program. The Rcpp package supports this and there are a lot of worked examples on SO. Perhaps a search on: [r] rcpp nested loops

Resources