R store expressions in data.frame - r

I have a matrix of computer generated equations. Each equation is solved to produce a single number. Resulting numeric matrix is then fed to the solve function. For plotting results over a variable, this has to be repeated N times.
I want to speed up the calculations by modifying AST of the equations and simplify them using these functions before evaluating.
The problem I encountered is that I cannot properly store modified equations of types expression or language in matrix or data.frame. For example:
foo <- data.frame(matrix(expression(NA), nrow = 100, ncol = 100))
# does not work
# apply(foo, MARGIN = c(1,2), function(x) {expression(1+1)})
for (i in c(1:100)) {for (j in c(1:100)) {foo[i,j] <- expression(1+1)}}
Resulting data.frame foo is 3.1Mb even for the shortest expression. Real equations are even bigger and have awful subset time. Is there a way to store these types efficiently?

The best way to deal with this I've found so far is to store expressions in function. I create a function:
out <- function() {
m <- matrix(NA, nrow = 100, ncol = 100)
}
Then I append expressions to it:
n <- 3
for (i in c(1:100)) {
for (j in c(1:100)) {
body(out)[[n]] <- substitute(m[i, j] <- f+i+j, list(i=i, j=j))
n <- n + 1
}
n <- n + 1
}
Finally, I append the return statement and add arguments to the function:
body(out)[[length(body(out))+1]] <- quote(m)
formals(out) <- alist(f=)
To speed things up for the repeated evaluation of the resulted function, I compile it to bytecode:
outc <- cmpfun(out)
Upon calling this function returns a numeric matrix of executed equations, so no subsetting is needed. It is really big though. out and outc are both 56Mb.

Related

Speed of Daisy Function

I'm working on improving the speed of a function (for a dissimilarity measure) I'm writing which is quite similar mathematically to the Euclidean distance function. However, when I time my function compared to that implemented in the daisy function from the cluster package, I find quite a significant difference in speed, with daisy performing much better. Given that (I'm assuming) a dissimilarity measure would require O(n x p) time due to the need to compare each object to itself over all variables (where n is number of objects and p is number of variables), I find it difficult to understand how the daisy function performs so well (near constant time, from the few experiments I've done) relative to my simple and direct implementation. I present the code I have used both to implement and test below. I have tried looking through the r source code for the implementation of the daisy function, but I found it difficult to understand. I found no nested for loop. Any help with understanding why this function performs so fast and how I could possibly modify my code to have similar speed would be very highly appreciated.
euclidean <- function (df){
no_obj <- nrow(df)
dist <- array(0, dim = c(no_obj, no_obj))
for (i in 1:no_obj){
for (j in 1:no_obj){
dist_v <- 0
if(i != j){
for (v in 1:ncol(df)){
dist_v <- dist_v + sqrt((df[i,v] - df[j,v])^2)
}
}
dist[i,j] <- dist_v
}
}
return(dist)
}
data("iris")
tic <- Sys.time()
dst <- euclidean(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Euclidean]: ", time))
tic <- Sys.time()
dst <- daisy(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Daisy]: ", time))
one option:
euclidean3 <- function(df) {
require(data.table)
n <- nrow(df)
i <- CJ(1:n, 1:n) # generate all row combinations
dl <- sapply(df, function(x) sqrt((x[i[[1]]] - x[i[[2]]])^2)) # loop over columns
dv <- rowSums(dl) # sum values of columns
d <- matrix(dv, n, n) # fill in matrix
d
}
dst3 <- euclidean3(iris[,1:4])
all.equal(euclidean(iris[,1:4]), dst3) # TRUE
[1] "Time taken [Euclidean3]: 0.008"
[1] "Time taken [Daisy]: 0.002"
Largest bottleneck in your code is selecting data.frame elements in loop (df[j,v])). Maybe changing it to matrix also could improver speed. I believe there could be more performant approach on stackoverflow, you just need to search by correct keywords...

Estimating an OLS model in R with million observations and thousands of variables

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.
I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.
However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.
I've already tried:
starting R with --max-ppsize=50000
setting options(expressions = 50000)
but the error persists
I am working on Windows and using Rstudio
# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))
# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]
# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))
EDIT 1:
The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.
The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get
Error: cannot allocate vector of size 9.3 Gb
(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.
I have a few caveats about this approach:
you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.
Example (first save the code at the bottom as my_biglm.R):
nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))
respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)
# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))
source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))
my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
y=NULL, X=NULL, off=0) {
if (!is.null(weights)) {
if (!inherits(weights, "formula"))
stop("`weights' must be a formula")
w <- model.frame(weights, data)[[1]]
} else w <- NULL
if (is.null(X)) {
tt <- terms(formula)
mf <- model.frame(tt, data)
if (is.null(off <- model.offset(mf)))
off <- 0
mm <- model.matrix(tt, mf)
y <- model.response(mf) - off
} else {
## model matrix specified directly
if (is.null(y)) stop("both y and X must be specified")
mm <- X
tt <- NULL
}
qr <- biglm:::bigqr.init(NCOL(mm))
qr <- biglm:::update.bigqr(qr, mm, y, w)
rval <- list(call = sys.call(), qr = qr, assign = attr(mm,
"assign"), terms = tt, n = NROW(mm), names = colnames(mm),
weights = weights)
if (sandwich) {
p <- ncol(mm)
n <- nrow(mm)
xyqr <- bigqr.init(p * (p + 1))
xx <- matrix(nrow = n, ncol = p * (p + 1))
xx[, 1:p] <- mm * y
for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
xyqr <- update(xyqr, xx, rep(0, n), w * w)
rval$sandwich <- list(xy = xyqr)
}
rval$df.resid <- rval$n - length(qr$D)
class(rval) <- "biglm"
rval
}

Matrix computation with for loop

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}
Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

Working with multiple cores and sparse matrices in R

I am working on a project that requires large matrices with a larger number of zeros. Unfortunately, as some of these matrices can have more than 1e10 elements, working with the "standard" R matrices is not an option, due to RAM constraints. Also, I need to work on multiple cores, as the computation can take quite a long time and really shouldn't.
So far, I have been working with the foreach package, and converted the results (which come in standard matrices) to sparse matrices afterwards. I can't help but think that there must be a smarter way.
Here is a minimal example of what I have been doing so far:
cl <- makeSOCKcluster(8)
registerDoSNOW(cl)
Mat <- foreach(j=1:length(lambda), .combine='cbind') %dopar% {
replicate(iter, rpois(n=1, lambda[j]))
}
Mat <- Matrix(Mat, sparse=TRUE)
stopCluster(cl)
The lambdas are all quite small, so that only every 5th element or so is different from zero, making it sensible to store the results in a sparse matrix.
Unfortunately, it has now become necessary to increase the number of iterations from 1e6 to at least 1e7, so that the matrix that is produced by the foreach loop is too large to be stored on 8GB of RAM. What I now want to do is split up the tasks into steps that each have 1e6 iterations, and combine these into a single, sparse matrix.
I now have the following as an idea:
library(Matrix)
library(snow)
cl <- makeSOCKcluster(8)
iter <- 1e6
steps <- 1e5
numsteps <- iter / steps
draws <- function(x, lambda, steps){
replicate(n=steps, rpois(n=1, lambda=lambda))
}
for(i in 1:numsteps){
Mat <- Matrix(0, nrow=steps, ncol=96, sparse=TRUE)
Mat <- Matrix(
parApply(cl=cl, X=Mat, MARGIN=2, FUN=draws, lambda=0.2, steps=steps)
, sparse = TRUE)
if(!exists("fullmat")) fullmat <- Mat else fullmat <- rBind(fullmat, Mat)
rm(Mat)
}
stopCluster(cl)
It works fine, but I had to fix lambda to some value. For my application, I need the values in the ith row to come from a poisson distribution with mean equal to the ith element of the lambda vector. This obviously worked fine in the foreach loop., but I have yet to find a way to make it work in an apply loop.
My questions are:
Is it possible to have the apply function "know" on which row it is operating and pass a corresponding argument to a function?
Is there a way to work with foreach and sparse matrices without the need of creating a standard matrix and converting it into a sparse one in the next step?
If none of the above, is there a way for me to manually assign tasks to slave processes of R - that is, could I specifically tell a process to work on column 1, another to work on column 2 and so on, each creating a sparse vector and only combining these in the last step.
I was able to find a solution to my problem.
In my case, I am able to define a unique ID for each of the columns, and can address the parameters by that. The following code should illustrate what I mean:
library(snow)
library(Matrix)
iter <- 1e6
steps <- 1e5
# define a unique id
SZid <- seq(from=1, to=10, by=1)
# in order to have reproducible code, generate random parameters
SZlambda <- replicate(runif(n=1, min=0, max=.5))
SZmu <- replicate(runif(n=1, min=10, max=15))
SZsigma <- replicate(runif(n=1, min=1, max=3))
cl <- makeSOCKcluster(8)
clusterExport(cl, list=c("SZlambda", "SZmu", "SZsigma"))
numsteps <- iter / steps
MCSZ <- function(SZid, steps){ # Monte Carlo Simulation
lambda <- SZlambda[SZid]; mu <- SZmu[SZid]; sigma <- SZsigma[SZid];
replicate(steps, sum(rlnorm(meanlog=mu, sdlog=sigma,
n = rpois(n=1, lambda))
))
}
for (i in 1:numsteps){
Mat <- Matrix(
parSapply(cl, X=SZid, FUN=MCSZ, steps=steps), sparse=TRUE)
if(!exists("LossSZ")) LossSZ <- Mat else LossSZ <- rBind(LossSZ, Mat)
rm(Mat)
}
stopCluster(cl)
The trick is to apply the function not over the matrix, but over a vector of unique ids that line up with the indices of the parameters.

Fill matrix with loop

I am trying to create a matrix n by k with k mvn covariates using a loop.
Quite simple but not working so far... Here is my code:
n=1000
k=5
p=100
mu=0
sigma=1
x=matrix(data=NA, nrow=n, ncol=k)
for (i in 1:k){
x [[i]]= mvrnorm(n,mu,sigma)
}
What's missing?
I see several things here:
You may want to set the random seed for replicability (set.seed(20430)). This means that every time you run the code, you will get exactly the same set of pseudorandom variates.
Next, your data will just be independent variates; they won't actually have any multivariate structure (although that may be what you want). In general, if you want to generate multivariate data, you should use ?mvrnorm, from the MASS package. (For more info, see here.)
As a minor point, if you want standard normal data, you don't need to specify mu = 0 and sigma = 1, as those are the default values for rnorm().
You don't need a loop to fill a matrix in R, just generate as many values as you like and add them directly using the data= argument in the matrix() function. If you really were committed to using a loop, you should probably use a double loop, so that you are looping over the columns, and within each loop, looping over the rows. (Note that this is a very inefficient way to code in R--although I do things like that all the time ;-).
Lastly, I can't tell what p is supposed to be doing in your code.
Here is a basic way to do what you seem to be going for:
set.seed(20430)
n = 1000
k = 5
dat = rnorm(n*k)
x = matrix(data=dat, nrow=n, ncol=k)
If you really wanted to use loops you could do it like this:
mu = 0
sigma = 1
x = matrix(data=NA, nrow=n, ncol=k)
for(j in 1:k){
for(i in 1:n){
x[i,j] = rnorm(1, mu, sigma)
}
}
define the matrix first
E<-matrix(data=0, nrow=10, ncol=10);
run two loops to iterate i for rows and j for columns, mine is a exchangeable correlation structure
for (i in 1:10)
{
for (j in 1:10)
{
if (i==j) {E[i,j]=1}
else {E[i,j]=0.6}
}
};
A=c(2,3,4,5);# In your case row terms
B=c(3,4,5,6);# In your case column terms
x=matrix(,nrow = length(A), ncol = length(B));
for (i in 1:length(A)){
for (j in 1:length(B)){
x[i,j]<-(A[i]*B[j])# do the similarity function, simi(A[i],B[j])
}
}
x # matrix is filled
I was thinking in my problem perspective.

Resources