Memory problems while using lm.circular() - r

I am trying to run circular regression in R using the circular package. My dataset is somewhat large, ~85000 rows and 6 variables. When I try to run the model, I get a error message reading "Error: cannot allocate vector of size 53.3 Gb." I am more of a statistician than a programmer so I can't figure out how to fix this, other than it seems odd that it's throwing out this large memory allocation, as my dataset is not that large. I have attached a fictional dataset and code below. Thank you.
library(circular)
set.seed(12)
n = 80000
df <- data.frame(y = rnorm(n,2,.2),
x1 = rnorm(n,100,2),
x2 = rnorm(n,0,1),
x3 = rnorm(n,9,.2),
x4 = rnorm(n,0,1),
x5 = rnorm(n,1,.1))
y <- circular(df$y, type = "angles", units = "radians")
x <- model.matrix(y ~., data = df)
m1 <- lm.circular(y = y, x = x, type = "c-l", init = c(1,.01,.5,.5,.5,.5))

The implementation tries to set up some diagonal matrices of size n x n using
A <- diag(k * A1(k), nrow = n)
g.p <- diag(apply(x, 1, function(row, betaPrev) 2/(1 + (t(betaPrev) %*%
row)^2), betaPrev = betaPrev), nrow = n)
(in circular:::LmCircularclRad) without using any sparse matrix tricks. For your example, those matrices would each take 50 GB of memory, and that allocation fails.
I don't think there's anything you can do to avoid this, other than suggesting a more efficient way to carry out the required calculations. Usually linear algebra using diagonal matrices can be done with much less memory use, but you'll have to look closely at this code to see if that's the case here.

Related

How to speed up row-wise computations on a data.frame using alternative functions to apply family?

I have a data frame with 10,000 rows and 40 columns. I am trying to apply a function to each of these rows. For each row, I am expecting to return a scalar which is the value of the statistic I am calculating in this function. Below is what I have done so far;
library(dfadjust)
library(MASS)
# Creating example data #
nrows=10000
ncols=40
n1=20
n2=20
df=data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
# Function to evaluate on each row of df #
get_est= function(x){
mod = rlm(x~cov$group)
fit = dfadjustSE(mod)
coef = fit$coefficients[2,1]
se = fit$coefficients[2,4]
stats = coef/se
return(stats)
}
# Applying above function to full data #
t1=Sys.time()
estimates=apply(df, 1, function(x) get_est(x))
t2=Sys.time()-t1
# Time taken by apply function
Time difference of 37.10623 secs
Is there a way to significantly decrease the time taken to implement get_est() on the full data? The main reason I need to speed up the computation on a single df is because I have 1000 more data frames with the same dimension and I have to apply this function to each row to each of these data frames simultaneously. To illustrate, below is the broader situation I am dealing with;
# Creating example data
set.seed(1234)
nrows = 10000
ncols = 40
n1 = 20
n2 = 20
df.list = list()
for(i in 1:1000){
df.list[[i]] = data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
}
# Applying get_est() to each row and to each of data frame in df.list #
pcks = c('MASS','dfadjust')
all.est = foreach(j = 1:length(df.list), .combine = cbind, .packages = pcks) %dopar% {
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
est = apply(df.list[[j]], 1, function(x) get_est(x))
return(est)
}
Even after parallelizing it is taking hours to finish. My ultimate objective is to significantly cut down the time to obtain "all.est" which will contain 10000 rows and 1000 columns where each column has the stats estimates for the respective data set. Any help is much appreciated!! Thanks in advance!
Doing some preprocessing of data we can adjust the rlm function so there's less overhead:
x3 <- as.matrix(cbind(1L, y))
colnames(x3) <- c('(Intercept)', 'x')
w = rep(1, nrow(x3))
get_est= function(x){
mod = rlm(x3, x, weights = w, w = w)
fit = dfadjustSE(mod)
coef = fit$coefficients[2,1]
se = fit$coefficients[2,4]
stats = coef/se
return(stats)
}
I got 12 seconds instead of 18 seconds for initial approach. ~33% improvement
For larger speedups I would suggest looking into rlm and dfadjustSE functions and try to optimize those for your specific needs (removing unnecessary checks etc., as you are calling those functions millions of times). But that probably will be quite time consuming and better performance is not guaranteed. Maybe there are other packages with similar but faster functions?

How to accomplish replicated calculation and plot in subset dataset?

I have a simulated data created like this:
average_vector = c(0,0,25)
sigma_matrix = matrix(c(4,1,0,1,8,0,0,0,9),nrow=3,ncol=3)
set.seed(12345)
data0 = as.data.frame(mvrnorm(n =20000, mu = average_vector, Sigma=sigma_matrix))
names(data0)=c("hard","smartness","age")
set.seed(13579)
data0$final=0.5*data0$hard+0.2*data0$smartness+(-0.1)*data0$age+rnorm(n=dim(data0)[1],mean=90,sd=6)
Now, I want to randomly sample 50 students 1,000 times (1,000 sets of 50 people), I used this code:
datsub<-(replicate(1000, sample(1:nrow(data0),50)))
After that step, I encountered a issue: I want to ask if I want to run a regression model with the 50 selected people (1,000 times), and record/store the point estimates of “hard” from model 4, where is given like this:
model4 = lm(formula = final ~ hard + smartness + age, data = data0), and plot the variation around the line of 0.5 (true value), is there any way I can achieve that? Thanks a lot!
I would highly suggest looking into either caret or the newer (and still maintained) TidyModels if you're just getting into R modelling. Either of these will make your life easier, once you get used to the dplyr-like syntax.
What you're trying to do is bootstrapping. Here is the manual approach using only base functions.
n <- nrow(data0)
k <- 1000
ns <- 50
samples <- replicate(k, sample(seq_len(n), ns))
params <- vector('list', k)
for(i in seq_len(n)){
params[[i]] <- coef( lm(formula = final ~ hard + smartness + age, data = data0[samples[, i],]) )
}
# merge params into columns
params <- do.call(rbind, params)
# Create plot from here.
plot(x = seq_len(n), y = params[, "hard"])
abline(h = 0.5)
Note the above may have a few typos as your example is not reproducible.

Estimating an OLS model in R with million observations and thousands of variables

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.
I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.
However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.
I've already tried:
starting R with --max-ppsize=50000
setting options(expressions = 50000)
but the error persists
I am working on Windows and using Rstudio
# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))
# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]
# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))
EDIT 1:
The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.
The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get
Error: cannot allocate vector of size 9.3 Gb
(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.
I have a few caveats about this approach:
you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.
Example (first save the code at the bottom as my_biglm.R):
nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))
respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)
# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))
source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))
my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
y=NULL, X=NULL, off=0) {
if (!is.null(weights)) {
if (!inherits(weights, "formula"))
stop("`weights' must be a formula")
w <- model.frame(weights, data)[[1]]
} else w <- NULL
if (is.null(X)) {
tt <- terms(formula)
mf <- model.frame(tt, data)
if (is.null(off <- model.offset(mf)))
off <- 0
mm <- model.matrix(tt, mf)
y <- model.response(mf) - off
} else {
## model matrix specified directly
if (is.null(y)) stop("both y and X must be specified")
mm <- X
tt <- NULL
}
qr <- biglm:::bigqr.init(NCOL(mm))
qr <- biglm:::update.bigqr(qr, mm, y, w)
rval <- list(call = sys.call(), qr = qr, assign = attr(mm,
"assign"), terms = tt, n = NROW(mm), names = colnames(mm),
weights = weights)
if (sandwich) {
p <- ncol(mm)
n <- nrow(mm)
xyqr <- bigqr.init(p * (p + 1))
xx <- matrix(nrow = n, ncol = p * (p + 1))
xx[, 1:p] <- mm * y
for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
xyqr <- update(xyqr, xx, rep(0, n), w * w)
rval$sandwich <- list(xy = xyqr)
}
rval$df.resid <- rval$n - length(qr$D)
class(rval) <- "biglm"
rval
}

Looking for ways to optimize R's sparse.model.matrix

I have sparse data problems that generally require computing a sparse model matrix. The matrix I should receive in the end contains ~95% zeroes. It is usually due to factors that get one hot encoded, which become sparse, and further taking interactions with these sparse vectors.
require(Matrix)
require(data.table)
require(magrittr)
n = 500000
p = 10
x.matrix = matrix(rnorm(n*p), n, p)
colnames(x.matrix) = sprintf("n%s", 1:p)
x.categorical = data.table(
c1 = sample(LETTERS, n, replace = T),
c2 = sample(LETTERS, n, replace = T),
c3 = sample(LETTERS, n, replace = T)
)
x = cBind(x.matrix, x.categorical)
myformula = "~ n1 + n2 + n3 + n4 + n5 + n6 + n7 + n8 + n9 +n10 +
c1 + c1*c2 + c3 + n1:c1"
mm = model.matrix(myformula %>% as.formula, x)
mm2 = sparse.model.matrix(myformula %>% as.formula, x)
I have found that the performance of sparse.model.matrix on a sparse problem is worse than model.matrix (normally used for dense problems). This is revealed using Rstudio's profiling tools.
Here sparse model matrix takes much more time than model.matrix, and uses almost the same amount of memory. In some problems I have found sparse.model.matrix to be up to 10x slower than model.matrix when working with data that should be sparse.
Are there better ways to create the sparse matrix? I have searched quite a lot and have not found any. Alternatively, I would be interested in finding others or getting tips in how to implement a smarter version of sparse.model.matrix from scratch, perhaps using Rcpp or data.table functions
The source of the problem is in sparse2int, although I don't quite understand what it is for, and there are a few "FIXME"s still left in the code.

Separating circles using kernel PCA

I am trying to reproduce a simple example of using kernel PCA. The objective is to separate out the points from two concentric circles.
Creating the data:
circle <- data.frame(radius = rep(c(0, 1), 500) + rnorm(1000, sd = 0.05),
phi = runif(1000, 0, 2 * pi),
group = rep(c("A", "B"), 500))
#
circle <- transform(circle,
x = radius * cos(phi),
y = radius * sin(phi),
z = rnorm(length(radius))) %>% select(group, x, y, z)
TFRAC = 0.75
#
train <- sample(1:1000, TFRAC * 1000)
circle.train <- circle[train,]
circle.test <- circle[-train,]
> head(circle.train)
group x y z
491 A -0.034216 -0.0312062 0.70780
389 A 0.052616 0.0059919 1.05942
178 B -0.987276 -0.3322542 0.75297
472 B -0.808646 0.3962935 -0.17829
473 A -0.032227 0.0027470 0.66955
346 B 0.894957 0.3381633 1.29191
I have split the data up into training and testing sets because I have the intention (once I get this working!) of testing the resulting model.
In principal kernel PCA should allow me to separate out the two classes. Other discussions of this example have used the Radial Basis Function (RBF) kernel, so I adopted this too. In R kernel PCA is implemented in the kernlab package.
library(kernlab)
circle.kpca <- kpca(~ ., data = circle.train[, -1], kernel = "rbfdot", kpar = list(sigma = 10), features = 1)
I requested only the first component and specified the RBF kernel. This is the result:
There has definitely been a major transformation of the data, but the transformed data is not what I was expecting (which would be a nice, clean separation of the two classes). I have tried fiddling with the value of the parameter sigma and, although the results do vary dramatically, I still didn't get what I was expecting. I assume that sigma is related to the parameter gamma mentioned here, possibly via the relationship given here (without the negative sign?).
I'm pretty sure that I am making a naive rookie error here and I would really appreciate any pointers which would get me onto the right track.
Thanks,
Andrew.
Try sigma = 20. I think you will get the answer you are looking for. The sigma in kernlab is actually what is usually referred to as gamma for rbf kernel so they are inversely related.

Resources