Memory efficient representation of ``model.matrix`` - r

Assume we have a large data.table object with model variables:
library(data.table)
library(magrittr)
library(pryr)
library(caret)
df <- rnorm(10000000, 0, 1) %>% matrix(., ncol = 10) %>% as.data.table
df[,factor_vars:=LETTERS[sample(1:26, 1000000, replace = T)]]
df[,factor_vars2:=LETTERS[sample(1:5, 1000000, replace = T)]]
I'm looking for an efficient way of making a model variable matrix from the data. At the moment the best way I've found is by using caret::dummyVars in the following manner:
dd_object <- dummyVars(~ -1 + V1 + V2 + V3 + V4 + V5 + V6 + V7 +
V8 + V9 + V10 + I(as.character(factor_vars)) +
I(as.character(factor_vars2)),
data = df)
Note that this creates a very convenient object for exporting and recreating without the original data.
object_size(dd_object)
R> 17.3 kB
On the other hand, same as with the base::model.matrix, it still retains the inefficiencies of the matrix object when dealing with many zeroes, i.e.:
MM1 <- predict(dd_object, newdata = df)
object_size(MM1)
R> 392 MB
object_size(df)
R> 96 MB
Note that the sizes can blow up very easily with more dummy variables added and etc., this is just for demonstration purposes.
My question: I want to use the same model-matrix object for various known modelling packages (glm,glmnet,xgboost and etc). The sparse matrix representation from Matrix packages does sound nice and efficient, but not every package is able to work with it, and as.matrix(.) transformation is a pain in that case.
Are there any known solutions for my case? I'm looking for something with greater efficiency than the base matrix (possibly like sparse matrices) and the capability of forming a storeable model.matrix object just like caret::dummyVars is able to do.
The desired workflow could be something along the lines of
fread %>% predict(dummyVars_object, newdata =.) %>% predict(some_Model, newdata =.)

Related

Memory problems while using lm.circular()

I am trying to run circular regression in R using the circular package. My dataset is somewhat large, ~85000 rows and 6 variables. When I try to run the model, I get a error message reading "Error: cannot allocate vector of size 53.3 Gb." I am more of a statistician than a programmer so I can't figure out how to fix this, other than it seems odd that it's throwing out this large memory allocation, as my dataset is not that large. I have attached a fictional dataset and code below. Thank you.
library(circular)
set.seed(12)
n = 80000
df <- data.frame(y = rnorm(n,2,.2),
x1 = rnorm(n,100,2),
x2 = rnorm(n,0,1),
x3 = rnorm(n,9,.2),
x4 = rnorm(n,0,1),
x5 = rnorm(n,1,.1))
y <- circular(df$y, type = "angles", units = "radians")
x <- model.matrix(y ~., data = df)
m1 <- lm.circular(y = y, x = x, type = "c-l", init = c(1,.01,.5,.5,.5,.5))
The implementation tries to set up some diagonal matrices of size n x n using
A <- diag(k * A1(k), nrow = n)
g.p <- diag(apply(x, 1, function(row, betaPrev) 2/(1 + (t(betaPrev) %*%
row)^2), betaPrev = betaPrev), nrow = n)
(in circular:::LmCircularclRad) without using any sparse matrix tricks. For your example, those matrices would each take 50 GB of memory, and that allocation fails.
I don't think there's anything you can do to avoid this, other than suggesting a more efficient way to carry out the required calculations. Usually linear algebra using diagonal matrices can be done with much less memory use, but you'll have to look closely at this code to see if that's the case here.

Looking for ways to optimize R's sparse.model.matrix

I have sparse data problems that generally require computing a sparse model matrix. The matrix I should receive in the end contains ~95% zeroes. It is usually due to factors that get one hot encoded, which become sparse, and further taking interactions with these sparse vectors.
require(Matrix)
require(data.table)
require(magrittr)
n = 500000
p = 10
x.matrix = matrix(rnorm(n*p), n, p)
colnames(x.matrix) = sprintf("n%s", 1:p)
x.categorical = data.table(
c1 = sample(LETTERS, n, replace = T),
c2 = sample(LETTERS, n, replace = T),
c3 = sample(LETTERS, n, replace = T)
)
x = cBind(x.matrix, x.categorical)
myformula = "~ n1 + n2 + n3 + n4 + n5 + n6 + n7 + n8 + n9 +n10 +
c1 + c1*c2 + c3 + n1:c1"
mm = model.matrix(myformula %>% as.formula, x)
mm2 = sparse.model.matrix(myformula %>% as.formula, x)
I have found that the performance of sparse.model.matrix on a sparse problem is worse than model.matrix (normally used for dense problems). This is revealed using Rstudio's profiling tools.
Here sparse model matrix takes much more time than model.matrix, and uses almost the same amount of memory. In some problems I have found sparse.model.matrix to be up to 10x slower than model.matrix when working with data that should be sparse.
Are there better ways to create the sparse matrix? I have searched quite a lot and have not found any. Alternatively, I would be interested in finding others or getting tips in how to implement a smarter version of sparse.model.matrix from scratch, perhaps using Rcpp or data.table functions
The source of the problem is in sparse2int, although I don't quite understand what it is for, and there are a few "FIXME"s still left in the code.

correlation matrix of a bunch of categorical variables in R

I have about 20 variables about different cities labeled "Y" or "N" and are factors. The variables are like "has co-op" and the such. I want to find some correlations and possibly use the corrplot package to display the connections between all these variables. But for some reason I cannot coerce the variables so that they are read in a way corrplot or even cor() likes so that I can get them in a matrix. I tried:
M <- cor(model.matrix(~.-1,data=mydata[c(25:44)]))
but the results in corrplot came out really weird. Does anyone have a fast way to turn a bunch of Y/N answers into a correlation matrix? Thanks!
You can use the sjp.corr function or sjt.corr function for graphical or tabular output, both from the sjPlot-package.
DF <- data.frame(v1 = sample(c("Y","N"), 100, T),
v2 = sample(c("Y","N"), 100, T),
v3 = sample(c("Y","N"), 100, T),
v4 = sample(c("Y","N"), 100, T),
v5 = sample(c("Y","N"), 100, T))
DF[] <- lapply(DF,as.integer)
library(sjPlot)
sjp.corr(DF)
sjt.corr(DF)
The plot:
The table (in RStudio viewer pane):
You can use many parameters to modify the appearance of the plot or table, see some examples here.
For binary variables, you might consider cross tabs (the table function in R).
However, getting the correlation matrix is pretty straightforward:
# example data
set.seed(1)
DF <- data.frame(x=sample(c("Y","N"),100,T),y=sample(c("Y","N"),100,T))
# how to get correlation
DF[] <- lapply(DF,as.integer)
cor(DF)
# x y
# x 1.0000000 -0.0369479
# y -0.0369479 1.0000000
# visualize it
library(corrplot)
corrplot(cor(DF))
When you convert to integer in this example, "N" is 1 and "Y" is 2. I'm not sure if that holds generally (for R's storage of factors). To have a look at the mapping for your data, try lapply(DF,levels) before converting to integer.
To me, the plot makes sense. If you have questions about the statistical interpretation of correlations in this context, you should consider having a look at http://stats.stackexchange.com

Generating multiple datasets and applying function and output multiple dataset

Here is my problem, just hard for me...
I want to generate multiple datasets, then apply a function to these datasets and output corresponding output in single or multiple dataset (whatever possible)...
My example, although I need to generate a large number of variables and datasets
seed <- round(runif(10)*1000000)
datagen <- function(x){
set.seed(x)
var <- rep(1:3, c(rep(3, 3)))
yvar <- rnorm(length(var), 50, 10)
matrix <- matrix(sample(1:10, c(10*length(var)), replace = TRUE), ncol = 10)
mydata <- data.frame(var, yvar, matrix)
}
gdt <- lapply (seed, datagen)
# resulting list (I believe is correct term) has 10 dataframes:
# gdt[1] .......to gdt[10]
# my function, this will perform anova in every component data frames and
#output probability coefficients...
anovp <- function(x){
ind <- 3:ncol(x)
out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]])
pval <- out$coefficients[,4][2]
pval <- do.call(rbind,pval)
}
plist <- lapply (gdt, anovp)
Error in gdt[x] : invalid subscript type 'list'
This is not working, I tried different options. But could not figure out...finally decided to bother experts, sorry for that...
My questions are:
(1) Is this possible to handle such situation in this way or there are other alternatives to handle such multiple datasets created?
(2) If this is right way, how can I do it?
Thank you for attention and I will appreciate your help...
You have the basic idea right, in that you should create a list of data frames and then use lapply to apply the function to each element of the list. Unfortunately, there are several oddities in your code.
There is no point in randomly generating a seed, then setting it. You only need to use set.seed in order to make random numbers reproducible. Cut the lines
seed <- round(runif(10)*1000000)
and maybe
set.seed(x)
rep(1:3, c(rep(3, 3))) is the same as rep(1:3, each = 3).
Don't call your variables var or matrix, since they will mask the names of those functions. since it's confusing.
3:ncol(x) is dangerous. If x has less than 3 columns it doesn't do what you think it does.
... and now, the problem you actually wanted solving.
The problem is in the line out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]]).
lapply passes data frames into anovp, not indicies, so x is a data frame in gdt[x]. Which throws an error.
One more thing. While you are rewriting that line, note that lm takes a data argument, so you don't need to do things like gdt$some_column; you can just reference some_column directly.
EDIT: Further advice.
You appear to always use the formula yvar ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10. Since its the same each time, create it before your call to lapply.
independent_vars <- paste(colnames(gdt[[1]])[-1:-2], collapse = " + ")
model_formula <- formula(paste("yvar", independent_vars, sep = " ~ "))
I probably wouldn't bother with the anovp function. Just do
models <- lapply(gdt, function(data) lm(model_formula, data))
Then include a further call to lapply to play with the coefficients if necessary. The next line replicates your anovp code, but won't work because model$coefficients is a vector (so the dimensions aren't right). Adjust to retrieve the bit you actualy want.
coeffs <- lapply(models, function(model) do.call(rbind, model$coefficients[,4][2]))

Looping to extract coefficients from multiply imputed mer objects

I am having a hard time wrapping my head around this problem. I have a list, results4 which contains 5 elements, all of which are mer objects from the zelig package. The mer objects are the result of ls.mixed regressions on each of five imputed datasets. I am trying to combine the results using Rubin's Rules for Multiple Imputation.
I can extract the coefficients and standard errors using summary(results4[[1]])#coefs, which returns a 16x3 vector (16 variables, each with a point estimate, standard error, and t-statistic).
I am trying to loop over the five sets of results and automate the process of combining the point estimates and standard errors, but unfortunately I seem to be staring at it with no solution arising. Any suggestions?
The code that produces the mer objects follows (variable names changed):
for (i in 1:5) {
results4[i] <- zelig(DV ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 +
V9 + V10 + V11 + V12 + V13 + V14 + V15 + tag(1 | L2),
data = as.data.frame(w4[,,i]), model = "ls.mixed", REML = FALSE)
}
I'm not going to take the time to code up the multiple-imputation rules (someone who wants the credit can what I show here and build on it), but I think you should be able to do what you want by building a 16x3x5 array containing the results:
resultsList <- lapply(results,function(x) summary(x)#coefs)
library(abind)
resultsArr <- abind(resultsList,along=3)
and then using apply appropriately across the margins.
There's probably a plyr-based solution as well.
You could also do this less fancily by just defining the array up front and filling it in as you go:
sumresults <- array(dim=c(16,3,5))
for (...) {
...
sumresults[,,i] <- summary(results4[[i]])#coefs
}

Resources