Fit many linear models in R with identical design matrices [duplicate] - r

This question already has an answer here:
Fitting a linear model with multiple LHS
(1 answer)
Closed 6 years ago.
For a neuroimaging application, I'm trying to fit many linear models by least squares in R (standard call to lm). Imagine I have a design matrix X. This design matrix will be the same across all of the models. The data (Y) that is being fit will change, and as a result so will all of the fit parameters (e.g. betas, p-values, residuals, etc).
At present, I'm just sticking it in a for loop, so it's doing hundreds of thousands of calls to lm. It seems like there's got to be a better way.
I believe the most computationally expensive piece is the matrix inversion. It looks like this gets handled with a Fortran call in lm.fit.
If I were doing this regression by hand, I'd do the matrix inversion, then just multiply it by the various datasets. In fact, I've coded up a function to do that when I have well-behaved design matrices (e.g. all continuously valued covariates). However, I really like all of the work that lm does, like recoding my factors appropriately, etc, and the output of lm is really nice, too.
Is there anyway to have my cake and eat it, too? Namely, to get the friendliness of lm, but use that power to computationally efficiently fit many models with identical design matrices?

Yes, there is a better way. We have been writing example replacement functions fastLm() based on using external C / C++ code from Armadillo, GSL and and Eigen in the packages RcppArmadillo, RcppGSL and RcppEigen.
By far the largest amount of time is spent setting up the model matrix and deparsing the formula. You can read the source of lm(), or maybe ours in fastLm(), and see how to do this parsing just once. Keep the right-hand side, and then loop over your different y vectors. Which fitting function you use matters less. I like fastLm() from RcppArmadillo, but hey, I also wrote it :)

From the help page for lm:
If ‘response’ is a matrix a linear model is fitted separately by
least-squares to each column of the matrix.
So it would seem that a simple approach would be to combine all the different y vectors into a matrix and pass that as the response in a single call to lm. For example:
(fit <- lm( cbind(Sepal.Width,Sepal.Length) ~ Petal.Width+Petal.Length+Species, data=iris))
summary(fit)
summary(fit)[2]
coef(summary(fit)[2])
coef(summary(fit))[2]
sapply( summary(fit), function(x) x$r.squared )

I do not know of a better way using lm; but you may want to consider using function lsfit. Although simpler and with less bells and whistles, the syntax lsfit(X,y) allows for y to be not just a vector with values of the response variable, but also a matrix. Then a single call to lsfit fits all columns of y
by regressing them on the same design matrix X. Quite fast and handy.

Related

Keras predict repeated columns

I have a question related to keras model code in R. I have finished training the model and need to predict. Predicting a line is very fast, but my data has 2000,000,000 rows and nearly 200 columns, with a structure like the attached image.
Datastructure
I don't know if anyone has any suggestions on which method to use so that predict can run quickly and use less memory. I created a matrix according to the table as shown in order to predict, each matrix is ​​200,000x200 dimensions. Then I use sapply to predict all the remaining matrices. However, even though predict is fast for each matrix, but creating the matrix is ​​slow, so it makes the model run twice or three times as long, and that is not taking into account the sapply step. I wonder if keras has a "smart" way to know that in each of his matrix, the last N columns that are exactly the same? I google and see someone talking about RepeatVector but I don't quite understand and it seems that this is only used for training? I already have the model and just need to predict.
Thank you so much everyone!
One of the most performant ways to feed keras models locally is by creating a tf.data.Dataset object. Please take a look at the tfdatasets R package for guides and example usage.

Model matrices in R, for mixed models

I'm writing my own software package for a certain type of mixed models in R and would like an easy and efficient way to form design matrices. The fixed part ("X") is not a problem, I use model.matrix, or sparse.model.matrix from the Matrix package if I want it sparse.
I'm looking for an equivalent function, preferably in a well tested package, to create the random effects design matrix ("Z").
For example, if I fit an lmer / glmer model using the lme4 package, I can get the desired matrix by calling getME(fitted_object, "Zt"), but I don't want to fit a model just to get the "Zt" or Z matrix. Ideally, I would like to call (e.g) model.matrix.random(formula) and get it directly, for any valid formula.
In response to a discussion in the comments, ideas about how such a function could be programmed are also welcome, but I was hoping there would be a well tested implementation already. I tried looking through the source code for lme4 to see how they do it but wasn't able to find the relevant code.

Formula interface for glmnet

In the last few months I've worked on a number of projects where I've used the glmnet package to fit elastic net models. It's great, but the interface is rather bare-bones compared to most R modelling functions. In particular, rather than specifying a formula and data frame, you have to give a response vector and predictor matrix. You also lose out on many quality-of-life things that the regular interface provides, eg sensible (?) treatment of factors, missing values, putting variables into the correct order, etc.
So I've generally ended up writing my own code to recreate the formula/data frame interface. Due to client confidentiality issues, I've also ended up leaving this code behind and having to write it again for the next project. I figured I might as well bite the bullet and create an actual package to do this. However, a couple of questions before I do so:
Are there any issues that complicate using the formula/data frame interface with elastic net models? (I'm aware of standardisation and dummy variables, and wide datasets maybe requiring sparse model matrices.)
Is there any existing package that does this?
Well, it looks like there's no pre-built formula interface, so I went ahead and made my own. You can download it from Github: https://github.com/Hong-Revo/glmnetUtils
Or in R, using devtools::install_github:
install.packages("devtools")
library(devtools)
install_github("hong-revo/glmnetUtils")
library(glmnetUtils)
From the readme:
Some quality-of-life functions to streamline the process of fitting
elastic net models with glmnet, specifically:
glmnet.formula provides a formula/data frame interface to glmnet.
cv.glmnet.formula does a similar thing for cv.glmnet.
Methods for predict and coef for both the above.
A function cvAlpha.glmnet to choose both the alpha and lambda parameters via cross-validation, following the approach described in
the help page for cv.glmnet. Optionally does the cross-validation in
parallel.
Methods for plot, predict and coef for the above.
Incidentally, while writing the above, I think I realised why nobody has done this before. Central to R's handling of model frames and model matrices is a terms object, which includes a matrix with one row per variable and one column per main effect and interaction. In effect, that's (at minimum) roughly a p x p matrix, where p is the number of variables in the model. When p is 16000, which is common these days with wide data, the resulting matrix is about a gigabyte in size.
Still, I haven't had any problems (yet) working with these objects. If it becomes a major issue, I'll see if I can find a workaround.
Update Oct-2016
I've pushed an update to the repo, to address the above issue as well as one related to factors. From the documentation:
There are two ways in which glmnetUtils can generate a model matrix out of a formula and data frame. The first is to use the standard R machinery comprising model.frame and model.matrix; and the second is to build the matrix one variable at a time. These options are discussed and contrasted below.
Using model.frame
This is the simpler option, and the one that is most compatible with other R modelling functions. The model.frame function takes a formula and data frame and returns a model frame: a data frame with special information attached that lets R make sense of the terms in the formula. For example, if a formula includes an interaction term, the model frame will specify which columns in the data relate to the interaction, and how they should be treated. Similarly, if the formula includes expressions like exp(x) or I(x^2) on the RHS, model.frame will evaluate these expressions and include them in the output.
The major disadvantage of using model.frame is that it generates a terms object, which encodes how variables and interactions are organised. One of the attributes of this object is a matrix with one row per variable, and one column per main effect and interaction. At minimum, this is (approximately) a p x p square matrix where p is the number of main effects in the model. For wide datasets with p > 10000, this matrix can approach or exceed a gigabyte in size. Even if there is enough memory to store such an object, generating the model matrix can take a significant amount of time.
Another issue with the standard R approach is the treatment of factors. Normally, model.matrix will turn an N-level factor into an indicator matrix with N-1 columns, with one column being dropped. This is necessary for unregularised models as fit with lm and glm, since the full set of N columns is linearly dependent. With the usual treatment contrasts, the interpretation is that the dropped column represents a baseline level, while the coefficients for the other columns represent the difference in the response relative to the baseline.
This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.
Manually building the model matrix
To deal with the problems above, glmnetUtils by default will avoid using model.frame, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be noticeably faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually).
The main downside of not using model.frame is that the formula can only be relatively simple. At the moment, only straightforward formulas like y ~ x1 + x2 + ... + x_p are handled by the code, where the x's are columns already present in the data. Interaction terms and computed expressions are not supported. Where possible, you should compute such expressions beforehand.
Update Apr-2017
After a few hiccups, this is finally on CRAN.

Is there a way to input a covariance matrix (or something like that) into lme4 in R?

I have a very large data set that I extract from a data warehouse. To download the data set to the box where I want to run lme4 takes a long time. I would like to know if I could process the data into a covariance matrix, download that data (which is much smaller), and use that as the data input to lme4. I have done something similar to this for multiple regression models using SAS, and am hoping I can create this type of input for lme4.
Thanks.
I don't know of any way to use the observed covariance matrix to fit an lmer model. But if the goal is to reduce data set size in order to speed up analysis, there may be simpler approaches. For example, if you don't need the conditional modes of the random effects, and you have a very large sample size, then you might try fitting the model to progressively larger subsets of the data until the estimates of the fixed effects and the covariance matrix of the random effects 'stabilize'. This approach has worked well in my experience, and has been discussed by others:
http://andrewgelman.com/2012/04/hierarchicalmultilevel-modeling-with-big-data/
Here's another quotation:
"Related to the “multiple model” approach are simple approximations that speed the computations. Computers are getting faster and faster—but models are getting more and more complicated! And so these general tricks might remain important. A simple and general trick is to break the data into subsets and analyze each subset separately. For example, break the 85 counties of radon data randomly into three sets of 30, 30, and 25 counties, and analyze each set separately." Gelman and Hill (2007), p.547.

Fitting a binormal distribution in R

As from title, I have some data that is roughly binormally distributed and I would like to find its two underlying components.
I am fitting to the data distribution the sum of two normal with means m1 and m2 and standard deviations s1 and s2. The two gaussians are scaled by a weight factor such that w1+w2 = 1
I can succeed to do this using the vglm function of the VGAM package such as:
fitRes <- vglm(mydata ~ 1, mix2normal1(equalsd=FALSE),
iphi=w, imu=m1, imu2=m2, isd1=s1, isd2=s2))
This is painfully slow and it can take several minutes depending on the data, but I can live with that.
Now I would like to see how the distribution of my data changes over time, so essentially I break up my data in a few (30-50) blocks and repeat the fit process for each of those.
So, here are the questions:
1) how do I speed up the fit process? I tried to use nls or mle that look much faster but mostly failed to get good fit (but succeeded in getting all the possible errors these function could throw on me). Also is not clear to me how to impose limits with those functions (w in [0;1] and w1+w2=1)
2) how do I automagically choose some good starting parameters (I know this is a $1 million question but you'll never know, maybe someone has the answer)? Right now I have a little interface that allow me to choose the parameters and visually see what the initial distribution would look like which is very cool, but I would like to do it automatically for this task.
I thought of relying on the x corresponding to the 3rd and 4th quartiles of the y as starting parameters for the two mean? Do you thing that would be a reasonable thing to do?
First things first:
did you try to search for fit mixture model on RSeek.org?
did you look at the Cluster Analysis + Finite Mixture Modeling Task View?
There has been a lot of research into mixture models so you may find something.

Resources