Speed up estimation of overlapping additive models (mgcv) - r

I have some set of variables and I'm fitting many (hundreds of thousands) additive models, each of which includes a subset of all the variables. The dependent variable is the same in every case, and some of the models overlap or are nested. Not all of the independent variables have to enter the model nonparametrically. For clarity, I might have a set of variables {x1,x2,x3,x4,x5} and estimate:
a) y=c+f(x1)+f(x2),
b) y=c+x1+f(x2),
c) y=c+f(x1)+f(x2)+x3, etc.
I'm wondering if there is anything I can do to speed up the gam estimation in this case? Is there anything that is being calculated over and over again that I could calculate once and supply to the function?
What I have already tried:
Memoization since the models repeat exactly from time to time.
Reluctantly switched from thin plate regression splines to cubic regression splines (quite a significant improvement).
The mgcv guide says:
The user can retain most of the advantages of the t.p.r.s. approach by supplying a reduced set of covariate values from which to obtain the basis - typically the number of covariate values used will be substantially smaller than the number of data, and substantially larger than the basis dimension, k.
This caused quite a noticeable improvement with smaller models, e.g. 5 smooths, but not with larger models, e.g. 10 smooths. In fact, in the latter case, it often caused the estimation to take (potentially much) longer.
What I'd like to try but don't know if it's possible:
One obvious thing that repeats itself in both, say, y=c+f(x1)+f(x2) and y=c+x1+f(x2), is the calculation of the basis for f(x2). If I were to use the same knots every time, how (if it's possible at all) could I precalculate the basis for every variable and then supply that to mgcv? Would you expect this to bring a significant time improvement?
Is there anything else you'd recommend?

Related

Donwsizing a lm object for plotting

I'd like to use check_model() from {performance} but I'm working with a few millions datapoints, which make plotting too costly. Is it possible to take a sample from a lm() model without affecting everything else (eg., it's coefficients).
# defining a model
model = lm(mpg ~ wt + am + gear + vs * cyl, data = mtcars)
# checking model assumptions
performance::check_model(model)
Created on 2022-08-23 by the reprex package (v2.0.1)
Alternative: Is downsizing, ok? In a ML workflow I'd donwsample for tunning, feature selection and feature engineering, for example. But I don't know if that's usual in classic linear regression modelling (is OK to test for heteroskedasticity in a downsized sample and then estimate the coefficients with full sample?)
Speeding up check_model
The documentation (?check_model) explains a few things you can do to speed up the function/plotting without subsampling:
For models with many observations, or for more complex models in
general, generating the plot might become very slow. One reason might
be that the underlying graphic engine becomes slow for plotting many
data points. In such cases, setting the argument show_dots = FALSE
might help. Furthermore, look at the check argument and see if some of
the model checks could be skipped, which also increases performance.
Accordingly, you can turn off the dots-per-point default with check_model(model, show_dots = FALSE). You can also choose the specific checks you get (reducing computation time) if you are not interested in them. For example, you could get only samples from the posterior predictive distribution with check_model(model, check = "pp_check").
Implications of Downsampling
Choosing a subset of observations (and/or draws from the posterior if you're using a Bayesian model) will always change the results of anything that conditions on the data. Both your model parameters and post-estimation summaries conditioning on the data will change. Just how much it will change depends on variability of your observations and sample size. With millions of observations, it's probably unlikely to change much -- but maybe some rare data patterns can heavily influence your results during (post)-estimation.
Plotting for heteroskedasticity based on a different model than the one you estimated makes little sense, but your mileage may vary because the models may differ little. You're looking to evaluate how well your model approximates the Gauss-Markov variance assumptions, not how well another model does. From a computational perspective, it would also be puzzling to do so: the residuals are part of estimation -- if you can fit the model, you can presumably also show the residuals in various ways.
That being said, these plots are also approximations to the actual distribution of interest anyway (i.e. you're implicitly estimating test statistics with some of these plots) and since the central limit theorem applies, things would look the same roughly if you cut out some observations given your data are sufficiently large.

Can I trust a full glmer model that converges ONLY with bobyqa and with contrast sum coding?

I am using R 3.2.0 with lme4 version 1.1.8. to run a mixed effects logistic regression model on some binomial data (coded as 0 and 1) from a psycholinguistic experiment. There are 2 categorical predictors (one with 2 levels and one with 3 levels) and two random terms (participants and items). I am using sum coding for the predictors (i.e. contr.sum..) which gives me the effects and interactions that I am interested in.
I find that the full model (with fixed effects and interactions, plus random intercepts AND slopes for the two random terms) converges ONLY when I specify (optimizer="bobyqa"). If I do not specify the optimizer, the model converges only after simplifying the model drastically. The same thing happens when I use the default treatment coding, even when I specify optimizer="bobyqa".
My first question is why is this happening and can I trust the output of the full model?
My second question is whether this might be due to the fact that my data is not fully balanced, in the sense that my conditions do not have exactly the same number of observations. Are there special precautions one must take when the data is not full balanced? Can one suggest any reading on this particular case?
Many thanks
You should take a look at the ?convergence help page of more recent versions of lme4 (or you can read it here). If the two fits using different optimizers give similar estimated parameters (despite one giving convergence warnings and the other not), and the fits with different contrasts give the same log-likelihood, then you probably have a reasonable fit.
In general lack of balance lowers statistical power and makes fitting more difficult, but mildly to moderate unbalanced data should present no particular problems.

PLM in R with time invariant variable

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.
You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

Running regression tree on large dataset in R

I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).
I can think of two main problems that are slowing down the calculation
The splits are being calculated at each step using the whole dataset. I would be happy with results that chose the variable to split on at each node based on a random subset of the data, as long as it continues to replenish the size of the sample at each subnode in the tree.
The operation is not being parallelized. It seems to me that as soon as the tree has made it's first split, it ought to be able to use two processors, so that by the time there are 16 splits each of the processors in my machine would be in use. In practice it seems like only one is getting used.
Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?
* I am using mob(), since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.
** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?
My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
Bin your categorical predictor into fewer bins (that are still meaningful to you).
Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an ordered factor in R
Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the randomForest package is to set the randomForest parameter mtry to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest method ensure you don't overfit your categorical variable.
Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.

Profiling SVM (e1071) in R

I am new to R and SVMs and I am trying to profile svm function from e1071 package. However, I can't find any large dataset that allows me to get a good profiling range of results varying the size of the input data. Does anyone know how to work svm out? Which dataset should I use? Any particular parameters to svm that makes it work harder?
I copy some commands that I am using to test the performance. Perhaps it is most useful and easier to get what I am trying here:
#loading libraries
library(class)
library(e1071)
#I've been using golubEsets (more examples availables)
library(golubEsets)
#get the data: matrix 7129x38
data(Golub_Train)
n <- exprs(Golub_Train)
#duplicate rows(to make the dataset larger)
n<-rbind(n,n)
#take training samples as a vector
samplelabels <- as.vector(Golub_Train#phenoData#data$ALL.AML)
#calculate svm and profile it
Rprof('svm.out')
svmmodel1 <- svm(x=t(n), y=samplelabels, type='C', kernel="radial", cross=10)
Rprof(NULL)
I keep increasing the dataset duplicating rows and columns but I reached the limit of memory instead of making svm works harder...
In terms of "working SVM out" - what will make SVM work "harder" is a more complex model which is not easily separated, higher dimensionality and a larger, denser dataset.
SVM performance degrades with:
Dataset size increases (number of data points)
Sparsity decreases (fewer zeros)
Dimensionality increases (number of attributes)
Non-linear kernels are used (and kernel parameters can make the
kernel evaluation more complex)
Varying Parameters
Are there parameters you can change to make SVM take longer. Of course the parameters affect the quality of the solution you will get and may not make any sense to use.
Using C-SVM, varying C will result in different runtimes. (The similar parameter in nu-SVM is nu) If the dataset is reasonably separable, making C smaller will result in a longer runtime because the SVM will allow more training points to become support vectors. If the dataset is not very separable, making C bigger will cause longer run times because you are essentially telling SVM you want a narrow-margin solution which fits tightly to the data and that will take much longer to compute when the data doesn't easily separate.
Often you find when doing a parameter search that there are parameters that will increase computation time with no appreciable increase in accuracy.
The other parameters are kernel parameters and if you vary them to increase the complexity of calculating the kernel then naturally the SVM runtime will increase. The linear kernel is simple and will be the fastest; non-linear kernels will of course take longer. Some parameters may not increase the calculation complexity of the kernel, but will force a much more complex model, which may take SVM much longer to find the optimal solution to.
Datasets to Use:
The UCI Machine Learning Repository is a great source of datasets.
The MNIST handwriting recognition dataset is a good one to use - you can randomly select subsets of the data to create increasingly larger sized datasets. Keep in mind the data at the link contains all digits, SVM is of course binary so you would have to reduce the data to just two digits or do some kind of multi-class SVM.
You can easily generate datasets as well. To generate a linear dataset, randomly select a normal vector to a hyperplane, then generate a datapoint and determine which side of the hyperplane it falls on to label it. Add some randomness to allow points within a certain distance of the hyperplane to sometimes be labeled differently. Increase the complexity by increasing that overlap between classes. Or generate some numbers of clusters of normally distributed points, labeled either 1 or -1, so that the distributions overlap at the edges. The classic non-linear example is a checkerboard. Generate points and label them in a checkerboard pattern. To make it more difficult enlarge the number of squares, increase the dimensions and increase the number of datapoints. You will have to use a non-linear kernel for that of course.

Resources