Donwsizing a lm object for plotting - r

I'd like to use check_model() from {performance} but I'm working with a few millions datapoints, which make plotting too costly. Is it possible to take a sample from a lm() model without affecting everything else (eg., it's coefficients).
# defining a model
model = lm(mpg ~ wt + am + gear + vs * cyl, data = mtcars)
# checking model assumptions
performance::check_model(model)
Created on 2022-08-23 by the reprex package (v2.0.1)
Alternative: Is downsizing, ok? In a ML workflow I'd donwsample for tunning, feature selection and feature engineering, for example. But I don't know if that's usual in classic linear regression modelling (is OK to test for heteroskedasticity in a downsized sample and then estimate the coefficients with full sample?)

Speeding up check_model
The documentation (?check_model) explains a few things you can do to speed up the function/plotting without subsampling:
For models with many observations, or for more complex models in
general, generating the plot might become very slow. One reason might
be that the underlying graphic engine becomes slow for plotting many
data points. In such cases, setting the argument show_dots = FALSE
might help. Furthermore, look at the check argument and see if some of
the model checks could be skipped, which also increases performance.
Accordingly, you can turn off the dots-per-point default with check_model(model, show_dots = FALSE). You can also choose the specific checks you get (reducing computation time) if you are not interested in them. For example, you could get only samples from the posterior predictive distribution with check_model(model, check = "pp_check").
Implications of Downsampling
Choosing a subset of observations (and/or draws from the posterior if you're using a Bayesian model) will always change the results of anything that conditions on the data. Both your model parameters and post-estimation summaries conditioning on the data will change. Just how much it will change depends on variability of your observations and sample size. With millions of observations, it's probably unlikely to change much -- but maybe some rare data patterns can heavily influence your results during (post)-estimation.
Plotting for heteroskedasticity based on a different model than the one you estimated makes little sense, but your mileage may vary because the models may differ little. You're looking to evaluate how well your model approximates the Gauss-Markov variance assumptions, not how well another model does. From a computational perspective, it would also be puzzling to do so: the residuals are part of estimation -- if you can fit the model, you can presumably also show the residuals in various ways.
That being said, these plots are also approximations to the actual distribution of interest anyway (i.e. you're implicitly estimating test statistics with some of these plots) and since the central limit theorem applies, things would look the same roughly if you cut out some observations given your data are sufficiently large.

Related

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

Speed up estimation of overlapping additive models (mgcv)

I have some set of variables and I'm fitting many (hundreds of thousands) additive models, each of which includes a subset of all the variables. The dependent variable is the same in every case, and some of the models overlap or are nested. Not all of the independent variables have to enter the model nonparametrically. For clarity, I might have a set of variables {x1,x2,x3,x4,x5} and estimate:
a) y=c+f(x1)+f(x2),
b) y=c+x1+f(x2),
c) y=c+f(x1)+f(x2)+x3, etc.
I'm wondering if there is anything I can do to speed up the gam estimation in this case? Is there anything that is being calculated over and over again that I could calculate once and supply to the function?
What I have already tried:
Memoization since the models repeat exactly from time to time.
Reluctantly switched from thin plate regression splines to cubic regression splines (quite a significant improvement).
The mgcv guide says:
The user can retain most of the advantages of the t.p.r.s. approach by supplying a reduced set of covariate values from which to obtain the basis - typically the number of covariate values used will be substantially smaller than the number of data, and substantially larger than the basis dimension, k.
This caused quite a noticeable improvement with smaller models, e.g. 5 smooths, but not with larger models, e.g. 10 smooths. In fact, in the latter case, it often caused the estimation to take (potentially much) longer.
What I'd like to try but don't know if it's possible:
One obvious thing that repeats itself in both, say, y=c+f(x1)+f(x2) and y=c+x1+f(x2), is the calculation of the basis for f(x2). If I were to use the same knots every time, how (if it's possible at all) could I precalculate the basis for every variable and then supply that to mgcv? Would you expect this to bring a significant time improvement?
Is there anything else you'd recommend?

How to train a multiple linear regression model to find the best combination of variables?

I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.
The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)
There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova

How do you correctly perform a glmmPQL on non-normal data?

I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Resources