Reproducibility of predictions from GAM (mgcv package) [closed]

Reproducibility of predictions from GAM (mgcv package) [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am using a generalized additive model built using the bam() function in the mgcv R package, to predict probabilities for a binary response.
I seem to get slightly different predictions for the same input data, depending on the make-up of the newdata table provided, and don't understand why.
The model was built using a formula like this:
model <- bam(response ~ categorical_predictor1 + s(continuous_predictor, bs='tp'),
data=data,
family="binomial",
select=TRUE,
discrete=TRUE,
nthreads = 16)
I have several more categorical and continuous predictors, but to save space I only mention two in the above formula.
I then predict like this:
predictions <- predict(model,
newdata = newdata,
type="response")
I want to make predictions for about 2.5 million rows, but during my testing I predicted for a subset of 250,000.
Each time I use the model to predict for that subset (i.e. newdata=subset) I get the same outputs - this is reproducible. However, if I use the model to predict that same subset within a the full table of 2.5 million rows (i.e. newdata=full_data), then I get slightly different predictions for that subset of 250,000 than when I predict them separately.
I always thought that each row is predicted in turn, based on the predictors provided, so can't understand why the predictions change with the context of the "newdata". This does not happen if I predict using a standard glm, or a random forest, so I assume it's something specific to gams or the mgcv package.
Sorry, I haven't been able to provide a reproducible example - my datasets are large, and I'm not sure if the same thing would happen with a small example dataset.

From the predict.bam help:
"When discrete=TRUE the prediction data in newdata is discretized in the same way as is done when using discrete fitting methods with bam. However the discretization grids are not currently identical to those used during fitting. Instead, discretization is done afresh for the prediction data. This means that if you are predicting for a relatively small set of prediction data, or on a regular grid, then the results may in fact be identical to those obtained without discretization. The disadvantage to this approach is that if you make predictions with a large data frame, and then split it into smaller data frames to make the predictions again, the results may differ slightly, because of slightly different discretization errors."
You likely can't switch to gam or use discrete=FALSE because you need the speed. But you'll have to deal with some small differences in exchange. From the help, it sounds like you might be able to minimize this by choosing the subsets carefully, but you won't be able to eliminate it completely.

Related

GLMM in R versus SPSS (convergence and singularity problems vanish)

Unfortunately, I had convergence (and singularity) issues when calculating my GLMM analysis models in R. When I tried it in SPSS, I got no such warning message and the results are only slightly different. Does it mean I can interpret the results from SPSS without worries? Or do I have to test for singularity/convergence issues to be sure?

You have two questions. I will answer both.
First Question
Does it mean I can interpret the results from SPSS without worries?
You do not want to do this. The reason being is that mixed models have a very specific parameterization. Here is a screenshot of common lme4 syntax from the original article about lme4 from the author:
With this comes assumptions about what your model is saying. If for example you are running a model with random intercepts only, you are assuming that the slopes do not vary by any measure. If you include correlated random slopes and random intercepts, you are then assuming that there is a relationship between the slopes and intercepts that may either be positive or negative. If you present this data as-is without knowing why it produced this summary, you may fail to explain your data in an accurate way.
The reason as highlighted by one of the comments is that SPSS runs off defaults whereas R requires explicit parameters for the model. I'm not surprised that the model failed to converge in R but not SPSS given that SPSS assumes no correlation between random slopes and intercepts. This kind of model is more likely to converge compared to a correlated model because the constraints that allow data to fit a correlated model make it very difficult to converge. However, without knowing how you modeled your data, it is impossible to actually know what the differences are. Perhaps if you provide an edit to your question that can be answered more directly, but just know that SPSS and R do not calculate these models the same way.
Second Question
Or do I have to test for singularity/convergence issues to be sure?
SPSS and R both have singularity checks as a default (check this page as an example). If your model fails to converge, you should drop it and use an alternative model (usually something that has a simpler random effects structure or improved optimization).

Imputation missing data for MLM in R

Maybe anyone can help me with this question. I conducted a follow-up study and obviously now have to face missing data. Now I am considering how to impute the missing data at best using MLM in R (f.e. participants concluded the follow up 2 survey, but not the follow up 1 survey, therefore I am missing L1 predictors for my longitudinal analysis).
I read about Multiple Imputation of multilevel data using the pan package (Schafer & Yucel, 2002) and came across the following code:
imp <- panImpute(data, formula = fml, n.burn = 1000, n.iter = 100, m = 5)
Yet, I have troubles understanding it completely. Is there maybe another way to impute missing data in R? Or maybe somebody could illustrate the process of the imputation method a bit more detailed, that would be so great! Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
Thank you in advance

Is there maybe another way to impute missing data in R?
There are other packages. mice is the one that I normally use, and it does support multilevel data.
Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
You have to specify the imputation model. Basically that means you have to tell the software which variables are predicted by which other variables. Since you are comparing models with the same fixed effect, and only changing the random effects (in particular comparing models with and without random slopes), the imputation model should be the same in both cases. So the workflow is:
perform the imputations;
run the model on all the imputed datasets,
pool the results (typically using Rubin's rules)
So you will need to do this twice, to end up with 2 sets of pooled results - one for each model. The software should provide functionality for doing all of this.
Having said all of that, I would advise against choosing your model based on fit statistics and instead use expert knowledge. If you have strong theoretical reasons for expecting slopes to vary by group, then include random slopes. If not, then don't include them.

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil

If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

How do you correctly perform a glmmPQL on non-normal data?

I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!

tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.

How to estimate missing DV using its own estimation model within a linear model? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
This question is more about statistics than R programming, though as I am a beginning user of R, I would especially appreciate any thoughts in the context of R; thanks for considering it:
The outcome variable in one of our linear models (lm) is waist circumference, which is missing in about 20% of our dataset. Last year a model was published which reliably estimates waist circumference from BMI, age, and gender (all of which we do have). I'd like to use this model to impute the missing waist circumferences in our data, but I'm wanting to make sure I incorporate the known error in that estimation model. The standard error of the intercept and of each coefficient has been reported.
Could you suggest how I might go about responsibly imputing (or perhaps a better word is estimating) the missing waist circumferences and evaluating any effect on my own waist circumference prediction models?
Thanks again for any coding strategy.

As Frank has indicated, this question has a strong stats flavor to it. But one possible solution does indeed entail some sophisticated programming, so perhaps it's legitimate to put it in an R thread.
In order to "incorporate the known error in that estimation", one standard approach is multiple imputation, and if you want to go this route, R is a good way to do it. It's a little involved, so you'll have to work out the specifics of the code for yourself, but if you understand the basic strategy it's relatively straightforward.
The basic idea is that for every subject in your dataset you impute the waist circumference by first using the published model and the BMI, age, and gender to determine the expected value, and then you add some simulated random noise to that; you'll have to read through the publication to determine the numerical value of that noise. Once you've filled in every missing value, then you just perform whatever statistical computation you want to run, and save the standard errors. Now, you create a second dataset, derived from your original dataset with missing values, once again using the published model to impute the expected values, along with some random noise -- since the noise is random, the imputed values for this dataset should be different from the imputed values for the first dataset. Now do your statistical computation, and save the standard errors, which will be a little different than those from the first imputed dataset, since the imputed values contain random noise. Repeat for a bunch of times. Finally, average the saved standard errors, and this will give you an estimate for the standard error incorporating the uncertainty due to the imputation.
What you're doing is essentially a two-level simulation: on a low level, for each iteration you are using the published model to create a simulated dataset with noisy imputed values for missing data, which then gives you a simulated standard error, and then on a high level you repeat the process to obtain a sample of such simulated standard errors, which you then average to get your overall estimate.
This is a pain to do in traditional stats packages such as SAS or Stata, although it IS possible, but it's much easier to do in R because it's based on a proper programming language. So, yes, your question is properly speaking a stats question, but the best solution is probably R-specific.