R: Linear regression model does not work very well - r

I'm using R to fit a linear regression model and then I use this model to predict values but it does not predict very well boundary values. Do you know how to fix it?
ZLFPS is:
ZLFPS<-c(27.06,25.31,24.1,23.34,22.35,21.66,21.23,21.02,20.77,20.11,20.07,19.7,19.64,19.08,18.77,18.44,18.24,18.02,17.61,17.58,16.98,19.43,18.29,17.35,16.57,15.98,15.5,15.33,14.87,14.84,14.46,14.25,14.17,14.09,13.82,13.77,13.76,13.71,13.35,13.34,13.14,13.05,25.11,23.49,22.51,21.53,20.53,19.61,19.17,18.72,18.08,17.95,17.77,17.74,17.7,17.62,17.45,17.17,17.06,16.9,16.68,16.65,16.25,19.49,18.17,17.17,16.35,15.68,15.07,14.53,14.01,13.6,13.18,13.11,12.97,12.96,12.95,12.94,12.9,12.84,12.83,12.79,12.7,12.68,27.41,25.39,23.98,22.71,21.39,20.76,19.74,19.49,19.12,18.67,18.35,18.15,17.84,17.67,17.65,17.48,17.44,17.05,16.72,16.46,16.13,23.07,21.33,20.09,18.96,17.74,17.16,16.43,15.78,15.27,15.06,14.75,14.69,14.69,14.6,14.55,14.53,14.5,14.25,14.23,14.07,14.05,29.89,27.18,25.75,24.23,23.23,21.94,21.32,20.69,20.35,19.62,19.49,19.45,19,18.86,18.82,18.19,18.06,17.93,17.56,17.48,17.11,23.66,21.65,19.99,18.52,17.22,16.29,15.53,14.95,14.32,14.04,13.85,13.82,13.72,13.64,13.5,13.5,13.43,13.39,13.28,13.25,13.21,26.32,24.97,23.27,22.86,21.12,20.74,20.4,19.93,19.71,19.35,19.25,18.99,18.99,18.88,18.84,18.53,18.29,18.27,17.93,17.79,17.34,20.83,19.76,18.62,17.38,16.66,15.79,15.51,15.11,14.84,14.69,14.64,14.55,14.44,14.29,14.23,14.19,14.17,14.03,13.91,13.8,13.58,32.91,30.21,28.17,25.99,24.38,23.23,22.55,20.74,20.35,19.75,19.28,19.15,18.25,18.2,18.12,17.89,17.68,17.33,17.23,17.07,16.78,25.9,23.56,21.39,20.11,18.66,17.3,16.76,16.07,15.52,15.07,14.6,14.29,14.12,13.95,13.89,13.66,13.63,13.42,13.28,13.27,13.13,24.21,22.89,21.17,20.06,19.1,18.44,17.68,17.18,16.74,16.07,15.93,15.5,15.41,15.11,14.84,14.74,14.68,14.37,14.29,14.29,14.27,18.97,17.59,16.05,15.49,14.51,13.91,13.45,12.81,12.6,12,11.98,11.6,11.42,11.33,11.27,11.13,11.12,11.11,10.92,10.87,10.87,28.61,26.4,24.22,23.04,21.8,20.71,20.47,19.76,19.38,19.18,18.55,17.99,17.95,17.74,17.62,17.47,17.25,16.63,16.54,16.39,16.12,21.98,20.32,19.49,18.2,17.1,16.47,15.87,15.37,14.89,14.52,14.37,13.96,13.95,13.72,13.54,13.41,13.39,13.24,13.07,12.96,12.95,27.6,25.68,24.56,23.52,22.41,21.69,20.88,20.35,20.26,19.66,19.19,19.13,19.11,18.89,18.53,18.13,17.67,17.3,17.26,17.26,16.71,19.13,17.76,17.01,16.18,15.43,14.8,14.42,14,13.8,13.67,13.33,13.23,12.86,12.85,12.82,12.75,12.61,12.59,12.59,12.45,12.32)
QPZL<-c(36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16)
ZLDBFSAO<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
My model is:
fit32=lm(log(ZLFPS) ~ poly(QPZL,2,raw=T) + ZLDBFSAO)
results3 <- coef(summary(fit32))
first3<-as.numeric(results3[1])
second3<-as.numeric(results3[2])
third3<-as.numeric(results3[3])
fourth3<-as.numeric(results3[4])
fifth3<-as.numeric(results3[5])
#inverse model used for prediction of FPS
f1 <- function(x) {first3 +second3*x +third3*x^2 + fourth3*1}
You can see my dataset here. This dataset contains the values that I have to predict. The FPS variation per QP is heterogenous. See dataset. I added a new column.
The fitted dataset is a different one.
To test the model just write exp(f1(selected_QP)) where selected QP varies from 16 to 36. See the given dataset for QP values and the FPS value that the model should predict.
You can run the model online here.
When I'm using QP values in the middle, let's say between 23 and 32 the model predicts the FPS value pretty well. Otherwise, the prediction has big error value.

Regarding the linear regression model I should use Weighted Least Squares as a Solution to Heteroskedasticity of the fitted dataset. For references, see here, here and here.
fit32=lm(log(ZLFPS) ~ poly(QPZL,2,raw=T) + ZLDBFSAO, weights=1/(1+0.5*QPZL^2))
The other code remains the same. This model gives me lower prediction error than the previous.

Related

Optimizing a GAM for Smoothness

I am currently trying to generate a general additive model in R using a response variable and three predictor variables. One of the predictors is linear, and the dataset consists of 298 observations.
I have run the following code to generate a basic GAM:
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
This produces a model with 18 degrees of freedom and seems to substantially overfit the data. I'm wondering how I might generate a GAM that maximizes smoothness and predictive error. I realize that each of these features is going to come at the expense of the other, but is there good a way to find the optimal model that doesn't overfit?
Additionally, I need to perform leave one out cross validation (LOOCV), and I am not sure how to make sure that gam() does this in the MGCV package. Any help on either of these problems uld be greatly appreciated. Thank you.
I've run this to generate a GAM, but it overfits the data.
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
I have also generated 1,000,000 GAMs with varying combinations of smoothing parameters and ranged the maximum degrees of freedom allowed from 10 (as shown in the code below) to 19. The variable "combinations2" is a list of all 1,000,000 combinations of smoothers I selected. This code is designed to try and balance degrees of freedom and AIC score. It does function, but I'm not sure that I'm actually going to be able to find the optimal model from this. I also cannot tell how to make sure that it uses LOOCV.
BestGAM <- gam(response~ linearpredictor+ predictor2+ predictor3, data = data[2:5])
for(i in 1:100000){
PotentialGAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5], sp=c(combinations2[i,]$Var1,combinations2[i,]$Var2))
if (AIC(PotentialGAM,BestGAM)$df[1] <= 10 & AIC(PotentialGAM,BestGAM)$AIC[1] < AIC(PotentialGAM,BestGAM)$AIC[2]){
BestGAM <<- PotentialGAM
listNumber <- i
}
}
You are fitting your GAM using generalised cross validation (GCV) smoothness selection. GCV is a way to get around the invariance problem of ordinary cross validation (OCV; what you also call LOOCV) when estimating GAMs. Note that GCV is the same as OCV on a rotated version of the fitting problem (rotating y - Xβ by Q, any orthogonal matrix), and while when fitting with GCV {mgcv} doesn't actually need to do the rotation and the expected GCV score isn't affected by the rotation, GCV is just OCV (wood 2017, p. 260)
It has been shown that GCV can undersmooth (resulting in more wiggly models) as the objective function (GCV profile) can become flat around the optimum. Instead it is preferred to estimate GAMs (with penalized smooths) using REML or ML smoothness selection; add method = "REML" (or "ML") to your gam() call.
If the REML or ML fit is as wiggly as the GCV one with your data, then I'd be likely to presume gam() is not overfitting, but that there is something about your response data that hasn't been explained here (are the data ordered in time, for example?)
As to your question
how I might generate a GAM that maximizes smoothness and [minimize?] predictive error,
you are already doing that using GCV smoothness selection and for a particular definition of "smoothness" (in this case it is squared second derivatives of the estimated smooths, integrated over the range of the covariates, and summed over smooths).
If you want GCV but smoother models, you can increase the gamma argument above 1; gamma 1.4 is often used for example, which means that each EDF costs 40% more in the GCV criterion.
FWIW, you can get the LOOCV (OCV) score for your model without actually fitting 288 GAMs through the use of the influence matrix A. Here's a reproducible example using my {gratia} package:
library("gratia")
library("mgcv")
df <- data_sim("eg1", seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df, method = "REML")
A <- influence(m)
r <- residuals(m, type = "response")
ocv_score <- mean(r^2 / (1 - A))

Is there a way to include an autocorrelation structure in the gam function of mgcv?

I am building a model using the mgcv package in r. The data has serial measures (data collected during scans 15 minutes apart in time, but discontinuously, e.g. there might be 5 consecutive scans on one day, and then none until the next day, etc.). The model has a binomial response, a random effect of day, a fixed effect, and three smooth effects. My understanding is that REML is the best fitting method for binomial models, but that this method cannot be specified using the gamm function for a binomial model. Thus, I am using the gam function, to allow for the use of REML fitting. When I fit the model, I am left with residual autocorrelation at a lag of 2 (i.e. at 30 minutes), assessed using ACF and PACF plots.
So, we wanted to include an autocorrelation structure in the model, but my understanding is that only the gamm function and not the gam function allows for the inclusion of such structures. I am wondering if there is anything I am missing and/or if there is a way to deal with autocorrelation with a binomial response variable in a GAMM built in mgcv.
My current model structure looks like:
gam(Response ~
s(Day, bs = "re") +
s(SmoothVar1, bs = "cs") +
s(SmoothVar2, bs = "cs") +
s(SmoothVar3, bs = "cs") +
as.factor(FixedVar),
family=binomial(link="logit"), method = "REML",
data = dat)
I tried thinning my data (using only every 3rd data point from consecutive scans), but found this overly restrictive to allow effects to be detected due to my relatively small sample size (only 42 data points left after thinning).
I also tried using the prior value of the binomial response variable as a factor in the model to account for the autocorrelation. This did appear to resolve the residual autocorrelation (based on the updated ACF/PACF plots), but it doesn't feel like the most elegant way to do so and I worry this added variable might be adjusting for more than just the autocorrelation (though it was not collinear with the other explanatory variables; VIF < 2).
I would use bam() for this. You don't need to have big data to fit a with bam(), you just loose some of the guarantees about convergence that you get with gam(). bam() will fit a GEE-like model with an AR(1) working correlation matrix, but you need to specify the AR parameter via rho. This only works for non-Gaussian families if you also set discrete = TRUE when fitting the model.
You could use gamm() with family = binomial() but this uses PQL to estimate the GLMM version of the GAMM and if your binomial counts are low this method isn't very good.

How to rectify heteroscedasticity for multiple linear regression model

I'm fitting a multiple linear regression model with 6 predictiors (3 continuous and 3 categorical). The residuals vs. fitted plot show that there is heteroscedasticity, also it's confirmed by bptest().
summary of sales_lm
rediduals vs. fitted plot
Also I calculated the sqrt for my train data and test data, as showed below:
sqrt(mean(sales_train_lm_pred-sales_train$SALES)^2)
2 3533.665
sqrt(mean(sales_test_lm_pred-sales_test$SALES)^2)
2 3556.036
I tried to fit glm() model, but still didn't rectify heteroscedasticity.
glm.test3<-glm(SALES~.,weights=1/sales_fitted$.resid^2,family=gaussian(link="identity"), data=sales_train)
resid vs. fitted plot for glm.test3
it looks weird.
glm.test3 plot
Could you please help me what should I do next?
Thanks in advance!
That you observe heteroscedasticity for your data means that the variance is not stationary. You can try the following:
1) Apply the one-parameter Box-Cox transformation (of the which the log transform is a special case) with a suitable lambda to one or more variables in the data set. The optimal lambda can be determined by looking at its log-likelihood function. Take a look at MASS::boxcox.
2) Play with your feature set (decrease, increase, add new variables).
2) Use the weighted linear regression method.

Obtaining predicted (i.e. expected) values from the orm function (Ordinal Regression Model) from rms package in R

I've run a simple model using orm (i.e. reg <- orm(formula = y ~ x)) and I'm having trouble understanding how to get predicted values for Y. I've never worked with models that use multiple intercepts. I want to know for each and every value of Y in my dataset what the predicted value from the model would be. I tried predict(reg, type="mean") and this produced values that are close to the predicted values from an OLS regression, but I'm not sure if this is what I want. I really just want something analogous to OLS where you can obtain the E(Y) given a set of predictors. If possible, please provide code I can run to do this with a very short explanation.

How do I plot predictions from new data fit with gee, lme, glmer, and gamm4 in R?

I have fit my discrete count data using a variety of functions for comparison. I fit a GEE model using geepack, a linear mixed effect model on the log(count) using lme (nlme), a GLMM using glmer (lme4), and a GAMM using gamm4 (gamm4) in R.
I am interested in comparing these models and would like to plot the expected (predicted) values for a new set of data (predictor variables). My goal is to compare the predicted effects for each model under particular conditions (x variables). Of particular interest is the comparison between marginal (GEE) and conditional estimates.
I think my main problem might be getting the new data in the correct form with the correct labels and attributes and such. I am still very much an R novice and struggle with this stuff (no course on this at my university unfortunately).
I currently have fitted models
gee1 lme1 lmer1 gamm1
and can extract their fixed effect coefficients and standard errors without a problem. I also don't have a problem converting them from the log scale or estimating confidence intervals accounting for the random effects.
I also have my new dataframe newdat which has 365 observations of 23 variables (average environmental data for each day of the year).
I am stuck on how to predict new count estimates from this. I played around with the model.matrix function but couldn't get it to work. For example, I tried:
mm = model.matrix(terms(glmm1), newdat) # Error in model.frame.default(object,
# data, xlev = xlev) : object is not a matrix
newdat$pcount = mm %*% fixef(glmm1)
Any suggestions or good references would be greatly appreciated. Can anyone help with the error above?
Getting predictions for lme() and lmer() is documented on http://glmm.wikidot.com/faq

Resources