How to rectify heteroscedasticity for multiple linear regression model - r

I'm fitting a multiple linear regression model with 6 predictiors (3 continuous and 3 categorical). The residuals vs. fitted plot show that there is heteroscedasticity, also it's confirmed by bptest().
summary of sales_lm
rediduals vs. fitted plot
Also I calculated the sqrt for my train data and test data, as showed below:
sqrt(mean(sales_train_lm_pred-sales_train$SALES)^2)
2 3533.665
sqrt(mean(sales_test_lm_pred-sales_test$SALES)^2)
2 3556.036
I tried to fit glm() model, but still didn't rectify heteroscedasticity.
glm.test3<-glm(SALES~.,weights=1/sales_fitted$.resid^2,family=gaussian(link="identity"), data=sales_train)
resid vs. fitted plot for glm.test3
it looks weird.
glm.test3 plot
Could you please help me what should I do next?
Thanks in advance!

That you observe heteroscedasticity for your data means that the variance is not stationary. You can try the following:
1) Apply the one-parameter Box-Cox transformation (of the which the log transform is a special case) with a suitable lambda to one or more variables in the data set. The optimal lambda can be determined by looking at its log-likelihood function. Take a look at MASS::boxcox.
2) Play with your feature set (decrease, increase, add new variables).
2) Use the weighted linear regression method.

Related

Backtransform LMM output in R

When performing linear mixed models, I have had to square-root(log) transform the data to achieve a normal distribution. Having performed the LMMs, I now want to plot the results onto a graph, but on the original scale i.e. not square-root(log) transformed.
Apparently I can use my raw (untransformed data) on a graph, and to create the predicted regression line I can use the coefficients from my LMM output to get backtransformed predicted y-values for each of my x values. This is where I'm stuck - I have no idea how to do this. Can anyone help?

Test of second differences for average marginal effects in logistic regression

I have a question similar to the one here: Testing the difference between marginal effects calculated across factors. I used the same code to generate average marginal effects for two groups. The difference is that I am running a logistic rather than linear regression model. My average marginal effects are on the probability scale, so emmeans will not provide the correct contrast. Does anyone have any suggestions for how to test whether there is a significant difference in the average marginal effects between group 1 and group 2?
Thank you so much,
Ilana
It is a bit unclear what the issue really is, but I'll try. I'm supposing your logistic regression model was fitted using, say, glm:
mod <- glm(cbind(heads, tails) ~ treat, data = mydata, family = binomial())
If you then do
emm <- emmeans(mod, "treat")
emm ### marginal means
pairs(emm) ### differences
Your results will be presented on the logit scale.
If you want them on the probability scale, you can do
summary(emm, type = "response")
summary(pairs(emm), type = "response")
However, the latter will back-transform the differences of logits, thereby producing odds ratios.
If you actually want differences of probabilities rather than ratios of odds, use regrid(), which will construct a new grid of values after back-transforming (and hence it will forget the log transformation):
pairs(regrid(emm))
It seems possible that two or more factors are present and you want contrasts of contrasts on the probability scale. In that case, extend this idea by calling regrid() on the table of EMMs to put everything on the probability scale, then follow the analogous procedure used in the linked article.

Address unequal variance between groups before applying contrasts for a linear model? (r)

My Goal: I have an ordinal factor variable (5 levels) to which I would like to apply contrasts to test for a linear trend. However, the factor groups have heterogeneity of variance.
What I've done: Upon recommendation, I used lmRob() from robust pckg to create a robust linear model, then applied the contrasts.
# assign the codes for a linear contrast of 5 groups, save as object
contrast5 <- contr.poly(5)
# set contrast property of sf1 to contain the weights
contrasts(SCI$sf1) <- contrast5
# fit and save a robust model (exhaustive instead of subsampling)
robmod.sf1 <- lmRob(ICECAP_A ~ sf1, data = SCI, nrep = Exhaustive)
summary.lmRob(robmod.sf1)
My problem: I have since been reading that robust regression is more suited to address outliers, and not heterogeneity of variance. (bottom of https://stats.idre.ucla.edu/r/dae/robust-regression/_ ) This UCLA page (among others) suggests the sandwich package to get heteroskedastic-consistent (HC) standard errors (such as in https://thestatsgeek.com/2014/02/14/the-robust-sandwich-variance-estimator-for-linear-regression-using-r/ ).
But these examples use a series of functions/calls to generate output that gives you the HC that could be used to calculate confidence intervals, t-values, p-values etc.
My thinking is that if I use vcovHC(), I could get the HC std errors, but the HC std errors would not have been 'applied'/a property of the model, so I couldn't pass the model (with the HC errors) through a function to apply the contrasts that I ultimately want. I hope I am not conflating two separate concepts, but surely if a function addresses/down-weights outliers, that should at least somewhat address unequal variances as well?
Can anyone confirm if my reasoning is sound (and thus remain with lmRob()? Or suggest how I could just correct my standard errors and still apply the contrasts?
vcovHC is the right function to deal with heteroscedasticity. HC stands for heteroscedasticity-consistent estimator. This will not downweight outliers in estimates of model effects, but it will calculated the CIs and p-values differently to accommodate the impact of such outlying observations. lmRob does downweight outlying values and does not handle heteroscedasticity
See more here:
https://stats.stackexchange.com/questions/50778/sandwich-estimator-intuition/50788#50788

Plot residuals vs predicted response in R

Is Plot residuals vs predicted response equivalent to Plot residuals vs fitted ?
If so, then would be plotted by plot(lm) and plot(predict(lm)), where lm is the linear model ?
Am I correct?
Maybe little off-topic, but as an addition: package named ggfortify might come handy. Super easy to use, like this:
library(ggfortify)
autoplot(mod3)
Yields an output with the most important things you need to know, if your model violates the lm assumptions or not. An example output here:
Yes, the fitted values are the predicted responses on the training data, i.e. the data used to fit the model, so plotting residuals vs. predicted response is equivalent to plotting residuals vs. fitted.
As for your second question, the plot would be obtained by plot(lm), but before that you have to run par(mfrow = c(2, 2)). This is because plot(lm) outputs 4 plots, one of which is the one you want, i.e the residuals vs fitted plot. The command above divides the output screen into four facets, so each plot will be shown in one. The plot you are looking for will appear in the top left.

R: Linear regression model does not work very well

I'm using R to fit a linear regression model and then I use this model to predict values but it does not predict very well boundary values. Do you know how to fix it?
ZLFPS is:
ZLFPS<-c(27.06,25.31,24.1,23.34,22.35,21.66,21.23,21.02,20.77,20.11,20.07,19.7,19.64,19.08,18.77,18.44,18.24,18.02,17.61,17.58,16.98,19.43,18.29,17.35,16.57,15.98,15.5,15.33,14.87,14.84,14.46,14.25,14.17,14.09,13.82,13.77,13.76,13.71,13.35,13.34,13.14,13.05,25.11,23.49,22.51,21.53,20.53,19.61,19.17,18.72,18.08,17.95,17.77,17.74,17.7,17.62,17.45,17.17,17.06,16.9,16.68,16.65,16.25,19.49,18.17,17.17,16.35,15.68,15.07,14.53,14.01,13.6,13.18,13.11,12.97,12.96,12.95,12.94,12.9,12.84,12.83,12.79,12.7,12.68,27.41,25.39,23.98,22.71,21.39,20.76,19.74,19.49,19.12,18.67,18.35,18.15,17.84,17.67,17.65,17.48,17.44,17.05,16.72,16.46,16.13,23.07,21.33,20.09,18.96,17.74,17.16,16.43,15.78,15.27,15.06,14.75,14.69,14.69,14.6,14.55,14.53,14.5,14.25,14.23,14.07,14.05,29.89,27.18,25.75,24.23,23.23,21.94,21.32,20.69,20.35,19.62,19.49,19.45,19,18.86,18.82,18.19,18.06,17.93,17.56,17.48,17.11,23.66,21.65,19.99,18.52,17.22,16.29,15.53,14.95,14.32,14.04,13.85,13.82,13.72,13.64,13.5,13.5,13.43,13.39,13.28,13.25,13.21,26.32,24.97,23.27,22.86,21.12,20.74,20.4,19.93,19.71,19.35,19.25,18.99,18.99,18.88,18.84,18.53,18.29,18.27,17.93,17.79,17.34,20.83,19.76,18.62,17.38,16.66,15.79,15.51,15.11,14.84,14.69,14.64,14.55,14.44,14.29,14.23,14.19,14.17,14.03,13.91,13.8,13.58,32.91,30.21,28.17,25.99,24.38,23.23,22.55,20.74,20.35,19.75,19.28,19.15,18.25,18.2,18.12,17.89,17.68,17.33,17.23,17.07,16.78,25.9,23.56,21.39,20.11,18.66,17.3,16.76,16.07,15.52,15.07,14.6,14.29,14.12,13.95,13.89,13.66,13.63,13.42,13.28,13.27,13.13,24.21,22.89,21.17,20.06,19.1,18.44,17.68,17.18,16.74,16.07,15.93,15.5,15.41,15.11,14.84,14.74,14.68,14.37,14.29,14.29,14.27,18.97,17.59,16.05,15.49,14.51,13.91,13.45,12.81,12.6,12,11.98,11.6,11.42,11.33,11.27,11.13,11.12,11.11,10.92,10.87,10.87,28.61,26.4,24.22,23.04,21.8,20.71,20.47,19.76,19.38,19.18,18.55,17.99,17.95,17.74,17.62,17.47,17.25,16.63,16.54,16.39,16.12,21.98,20.32,19.49,18.2,17.1,16.47,15.87,15.37,14.89,14.52,14.37,13.96,13.95,13.72,13.54,13.41,13.39,13.24,13.07,12.96,12.95,27.6,25.68,24.56,23.52,22.41,21.69,20.88,20.35,20.26,19.66,19.19,19.13,19.11,18.89,18.53,18.13,17.67,17.3,17.26,17.26,16.71,19.13,17.76,17.01,16.18,15.43,14.8,14.42,14,13.8,13.67,13.33,13.23,12.86,12.85,12.82,12.75,12.61,12.59,12.59,12.45,12.32)
QPZL<-c(36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16)
ZLDBFSAO<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
My model is:
fit32=lm(log(ZLFPS) ~ poly(QPZL,2,raw=T) + ZLDBFSAO)
results3 <- coef(summary(fit32))
first3<-as.numeric(results3[1])
second3<-as.numeric(results3[2])
third3<-as.numeric(results3[3])
fourth3<-as.numeric(results3[4])
fifth3<-as.numeric(results3[5])
#inverse model used for prediction of FPS
f1 <- function(x) {first3 +second3*x +third3*x^2 + fourth3*1}
You can see my dataset here. This dataset contains the values that I have to predict. The FPS variation per QP is heterogenous. See dataset. I added a new column.
The fitted dataset is a different one.
To test the model just write exp(f1(selected_QP)) where selected QP varies from 16 to 36. See the given dataset for QP values and the FPS value that the model should predict.
You can run the model online here.
When I'm using QP values in the middle, let's say between 23 and 32 the model predicts the FPS value pretty well. Otherwise, the prediction has big error value.
Regarding the linear regression model I should use Weighted Least Squares as a Solution to Heteroskedasticity of the fitted dataset. For references, see here, here and here.
fit32=lm(log(ZLFPS) ~ poly(QPZL,2,raw=T) + ZLDBFSAO, weights=1/(1+0.5*QPZL^2))
The other code remains the same. This model gives me lower prediction error than the previous.

Resources