How to resolve heteroskedasticity in Multiple Linear Regression in R - r

I'm modelling multiple linear regression. I used the bptest function to test for heteroscedasticity. The result was significant at less than 0.05.
How can I resolve the issue of heteroscedasticity?

Try using a different type of linear regression
Ordinary Least Squares (OLS) for homoscedasticity.
Weighted Least Squares (WLS) for heteroscedasticity without correlated errors.
Generalized Least Squares (GLS) for heteroscedasticity with correlated errors.

Welcome to SO, Arun.
Personally, I don't think heteroskedasticity is something you "solve". Rather, it's something you need to allow for in your model.
You haven't given us any of your data, so let's assume that the variance of your residuals increases with the magnitude of your predictor. Typically a simplistic approach to handling it is to transform the data so that the variance is constant. One way of doing this might be to log-transform your data. That might give you a more constant variance. But it also transforms your model. Your errors are no longer IID.
Alternatively, you might have two groups of observarions that you want to compare with a t-test, bit the variance in one group is larger than in the other. That's a different sot of heteroskedasticity. There are variants of the standard "pooled variance" t-test that might handle that.
I realise this isn't an answer to your question in the conventional sense. I would have made it a comment, but I knew before I started that I'd need more words than a comment would let me have.

Related

Coefficient value of covariate in Cox PH model is too big

I am trying to develop Cox PH model with time-varying covariates in R. I use coxph function from survival package. There was not any trouble during estimation process, though coefficient value of one covariates is too large, in particular, 2.5e+32.
I can't guess what is reason of this problem and how to tackle it. This variable is nonstationary and proportional assumption is violated. Does either of this facts may cause such a big value of coefficient?
More information could help framing your problem.
Anyway, I doubt non-proportionality is to blame. It would imply that you have some outliers heavily biasing your coefficient beyond reasonable expectations. You could give this a quick look by plotting the output of cox.zph.
Another possible explanation is that this rather depends on the unit of measure you used to define your covariate. Can the magnitude of the coefficient be meaningfully interpreted? If so, you could simply re-scale/standardise/log-transform that covariate to obtain a 'more manageable' coefficient (if this is theoretically appropriate).
This could also be due to the so called 'complete separation', which has been discussed here and here.

Understand Regression results

I have a set of numerical features that describe a phenomenon at different time points. In order to evaluate the individual performance of each feature, I perform a linear regression with a leave one out validation, and I compute the correlations and errors to evaluate the results.
So for a single feature, it would be something like:
Input: Feature F = {F_t1, F_t2, ... F_tn}
Input: Phenomenom P = {P_t1, P_t2, ... P_tn}
Linear Regression of P according to F, plus leave one out.
Evaluation: Compute correlations (linear and spearman) and errors (mean absolute and root mean squared)
For some of the variables, both correlations are really good (> 0.9), but when I take a look to the predictions, I realize that the predictions are all really close to the average (of the values to predict), so the errors are big.
How is that possible?
Is there a way to fix it?
For some technical precisions, I use the weka linear regression with the option "-S 1" in order to avoid the feature selection.
It seems to be because the problem we want to regress is not linear and we use a linear approach. Then it is possible to have good correlations and poor errors. It does not mean that the regression is wrong or really poor, but you have to be really careful and investigate further.
Anyway, a non linear approach that minimizes the errors and maximize the correlation is the way to go.
Moreover, outliers also make this problem occur.

R: Which variables to include in model?

I'm fairly new to R and am currently trying to find the best model to predict my dependent variable from a number of predictor variables. I have 20 precictor variables and I want to see which ones I should include in my model and which ones I should exclude.
I am currently just running models with different predictor variables in each and comparing them to see which one has the lowest AIC, but this is taking a really long time. Is there an easier way to do this?
Thank you in advance.
This is more of a theoretical question actually...
In principle, if all of the predictors are actually exogenous to the model, they can all be included together and assuming you have enough data (N >> 20) and they are not too similar (which could give rise to multi-collinearity), that should help prediction. In practice, you need to think about whether each of (or any of) your predictors are actually exogenous to the model (that is, independent of the error term in the model). If they are not, then they will impart a bias on the estimates. (Also, omitting explanatory variables that are actually necessary imparts a bias.)
If predictive accuracy (even spurious in-sample accuracy) is the goal, then techniques like LASSO (as mentioned in the comments) could also help.

How to deal with heteroscedasticity in OLS with R

I am fitting a standard multiple regression with OLS method. I have 5 predictors (2 continuous and 3 categorical) plus 2 two-way interaction terms. I did regression diagnostics using residuals vs. fitted plot. Heteroscedasticity is quite evident, which is also confirmed by bptest().
I don't know what to do next. First, my dependent variable is reasonably symmetric (I don't think I need to try transformations of my DV). My continuous predictors are also not highly skewed. I want to use weights in lm(); however, how do I know what weights to use?
Is there a way to automatically generate weights for performing weighted least squares? or Are you other ways to go about it?
One obvious way to deal with heteroscedasticity is the estimation of heteroscedasticity consistent standard errors. Most often they are referred to as robust or white standard errors.
You can obtain robust standard errors in R in several ways. The following page describes one possible and simple way to obtain robust standard errors in R:
https://economictheoryblog.com/2016/08/08/robust-standard-errors-in-r
However, sometimes there are more subtle and often more precise ways to deal with heteroscedasticity. For instance, you might encounter grouped data and find yourself in a situation where standard errors are heterogeneous in your dataset, but homogenous within groups (clusters). In this case you might want to apply clustered standard errors. See the following link to calculate clustered standard errors in R:
https://economictheoryblog.com/2016/12/13/clustered-standard-errors-in-r
What is your sample size? I would suggest that you make your standard errors robust to heteroskedasticity, but that you do not worry about heteroskedasticity otherwise. The reason is that with or without heteroskedasticity, your parameter estimates are unbiased (i.e. they are fine as they are). The only thing that is affected (in linear models!) is the variance-covariance matrix, i.e. the standard errors of your parameter estimates will be affected. Unless you only care about prediction, adjusting the standard errors to be robust to heteroskedasticity should be enough.
See e.g. here how to do this in R.
Btw, for your solution with weights (which is not what I would recommend), you may want to look into ?gls from the nlme package.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Resources