R - Testing for homo/heteroscedasticity and collinearity in a multivariate regression model - r

I'm trying to optimize a multivariate linear regression model lmMod=lm(depend_var~var1+var2+var3+var4....,data=df) and I'm presently working on the premises of the model: the constant variance of residuals and the absence of auto-correlation. For this I'm using:
Breusch-Pagan test for homo/heteroscedasticity: lmtest::bptest(lmMod) 
Durbin Watson test for auto-correlation: durbinWatsonTest(lmMod)
I found examples which are testing either one independent variable at a time:
example for Breush-Pagan test – one independent variable:
https://datascienceplus.com/how-to-detect-heteroscedasticity-and-rectify-it/
example for Durbin Watson test - one independent variable:
http://math.furman.edu/~dcs/courses/math47/R/library/lmtest/html/dwtest.html
or the whole model with several independent variables at a time:
example for Durbin Watson test – multiple independent variable:
https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/durbinWatsonTest
Here are the questions:
Can durbinWatsonTest() and bptest() be fed with a whole multivariate model
If answer to 1 is yes, how is it then possible to determine which variable is causing heteroscedasticity or auto-correlation in the model in order to fix it as each of those tests give only one p-value for the entire multivariate model?
If answer to 1 is no, the test should be then performed with one dependent variable at a time. But in the case of homoscedasticity, it can only be tested AFTER a particular regression has been modelled. Hence a pattern of homo/heteroscedasticity in an univariate regression model lmMod_1=lm(depend_var~var1, data=df) will be different from the pattern of a multivariate regression model lmMod_2=lm(depend_var~var1+var2+var3+var4....,data=df)
Thank very much in advance for your help!

I would like to try to give a first help
The answer to the first question: Yes, you can use the Breusch-Pagan test and the Durbin Watson test for mutlivariate models. (However, I have always used the dwtest() instead of the durbinWatsonTest()).
Also note that the dwtest() checks only the first-order autocorrelation. Unfortunately, I do not know how to find out which variable is causing heteroscedasticity or auto-correlation. However, if you encounter these problems, then one possible solution is that you use a robust estimation method, e.g. after NeweyWest (using: coeftest (regression model, vcov = NeweyWest)) at autocorrelation or with coeftest(regression model, vcov = vcovHC) at heteroscedasticity, both from the AER package.

Related

R: Using relative importance (relaimpo package) to build a linear model for prediction?

I have a huge dataset and I'm trying to build a good predictive linear model using the relaimpo package.
Using the calc.relimp function with type="lmg, i get an output of variables which are of relative importance. Although the proportion of variance explained by the model is only at 52%, I want to go and build a linear model using these variables.
Is there a way to build a lm model using these variables and somehow take into account the relative importance values into the model?
I'm not too familiar with this and was thinking maybe something along the lines of weighting each variable based on its relative importance value...?
I'm not a statistician, so I won't give you any Greek symbols, but I think you are confusing a few things.
As you correctly say, the relative importances based on the LMG method are more or less some sort of variance decomposition in case of correlated predictor variables, i.e. it tells you how much of your variance in the model is explained by which predictor.
However, this doesn't have to do anything with the lm function and its estimation itself. In fact, the R² of your lm model is exactly the same as you'll get by summing up the relative importances from calc.relimp.
There is no way to tell the lm function to pay more attention to a certain predictor during prediction/estimation.
What you probably want to do is an elastic net (which is a combination of LASSO and RIDGE regression), which basically does what you want, i.e. it shrinks the impact of "unimportant"/small predictors and emphasizes the impact of important/large predictors: https://en.wikipedia.org/wiki/Elastic_net_regularization (Lasso and Ridge regression are linked in the Wikipedia article).
I think this one here is the original package from Jerome Friedman, Trevor Hastie, Rob Tibshirani, et al.: https://cran.r-project.org/web/packages/glmnet/index.html

GLM & LM assumptions and interpretation

I am currently running some linear models and lmer (with replicate as a random effect) for continuous data and a glm and glmer (again, replicate as a random effect) for count data.
I was wondering if a lm, lmer, glm and glmer all need the data to be normally distributed and if not, do I need an alternative test?
Also, I have run a glm and looked at the pairwise differences and when reporting it other than P<0.001 I don't know what else I should report! As my glm output doesn't really give me that much. Thanks!

How to interpret Result of two way anova test in R?

m1 <- lm(AmountSpent~Catalogs*Salary,data=d)
summary(m1)
m2<-lm(AmountSpent~Catalogs+Salary,data=d)
summary(m2)
anova(m2,m1,test="Chisq")
The output is as follows
What is the better model according to test ? Is the order in which we insert models in the method important? Please explain the statistical concept behind this test
The Chi-square test looks into the statistical significance in reduction of the residual sum of squares between the nested linear models. From your R ouput you can see that adding a term to the regression resulted statistically in a better model (the one with the lower RSS value, Model 2).
It is usual to start the comparing with a simpler model and then add terms, however, the docs also mention that it is up to the user.
You should take a look at the docs of the the anova.lm function here.
For comparing models that are not nested rather use AIC or BIC criteria.

can we get probabilities the same way that we get them in logistic regression through random forest?

I have a data structure with binary 0-1 variable (click & Purchase; click & not-purchase) against a vector of the attributes. I used logistic regression to get the probabilities of the purchase. How can I use Random Forest to get the same probabilities? Is it by using Random Forest regression? or is it Random Forest classification with type='prob' in R which gives the probability of categorical variable?
It won't give you the same result since the structure of the two method are different. Logistic regression is given by a definitive linear specification, where RF is a collective vote from multiple independent/random trees. If specification and input feature are properly tuned for both, they can produce comparable results. Here is the major difference between the two:
RF will give more robust fit against noise, outliers, overfitting or multicollinearity etc which are common pitfalls in regression type of solution. Basically if you don't know or don't want to know much about whats going in with the input data, RF is a good start.
logistic regression will be good if you know expertly about the data and how to properly specify the equation. Or somehow want to engineer how the fit/prediction works. The explicit form of GLM specification will allow you to do that.

caret: Linear Model parameter estimates (mean and s.e.) via LOOCV

I am wondering if there is a way to extract out all of the lm parameter estimates from the results of the cross-validation training runs from caret::train() with an lm model.
I have a gist of the R code I used to do some of this checking, where I directly access the train() output object, get the cross-validation data.frame indexes used in each cross-validation run. But I was wondering if there were already functions that access this for me, because I would think that 1.) If it was a good or reasonable idea, the functionality would be there, or 2.) If the functionality is not there, my desire to do this may not be a good idea.
As a second part of the question, you can see in the gist that when I compute the mean and standard error of the single linear parameter over all of the cross-validation parameter estimates, that the mean of the CV parameter estimates and the linear model on a fit of the entire training data set match up well, but the standard error from the CV estimates is much smaller than that from the estimate from the single lm run on the whole training set. Is that expected, or am I computing/thinking about that wrong?
EDIT: I think the second part can be found by reading this answer .
Thanks in advance,
Matt

Resources