Has the Heteroskedasticity been resolved? - r

I created a linear regression model of two continuous variables Income and Expense. The former is the independent variable and the latter is the dependent. I initially found that there was heteroskedasticity in the model after looking at the spread of the data and then calculating a post-estimation function (Breusch-Pagan test) which calculated that p-value < 2.2e-16. Since this was less than the significance level of 0.05 I rejected the null hypothesis that there was homoskedasticity and concluded that heteroskedasticity does exist.
In trying to correct the heteroskedasticity I used the box-cox transformation on the dependent variable using the following code:
lmodI = lm(LCF2010$expense ~ LCF2010$income, data=newexcel) #my original Original model
boxcox(lmodI, lambda = seq(0,0.5,0.1)) #Found the ideal lambda value to be 0.35
newexcel <- cbind(newexcel, newexcel$expense^0.35) #Added the new variable to the original dataframe
names(newexcel)[14] <- "Yprime" #Changed the column name to "Yprime"
lmodINew <- lm(Yprime ~ income, data=newexcel) #Created the new linear model
I then decided to compare the old model to the new to see if I had corrected the heteroskedasticity - creating the following diagnostic plots:
original model:
new model:
I also ran the Breusch-Pagan test for the new model and found that the p-value stayed the same at p-value < 2.2e-16. This and the fact that I couldnt see much of a difference between the two diagnostic plots has confused me as I expected the method I used to fix the heteroskedasticity.
I expected the p-value for the new model to be higher than 0.05 so I couldn't reject the null hypothesis and thus have homoskedasticity. Have I done something wrong during the box-cox transformation?

From your plots it seems you have a couple of hundred of observations. Remember that the Breusch-Pagan test is essentially the number of observations times R-squared, where the R-squared comes from the auxiliary regression of the residuals on the regressors (see eqn. [8.16] in Wooldridge 2015). If n is large, this statistic will always reject the null hypothesis.

Related

Error in glsEstimate(object, control = control) : computed "gls" fit is singular, rank 19

First time asking in the forums, this time I couldn't find the solutions in other answers.
I'm just starting to learn to use R, so I can't help but think this has a simple solution I'm failing to see.
I'm analyzing the relationship between different insect species (SP) and temperature (T), explanatory variables
and the area of the femur of the resulting adult (Femur.area) response variable.
This is my linear model:
ModeloP <- lm(Femur.area ~ T * SP, data=Datos)
No error, but when I want to model variance with gls,
modelo_varPower <- gls(Femur.area ~ T*SP,
weights = varPower(),
data = Datos
)
I get the following errors...
Error in glsEstimate(object, control = control) :
computed "gls" fit is singular, rank 19
The linear model barely passes the Shapiro test of normality, could this be the issue?
Shapiro-Wilk normality test
data: re
W = 0.98269, p-value = 0.05936
Strangely I've run this model using another explanatory variable and had no errors, all I can read in the forums has to do with multiple samplings along a period of time, and thats not my case.
Since the only difference is the response variable I'm uploading and image of how the table looks like in case it helps.
You have some missing cells in your SP:T interaction. lm() tolerates these (if you look at coef(lm(Femur.area~SP*T,data=Datos)) you'll see some NA values for the missing interactions). gls() does not. One way to deal with this is to create an interaction variable and drop the missing levels, then fit the model as (effectively) a one-way rather than a two-way ANOVA. (I called the data dd rather than datos.)
dd3 <- transform(na.omit(dd), SPT=droplevels(interaction(SP,T)))
library(nlme)
gls(Femur.area~SPT,weights=varPower(form=~fitted(.)),data=dd3)
If you want the main effects and the interaction term and the power-law variance that's possible, but it's harder.

What post-hoc test should be used for a glmer model with a continious and a categorical predictor variable?

I'm a bit of a newbie with stats and R, so need a bit of direction to find a suitable post-hoc test for my glmer model.
The model has a binary dependent variable (absent/present) and the predictor variables are interactive terms between a continuous variable(eg temp) and a categorical variable (species, n=3). Only interactive terms, rather than the continuous factor in isolation, produce significant results when an anova is run on the model. Species by itself has a large effect because one species is much rarer than the others. I'm trying to tease apart how the presence of these species varies across pH and between species.
I've tried lsmeans test with Tukey, and Firth's Bias-Reduced Logistic Regression, emmeans. I ran the effects function on the interactive terms, so had a rough expectation of what a post hoc could show, but the results logistf (firth's) have produced I was not expecting. Emmeans and tukey both gave the same results and ignored the continuous variable I assume because it's not a factor.
When I run firth's regression it produces chi-squared and p values that are either infinity for chi values or the p values astronomically small, even though what I saw through effects suggested no significant difference. I can't tell with the interactive term if there truly is an effect of the environmental variable or if the significant effect is because of the difference in species. Based on what I have seen of the logistf function, I didn't think it would produce a chi-square score. Is this an issue in coding or is it because of my data?
If I wasn't clear enough about something please let me know and if anyone has any suggestions or advice, they would be massively appreciated. Thanks!
The model and test code I used are below:
###glmer model
Large<-glmer(Abs.Pres~ Species:Q.Depth+Species:Conductivity+Species:Temp+Species:pH+Species:DO.P+(1|QID),
nAGQ=0,
family=binomial,
data=Stacked_Pref)
anova(Large)
Output:Analysis of Variance Table
npar Sum Sq Mean Sq F value
Species:Q.Depth 3 234.904 78.301 78.3014
Species:Conductivity 3 32.991 10.997 10.9970
Species:Temp 3 39.001 13.000 13.0004
Species:pH 3 25.369 8.456 8.4562
Species:DO.P 3 34.930 11.643 11.6434
###Firths
Lp<-logistf(Abs.Pres~Species:pH, data=Stacked_Pref, contrasts.arg=list(pH="contr.treatment", Species="contr.sum"))
> Lp
logistf(formula = Abs.Pres ~ Species:pH, data = Stacked_Pref,
contrasts.arg = list(pH = "contr.treatment", Species = "contr.sum"))
Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood
coef se(coef) lower 0.95 upper 0.95 Chisq p
(Intercept) 1.9711411 0.57309880 0.8552342 3.1015114 12.09107 5.066380e-04
SpeciesGoby:pH -0.3393185 0.07146049 -0.4804047 -0.2003108 23.31954 1.371993e-06
SpeciesMosquito:pH -0.3001385 0.07127771 -0.4408186 -0.1614419 18.24981 1.937453e-05
SpeciesRFBE:pH -0.4771393 0.07232469 -0.6200179 -0.3365343 45.73750 1.352096e-11
Likelihood ratio test=267.0212 on 3 df, p=0, n=3945

How to test significance of polynomial (linear) trends among groups with unequal variances?

I am testing for a linear trend among several groups, however, my data has violated the assumption of equal variance among groups (tested by Levene's homogeneity of variance).
In SPSS, along with the significance of linear trend assuming equal variance, there is automatic output for significance where equal variances are not assumed. What 'test' or 'adjustment' is being done? Can I do this in R, and how?
Image of SPSS output: (https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSZRs3EM3wJz5raHhav-LLBTmTyfLJO0z4xHDEzI-3uI15BoBQ5)
I'm struggling to find what exactly SPSS is doing, but it could be some sort of welch correction?
# TEST homogeneity of variance
leveneTest(ICECAP_A ~ SFMental_f, data = SCI)
p < 0.001 so we reject null of homogeneity of variance.
# Use built-in contr.poly() function: Tell R to get a polynomial contrast matrix for 5 levels/groups
contrasts(SCI$SFMental_f) <- contr.poly(n=5)
# call an ANOVA
anova.SFMental <- aov(ICECAP_A ~ SFMental_f, data = SCI)
# print output, show linear trend result
summary.aov(anova.SFMental, split=list (SFMental_f=list ("Linear"=1)))
Now I have the significance for linear trend. How do I get the significance if we do NOT assume equal variances?
It seems that SPSS does a correction using the [Welch-Satterthwaite Equation]1. Thanks to Andy Field for the tip. But there is no direct R alternative, so I constructed the contrasts in the usual way and ran a robust model with lmRob() instead.

R: Linear regression model does not work very well

I'm using R to fit a linear regression model and then I use this model to predict values but it does not predict very well boundary values. Do you know how to fix it?
ZLFPS is:
ZLFPS<-c(27.06,25.31,24.1,23.34,22.35,21.66,21.23,21.02,20.77,20.11,20.07,19.7,19.64,19.08,18.77,18.44,18.24,18.02,17.61,17.58,16.98,19.43,18.29,17.35,16.57,15.98,15.5,15.33,14.87,14.84,14.46,14.25,14.17,14.09,13.82,13.77,13.76,13.71,13.35,13.34,13.14,13.05,25.11,23.49,22.51,21.53,20.53,19.61,19.17,18.72,18.08,17.95,17.77,17.74,17.7,17.62,17.45,17.17,17.06,16.9,16.68,16.65,16.25,19.49,18.17,17.17,16.35,15.68,15.07,14.53,14.01,13.6,13.18,13.11,12.97,12.96,12.95,12.94,12.9,12.84,12.83,12.79,12.7,12.68,27.41,25.39,23.98,22.71,21.39,20.76,19.74,19.49,19.12,18.67,18.35,18.15,17.84,17.67,17.65,17.48,17.44,17.05,16.72,16.46,16.13,23.07,21.33,20.09,18.96,17.74,17.16,16.43,15.78,15.27,15.06,14.75,14.69,14.69,14.6,14.55,14.53,14.5,14.25,14.23,14.07,14.05,29.89,27.18,25.75,24.23,23.23,21.94,21.32,20.69,20.35,19.62,19.49,19.45,19,18.86,18.82,18.19,18.06,17.93,17.56,17.48,17.11,23.66,21.65,19.99,18.52,17.22,16.29,15.53,14.95,14.32,14.04,13.85,13.82,13.72,13.64,13.5,13.5,13.43,13.39,13.28,13.25,13.21,26.32,24.97,23.27,22.86,21.12,20.74,20.4,19.93,19.71,19.35,19.25,18.99,18.99,18.88,18.84,18.53,18.29,18.27,17.93,17.79,17.34,20.83,19.76,18.62,17.38,16.66,15.79,15.51,15.11,14.84,14.69,14.64,14.55,14.44,14.29,14.23,14.19,14.17,14.03,13.91,13.8,13.58,32.91,30.21,28.17,25.99,24.38,23.23,22.55,20.74,20.35,19.75,19.28,19.15,18.25,18.2,18.12,17.89,17.68,17.33,17.23,17.07,16.78,25.9,23.56,21.39,20.11,18.66,17.3,16.76,16.07,15.52,15.07,14.6,14.29,14.12,13.95,13.89,13.66,13.63,13.42,13.28,13.27,13.13,24.21,22.89,21.17,20.06,19.1,18.44,17.68,17.18,16.74,16.07,15.93,15.5,15.41,15.11,14.84,14.74,14.68,14.37,14.29,14.29,14.27,18.97,17.59,16.05,15.49,14.51,13.91,13.45,12.81,12.6,12,11.98,11.6,11.42,11.33,11.27,11.13,11.12,11.11,10.92,10.87,10.87,28.61,26.4,24.22,23.04,21.8,20.71,20.47,19.76,19.38,19.18,18.55,17.99,17.95,17.74,17.62,17.47,17.25,16.63,16.54,16.39,16.12,21.98,20.32,19.49,18.2,17.1,16.47,15.87,15.37,14.89,14.52,14.37,13.96,13.95,13.72,13.54,13.41,13.39,13.24,13.07,12.96,12.95,27.6,25.68,24.56,23.52,22.41,21.69,20.88,20.35,20.26,19.66,19.19,19.13,19.11,18.89,18.53,18.13,17.67,17.3,17.26,17.26,16.71,19.13,17.76,17.01,16.18,15.43,14.8,14.42,14,13.8,13.67,13.33,13.23,12.86,12.85,12.82,12.75,12.61,12.59,12.59,12.45,12.32)
QPZL<-c(36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16)
ZLDBFSAO<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
My model is:
fit32=lm(log(ZLFPS) ~ poly(QPZL,2,raw=T) + ZLDBFSAO)
results3 <- coef(summary(fit32))
first3<-as.numeric(results3[1])
second3<-as.numeric(results3[2])
third3<-as.numeric(results3[3])
fourth3<-as.numeric(results3[4])
fifth3<-as.numeric(results3[5])
#inverse model used for prediction of FPS
f1 <- function(x) {first3 +second3*x +third3*x^2 + fourth3*1}
You can see my dataset here. This dataset contains the values that I have to predict. The FPS variation per QP is heterogenous. See dataset. I added a new column.
The fitted dataset is a different one.
To test the model just write exp(f1(selected_QP)) where selected QP varies from 16 to 36. See the given dataset for QP values and the FPS value that the model should predict.
You can run the model online here.
When I'm using QP values in the middle, let's say between 23 and 32 the model predicts the FPS value pretty well. Otherwise, the prediction has big error value.
Regarding the linear regression model I should use Weighted Least Squares as a Solution to Heteroskedasticity of the fitted dataset. For references, see here, here and here.
fit32=lm(log(ZLFPS) ~ poly(QPZL,2,raw=T) + ZLDBFSAO, weights=1/(1+0.5*QPZL^2))
The other code remains the same. This model gives me lower prediction error than the previous.

same regression, different statistics (R v. SAS)?

I ran the same probit regression in SAS and R and while my coefficient estimates are (essentially) equivalent, the reported test statistics are different. Specifically, SAS reports test statistics as t-statistics whereas R reports test statistics as z-statistics.
I checked my econometrics text and found (with little elaboration) that it reports probit results in terms of t statistics.
Which statistic is appropriate? And why does R differ from SAS?
Here's my SAS code:
proc qlim data=DavesData;
model y = x1 x2 x3/ discrete(d=probit);
run;
quit;
And here's my R code:
> model.1 <- glm(y ~ x1 + x2 + x3, family=binomial(link="probit"))
> summary(model.1)
Just to answer a little bit - it's seriously off topic, question should be closed in fact - but neither the t-statistic nor the z-statistic are meaningful. They're both related though, as Z is just the standard normal distribution and T is an adapted "close-to-normal" distribution that takes into account the fact that your sample is limited to n cases.
Now, both the z and the t statistic provide the significance for the null hypothesis that the respective coefficient is equal to zero. The standard error on the coefficients, used for that test, is based on the residual error. Using the link function, you practically transform your response in such a way that the residuals become normal again, whereas in fact the residuals represent the difference between the observed and the estimated proportion. Due to this transformation, calculation of the degrees of freedom for the T-statistic isn't useful anymore and hence R assumes the standard normal distribution for the test statistic.
Both results are completely equivalent, R will just give slightly sharper p-values. It's a matter of debate, but if you look at proportion difference tests, they're also always done using the standard normal approximation (Z-test).
Which brings me back to the point that neither of these values actually has any meaning. If you want to know whether or not a variable has a significant contribution with a p-value that actually says something, you use a Chi-squared test like the Likelihood Ratio test (LR), Score test or Wald test. R just gives you the standard likelihood ratio, SAS also gives you the other two. But all three tests are essentially equivalent, if they differ seriously it's time to look again at your data.
eg in R :
anova(model.1,test="Chisq")
For SAS : see the examples here for use of contrasts, getting the LR, Score or Wald test

Resources