Regression model produces low coefficient but high R-squared? - r

I'm doing research on right wing radicalization factors. Since I'm using survey data and have to account for oversampling, I'm using the survey-package. Furthermore I log-transformed the dependent variable in order to adjust for its right skewed distribution. This are the relevant code lines:
dat_wght <- svydesign(ids= ~1, data=dat, weights =~wghtpew)
mod1 <-svyglm(log(right) ~ religiosity, design = dat_wght)
For model 1, I get this regression output:
Call:
svyglm(formula = log(right) ~ religiosity,
design = dat_wght)
Survey design:
svydesign(ids = ~1, data = dat, weights = ~wghtpew)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.493398 0.016111 154.763 < 2e-16 ***
religiosity 0.016750 0.004091 4.094 4.43e-05 ***
Since it is a log-model, I interpretated the coefficient as follows: If independent variable religiosity increases by 1 unit, the dependent variable right increases by (exp(0.01675)-1)*100=1.69%. This would mean that there is indeed some kind of correlation, but the effect is very very low. Is that correct so far?
Furthermore, I want to calculate R-squared. The svyglm-model doesn't provide R-squared immediately. However you guys kindly pointed out, that I can calculate it by myself with:
total_var <-svyvar(~right, dat_wght)
resid_var_mod1 <- summary(mod1)$dispersion
rsq_mod1 <- 1-resid_var_mod1/total_var
rsq_mod1
However, the result I get is:
variance SE
right 0.99407 0.0028
How can that be? If the effect is very very low, my model apparently isn't suited to explain the variation in the dependent variable. Therefore R-squared should also be very low and much closer to zero than to 1, shouldn't it? Why is it so high then? Did I interpretate my coefficients wrong? Are there any mistakes that I did along the way?
I'm really grateful for every kind of advice! Thanks :)

Ok, it's simpler than everyone thought. Your R^2 calculation is just wrong.
You compute the residual variance of log(right) and divide it by the population variance of right. You need to use the population variance of log(right) as well.

Related

Is it normal that glmer returns no variance of intercept?

I am trying to run a multi-level model to account for the fact that votes for a country's presidential elections may be nested within groups (depending of voters' mother tongues, places of residence etc.). In order to do so, I use the glmer function of the lme4 package.
m1<-glmer(vote_DPP ~ 1 + (1 | county_city),
family = binomial(link="logit"), data = d3)
Here, my vote variable is binary, representing whether people vote for a given party (1) or not (0). Since I believe results may change depending on people's state of residence, I want to allow intercepts to vary across states. However, I see no variation of intercept when I run my code.
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: vote_DPP ~ 1 + (1 | county_city)
Data: d3
AIC BIC logLik deviance df.resid
1746.7918 1757.2001 -871.3959 1742.7918 1343
Random effects:
Groups Name Std.Dev.
county_city (Intercept) 0.2559
Number of obs: 1345, groups: county_city, 17
Fixed Effects:
(Intercept)
0.5937
What puzzles me here is the complete absence of variance column. I have seen other forums on the web regarding problems with variance = 0, but I cannot seem to find anything about the complete disappearance of this column (which makes me think it's probably something very simple I missed). First time posting in here, and quite a beginner in R and Stats, so any help would be appreciated :)
If you're concerned about seeing if the variance is zero, that's equivalent to seeing if the standard deviation is zero (similarly for "is (std dev/variance) small, although in this case they will be on different scales"). Furthermore, if the std dev/variance are zero or nearly zero you should get a "singular fit" message as well.
#Roland's comment is correct that summary() will print both the standard deviation and the variance by default. You can ask for both (or either) in the output of print() as well by specifying the ranef.comp (random effect component) argument:
library(lme4)
gm1 <- glmer(incidence/size ~ period + (1|herd),
data = cbpp,
weight = size,
family = binomial)
print(gm1, ranef.comp = c("Std.Dev.", "Variance"))
## ...
## Random effects:
## Groups Name Std.Dev. Variance
## herd (Intercept) 0.6421 0.4123
## ...
(You can similarly modify which components are shown in the summary printout: for example if you only want to see the variance, you can specify print(summary(gm1), ranef.comp = c("Variance")).)
For a bit more context: the standard deviation and variance are essentially redundant information (the standard error of the estimates of the random effects are not shown because they can be unreliable estimates of uncertainty in this case). Which form is more useful depends on the application: standard deviations are easier to compare to the corresponding fixed effects, variances can sometimes be used to make conclusions about partitioning of variance across effects (although doing this is more complicated than in the classic linear, balanced ANOVA case).

Newey West and paired t test to correct for autocorrelation

I have been looking extensively for the following.
Is there a way to use Newey West (1994) estimator for a paired T test in R?
The t.test() gives me correct t values, but then I want to correct them for autocorrelation. It seems not possible.
With coeftest() there is a way to use newey west correction, but only for independent t test and not paired t test!
x <-rnorm(100)
k <-rnorm(100)
t.test(x,k, paired=TRUE)
Now let's assume I know there is autocorrelation in my data (x and k) and therefore I want to use the Newey West estimator to correct for that.
Anyone any idea to do that with t.test?
Alternatively one can use the following:
fit<-lm(x~k)
coeftest(fit)
This is an independent t test. Anyone any idea how to make a paired t test with coeftest()?
Next, one can embed NeweyWest estimator in the t test to adjust for autocorrelation.
coeftest(fit, df=Inf, vcov=NeweyWest)
Again, I want to do this with a paired t test.
If anyone has any insight, please let me know.
Notice any similarities? Still have not figured out the issues about which you are concerned, but I don't really agree that the the regression corresponds to independent t-test. With an ind t-test there is no item-by-item pairing, whereas in regression there most definitely is such a pairing.
t.test(x,k, paired=TRUE)
#--------------------
Paired t-test
data: x and k
t = -0.6008, df = 99, p-value = 0.5493
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3746782 0.2005112
sample estimates:
mean of the differences
-0.0870835
#------------------
library(lmtest)
fit<-lm( (x-k) ~ 1)
lmtest::coeftest(fit)
#------------------
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.087083 0.144941 -0.6008 0.5493
You can perform a paired t-test in regression settings by taking the difference in outcomes for each pair and regressing it on the vector of ones. Once you have cast the problem this way, it is easy to use standard functionalities from "lmtest" library in R including the NeweyWest estimator. A code snippet follows:
fit<-lm(x-k~1)
coeftest(fit, df=Inf, vcov=NeweyWest)

How to get an estimate and confidence interval for a contrast in R with offset

I've got a Poisson GLM model fitted in R, looking something like this:
glm(Outcome~Exposure + Var1 + offset(log(persontime)),family=poisson,data=G))
Where Outcome will end up being a rate, the Exposure is a continuous variable, and Var1 is a factor with three levels.
It's easy enough from the output of that:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.6998 0.1963 -29.029 < 2e-16
Exposure 4.7482 1.0793 4.399 1.09e-05
Var1Thing1 -0.2930 0.2008 -1.459 0.144524
Var1Thin 1.0395 0.2037 5.103 3.34e-07
Var1Thing3 0.7722 0.2201 3.508 0.000451
To get the estimate of a one-unit increase in Exposure. But a one-unit increase isn't actually particularly meaningful. An increase of 0.025 is actually far more likely. Getting an estimate for that isn't particularly difficult either, but I'd like a confidence interval along with the estimate. My intuition is that I need to use the contrast package, but the following generated an error:
diff <- contrast(Fit,list(Exposure=0.030,Var1="Thing1"),list(Exposure=0.005,Type="Thing1"))
"Error in offset(log(persontime)) : object 'persontime' not found"
Any idea what I'm doing wrong?
you want to use the confint function (which in this case will call the MASS:::confint.glm method), as in:
confint(Fit)
Since the standard errors is the model scale linearly with the linear changes in the scale of the variable 'Exposure' in your model, you can simply multiply the confidence interval by the difference in scale to get the confidence for a smaller 'unit' change.
Dumb example:
Lets say you want to test the hypothesis that people fall down more often when they've had more alcohol. You test this by randomly serving individuals varying amounts of alcohol (which you measure in ml) and counting the number of times each person falls down. Your model is:
Fit <- glm(falls ~ alcohol_ml,data=myData, family=poisson)
and the coef table is
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.6998 0.1963 -29.029 < 2e-16
Alcohol_ml 4.7482 1.0793 4.399 1.09e-05
and the Confidence interval for alcohol is 4-6 (just to keep is simple). Now a colegue asks you to give the the confidence interval in ounces. All you have to do is to scale by the confidence interval by the conversion factor (29.5735 ounces per ml), as in:
c(4,6) * 29.5735 # effect per ounce alcohol [notice multiplication is used to rescale here]
alternatively you could re-scale your data and re-fit the model:
mydata$alcohol_oz <- mydata$alcohol_ml / 29.5735 #[notice division is used to rescale here]
Fit <- glm(falls ~ alcohol_oz,data=myData, family=poisson)
or you could re-scale your data right in the model:
#[again notice that division is used here]
Fit <- glm(falls ~ I(alcohol_ml/29.5735),data=myData, family=poisson)
Either way, you will get the same confidence intervals on the new scale.
Back to your example: if you're units of Exposure are so large that you are unlikely to observe such a change within an individual and a smaller change is more easily interpreted, just re-scale your variable 'Exposure' (as in myData$Exposure_newScale = myData$Exposure / 0.030 so Exposure_newScale is in multiples of 0.030) or rescale the confidence intervals using either of these methods.

OLS estimation with AR(1) term

For reasons that I cannot explain (because I can't, not because I don't want to), a process used at my office requires running some regressions on Eviews.
The equation specification used on Eviews is:
dependent_variable c independent_variable ar(1)
Furthermore, the process used is "NLS and ARMA."
I don't use Eviews but, as I understand it, that equation means an OLS regression with a constant, one independent variable and an AR(1) term.
I tried running this in R:
result <- lm(df$dependent[2:48] ~ df$independent[1:47] + df$dependent[1:47])
Where df is a data.frame containing the dependent and independent variables (both spanning 48 observations).
Am I doing it right? Because the parameter estimations, while similar, are different in Eviews. Different enough that I cannot use them.
I've thoroughly searched the internet for what this means. I've read up on ARIMA and ARMAX models but I don't think that this is it. I'm sorry but I'm not that knowledgeable on statistics. By the way, estimating ARMAX models seems very complicated and is done by ML, not LS, so I'm really hoping that's not it.
EDIT: I had to edit the model indexes again because I messed them up, again.
You need arima function, see ?arima
Example with some data
y <- lh # lh is Luteinizing Hormone in Blood Samples in datasets package (Base)
set.seed(001)
x <- rnorm(length(y), 100, 10)
arima(y, order = c(1,0,0), xreg=x)
Call:
arima(x = y, order = c(1, 0, 0), xreg = x)
Coefficients:
ar1 intercept x
0.5810 1.8821 0.0053
s.e. 0.1153 0.6991 0.0068
sigma^2 estimated as 0.195: log likelihood = -29.08, aic = 66.16
See ?arima to find help about its arguments.

Calculation of R^2 value for a non-linear regression

I would first like to say, that I understand that calculating an R^2 value for a non-linear regression isn't exactly correct or a valid thing to do.
However, I'm in a transition period of performing most of our work in SigmaPlot over to R and for our non-linear (concentration-response) models, colleagues are used to seeing an R^2 value associated with the model to estimate goodness-of-fit.
SigmaPlot calculates the R^2 using 1-(residual SS/total SS), but in R I can't seem to extract the total SS (residual SS are reported in summary).
Any help in getting this to work would be greatly appreciated as I try and move us into using a better estimator of goodness-of-fit.
Cheers.
Instead of extracting the total SS, I've just calculated them:
test.mdl <- nls(ctrl.adj~a/(1((conc.calc/x0)^b)),
data=dataSet,
start=list(a=100,b=10,x0=40), trace=T);
1 - (deviance(test.mdl)/sum((ctrl.adj-mean(ctrl.adj))^2))
I get the same R^2 as when using SigmaPlot, so all should be good.
So the total variation in y is like (n-1)*var(y) and the proportion not explained my your model is sum(residuals(fit)^2) so do something like 1-(sum(residuals(fit)^2)/((n-1)*var(y)) )

Resources