I am very confused about the package Zelig and in particular the function sim.
What i want to do is estimate a logistic regression using a subset of my data and then estimate the fitted values of the remaining data to see how well the estimation performs. Some sample code follows:
data(turnout)
turnout <- data.table(turnout)
Shuffle the data
turnout <- turnout[sample(.N,2000)]
Create a sample for regression
turnout_sample <- turnout[1:1800,]
Create a sample for out of data testing
turnout_sample2 <- turnout[1801:2000,]
Run the regression
z.out1 <- zelig(vote ~ age + race, model = "logit", data = turnout_sample)
summary(z.out1)
Model:
Call:
z5$zelig(formula = vote ~ age + race, data = turnout_sample)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9394 -1.2933 0.7049 0.7777 1.0718
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.028874 0.186446 0.155 0.876927
age 0.011830 0.003251 3.639 0.000274
racewhite 0.633472 0.142994 4.430 0.00000942
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2037.5 on 1799 degrees of freedom
Residual deviance: 2002.9 on 1797 degrees of freedom
AIC: 2008.9
Number of Fisher Scoring iterations: 4
Next step: Use 'setx' method
Set the x values to the remaining 200 observations
x.out1 <- setx(z.out1,fn=NULL,data=turnout_sample2)
Simulate
s.out1 <- sim(z.out1,x=x.out1)
Get the fitted values
fitted <- s.out1$getqi("ev")
What i don't understand is that the list fitted now contains 1000 values and all the values are between 0,728 and 0,799.
1. Why are there 1000 values when what I am trying to estimate is the fitted value of 200 observations?
2. And why are the observations so closely grouped?
I hope someone can help me with this.
Best regards
The first question: From the signature of sim (sim(obj, x = NULL, x1 = NULL, y = NULL, num = 1000..) you see the default number of simulations is 1000. If you want to have 200, set num=200.
However, the sim in this example from documentation you use, actually generates (simulates) the probability that a person will vote given certain values (either computed by setx or computed and fixed on some value like this setx(z.out, race = "white")).
So in your case, you have 1000 simulated probability values between 0,728 and 0,799, which is what you are supposed to get.
Related
Please find reprex below:
library(tidyverse)
# Work days for January from 2010 - 2018
data = data.frame(work_days = c(20,21,22,20,20,22,21,21),
sale = c(1205,2111,2452,2054,2440,1212,1211,2111))
# Apply linear regression
model = lm(sale ~ work_days, data)
summary(model)
Call:
lm(formula = sale ~ work_days, data = data)
Residuals:
Min 1Q Median 3Q Max
-677.8 -604.5 218.7 339.0 645.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2643.82 5614.16 0.471 0.654
work_days -38.05 268.75 -0.142 0.892
Residual standard error: 593.4 on 6 degrees of freedom
Multiple R-squared: 0.00333, Adjusted R-squared: -0.1628
F-statistic: 0.02005 on 1 and 6 DF, p-value: 0.892
Could you please help me understand if the coefficients
Every work day decreases the sale by 38.05 ?
data = data.frame(work_days = c(20,21,22,20,20,22,21,21),
sale = c(1212,1211,2111,1205,2111,2452,2054,2440))
model = lm(sale ~ work_days, data)
summary(model)
Call:
lm(formula = sale ~ work_days, data = data)
Residuals:
Min 1Q Median 3Q Max
-686.8 -301.0 -8.6 261.3 599.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6220.0 4555.9 -1.365 0.221
work_days 386.6 218.1 1.772 0.127
Residual standard error: 481.5 on 6 degrees of freedom
Multiple R-squared: 0.3437, Adjusted R-squared: 0.2343
F-statistic: 3.142 on 1 and 6 DF, p-value: 0.1267
Does this mean,
Every workday increases the sales by 387 ?
How about the negative intercept ?
Similar questions but couldnt apply the learnings:
Interpreting regression coefficients in R
Interpreting coefficients from Logistic Regression from R
Linear combination of regression coefficients in R
Could you please help me understand if the coefficients Every work day decreases the sale by 38.05 ?
Yes and no. Given only the 8 data points the best regression line has a negative slope of -38.05 which appears to be counterintuitive.
However, you need to take the standard error of this -38.05 value into account, which is 268.75. So the result can be translated into "in this sample it looks like the slope is negative but it might as well be positive, anything between '-38.05 + 2*268.75' and '-38.05 - 2*268.75' is a resonable guess. So do not extrapolate from this small sample to anything other than this sample.
Also look at
Multiple R-squared: 0.00333
This means, less than 1 % of the sample variance can be explained with this regression. Do not take it to serious and try to explain numbers from such a small sample.
Every workday increases the sales by 387 ? How about the negative intercept ?
Judging only from the small sample you investigated, it looks like every workday increased sales by 387. However, the standard error is high and thus you cannot tell, whether additional workdays increase or decrease sales outside of this small sample. The whole model is not significant so nobody claims, this model is better then pure guessing.
How about the negative intercept ?
You forced the computer to calculate a linear model. That model will allow you to compute stupid values like "what if sales were a linear function of work days and a month had negative or zero workdays"? You could of course force R to predict a linear model, in which zero workdays lead to zero sales and this brings us back on topic. Forcing R to compute a model through the point (0; 0) takes the following syntax:
model <- lm(sales ~ work_days - 1, data = data)
The Intercept of the regression line is interpreted as the predicted sale when work_days is equal to zero. If the predictor (work_days in this case) can't be zero, then it doesn't make sense. The slope of the regression line or the predicted estimate -38.5 can be interpreted as for each additional increase in work_days, sale measurement is reduced by -38.05.
Disclaimer: I am very new to glm binomial.
However, this to me sounds very basic however glm brings back something that either is incorrect or I don't know how to interpret it.
First I was using my primary data and was getting errors, then I tried to replicate the error and I see the same thing: I define two columns, indep and dep and glm results does not make sense, to me at least...
Any help will be really appreciated. I have a second question on handling NAs in my glm but first I wish to take care of this :(
set.seed(100)
x <- rnorm(24, 50, 2)
y <- rnorm(24, 25,2)
j <- c(rep(0,24), rep(1,24))
d <- data.frame(dep= as.factor(j),indep = c(x,y))
mod <- glm(dep~indep,data = d, family = binomial)
summary(mod)
Which brings back:
Call:
glm(formula = dep ~ indep, family = binomial, data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-9.001e-06 -7.612e-07 0.000e+00 2.110e-08 1.160e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 92.110 168306.585 0.001 1
indep -2.409 4267.658 -0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.6542e+01 on 47 degrees of freedom
Residual deviance: 3.9069e-10 on 46 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
Number of Fisher Scoring iterations: 25
What is happening? I see the warning but in this case these two groups are really separated...
Barplot of the random data:
enter image description here
I want to compute a logit regression for rare events. I decided to use the Zelig package (relogit function) to do so.
Usually, I use stargazer to extract and save regression results. However, there seem to be compatibility issues with these two packages (Using stargazer with Zelig).
I now want to extract the following information from the Zelig relogit output:
Coefficients, z values, p values, number of observations, log likelihood, AIC
I have managed to extract the p-values and coefficients. However, I failed at the rest. But I am sure these values must be accessible somehow, because they are reported in the summary() output (however, I did not manage to store the summary output as an R object). The summary cannot be processed in the same way as a regular glm summary (https://stats.stackexchange.com/questions/176821/relogit-model-from-zelig-package-in-r-how-to-get-the-estimated-coefficients)
A reproducible example:
##Initiate package, model and data
require(Zelig)
data(mid)
z.out1 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years,
data = mid, model = "relogit")
##Call summary on output (reports in console most of the needed information)
summary(z.out1)
##Storing the summary fails and only produces a useless object
summary(z.out1) -> z.out1.sum
##Some of the output I can access as follows
z.out1$get_coef() -> z.out1.coeff
z.out1$get_pvalue() -> z.out1.p
z.out1$get_se() -> z.out1.se
However, I did not find similar commands for other elements, such as z values, AIC etc. However, as they are shown in the summary() call, they should be accessible somehow.
The summary call result:
Model:
Call:
z5$zelig(formula = conflict ~ major + contig + power + maxdem +
mindem + years, data = mid)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0742 -0.4444 -0.2772 0.3295 3.1556
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.535496 0.179685 -14.111 < 2e-16
major 2.432525 0.157561 15.439 < 2e-16
contig 4.121869 0.157650 26.146 < 2e-16
power 1.053351 0.217243 4.849 1.24e-06
maxdem 0.048164 0.010065 4.785 1.71e-06
mindem -0.064825 0.012802 -5.064 4.11e-07
years -0.063197 0.005705 -11.078 < 2e-16
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3979.5 on 3125 degrees of freedom
Residual deviance: 1868.5 on 3119 degrees of freedom
AIC: 1882.5
Number of Fisher Scoring iterations: 6
Next step: Use 'setx' method
Use from_zelig_model for deviance, AIC.
m <- from_zelig_model(z.out1)
m$aic
...
Z-values are coefficient / sd.
z.out1$get_coef()[[1]]/z.out1$get_se()[[1]]
I am running a mixed model using lme4 in R:
full_mod3=lmer(logcptplus1 ~ logdepth*logcobb + (1|fyear) + (1 |flocation),
data=cpt, REML=TRUE)
summary:
Formula: logcptplus1 ~ logdepth * logcobb + (1 | fyear) + (1 | flocation)
Data: cpt
REML criterion at convergence: 577.5
Scaled residuals:
Min 1Q Median 3Q Max
-2.7797 -0.5431 0.0248 0.6562 2.1733
Random effects:
Groups Name Variance Std.Dev.
fyear (Intercept) 0.2254 0.4748
flocation (Intercept) 0.1557 0.3946
Residual 0.9663 0.9830
Number of obs: 193, groups: fyear, 16; flocation, 16
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.3949 1.2319 3.568
logdepth 0.2681 0.4293 0.625
logcobb -0.7189 0.5955 -1.207
logdepth:logcobb 0.3791 0.2071 1.831
I have used the effects package and function in R to calculate the 95% confidence intervals for the model output. I have calculated and extracted the 95% CI and standard error using the effects package so that I can examine the relationship between the predictor variable of importance and the response variable by holding the secondary predictor variable (logdepth) constant at the median (2.5) in the data set:
gm=4.3949 + 0.2681*depth_median + -0.7189*logcobb_range + 0.3791*
(depth_median*logcobb_range)
ef2=effect("logdepth*logcobb",full_mod3,
xlevels=list(logcobb=seq(log(0.03268),log(0.37980),,200)))
I have attempted to bootstrap the 95% CIs using code from here. However, I need to calculate the 95% CIs for only the median depth (2.5). Is there a way to specify in the confint() code so that I can calculate the CIs needed to visualize the bootstrapped results as in the plot above?
confint(full_mod3,method="boot",nsim=200,boot.type="perc")
You can do this by specifying a custom function:
library(lme4)
?confint.merMod
FUN: bootstrap function; if ‘NULL’, an internal function that returns the fixed-effect parameters as well as the random-effect parameters on the standard deviation/correlation scale will be used. See ‘bootMer’ for details.
So FUN can be a prediction function (?predict.merMod) that uses a newdata argument that varies and fixes appropriate predictor variables.
An example with built-in data (not quite as interesting as yours since there's a single continuous predictor variable, but I think it should illustrate the approach clearly enough):
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
pframe <- data.frame(Days=seq(0,20,by=0.5))
## predicted values at population level (re.form=NA)
pfun <- function(fit) {
predict(fit,newdata=pframe,re.form=NA)
}
set.seed(101)
cc <- confint(fm1,method="boot",FUN=pfun)
Picture:
par(las=1,bty="l")
matplot(pframe$Days,cc,lty=2,col=1,type="l",
xlab="Days",ylab="Reaction")
I'm running the following code:
library(lme4)
library(nlme)
nest.reg2 <- glmer(SS ~ (bd|cond), family = "binomial",
data = combined2)
coef(nest.reg2)
summary(nest.reg2)
Which produces the following output:
coefficients
$cond
bd (Intercept)
LL -1.014698 1.286768
no -3.053920 4.486349
SS -5.300883 8.011879
summary
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: SS ~ (bd | cond)
Data: combined2
AIC BIC logLik deviance df.resid
1419.7 1439.7 -705.8 1411.7 1084
Scaled residuals:
Min 1Q Median 3Q Max
-8.0524 -0.8679 -0.4508 1.0735 2.2756
Random effects:
Groups Name Variance Std.Dev. Corr
cond (Intercept) 33.34 5.774
bd 13.54 3.680 -1.00
Number of obs: 1088, groups: cond, 3
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3053 0.1312 -2.327 0.02 *
My question is how do I test the significance of each of the coefficients for this model? The Summary function seems to only provide a p-value for the intercept, not the coefficients.
When I try anova(nest.reg2) I get nothing, just:
Analysis of Variance Table
Df Sum Sq Mean Sq F value
I've tried the solutions proposed here (How to obtain the p-value (check significance) of an effect in a lme4 mixed model?) to no avail.
To clarify, the cond variable is a factor with three levels (SS, no, and LL), and I believe that the coef command produces coefficients for the continuous bd variable at each of those levels, so what I'm trying to do is test the significance of those coefficients.
There are several issues here.
the main one is that you can really only do significance testing on fixed effect coefficients; you have coded your model with no fixed effects. You might be looking for
glmer(SS ~ bd + (1|cond), ...)
which will model the overall (population-level) distinctions among the levels of bd and include variation in the intercept among levels of cond.
If you have multiple levels of bd represented in each cond group, then you can in principle also allow for variation in treatment effects among cond groups:
glmer(SS ~ bd + (bd|cond), ...)
however, you have another problem. Three groups (i.e., levels of cond) isn't really enough, in practice, to estimate variability among groups. That's why you're seeing a correlation of -1.00 in your output, which indicates you have a singular fit (e.g. see here for more discussion).
therefore, another possibility would be to just go ahead and treat cond as a fixed effect (adjusting the contrasts on cond so that the main effect of bd is estimated as the average across groups rather than the effect in the baseline level of cond).
glm(SS~bd*cond,contrasts=list(cond=contr.sum),...)