I have the (sample) dataset below:
round<-c( 0.125150, 0.045800, -0.955299, -0.232007, 0.120880, -0.041525, 0.290473, -0.648752, 0.113264, -0.403685)
square<-c(-0.634753, 0.000492, -0.178591, -0.202462, -0.592054, -0.583173, -0.632375, -0.176673, -0.680557, -0.062127)
ideo<-c(0,1,0,1,0,1,0,0,1,1)
ex<-data.frame(round,square,ideo)
When I ran the GEE regression in SPSS I took this table as a result.
I used packages gee and geepack in R to run the same analysis and I took these results:
#gee
summary(gee(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 1.0541 0.4099 2.572 0.1328 7.937
square 1.1811 0.8321 1.419 0.4095 2.884
round 0.7072 0.5670 1.247 0.1593 4.439
#geepack
summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 1.054 0.133 63.00 2.1e-15 ***
square 1.181 0.410 8.32 0.0039 **
round 0.707 0.159 19.70 9.0e-06 ***
---
I would like to recreate exactly the table of SPSS(not the results as I use a subset of the original dataset)but I do not know how to achieve all these results.
A tiny bit of tidyverse magic can get the same results - more or less.
Get the information from coef(summary(geeglm())) and compute the necessary columns:
library("tidyverse")
library("geepack")
coef(summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))) %>%
mutate(lowerWald = Estimate-1.96*Std.err, # Lower Wald CI
upperWald=Estimate+1.96*Std.err, # Upper Wald CI
df=1,
ExpBeta = exp(Estimate)) %>% # Transformed estimate
mutate(lWald=exp(lowerWald), # Upper transformed
uWald=exp(upperWald)) # Lower transformed
This produces the following (with the data you provided). The order and the names of the columns could be modified to suit your needs
Estimate Std.err Wald Pr(>|W|) lowerWald upperWald df ExpBeta lWald uWald
1 1.0541 0.1328 62.997 2.109e-15 0.7938 1.314 1 2.869 2.212 3.723
2 1.1811 0.4095 8.318 3.925e-03 0.3784 1.984 1 3.258 1.460 7.270
3 0.7072 0.1593 19.704 9.042e-06 0.3949 1.019 1 2.028 1.484 2.772
Related
This is an example of waht I want:
This is terrible because when I checked the summary of my averaged model as below(I use conditional average part), I found that the significant terms have issues,such like water , shouldn't have been significant but really significant in the below plot, I don't know how to resolve it .
And do we use the dwplot to plot them? I think the dwplot is a good method all the time.
Model-averaged coefficients:
(full average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
cond((Int)) 1.920090 0.162094 0.162496 11.816 <2e-16 ***
cond(semi_habitats) 0.186583 0.097901 0.098116 1.902 0.0572 .
cond(SHDI) -0.129630 0.151271 0.151428 0.856 0.3920
cond(tree) -0.162590 0.124409 0.124601 1.305 0.1919
zi((Int)) 1.500601 0.128244 0.128614 11.668 <2e-16 ***
cond(water) -0.044410 0.081564 0.081655 0.544 0.5865
cond(YEAR2019) 0.085849 0.197617 0.197803 0.434 0.6643
cond(other_crop) 0.043594 0.099623 0.099674 0.437 0.6618
cond(maize) 0.052875 0.096160 0.096256 0.549 0.5828
cond(cotton) 0.014107 0.060640 0.060688 0.232 0.8162
cond(wheat) -0.031901 0.095976 0.096014 0.332 0.7397
cond(pesticide_June) -0.008873 0.051123 0.051208 0.173 0.8624
cond(buildings) -0.001023 0.015340 0.015378 0.066 0.9470
(conditional average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
cond((Int)) 1.92009 0.16209 0.16250 11.816 <2e-16 ***
cond(semi_habitats) 0.19525 0.09131 0.09156 2.133 0.0330 *
cond(SHDI) -0.24078 0.12546 0.12581 1.914 0.0556 .
cond(tree) -0.20019 0.10737 0.10765 1.860 0.0629 .
zi((Int)) 1.50060 0.12824 0.12861 11.668 <2e-16 ***
cond(water) -0.13278 0.09032 0.09056 1.466 0.1426
cond(YEAR2019) 0.37262 0.25029 0.25093 1.485 0.1376
cond(other_crop) 0.17128 0.13086 0.13101 1.307 0.1911
cond(maize) 0.15044 0.10785 0.10809 1.392 0.1640
cond(cotton) 0.12061 0.13636 0.13654 0.883 0.3771
cond(wheat) -0.27238 0.11467 0.11494 2.370 0.0178 *
cond(pesticide_June) -0.12006 0.14837 0.14877 0.807 0.4197
cond(buildings) -0.03181 0.07962 0.07985 0.398 0.6904
What I used the technology is glmmTMB and I have been proceeding the model averaging with dredge & model.avg with MuMInpackages and directly save it to rdsfile.
LCWJuneLS1avg.rds = readRDS("LCWJuneLS1avg.rds")
mA <-(LCWJuneLS1avg.rds) #pulling out model averages #
df1<-as.data.frame(mA$coefmat.subset) #selecting full model coefficient averages
CI <- as.data.frame(confint(LCWJuneLS1avg.rds, full=T)) # get confidence intervals for full model
df1$CI.min <-CI$`2.5 %` #pulling out CIs and putting into same df as coefficient estimates
df1$CI.max <-CI$`97.5 %`# order of coeffients same in both, so no mixups; but should check anyway
setDT(df1, keep.rownames = "coefficient") #put rownames into column
names(df1) <- gsub(" ", "", names(df1)) # remove spaces from column headers
ggplot(data=df1[-1,], aes(x=coefficient, y=Estimate))+ #again, excluding intercept because estimates so much larger
geom_hline(yintercept=0, color = "red",linetype="dashed", lwd=1.5)+ #add dashed line at zero
geom_errorbar(aes(ymin=Estimate-AdjustedSE, ymax=Estimate+AdjustedSE), colour="blue", #adj SE
width=0, lwd=1.5) +
coord_flip()+ # flipping x and y axes
geom_point(size=8)+theme_classic(base_size = 20)+ ylab("Coefficient")
Here is my dataset:
file.name : LCWJuneLS1avg.rds
https://drive.google.com/open?id=1C3vzpA17Ewfu5ZXWp-BNLE0_VEgkeq6b
When I run logistic regression and use predict() function and when I manually calculate with formula p=1/(1+e^-(b0+b1*x1...)) I cannot get the same answer. What could be the reason?
>test[1,]
loan_status loan_Amount interest_rate period sector sex age grade
10000 0 608 41.72451 12 Online Shop Female 44 D3
sector and period was insignificant so I removed it from the regression.
glm gives:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1542256 0.7610472 -1.517 0.12936
interest_rate -0.0479765 0.0043415 -11.051 < 2e-16 ***
sexMale -0.8814945 0.0656296 -13.431 < 2e-16 ***
age -0.0139100 0.0035193 -3.953 7.73e-05 ***
gradeB 0.3209587 0.8238955 0.390 0.69686
gradeC1 -0.7113279 0.8728260 -0.815 0.41509
gradeC2 -0.4730014 0.8427544 -0.561 0.57462
gradeC3 0.0007541 0.7887911 0.001 0.99924
gradeD1 0.5637668 0.7597531 0.742 0.45806
gradeD2 1.3207785 0.7355950 1.796 0.07257 .
gradeD3 0.9201400 0.7303779 1.260 0.20774
gradeE1 1.7245351 0.7208260 2.392 0.01674 *
gradeE2 2.1547773 0.7242669 2.975 0.00293 **
gradeE3 3.1163245 0.7142881 4.363 1.28e-05 ***
>predictions_1st <- predict(Final_Model, newdata = test[1,], type = "response")
>predictions_1st
answer: **0.05478904**
But when I calculate like this:
>prob_1 <- 1/(1+e^-((-0.0479764603)*41.72451)-0.0139099563*44)
>prob_1
answer: 0.09081154
I calculated also with insignificant coefficients but answer still is not the same. What could be the reason?
You have also an (Intercept) -1.1542256 and a gradeD3 0.9201400
1/(1+exp(-1*(-1.1542256 -0.0479764603*41.72451 -0.0139099563*44 + 0.9201400)))
#[1] 0.05478904
I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.
I have a logistic regression model, for which I have been using the rms package. The model fits best using a log term for tn1, and for clinical interpretation I’m using log2. I ran the model using lrm from the rms package, and then to double check, I ran it using glm. The initial coefficients are the same:
h <- lrm(formula = outcomehosp ~ I(log2(tn1 + 0.001)) + apscore_ad +
emsurg + corrapiidiag, data = d, x = TRUE, y = TRUE)
Coef S.E. Wald Z Pr(>|Z|)
Intercept -3.4570 0.3832 -9.02 <0.0001
tn1 0.0469 0.0180 2.60 0.0093
apscore_ad 0.1449 0.0127 11.44 <0.0001
emsurg 0.0731 0.3228 0.23 0.8208
f <- glm(formula = outcomehosp ~ apscore_ad + emsurg + corrapiidiag +
I(log2(tn1 + 0.001)), family = binomial(), data = tn24)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.45699 0.38315 -9.023 < 2e-16
I(log2(tn1 + 0.001)) 0.04690 0.01804 2.600 0.00932
apscore_ad 0.14487 0.01267 11.438 < 2e-16
emsurg 0.07310 0.32277 0.226 0.82082
However when I try to get the odds ratios, they are noticeably different for tn1 between the two models, and this doesn’t seem to be a difference of log2 transformation.
summary(h)
Effects Response : outcomehosp
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
tn1 0 0.21 0.21 0.362120 0.15417 6.5300e-02 0.673990
Odds Ratio 0 0.21 0.21 1.436400 NA 1.0675e+00 1.962100
apscore_ad 14 25.00 11.00 1.593600 0.15631 1.3605e+00 1.961000
Odds Ratio 14 25.00 11.00 4.921400 NA 3.8981e+00 7.106600
emsurg 0 1.00 1.00 0.073103 0.33051 -5.8224e-01 0.734860
Odds Ratio 0 1.00 1.00 1.075800 NA 5.5865e-01 2.085200
exp(f$coefficients)
(Intercept) 0.03152467
apscore_ad 1.15589222
emsurg 1.07584115
I(log2(tn1 + 0.001)) 1.04802
Would anyone be able to explain what the rms package is calculating the odds ratio of? Many thanks.
The tn1 effect from summary(h) is the effect on the log of the odds ratio of tn1 going from 0 to 0.21 -- the inter-quartile range. See ?summary.rms.
So, the effect from the first row of summary(h) is 0.36212 = (log2(0.211)-log2(0.001))*.0469.
I would like to fit a mixed effect model that allows me to account for unequal variances across different geos. Specifically, I would like to predict response as a function of a fixed effect X with geo as the random effect.
Here are what the data look like:
X response geo
1 4 5.521461 other
2 4 5.164786 other
3 4 5.164786 other
4 6 3.401197 other
5 5 4.867534 other
6 4 5.010635 other
Unique values for the geo column:
[1] "other" "Atlanta-Sandy Springs-Marietta, GA" "Chicago-Naperville-Joliet, IL-IN-WI" "Dallas-Fort Worth-Arlington, TX"
[5] "Houston-Sugar Land-Baytown, TX" "Los Angeles-Long Beach-Santa Ana, CA" "Miami-Fort Lauderdale-Pompano Beach, FL" "Phoenix-Mesa-Glendale, AZ"
Here is the model that I've attempted:
> lme0 <- lme(response ~ factor(predictor) , random = ~1|factor(geo), data = HC_hired)
> summary(lme0)
Linear mixed-effects model fit by REML
Data: HC_hired
AIC BIC logLik
54770.69 54836.3 -27377.34
Random effects:
Formula: ~1 | factor(geo)
(Intercept) Residual
StdDev: 0.08689381 0.66802
Fixed effects: response ~ factor(predictor)
Value Std.Error DF t-value p-value
(Intercept) 4.255531 0.04410213 26918 96.49264 0.0000
factor(predictor)2 0.022986 0.03336742 26918 0.68889 0.4909
factor(predictor)3 0.166341 0.03221410 26918 5.16361 0.0000
factor(predictor)4 0.299172 0.03194177 26918 9.36618 0.0000
factor(predictor)5 0.378645 0.03249053 26918 11.65402 0.0000
factor(predictor)6 0.472583 0.03664732 26918 12.89543 0.0000
Correlation:
(Intr) fct()2 fct()3 fct()4 fct()5
factor(predictor)2 -0.660
factor(predictor)3 -0.683 0.903
factor(predictor)4 -0.689 0.912 0.945
factor(predictor)5 -0.679 0.897 0.930 0.940
factor(predictor)6 -0.603 0.795 0.824 0.832 0.819
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-4.7047458 -0.3424262 0.1883132 0.7045260 2.1949313
Number of Observations: 26931
Number of Groups: 8
My issue is that the output does not specify a random effect for each level of geo. What is the correct model specification to do this? I've tried many permutations of the formula without luck. Any comments on the overall process are also welcome. Many thanks in advance!
RESPONSE TO COMMENT (coercing geo to factor does not change output):
HC_hired$geo <- as.factor(HC_hired$geo)
lme0 <- lme(response ~ factor(predictor) , random = ~1|factor(geo), data = HC_hired)
summary(lme0)