This is an example of waht I want:
This is terrible because when I checked the summary of my averaged model as below(I use conditional average part), I found that the significant terms have issues,such like water , shouldn't have been significant but really significant in the below plot, I don't know how to resolve it .
And do we use the dwplot to plot them? I think the dwplot is a good method all the time.
Model-averaged coefficients:
(full average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
cond((Int)) 1.920090 0.162094 0.162496 11.816 <2e-16 ***
cond(semi_habitats) 0.186583 0.097901 0.098116 1.902 0.0572 .
cond(SHDI) -0.129630 0.151271 0.151428 0.856 0.3920
cond(tree) -0.162590 0.124409 0.124601 1.305 0.1919
zi((Int)) 1.500601 0.128244 0.128614 11.668 <2e-16 ***
cond(water) -0.044410 0.081564 0.081655 0.544 0.5865
cond(YEAR2019) 0.085849 0.197617 0.197803 0.434 0.6643
cond(other_crop) 0.043594 0.099623 0.099674 0.437 0.6618
cond(maize) 0.052875 0.096160 0.096256 0.549 0.5828
cond(cotton) 0.014107 0.060640 0.060688 0.232 0.8162
cond(wheat) -0.031901 0.095976 0.096014 0.332 0.7397
cond(pesticide_June) -0.008873 0.051123 0.051208 0.173 0.8624
cond(buildings) -0.001023 0.015340 0.015378 0.066 0.9470
(conditional average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
cond((Int)) 1.92009 0.16209 0.16250 11.816 <2e-16 ***
cond(semi_habitats) 0.19525 0.09131 0.09156 2.133 0.0330 *
cond(SHDI) -0.24078 0.12546 0.12581 1.914 0.0556 .
cond(tree) -0.20019 0.10737 0.10765 1.860 0.0629 .
zi((Int)) 1.50060 0.12824 0.12861 11.668 <2e-16 ***
cond(water) -0.13278 0.09032 0.09056 1.466 0.1426
cond(YEAR2019) 0.37262 0.25029 0.25093 1.485 0.1376
cond(other_crop) 0.17128 0.13086 0.13101 1.307 0.1911
cond(maize) 0.15044 0.10785 0.10809 1.392 0.1640
cond(cotton) 0.12061 0.13636 0.13654 0.883 0.3771
cond(wheat) -0.27238 0.11467 0.11494 2.370 0.0178 *
cond(pesticide_June) -0.12006 0.14837 0.14877 0.807 0.4197
cond(buildings) -0.03181 0.07962 0.07985 0.398 0.6904
What I used the technology is glmmTMB and I have been proceeding the model averaging with dredge & model.avg with MuMInpackages and directly save it to rdsfile.
LCWJuneLS1avg.rds = readRDS("LCWJuneLS1avg.rds")
mA <-(LCWJuneLS1avg.rds) #pulling out model averages #
df1<-as.data.frame(mA$coefmat.subset) #selecting full model coefficient averages
CI <- as.data.frame(confint(LCWJuneLS1avg.rds, full=T)) # get confidence intervals for full model
df1$CI.min <-CI$`2.5 %` #pulling out CIs and putting into same df as coefficient estimates
df1$CI.max <-CI$`97.5 %`# order of coeffients same in both, so no mixups; but should check anyway
setDT(df1, keep.rownames = "coefficient") #put rownames into column
names(df1) <- gsub(" ", "", names(df1)) # remove spaces from column headers
ggplot(data=df1[-1,], aes(x=coefficient, y=Estimate))+ #again, excluding intercept because estimates so much larger
geom_hline(yintercept=0, color = "red",linetype="dashed", lwd=1.5)+ #add dashed line at zero
geom_errorbar(aes(ymin=Estimate-AdjustedSE, ymax=Estimate+AdjustedSE), colour="blue", #adj SE
width=0, lwd=1.5) +
coord_flip()+ # flipping x and y axes
geom_point(size=8)+theme_classic(base_size = 20)+ ylab("Coefficient")
Here is my dataset:
file.name : LCWJuneLS1avg.rds
https://drive.google.com/open?id=1C3vzpA17Ewfu5ZXWp-BNLE0_VEgkeq6b
Related
Say I have a series of GAMs that I would like to average together using MuMIn. How do I go about interpreting the results of the averaged smoothers? Why are there numbers after each smoother term?
library(glmmTMB)
library(mgcv)
library(MuMIn)
data("Salamanders") # glmmTMB data
# mgcv gams
gam1 <- gam(count ~ spp + s(cover) + s(DOP), data = Salamanders, family = tw, method = "ML")
gam2 <- gam(count ~ mined + s(cover) + s(DOP), data = Salamanders, family = tw, method = "ML")
gam3 <- gam(count ~ s(Wtemp), data = Salamanders, family = tw, method = "ML")
gam4 <- gam(count ~ mined + s(DOY), data = Salamanders, family = tw, method = "ML")
# MuMIn model average
summary(model.avg(gam1, gam2, gam3, gam4))
And an excerpt from the results...
Model-averaged coefficients:
(full average)
Estimate Std. Error
(Intercept) -1.32278368618846586812765053764451295137405 0.16027398202204409805027296442858641967177
minedno 2.22006553885311141982583649223670363426208 0.19680444996609294805445244946895400062203
s(cover).1 0.00096638939252485735100645092288118576107 0.05129736767981037115493592182247084565461
s(cover).2 0.00360413985630353601863351542533564497717 0.18864911049300209233692271482141222804785
s(cover).3 0.00034381902619062468381624930735540601745 0.01890820689958183642431777116144075989723
s(cover).4 -0.00248365164684107844403349041328965540743 0.12950622739175629560826052966149291023612
s(cover).5 -0.00089826079366626997504963192398008686723 0.04660149540411069601919535898559843190014
s(cover).6 0.00242197856572917875894734862640689243563 0.12855093144749979439112053114513400942087
s(cover).7 -0.00032596616013735266745646179664674946252 0.02076865732570042782922925539423886220902
s(cover).8 0.00700001172809289889942263584998727310449 0.36609857217759655956257347497739829123020
s(cover).9 -0.17150069832114492318630993850092636421323 0.17672571419517621449379873865836998447776
s(DOP).1 0.00018839994220792031023870016781529557193 0.01119134546418791391342306695833030971698
s(DOP).2 -0.00081869157242861999301819508900734945200 0.04333670935815417402103832955617690458894
s(DOP).3 -0.00021538789478326670289408395486674407948 0.01164171952980479901595955993798270355910
s(DOP).4 0.00043433676942596419591827161532648915454 0.02463278659589070856972270462392771150917
This is a little easier to read if you don't print so many digits (see below):
Each smooth term is parameterized using multiple coefficients (9 by default), which is why we have multiple s.(whatever).xxx coefficients.
It's not clear to me what you want to do with the model-averaged results. It's usually best to make model-averaged predictions rather than trying to interpret model-averaged coefficients, which has some pitfalls ... There is a predict() method for objects of class "averaging" (which is what model.average() returns).
For further questions about interpretation you might want to ask on CrossValidated ...
Model-averaged coefficients:
(full average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) -1.323e+00 1.603e-01 1.606e-01 8.239 <2e-16 ***
minedno 2.220e+00 1.968e-01 1.971e-01 11.263 <2e-16 ***
s(cover).1 9.664e-04 5.130e-02 5.130e-02 0.019 0.985
s(cover).2 3.604e-03 1.886e-01 1.887e-01 0.019 0.985
s(cover).3 3.438e-04 1.891e-02 1.891e-02 0.018 0.985
s(cover).4 -2.484e-03 1.295e-01 1.295e-01 0.019 0.985
s(cover).5 -8.983e-04 4.660e-02 4.660e-02 0.019 0.985
s(cover).6 2.422e-03 1.286e-01 1.286e-01 0.019 0.985
s(cover).7 -3.260e-04 2.077e-02 2.078e-02 0.016 0.987
s(cover).8 7.000e-03 3.661e-01 3.661e-01 0.019 0.985
s(cover).9 -1.715e-01 1.767e-01 1.768e-01 0.970 0.332
s(DOP).1 1.884e-04 1.119e-02 1.120e-02 0.017 0.987
s(DOP).2 -8.187e-04 4.334e-02 4.334e-02 0.019 0.985
s(DOP).3 -2.154e-04 1.164e-02 1.164e-02 0.018 0.985
s(DOP).4 4.343e-04 2.463e-02 2.464e-02 0.018 0.986
s(DOP).5 -1.737e-04 1.019e-02 1.020e-02 0.017 0.986
s(DOP).6 -3.224e-04 1.790e-02 1.790e-02 0.018 0.986
s(DOP).7 2.991e-07 5.739e-04 5.750e-04 0.001 1.000
s(DOP).8 -1.756e-03 9.557e-02 9.559e-02 0.018 0.985
s(DOP).9 1.930e-02 5.630e-02 5.639e-02 0.342 0.732
s(DOY).1 5.189e-08 3.378e-04 3.384e-04 0.000 1.000
When I run logistic regression and use predict() function and when I manually calculate with formula p=1/(1+e^-(b0+b1*x1...)) I cannot get the same answer. What could be the reason?
>test[1,]
loan_status loan_Amount interest_rate period sector sex age grade
10000 0 608 41.72451 12 Online Shop Female 44 D3
sector and period was insignificant so I removed it from the regression.
glm gives:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1542256 0.7610472 -1.517 0.12936
interest_rate -0.0479765 0.0043415 -11.051 < 2e-16 ***
sexMale -0.8814945 0.0656296 -13.431 < 2e-16 ***
age -0.0139100 0.0035193 -3.953 7.73e-05 ***
gradeB 0.3209587 0.8238955 0.390 0.69686
gradeC1 -0.7113279 0.8728260 -0.815 0.41509
gradeC2 -0.4730014 0.8427544 -0.561 0.57462
gradeC3 0.0007541 0.7887911 0.001 0.99924
gradeD1 0.5637668 0.7597531 0.742 0.45806
gradeD2 1.3207785 0.7355950 1.796 0.07257 .
gradeD3 0.9201400 0.7303779 1.260 0.20774
gradeE1 1.7245351 0.7208260 2.392 0.01674 *
gradeE2 2.1547773 0.7242669 2.975 0.00293 **
gradeE3 3.1163245 0.7142881 4.363 1.28e-05 ***
>predictions_1st <- predict(Final_Model, newdata = test[1,], type = "response")
>predictions_1st
answer: **0.05478904**
But when I calculate like this:
>prob_1 <- 1/(1+e^-((-0.0479764603)*41.72451)-0.0139099563*44)
>prob_1
answer: 0.09081154
I calculated also with insignificant coefficients but answer still is not the same. What could be the reason?
You have also an (Intercept) -1.1542256 and a gradeD3 0.9201400
1/(1+exp(-1*(-1.1542256 -0.0479764603*41.72451 -0.0139099563*44 + 0.9201400)))
#[1] 0.05478904
I have the (sample) dataset below:
round<-c( 0.125150, 0.045800, -0.955299, -0.232007, 0.120880, -0.041525, 0.290473, -0.648752, 0.113264, -0.403685)
square<-c(-0.634753, 0.000492, -0.178591, -0.202462, -0.592054, -0.583173, -0.632375, -0.176673, -0.680557, -0.062127)
ideo<-c(0,1,0,1,0,1,0,0,1,1)
ex<-data.frame(round,square,ideo)
When I ran the GEE regression in SPSS I took this table as a result.
I used packages gee and geepack in R to run the same analysis and I took these results:
#gee
summary(gee(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 1.0541 0.4099 2.572 0.1328 7.937
square 1.1811 0.8321 1.419 0.4095 2.884
round 0.7072 0.5670 1.247 0.1593 4.439
#geepack
summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 1.054 0.133 63.00 2.1e-15 ***
square 1.181 0.410 8.32 0.0039 **
round 0.707 0.159 19.70 9.0e-06 ***
---
I would like to recreate exactly the table of SPSS(not the results as I use a subset of the original dataset)but I do not know how to achieve all these results.
A tiny bit of tidyverse magic can get the same results - more or less.
Get the information from coef(summary(geeglm())) and compute the necessary columns:
library("tidyverse")
library("geepack")
coef(summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))) %>%
mutate(lowerWald = Estimate-1.96*Std.err, # Lower Wald CI
upperWald=Estimate+1.96*Std.err, # Upper Wald CI
df=1,
ExpBeta = exp(Estimate)) %>% # Transformed estimate
mutate(lWald=exp(lowerWald), # Upper transformed
uWald=exp(upperWald)) # Lower transformed
This produces the following (with the data you provided). The order and the names of the columns could be modified to suit your needs
Estimate Std.err Wald Pr(>|W|) lowerWald upperWald df ExpBeta lWald uWald
1 1.0541 0.1328 62.997 2.109e-15 0.7938 1.314 1 2.869 2.212 3.723
2 1.1811 0.4095 8.318 3.925e-03 0.3784 1.984 1 3.258 1.460 7.270
3 0.7072 0.1593 19.704 9.042e-06 0.3949 1.019 1 2.028 1.484 2.772
I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.
I've never noticed this behavior before, but I'm surprised at the output naming conventions for linear model summaries. My question, essentially, is why row names in a linear model summary always seem to carry the name of the column they came from.
An example
Suppose you had some data for 300 movie audience members from three different cities:
Chicago
Milwaukee
Dayton
And suppose all of them were subjected to the stinking pile of confusing, contaminated waste that was Spider-Man 3. After enduring the entirety of that cinematic abomination, they were asked to rate the movie on a 100-point scale.
Because all of the audience members were reasonable human beings, the ratings were all below zero. (Naturally. Anyone who's seen the movie would agree.)
Here's what that might look like in R:
> score <- rnorm(n = 300, mean = -50, sd = 10)
> city <- rep(c("Chicago", "Milwaukee", "Dayton"), times = 100)
> spider.man.3.sucked <- data.frame(score, city)
> head(spider.man.3.sucked)
score city
1 -64.57515 Chicago
2 -50.51050 Milwaukee
3 -56.51409 Dayton
4 -45.55133 Chicago
5 -47.88686 Milwaukee
6 -51.22812 Dayton
Great. So let's run a quick linear model, assign it to lm1, and get its summary output:
> lm1 <- lm(score ~ city, data = spider.man.3.sucked)
> summary(lm1)
Call:
lm(formula = score ~ city, data = spider.man.3.sucked)
Residuals:
Min 1Q Median 3Q Max
-29.8515 -6.1090 -0.4745 6.0340 26.2616
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.3621 0.9630 -53.337 <2e-16 ***
cityDayton 1.1892 1.3619 0.873 0.383
cityMilwaukee 0.8288 1.3619 0.609 0.543
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.63 on 297 degrees of freedom
Multiple R-squared: 0.002693, Adjusted R-squared: -0.004023
F-statistic: 0.4009 on 2 and 297 DF, p-value: 0.6701
What's bugging me
The part I want to highlight is this:
cityDayton 1.1892 1.3619 0.873 0.383
cityMilwaukee 0.8288 1.3619 0.609 0.543
It looks like R sensibly concatenated the column name (city, if you remember from above) with the distinct value (in this case either Dayton or Milwaukee). If I don't want R to output in that format, is there any way to override it? For example, in my case all I'd need is simply:
Dayton 1.1892 1.3619 0.873 0.383
Milwaukee 0.8288 1.3619 0.609 0.543
Two questions in one
So,
What's controlling the format of the output for linear model summary rows, and
Can/should I change it?
The extractor function for that component of a summary object is coef. Does this provide the means to control your output acceptably:
summ <- summary(lm1)
csumm <- coef(summ)
rownames(csumm) <- sub("^city", "", rownames(csumm))
print(csumm[-1,], digits=4)
# Estimate Std. Error t value Pr(>|t|)
# Dayton 0.8133 1.485 0.5478 0.5842
# Milwaukee 0.3891 1.485 0.2621 0.7934
(No random seed was set so cannot match your values.)
For 1) it appears to happen inside model.matrix.default() and inside internal R compiled code for that matter.
It might be difficult to change this easily - the obvious way would be to write your own model.matrix.default() that calls model.matrix.default() and updates the names afterwards. But this isn't tested or tried.
Here is a hack
# RUN REGRESSION
require(ggplot2)
lm1 = lm(tip ~ total_bill + sex + day, data = tips)
# FUNCTION TO REMOVE FACTOR NAMES FROM MODEL SUMMARY
remove_factors = function(mod){
mydf = mod$model
# PREPARE VECTOR OF VARIABLES WITH REPETITIONS = UNIQUE FACTOR LEVELS
vars = names(mod$model)[-1]
eachlen = sapply(mydf[,vars,drop=F], function(x)
ifelse(is.numeric(x), 1, length(unique(x)) - 1))
vars = rep(vars, eachlen)
# REPLACE COEF NAMES WITH VARIABLE NAME WHEN APPROPRIATE
coefs = names(lm1$coefficients)[-1]
coefs2 = stringr::str_replace(coefs, vars, "")
names(mod$coefficients)[-1] = ifelse(coefs2 == "", coefs, coefs2)
return(mod)
}
summary(remove_factors(lm1))
This gives
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.95588 0.27579 3.47 0.00063 ***
total_bill 0.10489 0.00758 13.84 < 2e-16 ***
Male -0.03844 0.14215 -0.27 0.78706
Sat -0.08088 0.26226 -0.31 0.75806
Sun 0.08282 0.26741 0.31 0.75706
Thur -0.02063 0.26975 -0.08 0.93910
However, it is not always advisable to do this, as you can see from running the same hack for a different regression. It is not clear what the Yes variable in the last name stands for. R by default writes it as smokerYes to make its meaning clear. So use with caution.
lm2 = lm(tip ~ total_bill + sex + day + smoker, data = tips)
summary(remove_factors(lm2))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.05182 0.29315 3.59 0.00040 ***
total_bill 0.10569 0.00763 13.86 < 2e-16 ***
Male -0.03769 0.14217 -0.27 0.79114
Sat -0.12636 0.26648 -0.47 0.63582
Sun 0.00407 0.27959 0.01 0.98841
Thur -0.09283 0.27994 -0.33 0.74048
Yes -0.13935 0.14422 -0.97 0.33489