I have a question about Poisson GLM and formula representation:
Considering a data set:
p <- read.csv("https://raw.githubusercontent.com/Leprechault/PEN-533/master/bradysia-greenhouse.csv")
Without considering the interaction:
m1 <- glm(bradysia ~ area + mes, family="quasipoisson", data=p)
summary(m1)
#(Intercept) 4.36395 0.12925 33.765 < 2e-16 ***
#areaCV -0.19696 0.12425 -1.585 0.113
#areaMJC -0.71543 0.08553 -8.364 3.11e-16 ***
#mes -0.08872 0.01970 -4.503 7.82e-06 ***
The final formula is: bradysia = exp(4.36395*CS-0.19696*CV-0.71543-0.08872*mes)
Considering the interaction:
m2 <- glm(bradysia ~ area*mes, family="quasipoisson", data=p)
summary(m2)
#(Intercept) 4.05682 0.15468 26.227 < 2e-16 ***
#areaCV 0.15671 0.35219 0.445 0.6565
#areaMJC 0.54132 0.31215 1.734 0.0833 .
#mes -0.03943 0.02346 -1.680 0.0933 .
#areaCV:mes -0.05724 0.05579 -1.026 0.3052
#areaMJC:mes -0.22609 0.05576 -4.055 5.57e-05 **
The final formula is: bradysia = exp(?????) and any help, please?
Related
I understand from this question here that coefficients are the same whether we use a lm regression with as.factor() and a plm regression with fixed effects.
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
However, the R and R-squared differ significantly. Which one is correct and how does the interpretation changes between the two models? In my case, the R-squared is much larger for the plm specification and is even negative for the lm + factor one.
I estimate a heckit-model using the heckit-model from sampleSelection.
The model looks as follows:
library(sampleSelection) Heckman = heckit(AgencyTRACE ~ SizeCat + log(Amt_Issued) + log(daysfromissuance) + log(daystomaturity) + EoW + dMon + EoM + VIX_95_Dummy + quarter, Avg_Spread_Choi ~ SizeCat + log(Amt_Issued) + log(daysfromissuance) + log(daystomaturity) + VIX_95_Dummy + TresholdHYIG_II, data=heckmandata, method = "2step")
The summary generates a probit selection equation and an outcome equation - see below:
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
2019085 observations (1915401 censored and 103684 observed)
26 free parameters (df = 2019060)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.038164 0.043275 0.882 0.378
SizeCat2 0.201571 0.003378 59.672 < 2e-16 ***
SizeCat3 0.318331 0.008436 37.733 < 2e-16 ***
log(Amt_Issued) -0.099472 0.001825 -54.496 < 2e-16 ***
log(daysfromissuance) 0.079691 0.001606 49.613 < 2e-16 ***
log(daystomaturity) -0.036434 0.001514 -24.066 < 2e-16 ***
EoW 0.021169 0.003945 5.366 8.04e-08 ***
dMon -0.003409 0.003852 -0.885 0.376
EoM 0.008937 0.007000 1.277 0.202
VIX_95_Dummy1 0.088558 0.006521 13.580 < 2e-16 ***
quarter2019.2 -0.092681 0.005202 -17.817 < 2e-16 ***
quarter2019.3 -0.117021 0.005182 -22.581 < 2e-16 ***
quarter2019.4 -0.059833 0.005253 -11.389 < 2e-16 ***
quarter2020.1 -0.005230 0.004943 -1.058 0.290
quarter2020.2 0.073175 0.005080 14.406 < 2e-16 ***
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.29436 6.26019 7.395 1.41e-13 ***
SizeCat2 -25.63433 0.79836 -32.109 < 2e-16 ***
SizeCat3 -34.25275 1.48030 -23.139 < 2e-16 ***
log(Amt_Issued) -0.38051 0.39506 -0.963 0.33547
log(daysfromissuance) 0.02452 0.34197 0.072 0.94283
log(daystomaturity) 7.92338 0.24498 32.343 < 2e-16 ***
VIX_95_Dummy1 -2.34875 0.89133 -2.635 0.00841 **
TresholdHYIG_II1 10.36993 1.07267 9.667 < 2e-16 ***
Multiple R-Squared:0.0406, Adjusted R-Squared:0.0405
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio -23.8204 3.6910 -6.454 1.09e-10 ***
sigma 68.5011 NA NA NA
rho -0.3477 NA NA NA
Now I'd like to estimate a value using the outcome equation. I'd like to predict Spread_Choi_All using the following data:
newdata = data.frame(SizeCat=as.factor(1),
Amt_Issued=50*1000000,
daysfromissuance=5*365,
daystomaturity=5*365,
VIX_95_Dummy=as.factor(0),
TresholdHYIG_II=as.factor(0)
SizeCat is a categorical/factor variable with the value 1, 2 or 3.
I have tried varies ways, i.e.
predict(Heckman, part ="outcome", newdata = newdata)
I aim to predict a value (with the data from newdata) using the outcome equation (incl. the invMillsRatio). Is there a way how to predict a value from the outcome equation?
I've made up an example to illustrate my problem. Imagine I have a dataset, and I train a generalized linear model with gamma-distributed residuals.
library(MASS)
df <- read.csv('test.csv')
model <- glm(formula = y ~ method * site + year + 0,
family=Gamma(link = "log"), data = df)
And I get something that looks like this:
> summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
methodM0 3.89533 0.13670 28.496 < 2e-16 ***
methodM1 5.63965 0.20940 26.933 < 2e-16 ***
methodM2 -55.854107 73.982453 -0.755 0.450
methodM3 -55.731730 73.986509 -0.753 0.451
siteS1 -0.002872 0.098226 -0.029 0.977
siteS2 0.060892 0.107795 0.565 0.572
siteS3 -0.016239 0.102258 -0.159 0.874
year 0.030813 0.036743 0.839 0.402
methodM1:siteS1 -0.030616 0.144592 -0.212 0.832
methodM2:siteS1 -0.030632 0.144663 -0.212 0.832
methodM3:siteS1 0.064179 0.145593 0.441 0.659
methodM1:siteS2 -0.146505 0.152012 -0.964 0.335
methodM2:siteS2 -0.039610 0.148024 -0.268 0.789
methodM3:siteS2 -0.202881 0.150406 -1.349 0.178
methodM1:siteS3 NA NA NA NA
methodM2:siteS3 0.081617 0.144040 0.567 0.571
methodM3:siteS3 -0.064155 0.147771 -0.434 0.664
The table is the result of made-up numbers, but the point is that I have an interaction between method M1 and site S3 that give NA. How can I setup the GLM to not calculate that particular interaction, remove that interaction after training, or set those NA values in the model to 0?
Update
#jared_mamrot gave an answer that pointed to this related question which is very similar:
s <- source("http://pastebin.com/raw.php?i=EcMEVqUC")$value
lm(income ~ age + cit * prof, data=s)
Here lm rather than glm is followed, but I found that update didn't even seem to fix this example when I ran the accepted answer to the related problem.
model1 <- lm(income ~ age + cit * prof, data=s)
model2 <- update(model1, . ~ . - citforeign:profofficial)
Looking at model1, we have
> model1
Call:
lm(formula = income ~ age + cit * prof, data = s)
Coefficients:
(Intercept) age citwest citforeign
2205.231 -3.825 74.871 30.066
profblue-collar profofficial citwest:profblue-collar citforeign:profblue-collar
-189.146 -147.332 27.792 -60.223
citwest:profofficial citforeign:profofficial
-122.220 NA
And looking at model2 we have the same
> model1
Call:
lm(formula = income ~ age + cit * prof, data = s)
Coefficients:
(Intercept) age citwest citforeign
2205.231 -3.825 74.871 30.066
profblue-collar profofficial citwest:profblue-collar citforeign:profblue-collar
-189.146 -147.332 27.792 -60.223
citwest:profofficial citforeign:profofficial
-122.220 NA
As you can see, update doesn't seem to remove the NA.
Can you use update? E.g.
model1 <- glm(formula = y ~ method * site + year + 0,
family=Gamma(link = "log"), data = df)
model2 <- update(model1, . ~ . - methodM1:siteS3)
(per Remove some factor-interaction terms from estimation / https://www.r-bloggers.com/using-the-update-function-during-variable-selection/)
Edit
Here is a way to use the update method on the example data posted at Remove some factor-interaction terms from estimation
s <- source("http://pastebin.com/raw.php?i=EcMEVqUC")$value
model1 <- glm(income ~ age + cit * prof, data=s)
model2 <- update(model1, . ~ . - cit:prof)
summary_glm(model1)
summary_glm(model2)
Edit 2
If you don't want to use update you could try dropping the interaction (see below) but I don't know how this will affect the validity of the model or whether it is advisable.
s <- source("http://pastebin.com/raw.php?i=EcMEVqUC")$value
model1 <- glm(income ~ age + cit * prof, data=s)
model2 <- glm(income ~ model.matrix(model1)[,1:9], data=s)
summary_glm(model1)
summary_glm(model2)
When I run logistic regression and use predict() function and when I manually calculate with formula p=1/(1+e^-(b0+b1*x1...)) I cannot get the same answer. What could be the reason?
>test[1,]
loan_status loan_Amount interest_rate period sector sex age grade
10000 0 608 41.72451 12 Online Shop Female 44 D3
sector and period was insignificant so I removed it from the regression.
glm gives:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1542256 0.7610472 -1.517 0.12936
interest_rate -0.0479765 0.0043415 -11.051 < 2e-16 ***
sexMale -0.8814945 0.0656296 -13.431 < 2e-16 ***
age -0.0139100 0.0035193 -3.953 7.73e-05 ***
gradeB 0.3209587 0.8238955 0.390 0.69686
gradeC1 -0.7113279 0.8728260 -0.815 0.41509
gradeC2 -0.4730014 0.8427544 -0.561 0.57462
gradeC3 0.0007541 0.7887911 0.001 0.99924
gradeD1 0.5637668 0.7597531 0.742 0.45806
gradeD2 1.3207785 0.7355950 1.796 0.07257 .
gradeD3 0.9201400 0.7303779 1.260 0.20774
gradeE1 1.7245351 0.7208260 2.392 0.01674 *
gradeE2 2.1547773 0.7242669 2.975 0.00293 **
gradeE3 3.1163245 0.7142881 4.363 1.28e-05 ***
>predictions_1st <- predict(Final_Model, newdata = test[1,], type = "response")
>predictions_1st
answer: **0.05478904**
But when I calculate like this:
>prob_1 <- 1/(1+e^-((-0.0479764603)*41.72451)-0.0139099563*44)
>prob_1
answer: 0.09081154
I calculated also with insignificant coefficients but answer still is not the same. What could be the reason?
You have also an (Intercept) -1.1542256 and a gradeD3 0.9201400
1/(1+exp(-1*(-1.1542256 -0.0479764603*41.72451 -0.0139099563*44 + 0.9201400)))
#[1] 0.05478904
I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.