Predict a value using the "output equation" of a heckit-model (sampleSelection) - r

I estimate a heckit-model using the heckit-model from sampleSelection.
The model looks as follows:
library(sampleSelection) Heckman = heckit(AgencyTRACE ~ SizeCat + log(Amt_Issued) + log(daysfromissuance) + log(daystomaturity) + EoW + dMon + EoM + VIX_95_Dummy + quarter, Avg_Spread_Choi ~ SizeCat + log(Amt_Issued) + log(daysfromissuance) + log(daystomaturity) + VIX_95_Dummy + TresholdHYIG_II, data=heckmandata, method = "2step")
The summary generates a probit selection equation and an outcome equation - see below:
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
2019085 observations (1915401 censored and 103684 observed)
26 free parameters (df = 2019060)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.038164 0.043275 0.882 0.378
SizeCat2 0.201571 0.003378 59.672 < 2e-16 ***
SizeCat3 0.318331 0.008436 37.733 < 2e-16 ***
log(Amt_Issued) -0.099472 0.001825 -54.496 < 2e-16 ***
log(daysfromissuance) 0.079691 0.001606 49.613 < 2e-16 ***
log(daystomaturity) -0.036434 0.001514 -24.066 < 2e-16 ***
EoW 0.021169 0.003945 5.366 8.04e-08 ***
dMon -0.003409 0.003852 -0.885 0.376
EoM 0.008937 0.007000 1.277 0.202
VIX_95_Dummy1 0.088558 0.006521 13.580 < 2e-16 ***
quarter2019.2 -0.092681 0.005202 -17.817 < 2e-16 ***
quarter2019.3 -0.117021 0.005182 -22.581 < 2e-16 ***
quarter2019.4 -0.059833 0.005253 -11.389 < 2e-16 ***
quarter2020.1 -0.005230 0.004943 -1.058 0.290
quarter2020.2 0.073175 0.005080 14.406 < 2e-16 ***
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.29436 6.26019 7.395 1.41e-13 ***
SizeCat2 -25.63433 0.79836 -32.109 < 2e-16 ***
SizeCat3 -34.25275 1.48030 -23.139 < 2e-16 ***
log(Amt_Issued) -0.38051 0.39506 -0.963 0.33547
log(daysfromissuance) 0.02452 0.34197 0.072 0.94283
log(daystomaturity) 7.92338 0.24498 32.343 < 2e-16 ***
VIX_95_Dummy1 -2.34875 0.89133 -2.635 0.00841 **
TresholdHYIG_II1 10.36993 1.07267 9.667 < 2e-16 ***
Multiple R-Squared:0.0406, Adjusted R-Squared:0.0405
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio -23.8204 3.6910 -6.454 1.09e-10 ***
sigma 68.5011 NA NA NA
rho -0.3477 NA NA NA
Now I'd like to estimate a value using the outcome equation. I'd like to predict Spread_Choi_All using the following data:
newdata = data.frame(SizeCat=as.factor(1),
Amt_Issued=50*1000000,
daysfromissuance=5*365,
daystomaturity=5*365,
VIX_95_Dummy=as.factor(0),
TresholdHYIG_II=as.factor(0)
SizeCat is a categorical/factor variable with the value 1, 2 or 3.
I have tried varies ways, i.e.
predict(Heckman, part ="outcome", newdata = newdata)
I aim to predict a value (with the data from newdata) using the outcome equation (incl. the invMillsRatio). Is there a way how to predict a value from the outcome equation?

Related

Using predict() and calculating manually in Logistic regression in R does not match. What is the reason?

When I run logistic regression and use predict() function and when I manually calculate with formula p=1/(1+e^-(b0+b1*x1...)) I cannot get the same answer. What could be the reason?
>test[1,]
loan_status loan_Amount interest_rate period sector sex age grade
10000 0 608 41.72451 12 Online Shop Female 44 D3
sector and period was insignificant so I removed it from the regression.
glm gives:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1542256 0.7610472 -1.517 0.12936
interest_rate -0.0479765 0.0043415 -11.051 < 2e-16 ***
sexMale -0.8814945 0.0656296 -13.431 < 2e-16 ***
age -0.0139100 0.0035193 -3.953 7.73e-05 ***
gradeB 0.3209587 0.8238955 0.390 0.69686
gradeC1 -0.7113279 0.8728260 -0.815 0.41509
gradeC2 -0.4730014 0.8427544 -0.561 0.57462
gradeC3 0.0007541 0.7887911 0.001 0.99924
gradeD1 0.5637668 0.7597531 0.742 0.45806
gradeD2 1.3207785 0.7355950 1.796 0.07257 .
gradeD3 0.9201400 0.7303779 1.260 0.20774
gradeE1 1.7245351 0.7208260 2.392 0.01674 *
gradeE2 2.1547773 0.7242669 2.975 0.00293 **
gradeE3 3.1163245 0.7142881 4.363 1.28e-05 ***
>predictions_1st <- predict(Final_Model, newdata = test[1,], type = "response")
>predictions_1st
answer: **0.05478904**
But when I calculate like this:
>prob_1 <- 1/(1+e^-((-0.0479764603)*41.72451)-0.0139099563*44)
>prob_1
answer: 0.09081154
I calculated also with insignificant coefficients but answer still is not the same. What could be the reason?
You have also an (Intercept) -1.1542256 and a gradeD3 0.9201400
1/(1+exp(-1*(-1.1542256 -0.0479764603*41.72451 -0.0139099563*44 + 0.9201400)))
#[1] 0.05478904

Is there a way to extract $R^2$ of the coeftest() function?

I am having a question concerning the function coeftest(). I am trying to figure out where from I could get any results of the R-squared of this function. I was fitting a standard ,multiple linear regression as follows:
Wetterstation.lm <- lm(temp~t+temp_auto+dum.jan+dum.feb+dum.mar+dum.apr+dum.may+dum.jun+dum.aug+dum.sep+dum.oct+dum.nov+dum.dec+
dum.jan*t+dum.feb*t+dum.mar*t+dum.apr*t+dum.may*t+dum.jun*t+dum.aug*t+dum.sep*t+dum.oct*t+dum.nov*t+dum.dec*t)
Upfront I defined each of these variables separately and my results were the following:
> summary(Wetterstation.lm)
Call:
lm(formula = temp ~ t + temp_auto + dum.jan + dum.feb + dum.mar +
dum.apr + dum.may + dum.jun + dum.aug + dum.sep + dum.oct +
dum.nov + dum.dec + dum.jan * t + dum.feb * t + dum.mar *
t + dum.apr * t + dum.may * t + dum.jun * t + dum.aug * t +
dum.sep * t + dum.oct * t + dum.nov * t + dum.dec * t)
Residuals:
Min 1Q Median 3Q Max
-10.9564 -1.3214 0.0731 1.3621 9.9312
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.236e+00 9.597e-02 33.714 < 2e-16 ***
t 1.206e-05 3.744e-06 3.221 0.00128 **
temp_auto 8.333e-01 2.929e-03 284.503 < 2e-16 ***
dum.jan -3.550e+00 1.252e-01 -28.360 < 2e-16 ***
dum.feb -3.191e+00 1.258e-01 -25.367 < 2e-16 ***
dum.mar -2.374e+00 1.181e-01 -20.105 < 2e-16 ***
dum.apr -1.582e+00 1.142e-01 -13.851 < 2e-16 ***
dum.may -7.528e-01 1.106e-01 -6.809 9.99e-12 ***
dum.jun -3.283e-01 1.106e-01 -2.968 0.00300 **
dum.aug -2.144e-01 1.094e-01 -1.960 0.05005 .
dum.sep -8.009e-01 1.103e-01 -7.260 3.96e-13 ***
dum.oct -1.752e+00 1.123e-01 -15.596 < 2e-16 ***
dum.nov -2.622e+00 1.181e-01 -22.198 < 2e-16 ***
dum.dec -3.287e+00 1.226e-01 -26.808 < 2e-16 ***
t:dum.jan 2.626e-06 5.277e-06 0.498 0.61877
t:dum.feb 2.479e-06 5.404e-06 0.459 0.64642
t:dum.mar 1.671e-06 5.277e-06 0.317 0.75145
t:dum.apr 1.357e-06 5.320e-06 0.255 0.79872
t:dum.may -3.173e-06 5.276e-06 -0.601 0.54756
t:dum.jun 2.481e-06 5.320e-06 0.466 0.64098
t:dum.aug 5.998e-07 5.298e-06 0.113 0.90985
t:dum.sep -5.997e-06 5.321e-06 -1.127 0.25968
t:dum.oct -5.860e-06 5.277e-06 -1.110 0.26683
t:dum.nov -4.215e-06 5.320e-06 -0.792 0.42820
t:dum.dec -2.526e-06 5.277e-06 -0.479 0.63217
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.12 on 35744 degrees of freedom
Multiple R-squared: 0.9348, Adjusted R-squared: 0.9348
F-statistic: 2.136e+04 on 24 and 35744 DF, p-value: < 2.2e-16
Now I was trying to adjust for heteroskedasticity and autocorrelation using the function coeftest() and vcovHAC as follows:
library("lmtest")
library("sandwich")
Wetterstation.lm.HAC <- coeftest(Wetterstation.lm, vcov = vcovHAC)
The results of these are that:
> Wetterstation.lm.HAC
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2356e+00 7.8816e-02 41.0529 < 2.2e-16 ***
t 1.2059e-05 2.7864e-06 4.3280 1.509e-05 ***
temp_auto 8.3334e-01 2.9798e-03 279.6659 < 2.2e-16 ***
dum.jan -3.5505e+00 1.1843e-01 -29.9789 < 2.2e-16 ***
dum.feb -3.1909e+00 1.2296e-01 -25.9507 < 2.2e-16 ***
dum.mar -2.3741e+00 1.0890e-01 -21.8002 < 2.2e-16 ***
dum.apr -1.5821e+00 9.5952e-02 -16.4881 < 2.2e-16 ***
dum.may -7.5282e-01 8.8987e-02 -8.4599 < 2.2e-16 ***
dum.jun -3.2826e-01 8.2271e-02 -3.9899 6.622e-05 ***
dum.aug -2.1440e-01 7.7966e-02 -2.7499 0.005964 **
dum.sep -8.0094e-01 8.4456e-02 -9.4835 < 2.2e-16 ***
dum.oct -1.7519e+00 9.2919e-02 -18.8538 < 2.2e-16 ***
dum.nov -2.6224e+00 1.0028e-01 -26.1504 < 2.2e-16 ***
dum.dec -3.2873e+00 1.1393e-01 -28.8546 < 2.2e-16 ***
t:dum.jan 2.6256e-06 5.2429e-06 0.5008 0.616517
t:dum.feb 2.4790e-06 5.5284e-06 0.4484 0.653850
t:dum.mar 1.6713e-06 4.8632e-06 0.3437 0.731107
t:dum.apr 1.3567e-06 4.5670e-06 0.2971 0.766423
t:dum.may -3.1734e-06 4.2970e-06 -0.7385 0.460209
t:dum.jun 2.4809e-06 4.1490e-06 0.5979 0.549880
t:dum.aug 5.9983e-07 4.0379e-06 0.1485 0.881910
t:dum.sep -5.9975e-06 4.1675e-06 -1.4391 0.150125
t:dum.oct -5.8595e-06 4.4635e-06 -1.3128 0.189265
t:dum.nov -4.2151e-06 4.6555e-06 -0.9054 0.365263
t:dum.dec -2.5257e-06 4.9871e-06 -0.5065 0.612539
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
But as I want to add the R-squared in a table that summarizes my results I cannot figure out how to get it. Now I was wondering if there is anyone that could help with this issue and tell me where I could get the information from. Maybe I am just to dumb but as I am quite new to R I would be happy for any help I could get.
Thank you very much in advance.
Simple answer: no there is not. And also there is no reason for doing this. The coeftest() function is using the values of your given model. With stats4::coef the coeftest function is taking the coefficients of the model.
It would be possible to extract the r^2 value if the function intends to do it. Also the imtest coeftest() only returns a paste, so you can't extract values.
To sum this up: lmtest::coeftest() is using the values of lm and so the r^2 is not changing
Background about the standard error: lm uses a slightly different method to calculate the standard error. In the source code you can extract:
R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
se <- sqrt(diag(R))
So this brings us to following: lm using the Cholesky composition (i also think R using this by default).
coeftest() meanwhile using the sqrt() of the variance-covariance matrix of the residuals(see here). So the autocorrelation. vcov extracts the variance-covriance matrix of a given model (like lm)
se <- vcov.
se <- sqrt(diag(se))
I personally think the users of lm are using the chalesky composition for speed reasons, since you don't have to invert the matrix. But the writers of the lmtest package were no aware of this. But this is just a guess.
Since the t values are calculated in both packages with the estimated values and the standard error like this:
tval <- as.vector(est)/se
and the p-value is based on the tval:
2*pt(abs(tval), rdf, lower.tail = FALSE)
all the differences are based on the different estimated error. Please be aware, that the estimations are identical because coeftest() just inherits them.

How do I return the middle of an object

I am running a regression with state*year covariates for all U.S. states. I need to access the ones for N.J., which are probably only a few states down from where my output stops. How do I access the middle of the object while maintaining the structure of the output so that I can identify the corresponding state coefficients? Here is my code thus far:
g <- as.Formula(fatalityrate~sb_useage+speed65+speed70+ba08+drinkage21+log(income)+age)
g <- update(g, . ~ . + year)
g <- update(g, . - fatalityrate + sb_useage ~ . - sb_useage + primary + secondary + factor(state)*year)
sb <- plm(g, data = pnl_sb, model = "within", effect = "twoways", index = c("fips","year"))
sb_c <- coeftest(sb, vcovHC)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
speed65 -4.3506e-02 1.4132e-15 -3.0785e+13 < 2.2e-16 ***
speed70 2.7920e-02 1.7666e-15 1.5805e+13 < 2.2e-16 ***
ba08 1.3878e-01 7.9562e-15 1.7443e+13 < 2.2e-16 ***
drinkage21 -5.2707e-01 1.4493e-14 -3.6368e+13 < 2.2e-16 ***
log(income) 1.0611e-01 3.5164e-14 3.0175e+12 < 2.2e-16 ***
age 1.7745e-01 1.5497e-14 1.1450e+13 < 2.2e-16 ***
primary 4.2674e-02 9.3908e-15 4.5443e+12 < 2.2e-16 ***
secondary -1.9863e-01 6.4951e-15 -3.0581e+13 < 2.2e-16 ***
year1984:factor(state)AL 3.2376e-01 1.0903e-14 2.9695e+13 < 2.2e-16 ***
year1985:factor(state)AL 6.5355e-01 9.5593e-15 6.8368e+13 < 2.2e-16 ***
year1986:factor(state)AL 4.7362e-01 7.2047e-15 6.5737e+13 < 2.2e-16 ***
year1987:factor(state)AL 2.8150e-01 3.1782e-15 8.8574e+13 < 2.2e-16 ***
year1988:factor(state)AL -1.4605e-01 1.1052e-14 -1.3214e+13 < 2.2e-16 ***
.
.
.
I thought to use the head or tail option but I have not been able to find documentation on using these commands to access the middle of a dataframe. Also I need to be able to identify the state associated with the coefficient, and head/tail command only returns the coefficient.

Print (display) reference category in R's lm summary?

How can you print the reference category used when a categorical/nominal variable is entered into a linear model. Here's an example:
summary(lm(data = iris, Sepal.Length ~ Species))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.76 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.03 8.8e-16 ***
Speciesvirginica 1.5820 0.1030 15.37 < 2e-16 ***
Here's what I'd like:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.76 < 2e-16 ***
**Reference: Speciessetosa**
Speciesversicolor 0.9300 0.1030 9.03 8.8e-16 ***
Speciesvirginica 1.5820 0.1030 15.37 < 2e-16 ***
If there is a way to make this work generally (when there are multiple categorical predictors, then each reference group is easily identifiable), that would be most excellent. And if there is a way to make the formatting particularly clear, that would be doubly excellent (I'm not beholden to the example formatting above).
You can specify that your intercept is zero.
summary(lm(Sepal.Length ~ Species + 0, data = iris))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#Speciessetosa 5.0060 0.0728 68.76 <2e-16 ***
#Speciesversicolor 5.9360 0.0728 81.54 <2e-16 ***
#Speciesvirginica 6.5880 0.0728 90.49 <2e-16 ***

How to read an item from summary in R

I am using Aparch/Garch model (library: "fGarch") and want to read (& use later) the objects like AIC, t-values of the coefficients in the summary of the model fit. How can I do this?
m3<-(garchFit(~arma(1,0)+aparch(1,1), cond.dist= "sged" ,data=t2, trace=FALSE))
summary(m3)
Title:
GARCH Modelling
Call:
garchFit(formula = ~arma(1, 0) + aparch(1, 1), data = t2, cond.dist = "sged",
trace = FALSE)
Mean and Variance Equation:
data ~ arma(1, 0) + aparch(1, 1)
[data = t2]
Conditional Distribution:
sged
Coefficient(s):
mu ar1 omega alpha1 gamma1 beta1 delta skew shape
0.00063936 0.07745422 0.00116542 0.24170185 0.19179650 0.74430731 1.11902269 1.06401615 1.23013925
Std. Errors:
based on Hessian
Error Analysis:
Estimate Std. Error t value Pr(>|t|)
mu 0.0006394 0.0004789 1.335 0.181828
ar1 0.0774542 0.0256070 3.025 0.002489 **
omega 0.0011654 0.0003097 3.763 0.000168 ***
alpha1 0.2417019 0.0368264 6.563 5.26e-11 ***
gamma1 0.1917965 0.0699436 2.742 0.006104 **
beta1 0.7443073 0.0383066 19.430 < 2e-16 ***
delta 1.1190227 0.2569665 4.355 1.33e-05 ***
skew 1.0640162 0.0295095 36.057 < 2e-16 ***
shape 1.2301392 0.0592616 20.758 < 2e-16 ***
Information Criterion Statistics:
AIC BIC SIC HQIC
-4.835325 -4.803583 -4.835395 -4.823503
I think you'll have to extract those from the output of garchFit, not its summary. Start by looking at:
> attributes(m3)
Then you can access something like $fit$tval by doing
> m3#fit$tval

Resources