Suppose I have a formula y~x1+x2+I( (x1==x2)*x3 ) and estimate an linear model
summary(lm(y~x1+x2+I( (x1==x2)*x3 ), data=some_data ))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6027 1.6069 2.242 0.0662 .
x1 1.8685 1.9769 0.945 0.3811
x2 2.6041 2.0286 1.284 0.2466
I((x1 == x2) * x3) 0.5666 1.5456 0.367 0.7265
Beside creating a new 'data.frame', is there any methods to modify the formula so that the summary table would become
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6027 1.6069 2.242 0.0662 .
x1 1.8685 1.9769 0.945 0.3811
x2 2.6041 2.0286 1.284 0.2466
some_name 0.5666 1.5456 0.367 0.7265
Using within you could hack something up.
Instead of:
summary(lm(Sepal.Length ~ I(Sepal.Width * Petal.Length), data=iris))
#...
#(Intercept) 4.252934 0.069396 61.28 <2e-16 ***
#I(Sepal.Width * Petal.Length) 0.142483 0.005632 25.30 <2e-16 ***
#...
You can use:
summary(lm(Sepal.Length ~ newvar, data=
within(iris, newvar <- Sepal.Width * Petal.Length)))
#...
#(Intercept) 4.252934 0.069396 61.28 <2e-16 ***
#newvar 0.142483 0.005632 25.30 <2e-16 ***
#...
Related
I estimate a heckit-model using the heckit-model from sampleSelection.
The model looks as follows:
library(sampleSelection) Heckman = heckit(AgencyTRACE ~ SizeCat + log(Amt_Issued) + log(daysfromissuance) + log(daystomaturity) + EoW + dMon + EoM + VIX_95_Dummy + quarter, Avg_Spread_Choi ~ SizeCat + log(Amt_Issued) + log(daysfromissuance) + log(daystomaturity) + VIX_95_Dummy + TresholdHYIG_II, data=heckmandata, method = "2step")
The summary generates a probit selection equation and an outcome equation - see below:
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
2019085 observations (1915401 censored and 103684 observed)
26 free parameters (df = 2019060)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.038164 0.043275 0.882 0.378
SizeCat2 0.201571 0.003378 59.672 < 2e-16 ***
SizeCat3 0.318331 0.008436 37.733 < 2e-16 ***
log(Amt_Issued) -0.099472 0.001825 -54.496 < 2e-16 ***
log(daysfromissuance) 0.079691 0.001606 49.613 < 2e-16 ***
log(daystomaturity) -0.036434 0.001514 -24.066 < 2e-16 ***
EoW 0.021169 0.003945 5.366 8.04e-08 ***
dMon -0.003409 0.003852 -0.885 0.376
EoM 0.008937 0.007000 1.277 0.202
VIX_95_Dummy1 0.088558 0.006521 13.580 < 2e-16 ***
quarter2019.2 -0.092681 0.005202 -17.817 < 2e-16 ***
quarter2019.3 -0.117021 0.005182 -22.581 < 2e-16 ***
quarter2019.4 -0.059833 0.005253 -11.389 < 2e-16 ***
quarter2020.1 -0.005230 0.004943 -1.058 0.290
quarter2020.2 0.073175 0.005080 14.406 < 2e-16 ***
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.29436 6.26019 7.395 1.41e-13 ***
SizeCat2 -25.63433 0.79836 -32.109 < 2e-16 ***
SizeCat3 -34.25275 1.48030 -23.139 < 2e-16 ***
log(Amt_Issued) -0.38051 0.39506 -0.963 0.33547
log(daysfromissuance) 0.02452 0.34197 0.072 0.94283
log(daystomaturity) 7.92338 0.24498 32.343 < 2e-16 ***
VIX_95_Dummy1 -2.34875 0.89133 -2.635 0.00841 **
TresholdHYIG_II1 10.36993 1.07267 9.667 < 2e-16 ***
Multiple R-Squared:0.0406, Adjusted R-Squared:0.0405
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio -23.8204 3.6910 -6.454 1.09e-10 ***
sigma 68.5011 NA NA NA
rho -0.3477 NA NA NA
Now I'd like to estimate a value using the outcome equation. I'd like to predict Spread_Choi_All using the following data:
newdata = data.frame(SizeCat=as.factor(1),
Amt_Issued=50*1000000,
daysfromissuance=5*365,
daystomaturity=5*365,
VIX_95_Dummy=as.factor(0),
TresholdHYIG_II=as.factor(0)
SizeCat is a categorical/factor variable with the value 1, 2 or 3.
I have tried varies ways, i.e.
predict(Heckman, part ="outcome", newdata = newdata)
I aim to predict a value (with the data from newdata) using the outcome equation (incl. the invMillsRatio). Is there a way how to predict a value from the outcome equation?
I have a question about Poisson GLM and formula representation:
Considering a data set:
p <- read.csv("https://raw.githubusercontent.com/Leprechault/PEN-533/master/bradysia-greenhouse.csv")
Without considering the interaction:
m1 <- glm(bradysia ~ area + mes, family="quasipoisson", data=p)
summary(m1)
#(Intercept) 4.36395 0.12925 33.765 < 2e-16 ***
#areaCV -0.19696 0.12425 -1.585 0.113
#areaMJC -0.71543 0.08553 -8.364 3.11e-16 ***
#mes -0.08872 0.01970 -4.503 7.82e-06 ***
The final formula is: bradysia = exp(4.36395*CS-0.19696*CV-0.71543-0.08872*mes)
Considering the interaction:
m2 <- glm(bradysia ~ area*mes, family="quasipoisson", data=p)
summary(m2)
#(Intercept) 4.05682 0.15468 26.227 < 2e-16 ***
#areaCV 0.15671 0.35219 0.445 0.6565
#areaMJC 0.54132 0.31215 1.734 0.0833 .
#mes -0.03943 0.02346 -1.680 0.0933 .
#areaCV:mes -0.05724 0.05579 -1.026 0.3052
#areaMJC:mes -0.22609 0.05576 -4.055 5.57e-05 **
The final formula is: bradysia = exp(?????) and any help, please?
How can you print the reference category used when a categorical/nominal variable is entered into a linear model. Here's an example:
summary(lm(data = iris, Sepal.Length ~ Species))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.76 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.03 8.8e-16 ***
Speciesvirginica 1.5820 0.1030 15.37 < 2e-16 ***
Here's what I'd like:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.76 < 2e-16 ***
**Reference: Speciessetosa**
Speciesversicolor 0.9300 0.1030 9.03 8.8e-16 ***
Speciesvirginica 1.5820 0.1030 15.37 < 2e-16 ***
If there is a way to make this work generally (when there are multiple categorical predictors, then each reference group is easily identifiable), that would be most excellent. And if there is a way to make the formatting particularly clear, that would be doubly excellent (I'm not beholden to the example formatting above).
You can specify that your intercept is zero.
summary(lm(Sepal.Length ~ Species + 0, data = iris))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#Speciessetosa 5.0060 0.0728 68.76 <2e-16 ***
#Speciesversicolor 5.9360 0.0728 81.54 <2e-16 ***
#Speciesvirginica 6.5880 0.0728 90.49 <2e-16 ***
I am trying to see in practice what was explained here what happens to the coefficients once labels are switched but I am not getting what is expected. Here is my attempt:
I am using the example of natality public-use data given as an example in "Practical Data Science with R" Where the output is a logical variable that classifies new born babies if they are atRisk with levels FALSE and TRUE
load(url("https://github.com/WinVector/zmPDSwR/tree/master/CDC/NatalRiskData.rData"))
train <- sdata[sdata$ORIGRANDGROUP<=5,]
test <- sdata[sdata$ORIGRANDGROUP>5,]
complications <- c("ULD_MECO","ULD_PRECIP","ULD_BREECH")
riskfactors <- c("URF_DIAB", "URF_CHYPER", "URF_PHYPER",
"URF_ECLAM")
y <- "atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
summary(model)
This results to the following coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
OK, now let us switch the labels in our atRisk variable:
esdata$atRisk <- factor(sdata$atRisk)
levels(sdata$atRisk) <- c("TRUE", "FALSE")
and re-run the above analysis where I am expecting to see a change in the signs of the above reported coefficients, however, I am getting exactly the same coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
What is that am I doing wrong here? Can you help please
its because you set train <- sdata[sdata$ORIGRANDGROUP<=5,] and then you change sdata$atRisk <- factor(sdata$atRisk) but your model is using the train dataset, whose levels DID NOT get changed.
Instead you can do
y <- "!atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
Call:
glm(formula = fmla, family = binomial(link = "logit"), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2641 0.1358 0.1511 0.1818 0.9732
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.412189 0.289352 15.249 < 2e-16 ***
PWGT -0.003762 0.001487 -2.530 0.011417 *
UPREVIS 0.063289 0.015252 4.150 3.33e-05 ***
CIG_RECTRUE -0.313169 0.187230 -1.673 0.094398 .
GESTREC3< 37 weeks -1.545183 0.140795 -10.975 < 2e-16 ***
DPLURALtriplet or higher -1.394193 0.498866 -2.795 0.005194 **
DPLURALtwin -0.312319 0.241088 -1.295 0.195163
ULD_MECOTRUE -0.818426 0.235798 -3.471 0.000519 ***
ULD_PRECIPTRUE -0.191720 0.357680 -0.536 0.591951
ULD_BREECHTRUE -0.749237 0.178129 -4.206 2.60e-05 ***
URF_DIABTRUE 0.346467 0.287514 1.205 0.228187
URF_CHYPERTRUE -0.560025 0.389678 -1.437 0.150676
URF_PHYPERTRUE -0.161599 0.250003 -0.646 0.518029
URF_ECLAMTRUE -0.498064 0.776948 -0.641 0.521489
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2698.7 on 14211 degrees of freedom
Residual deviance: 2463.0 on 14198 degrees of freedom
AIC: 2491
Number of Fisher Scoring iterations: 7
I have a list named "mylist" that contains gam outputs. Summary of the first list is the following:
> summary(mylist[[1]][[1]])
Family: quasipoisson
Link function: log
Formula:
cardva ~ s(trend, k = 11 * 6, fx = T, bs = "cr") + s(temp_01, k = 6, fx = F, bs = "cr") + rh_01 + as.factor(dow) + s(fluepi, k = 4, fx = F, bs = "cr") + as.factor(holiday) + Lag(pm1010, 0)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1584139 0.0331388 95.309 < 2e-16 ***
rh_01 0.0005441 0.0004024 1.352 0.17639
as.factor(dow)2 0.0356757 0.0127979 2.788 0.00533 **
as.factor(dow)3 0.0388823 0.0128057 3.036 0.00241 **
as.factor(dow)4 0.0107302 0.0129014 0.832 0.40561
as.factor(dow)5 0.0243382 0.0128705 1.891 0.05867 .
as.factor(dow)6 0.0277954 0.0128360 2.165 0.03040 *
as.factor(dow)7 0.0275593 0.0127373 2.164 0.03053 *
as.factor(holiday)1 0.0444349 0.0147219 3.018 0.00255 **
Lag(pm1010, 0) -0.0010816 0.0042891 -0.252 0.80091
After unlisting the list I have extracted the coefficients of the linear terms for the first list:
> head(plist)
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1584139271 0.0331388386 95.3085280 0.000000000
rh_01 0.0005441175 0.0004024202 1.3521128 0.176392590
as.factor(dow)2 0.0356757100 0.0127979429 2.7876128 0.005327293
as.factor(dow)3 0.0388823055 0.0128056733 3.0363343 0.002405504
as.factor(dow)4 0.0107302325 0.0129013816 0.8317119 0.405606249
as.factor(dow)5 0.0243382447 0.0128704711 1.8910143 0.058672841
as.factor(dow)6 0.0277953708 0.0128359850 2.1654256 0.030396240
as.factor(dow)7 0.0275592574 0.0127372874 2.1636677 0.030531063
as.factor(holiday)1 0.0444348611 0.0147218816 3.0182868 0.002553265
Lag(pm1010, 0) -0.0010816252 0.0042890866 -0.2521808 0.800910389
My question is: it possible to include the names of the dependent variable (in this example cardiac) as part of the plist?
What I want to achieve is (output deliberately reduced)
cardva Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1584139271 0.0331388386 95.3085280 0.000000000
rh_01 0.0005441175 0.0004024202 1.3521128 0.176392590
as.factor(dow)2 0.0356757100 0.0127979429 2.7876128 0.005327293
or
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1584139271 0.0331388386 95.3085280 0.000000000
rh_01 0.0005441175 0.0004024202 1.3521128 0.176392590
as.factor(dow)7 0.0275592574 0.0127372874 2.1636677 0.030531063
as.factor(holiday)1 0.0444348611 0.0147218816 3.0182868 0.002553265
cardva_Lag(pm1010, 0) -0.0010816252 0.0042890866 -0.2521808 0.800910389
Two options: Name the nodes of the list so they would then be printed as:
names(plist)[1] <- 'cardva'
plist[1]
$cardva
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1584139271 0.0331388386 95.3085280 0.000000000
rh_01 0.0005441175 0.0004024202 1.3521128 0.176392590
as.factor(dow)2 0.0356757100 0.0127979429 2.7876128 0.005327293
as.factor(dow)3 0.0388823055 0.0128056733 3.0363343 0.002405504
as.factor(dow)4 0.0107302325 0.0129013816 0.8317119 0.405606249
as.factor(dow)5 0.0243382447 0.0128704711 1.8910143 0.058672841
as.factor(dow)6 0.0277953708 0.0128359850 2.1654256 0.030396240
as.factor(dow)7 0.0275592574 0.0127372874 2.1636677 0.030531063
as.factor(holiday)1 0.0444348611 0.0147218816 3.0182868 0.002553265
Lag(pm1010, 0) -0.0010816252 0.0042890866 -0.2521808 0.800910389
Or:
temp <- plist[[1]]
rownames(temp)[nrow(temp)] <- paste0( "cardva_", rownames(temp)[nrow(temp)] )