Calculating AIC for Fixed Effect logit from bife package - r

I would like to ask how to calculace inf. criteria such as AIC, etc... for Fixed effect logit model from
bife package.
Basic summmary output does NOT include AIC, how ever when looking at: Goodness-of-fit for fixed effect logit model using 'bife' package
The AIC criterium was computed. how ever I do no have it in my summary output nor log-likelihood.
dta = bife::psid
mod_logit <- bife(LFP ~ AGE + I(INCH / 1000) + KID1 + KID2 + KID3 | ID, data = dta, bias_corr = "ana")
summary(mod_logit)

If you check bife code, AIC was computed in earlier versions at least in version 0.5. You might be using the current version 0.6 in which AIC is no longer included.
If you do not mind using the older version, try the following:
remove the current version from your library.
download version 0.5 from CRAN website: https://cran.r-project.org/src/contrib/Archive/bife/
install to your computer: install.packages("D:\\bife_0.5.tar.gz", repos = NULL, type="source"). Assuming it is stored on D: drive.
Or:
require(devtools)
install_version("bife", version = "0.5", repos = "http://cran.us.r-project.org")
If successfully installed, run the following with AIC included:
library(bife)
dta = bife::psid
mod_logit <- bife(LFP ~ AGE + I(INCH / 1000) + KID1 + KID2 + KID3 | ID, data = dta, bias_corr = "ana")
summary(mod_logit)
#> ---------------------------------------------------------------
#> Fixed effects logit model
#> with analytical bias-correction
#>
#> Estimated model:
#> LFP ~ AGE + I(INCH/1000) + KID1 + KID2 + KID3 | ID
#>
#> Log-Likelihood= -3045.505
#> n= 13149, number of events= 9516
#> Demeaning converged after 5 iteration(s)
#> Offset converged after 3 iteration(s)
#>
#> Corrected structural parameter(s):
#>
#> Estimate Std. error t-value Pr(> t)
#> AGE 0.033945 0.012990 2.613 0.00898 **
#> I(INCH/1000) -0.007630 0.001993 -3.829 0.00013 ***
#> KID1 -1.052985 0.096186 -10.947 < 2e-16 ***
#> KID2 -0.509178 0.084510 -6.025 1.74e-09 ***
#> KID3 -0.010562 0.060413 -0.175 0.86121
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> AIC= 9023.011 , BIC= 19994.7
#>
#>
#> Average individual fixed effects= 0.0122
#> ---------------------------------------------------------------
Created on 2020-01-09 by the reprex package (v0.3.0)

Related

Regression and Prediction using R

I want to perform the following task using fastfood dataset from openintro package in R.
a) Create a regression predicting whether or not a restaurant is McDonalds or Subway
based on calories, sodium, and protein. (McDonalds should be 1, Subway 0).
Save the coefficients to Q2.
b) use data from only restaurants with between 50 and 60 items in the
data set. Predict total fat from cholesterol, total carbs, vitamin a, and restaurant.
Remove any nonsignificant predictors and run again.
Assign the strongest standardized regression coefficient to Q5.
Here's my code.
library(tidyverse)
library(openintro)
library(lm.beta)
fastfood <- openintro::fastfood
head(fastfood)
#Solving for part (a)
fit_1 <- lm(I(restaurant %in% c("Subway", "Mcdonalds")) ~ calories + sodium + protein, data = fastfood)
Q2 <- round(summary(fit_1)$coefficients,2)
#Solving for part (b)
newdata <- fastfood[ which(fastfood$item>=50 & fastfood$item <= 60), ]
df = sort(sample(nrow(newdata), nrow(data)*.7))
newdata_train<-data[df,]
newdata_test<-data[-df,]
fit_5 <- lm(I(total_fat) ~ cholesterol + total_carb + vit_a + restaurant, data = newdata)
prediction_5 <- predict(fit_5, newdata = newdata_test)
Q5 <- lm.beta(fit_5)
But I'm not getting desired results
Here's is desired output
output for part (a):
output for part (b):
The first question requires logistic regression rather than linear regression, since the aim is to predict a binary outcome. The most sensible way to do this is, as the question suggests, to remove all the restaurants except McDonald's and Subway, then create a new binary variable to mark which rows are McDonald's and which aren't:
library(dplyr)
fastfood <- openintro::fastfood %>%
filter(restaurant %in% c("Mcdonalds", "Subway")) %>%
mutate(is_mcdonalds = restaurant == "Mcdonalds")
The logistic regression is done like this:
fit_1 <- glm(is_mcdonalds ~ calories + sodium + protein,
family = "binomial", data = fastfood)
And your coefficients are obtained like this:
Q2 <- round(coef(fit_1), 2)
Q2
#> (Intercept) calories sodium protein
#> -1.24 0.00 0.00 0.06
The second question requires that you filter out any restaurants with more than 60 or fewer than 50 items:
fastfood <- openintro::fastfood %>%
group_by(restaurant) %>%
filter(n() >= 50 & n() <= 60)
We now fit the described regression and examine it to look for non-significant regressors:
fit_2 <- lm(total_fat ~ cholesterol + vit_a + total_carb + restaurant,
data = fastfood)
summary(fit_2)
#>
#> Call:
#> lm(formula = total_fat ~ cholesterol + vit_a + total_carb + restaurant,
#> data = fastfood)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -24.8280 -2.9417 0.9397 5.1450 21.0494
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.20102 2.08029 -0.577 0.564751
#> cholesterol 0.26932 0.01129 23.853 < 2e-16 ***
#> vit_a 0.01159 0.01655 0.701 0.484895
#> total_carb 0.16327 0.03317 4.922 2.64e-06 ***
#> restaurantMcdonalds -4.90272 1.94071 -2.526 0.012778 *
#> restaurantSonic 6.43353 1.89014 3.404 0.000894 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 7.611 on 125 degrees of freedom
#> (34 observations deleted due to missingness)
#> Multiple R-squared: 0.8776, Adjusted R-squared: 0.8727
#> F-statistic: 179.2 on 5 and 125 DF, p-value: < 2.2e-16
We note that vit_a is non-significant and drop it from our model:
fit_3 <- update(fit_2, . ~ . - vit_a)
Now we get the regularized coefficients and round them:
coefs <- round(coef(lm.beta::lm.beta(fit_3)), 2)
and Q5 will be the maximum value of these coefficients:
Q5 <- coefs[which.max(coefs)]
Q5
#> cholesterol
#> 0.82
Created on 2022-02-26 by the reprex package (v2.0.1)

R Fixest Package: IV Estimation Without Further Exogenous Variables

I intend to run instrumental variable regressions with fixed effects using the fixest package's feols function. However, I am having issues with the syntax specifying an estimation without further exogenous controls.
Consider the following example:
# Load package
require("fixest")
# Load data
df <- airquality
I would like to something like the following, i.e. explaining the outcome via the instrumented endogenous variable and fixed effects:
feols(Temp | Month + Day | Ozone ~ Wind, df)
This, however, produces an error:
The dependent variable is a constant. Estimation cannot be done.
It only works, when I add further exogenous covariates (as in the documentation's examples):
feols(Temp ~ Solar.R | Month + Day | Ozone ~ Wind, df)
How do I fix this? How do I run the estimation without further controls, such as Solar.R in this case?
Note: I post this on Stack Overflow rather than Cross Validated because the question relates to a coding syntax issue, and not to the econometric techniques underlying the estimations.
actually there seems to be a misunderstanding on how to write the formula.
The syntax is: Dep_var ~ Exo_vars | Fixed-effects | Endo_vars ~ Instruments.
The parts Fixed-effects and Endo_vars ~ Instruments are optional. On the other hand, the part with Exo_vars must always be there, be it with only the intercept.
Knowing that, the following works:
base = iris
names(base) = c("y", "x1", "x_endo", "x_inst", "fe")
feols(y ~ 1 | x_endo ~ x_inst, base)
#> TSLS estimation, Dep. Var.: y, Endo.: x_endo, Instr.: x_inst
#> Second stage: Dep. Var.: y
#> Observations: 150
#> Standard-errors: Standard
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.345900 0.08096 53.679 < 2.2e-16 ***
#> fit_x_endo 0.398477 0.01964 20.289 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.404769 Adj. R2: 0.757834
#> F-test (1st stage): stat = 1,882.45 , p < 2.2e-16 , on 1 and 148 DoF.
#> Wu-Hausman: stat = 3.9663, p = 0.048272, on 1 and 147 DoF.
# Same with fixed-effect
feols(y ~ 1 | fe | x_endo ~ x_inst, base)
#> TSLS estimation, Dep. Var.: y, Endo.: x_endo, Instr.: x_inst
#> Second stage: Dep. Var.: y
#> Observations: 150
#> Fixed-effects: fe: 3
#> Standard-errors: Clustered (fe)
#> Estimate Std. Error t value Pr(>|t|)
#> fit_x_endo 0.900061 0.117798 7.6407 0.016701 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.333489 Adj. R2: 0.833363
#> Within R2: 0.57177
#> F-test (1st stage): stat = 44.77 , p = 4.409e-10, on 1 and 146 DoF.
#> Wu-Hausman: stat = 0.001472, p = 0.969447 , on 1 and 145 DoF.
Getting back to the original example:
feols(Temp | Month + Day | Ozone ~ Wind, df) means that the dependent variable will be Temp | Month + Day | Ozone with the pipe here meaning the logical OR, leading to a 1 for all observations. Hence the error message.
To fix it and obtain an appropriate behavior, use feols(Temp ~ 1 | Month + Day | Ozone ~ Wind, df).

Log-odds formula

I am interested in calculating the log-odds of the relationship between a continuous predictor and dichotomous outcome for purposes of graphically evaluating the linearity assumption for a logistic regression model. Does anyone know a formula for this? My key issue is I am unsure how to calculate an event rate for each level of the continuous predictor (i.e. number with outcome/total observations at that level).
Thank you!
Let's simulate some data to show how this can be done.
Imagine we are testing a new electrical product, and we test at a variety of temperatures to see whether temperature affects failure rate.
set.seed(69)
df <- data.frame(temperature = seq(0, 100, length.out = 1000),
failed = rbinom(1000, 1, seq(0.1, 0.9, length.out = 1000)))
So we have two columns: the temperature, and a dichotomous column of 1 (failed) and 0 (passed).
We can get a rough measure of the relationship between temperature and failure rate just by cutting our data frame into 5 degree bins:
df$temp_range <- cut(df$temperature, seq(0, 100, 5), include.lowest = TRUE)
We can now plot the proportion of devices that failed within each 5 degree temperature band:
library(ggplot2)
ggplot(df, aes(x = temp_range, y = failed)) + stat_summary()
#> No summary function supplied, defaulting to `mean_se()`
We can see that the probability of failure appears to go up linearly with temperature.
Now, if we get the proportions of failures in each bin, we take these as the estimate of probability of failure. This allows us to calculate the log odds of failure within each bin:
counts <- table(df$temp_range, df$failed)
probs <- counts[,2]/rowSums(counts)
logodds <- log(probs/(1 - probs))
temp_range <- seq(2.5, 97.5, 5)
logit_df <- data.frame(temp_range, probs, logodds)
So now, we can plot the log odds. Here, we will make our x axis continuous by taking the mid point of each bin as the x co-ordinate. We can then draw a linear regression through our points:
p <- ggplot(logit_df, aes(temp_range, logodds)) +
geom_point() +
geom_smooth(method = "lm", colour = "black", linetype = 2, se = FALSE)
p
#> `geom_smooth()` using formula 'y ~ x'
and in fact carry out a linear regression:
summary(lm(logodds ~ temp_range))
#>
#> Call:
#> lm(formula = logodds ~ temp_range)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.70596 -0.20764 -0.06761 0.18100 1.31147
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.160639 0.207276 -10.42 4.70e-09 ***
#> temp_range 0.046025 0.003591 12.82 1.74e-10 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.463 on 18 degrees of freedom
#> Multiple R-squared: 0.9012, Adjusted R-squared: 0.8957
#> F-statistic: 164.2 on 1 and 18 DF, p-value: 1.738e-10
We can see that the linear assumption is reasonable here.
What we have just done is like a crude form of logistic regression. Let's now do it properly:
model <- glm(failed ~ temperature, data = df, family = binomial())
summary(model)
#>
#> Call:
#> glm(formula = failed ~ temperature, family = binomial(), data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.1854 -0.8514 0.4672 0.8518 2.0430
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.006197 0.159997 -12.54 <2e-16 ***
#> temperature 0.043064 0.002938 14.66 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1383.4 on 999 degrees of freedom
#> Residual deviance: 1096.0 on 998 degrees of freedom
#> AIC: 1100
#>
#> Number of Fisher Scoring iterations: 3
Notice how close the coefficients are to our hand-crafted model.
Now that we have this model, we can plot its predictions over our crude linear estimate:
mod_df <- data.frame(temp_range = 1:100,
logodds = predict(model, newdata = list(temperature = 1:100)))
p + geom_line(data = mod_df, colour = "red", linetype = 3, size = 2)
#> `geom_smooth()` using formula 'y ~ x'
Pretty close.
Created on 2020-06-19 by the reprex package (v0.3.0)

How does R lm choose contrasts with interaction between a categorical and continuous variables?

If I run lm with a formula like Y ~ X1 + X2:X1 + X3:X1 where X1 is continuous and X2,X3 are categorical, I get a contrast for both levels of X2, but not X3.
The pattern is that the first categorical interaction gets both levels but not the second.
library(tidyverse)
library(magrittr)
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
#> The following object is masked from 'package:tidyr':
#>
#> extract
df = data.frame(Frivolousness = sample(1:100, 50, replace =T))
df %<>% mutate(
Personality=sample(c("Bad", "Good"), 50, replace = T),
Timing=ifelse(Frivolousness %% 2 == 0 & runif(50) > 0.2, "Early", "Late")
)
df %<>% mutate(
Enchantedness = 11 +
ifelse(Personality=="Good", 0.23, -0.052)*Frivolousness -
1.3*ifelse(Personality=="Good", 1, 0) +
10*rnorm(50)
)
df %<>% mutate(
Personality = factor(Personality, levels=c("Bad", "Good")),
Timing = factor(Timing, levels=c("Early", "Late"))
)
lm(Enchantedness ~ Personality + Timing + Timing:Frivolousness + Personality:Frivolousness, df)
#>
#> Call:
#> lm(formula = Enchantedness ~ Personality + Timing + Timing:Frivolousness +
#> Personality:Frivolousness, data = df)
#>
#> Coefficients:
#> (Intercept) PersonalityGood
#> 15.64118 -10.99518
#> TimingLate TimingEarly:Frivolousness
#> -1.41757 -0.05796
#> TimingLate:Frivolousness PersonalityGood:Frivolousness
#> -0.07433 0.33410
lm(Enchantedness ~ Personality + Timing + Personality:Frivolousness+ Timing:Frivolousness , df)
#>
#> Call:
#> lm(formula = Enchantedness ~ Personality + Timing + Personality:Frivolousness +
#> Timing:Frivolousness, data = df)
#>
#> Coefficients:
#> (Intercept) PersonalityGood
#> 15.64118 -10.99518
#> TimingLate PersonalityBad:Frivolousness
#> -1.41757 -0.05796
#> PersonalityGood:Frivolousness TimingLate:Frivolousness
#> 0.27614 -0.01636
Created on 2020-02-15 by the reprex package (v0.3.0)
I think the reason it is dropped is that there would be perfect colinearity if it was included. You really should have Frivolousness as a regressor on its own also. Then, you will see that R provides you with the result for just one level of both interactions.
You get this kind of weird behavior because you are missing the term main term, Frivolousness. If you do:
set.seed(111)
## run your data frame stuff
lm(Enchantedness ~ Personality + Timing + Timing:Frivolousness + Personality:Frivolousness, df)
Coefficients:
(Intercept) PersonalityGood
-1.74223 5.31189
TimingLate TimingEarly:Frivolousness
12.47243 0.19090
TimingLate:Frivolousness PersonalityGood:Frivolousness
-0.09496 0.17383
lm(Enchantedness ~ Personality + Timing + Frivolousness+Timing:Frivolousness + Personality:Frivolousness, df)
Coefficients:
(Intercept) PersonalityGood
-1.7422 5.3119
TimingLate Frivolousness
12.4724 0.1909
TimingLate:Frivolousness PersonalityGood:Frivolousness
-0.2859 0.1738
In your model, the interaction term TimingLate:Frivolousness means the change in slope of Frivolousness when Timing is Late. Since the default is not estimated, it has to do it for TimingEarly (the reference level). Hence you can see the coefficients for TimingEarly:Frivolousness and Frivolousness are the same.
As you can see the TimingLate:Frivolousness are very different and In your case I think doesn't make sense to do only the interaction term without the main effect, because it's hard to interpret or model it.
You can roughly check what is the slope for different groups of timing and the model with all terms gives a good estimate:
df %>% group_by(Timing) %>% do(tidy(lm(Enchantedness ~ Frivolousness,data=.)))
# A tibble: 4 x 6
# Groups: Timing [2]
Timing term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 Early (Intercept) 6.13 6.29 0.975 0.341
2 Early Frivolousness 0.208 0.0932 2.23 0.0366
3 Late (Intercept) 11.5 5.35 2.14 0.0419
4 Late Frivolousness -0.00944 0.107 -0.0882 0.930

Reorder x axis using plot_model() from sjPlot

I have run a binomial logistic regression model in R using the lme4 package. Now, I want to plot the estimated marginal means for the model, so I have installed the sjPlot package and I have used the plot_model() function.
My x axis includes three variables corresponding to three different groups: "L1", "HS", and "L2". I want to have the three variables in that precise order. However, when I plot the model, I get "HS" before "L1", because the labels appear in alphabetical order. I would like to change the order of those two labels and I know how to do that in a dataframe, but not when plotting a model with that function. Any ideas on how to reorder my x axis using the sjPlot package?
You can change the order of the coefficients using the order.terms-argument. Note that the numbers for this argument correspond to the position of the summary. Example:
library(sjPlot)
library(sjlabelled)
data(efc)
efc <- as_factor(efc, c161sex, e42dep, c172code)
m <- lm(neg_c_7 ~ pos_v_4 + c12hour + e42dep + c172code, data = efc)
plot_model(m, auto.label = F)
summary(m)
#>
#> Call:
#> lm(formula = neg_c_7 ~ pos_v_4 + c12hour + e42dep + c172code,
#> data = efc)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -6.5411 -2.0797 -0.5183 1.3256 19.1412
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 17.65938 0.82864 21.311 < 2e-16 ***
#> pos_v_4 -0.66552 0.05163 -12.890 < 2e-16 ***
#> c12hour 0.01134 0.00270 4.201 2.95e-05 ***
#> e42dep2 0.84189 0.47605 1.768 0.077355 .
#> e42dep3 1.73616 0.47118 3.685 0.000244 ***
#> e42dep4 3.10107 0.50470 6.144 1.26e-09 ***
#> c172code2 0.12894 0.28832 0.447 0.654844
#> c172code3 0.69876 0.36649 1.907 0.056922 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.27 on 810 degrees of freedom
#> (90 observations deleted due to missingness)
#> Multiple R-squared: 0.2981, Adjusted R-squared: 0.292
#> F-statistic: 49.15 on 7 and 810 DF, p-value: < 2.2e-16
# according to summary, order of coefficients:
# 1=pos_v_4, 2=c12hour, 3=e42dep2, 4=e42dep3, ...
plot_model(m, auto.label = F, order.terms = c(1,2,4,5,3,6,7))
Created on 2019-05-08 by the reprex package (v0.2.1)

Resources