Regression in first differences with packages `plm` and `broom` - panel-data

I'm estimating a linear regression in first-differences straight out of the textbook:
James H. Stock and Mark W. Watson, Introduction to Econometrics, Pearson, 4th Edition.
Christoph Hanck, Martin Arnold, Alexander Gerber, and Martin Schmelzer, Introduction to Econometrics with R, https://www.econometrics-with-r.org/10-rwpd.html
Data from package AER, https://cran.r-project.org/web/packages/AER/AER.pdf
I'm estimating the model in first-differences with the plm() function from the plm package and extracting residuals with the augment() function from the broom package. I'm getting an error message and suspect I may not be using the "fd" option correctly and/or misusing augment(). A similar attempt for model="pooling" appears to work. Help appreciated!
library(AER)
data(Fatalities)
Fatalities$fatality <- Fatalities$fatal / Fatalities$pop * 10000
library(plm)
library(broom)
plm.pool <- plm(fatality ~ beertax, data=Fatalities, model="pooling")
tidy(plm.pool) # ok
augment(plm.pool) # ok
plm.fd <- plm(fatality ~ beertax, data=Fatalities,
index=c("state", "year"),
model="fd")
tidy(plm.fd) # looks ok
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.00314 0.0119 -0.263 0.792
2 beertax 0.0137 0.285 0.0480 0.962
augment(plm.fd) # not ok
Error in `$<-.data.frame`(`*tmp*`, ".resid", value = c(`2` = 0.219840293582125, :
replacement has 288 rows, data has 336
In addition: Warning message:
In get(.Generic)(e1, e2) :
longer object length is not a multiple of shorter object length
EDIT: A WORKAROUND
So I suspect the problem has something to do with the fact that the model returned by plm and the residuals do not have the same number of rows:
length(row.names(plm.fd$model)) is 336
length(names(plm.fd$residuals)) is 288.
Can someone tell me if the following is the correct way to get residuals and fitted values from the first-difference estimation?
data.frame(".rownames" = row.names(plm.fd$model), plm.fd$model) %>%
left_join(data.frame(".rownames" = names(resid(plm.fd)),
".fitted" = fitted(plm.fd),
".resid" = resid(plm.fd)
)) -> Fatalities.augmented
head(Fatalities.augmented)
.rownames fatality beertax .fitted .resid
1 1 2.12836 1.539379 NA NA
2 2 2.34848 1.788991 0.0034166261 0.219840294
3 3 2.33643 1.714286 -0.0010225479 -0.007890716
4 4 2.19348 1.652542 -0.0008451287 -0.138968054
5 5 2.66914 1.609907 -0.0005835833 0.479380363
6 6 2.71859 1.560000 -0.0006831177 0.053269973
References:
https://cran.r-project.org/web/packages/plm/plm.pdf
https://cran.r-project.org/web/packages/broom/broom.pdf
Reference for Edit:
Using broom::augment Panel data models

This is due to a misunderstanding or non-special casing the first-difference (FD) panel model in broom::augment_columns: the function assumes the residuals of the FD model have the same length as the predicted values.
More concretly, this line: ret$.resid <- residuals0(x) (https://github.com/tidymodels/broom/blob/069c21e903174fcf5d491091b7c347a9fdcd2999/R/utilities.R#L256)
FD models compress the data, so the number of residuals is lower than the number of observations used for model estimation. You can see that in the summary output:
summary(panel3) # FD model
Oneway (individual) effect First-Difference Model
[...]
Balanced Panel: n = 90, T = 7, N = 630
Observations used in estimation: 540
[...]
While the model has an input of 630 observations, after FD transformation, only 540 transformed observations are used as one loses one observation per group (individual dimension) -> 630 - 90 = 540.
The broom:augment_columns wants to put the predicted values (630) and the residuals (540) in the same data frame, this is bound to fail. If they want to do it, they could pad values with NA (e.g., the first row for each individual set to NA).
My suggestion is to make developers/the maintainer of broom aware of this issue (and maybe this post). plm's FD panel models are identified via plm_object$args$model == "fd".

Related

Using emmeans with brms

I regularly use emmeans to calculate custom contrasts scross a wide range of statistical models. One of its strengths is its versatility: it is compatible with a huge range of packages. I have recently discovered that emmeans is compatible with the brms package, but am having trouble getting it to work. I will conduct an example multinomial logistic regression analysis use a dataset provided here. I will also conduct the same analysis in another package (nnet) to demonstrate what I need.
library(brms)
library(nnet)
library(emmeans)
# read in data
ml <- read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
The data set contains variables on 200 students. The outcome variable is prog, program type, a three-level categorical variable (general, academic, vocation). The predictor variable is social economic status, ses, a three-level categorical variable. Now to conduct the analysis via the nnet package nnet
# first relevel so 'academic' is the reference level
ml$prog2 <- relevel(ml$prog, ref = "academic")
# run test in nnet
test_nnet <- multinom(prog2 ~ ses,
data = ml)
Now run the same test in brms
# run test in brms (note: will take 30 - 60 seconds)
test_brm <- brm(prog2 ~ ses,
data = ml,
family = "categorical")
I will not print the output of the two models but the coefficients are roughly equivalent in both
Now to create an emmeans object that will allow us to conduct pariwise tests
# pass into emmeans
rg_nnet <- ref_grid(test_nnet)
em_nnet <- emmeans(rg_nnet,
specs = ~prog2|ses)
# regrid to get coefficients as logit
em_nnet_logit <- regrid(em_nnet,
transform = "logit")
em_nnet_logit
# output
# ses = low:
# prog2 prob SE df lower.CL upper.CL
# academic -0.388 0.297 6 -1.115 0.3395
# general -0.661 0.308 6 -1.415 0.0918
# vocation -1.070 0.335 6 -1.889 -0.2519
#
# ses = middle:
# prog2 prob SE df lower.CL upper.CL
# academic -0.148 0.206 6 -0.651 0.3558
# general -1.322 0.252 6 -1.938 -0.7060
# vocation -0.725 0.219 6 -1.260 -0.1895
#
# ses = high:
# prog2 prob SE df lower.CL upper.CL
# academic 0.965 0.294 6 0.246 1.6839
# general -1.695 0.363 6 -2.582 -0.8072
# vocation -1.986 0.403 6 -2.972 -0.9997
#
# Results are given on the logit (not the response) scale.
# Confidence level used: 0.95
So now we have our lovely emmeans() object that we can use to perform a vast array of different comparisons.
However, when I try to do the same thing with the brms object, I don't even get past the first step of converting the brms object into a reference grid before I get an error message
# do the same for brm
rg_brm <- ref_grid(test_brm)
Error : The select parameter is not predicted by a linear formula. Use the 'dpar' and 'nlpar' arguments to select the parameter for which marginal means should be computed.
Predicted distributional parameters are: 'mugeneral', 'muvocation'
Predicted non-linear parameters are: ''
Error in ref_grid(test_brm) :
Perhaps a 'data' or 'params' argument is needed
Obviously, and unsurprisingly, there are some steps I am not aware of to get the Bayesian software to play nicely with emmeans. Clearly there are some extra parameters I need to specify at some stage of the process but I'm not sure if these need to be specified in brms or in emmeans. I've searched around the web but am having trouble finding a simple but thorough guide.
Can anyone who knows how, help me to get the brms model into an emmeans object?

How to compute marginal effects of a multinomial logit model created with the nnet package?

I have a multinomial logit model created with the nnet R package, using the multinom command. The dependent variable has three categories/choice options. I am modelling the probability of selecting a certain irrigation type (no irrigation, surface irrigation, drip irrigation) based on farmer characteristics.
I would like to estimate marginal effects, i.e. by how much does the probability of selecting irrigation type Y change when I increase independent variable X by one unit? I have tried doing this with the margins package (marginal_effects), but this gives only 1 value per observation in the dataset. I was expecting three values, since I want the marginal effect for each of the three irrigation types.
Does someone know if there is a better R package to use for this? Or whether I am doing something wrong with the margins packages? Thank you.
You can use the marginaleffects
package to do
that (disclaimer: I am the maintainer). Please note the warning.
library(nnet)
library(marginaleffects)
mod <- multinom(factor(cyl) ~ hp + mpg, data = mtcars, quiet = true)
mfx <- marginaleffects(mod, type = "probs")
## Warning in sanity_model_specific.multinom(model, ...): The standard errors
## estimated by `marginaleffects` do not match those produced by Stata for
## `nnet::multinom` models. Please be very careful when interpreting the results.
summary(mfx)
## Average marginal effects
## type Group Term Effect Std. Error z value Pr(>|z|) 2.5 %
## 1 probs 6 hp 2.792e-04 0.000e+00 Inf < 2.22e-16 2.792e-04
## 2 probs 6 mpg -1.334e-03 0.000e+00 -Inf < 2.22e-16 -1.334e-03
## 3 probs 8 hp 2.396e-05 1.042e-126 2.298e+121 < 2.22e-16 2.396e-05
## 4 probs 8 mpg -2.180e-04 1.481e-125 -1.472e+121 < 2.22e-16 -2.180e-04
## 97.5 %
## 1 2.792e-04
## 2 -1.334e-03
## 3 2.396e-05
## 4 -2.180e-04
##
## Model type: multinom
## Prediction type: probs
The marginaleffects package should work in theory, but my example doesn't compile because of file size restrictions (meaning I don't have enough RAM for the 1.5 GB vector it tries to use). It's not even that large of a dataset, which is odd.
If you use marginal_effects() (margins package) for multinomial models, it only displays the output for a default category. You have to manually set each category you want to see. You can clean up the output with broom and then combine some other way. It's clunky, but it can work.
marginal_effects(model, category = 'cat1')

R function to get R squared and p value for the regression

First off, I'll start by saying that I do not know how to use R, but I need to do reduced major axis regression for the two variables below, and the lmodel2 function can do this. I learnt the code below to obtain the intercept and slope for the regression equation. However I don't get a R squared or a p-value for the regression. How would I do this? The code I have used so far is below.
Tensor_force=c(1.72,1.48,1.37,0.81,0.75,0.96,0.96,0.78,0.54,0.67,0.75,0.66,0.4)
Stapedius_force=c(0.8,0.58,1.07,0.82,0.77,0.98,0.99,0.98,0.92,1.06,1.19,1.32,1.18)
library(lmodel2)
lmodel2(Stapedius_force ~ Tensor_force,,"relative", "relative",0)
library(lmodel2)
mod1 <- lmodel2(Stapedius_force ~ Tensor_force,,"relative", "relative",0)
#> No permutation test will be performed
The lmodel2 object created by the lmodel2() function is basically a list,
that is printed in a fashion so that the user can better read the results.
You can access individual values by subsetting that list. If you write
mod1$ and look at the auto complete options you’ll see the available values names.
rsquare and P.param are the names you are looking for:
mod1$rsquare
#> [1] 0.2905577
mod1$P.param # 2 tailed
#> [1] 0.0573178
mod1$P.param/2 # 1 tailed
#> [1] 0.0286589
If you are working with several different iterations of the same model
the broom package and its glance() function are very useful, as it
extracts the model quality measurments as a data.frame/tibble so you can
easily conduct further analysis on these values. See https://broom.tidymodels.org/ to learn more.
library(broom)
glance(mod1)
#> # A tibble: 1 x 5
#> r.squared theta p.value H nobs
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 0.291 28.2 0.0573 0.0986 13

randomForest in R: can fit model and use it for predictions without error, but tuneRF gives diff length error

Just messing around with UCI heart disease data: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data. Data is of the format:
A tibble: 6 x 14
age sex cp trestbps chol fbs restecg thalach exang oldpeak
<dbl> <dbl> <dbl> <int> <int> <dbl> <int> <int> <int> <dbl>
1 63 1 3 145 233 1 0 150 0 2.3
2 41 0 1 130 204 0 0 172 0 1.4
Growing/fitting the tree on the training set works great, as does using it for predictions on the test set. However, tuneRF gives the error:
Error in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, :
length of response must be the same as predictors
It's R 3.5.0 and randomForest 4.6-14.
Some notes you'll see in the code:
1) the tuneRF command is using subsets of the same dataset, so the class labels are the same
2) the "target" response variable has been converted to factor before training/test partitioning
I have a feeling it is related to the way I am subsetting, that the results are lists instead of dataframes, maybe? But I used the same approach for the earlier steps without error. I found an SO question regarding this before, but can't find it in my history/google now. Even if I could find it, I don't understand how it applies, since I used the same method of subsetting before without any problem.
Script:
library(tidyverse)
library(randomForest)
I've added the hungarian data, after imputing the missing values (and don't want to use response for imputation) by running:
hungar_heart <- cbind(impute(hungar_heart[,-14]),hungar_heart[,14])
I then add colnames to hungar_heart and add it to cleveland data:
hungar_heart<-setNames(hungar_heart, c("age","sex","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal","target"))
heart_total<-rbind(heart_data,hungar_heart)
heart_total$target <- as.factor(heart_total$target)
#Partition new combined dataset into training and test sets after setting seed (123)
set.seed(123)
indicator <- sample(2, nrow(heart_total), replace = TRUE, prob = c(.7,.3))
train <- heart_total[indicator==1,]
test <- heart_total[indicator==2,]
#Fit random forest to training set, using default values to start.
forest <- randomForest(target~., data=train)
#Use trained model on test set
predict_try <- predict(forest, test)
#so far so good. now tuneRF gives error:
tune_RF <- tuneRF(train[,-14],train[,14],
stepFactor = 0.5,
plot = TRUE,
ntreeTry = 300,
improve = 0.05)
Error in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, :
length of response must be the same as predictors
In addition: Warning message:
In randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, :
The response has five or fewer unique values. Are you sure you want to do regression?
#FWIW, length:
length(train[,-14])
[1] 13
length(train[,14])
[1] 1
I think it's probably just some uniqueness I didn't expect from my subsetting method.
Thanks
Great - figured this out thanks to some help.
I should have explicitly included in my OP that I was using dplyr.
Turns out, although randomForest and predict on that random forest work fine on tibbles, tuneRF (or maybe tuneRF after the way I subsetted) expects a dataframe, and will throw an error otherwise.
V simple fix:
train <- as.data.frame(train)
Before tuneRF line.

How to properly set contrasts in R

I have been asked to see if there is a linear trend in 3 groups of data (5 points each) by using ANOVA and linear contrasts. The 3 groups represent data collected in 2010, 2011 and 2012. I want to use R for this procedure and I have tried both of the following:
contrasts(data$groups, how.many=1) <- contr.poly(3)
contrasts(data$groups) <- contr.poly(3)
Both ways seem to work fine but give slightly different answers in terms of their p-values. I have no idea which is correct and it is really tricky to find help for this on the web. I would like help figuring out what is the reasoning behind the different answers. I'm not sure if it has something to do with partitioning sums of squares or whatnot.
Both approaches differ with respect to whether a quadratic polynomial is used.
For illustration purposes, have a look at this example, both x and y are a factor with three levels.
x <- y <- gl(3, 2)
# [1] 1 1 2 2 3 3
# Levels: 1 2 3
The first approach creates a contrast matrix for a quadratic polynomial, i.e., with a linear (.L) and a quadratic trend (.Q). The 3 means: Create the 3 - 1th polynomial.
contrasts(x) <- contr.poly(3)
# [1] 1 1 2 2 3 3
# attr(,"contrasts")
# .L .Q
# 1 -7.071068e-01 0.4082483
# 2 -7.850462e-17 -0.8164966
# 3 7.071068e-01 0.4082483
# Levels: 1 2 3
In contrast, the second approach results in a polynomial of first order (i.e., a linear trend only). This is due to the argument how.many = 1. Hence, only 1 contrast is created.
contrasts(y, how.many = 1) <- contr.poly(3)
# [1] 1 1 2 2 3 3
# attr(,"contrasts")
# .L
# 1 -7.071068e-01
# 2 -7.850462e-17
# 3 7.071068e-01
# Levels: 1 2 3
If you're interested in the linear trend only, the second option seems more appropriate for you.
Changing the contrasts you ask for changes the degrees of freedom of the model. If one model requests linear and quadratic contrasts, and a second specifies only, say, the linear contrast, then the second model has an extra degree of freedom: this will increase the power to test the linear hypothesis, (at the cost of preventing the model fitting the quadratic trend).
Using the full ("nlevels - 1") set of contrasts creates an orthogonal set of contrasts which explore the full set of (independent) response configurations. Cutting back to just one prevents the model from fitting one configuration (in this case the quadratic component which our data in fact possess.
To see how this works, use the built-in dataset mtcars, and test the (confounded) relationship of gears to gallons. We'll hypothesize that the more gears the better (at least up to some point).
df = mtcars # copy the dataset
df$gear = as.ordered(df$gear) # make an ordered factor
Ordered factors default to polynomial contrasts, but we'll set them here to be explicit:
contrasts(df$gear) <- contr.poly(nlevels(df$gear))
Then we can model the relationship.
m1 = lm(mpg ~ gear, data = df);
summary.lm(m1)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 20.6733 0.9284 22.267 < 2e-16 ***
# gear.L 3.7288 1.7191 2.169 0.03842 *
# gear.Q -4.7275 1.4888 -3.175 0.00353 **
#
# Multiple R-squared: 0.4292, Adjusted R-squared: 0.3898
# F-statistic: 10.9 on 2 and 29 DF, p-value: 0.0002948
Note we have F(2,29) = 10.9 for the overall model and p=.038 for our linear effect with an estimated extra 3.7 mpg/gear.
Now let's only request the linear contrast, and run the "same" analysis.
contrasts(df$gear, how.many = 1) <- contr.poly(nlevels(df$gear))
m1 = lm(mpg ~ gear, data = df)
summary.lm(m1)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 21.317 1.034 20.612 <2e-16 ***
# gear.L 5.548 1.850 2.999 0.0054 **
# Multiple R-squared: 0.2307, Adjusted R-squared: 0.205
# F-statistic: 8.995 on 1 and 30 DF, p-value: 0.005401
The linear effect of gear is now bigger (5.5 mpg) and p << .05 - A win? Except the overall model fit is now significantly worse: variance accounted for is now just 23% (was 43%)! Why is clear if we plot the relationship:
plot(mpg ~ gear, data = df) # view the relationship
So, if you're interested in the linear trend, but also expect (or are unclear about) additional levels of complexity, you should also test these higher polynomials. The quadratic (or, in general, trends up to levels-1).
Note too that in this example the physical mechanism is confounded: We've forgotten that number of gears is confounded with automatic vs manual transmission, and also with weight, and sedan vs sports car.
If someone wants to test the hypothesis that 4 gears is better than 3, they could answer this question :-)

Resources