Standardized Betas for Panel Data in R - r

I have a panel dataset and I'm running a fixed effects regression. My dependent variable is CDS Spreads and I have 7 independent variables which are macroeconomic variables (GDP, Inflation etc) and then I have ratings data for three agencies which is the eighth independent variable, so I basically run three separate regressions for each rating agency:
plm(CDS ~ GDP+Inflation+...+S&PRating, data, model="within")
plm(CDS ~ GDP+Inflation+...+FTSE, data, model="within")
plm(CDS ~ GDP+Inflation+...+Moodys, data, model="within")
I want to compare the difference in the magnitude of effect of the three agencies on CDS spreads and also in comparison to the rest of the independent variables but the scale of the rating by the three agencies is different. I want to standardize the coefficients. How do I do that for panel data. "lm.beta" from the "QuantPsyc" package is not giving accurate results. It changes the signs of the coefficients and an earlier post suggested that it is not advisable to use z-transformation for panel data. Can you please suggest a way to make a meaningful comparison from the results?
Thanks!

If you run three different regressions you cannot formally test differences in coefficients. It might be more informative to:
Standardize the scores (standard normal could be a good start, here some more info https://datascience.stackexchange.com/questions/1240/methods-for-standardizing-normalizing-different-rank-scales)
Stack your data, so that each observation shows up three times. The score will be in the same variable.
Create 3 new variables, where you interact the score with a dummy indicator of the score agency. Your data will look like this:
rbind(
data.frame(iso3 = "USA", year = 2001, gdp = 13, score_sp = 1, score_moody = 0, score_sfse = 0 ),
data.frame(iso3 = "USA", year = 2001, gdp = 13, score_sp = 0, score_moody = 2, score_sfse = 0 ),
data.frame(iso3 = "USA", year = 2001, gdp = 13, score_sp = 0, score_moody = 0, score_sfse = 3 )
)
iso3 year gdp score_sp score_moody score_sfse
1 USA 2001 13 1 0 0
2 USA 2001 13 0 2 0
3 USA 2001 13 0 0 3
Estimate your model:
plm(CDS ~ GDP+Inflation+...+ score_sp + score_moody + score_ftse, data, model="within")
Now you can simply compare coefficients with t-tests.
If you want to see how coefficients on the "control" variables change when you use a different score then your data will look like this:
iso3 year score_sp score_moody score_sfse gdp_sp gdp_moody gdp_sfse
1 USA 2001 1 0 0 13 0 0
2 USA 2001 0 2 0 0 13 0
3 USA 2001 0 0 3 0 0 13
Now you can use t-test to check whether the coefficient on gdp is bigger when using one particular score (if that is what you are interested in).

Related

Panel regression - Estimators

I am trying to do a panel regression in R.
pdata <- pdata.frame(NEW, index = c("Year"))
And:
R1 <- plm(Market_Cap ~ GDP_growthR + Volatility_IR + FDI
+ Savings_rate, data=pdata, model="between")
However when I want to use the within (or random) estimator, I got the following error:
Error in plm.fit(data, model, effect, random.method, random.models, random.dfcor, : empty model
But, when I use the between estimator, everything is fine. Do you have any explanation and suggestion?
Thank you!
You should heed the advice in the comments.
I addressed a version of the OP's question on CV. If the structure of the data is the same, then you're only observing one cross-sectional unit over time. In your setting, you're observing a single country over many years. If your data was a true panel dataset, you would be observing more than one country over at least two years. For example, I will simulate a small panel data frame.
library(dplyr)
library(plm)
set.seed(12345)
panel <- tibble(
country = c(rep("Spain", 5), rep("France", 5), rep("Croatia", 5)),
year = rep(2016:2020, 3), # each country is observed over 5 years
x = rnorm(15), # sample 15 random deviates (5 per country)
y = sample(c(10000:100000), size = 15) # sample incomes (range: 10,000 - 100,000)
) %>%
mutate(
France = ifelse(country == "France", 1, 0),
Croatia = ifelse(country == "Croatia", 1, 0),
y_2016 = ifelse(year == 2016, 1, 0),
y_2017 = ifelse(year == 2017, 1, 0),
y_2018 = ifelse(year == 2018, 1, 0),
y_2019 = ifelse(year == 2019, 1, 0),
y_2020 = ifelse(year == 2020, 1, 0)
)
Inside of the mutate() function I appended dummies for all countries and all years, excluding one country and one year. In your other question, you estimate time fixed effects. Software invariably drops one year to avoid collinearity. You don't need to append the dummies, but they are helpful for explication purposes. Here is a classic panel data frame:
# Panel - varies across two dimensions (country + time)
# 3 countries observed over 5 years for a total of 15 country-year observations
# A tibble: 15 x 10
country year x y France Croatia y_2017 y_2018 y_2019 y_2020
<chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Spain 2016 0.586 81371 0 0 0 0 0 0
2 Spain 2017 0.709 10538 0 0 1 0 0 0
3 Spain 2018 -0.109 26893 0 0 0 1 0 0
4 Spain 2019 -0.453 71363 0 0 0 0 1 0
5 Spain 2020 0.606 43308 0 0 0 0 0 1
6 France 2016 -1.82 42544 1 0 0 0 0 0
7 France 2017 0.630 88187 1 0 1 0 0 0
8 France 2018 -0.276 91368 1 0 0 1 0 0
9 France 2019 -0.284 65563 1 0 0 0 1 0
10 France 2020 -0.919 22061 1 0 0 0 0 1
11 Croatia 2016 -0.116 80390 0 1 0 0 0 0
12 Croatia 2017 1.82 48623 0 1 1 0 0 0
13 Croatia 2018 0.371 93444 0 1 0 1 0 0
14 Croatia 2019 0.520 79582 0 1 0 0 1 0
15 Croatia 2020 -0.751 33367 0 1 0 0 0 1
As #DaveArmstrong correctly noted, you should specify the panel indexes. First, we specify a panel data frame, then we estimate the model.
pdata <- pdata.frame(panel, index = c("year", "country"))
random <- plm(y ~ x, model = "random", data = pdata)
A one-way random effects model is fit. The call to summary() will produce the following (abridged output):
Call:
plm(formula = y ~ x, data = pdata, model = "random")
Balanced Panel: n = 5, T = 3, N = 15
Effects:
var std.dev share
idiosyncratic 685439601 26181 0.819
individual 151803385 12321 0.181
theta: 0.2249
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-49380 -17266 6221 17759 32442
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 58308.0 8653.7 6.7380 1.606e-11 ***
x 7777.0 8808.9 0.8829 0.3773
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
But your data does not have this structure, hence the warning message. In fact, your data is similar to carving out one country from this panel. For example, suppose we winnowed down the data frame to Croatian observations only. The following code takes a subset of the previous data frame:
croatia_only <- panel %>%
filter(country == "Croatia") # grab only the observations from Croatia
Here, longitudinal variation only exists for one country. In other words, by restricting attention to Croatia, we cannot exploit the variation across countries; we only have variation in one dimension! The resulting data frame looks like the following:
# Time Series - varies across one dimension (time)
# A tibble: 5 x 10
country year x y France Croatia y_2017 y_2018 y_2019 y_2020
<chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Croatia 2016 -0.116 80390 0 1 0 0 0 0
2 Croatia 2017 1.82 48623 0 1 1 0 0 0
3 Croatia 2018 0.371 93444 0 1 0 1 0 0
4 Croatia 2019 0.520 79582 0 1 0 0 1 0
5 Croatia 2020 -0.751 33367 0 1 0 0 0 1
Now I will re-estimate a random effects model with one country:
pdata <- pdata.frame(croatia_only, index = c("year", "country"))
random_croatia <- plm(y ~ x , model = "random", data = pdata)
This should reproduce your error message (i.e., empty model). Note, you only have variation within one country! As you correctly noted, a "between-effects" model is estimable, but not for reasons you might presume. A "between effects" model averages over all years within a country, then it runs ordinary least squares on the 'averaged' data. In your setting, taking the average over your time series results in a country mean. And since you only observe one country, then you only have one observation. Such a model is inestimable. However, you can 'pool' together all of your yearly observations for one country and run a linear model instead. That is what you're doing. To test this out using one country, try comparing the "between" model with the "pooling" model. They should produce identical estimates of x.
# Run this using the croatia_only data frame
summary(plm(y ~ x , model = "between", data = pdata))
summary(plm(y ~ x , model = "pooling", data = pdata))
It should be painfully obvious now, but model = "pooling" is equivalent to running lm().
If you want me to tie this into your previous post, try estimating a linear model with separate dummies for all years as covariates. You will quickly discover that you have no residual degrees of freedom, which is exactly the problem outlined in your other post.
In sum, I would look for data from other countries. Once you do that, you can use plm() for all it's worth.

How can loading factors from PCA be used to calculate an index that can be applied for each individual in a data frame in R?

I am using principal component analysis (PCA) based on ~30 variables to compose an index that classifies individuals in 3 different categories (top, middle, bottom) in R.
I have a dataframe of ~2000 individuals with 28 binary and 2 continuous variables.
Now, I would like to use the loading factors from PC1 to construct an
index that classifies my 2000 individuals for these 30 variables in 3 different groups.
Problem: Despite extensive research, I could not find out how to extract the loading factors from PCA_loadings, give each individual a score (based on the loadings of the 30 variables), which would subsequently allow me to rank each individual (for further classification). Does it make sense to display the loading factors in a graph?
I've performed the following steps:
a) Ran a PCA using PCA_outcome <- prcomp(na.omit(df1), scale = T)
b) Extracted the loadings using PCA_loadings <- PCA_outcome$rotation
c) Removed all the variables for which the loading factors were close to 0.
I have considered creating 30 new variable, one for each loading factor, which I would sum up for each binary variable == 1 (though, I am not sure how to proceed with the continuous variables). Consequently, I would assign each individual a score. However, I would not know how to assemble the 30 values from the loading factors to a score for each individual.
R code
df1 <- read.table(text="
educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header=T)
## Run PCA
PCA_outcome <- prcomp(na.omit(df1), scale = T)
## Extract loadings
PCA_loadings <- PCA_outcome$rotation
## Explanation: A-E are 5 of the 2000 individuals and the variables (education, call, house, school, members) represent my 30 variables (binary and continuous).
Expected results:
- Get a rank score for each individual
- Subsequently, assign a category 1-3 to each individual.
I'm not 100% sure what you're asking, but here's an answer to the question I think you're asking.
First of all, PC1 of a PCA won't necessarily provide you with an index of socio-economic status. As explained here, PC1 simply "accounts for as much of the variability in the data as possible". PC1 may well work as a good metric for socio-economic status for your data set, but you'll have to critically examine the loadings and see if this makes sense. Depending on the signs of the loadings, it could be that a very negative PC1 corresponds to a very positive socio-economic status. As I say: look at the results with a critical eye. An explanation of how PC scores are calculated can be found here. Anyway, that's a discussion that belongs on Cross Validated, so let's get to the code.
It sounds like you want to perform the PCA, pull out PC1, and associate it with your original data frame (and merge_ids). If that's your goal, here's a solution.
# Create data frame
df <- read.table(text = "educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header = TRUE)
# Perform PCA
PCA <- prcomp(df[, names(df) != "merge_id"], scale = TRUE, center = TRUE)
# Add PC1
df$PC1 <- PCA$x[, 1]
# Look at new data frame
print(df)
#> educ call house merge_id school members PC1
#> A 1 0 1 12_3 0 0.9 0.1000145
#> B 0 0 0 13_3 1 0.8 1.6610864
#> C 1 1 1 14_3 0 1.1 -0.8882381
#> D 0 0 0 15_3 1 0.8 1.6610864
#> E 1 1 1 16_3 3 3.2 -2.5339491
Created on 2019-05-30 by the reprex package (v0.2.1.9000)
As you say you have to use PCA, I'm assuming this is for a homework question, so I'd recommend reading up on PCA so that you get a feel of what it does and what it's useful for.

Average Marginal Effects in R with complex interaction terms

I am using R to compute the linear regression on the following model, as well as find the marginal effects of age on pizza at specific points (20,30,40,50,55).
mod6.22c <- lm(pizza ~ age + income + age*income +
I((age*age)*income), data = piz4)
The problem I am running into is when using the margins command, R does not see interaction terms that are inserted into the lm with I((age x age) x income). The margins command will only produce accurate average marginal effects when the interaction terms are in the form of variable1 x variable1. I also can't create a new variable in my table table$newvariable <- table$variable1^2, because the margins command won't identify newvariable as related to variable1.
This has been fine up until now, where my interaction terms have only been a quadratic, or an xy interaction, but now I am at a point where I need to calculate the average marginal effects with the interaction term AGE^2xINCOME included in the model, but the only way I can seem to get the summary lm output to be correct is by using I(age^2*(income)) or by creating a new variable in my table. As stated before, the margins command can't read I(age^2*(income)), and if I create a new variable, the margins command doesn't recognize the variables are related, and the average marginal effects produced are incorrect.
The error I am receiving:
> summary(margins(mod6.22c, at = list(age= c(20,30,40,50,55)),
variables = "income"))
Error in names(classes) <- clean_terms(names(classes)) :
'names' attribute [4] must be the same length as the vector [3]
I appreciate any help in advance.
Summary of data:
Pizza is annual expenditure on pizza, female, hs, college and grad are dummy variables, income is in thousands of dollars per year, age is years old.
> head(piz4)
pizza female hs college grad income age agesq
1 109 1 0 0 0 19.5 25 625
2 0 1 0 0 0 39.0 45 2025
3 0 1 0 0 0 15.6 20 400
4 108 1 0 0 0 26.0 28 784
5 220 1 1 0 0 19.5 25 625
6 189 1 1 0 0 39.0 35 1225
Libraries used:
library(data.table)
library(dplyr)
library(margins)
tldr
This works:
mod6.22 <- lm(pizza ~ age + income + age*income, data = piz4)
**summary(margins(mod6.22, at = list(age= c(20,30,40,50,55)), variables = "income"))**
factor age AME SE z p lower upper
income 20.0000 4.5151 1.5204 2.9697 0.0030 1.5352 7.4950
income 30.0000 3.2827 0.9049 3.6276 0.0003 1.5091 5.0563
income 40.0000 2.0503 0.4651 4.4087 0.0000 1.1388 2.9618
income 50.0000 0.8179 0.7100 1.1520 0.2493 -0.5736 2.2095
income 55.0000 0.2017 0.9909 0.2036 0.8387 -1.7403 2.1438
This doesn't work:
mod6.22c <- lm(pizza ~ age + income + age*income + I((age * age)*income), data = piz4)
**summary(margins(mod6.22c, at = list(age= c(20,30,40,50,55)), variables = "income"))**
Error in names(classes) <- clean_terms(names(classes)) :
'names' attribute [4] must be the same length as the vector [3]
How do I get margins to read my interaction variable I((age*age)*income)?

How to account for categorical variables in calculating a risk score from regression model?

I have a data set which has a number of variables that I'd like to use to generate a risk score for getting a disease.
I have created a basic version of what I'm trying to do.
The dataset looks like this:
ID DISEASE_STATUS AGE SEX LOCATION
1 1 20 1 FRANCE
2 0 22 1 GERMANY
3 0 24 0 ITALY
4 1 20 1 GERMANY
5 1 20 0 ITALY
So the model I ran was:
glm(disease_status ~ age + sex + location, data=data, family=binomial(link='logit'))
The beta values produced by this model were as follows:
bage = −0.193
bsex = −0.0497
blocation= 1.344
To produce a risk score, I want to multiply the values for each individual by the beta values, eg:
risk score = (-0.193 * 20 (age)) + (-0.0497 * 1 (sex)) + (1.344 * ??? (location))
However, what value would I use to multiply the beta score for location by?
Thank you!

Check if a variable is time invariant in R

I tried to search an answer to my question but I find the right answer for Stata (I am using R).
I am using a national survey to study which variables influence the investment in complementary pension (it is voluntary in my country).
The survey is conducted every two years and some individuals are interviewed more than one time. I filtered the df in order to have only the individuals present more than one time trought the filter command. This is an example from the original survey already filtered:
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 1
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
2008 4 1972 F 33000 0
2010 4 1972 F 35000 0
where id is the individual, y.b is year of birth, pens is a dummy which takes value 1 if the individual invests in a complementary pension form.
I wanted to run a FE regression so I load the plm package and then I set the df like this:
df.p <- plm.data(df, c("id", "year")
After this command, I expected that constant variables were deleted but after running this regression:
pan1 <- plm (pens ~ woman + age + I(age^2) + high + medium + north + centre, model="within", effect = "individual", data=dd.p, na.action = na.omit)
(where woman is a variable which takes value 1 if the individual is a woman, high, medium refer to education level and north, centre to geographical regions) and after the command summary(pan1) the variable woman is still present.
At this point I think that there are some mistakes in the survey (for example sex was not insert correctly and so it wasn't the same for the same id), so I tried to find a way to check if for each id, sex is constant.
I tried this code but I am sure it is not correct:
df$x <- ifelse(df$id==df$id & df$sex==df$sex,1,0)
the basic idea shuold be like this:
df$x <- ifelse(df$id=="1" & df$sex=="F",1,0)
but I can't do it manually since the df is composed up to 40k observations.
If you know another way to check if a variable is constant in R I will be glad.
Thank you in advance
I think what you are trying to do is calculate the number of unique values of sex for each id. You are hoping it is 1, but any cases of 2 indicate a transcription error. The way to do this in R is
any(by(df$sex,df$id,function(x) length(unique(x))) > 1)
To break that down, the function length(unique(x)) tells you the number of different unique values in a vector. It's similar to levels for a factor (but not identical, since a factor can have levels not present).
The function by calculates the given function on each subset of df$sex according to df$id. In other words, it calculates length(unique(df$sex)) where df$id is 1, then 2, etc.
Lastly, any(... > 1) checks if any of the results are more than one. If they are, the result will be TRUE (and you can use which instead of any to find which ones). If everything is okay, the result will be FALSE.
We can try with dplyr
Example data:
df=data.frame(year=c(2002,2002,2004,2004,2006,2008,2008,2010),
id=c(1,2,1,2,3,3,4,4),
sex=c("F","M","M","M","M","M","F","F"))
Id 1 is both F and M
library(dplyr)
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))
# A tibble: 4 x 2
id sexes
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 4 1
We can then filter:
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))%>%filter(sexes==2)
# A tibble: 1 x 2
id sexes
<dbl> <int>
1 1 2

Resources