I am trying to do a panel regression in R.
pdata <- pdata.frame(NEW, index = c("Year"))
R1 <- plm(Market_Cap ~ GDP_growthR + Volatility_IR + FDI
+ Savings_rate, data=pdata, model="between")
However when I want to use the within (or random) estimator, I got the following error:
Error in, model, effect, random.method, random.models, random.dfcor, : empty model
But, when I use the between estimator, everything is fine. Do you have any explanation and suggestion?
You should heed the advice in the comments.
I addressed a version of the OP's question on CV. If the structure of the data is the same, then you're only observing one cross-sectional unit over time. In your setting, you're observing a single country over many years. If your data was a true panel dataset, you would be observing more than one country over at least two years. For example, I will simulate a small panel data frame.
panel <- tibble(
country = c(rep("Spain", 5), rep("France", 5), rep("Croatia", 5)),
year = rep(2016:2020, 3), # each country is observed over 5 years
x = rnorm(15), # sample 15 random deviates (5 per country)
y = sample(c(10000:100000), size = 15) # sample incomes (range: 10,000 - 100,000)
) %>%
France = ifelse(country == "France", 1, 0),
Croatia = ifelse(country == "Croatia", 1, 0),
y_2016 = ifelse(year == 2016, 1, 0),
y_2017 = ifelse(year == 2017, 1, 0),
y_2018 = ifelse(year == 2018, 1, 0),
y_2019 = ifelse(year == 2019, 1, 0),
y_2020 = ifelse(year == 2020, 1, 0)
Inside of the mutate() function I appended dummies for all countries and all years, excluding one country and one year. In your other question, you estimate time fixed effects. Software invariably drops one year to avoid collinearity. You don't need to append the dummies, but they are helpful for explication purposes. Here is a classic panel data frame:
# Panel - varies across two dimensions (country + time)
# 3 countries observed over 5 years for a total of 15 country-year observations
# A tibble: 15 x 10
country year x y France Croatia y_2017 y_2018 y_2019 y_2020
<chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Spain 2016 0.586 81371 0 0 0 0 0 0
2 Spain 2017 0.709 10538 0 0 1 0 0 0
3 Spain 2018 -0.109 26893 0 0 0 1 0 0
4 Spain 2019 -0.453 71363 0 0 0 0 1 0
5 Spain 2020 0.606 43308 0 0 0 0 0 1
6 France 2016 -1.82 42544 1 0 0 0 0 0
7 France 2017 0.630 88187 1 0 1 0 0 0
8 France 2018 -0.276 91368 1 0 0 1 0 0
9 France 2019 -0.284 65563 1 0 0 0 1 0
10 France 2020 -0.919 22061 1 0 0 0 0 1
11 Croatia 2016 -0.116 80390 0 1 0 0 0 0
12 Croatia 2017 1.82 48623 0 1 1 0 0 0
13 Croatia 2018 0.371 93444 0 1 0 1 0 0
14 Croatia 2019 0.520 79582 0 1 0 0 1 0
15 Croatia 2020 -0.751 33367 0 1 0 0 0 1
As #DaveArmstrong correctly noted, you should specify the panel indexes. First, we specify a panel data frame, then we estimate the model.
pdata <- pdata.frame(panel, index = c("year", "country"))
random <- plm(y ~ x, model = "random", data = pdata)
A one-way random effects model is fit. The call to summary() will produce the following (abridged output):
plm(formula = y ~ x, data = pdata, model = "random")
Balanced Panel: n = 5, T = 3, N = 15
var share
idiosyncratic 685439601 26181 0.819
individual 151803385 12321 0.181
theta: 0.2249
Min. 1st Qu. Median 3rd Qu. Max.
-49380 -17266 6221 17759 32442
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 58308.0 8653.7 6.7380 1.606e-11 ***
x 7777.0 8808.9 0.8829 0.3773
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
But your data does not have this structure, hence the warning message. In fact, your data is similar to carving out one country from this panel. For example, suppose we winnowed down the data frame to Croatian observations only. The following code takes a subset of the previous data frame:
croatia_only <- panel %>%
filter(country == "Croatia") # grab only the observations from Croatia
Here, longitudinal variation only exists for one country. In other words, by restricting attention to Croatia, we cannot exploit the variation across countries; we only have variation in one dimension! The resulting data frame looks like the following:
# Time Series - varies across one dimension (time)
# A tibble: 5 x 10
country year x y France Croatia y_2017 y_2018 y_2019 y_2020
<chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Croatia 2016 -0.116 80390 0 1 0 0 0 0
2 Croatia 2017 1.82 48623 0 1 1 0 0 0
3 Croatia 2018 0.371 93444 0 1 0 1 0 0
4 Croatia 2019 0.520 79582 0 1 0 0 1 0
5 Croatia 2020 -0.751 33367 0 1 0 0 0 1
Now I will re-estimate a random effects model with one country:
pdata <- pdata.frame(croatia_only, index = c("year", "country"))
random_croatia <- plm(y ~ x , model = "random", data = pdata)
This should reproduce your error message (i.e., empty model). Note, you only have variation within one country! As you correctly noted, a "between-effects" model is estimable, but not for reasons you might presume. A "between effects" model averages over all years within a country, then it runs ordinary least squares on the 'averaged' data. In your setting, taking the average over your time series results in a country mean. And since you only observe one country, then you only have one observation. Such a model is inestimable. However, you can 'pool' together all of your yearly observations for one country and run a linear model instead. That is what you're doing. To test this out using one country, try comparing the "between" model with the "pooling" model. They should produce identical estimates of x.
# Run this using the croatia_only data frame
summary(plm(y ~ x , model = "between", data = pdata))
summary(plm(y ~ x , model = "pooling", data = pdata))
It should be painfully obvious now, but model = "pooling" is equivalent to running lm().
If you want me to tie this into your previous post, try estimating a linear model with separate dummies for all years as covariates. You will quickly discover that you have no residual degrees of freedom, which is exactly the problem outlined in your other post.
In sum, I would look for data from other countries. Once you do that, you can use plm() for all it's worth.


Formula with interaction terms in event-study designs using R

I am estimating what's often called the "event-study" specification of a difference-in-differences model in R. Basically, we observe treated and control units over time and estimate a two-way fixed effects model with parameters for the "effect" of being treated in each time period (omitting one period, usually the one before treatment, as the reference period). I am struggling with how to compactly specify this model with R formulas.
For example, here is the model...
N <- 100
df <- tibble(
id = rep(1:N, 5),
treat = id >= ceiling(N / 2),
time = rep(1:5, each=N),
x = rnorm(5 * N)
# produce an outcome variable
df <- df %>% mutate(
y = x - treat * (time == 5) + time + rnorm(5*N)
# easily recover the parameters with the true model...
y ~ x + I(treat * (time == 5)) | id + time, data = df
Now, I want to do an event-study design using period 4 as the baseline because treatment happens in period 5. We expect coefficients near zero on the pre-periods (1–4), and a negative treatment effect for the treated in the treated period (time == 5)
df$timefac <- factor(df$time, levels = c(4, 1, 2, 3, 5))
y ~ x + treat * timefac | id + time, data = df
That looks good, but produces lots of NAs because several of the coefficients are absorbed by the unit and time effects. Ideally, I can specify the model without those coefficients...
# create dummy for each time period for treated units
tdum <- dummy(df$time)
df <- bind_cols(df,
df <- df %>% mutate_at(vars(time1:time5), ~ . * treat)
# estimate model, manually omitting one dummy
y ~ x + time1 + time2 + time3 + time5 | id + time, data = df
Now, the question is how to specify this model in a compact way. I thought the following would work, but it produces very unpredictable output...
y ~ x + treat:timefac | id + time, data = df
With the above, R does not use period 4 as the reference period and sometimes chooses to include the interaction with untreated rather than treated. The output is...
Estimate Std. Error t value Pr(>|t|)
x 0.97198 0.05113 19.009 < 2e-16 ***
treatFALSE:timefac4 NA NA NA NA
treatTRUE:timefac4 -0.19607 0.28410 -0.690 0.49051
treatFALSE:timefac1 NA NA NA NA
treatTRUE:timefac1 -0.07690 0.28572 -0.269 0.78796
treatFALSE:timefac2 NA NA NA NA
treatTRUE:timefac2 NA NA NA NA
treatFALSE:timefac3 0.15525 0.28482 0.545 0.58601
treatTRUE:timefac3 NA NA NA NA
treatFALSE:timefac5 0.97340 0.28420 3.425 0.00068 ***
treatTRUE:timefac5 NA NA NA NA
Is there a way to specify this model without having to manually produce dummies and interactions for treated units for every time period?
If you know Stata, I'm essentially looking for something as easy as:
areg y x i.treat##ib4.time, absorb(id)
(Note how simple it is to tell Stata to treat the variable as categorical — the i prefix —without making dummies for time and also indicate that period 4 should be the base period — the b4 prefix.)
The package fixest performs fixed-effects estimations (like lfe) and includes utilities to deal with interactions. The function i (or interact) is what you're looking for.
Here is an example where the treatment is interacted with the year and year 5 is dropped out:
est_did = feols(y ~ x1 + i(treat, period, 5) | id + period, base_did)
#> OLS estimation, Dep. Var.: y
#> Observations: 1,080
#> Fixed-effects: id: 108, period: 10
#> Standard-errors: Clustered (id)
#> Estimate Std. Error t value Pr(>|t|)
#> x1 0.973490 0.045678 21.312000 < 2.2e-16 ***
#> treat:period::1 -1.403000 1.110300 -1.263700 0.206646
#> treat:period::2 -1.247500 1.093100 -1.141200 0.254068
#> treat:period::3 -0.273206 1.106900 -0.246813 0.805106
#> treat:period::4 -1.795700 1.088000 -1.650500 0.099166 .
#> treat:period::6 0.784452 1.028400 0.762798 0.445773
#> treat:period::7 3.598900 1.101600 3.267100 0.001125 **
#> treat:period::8 3.811800 1.247500 3.055500 0.002309 **
#> treat:period::9 4.731400 1.097100 4.312600 1.8e-05 ***
#> treat:period::10 6.606200 1.120500 5.895800 5.17e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-likelihood: -2,984.58 Adj. R2: 0.48783
The nice thing is that you can plot the interacted coefficients out of the estimation to have a quick visual representation of the results (if you find the graph too sober, no worries you can customize almost everything in it):
If you don't want to use fixest for estimation, you can still use the function i to create interactions. Its syntax is i(var, f, ref, drop, keep): it interacts the variable var with a dummy variable for each value in f. You can select which values of f to retain with the arguments ref, drop and keep. drop well... drops values from f and ref is the same as drop, but the references are shown in the coefplot (while the values in drop don't appear in the graph).
Here's an example of what i does:
head(with(base_did, i(treat, period, keep = 3:7)))
#> treat:period::3 treat:period::4 treat:period::5 treat:period::6 treat:period::7
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 1 0 0 0 0
#> 4 0 1 0 0 0
#> 5 0 0 1 0 0
#> 6 0 0 0 1 0
head(with(base_did, i(treat, period, drop = 3:7)))
#> treat:period::1 treat:period::2 treat:period::8 treat:period::9 treat:period::10
#> 1 1 0 0 0 0
#> 2 0 1 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
You can find more information on fixest here.
You can redefine the timefac so that untreated observations are coded as the omitted time category.
df %>%
mutate(time = ifelse(treat == 0, 4, time),
timefac = factor(time, levels = c(4, 1, 2, 3, 5)))
Then, you can use timefac without interactions and get a regression table with no NAs.
y ~ x + timefac | id + time, data = df
Estimate Std. Error t value Pr(>|t|)
x 0.98548 0.05028 19.599 < 2e-16 ***
time_fac1 -0.01335 0.27553 -0.048 0.961
time_fac2 -0.10332 0.27661 -0.374 0.709
time_fac3 0.24169 0.27575 0.876 0.381
time_fac5 -1.16305 0.27557 -4.221 3.03e-05 ***
How to plot different tests conclusions

I'm trying to plot different tests i did for my database regarding the connection between different groups
Here's the dataframe structure
Month District Age Gender Education Disability Religion Occupation JobSeekers GMI
1 2020-01 Dan U17 Male None None Jewish Unprofessional workers 2 0
2 2020-01 Dan U17 Male None None Muslims Sales and costumer service 1 0
3 2020-01 Dan U17 Female None None Other Undefined 1 0
4 2020-01 Dan 18-24 Male None None Jewish Production and construction 1 0
5 2020-01 Dan 18-24 Male None None Jewish Academic degree 1 0
6 2020-01 Dan 18-24 Male None None Jewish Practical engineers and technicians 1 0
ACU NACU NewSeekers NewFiredSeekers
1 0 2 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 0 1 0 0
6 0 1 1 1
I reduced it based on the relevant test, for example for the t-test i did:
dist.newseek <- Cdata %>%
group_by(Month,District) %>%
Month District NewSeekers
<chr> <chr> <int>
1 2020-01 Dan 6551
2 2020-01 Jerusalem 3589
3 2020-01 North 6154
4 2020-01 Sharon 4131
5 2020-01 South 4469
6 2020-02 Dan 5529
and than performed t test
t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))
Here's all the tests i did for each group (t test for new seekers vs district, wilcox for age vs new seekers and ANONA for occupation vs new seekers)
I'm looking for a graphic way to show the result of each test.
If you have any idea, please help
# t test for district vs new seekers
# sorting
dist.newseek <- Cdata %>%
group_by(Month,District) %>%
# performing a t test on the mini table we created
t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))
# results
Welch Two Sample t-test
data: NewSeekers by District
t = 0.68883, df = 4.1617, p-value = 0.5274
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-119952.3 200737.3
sample estimates:
mean in group Dan mean in group South
74608.25 34215.75
#wilcoxon test
# filtering Cdata to New seekers based on month and age
age.newseek <- Cdata %>%
group_by(Month,Age) %>%
#performing a wilcoxon test on the subset
wilcox.test(NewSeekers ~ Age,data=subset(age.newseek,Age %in% c("25-34","45-54")))
# Results
Wilcoxon rank sum exact test
data: NewSeekers by Age
W = 11, p-value = 0.4857
alternative hypothesis: true location shift is not equal to 0
ANOVA test
# Sorting occupation and month by new seekers
occu.newseek <- Cdata %>%
group_by(Month,Occupation) %>%
## Make the Occupation as a factor
occu.newseek$District <- as.factor(occu.newseek$Occupation)
## Get the occupation group means and standart deviations <- aggregate(
x = occu.newseek$NewSeekers, # Specify data column
by = list(occu.newseek$Occupation), # Specify group indicator
FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
## Run one way ANOVA test
anova_one_way <- aov(NewSeekers~ Occupation, data = occu.newseek)
## Run the Tukey Test to compare the groups
## Check the mean differences across the groups
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = Occupation)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +

Standardized Betas for Panel Data in R

I have a panel dataset and I'm running a fixed effects regression. My dependent variable is CDS Spreads and I have 7 independent variables which are macroeconomic variables (GDP, Inflation etc) and then I have ratings data for three agencies which is the eighth independent variable, so I basically run three separate regressions for each rating agency:
plm(CDS ~ GDP+Inflation+...+S&PRating, data, model="within")
plm(CDS ~ GDP+Inflation+...+FTSE, data, model="within")
plm(CDS ~ GDP+Inflation+...+Moodys, data, model="within")
I want to compare the difference in the magnitude of effect of the three agencies on CDS spreads and also in comparison to the rest of the independent variables but the scale of the rating by the three agencies is different. I want to standardize the coefficients. How do I do that for panel data. "lm.beta" from the "QuantPsyc" package is not giving accurate results. It changes the signs of the coefficients and an earlier post suggested that it is not advisable to use z-transformation for panel data. Can you please suggest a way to make a meaningful comparison from the results?
If you run three different regressions you cannot formally test differences in coefficients. It might be more informative to:
Standardize the scores (standard normal could be a good start, here some more info
Stack your data, so that each observation shows up three times. The score will be in the same variable.
Create 3 new variables, where you interact the score with a dummy indicator of the score agency. Your data will look like this:
data.frame(iso3 = "USA", year = 2001, gdp = 13, score_sp = 1, score_moody = 0, score_sfse = 0 ),
data.frame(iso3 = "USA", year = 2001, gdp = 13, score_sp = 0, score_moody = 2, score_sfse = 0 ),
data.frame(iso3 = "USA", year = 2001, gdp = 13, score_sp = 0, score_moody = 0, score_sfse = 3 )
iso3 year gdp score_sp score_moody score_sfse
1 USA 2001 13 1 0 0
2 USA 2001 13 0 2 0
3 USA 2001 13 0 0 3
Estimate your model:
plm(CDS ~ GDP+Inflation+...+ score_sp + score_moody + score_ftse, data, model="within")
Now you can simply compare coefficients with t-tests.
If you want to see how coefficients on the "control" variables change when you use a different score then your data will look like this:
iso3 year score_sp score_moody score_sfse gdp_sp gdp_moody gdp_sfse
1 USA 2001 1 0 0 13 0 0
2 USA 2001 0 2 0 0 13 0
3 USA 2001 0 0 3 0 0 13
Now you can use t-test to check whether the coefficient on gdp is bigger when using one particular score (if that is what you are interested in).

In R, how do I create multiple time series graphs for multiple sites and parameters in one script/code section?

I'm trying to create hundreds of plots (one plot for each site/program) for time series data (annual water quality averages over the years). I need to plot each site/program combination with each parameter (ANC, DOC, etc). Years run from 1990 to 2017. When a parameter isn't collected, it's stored as a 0, but I can change this to a null if needed. I'd like to do this as one code/script to save time.
I'm more familiar with ggplot2, but I'm open to other packages as well.
Here's an example of my data:
PROGRAM <fctr> SITE <fctr> YEAR <int> ANC <dbl> DOC <dbl> NO3 <dbl> SBC_ALL <dbl> SBC.0 <dbl> SO4 <dbl>
1 VTSSS VT64 1992 66.753 0 0.803 94.890 0 8.135
2 VTSSS VT64 1998 64.778 0 0.803 92.783 0 8.230
3 VTSSS VT43 1995 58.925 0 0.698 91.945 0 8.730
4 VTSSS VT43 1990 63.240 0 0.885 91.618 0 8.733
5 VTSSS VT43 1992 66.043 0 0.825 93.873 0 8.758
6 VTSSS VT43 2004 66.298 0 0.688 95.358 0 8.813
I've tried ggplot, but it only comes up with an empty graph. I'm thinking I need a graphing package where I can subset by PROGRAM and SITE.
I've tried:
ggplot(data = mydata, mapping = aes(x = "year", y = "anc", color = "site")) +
geom_line() +
Expected Results: Want to have a plot for site: VT43 and parameter: ANC over time (1990 to 2017). VT43 and DOC. VT43 and NO3. Etc....
Actual Results:
Empty graph with year on the x axis, anc on the y axis, and no dots or lines for sites.

Predict() with nested multinomial logit models

I'm using the mlogit package in R to create a nested multinomial logit model of healthcare provider choice given choice data I have. The data look like this:
ID RES weight age wealth educ married urban partnerAge totalChildren survivingChildren anyANC
1.0 2468158 FALSE 0.2609153 29 Poor Primary 1 0 31 4 4 1
1.1 2468158 TRUE 0.2609153 29 Poor Primary 1 0 31 4 4 1
1.2 2468158 FALSE 0.2609153 29 Poor Primary 1 0 31 4 4 1
1.3 2468158 FALSE 0.2609153 29 Poor Primary 1 0 31 4 4 1
2.0 14233860 FALSE 0.2754970 19 Poorest Primary 1 0 30 1 1 1
2.1 14233860 TRUE 0.2754970 19 Poorest Primary 1 0 30 1 1 1
2.2 14233860 FALSE 0.2754970 19 Poorest Primary 1 0 30 1 1 1
2.3 14233860 FALSE 0.2754970 19 Poorest Primary 1 0 30 1 1 1
outlier50Km optout alt spa mes dist bobs cobs Q fees chid educSec
1.0 0 -1 0 Home Home 0.000 0.0000000 0.000 0.00 0 1 0
1.1 0 -1 1 Health center Public 13.167 0.4898990 NA 0.64 0 1 0
1.2 0 -1 2 Health center Public 30.596 0.5202020 NA 0.56 0 1 0
1.3 0 -1 3 District hospital Public 41.164 0.7171717 0.825 0.88 0 1 0
2.0 0 -1 0 Home Home 0.000 0.0000000 0.000 0.00 0 2 0
2.1 0 -1 1 Health center Mission 14.756 0.7676768 NA 0.64 1 2 0
2.2 0 -1 2 Health center Public 41.817 0.3787879 NA 0.56 0 2 0
2.3 0 -1 3 District hospital Public 50.419 0.7171717 0.825 0.88 0 2 0
where spa, mes, dist, bobs, cobs, Q, and fees are characteristics of the provider and the remaining variables specific to the individual. These data are in long format, meaning each individual has four rows, reflecting her four choices (alt = 0:3), with RES being the response variable.
An un-nested model behaves appropriately
f.full <- RES ~ 0 + dist + Q + bobs + fees + spa | 0 + age + wealth + educSec + married + urban + totalChildren + survivingChildren <- mlogit(formula = f.full, data = data, weights = weight)
predict(, data[1:8,])
0 1 2 3
[1,] 0.1124429 0.7739403 0.06893341 0.04468343
[2,] 0.4465272 0.3107375 0.11490317 0.12783210
By all measures of model fit, however, a nested model is better than an un-nested one. The nested model gives me coefficients appropriately:
ns2 <- mlogit(formula = f.full, nests = list(home = "0", useCare = c("1", "2", "3")), data = data, weight = weight, un.nest.el = TRUE)
mlogit(formula = f.full, data = data, weights = weight, nests = list(home = "0",
useCare = c("1", "2", "3")), un.nest.el = TRUE)
Frequencies of alternatives:
0 1 2 3
0.094378 0.614216 0.194327 0.097079
bfgs method
23 iterations, 0h:0m:13s
g'(-H)^-1g = 9.51E-07
gradient close to zero
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
dist -0.0336233 0.0040136 -8.3773 < 2.2e-16 ***
Q 0.1780058 0.0768181 2.3172 0.0204907 *
bobs -0.0695695 0.0505795 -1.3754 0.1689925
fees -0.8488132 0.1001928 -8.4718 < 2.2e-16 ***
But, I get the following error if I try to predict on a single individual:
predict(ns2, data[1:4,])
Error in apply(Gl, 1, sum) : dim(X) must have a positive length
and a different error if I try to predict on more than one individual:
predict(ns2, data[1:8,])
Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) :
Lapack routine dgesv: system is exactly singular: U[5,5] = 0
Any help would be vastly appreciated.

