including non linearity in fixed effects model in plm - r

I am trying to build a fixed effects regression with the plm package in R. I am using country level panel data with year and country fixed effects.
My problem concerns 2 explanatory variables. One is an interaction term of two varibels and one is a squared term of one of the variables.
model is basically:
y = x1 + x1^2+ x3 + x1*x3+ ...+xn , with the variables all being in log form
It is central to the model to include the squared term, but when I run the regression it always gets excluded because of "singularities", as x1 and x1^2 are obviously correlated.
Meaning the regression works and I get estimates for my variables, just not for x1^2 and x1*x2.
How do I circumvent this?
library(plm)
fe_reg<- plm(log(y) ~ log(x1)+log(x2)+log(x2^2)+log(x1*x2)+dummy,
data = df,
index = c("country", "year"),
model = "within",
effect = "twoways")
summary(fe_reg)
´´´
#I have tried defining the interaction and squared terms as vectors, which helped with the #interaction term but not the squared term.
df1.pd<- df1 %>% mutate_at(c('x1'), ~(scale(.) %>% as.vector))
df1.pd<- df1 %>% mutate_at(c('x2'), ~(scale(.) %>% as.vector))
´´´
I am pretty new to R, so apologies if this not a very well structured question.

You just found two properties of the logarithm function:
log(x^2) = 2 * log(x)
log(x*y) = log(x) + log(y)
Then, obviously, log(x) is collinear with 2*log(x) and one of the two collinear variables is dropped from the estimation. Same for log(x*y) and log(x) + log(y).
So, the model you want to estimate is not estimable by linear regression methods. You might want to take different data transformations than log into account or the original variables.
See also the reproducible example below wher I just used log(x^2) = 2*log(x). Linear dependence can be detected, e.g., via function detect.lindep from package plm (see also below). Dropping of coefficients from estimation also hints at collinear columns in the model estimation matrix. At times, linear dependence appears only after data transformations invovled in the estimation functions, see for an example of the within transformation the help page ?detect.lindep in the Example section).
library(plm)
data("Grunfeld")
pGrun <- pdata.frame(Grunfeld)
pGrun$lvalue <- log(pGrun$value) # log(x)
pGrun$lvalue2 <- log(pGrun$value^2) # log(x^2) == 2 * log(x)
mod <- plm(inv ~ lvalue + lvalue2 + capital, data = pGrun, model = "within")
summary(mod)
#> Oneway (individual) effect Within Model
#>
#> Call:
#> plm(formula = inv ~ lvalue + lvalue2 + capital, data = pGrun,
#> model = "within")
#>
#> Balanced Panel: n = 10, T = 20, N = 200
#>
#> Residuals:
#> Min. 1st Qu. Median 3rd Qu. Max.
#> -186.62916 -20.56311 -0.17669 20.66673 300.87714
#>
#> Coefficients: (1 dropped because of singularities)
#> Estimate Std. Error t-value Pr(>|t|)
#> lvalue 30.979345 17.592730 1.7609 0.07988 .
#> capital 0.360764 0.020078 17.9678 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Total Sum of Squares: 2244400
#> Residual Sum of Squares: 751290
#> R-Squared: 0.66525
#> Adj. R-Squared: 0.64567
#> F-statistic: 186.81 on 2 and 188 DF, p-value: < 2.22e-16
detect.lindep(mod) # run on the model
#> [1] "Suspicious column number(s): 1, 2"
#> [1] "Suspicious column name(s): lvalue, lvalue2"
detect.lindep(pGrun) # run on the data
#> [1] "Suspicious column number(s): 6, 7"
#> [1] "Suspicious column name(s): lvalue, lvalue2"

Related

How to determine degrees of freedom for t stat with quantile regression and bootstrapped standard errors in R

I am using R to conduct a quantile regression with bootstrapped standard errors to test if one variable is higher than a second variable at the 5th, 50th, and 95th percentiles of the distributions. The output does not include degrees of freedom for the t stat. How can I calculate this?
summary(rq(data$var1~data$var2, tau=.05), se="boot")
summary(rq(data$var1~data$var2, tau=.5), se="boot")
Assuming you used the library quantreg, if you were to call rq() by itself, you get the degrees of freedom.
It looks like you're fairly new to SO; welcome to the community! If you want great answers quickly, it's best to make your question reproducible. This includes the libraries you used or sample data like the output from dput(head(dataObject))). Check it out: making R reproducible questions.
Capturing the degrees of freedom, in this case, should be relatively easy.
In truth, the number of observations and subtract the number of observations is total degrees of freedom. The residual degrees of freedom are the number of observations minus the number of variables in the formula.
The degrees of freedom for each t-statistic is the number of variables that are represented for that t-statistic (typically one).
If you call the regression directly (instead of nested in the summary function), it gives you information about the degrees of freedom, as well. That being said, if you don't run the model independently, it is more difficult to test the assumptions that the data must meet for the analysis. Lastly, in this form, you can't test the model for overfitting, either.
library(quantreg)
data(mtcars)
(fit <- rq(mpg ~ wt, data = mtcars, tau = .05))
# Call:
# rq(formula = mpg ~ wt, tau = 0.05, data = mtcars)
#
# Coefficients:
# (Intercept) wt
# 37.561538 -6.515837
#
# Degrees of freedom: 32 total; 30 residual
(fit2 <- rq(mpg ~ wt, data = mtcars, tau = .5))
# Call:
# rq(formula = mpg ~ wt, tau = 0.5, data = mtcars)
#
# Coefficients:
# (Intercept) wt
# 34.232237 -4.539474
#
# Degrees of freedom: 32 total; 30 residual
summary(fit, se = "boot")
#
# Call: rq(formula = mpg ~ wt, tau = 0.05, data = mtcars)
#
# tau: [1] 0.05
#
# Coefficients:
# Value Std. Error t value Pr(>|t|)
# (Intercept) 37.56154 5.30762 7.07690 0.00000
# wt -6.51584 1.58456 -4.11208 0.00028
summary(fit2, se = "boot")
#
# Call: rq(formula = mpg ~ wt, tau = 0.5, data = mtcars)
#
# tau: [1] 0.5
#
# Coefficients:
# Value Std. Error t value Pr(>|t|)
# (Intercept) 34.23224 3.20718 10.67362 0.00000
# wt -4.53947 1.04645 -4.33798 0.00015
I would like to point out that se = "boot" doesn't appear to be doing anything. Additionally, you can run both tau settings in the same model. The Quantreg package has several tools for comparing the models when ran as together.

How to get within R squared from plm FE regression?

I regress monthly stocks returns on a set of firm characteristics using the plm package.
library(plm)
set.seed(1)
id=rep(1:10,each=10); t=rep(1:10,10); industry=rep(1:2,each=50); return=rnorm(100); x=rnorm(100)
data=data.frame(id,t,industry,return,x)
In a first step, I want to include time fixed effects. The following two formulas give the same coefficients for x but different R-squares. The first model estimates the overall R-squared, while the second model gives the within R-squared.
reg1=plm(return~x+factor(t),model="pooling",index=c("id","t"),data=data)
summary(reg1)$r.squared
reg2=plm(return~x,model="within",index=c("id","t"),data=data,effect="time")
summary(reg2)$r.squared
In a second step, I now want to include both time and industry fixed effects. I obtain coefficients by this formula:
reg3=plm(return~x+factor(t)+factor(industry),model="pooling",index=c("id","t"), data=data)
Unfortunately, I cannot use the "within" model as in reg2 because industry is not one of my index variables. Is there another way to calculate the within R-squared for reg3?
This is not a direct answer to your question, because I am not sure
plm can do this. (It might, but I can’t figure it out.)
However, if you are mainly estimating fixed effects models, then I can
warmly recommend the fixest
package, which is super fast and
offers a convenient formula syntax to specify fixed effects and
interactions. Here’s a simple example:
library(fixest)
library(modelsummary)
dat = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/plm/EmplUK.csv")
models = list(
feols(wage ~ emp | year, data = dat),
feols(wage ~ emp | firm, data = dat),
feols(wage ~ emp | firm + year, data = dat)
)
modelsummary(models)
Model 1
Model 2
Model 3
emp
-0.039
-0.120
-0.047
(0.003)
(0.064)
(0.042)
Num.Obs.
1031
1031
1031
R2
0.039
0.868
0.896
R2 Adj.
0.030
0.847
0.879
R2 Within
0.012
0.016
0.003
R2 Pseudo
AIC
6474.0
4687.7
4455.6
BIC
6523.4
5384.0
5191.4
Log.Lik.
-3226.988
-2202.833
-2078.818
Std.Errors
by: year
by: firm
by: firm
FE: year
X
X
FE: firm
X
X
To include time and industry effects (next to individual effects), just use a two-ways within model and include any further fixed effects by + factor(eff) in the formula.
For your example, this would be:
reg3 <- plm(return ~ x + factor(industry), model="within", effect = "twoways", index=c("id","t"), data = data)
summary(reg3)
# Twoways effects Within Model
#
# Call:
# plm(formula = return ~ x + factor(industry), data = data, effect = "twoways",
# model = "within", index = c("id", "t"))
#
# Balanced Panel: n = 10, T = 10, N = 100
#
# Residuals:
# Min. 1st Qu. Median 3rd Qu. Max.
# -1.84660 -0.61135 0.06318 0.57474 2.06264
#
# Coefficients:
# Estimate Std. Error t-value Pr(>|t|)
# x 0.050906 0.112408 0.4529 0.6519
#
# Total Sum of Squares: 68.526
# Residual Sum of Squares: 68.35
# R-Squared: 0.0025571
# Adj. R-Squared: -0.23434
# F-statistic: 0.20509 on 1 and 80 DF, p-value: 0.65187
summary(reg3)$r.squared
# rsq adjrsq
# 0.002557064 -0.234335633
However, note that for your toy example data, the variable industry is collinear after the fixed effects transformation and, thus, drops out of the estimation (see ?detect.lindep for an explanation and another example). Check via, e.g.:
detect.lindep(reg3)
# [1] "Suspicious column number(s): 2"
# [1] "Suspicious column name(s): factor(industry)2"
Or via:
alias(reg3)
# Model :
# [1] "return ~ x + factor(industry)"
#
# Complete :
# [,1]
# factor(industry)2 0

R -plm - error within and random effects models (pooling, between & first differences work)

I have problem with Within and random effect method (it doesn't work). And I have no problem with pooling, between or first diffeences estimator -> it works.
I have the same problem like R - Error in class(x) - plm - only within and random effects models.
Here is the link to my data: https://www.dropbox.com/s/8tgeyhxeb0wrdri/my_data.xlsx?raw=1 (there are some financial measures and GDP growth for some countries)
My code:
proba<-read_excel("my_data.xlsx")
attach(proba)
Y<-cbind(GDP_growth)
X<-cbind(gfdddi01, gfdddi02, gfdddi04, gfdddi05)
pdata<-pdata.frame(proba,index=c("id","year"))
##POOLED OLS estimator
pooling<-plm(Y~X,data=pdata,model="pooling")
summary(pooling)
##BETWEEN ESTIMATOR
between<-plm(Y~X,data=pdata,model="between")
summary(between)
#FIRST DIFFERENCES ESTIMATOR
firstdiff<-plm(Y~X,data=pdata,model="fd")
summary(firstdiff)
#FIXED EFFECT OR WITHIN ESTIMATOR
fixed <-plm(Y~X,data=pdata,model="within")
summary(fixed)
#RANDOM EFFECTS ESTIMATOR
random<- plm(Y~X,data=pdata,model="random")
summary(random)
The error message I get:
Error in class(x) <- setdiff(class(x), "pseries") : invalid to set the class to matrix unless the dimension attribute is of length 2 (was 0)
What can be wrong?
Do not use variables from the environment (like you have done with Y and X - no need to create those). Rather, use in the formula argument of plm the variable names as they occur in your data pdata:
#FIXED EFFECT OR WITHIN ESTIMATOR
fixed <-plm(GDP_growth ~ gfdddi01 + gfdddi02 + gfdddi04 + gfdddi05, data = pdata, model ="within")
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = GDP_growth ~ gfdddi01 + gfdddi02 + gfdddi04 + gfdddi05,
## data = pdata, model = "within")
##
## Balanced Panel: n = 17, T = 41, N = 697
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -18.89148 -1.17470 0.12701 1.48874 20.70109
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## gfdddi01 -0.0066663 0.0153800 -0.4334 0.6648
## gfdddi02 0.0051626 0.0153343 0.3367 0.7365
## gfdddi04 -0.0245573 0.0150069 -1.6364 0.1022
## gfdddi05 -0.0049627 0.0073786 -0.6726 0.5014
##
## Total Sum of Squares: 5421.5
## Residual Sum of Squares: 5366.8
## R-Squared: 0.010095
## Adj. R-Squared: -0.019192
## F-statistic: 1.72352 on 4 and 676 DF, p-value: 0.14296

tuple variable in r regression model

I'm trying to use Time of the Day as an independent variable in my model. As time is a circular variable, I'm transforming it to (sin(pi * hour / 12), cos(pi * hour / 12)).
I googled around and I still don't know how to create a column in R with the (sin, cos) formatted vector/tuple values. I don't know if the models lm, glm, glm.nb (MASS) and glmer (lme4) can support this kind of data.
Excuse me for being a novice here. If vector-type variable should not be included in a regression model, I'll go to Cross Validated (stats) for suggestion on dealing with circular variables. Please help and share your experience, thanks!
This is an excellent question.
Internally, R uses matrices to fit models to data. Instead of thinking about your model as a functions of tuples, you need to generate a block matrix; this is how splines are implemented, for example.
In your example, one column contains sin(x) and the other cox(x), and the blocks is tagged with a class attribute; the functions that manage this object internally are makepredictcall and predict.
Any models using the standard model.frame / model.matrix processing should be compatible with this.
sincos <- function(x, period=168/2/pi) {
structure(cbind(`_sin`=sin(x/period),
`_cos`=cos(x/period)),
class="sincos",
period=period)
}
Here we set the class and period as attributes.
makepredictcall.sincos <- function(var, call){
if (as.character(call)[1L] != "sincos")
return(call)
call["period"] <- attr(var, "period")
call
}
Set the period as needed on the call.
predict.sincos <- function(object, newx, ...)
{
if(missing(newx))
return(object)
sincos(newx, period=attr(object, "period"))
}
Invoke our function using the period from the fit model.
Here is a short example using lm:
#FAKE DATA EXAMPLE
N <- 1000
hr <- sample(168, N, replace = TRUE)
Y = 5 + sinpi(hr * 2/168) + cospi(hr * 2/168) + rnorm(N)
lm(Y~sinpi(hr*2/168)+cospi(hr*2/168))
#>
#> Call:
#> lm(formula = Y ~ sinpi(hr * 2/168) + cospi(hr * 2/168))
#>
#> Coefficients:
#> (Intercept) sinpi(hr * 2/168) cospi(hr * 2/168)
#> 5.0078 1.0243 0.9637
Our custom function matches exactly:
lm(Y~sincos(hr))
#>
#> Call:
#> lm(formula = Y ~ sincos(hr))
#>
#> Coefficients:
#> (Intercept) sincos(hr)_sin sincos(hr)_cos
#> 5.0078 1.0243 0.9637
Other functions will also be able to tell that the two columns are a single term in the model:
anova(lm(Y~sincos(hr)))
#> Analysis of Variance Table
#>
#> Response: Y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> sincos(hr) 2 1051.05 525.53 553.24 < 2.2e-16 ***
#> Residuals 997 947.06 0.95
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I'll add this to the stackoverflow package in the next few days, should anyone else find this helpful.

Custom regression equation in R

I have a set of data in R and I want to run a regression to test for correlation using custom coefficients.
Example:
x = lm(a ~ b + c + d, data=data, weights=weights)
That gives me coefficients for b, c, and d, but I just want to give b, c, and d my own coefficients and find, for example, the r^2. How would I do so?
Let's assume your predetermined coefficients are a three-element, numeric vector named: vec and that none of a,b,c are factors or character vectors:
#edit ... add a sum() function
(x = lm(a ~ 1, data=data, offset=apply(data, 1, function(x) {sum( c(1,x) * vec))} )
This should produce a model that has the specified estimates. You will probably need to do this:
summary(x)
As always... if you want tested code, then provide a dataset for testing. With the mtcars dataframe:
m1 = lm(mpg ~ carb + wt, data=mtcars)
vec <- coef(m1)
(x = lm(mpg ~ 1, data=mtcars,
offset=apply( mtcars[c("carb","wt")], 1,
function(x){ sum( c(1,x) *vec)} )))
Call:
lm(formula = mpg ~ 1, data = mtcars, offset = apply(mtcars[c("carb",
"wt")], 1, function(x) {
sum( c(1, x) * vec)
}))
Coefficients:
(Intercept)
-7.85e-17
So the offset model (with the coefficients used in the offset) is essentially an exact fit to the m1 model.
#BondedDust's method will be more efficient in the long run, but just for illustration, here's a simple example of how to create your own function to calculate R-squared for any regression coefficients you choose. We'll use the mtcars data set, which is built into R.
Assume a regression model that predicts "mpg" using the independent variables "carb" and "wt". a, b, and c are the three regression parameters that we need to provide to the function.
# Function to calculate R-squared
R2 = function(a,b,c) {
# Calculate the residual sum of squares from the regression model
SSresid = sum(((a + b*mtcars$carb + c*mtcars$wt) - mtcars$mpg)^2)
# Calculate the total sum of squares
SStot = sum((mtcars$mpg - mean(mtcars$mpg))^2)
# Calculate and return the R-squared for the regression model
return(1 - SSresid/SStot)
}
Now let's run the function. First let's see if our function matches the R-squared calculated by lm. We'll do this by creating a regression model in R, then we'll use the coefficients from that model and calculate the R-squared using our function and see if it matches the output from lm:
# Create regression model
m1 = lm(mpg ~ carb + wt, data=mtcars)
summary(m1)
Call:
lm(formula = mpg ~ carb + wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5206 -2.1223 -0.0467 1.4551 5.9736
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.7300 1.7602 21.435 < 2e-16 ***
carb -0.8215 0.3492 -2.353 0.0256 *
wt -4.7646 0.5765 -8.265 4.12e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.839 on 29 degrees of freedom
Multiple R-squared: 0.7924, Adjusted R-squared: 0.7781
F-statistic: 55.36 on 2 and 29 DF, p-value: 1.255e-10
From the summary, we can see that the R-squared is 0.7924. Let's see what we get from the function we just created. All we need to do is feed our function the three regression coefficients listed in the summary above. We can hard-code those numbers, or we can extract the coefficients from the model object m1 (which is what I've done below):
R2(coef(m1)[1], coef(m1)[2], coef(m1)[3])
[1] 0.7924425
Now let's calculate the R-squared for other choices of the regression coefficients:
a = 37; b = -1; c = -3.5
R2(a, b, c)
[1] 0.5277607
a = 37; b = -2; c = -5
R2(a, b, c)
[1] 0.0256494
To check lots of values of a parameter at once, you can, for example, use sapply. The code below will return the R-squared for values of c ranging from -7 to -3 in increments of 0.1 (with the other two parameters set to the the values returned by lm:
sapply(seq(-7,-3,0.1), function(x) R2(coef(m1)[1],coef(m1)[2],x))

Resources