Related
I have estimated the following negative binomial regression model with group fixed effects in Stata. The data are time series cross sectional. The panelvar is group and the timevar is time.
tsset group time
xtnbreg y x1 x2 x3 + x4 + x5, fe
I want to replicate these findings in R. To do this, I have tried these 4 models:
nb1 <- femlm(y ~ x1 + x2 + x3 + x4 + x5 | group, panel.id = ~group + time, family = "negbin", mydata)
nb2 <- fenegbin(y ~ x1 + x2 + x3 + x4 + x5 | group, panel.id = ~group + time, mydata)
nb3 <- glm.nb(y ~ x1 + x2 + x3 + x4 + x5 + factor(group), data=mydata)
nb4 <- glmmadmb(y ~ x1 + x2 + x3 + x4 + x5 + factor(group), data = mydata, family="nbinom")
The results produced by nb1-4 are all identical, but different from the results produced by xtnbreg in Stata. The coefficients, standard errors, and p-values are all substantively different.
I have tried replicating a standard negative binomial regression in Stata and R and have been able to do so successfully.
Does anyone have any idea what's going on here? I have reviewed related posts on this forum (such as this one: is there an R function for Stata's xtnbreg?) and have not found any answers.
SOLVED (mostly): The R code that reproduces the results generated by xtnbreg, fe in Stata:
nb5 <- pglm(y ~ x1 + x2 + x3 + x4 + x5 ,family = negbin, data = mydata, effect = "individual", model="within", index = "group")
I found the solution on RPubs: https://rpubs.com/cuborican/xtpoisson.
I still do not know why this works, only that it does. I suspect that Ben is correct and it has something to do with estimating conditional vs unconditional ML. If anyone knows for sure, please share.
The initial model I run is the following:
play1 <- glmer(choice ~ x1 + x2 + x3 + x4 + x5 + log(x6) + (x1 + x2 + x3 + x4 + x5 + log(x6) || order), data=data.play0, family=binomial, control=glmerControl(optCtrl=list(maxfun=1e6)))
This doesn't converge, I get a warning. I rerun it like that:
play1.b <- glmer(choice ~ x1 + x2 + x3 + x4 + x5 + log(x6) + (x1 + x2 + x3 + x4 + x5 + log(x6) || order), data=data.play0, family=binomial, start=list(fixef=fixef(mm1.play0), theta=getME(mm1.play0, "theta")), control=glmerControl(optCtrl=list(maxfun=1e6)))
It still doesn't converge. Standard deviation for x3, x4 and x5 is close to zero, so I drop those random effects and continue
play1.c <- glmer(choice ~ x1 + x2 + x3 + x4 + x5 + log(x6) +
(x1 + x2 + log(x6) || order), data=data.play0,
family=binomial, start=list(fixef=fixef(play0),
theta=getME(play0, "theta")[-c(5,7)]),
control=glmerControl(optCtrl=list(maxfun=1e6)))
Then I get the error message I described before, the
Error in getStart(start, lower = rho$lower, pred = rho$pp, "theta") :
incorrect number of theta components (!=4)
I try different combinations for the c vector, but still I get the same message. Everything is numeric yes.
You didn't give a reproducible example, so I made one up:
library("lme4")
set.seed(101)
dd <- as.data.frame(matrix(rnorm(3000),ncol=6,
dimnames=list(NULL,paste0("x",1:6))))
dd$order <- factor(sample(1:25,size=500,replace=TRUE))
dd$x6 <- abs(dd$x6)
form <- choice ~ x1 + x2 + x3 + x4 + x5 + log(x6) +
(x1 + x2 + x3 + x4 + x5 + log(x6) || order)
dd$choice <- simulate(form[-2],
newdata=dd,
newparams=list(beta=rep(1,7),
theta=rep(1,7)),
family=binomial,
weights=rep(1,500))[[1]]
Now fit some models:
mm.play1 <- glmer(choice ~ x1 + x2 + x3 + x4 + x5 + log(x6) +
(x1 + x2 + x3 + x4 + x5 + log(x6) || order),
data=dd, family=binomial,
control=glmerControl(optCtrl=list(maxfun=1e6),
optimizer="nloptwrap"))
mm.play1b <- update(play1,start=list(fixef=fixef(mm.play1),
theta=getME(mm.play1, "theta")))
I tweaked your code a little bit (used the "nloptwrap" optimizer for a little bit of extra speed)
th <- getME(mm.play1, "theta")[c(1,2,3,7)]
mm.play1c <- update(mm.play1,
. ~ x1 + x2 + x3 + x4 + x5 + log(x6) +
(x1 + x2 + log(x6) || order),
start=list(fixef=fixef(mm.play1),
theta=th))
This works (no time to explain further at the moment, but basically you have to make sure the vector lengths match ...)
I want to run linear regression for the same outcome and a number of covariates minus one covariate in each model. I have looked at the example on this page but could that did not provide what I wanted.
Sample data
a <- data.frame(y = c(30,12,18), x1 = c(7,6,9), x2 = c(6,8,5),
x3 = c(4,-2,-3), x4 = c(8,3,-3), x5 = c(4,-4,-2))
m1 <- lm(y ~ x1 + x4 + x5, data = a)
m2 <- lm(y ~ x2 + x4 + x5, data = a)
m3 <- lm(y ~ x3 + x4 + x5, data = a)
How could I run these models in a short way and and without repeating the same covariates again and again?
Following this example you could do this:
lapply(1:3, function(i){
lm(as.formula(sprintf("y ~ x%i + x4 + x5", i)), a)
})
Suppose I have a response variable and a data containing three covariates (as a toy example):
y = c(1,4,6)
d = data.frame(x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
I want to fit a linear regression to the data:
fit = lm(y ~ d$x1 + d$x2 + d$y2)
Is there a way to write the formula, so that I don't have to write out each individual covariate? For example, something like
fit = lm(y ~ d)
(I want each variable in the data frame to be a covariate.) I'm asking because I actually have 50 variables in my data frame, so I want to avoid writing out x1 + x2 + x3 + etc.
There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
You can also do things like this, to use all variables but one (in this case x3 is excluded):
mod <- lm(y ~ . - x3, data = d)
Technically, . means all variables not already mentioned in the formula. For example
lm(y ~ x1 * x2 + ., data = d)
where . would only reference x3 as x1 and x2 are already in the formula.
A slightly different approach is to create your formula from a string. In the formula help page you will find the following example :
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
Then if you look at the generated formula, you will get :
R> fmla
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
Yes of course, just add the response y as first column in the dataframe and call lm() on it:
d2<-data.frame(y,d)
> d2
y x1 x2 x3
1 1 4 3 4
2 4 -1 9 -4
3 6 3 8 -2
> lm(d2)
Call:
lm(formula = d2)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
Also, my information about R points out that assignment with <- is recommended over =.
An extension of juba's method is to use reformulate, a function which is explicitly designed for such a task.
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
reformulate(xnam, "y")
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
For the example in the OP, the easiest solution here would be
# add y variable to data.frame d
d <- cbind(y, d)
reformulate(names(d)[-1], names(d[1]))
y ~ x1 + x2 + x3
or
mod <- lm(reformulate(names(d)[-1], names(d[1])), data=d)
Note that adding the dependent variable to the data.frame in d <- cbind(y, d) is preferred not only because it allows for the use of reformulate, but also because it allows for future use of the lm object in functions like predict.
I build this solution, reformulate does not take care if variable names have white spaces.
add_backticks = function(x) {
paste0("`", x, "`")
}
x_lm_formula = function(x) {
paste(add_backticks(x), collapse = " + ")
}
build_lm_formula = function(x, y){
if (length(y)>1){
stop("y needs to be just one variable")
}
as.formula(
paste0("`",y,"`", " ~ ", x_lm_formula(x))
)
}
# Example
df <- data.frame(
y = c(1,4,6),
x1 = c(4,-1,3),
x2 = c(3,9,8),
x3 = c(4,-4,-2)
)
# Model Specification
columns = colnames(df)
y_cols = columns[1]
x_cols = columns[2:length(columns)]
formula = build_lm_formula(x_cols, y_cols)
formula
# output
# "`y` ~ `x1` + `x2` + `x3`"
# Run Model
lm(formula = formula, data = df)
# output
Call:
lm(formula = formula, data = df)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
```
You can check the package leaps and in particular the function regsubsets()
functions for model selection. As stated in the documentation:
Model selection by exhaustive search, forward or backward stepwise, or sequential replacement
Suppose I have a response variable and a data containing three covariates (as a toy example):
y = c(1,4,6)
d = data.frame(x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
I want to fit a linear regression to the data:
fit = lm(y ~ d$x1 + d$x2 + d$y2)
Is there a way to write the formula, so that I don't have to write out each individual covariate? For example, something like
fit = lm(y ~ d)
(I want each variable in the data frame to be a covariate.) I'm asking because I actually have 50 variables in my data frame, so I want to avoid writing out x1 + x2 + x3 + etc.
There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
You can also do things like this, to use all variables but one (in this case x3 is excluded):
mod <- lm(y ~ . - x3, data = d)
Technically, . means all variables not already mentioned in the formula. For example
lm(y ~ x1 * x2 + ., data = d)
where . would only reference x3 as x1 and x2 are already in the formula.
A slightly different approach is to create your formula from a string. In the formula help page you will find the following example :
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
Then if you look at the generated formula, you will get :
R> fmla
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
Yes of course, just add the response y as first column in the dataframe and call lm() on it:
d2<-data.frame(y,d)
> d2
y x1 x2 x3
1 1 4 3 4
2 4 -1 9 -4
3 6 3 8 -2
> lm(d2)
Call:
lm(formula = d2)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
Also, my information about R points out that assignment with <- is recommended over =.
An extension of juba's method is to use reformulate, a function which is explicitly designed for such a task.
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
reformulate(xnam, "y")
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
For the example in the OP, the easiest solution here would be
# add y variable to data.frame d
d <- cbind(y, d)
reformulate(names(d)[-1], names(d[1]))
y ~ x1 + x2 + x3
or
mod <- lm(reformulate(names(d)[-1], names(d[1])), data=d)
Note that adding the dependent variable to the data.frame in d <- cbind(y, d) is preferred not only because it allows for the use of reformulate, but also because it allows for future use of the lm object in functions like predict.
I build this solution, reformulate does not take care if variable names have white spaces.
add_backticks = function(x) {
paste0("`", x, "`")
}
x_lm_formula = function(x) {
paste(add_backticks(x), collapse = " + ")
}
build_lm_formula = function(x, y){
if (length(y)>1){
stop("y needs to be just one variable")
}
as.formula(
paste0("`",y,"`", " ~ ", x_lm_formula(x))
)
}
# Example
df <- data.frame(
y = c(1,4,6),
x1 = c(4,-1,3),
x2 = c(3,9,8),
x3 = c(4,-4,-2)
)
# Model Specification
columns = colnames(df)
y_cols = columns[1]
x_cols = columns[2:length(columns)]
formula = build_lm_formula(x_cols, y_cols)
formula
# output
# "`y` ~ `x1` + `x2` + `x3`"
# Run Model
lm(formula = formula, data = df)
# output
Call:
lm(formula = formula, data = df)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
```
You can check the package leaps and in particular the function regsubsets()
functions for model selection. As stated in the documentation:
Model selection by exhaustive search, forward or backward stepwise, or sequential replacement