Error in glmer incorrect number of components - r

The initial model I run is the following:
play1 <- glmer(choice ~ x1 + x2 + x3 + x4 + x5 + log(x6) + (x1 + x2 + x3 + x4 + x5 + log(x6) || order), data=data.play0, family=binomial, control=glmerControl(optCtrl=list(maxfun=1e6)))
This doesn't converge, I get a warning. I rerun it like that:
play1.b <- glmer(choice ~ x1 + x2 + x3 + x4 + x5 + log(x6) + (x1 + x2 + x3 + x4 + x5 + log(x6) || order), data=data.play0, family=binomial, start=list(fixef=fixef(mm1.play0), theta=getME(mm1.play0, "theta")), control=glmerControl(optCtrl=list(maxfun=1e6)))
It still doesn't converge. Standard deviation for x3, x4 and x5 is close to zero, so I drop those random effects and continue
play1.c <- glmer(choice ~ x1 + x2 + x3 + x4 + x5 + log(x6) +
(x1 + x2 + log(x6) || order), data=data.play0,
family=binomial, start=list(fixef=fixef(play0),
theta=getME(play0, "theta")[-c(5,7)]),
control=glmerControl(optCtrl=list(maxfun=1e6)))
Then I get the error message I described before, the
Error in getStart(start, lower = rho$lower, pred = rho$pp, "theta") :
incorrect number of theta components (!=4)
I try different combinations for the c vector, but still I get the same message. Everything is numeric yes.

You didn't give a reproducible example, so I made one up:
library("lme4")
set.seed(101)
dd <- as.data.frame(matrix(rnorm(3000),ncol=6,
dimnames=list(NULL,paste0("x",1:6))))
dd$order <- factor(sample(1:25,size=500,replace=TRUE))
dd$x6 <- abs(dd$x6)
form <- choice ~ x1 + x2 + x3 + x4 + x5 + log(x6) +
(x1 + x2 + x3 + x4 + x5 + log(x6) || order)
dd$choice <- simulate(form[-2],
newdata=dd,
newparams=list(beta=rep(1,7),
theta=rep(1,7)),
family=binomial,
weights=rep(1,500))[[1]]
Now fit some models:
mm.play1 <- glmer(choice ~ x1 + x2 + x3 + x4 + x5 + log(x6) +
(x1 + x2 + x3 + x4 + x5 + log(x6) || order),
data=dd, family=binomial,
control=glmerControl(optCtrl=list(maxfun=1e6),
optimizer="nloptwrap"))
mm.play1b <- update(play1,start=list(fixef=fixef(mm.play1),
theta=getME(mm.play1, "theta")))
I tweaked your code a little bit (used the "nloptwrap" optimizer for a little bit of extra speed)
th <- getME(mm.play1, "theta")[c(1,2,3,7)]
mm.play1c <- update(mm.play1,
. ~ x1 + x2 + x3 + x4 + x5 + log(x6) +
(x1 + x2 + log(x6) || order),
start=list(fixef=fixef(mm.play1),
theta=th))
This works (no time to explain further at the moment, but basically you have to make sure the vector lengths match ...)

Related

How can I see the effect sizes of my predictor variables on both my dependent variables in a system of equations?

I originally got two formulas that were expected two have the same predictor variables (X-variables) and the same control variables (C-variables):
Y1 = X1 + X2 + X3 + C1 + C2 + C3
Y2 = X1 + X2 + X3 + C1 + C2 + C3
However, my dependent variable Y1 could also be used as a dependent variable in the regression for Y2. So my new formula became:
Y1 = X1 + X2 + X3 + C1 + C2 + C3
Y2 = Y1 + X1 + X2 + X3 + C1 + C2 + C3
I tried to model this as a system of equations in R. I understand that "X1 + X2 + X3 + C1 + C2 + C3" should not be in the formula for Y2 as they already are in the formula for Y1. Now the problem is that I am interested in the effect of the predictor variables on Y1 as well as Y2 (in order to reject or approve hypotheses), but the output form my system of equations only gives the effect of Y1 on Y2. How can I account for the effect of Y1 on Y2 and also be able to see the effects of X1, X2 and X3 on both formulas?
I have used the following formula:
model_1 <- ivreg(Y2 ~ Y1 + C1
|. - Y1 + X1 + X2 + X3 + C2 + C3, data = df)
This gives the following output:

avoid repeatedly writing model formula when fitting a number of linear regression models

I'd like to run a number of similar linear regression models in R, such as
lm(y ~ x1 + x2 + x3 + x4 + x5, data = df)
lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = df)
lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = df)
How can I assign part of this to a "base" formula, to avoid repeating it many times? This would be the base:
y ~ x1 + x2 + x3 + x4 + x5
Then how can I do something like the following (obviously not working)?
lm(base + x6, data = df)
Searching on Stack Overflow I realized that I could make a data frame with only variables of interest and use . to shorten the model formula, but I wonder if this could be avoided.
You can update a model formula with update.formula. For example:
base <- y ~ x1 + x2 + x3 + x4 + x5
update.formula(base, . ~ . + x6)
#y ~ x1 + x2 + x3 + x4 + x5 + x6
Here is a strings version if you want to provide new variable name as character:
## `deparse` damp a model formula to a string
formula(paste(deparse(base), "x6", sep = " + "))
In fact, you can even update your model directly
fit <- lm(base, dat); update.default(fit, . ~ . + x6)
This idea that updates the whole model worked the best. Only update() was needed in my case.
I wrote update.default and update.formula so that you know what function to look for when you do ? for the documentation.

How to run the pgmm command?

Please see a sample of my data, and my pgmm code, and let me know if I am using the correct syntax.
Y1 is my dependent variable, and X* with C* variables are my independent and control variables. I am trying to run the dynamic GMM model with 2 year lags, but this is the first time that I am using PGMM and I am not sure if this is the correct syntax.
Sample Data
I am trying to run the pgmm command below:
country <- pdata.frame(country, index = c('Co_Code', 'YEAR'))
model.gmm <- Y1 ~ lag(X1, 2) + lag(X2, 2) + lag(X3, 2) + lag(X7, 2) +
lag(X6, 2) + lag(X4, 2) + lag(X5, 2) + lag(X8, 2) + lag(X9, 2) +
lag(X10, 2) + lag(C1, 2) + lag(C2, 2) + lag(C3, 2) + lag(C6, 2) + lag(C7, 2)
gmm.form = update.formula(model.gmm, . ~ . | lag(Y1, 2))
gmm.form[[3]] <- gmm.form[[3]][[2]]
gmm.fit <- pgmm(gmm.form, data = country, effect = "twoways", model =
"twosteps")
summary(gmm.fit)
Edit: I've also generated the code below:
gmm.fit <- pgmm(Y1 ~ X1 + X2 + X3 + X6 + X7 + X4 + X5 + X8 + X9 + X10 +
C1 + C2 + C3 + C6 |lag(X1, 2) + lag(X2, 2) + lag(X3, 2) + lag(X7, 2) +
lag(X6, 2) + lag(X4, 2) + lag(X5, 2) + lag(X8, 2) + lag(X9, 2) +
lag(X10, 2) + lag(C1, 2) + lag(C2, 2) + lag(C3, 2) + lag(C6, 2), data =
country, effect = "twoways", model = "twosteps")
Yes, your updated version appears correct for what you say. You may prefer using dynformula, the basic structure is:
gmm.form <- dynformula(Y1~ X + C, lag.form=list(2,2,2))
And this easily generalises for multiple X and C:
gmm.form <- dynformula(Y1~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 +X10 + C1 + C2
+ C3 + C4 + C5 +C6, lag.form=list(rep(2,17)))
This command means you will be including up to and including 2 lags for all the variables (noting that the first in the lag.form list above is Y1 - dynformula will automatically put the lags of Y1 on the right hand side of the equation).
[Edit: I note you haven't specified instruments. Seeing your data, for standard dynamic panel approach of lagged Y, I'd put gmm.inst=~Y1,gmm.lag=list(c(3,99))]

concise way of making an R formula [duplicate]

Suppose I have a response variable and a data containing three covariates (as a toy example):
y = c(1,4,6)
d = data.frame(x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
I want to fit a linear regression to the data:
fit = lm(y ~ d$x1 + d$x2 + d$y2)
Is there a way to write the formula, so that I don't have to write out each individual covariate? For example, something like
fit = lm(y ~ d)
(I want each variable in the data frame to be a covariate.) I'm asking because I actually have 50 variables in my data frame, so I want to avoid writing out x1 + x2 + x3 + etc.
There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
You can also do things like this, to use all variables but one (in this case x3 is excluded):
mod <- lm(y ~ . - x3, data = d)
Technically, . means all variables not already mentioned in the formula. For example
lm(y ~ x1 * x2 + ., data = d)
where . would only reference x3 as x1 and x2 are already in the formula.
A slightly different approach is to create your formula from a string. In the formula help page you will find the following example :
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
Then if you look at the generated formula, you will get :
R> fmla
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
Yes of course, just add the response y as first column in the dataframe and call lm() on it:
d2<-data.frame(y,d)
> d2
y x1 x2 x3
1 1 4 3 4
2 4 -1 9 -4
3 6 3 8 -2
> lm(d2)
Call:
lm(formula = d2)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
Also, my information about R points out that assignment with <- is recommended over =.
An extension of juba's method is to use reformulate, a function which is explicitly designed for such a task.
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
reformulate(xnam, "y")
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
For the example in the OP, the easiest solution here would be
# add y variable to data.frame d
d <- cbind(y, d)
reformulate(names(d)[-1], names(d[1]))
y ~ x1 + x2 + x3
or
mod <- lm(reformulate(names(d)[-1], names(d[1])), data=d)
Note that adding the dependent variable to the data.frame in d <- cbind(y, d) is preferred not only because it allows for the use of reformulate, but also because it allows for future use of the lm object in functions like predict.
I build this solution, reformulate does not take care if variable names have white spaces.
add_backticks = function(x) {
paste0("`", x, "`")
}
x_lm_formula = function(x) {
paste(add_backticks(x), collapse = " + ")
}
build_lm_formula = function(x, y){
if (length(y)>1){
stop("y needs to be just one variable")
}
as.formula(
paste0("`",y,"`", " ~ ", x_lm_formula(x))
)
}
# Example
df <- data.frame(
y = c(1,4,6),
x1 = c(4,-1,3),
x2 = c(3,9,8),
x3 = c(4,-4,-2)
)
# Model Specification
columns = colnames(df)
y_cols = columns[1]
x_cols = columns[2:length(columns)]
formula = build_lm_formula(x_cols, y_cols)
formula
# output
# "`y` ~ `x1` + `x2` + `x3`"
# Run Model
lm(formula = formula, data = df)
# output
Call:
lm(formula = formula, data = df)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
```
You can check the package leaps and in particular the function regsubsets()
functions for model selection. As stated in the documentation:
Model selection by exhaustive search, forward or backward stepwise, or sequential replacement

How to succinctly write a formula with many variables from a data frame?

Suppose I have a response variable and a data containing three covariates (as a toy example):
y = c(1,4,6)
d = data.frame(x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
I want to fit a linear regression to the data:
fit = lm(y ~ d$x1 + d$x2 + d$y2)
Is there a way to write the formula, so that I don't have to write out each individual covariate? For example, something like
fit = lm(y ~ d)
(I want each variable in the data frame to be a covariate.) I'm asking because I actually have 50 variables in my data frame, so I want to avoid writing out x1 + x2 + x3 + etc.
There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
You can also do things like this, to use all variables but one (in this case x3 is excluded):
mod <- lm(y ~ . - x3, data = d)
Technically, . means all variables not already mentioned in the formula. For example
lm(y ~ x1 * x2 + ., data = d)
where . would only reference x3 as x1 and x2 are already in the formula.
A slightly different approach is to create your formula from a string. In the formula help page you will find the following example :
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
Then if you look at the generated formula, you will get :
R> fmla
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
Yes of course, just add the response y as first column in the dataframe and call lm() on it:
d2<-data.frame(y,d)
> d2
y x1 x2 x3
1 1 4 3 4
2 4 -1 9 -4
3 6 3 8 -2
> lm(d2)
Call:
lm(formula = d2)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
Also, my information about R points out that assignment with <- is recommended over =.
An extension of juba's method is to use reformulate, a function which is explicitly designed for such a task.
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
reformulate(xnam, "y")
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
For the example in the OP, the easiest solution here would be
# add y variable to data.frame d
d <- cbind(y, d)
reformulate(names(d)[-1], names(d[1]))
y ~ x1 + x2 + x3
or
mod <- lm(reformulate(names(d)[-1], names(d[1])), data=d)
Note that adding the dependent variable to the data.frame in d <- cbind(y, d) is preferred not only because it allows for the use of reformulate, but also because it allows for future use of the lm object in functions like predict.
I build this solution, reformulate does not take care if variable names have white spaces.
add_backticks = function(x) {
paste0("`", x, "`")
}
x_lm_formula = function(x) {
paste(add_backticks(x), collapse = " + ")
}
build_lm_formula = function(x, y){
if (length(y)>1){
stop("y needs to be just one variable")
}
as.formula(
paste0("`",y,"`", " ~ ", x_lm_formula(x))
)
}
# Example
df <- data.frame(
y = c(1,4,6),
x1 = c(4,-1,3),
x2 = c(3,9,8),
x3 = c(4,-4,-2)
)
# Model Specification
columns = colnames(df)
y_cols = columns[1]
x_cols = columns[2:length(columns)]
formula = build_lm_formula(x_cols, y_cols)
formula
# output
# "`y` ~ `x1` + `x2` + `x3`"
# Run Model
lm(formula = formula, data = df)
# output
Call:
lm(formula = formula, data = df)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
```
You can check the package leaps and in particular the function regsubsets()
functions for model selection. As stated in the documentation:
Model selection by exhaustive search, forward or backward stepwise, or sequential replacement

Resources