Fitting a multivariate polynomial of generic degree in R without having to write the explicit formula - r

I would like to fit a multivariate polynomial of arbitrary degree and in an arbitrary number of variables, to some data. The number of variables can be high (for example 40) and the code should work for different numbers of variables (e.g., 10, 20, 40, etc.)., so it's not possible to write out the formula explicitly. For a degree 1 polynomial (i.e., the classic linear model), the solution is trivial: suppose I have my data in the dataframe df, then
mymodel <- lm(y ~ ., data = df)
Unfortunately I don't know of a similar compact formula when the polynomial is of arbitrary degree. Can you help me?

This combines both options from my earlier posting (interactions and polynomial terms) in a hypothetical situation where the column names look like "X1", "X2", ...., "X30". You would take out the terms() call which is just in there to demonstrate that it was successful:
terms( as.formula(
paste(" ~ (", paste0("X", 1:30 , collapse="+"), ")^2", "+",
paste( "poly(", paste0("X", 1:30), ", degree=2)",
collapse="+"),
collapse="")
) )
You could use an expression like names(dfrm)[!names(dfrm) %in% "y"] instead of the inner paste0 calls.
Note that the interaction terms are constructed by way of the R formula process in with the (...)^2 mechanism which is no creating squared terms but rather all of hte two way interactions:
as.formula(
paste(" ~ (", paste0("X", 1:30 , collapse="+"), ")^2", "+", paste( "poly(", paste0("X", 1:30), ", degree=2)", collapse="+"), collapse="")
)
#----output----
~(X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 + X12 +
X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20 + X21 + X22 +
X23 + X24 + X25 + X26 + X27 + X28 + X29 + X30)^2 + poly(X1,
degree = 2) + poly(X2, degree = 2) +
poly(X3, degree = 2) +
poly(X4, degree = 2) + poly(X5, degree = 2) + poly(X6, degree = 2) +
poly(X7, degree = 2) + poly(X8, degree = 2) + poly(X9, degree = 2) +
poly(X10, degree = 2) + poly(X11, degree = 2) + poly(X12,
degree = 2) + poly(X13, degree = 2) + poly(X14, degree = 2) +
poly(X15, degree = 2) + poly(X16, degree = 2) + poly(X17,
degree = 2) + poly(X18, degree = 2) + poly(X19, degree = 2) +
poly(X20, degree = 2) + poly(X21, degree = 2) + poly(X22,
degree = 2) + poly(X23, degree = 2) + poly(X24, degree = 2) +
poly(X25, degree = 2) + poly(X26, degree = 2) + poly(X27,
degree = 2) + poly(X28, degree = 2) + poly(X29, degree = 2) +
poly(X30, degree = 2)

You can use this function makepoly that generates a formula with polynomial terms based on a formula and a data frame.
makepoly <- function(form, data, degree = 1) {
mt <- terms(form, data = data)
tl <- attr(mt, "term.labels")
resp <- tl[attr(mt, "response")]
reformulate(paste0("poly(", tl, ", ", degree, ")"),
response = form[[2]])
}
A test data set:
set.seed(1)
df <- data.frame(y = rnorm(10),
x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10))
Create the formula and run the regression:
form <- makepoly(y ~ ., df, degree = 2)
# y ~ poly(x1, 2) + poly(x2, 2) + poly(x3, 2)
lm(form, df)
#
# Call:
# lm(formula = form, data = df)
#
# Coefficients:
# (Intercept) poly(x1, 2)1 poly(x1, 2)2 poly(x2, 2)1
# 0.1322 0.1445 -5.5757 -5.2132
# poly(x2, 2)2 poly(x3, 2)1 poly(x3, 2)2
# 4.2297 0.7895 3.9796

Related

How long should a Monte Carlo bootstrap power analysis simulation in R take? Is it potentially hours? (1000 reps, 1000 bootstraps)

I am using a Monte Carlo simulation to run a power analysis for a longitudinal mediation model. I'm using the power.boot function from the bmem package (lavaan).
I checked the code with only 5 reps/5 bootstrap to make sure it worked and it did.
Then I ran the code with 1000 reps, 1000 bootstrap as the package documentation recommends.
It's been over an hour now and it's still running - is this normal? How long is too long?
powermodel1 <-'
x2 ~ start(.6)*x1 + x*x1
x3 ~ start(.6)*x2 + x*x2
m2 ~ start(.15)*x1 + a*x1 + start(.3)*m1 + m*m1
m3 ~ start(.15)*x2 + a*x2 + start(.3)*m2 + m*m2
y2 ~ start(.5)*m1 + b*m2 + start(.3)*y1 + y*y1
y3 ~ start(.5)*m2 + b*m2 + start(.3)*y2 + y*y2 + start(0.05)*x1 + c*x1
x1 ~~ start(.15)*m1
x1 ~~ start(.15)*y1
y1 ~~ start(.5)*m1
'
indirect <- 'ab:=a*b'
N<-200
system.time(bootstrap<-power.boot(powermodel1, indirect, N, nrep=1000, nboot=1000, parallel = 'multicore'))
summary(bootstrap)
Unfortunately it looks like it will take a while; ~8hrs on my system:
library(bmem)
powermodel1 <-'
x2 ~ start(.6)*x1 + x*x1
x3 ~ start(.6)*x2 + x*x2
m2 ~ start(.15)*x1 + a*x1 + start(.3)*m1 + m*m1
m3 ~ start(.15)*x2 + a*x2 + start(.3)*m2 + m*m2
y2 ~ start(.5)*m1 + b*m2 + start(.3)*y1 + y*y1
y3 ~ start(.5)*m2 + b*m2 + start(.3)*y2 + y*y2 + start(0.05)*x1 + c*x1
x1 ~~ start(.15)*m1
x1 ~~ start(.15)*y1
y1 ~~ start(.5)*m1
'
indirect <- 'ab:=a*b'
N<-200
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 10, nboot = 10, parallel = 'multicore'))
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 30, nboot = 30, parallel = 'multicore'))
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 60, nboot = 60, parallel = 'multicore'))
system.time(bootstrap<-bmem::power.boot(powermodel1, indirect, N, nrep = 100, nboot = 100, parallel = 'multicore'))
library(tidyverse)
# Load the times from above into a dataframe
benchmark <- tibble(bootstraps = c(10, 30, 60, 100),
times = c(4.021, 30.122, 121.103, 311.236))
# Plot the points and fit a curve
ggplot(benchmark, aes(x = bootstraps, y = times)) +
geom_point() +
geom_smooth(se = FALSE, span = 5)
# Fit a model
fit <- lm(data = benchmark, times~poly(bootstraps,
2, raw=TRUE))
newtimes <- data.frame(bootstraps = seq(100, 1000, length = 4))
# Predict the time it will take for larger bootstrap/rep values
predict(fit, newdata = newtimes)
> 1 2 3 4
> 311.6829 4568.3812 13789.6754 27975.5655
# Convert from seconds to hours
print(27975.5655/60/60)
>[1] 7.77099

How to run the pgmm command?

Please see a sample of my data, and my pgmm code, and let me know if I am using the correct syntax.
Y1 is my dependent variable, and X* with C* variables are my independent and control variables. I am trying to run the dynamic GMM model with 2 year lags, but this is the first time that I am using PGMM and I am not sure if this is the correct syntax.
Sample Data
I am trying to run the pgmm command below:
country <- pdata.frame(country, index = c('Co_Code', 'YEAR'))
model.gmm <- Y1 ~ lag(X1, 2) + lag(X2, 2) + lag(X3, 2) + lag(X7, 2) +
lag(X6, 2) + lag(X4, 2) + lag(X5, 2) + lag(X8, 2) + lag(X9, 2) +
lag(X10, 2) + lag(C1, 2) + lag(C2, 2) + lag(C3, 2) + lag(C6, 2) + lag(C7, 2)
gmm.form = update.formula(model.gmm, . ~ . | lag(Y1, 2))
gmm.form[[3]] <- gmm.form[[3]][[2]]
gmm.fit <- pgmm(gmm.form, data = country, effect = "twoways", model =
"twosteps")
summary(gmm.fit)
Edit: I've also generated the code below:
gmm.fit <- pgmm(Y1 ~ X1 + X2 + X3 + X6 + X7 + X4 + X5 + X8 + X9 + X10 +
C1 + C2 + C3 + C6 |lag(X1, 2) + lag(X2, 2) + lag(X3, 2) + lag(X7, 2) +
lag(X6, 2) + lag(X4, 2) + lag(X5, 2) + lag(X8, 2) + lag(X9, 2) +
lag(X10, 2) + lag(C1, 2) + lag(C2, 2) + lag(C3, 2) + lag(C6, 2), data =
country, effect = "twoways", model = "twosteps")
Yes, your updated version appears correct for what you say. You may prefer using dynformula, the basic structure is:
gmm.form <- dynformula(Y1~ X + C, lag.form=list(2,2,2))
And this easily generalises for multiple X and C:
gmm.form <- dynformula(Y1~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 +X10 + C1 + C2
+ C3 + C4 + C5 +C6, lag.form=list(rep(2,17)))
This command means you will be including up to and including 2 lags for all the variables (noting that the first in the lag.form list above is Y1 - dynformula will automatically put the lags of Y1 on the right hand side of the equation).
[Edit: I note you haven't specified instruments. Seeing your data, for standard dynamic panel approach of lagged Y, I'd put gmm.inst=~Y1,gmm.lag=list(c(3,99))]

Sending variable name vector to glm inside R function [duplicate]

Suppose I have a response variable and a data containing three covariates (as a toy example):
y = c(1,4,6)
d = data.frame(x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
I want to fit a linear regression to the data:
fit = lm(y ~ d$x1 + d$x2 + d$y2)
Is there a way to write the formula, so that I don't have to write out each individual covariate? For example, something like
fit = lm(y ~ d)
(I want each variable in the data frame to be a covariate.) I'm asking because I actually have 50 variables in my data frame, so I want to avoid writing out x1 + x2 + x3 + etc.
There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
You can also do things like this, to use all variables but one (in this case x3 is excluded):
mod <- lm(y ~ . - x3, data = d)
Technically, . means all variables not already mentioned in the formula. For example
lm(y ~ x1 * x2 + ., data = d)
where . would only reference x3 as x1 and x2 are already in the formula.
A slightly different approach is to create your formula from a string. In the formula help page you will find the following example :
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
Then if you look at the generated formula, you will get :
R> fmla
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
Yes of course, just add the response y as first column in the dataframe and call lm() on it:
d2<-data.frame(y,d)
> d2
y x1 x2 x3
1 1 4 3 4
2 4 -1 9 -4
3 6 3 8 -2
> lm(d2)
Call:
lm(formula = d2)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
Also, my information about R points out that assignment with <- is recommended over =.
An extension of juba's method is to use reformulate, a function which is explicitly designed for such a task.
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
reformulate(xnam, "y")
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
For the example in the OP, the easiest solution here would be
# add y variable to data.frame d
d <- cbind(y, d)
reformulate(names(d)[-1], names(d[1]))
y ~ x1 + x2 + x3
or
mod <- lm(reformulate(names(d)[-1], names(d[1])), data=d)
Note that adding the dependent variable to the data.frame in d <- cbind(y, d) is preferred not only because it allows for the use of reformulate, but also because it allows for future use of the lm object in functions like predict.
I build this solution, reformulate does not take care if variable names have white spaces.
add_backticks = function(x) {
paste0("`", x, "`")
}
x_lm_formula = function(x) {
paste(add_backticks(x), collapse = " + ")
}
build_lm_formula = function(x, y){
if (length(y)>1){
stop("y needs to be just one variable")
}
as.formula(
paste0("`",y,"`", " ~ ", x_lm_formula(x))
)
}
# Example
df <- data.frame(
y = c(1,4,6),
x1 = c(4,-1,3),
x2 = c(3,9,8),
x3 = c(4,-4,-2)
)
# Model Specification
columns = colnames(df)
y_cols = columns[1]
x_cols = columns[2:length(columns)]
formula = build_lm_formula(x_cols, y_cols)
formula
# output
# "`y` ~ `x1` + `x2` + `x3`"
# Run Model
lm(formula = formula, data = df)
# output
Call:
lm(formula = formula, data = df)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
```
You can check the package leaps and in particular the function regsubsets()
functions for model selection. As stated in the documentation:
Model selection by exhaustive search, forward or backward stepwise, or sequential replacement

concise way of making an R formula [duplicate]

Suppose I have a response variable and a data containing three covariates (as a toy example):
y = c(1,4,6)
d = data.frame(x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
I want to fit a linear regression to the data:
fit = lm(y ~ d$x1 + d$x2 + d$y2)
Is there a way to write the formula, so that I don't have to write out each individual covariate? For example, something like
fit = lm(y ~ d)
(I want each variable in the data frame to be a covariate.) I'm asking because I actually have 50 variables in my data frame, so I want to avoid writing out x1 + x2 + x3 + etc.
There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
You can also do things like this, to use all variables but one (in this case x3 is excluded):
mod <- lm(y ~ . - x3, data = d)
Technically, . means all variables not already mentioned in the formula. For example
lm(y ~ x1 * x2 + ., data = d)
where . would only reference x3 as x1 and x2 are already in the formula.
A slightly different approach is to create your formula from a string. In the formula help page you will find the following example :
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
Then if you look at the generated formula, you will get :
R> fmla
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
Yes of course, just add the response y as first column in the dataframe and call lm() on it:
d2<-data.frame(y,d)
> d2
y x1 x2 x3
1 1 4 3 4
2 4 -1 9 -4
3 6 3 8 -2
> lm(d2)
Call:
lm(formula = d2)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
Also, my information about R points out that assignment with <- is recommended over =.
An extension of juba's method is to use reformulate, a function which is explicitly designed for such a task.
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
reformulate(xnam, "y")
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
For the example in the OP, the easiest solution here would be
# add y variable to data.frame d
d <- cbind(y, d)
reformulate(names(d)[-1], names(d[1]))
y ~ x1 + x2 + x3
or
mod <- lm(reformulate(names(d)[-1], names(d[1])), data=d)
Note that adding the dependent variable to the data.frame in d <- cbind(y, d) is preferred not only because it allows for the use of reformulate, but also because it allows for future use of the lm object in functions like predict.
I build this solution, reformulate does not take care if variable names have white spaces.
add_backticks = function(x) {
paste0("`", x, "`")
}
x_lm_formula = function(x) {
paste(add_backticks(x), collapse = " + ")
}
build_lm_formula = function(x, y){
if (length(y)>1){
stop("y needs to be just one variable")
}
as.formula(
paste0("`",y,"`", " ~ ", x_lm_formula(x))
)
}
# Example
df <- data.frame(
y = c(1,4,6),
x1 = c(4,-1,3),
x2 = c(3,9,8),
x3 = c(4,-4,-2)
)
# Model Specification
columns = colnames(df)
y_cols = columns[1]
x_cols = columns[2:length(columns)]
formula = build_lm_formula(x_cols, y_cols)
formula
# output
# "`y` ~ `x1` + `x2` + `x3`"
# Run Model
lm(formula = formula, data = df)
# output
Call:
lm(formula = formula, data = df)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
```
You can check the package leaps and in particular the function regsubsets()
functions for model selection. As stated in the documentation:
Model selection by exhaustive search, forward or backward stepwise, or sequential replacement

How to succinctly write a formula with many variables from a data frame?

Suppose I have a response variable and a data containing three covariates (as a toy example):
y = c(1,4,6)
d = data.frame(x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
I want to fit a linear regression to the data:
fit = lm(y ~ d$x1 + d$x2 + d$y2)
Is there a way to write the formula, so that I don't have to write out each individual covariate? For example, something like
fit = lm(y ~ d)
(I want each variable in the data frame to be a covariate.) I'm asking because I actually have 50 variables in my data frame, so I want to avoid writing out x1 + x2 + x3 + etc.
There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
You can also do things like this, to use all variables but one (in this case x3 is excluded):
mod <- lm(y ~ . - x3, data = d)
Technically, . means all variables not already mentioned in the formula. For example
lm(y ~ x1 * x2 + ., data = d)
where . would only reference x3 as x1 and x2 are already in the formula.
A slightly different approach is to create your formula from a string. In the formula help page you will find the following example :
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
Then if you look at the generated formula, you will get :
R> fmla
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
Yes of course, just add the response y as first column in the dataframe and call lm() on it:
d2<-data.frame(y,d)
> d2
y x1 x2 x3
1 1 4 3 4
2 4 -1 9 -4
3 6 3 8 -2
> lm(d2)
Call:
lm(formula = d2)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
Also, my information about R points out that assignment with <- is recommended over =.
An extension of juba's method is to use reformulate, a function which is explicitly designed for such a task.
## Create a formula for a model with a large number of variables:
xnam <- paste("x", 1:25, sep="")
reformulate(xnam, "y")
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
x22 + x23 + x24 + x25
For the example in the OP, the easiest solution here would be
# add y variable to data.frame d
d <- cbind(y, d)
reformulate(names(d)[-1], names(d[1]))
y ~ x1 + x2 + x3
or
mod <- lm(reformulate(names(d)[-1], names(d[1])), data=d)
Note that adding the dependent variable to the data.frame in d <- cbind(y, d) is preferred not only because it allows for the use of reformulate, but also because it allows for future use of the lm object in functions like predict.
I build this solution, reformulate does not take care if variable names have white spaces.
add_backticks = function(x) {
paste0("`", x, "`")
}
x_lm_formula = function(x) {
paste(add_backticks(x), collapse = " + ")
}
build_lm_formula = function(x, y){
if (length(y)>1){
stop("y needs to be just one variable")
}
as.formula(
paste0("`",y,"`", " ~ ", x_lm_formula(x))
)
}
# Example
df <- data.frame(
y = c(1,4,6),
x1 = c(4,-1,3),
x2 = c(3,9,8),
x3 = c(4,-4,-2)
)
# Model Specification
columns = colnames(df)
y_cols = columns[1]
x_cols = columns[2:length(columns)]
formula = build_lm_formula(x_cols, y_cols)
formula
# output
# "`y` ~ `x1` + `x2` + `x3`"
# Run Model
lm(formula = formula, data = df)
# output
Call:
lm(formula = formula, data = df)
Coefficients:
(Intercept) x1 x2 x3
-5.6316 0.7895 1.1579 NA
```
You can check the package leaps and in particular the function regsubsets()
functions for model selection. As stated in the documentation:
Model selection by exhaustive search, forward or backward stepwise, or sequential replacement

Resources