Creating complicated formula's with a function - r

I am using the following function to make formula's, where I can simply assign vector of variable names, where the function makes sure everything is in the right place and double variable names are excluded:
formula <- function(depvar, indepvars, instruments=NULL, othervars=NULL) {
x <- c(indepvars, instruments, othervars)
totvars <- unique(x)
totvars <- x[!x %in% depvar]
formula <- as.formula(
paste(depvar, paste(totvars, collapse = " + "), sep = " ~ "))
return(formula)
}
indepvars <- c("indepvarA", "indepvarB", "indepvarC")
instruments <- c("IV_A", "IV_B")
# lm
formula("depvar", indepvars)
# 1st stage - IV's for indepvarC
formula("indepvarC", indepvars, instruments)
However, I want the option to write a more complicated formula (an ivreg formula), namely:
depvar ~ instrumentedvar + indepvars | instrumentvars + indepvars
I have been trying the following:
formula <- function(depvar, indepvars, instruments=NULL, instrumentedvar=NULL, othervars=NULL, twostage=NULL) {
x <- c(indepvars, instruments, othervars)
totvars <- unique(x)
totvars <- x[!x %in% depvar]
if (is.null(twostage)) {
formula <- as.formula(
paste(depvar, paste(totvars, collapse = " + "), sep = " ~ "))
} else {
totvarsB <- totvars[!totvars %in% instrumentedvar]
totvarsB <- c(as.character(totvarsB), as.character(instruments))
formula <- as.formula(
paste(depvar, paste(paste(totvars, collapse = " + "), paste("|", paste(totvarsB, collapse = " + " )), sep = " ~ ")))
}
return(formula)
}
indepvars <- c("indepvarA", "indepvarB", "indepvarC")
instruments <- c("IV_A", "IV_B")
instrumentedvar <- "indepvarC"
formula("indepvarC", indepvars, instruments, twostage=1)
But I cannot seem to get it right.

Define reform which takes a vector of names and outputs a string in which they are connected with plus signs. Then use sprintf to generate the final string and convert that using as.formula:
reform <- function(x) paste(x, collapse = " + ")
makeFo <- function(lhs, rhs1, rhs2 = NULL, env = parent.frame()) {
s <- sprintf("%s ~ %s", lhs, reform(c(rhs1, rhs2)))
if (!missing(rhs2)) s <- sprintf("%s | %s", s, reform(rhs2))
as.formula(s, env = env)
}
# test
makeFo("y", c("x1", "x2"))
## y ~ x1 + x2
makeFo("y", c("x1", "x2"), c("u1", "u2"))
## y ~ x1 + x2 + u1 + u2 | u1 + u2

Related

Variable parts formula for Shiny

In this part of my Shiny app, I'll do a 'linear model' (lm()) regression, using the variables the user selects. There are three inputs:
input$lmTrendFun is a selectInput(), with the options c("Linear", "Exponential", "Logarithmic", "Quadratic", "Cubic"):
selectInput("lmTrendFun", "Select the model for your trend line.",
choices = c("Linear", "Exponential", "Logarithmic", "Quadratic", "Cubic"))
The second input is input$lmDep, and it's a selectInput() too. I created a updateSelectInput first inside an observe() reactive function, so the choices are the column names from the imported tibble.
The third input is input$lmInd and it's a checkboxGroupInput(), the choices being all the column names other than the one that's already the input$lmInd.
From that I want this output: the lm() (or rather, summary.lm() or summary(lm())) result for those variables. If I knew which they were, it would be simple:
if(input$lmTrendFun == "Linear"){
form <- yname ~ x1 + x2
}else if(input$lmTrendFun == "Exponential"){
form <- yname~ exp(x1) + exp(x2)
}else if(input$lmTrendFun == "Logarithmic"){
form <- yname~ log(x1) + log(x2)
}else if(input$lmTrendFun == "Quadratic"){
form <- yname ~ poly(x1, 2) + poly(x2, 2)
}else if(input$lmTrendFun == "Cubic"){
form <- y ~ poly(x1, 3) + poly(x2, 3)
}
[...]
lm(form, data = .)
where the data (.) has the columns yname, x1 and x2.
However, I don't. So I believe I need some more generic function that can create the formula. How can this be done?
formulizer <- function() as.formula(paste0( input$lmDep, "~", switch(input$lmTrendFun,
Linear = paste0(input$lmInd, collapse=" + "),
Logarithmic = paste0("exp(", input$lmInd,")", collapse=" + "),
Quadratic = paste0("poly(", input$lmInd,", 2)", collapse=" + "),
Cubic = paste0("poly(", input$lmInd,", 3)", collapse=" + ") )))
> input <- list(lmInd=paste0("V", 1:5), lmTrendFun="Linear", lmDep="Vp")
> formulaizer()
Vp ~ V1 + V2 + V3 + V4 + V5
<environment: 0x7fad1cf63d48>
> input <- list(lmInd=paste0("V", 1:5), lmTrendFun="Logarithmic", lmDep="Vp")
> formulizer()
Vp ~ exp(V1) + exp(V2) + exp(V3) + exp(V4) + exp(V5)
<environment: 0x7fad01e694d0>
> input <- list(lmInd=paste0("V", 1:5), lmTrendFun="Quadratic", lmDep="Vp")
> formulizer()
Vp ~ poly(V1, 2) + poly(V2, 2) + poly(V3, 2) + poly(V4, 2) +
poly(V5, 2)
<environment: 0x7fad01f51d20>
> input <- list(lmInd=paste0("V", 1:5), lmTrendFun="Cubic", lmDep="Vp")
> formulizer()
Vp ~ poly(V1, 3) + poly(V2, 3) + poly(V3, 3) + poly(V4, 3) +
poly(V5, 3)
<environment: 0x7fad01f59690>
Consider switch with vectorized paste0 to build terms with transformations and then pass terms into reformulate. Adjust below inputs to actual Shiny variables:
dep_term <- ...
ind_terms <- ...
form <- switch(input$lmTrendFun,
Linear = reformulate(ind_terms, response="yname"),
Exponential = reformulate(paste0("exp(", ind_terms, ")"), response=dep_term),
Logarithmic = reformulate(paste0("log(", ind_terms, ")"), response=dep_term),
Quadratic = reformulate(paste0("poly(", ind_terms, ", 2)"), response=dep_term),
Cubic = reformulate(paste0("poly(", ind_terms, ", 3)"), response=dep_term)
)
Online Demo

R Step function looks for data in global environment, not inside defined function

I have a problem with step forward regression and My understanding is that i don't pass argument Data correctly.
I have the function:
ForwardStep <- function(df,yName, Xs, XsMin) {
Data <- df[, c(yName,Xs)]
fit <- glm(formula = paste(yName, " ~ ", paste0(XsMin, collapse = " + ")),
data = Data, family = binomial(link = "logit") )
ScopeFormula <- list(lower = paste(yName, " ~ ", paste0(XsMin, collapse = " + ")),
upper = paste(yName, " ~ ", paste0(Xs, collapse = " + ")))
result <- step(fit, direction = "forward", scope = ScopeFormula, trace = 1 )
return(result)
}
When I try to run it with following arguments
df <- data.frame(Y= rep(c(0,1),25),time = rpois(50,2), x1 = rnorm(50, 0,1),
x2 = rnorm(50,.5,2), x3 = rnorm(50,0,1))
yName = "Y"
Xs <- c("x1","x2","x3")
XsMin <- 1
res <- ForwardStep(df,Yname,Xs,XsMin)
I am getting an Error:
Error in is.data.frame(data) : object 'Data' not found
But if I first define Data in Global Env it works perfectly fine.
Data <- df[, c(yName,Xs)]
res <- ForwardStep(df,Yname,Xs,XsMin)
I guess that I have wrong implementation of function step however I don't exactly know how to do it the right way.
You need to realize that formulas always have an associated environment, see help("formula"). One should never pass text to the formula parameter of model functions, never ever. If you do that, you will encounter scoping issues sooner or later. Usually, I'd recommend computing on the language instead, but you can also create the formulas from text in the correct scope:
ForwardStep <- function(df,Yname, Xs, XsMin) {
Data <- df[, c(Yname,Xs)]
f1 <- as.formula(paste(Yname, " ~ ", paste0(XsMin, collapse = " + ")))
fit <- glm(formula = f1,
data = Data, family = binomial(link = "logit") )
f2 <- as.formula(paste(Yname, " ~ ", paste0(XsMin, collapse = " + ")))
f3 <- as.formula(paste(Yname, " ~ ", paste0(Xs, collapse = " + ")))
ScopeFormula <- list(lower = f2,
upper = f3)
step(fit, direction = "forward", scope = ScopeFormula, trace = 1)
}
df <- data.frame(Y= rep(c(0,1),25),time = rpois(50,2), x1 = rnorm(50, 0,1),
x2 = rnorm(50,.5,2), x3 = rnorm(50,0,1))
YName = "Y"
Xs <- c("x1","x2","x3")
XsMin <- 1
res <- ForwardStep(df,YName,Xs,XsMin)
#Start: AIC=71.31
#Y ~ 1
#
# Df Deviance AIC
#<none> 69.315 71.315
#+ x1 1 68.661 72.661
#+ x3 1 68.797 72.797
#+ x2 1 69.277 73.277
(Public service announcement: step-wise regression is a garbage generator. There are better statistical techniques available.)

How to apply a function that contains multiple regression on 2 by 2 variables in df by category in R?

My df is somthing like this:
ind1 <- rnorm(99)
ind2 <- rnorm(99)
ind3 <- rnorm(99)
ind4 <- rnorm(99)
ind5 <- rnorm(99)
dep <- rnorm(99, mean=ind1)
group <- rep(c("A", "B", "C"), each=33)
df <- data.frame(dep, group, ind1, ind2, ind3, ind4, ind5)
This is a function that combines different types of regression equations.
functions <- function(x, y) {
eq1 <- lm(dep ~ x + y)
eq2 <- lm(dep ~ I(x*y))
eq3 <- lm(log(dep) ~ I(log(x+1)^2) + I(log(y+1)^2))
list <- list(eq1, eq2, eq3)
names(list) <- paste0("mod", 1:3)
return(list)
}
Then I applied this function to ind1 and ind2 variables by group as below.
out <- lapply(split(df, df$group), function(x) functions(x$ind1, x$ind2))
lapply(out, summary)
But there is a error saying that variable lengths differ (found for 'x')? So how I can fix this error?
Thanks in advance!
Here, the issue is that the formula needs the column names instead of the values
functions <- function(dat, x, y) {
form1 <- paste0("dep ~ ", x, " + ", y)
form2 <- paste0("dep ~ ", "I(", x, "*", y, ")")
form3 <- paste0("log(dep) ~", "I(log(", x, "+1)^2) + I(log(", y, "+1)^2)")
eq1 <- lm(form1, data = dat)
eq2 <- lm(form2, data = dat)
eq3 <- lm(form3, data = dat)
list1 <- list(eq1, eq2, eq3)
names(list1) <- paste0("mod", 1:3)
return(list1)
}
out <- lapply(split(df, df$group), function(x) functions(x, "ind1", "ind2"))
Also, as the output is a nested list, we may need to get inside the inner list to extract the summary
lapply(out, function(x) lapply(x, summary))

Passing a character vector of variables into selection() formula

When I calculate a linear model in R via the lm() function, it is possible to pass a character vector of variables into the lm() formula. (E.g. like described here or here.) However, if I apply the same method to the selection() function of the sampleSelection package, it appears the following error:
Error in detectModelType(selection, outcome) :
argument 'selection' must be a formula in function 'selection()'
Question: Is there a way to pass a character vector of variables into the selection() formula?
Below, you can find a reproducible example, which illustrates the problem:
# Example data
N <- 1000
y <- rnorm(N, 2000, 200)
y_prob <- c(rep(0, N / 2), rep(1, N / 2)) == 1
x1 <- y + rnorm(N, 0, 300)
x2 <- y + rnorm(N, 0, 300)
x3 <- y + rnorm(N, 0, 300)
x4 <- y + rnorm(N, 0, 300)
x5 <- y + rnorm(N, 0, 300)
y[1:(N / 2)] <- 0
data <- data.frame(y, x1, x2, x3, x4, x5, y_prob)
x_vars <- colnames(data)[colnames(data) %in% c("y", "y_prob") == FALSE]
# Estimate linear model via lm() --> works without any problems
lm(paste("y", "~", paste(x_vars, collapse = " + ")))
# Estimate Heckman model via selection()
library("sampleSelection")
# Passing of vector does not work
selection(paste("y_prob", "~", paste(x_vars[1:4], collapse = " + ")),
paste("y", "~", paste(x_vars[3:5], collapse = " + ")), data)
# Formula has to be written manually
selection(y_prob ~ x1 + x2 + x3 + x4, y ~ x3 + x4 + x5, data)
Wrap your paste calls with as.formula
selection(as.formula(paste("y_prob", "~", paste(x_vars[1:4], collapse = " + "))),
as.formula(paste("y", "~", paste(x_vars[3:5], collapse = " + "))), data)
Call:
selection(selection = as.formula(paste("y_prob", "~", paste(x_vars[1:4], collapse = " + "))), outcome = as.formula(paste("y", "~", paste(x_vars[3:5], collapse = " + "))), data = data)
Coefficients:
S:(Intercept) S:x1 S:x2 S:x3 S:x4 O:(Intercept) O:x3 O:x4 O:x5 sigma
-1.936e-01 -5.851e-05 7.020e-05 5.475e-05 2.811e-05 2.905e+02 2.286e-01 2.437e-01 2.165e-01 4.083e+02
rho
1.000e+00

remove spaces added with using as.formula

I create a formula from text, but the resulting formula contains unwanted spaces. Is there a way to prevent this.
# text in which variables for formula are stored
mainEffectText ="age sex bmi cohort"
interactionText="age*cohort"
# get text into R
mainEffects <- read.table(textConnection(mainEffectText))
mainEffects <- t(mainEffects)
mainEffects <- as.character(mainEffects[,1])
interactions <- read.table(textConnection(interactionText))
interactions <- t(interactions)
interactions <- as.character(interactions[,1])
# put all terms into one vector
allTerms <- c(mainEffects,interactions)
# create formula
form <- as.formula(paste("Surv(time, event) ~ ",paste(allTerms, collapse=" +")))
This gives the following with spaces added to the interaction terms:
form
# Surv(time, event) ~ age + sex + bmi + cohort + age * cohort
The spaces are not really a problem but anyways try this:
# inputs
mainEffectText <- "age sex bmi cohort"
interactionText <- "age*cohort"
lhs <- "Surv(time, event)"
both <- paste(mainEffectText, interactionText)
paste0(gsub(" ", "", lhs), "~", gsub(" ", "+", both))
giving:
"Surv(time,event)~age+sex+bmi+cohort+age*cohort"
If the inputs are of this form (lhs is as above), then use:
# inputs
mainEffects <- c("age", "sex", "bmi", "cohort")
interactions <- "age*cohort"
both <- paste(paste(mainEffects, collapse = " "),
paste(interactions, collapse = " "))
paste0(gsub(" ", "", lhs), "~", gsub(" ", "+", both))

Resources