Concatenation of formulae in R - r

Function constrain of the R package diversitree takes a list of formulae as input.
formulae <- list(lambda1 ~ lambda0, mu1 ~ mu0, q10 ~ q01)
constrain(lik, formulae=formulae)
I would like to pass these formulae via a decision tree and concatenate them as needed.
f1 <- "lambda1 ~ lambda0"
f2 <- "mu1 ~ mu0"
f3 <- "q10 ~ q01"
How do I arrive at the above-shown list formulae?
Unsuccessful attempt:
formulae <- as.formula(paste(f1,f2,f3, collapse=","))
EDIT 1:
I do not know the precise number of respective formulae a prior, but let them be determined via a decision tree. The precise number of individual formulae (i.e., f1, f2, f3, etc.) that goes into variable formulae should thus not be hard-coded.

You can use:
formulae = list(as.formula(f1),as.formula(f2),as.formula(f3))
If you originally have all string formulae in a vector, say f <- c(f1, f2, f3), you may use
lapply(f, as.formula)

Related

Best way to tell if a formula contains a random effect?

I have a list of formulas that I would like to fit in a loop using a function. Some of these formulas are random effects models and others are straightforward linear models. I want the function to detect whether the model contains a random effect and if so, use lmer() to fit the model. Otherwise, it should use lm(). Any suggestions on how to check this condition (other than converting the formula to a string and checking for parentheses)? At this stage, they have the same class so I can't just check that. I could also use error handling to catch when lmer() returns an error from a model without a random effect and reroute towards regular lm(), but this also seems unnecessarily messy.
Example below:
fit_models <- function(formula_list) {
models <- list()
for(ii in seq_along(formula_list)) {
if(formula_list[[ii]] is lmer) { # Enter condition here
print("lmer")
} else {
print("lm")
}
}
}
f1 <- formula(y ~ x)
f2 <- formula(y ~ 1 + x + (1 + x | z))
formulas <- c(f1, f2)
fit_models(formulas)
I would say
length(lme4::findbars(f))>0
should reliably detect formulas containing a random-effects component (in the lme4 sense).
From the right hand side of a formula for a mixed-effects model,
determine the pairs of expressions that are separated by the
vertical bar operator.
This is (implicitly) the test that's done in the lme4 code, here ...
The symbols in formulas don't have inherent meanings. A function can reinterpret the symbols to mean whatever they like. So just because there is a "|", that doesn't mean necessarily that that's a formula that has a random effect. That's just how lmer chose to interpret that symbol.
Given that formulas are basically just ordered collections of unevaluated symbols, there's not much more you can do than a basic equality check for a symbol operating on just the formula itself. Rather than a strait up character conversion, you could use all.names. So something like
f2 <- formula(y ~ 1 + x + (1 + x | z))
all.names(f2)
# [1] "~" "y" "+" "+" "x" "(" "|" "+" "x" "z"
"|" %in% all.names(f2)
# [1] TRUE
This won't be fooled if you have something like formula(`a|b` ~ x) where a|b is a (terrible) column name.
You can just convert the formula to a character and look for the pipe operator |:
f1 <- formula(y ~ x)
f2 <- formula(y ~ 1 + x + (1 + x | z))
formulas <- c(f1, f2)
sapply(formulas, function(x) any(grepl("\\|", as.character(x))))
#> [1] FALSE TRUE

How to run a for loop to run regressions by dummy variables

I have the following code:
reg <- lm(Y ~ x1 + x1_sq + x2 + x2_sq + x1x2 + d2 + d3 + d4, df)
Where all x_i are continuous variables and d_i are mutually exclusive dummy variables (d1 is present but exclude to avoid perfect multicollinearity). Rather than including the dummy variables, I want to run separate regressions for each dummy variable == 1. I wish to achieve this through a loop in the following form:
dummylist <- list("d1", "d2", "d3", "d4")
for(i in dummylist){
if(i==1){
ireg <- lm(Y ~ x1 + x1_sq + x2 + x2_sq + x1x2, df)
} else {
Unsure what to put here
}
}
My three(?) questions are:
in the first section of the -if- function, do I just include "i" before "reg" for my code to generate results "d1reg, d2reg, etc."? and,
included in the code above, what would I put after the -else- statement?
This all begs the question, is putting an if-else statement within the -for- loop the wrong approach/is there a more appropriate loop?
Sorry if this is too much, please let me know if it is and I can cut it down or separate into multiple questions. I could not find a similar question, probably as I am rather new to running loops in R and don't know what to look for.
in the first section of the -if- function, do I just include "i" before "reg" for my code to generate results "d1reg, d2reg, etc."?
Short: No
In R there are many data types. One of the more versatile once is the list object, which can store any type of object. Alternatively one could create an environment to store the lists within, but that is a bit overkill.
If you know roughly how many elements should be in your list, the easiest is to initialize it prior to your loop as
n <- 3
regList <- vector(mode = "list", length = n)
# Optional naming:
#names(regList) <- c("d1 reg", "d2 reg", "d3 reg")
In your loop you then fill in your list iteratively:
for(i in seq_along(regList)){
regList[[i]] <- lm(...)
}
what would I put after the -else- statement? This all begs the question,
It is not entirely clear what you want here. Either you want to 'only' include the seperate dummy variables. For this the simplest is likely to save your formula and updating it iteratively.
form <- Y ~ x1 + x1_sq + x2 + x2_sq + x1x2
for(i in seq_along(regList)){
#paste0 combine strings. ". ~ . + d1" means take the formula and add the element d1
form <- update(form, as.formula(paste0(". ~ . + d", i))
regList[[i]] <- lm(form, data = df)
}
or maybe you are actually trying to run separate regressions on the subset where d[i] == 1. This can actually be done with lm itself
form <- Y ~ x1 + x1_sq + x2 + x2_sq + x1x2
d <- list(d1, d2, d3)
for(i in seq_along(regList)){
#Using the subset argument
regList[[i]] <- lm(form, data = df, subset = which(d[[i]] == 1))
#Alternatively:
#regList[[i]] <- lm(form, data = subset(df, d[[i]] == 1))
}
Disclaimer: It is not entirely clear if d1, d2, d3 is a part of df. In this case the example below would work
regList[[i]] <- with(df, lm(form, subset = which(d[[i]] == 1)))
is putting an if-else statement within the -for- loop the wrong approach/is there a more appropriate loop?
In this case it is not clearly the correct approach. But it isn't the wrong approach either in all circumstances. Here it just doesn't serve a clear purpose. And note that i in dummylist would return "d1", "d2", "d3", "d4" as the variables have been quoted, rather than directly placed within the list.
However another thing to address, is whether you have transformed the variables yourself, before performing your linear regression. Note that R's internal function allows you to do this directly in the formula, and doing this will allow it to help you avoid dummy-mistakes, such as testing variables for which an interaction exists, unless it is very very much what you wanted to do. For example i assume x1_sq = x1^2. Maybe d1, d2, d3 are all contained in a variable d? In these cases you should use the original variables as shown below:
lm(formula = Y ~ poly(x1, 2, raw = TRUE) + poly(x2, 2, raw = TRUE) + x1:x2, data = df ) #+d if d1, d2, d3 is part of the formula
poly being the second order polynomial and raw = TRUE returning the parameters as x1 + I(x1^2) rather than the orthogonal representation.
If one does this, the output of drop1, anova etc. will take into account that it should not test the first order variables to the second order interactions.

Nonlinear model with many independent variables (fixed effects) in R

I'm trying to fit a nonlinear model with nearly 50 variables (since there are year fixed effects). The problem is I have so many variables that I cannot write the complete formula down like
nl_exp = as.formula(y ~ t1*year.matrix[,1] + t2*year.matrix[,2]
+... +t45*year.matirx[,45] + g*(x^d))
nl_model = gnls(nl_exp, start=list(t=0.5, g=0.01, d=0.1))
where y is the binary response variable, year.matirx is a matrix of 45 columns (indicating 45 different years) and x is the independent variable. The parameters need to be estimated are t1, t2, ..., t45, g, d.
I have good starting values for t1, ..., t45, g, d. But I don't want to write a long formula for this nonlinear regression.
I know that if the model is linear, the expression can be simplified using
l_model = lm(y ~ factor(year) + ...)
I tried factor(year) in gnls function but it does not work.
Besides, I also tried
nl_exp2 = as.formula(y ~ t*year.matrix + g*(x^d))
nl_model2 = gnls(nl_exp2, start=list(t=rep(0.2, 45), g=0.01, d=0.1))
It also returns me error message.
So, is there any easy way to write down the nonlinear formula and the starting values in R?
Since you have not provided any example data, I wrote my own - it is completely meaningless and the model actually doesn't work because it has bad data coverage but it gets the point across:
y <- 1:100
x <- 1:100
year.matrix <- matrix(runif(4500, 1, 10), ncol = 45)
start.values <- c(rep(0.5, 45), 0.01, 0.1) #you could also use setNames here and do this all in one row but that gets really messy
names(start.values) <- c(paste0("t", 1:45), "g", "d")
start.values <- as.list(start.values)
nl_exp2 <- as.formula(paste0("y ~ ", paste(paste0("t", 1:45, "*year.matrix[,", 1:45, "]"), collapse = " + "), " + g*(x^d)"))
gnls(nl_exp2, start=start.values)
This may not be the most efficient way to do it, but since you can pass a string to as.formula it's pretty easy to use paste commands to construct what you are trying to do.

passing multiple arguments via mapply to a formula function

I would like to run two loess regressions. The data is provided as a list which contains two elements. Each element itself contains a pair of columns (x and y for regression) for which I would like to run the loess regression. I would like to do so by employing the apply family specifically maply. However loess regression takes the formula expression y ~ x and I believe you can not directly reference for x and y in the formula format as you would for a non formula function where the variables could be provided via mapply.
X <- c(3,4,3,2,3,4,5,6,7,7,6,5,4,3,3,5,3,6,3,5,6,3,6,3,4,5,5,4,3,4,5,3,5,5,4)
Y <- c(3,2,1,3,4,2,1,2,3,5,4,3,2,1,1,3,4,5,6,7,6,5,4,3,2,3,4,3,4,2,4,3,NA,NA,NA)
mydata<-data.frame(X,Y)
L <- seq(1:length(mydata))
n <- function(x) length(na.omit(mydata[,x]))
n <- lapply(L,n)
# sequence each (Variable time)
x <- function(x) seq(1:n[[x]])
x <- lapply(L,x)
y <- function(x) na.omit(mydata[,x])
y <- lapply(L,y)
# create a list with pairs of columns each
Data <- function (p) data.frame(y[[p]], x[[p]])
Data <- lapply(L,Data)
# In a writen function you would do (where p will correspond to the sequence of L and d is the number of columns in each element) and use mapply passing the arguments
d <- seq(1:length(mydata))
p <- seq(1:length(Data))
W1<-expand.grid(p=p,d=d)
# However the formula framework y ~ x for loess does not allow to pass multiple variable arguments to x and y and use mapply to do so which I would like to do as to automate this process. I wrote in the same format as x and y variables will be passed to a non formula function which does not work.
mapply(function(p,d) ((y ~ x, span = 0.75, degree = 2,parametric = FALSE, drop.square = FALSE, normalize = FALSE,family = c("gaussian")),W1$p,W1$d)
I am wondering how could I pass using the mapply function the different variables to the loess function.

Converting R formula format to mathematical equation

When we fit a statistical model in R, say
lm(y ~ x, data=dat)
We use R's special formula syntax: "y~x"
Is there something that converts from such a formula to the corresponding equation? In this case it could be written as:
y = B0 + B1*x
This would be very useful! For one, because with more complicated formulae I don't trust my translation. Second, in scientific papers written with R/Sweave/knitr, sometimes the model should be reported in equation form and for fully reproducible research, we'd like to do this in automated fashion.
Just had a quick play and got this working:
# define a function to take a linear regression
# (anything that supports coef() and terms() should work)
expr.from.lm <- function (fit) {
# the terms we're interested in
con <- names(coef(fit))
# current expression (built from the inside out)
expr <- quote(epsilon)
# prepend expressions, working from the last symbol backwards
for (i in length(con):1) {
if (con[[i]] == '(Intercept)')
expr <- bquote(beta[.(i-1)] + .(expr))
else
expr <- bquote(beta[.(i-1)] * .(as.symbol(con[[i]])) + .(expr))
}
# add in response
expr <- bquote(.(terms(fit)[[2]]) == .(expr))
# convert to expression (for easy plotting)
as.expression(expr)
}
# generate and fit dummy data
df <- data.frame(iq=rnorm(10), sex=runif(10) < 0.5, weight=rnorm(10), height=rnorm(10))
f <- lm(iq ~ sex + weight + height, df)
# plot with our expression as the title
plot(resid(f), main=expr.from.lm(f))
Seems to have lots of freedom about what variables are called, and whether you actually want the coefficients in there as well—but seems good for a start.

Resources