User defined functions as formula input - r

Built-in functions in R can be used in formula objects, for example
reg1 = lm(y ~ log(x), data = data1)
How can I write my functions such that they can be used in formula objects?
fnMyFun = function(x) {
return(x^2)
}
reg2 = lm(y ~ fnMyFun(x), data = data1)

What you've got certainly works. One problem is that different modelling functions handle formulas in different ways. I think that as long as you return something that model.matrix can make sense of, you'll be fine. That would mean
The function is vectorised; ie given a vector of length N, it returns a result also of length N
It has to return an atomic vector or matrix (but not a list, or of type raw)

Related

Programmatically detect function calls in R formulae, e.g. y ~ x + log(z), and surround them in backticks

Let me explain my goal first because while the title expresses my strategy, I don't think it is likely to be the only way to solve the problem.
I have an R function to which I pass fitted model objects, like those from lm, and the function extracts the model frame, saves that as a data frame, standardizes the variables in the new data frame, then refits the model with the standardized variables to ease the interpretation of the model's coefficients.
Example code without wrapping it in a function:
mod <- lm(mpg ~ wt, data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale))
standardized_mod <- update(mod, data = new_data)
Now a summary of standardized_mod by virtue of being fitted with standardized data will give standardized coefficients.
This isn't the most efficient way of doing things, I admit, since I could do something like multiplying the estimates and SEs by each variable's standard deviation. But in the context of the function, I'm trying to be more flexible; this gets less straightforward when working with survey package objects and the like. I also use the same logic to fit models with interaction terms for simple slopes analysis. But this is besides the main point of the question, I just want to offer some explanation to avoid getting bogged down with "there's other ways to standardize coefficients" responses. I'm more interested in this general problem with formulae than the specific application.
The solution above falls apart when a function is applied to any of the variables. For example,
mod <- lm(mpg ~ log(wt), data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale), check.names = FALSE)
standardized_mod <- update(mod, data = new_data)
This will break on update(mod, data = new_data), because lm is going to look for a column called wt to apply log to in new_data, which only has columns called mpg and log(wt).
What I would like to do is manipulate the model formula in such a way that it goes from mpg ~ log(data) to mpg ~ `log(data)`. Of course, if it was just log I was worried about, I might be able to get something really hacky going to address it. But I'd like to be able to do the same regardless of the function in the formula, like if it's poly or some such.
Here are some solutions I've considered:
Instead of update, re-fit the model with lm directly and use the . for the RHS of the formula. This would work for some cases, but has big drawbacks, too. This will ignore any interaction terms in the original formula or other arithmetic uses of the formula from the original model. It also won't fix the problem if the function was applied to the LHS of the formula in the original model.
Use some kind of convoluted regex matching to isolate terms that appear to be functions on the basis of being right before (, but as a general rule I'm fearful of using string manipulation since it may fail in confusing ways. I'm not completely ruling this route out, but I haven't wrapped my head around how to do it safely and am not sure how to match terms with functions without accidentally capturing other parts of the formula.
I've tried messing around with the terms object and trying to use that as a way to use update on the formula itself, but haven't had much luck figuring out how to edit the terms object in the right ways.
We can avoid having to re-create the formula like this. mm0 is the model matrix columns except for the intercept. scale that giving mm0_std0. Now compute the new standardized lm:
mod <- lm(mpg ~ log(wt) * qsec, data = mtcars)
response <- mod$model[1]
mm0 <- model.matrix(mod)[, -1]
mm0_std <- scale(mm0)
mod_std <- lm(cbind(response, mm0_std))
If you do want the formula this will give it:
formula(mod_std)
## mpg ~ `log(wt)` + qsec + `log(wt):qsec`
## <environment: 0x000000000b1988c8>
I've thought of another potential solution as well, but I've not extensively tested it and it uses regex, which is in my understanding not the most R way of doing things.
mod <- lm(mpg ~ log(wt) * qsec, data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale), check.names = FALSE)
We have the usual start, above.
Now I pull the variable names from the terms object.
vars <- as.character(attributes(terms(mod))$variables)
vars <- vars[-1] # gets rid of "list"
And save the full formula as a string.
char_form <- as.character(deparse(formula(mod)))
Now I iterate through the variables and use regex to surround each one in backticks. This gets around the trickier regex I was worried about with regard to detect which variables had functions applied.
for (var in vars) {
backtick_name <- paste("`", var, "`", sep = "")
char_form <- gsub(var, backtick_name, char_form, fixed = TRUE)
}
If I want to specify a variable not to standardize, like the outcome variable, I can exclude it from the vars vector programmatically. For instance, I can do this:
response <- as.character(formula(mod))[2]
vars <- vars[vars != response]
Of course, we can remove the response by dropping the first item in the list, but the above is for demonstrative purposes.
Now I can refit the model with the new data and new formula.
new_model <- update(mod, formula = as.formula(char_form), data = new_data)
In this narrow case, I don't really need to use update since I have all I need for lm. But if I was starting with a glm object or some other model, other user-supplied arguments like family are preserved.
Note: Weights and offsets can be problematic here, but it's not an intractable problem. I think the most straightforward thing to do is explicitly exclude columns named "(weights)" and "(offset)" from the model frame before scaling, then cbinding it back together afterwards. Then the user can use conditionals or some such to decide when to supply weights = `(weights)` and offset = `(offset)` arguments to update.

Using list of LM estimates as stargazer input

I'm trying to use stargazer over a several LM estimates at once, say "OLS1",...,"OLS5".
I would usually insert them as separate arguments at the beginning of the stargazer input. What I'm looking for is a way to input them all with a list that contains them all, being one argument. Something like
stargazer(list,...)
stargazer arguments explanation states that
one or more model objects (for regression analysis tables) or data frames/vectors/matrices (for summary statistics, or direct output of content). They can also be included as lists (or even lists within lists).
I was wondering what is the correct way to gather LM estimates in a list so that this would work. When I just save the results in a list I get the following error
Error in list.of.objects[[i]] : subscript out of bounds
I will mention that I create the elements storing the estimate using assign. E.G:
assign(some_string,lm(...))
So what I have is a string, called some_string, and I want to put the LM result names some_string inside a list. Using get doesn't help with that.
EDIT: I think you want mget
library(stargazer)
Y <- rnorm(100)
X <- rnorm(100)
assign("string_1", lm(Y ~ X))
assign("string_2", lm(Y ~ X))
my_list <- mget(x = c("string_1", "string_2"))
stargazer(my_list)
works for me?
library(stargazer)
Y <- rnorm(100)
X <- rnorm(100)
fit_1 <- lm(Y ~ X)
fit_2 <- lm(Y ~ X)
stargazer(list(fit_1, fit_2))
did you name your list list? maybe it's grabbing the function?

How is formula specified in R's lm function

I have a function in R, lets say
myfunction <- function(formula,data)
Among other things, the function contains a call to lm(). Formula should include the covariates, and should be specified as
formula = x1 + x2 + ... + x_n
Data contains columns Z and W, where the response
y=data$Z/data$W
I only want to have formula including the covariates, since the function modifies the response variable for each iteration.
The call for lm() should then work with
lm(y~formula,data=data)
why would you do that? It is cleaner to pass the whole formula in myfunction, for instance:
myfunction <- function(formula,data) {
data = data*2 # this is an example of data manipulation
lm(formula=formula, data=data)
}
then use myfunction as you would use lm
If you REALLY want to create complexity (for nothing?), you can also use the fact that lm will coerce whatever string you pass as an argument into a proper formula object
myfunction2 <- function(formula2,data) {
data = data*2 # this is an example of data manipulation
lm(formula=paste0("y~",formula2), data=data)
}

Using apply to loop over different datasets in a regression

I found this way of looping over variables in an lm() when the variable names are stored as characters (http://www.ats.ucla.edu/stat/r/pages/looping_strings.htm):
models <- lapply(varlist, function(x) {
lm(substitute(read ~ i, list(i = as.name(x))), data = hsb2)
})
My first question is: Is there a more efficient/faster way?
What if I want to loop over different data instead of looping over variables?
Example:
reg1 <- lm(a~b, data=dataset1)
reg2 <- lm(a~b, data=dataset2)
Can I apply something similar to the code shown above? Using the substitute function for the data did not work.
Thank You!
The substitute in your example is used to construct the formula. If you want to to apply lm to a number of data.frames use:
lapply(list(dataset1, dataset2), lm, formula = a ~ b)

Working with dataframe variable names passed as arguments to the R function

I want to fit a statistical model (e.g. a linear one) for arbitrary variables from the dataframe passed as arguments to a function.
E.g.
myLm = function(x,y) {
fit = lm(y~x, data=mtcars)}
myLm("cyl","mpg")
But this does not work.
How do I do that correctly?
You have to pass a formula object to lm function. But you give characters in input of myLM.
You can change your function in this way
myLm = function(x,y) {
fit = lm(formula(paste(x,"~",y)), data=mtcars)
}
myLm("cyl","mpg")
You still give characters input but they are converted inside your function myLM
Why don't you pass a formula object directly:
myLm <- function(f)
lm(f, data=mtcars)
myLm(mpg ~ cyl)

Resources