Programmatically detect function calls in R formulae, e.g. y ~ x + log(z), and surround them in backticks - r

Let me explain my goal first because while the title expresses my strategy, I don't think it is likely to be the only way to solve the problem.
I have an R function to which I pass fitted model objects, like those from lm, and the function extracts the model frame, saves that as a data frame, standardizes the variables in the new data frame, then refits the model with the standardized variables to ease the interpretation of the model's coefficients.
Example code without wrapping it in a function:
mod <- lm(mpg ~ wt, data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale))
standardized_mod <- update(mod, data = new_data)
Now a summary of standardized_mod by virtue of being fitted with standardized data will give standardized coefficients.
This isn't the most efficient way of doing things, I admit, since I could do something like multiplying the estimates and SEs by each variable's standard deviation. But in the context of the function, I'm trying to be more flexible; this gets less straightforward when working with survey package objects and the like. I also use the same logic to fit models with interaction terms for simple slopes analysis. But this is besides the main point of the question, I just want to offer some explanation to avoid getting bogged down with "there's other ways to standardize coefficients" responses. I'm more interested in this general problem with formulae than the specific application.
The solution above falls apart when a function is applied to any of the variables. For example,
mod <- lm(mpg ~ log(wt), data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale), check.names = FALSE)
standardized_mod <- update(mod, data = new_data)
This will break on update(mod, data = new_data), because lm is going to look for a column called wt to apply log to in new_data, which only has columns called mpg and log(wt).
What I would like to do is manipulate the model formula in such a way that it goes from mpg ~ log(data) to mpg ~ `log(data)`. Of course, if it was just log I was worried about, I might be able to get something really hacky going to address it. But I'd like to be able to do the same regardless of the function in the formula, like if it's poly or some such.
Here are some solutions I've considered:
Instead of update, re-fit the model with lm directly and use the . for the RHS of the formula. This would work for some cases, but has big drawbacks, too. This will ignore any interaction terms in the original formula or other arithmetic uses of the formula from the original model. It also won't fix the problem if the function was applied to the LHS of the formula in the original model.
Use some kind of convoluted regex matching to isolate terms that appear to be functions on the basis of being right before (, but as a general rule I'm fearful of using string manipulation since it may fail in confusing ways. I'm not completely ruling this route out, but I haven't wrapped my head around how to do it safely and am not sure how to match terms with functions without accidentally capturing other parts of the formula.
I've tried messing around with the terms object and trying to use that as a way to use update on the formula itself, but haven't had much luck figuring out how to edit the terms object in the right ways.

We can avoid having to re-create the formula like this. mm0 is the model matrix columns except for the intercept. scale that giving mm0_std0. Now compute the new standardized lm:
mod <- lm(mpg ~ log(wt) * qsec, data = mtcars)
response <- mod$model[1]
mm0 <- model.matrix(mod)[, -1]
mm0_std <- scale(mm0)
mod_std <- lm(cbind(response, mm0_std))
If you do want the formula this will give it:
formula(mod_std)
## mpg ~ `log(wt)` + qsec + `log(wt):qsec`
## <environment: 0x000000000b1988c8>

I've thought of another potential solution as well, but I've not extensively tested it and it uses regex, which is in my understanding not the most R way of doing things.
mod <- lm(mpg ~ log(wt) * qsec, data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale), check.names = FALSE)
We have the usual start, above.
Now I pull the variable names from the terms object.
vars <- as.character(attributes(terms(mod))$variables)
vars <- vars[-1] # gets rid of "list"
And save the full formula as a string.
char_form <- as.character(deparse(formula(mod)))
Now I iterate through the variables and use regex to surround each one in backticks. This gets around the trickier regex I was worried about with regard to detect which variables had functions applied.
for (var in vars) {
backtick_name <- paste("`", var, "`", sep = "")
char_form <- gsub(var, backtick_name, char_form, fixed = TRUE)
}
If I want to specify a variable not to standardize, like the outcome variable, I can exclude it from the vars vector programmatically. For instance, I can do this:
response <- as.character(formula(mod))[2]
vars <- vars[vars != response]
Of course, we can remove the response by dropping the first item in the list, but the above is for demonstrative purposes.
Now I can refit the model with the new data and new formula.
new_model <- update(mod, formula = as.formula(char_form), data = new_data)
In this narrow case, I don't really need to use update since I have all I need for lm. But if I was starting with a glm object or some other model, other user-supplied arguments like family are preserved.
Note: Weights and offsets can be problematic here, but it's not an intractable problem. I think the most straightforward thing to do is explicitly exclude columns named "(weights)" and "(offset)" from the model frame before scaling, then cbinding it back together afterwards. Then the user can use conditionals or some such to decide when to supply weights = `(weights)` and offset = `(offset)` arguments to update.

Related

How do I generate spline bases from a character vector of response variables?

I am working on a problem where I need to fit many additive models of the form y ~ s(x), where the response y is constant whereas the predictor x varies between each model. I am using mgcv::smoothCon() to set up the bases, and lm() to fit the models. The reason why I do this, rather than calling gam() directly, is that I need the unpenalized fits. My problem is that smoothCon() requires it object argument to be unquoted, e.g., s(x), and I wonder how I can generated such unquoted arguments from a character vector of variable names.
A minimal example can be illustrated using the mtcars dataset. The following snippet shows what I am able to do at the moment:
library(mgcv)
# Variables for which I want to create a smooth term s(x)
responses <- c("mpg", "disp")
# At the moment, this is the only solution which I am able to make work
bs <- list(
smoothCon(s(mpg), data = mtcars),
smoothCon(s(disp), data = mtcars)
)
It would be nicer to be able to generate bs using some functional programming approach. I imagine something like this, where foo() is my missing link:
lapply(paste0("s(", responses, ")"), function(x) smoothCon(foo(x),
data = mtcars))
I have tried noquote() and as.symbol(), but both fail.
responses <- c("mpg", "disp")
lapply(paste0("s(", responses, ")"),
function(x) smoothCon(noquote(x), data = mtcars))
#> Error: $ operator is invalid for atomic vectors
lapply(paste0("s(", responses, ")"),
function(x) smoothCon(as.symbol(x), data = mtcars))
#> Error: object of type 'symbol' is not subsettable
We can do this by converting to language class, evaluate and then apply the smoothCon
library(tidyverse)
out <- paste0("s(", responses, ")") %>%
map(~ rlang::parse_expr(.x) %>%
eval %>%
smoothCon(., data = mtcars))
identical(out, bs)
#[1] TRUE
why don't you try like this?
smoothCon(s(get("disp")), data = mtcars)
and, instead of disp you give the name of the variable you prefer. You can even put this within a loop or any other construct you prefer

Concise way to reference many columns matching regex, within a model formula

In certain datasets I have a large number of related variables which I want to use within a model such as lm, randomforest, xgboost etc. It is not viable to type all these out manually, but I can identify the columns with regex match based on a common prefix, e.g. 'fruit_'.
I want to construct a formula with many terms along the lines of
outcome ~ fruit_banana + fruit_apple + fruit_pear + ....
However, to reference these in a formula is more complicated than I would expect. I have a method which works, but it is feels cumbersome and I am wondering if there is a more concise method.
Note, I am looking for a solution which doesn't involve manipulating the data frame itself, since in real cases I often want to add or drop other variables to the model quickly. Secondly, I do not want to reference columns by position, as this would not be reliable.
Example
In the dataset below I have one variable named 'outcome' which I wish to predict using 20 variables all begining with 'a_', and I want to ignore the other 10 columns.
Create toy dataset:
df = data.frame(outcome = rnorm(100), matrix(rbinom(30*100, 1, 0.2), ncol = 30))
colnames(df)[2:21] = paste0('a_', 1:20)
My current method to construct the formula:
frm = as.formula(paste0("outcome ~ ",
paste(colnames(df)[grep('a_', colnames(df))], collapse = ' + ')))
lm(frm, data = df)

How to write a function to check model assumptions for a linear model in R?

I'm making a lot of models in R and trying to check the model assumptions for all of them. It would be awesome if I could write a function to do it all in one go, but it doesn't seem to be working.
I have:
assumptionfunction <- function(y, modelobject){
plot(x)
plot(y, x$residuals)
qqnorm(x$residuals)
}
And I'm getting lots of errors.
Instead of creating your own function, you can use an existing one. The beautiful check_model() function from the performance package does just that:
library(performance)
library(see)
model <- lm(mpg ~ wt * cyl + gear, data = mtcars)
check_model(model)
If you insist on using some objective tests, there is the gvlma package.
library(gvlma)
gvlma(model)
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Value p-value Decision
Global Stat 1.770046 0.7780 Assumptions acceptable.
Skewness 0.746520 0.3876 Assumptions acceptable.
Kurtosis 0.003654 0.9518 Assumptions acceptable.
Link Function 0.927065 0.3356 Assumptions acceptable.
Heteroscedasticity 0.092807 0.7606 Assumptions acceptable.
Now if you don't like gvlma because it doesn't explicitly name the tests used and gives Skewness and Kurtosis but not overall normality from, say, Shapiro-Wilk, I made a convenience function. It gets all tests names and assumptions at once with the total number of assumptions that are not respected. You can take it and modify it to suit your needs.
# Load the function:
source("https://raw.githubusercontent.com/RemPsyc/niceplots/master/niceAssFunction.R")
View(niceAss(model))
Interpretation: (p) values < .05 imply assumptions are not respected.
Diagnostic is how many assumptions are not respected for a given model or variable.
Applied to a list of models:
# Define our dependent variables
(DV <- names(mtcars[-1]))
# Make list of all formulas
(formulas <- paste(DV, "~ mpg"))
# Make list of all models
models.list <- sapply(X = formulas, FUN = lm, data = mtcars, simplify = FALSE, USE.NAMES = TRUE)
# Make diagnostic table
(ass.table <- do.call("rbind", lapply(models.list, niceAss)))
# Use the Viewer for better results
View(ass.table)

Panel regression error in R

I am running an unbalanced panel regression.
Independent Variable is Gross
Dependent Varibales are DEX, GRW, Debt and Life.
Time is Year
Grouping is Country
I have successfully executed the following commands:
tino=read.delim("clipboard")
tino
summary(tino)
Dep<- with(tino, cbind(Gross, index=c("Country, Year"))
Ind<- tino[ , c('DEX', 'GRW' , 'Debt', 'Life')]
install.packages("plm")
library('plm')
pandata<-plm.data(tino)
tino
summary(pandata)
summary(Dep)
summary(Ind)
However, When I run the Command below for results, I get an error.
pooling<- plm(Dep~Ind, data = pandata, model= "pooling")
gives error below
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data,: invalid type (list) for variable 'Ind'
Please help.
Thanks
Without access to your data, it is impossible to confirm that this will work, but I am going to try to point out several issues in your code that are likely contributing to the error.
This line is fine:
tino=read.delim("clipboard")
Here is where you start to make errors:
Dep<- with(tino, cbind(Gross, index=c("Country, Year"))
Ind<- tino[ , c('DEX', 'GRW' , 'Debt', 'Life')]
with() is typically used to create new vectors out of a data.frame. All it does is allow you to drop the $ notation for referencing variables in a data.frame and nothing else. From the read of your code, you may be thinking that with() is actually modifying the tino object, which it is not.
Further, when you want to construct a data.frame for use in a regression model, you want all of the right-hand and left-hand side variables in one data.frame or matrix rather than separating them. This is because most modelling functions operate using a "formula" and data argument, which are passed to model.frame() to preprocess the data before modelling.
This means you presumably want to do something like the following, skipping all of the above:
pandata <- plm.data(tino, index = c("Country", "Year"))
pooling <- plm(Gross ~ DEX + GRW + Debt + Life, data = pandata, model = "pooling")
summary(pooling)
If you have a lot of right-hand side variables, you can subset your data.frame, with something like:
pandata2 <- plm.data(tino[ , c('Gross', 'DEX', 'GRW' , 'Debt', 'Life')], index = c("Country", "Year"))
pooling2 <- plm(Gross ~ ., data = pandata2, model = "pooling")
using the . notation as a shorthand for "all other columns in the data."

"Vectorizing" this for-loop in R? (suppressing interaction main effects in lm)

When interactions are specified in lm, R includes main effects by default, with no option to suppress them. This is usually appropriate and convenient, but there are certain instances (within estimators, ratio LHS variables, among others) where this isn't appropriate.
I've got this code that fits a log-transformed variable to a response variable, independently within subsets of the data.
Here is a silly yet reproducible example:
id = as.factor(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,6,7,7,8,8,8,9,9,9,9,10))
x = rexp(length(id))
y = rnorm(length(id))
logx = log(x)
data = data.frame(id,y,logx)
for (i in data$id){
sub = subset(data, id==i) #This splits the data by id
m = lm(y~logx-1,data=sub) #This gives me the linear (log) fit for one of my id's
sub$x.tilde = log(1+3)*m$coef #This linearizes it and gives me the expected value for x=3
data$x.tilde[data$id==i] = sub$x.tilde #This puts it back into the main dataset
data$tildecoeff[data$id==i] = m$coef #This saves the coefficient (I use it elsewhere for plotting)
}
I want to fit a model like the following:
Y = B(X*id) +e
with no intercept and no main effect of id. As you can see from the loop, I'm interested in the expectation of Y when X=3, constrained the fit through the origin (because Y is a (logged) ratio of Y[X=something]/Y[X=0].
But if I specify
m = lm(Y~X*as.factor(id)-1)
there is no means of suppressing the main effects of id. I need to run this loop several hundred times in an iterative algorithm, and as a loop it is far too slow.
The other upside of de-looping this code is that it'll be much more convenient to get prediction intervals.
(Please, I don't need pious comments about how leaving out main effects and intercepts is improper -- it usually is, but I can promise that it isn't in this instance).
Thanks in advance for any ideas!
I think you want
m <- lm(y ~ 0 + logx : as.factor(id))
see R-intro '11.1 Defining statistical models; formulae'

Resources