Best practice to avoid conflict with user supplied variable names - R - r

I'm trying to create a R function for a package that will take user data and (the right hand side of) a formula, do some processing, and return a model. But, I'm having trouble when the user data or formula contain variables with the same name as I use internally. A reproducible example,
(Note that updating the formula's environment is required to keep R from looking in the user's R_GlobalEnv for my variable y.)
# R Version 3.6.2
my_function <- function(user_data, user_formula){
y <- as.numeric(user_data[,1] > mean(user_data[,1]))
my_formula <- update.formula(user_formula, y ~ .)
environment(my_formula) <- environment()
my_model <- lm(my_formula, data = user_data, model = TRUE)
return(my_model)
}
some_data <- data.frame(x1 = c(1,2,3,3))
some_formula <- response ~ x1
my_function(some_data, some_formula)
The above is what I want to run, and it works as long as there isn't variable in user_formula or user_data with the name "y". But when the user_data contains a variable with the same name, the model will use that variable instead of mine.
some_data <- data.frame(x1 = c(1,2,3,3), y = c(6,7,5,6))
some_formula <- response ~ x1 + y
my_function(some_data, some_formula)$model
# y x1
# 1 6 1
# 2 7 2
# 3 5 3
# 4 6 3
# Warning messages:
# 1: In model.matrix.default(mt, mf, contrasts) :
# the response appeared on the right-hand side and was dropped
# 2: In model.matrix.default(mt, mf, contrasts) :
# problem with term 2 in model.matrix: no columns are assigned
I tried forcing R to search the function's environment for y by using get(),
my_function <- function(user_data, user_formula){
y <- as.numeric(user_data[,1] > mean(user_data[,1]))
e1 <- environment()
my_formula <- update.formula(user_formula, get("y", e1) ~ .)
environment(my_formula) <- environment()
my_model <- lm(my_formula, data = user_data, model = TRUE)
return(my_model)
}
some_data <- data.frame(x1 = c(1,2,3,3), y = c(6,7,5,6))
some_formula <- response ~ x1 + y
my_function(some_data, some_formula)$model
# get("y", e1) x1 y
# 1 0 1 6
# 2 0 2 7
# 3 1 3 5
# 4 1 3 6
But this also fails if the user data has a variable with the same name as my internal environment name,
some_data <- data.frame(x1 = c(1,2,3,3), y = c(6,7,5,6), e1 = c(1,2,3,4))
some_formula <- response ~ x1 + y + e1
my_function(some_data, some_formula)$model
# Error in get("y", e1) : invalid 'envir' argument
What is the proper way to avoid overlapping my internal variables with user-supplied variable names? I'd prefer a method for base R if possible.

Per docs of lm, the data argument handles variables in formula in two ways that are NOT mutually exclusive:
data
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.
Specifically, the calculated y vector in function runs into a name collision with potential y column in data frame. As you noticed and as emphasized in docs above, lm will default to y column before the y vector.
To avoid naming conflict, consider adding a suitable placeholder like response or dependent_variable and raise a tryCatch warning if user supply a data frame with same column name. This approach also allows you to avoid resetting the environment(formula).
my_function <- function(user_data, user_formula){
tryCatch({
user_data$response <- as.numeric(user_data[,1] > mean(user_data[,1]))
my_formula <- update.formula(user_formula, response ~ .)
my_model <- lm(my_formula, data = user_data, model = TRUE)
return(my_model)
}, warning = function(w) message("Please rename column 'response' in your data frame")
, error = function(e) print(e)
)
}
Output
some_data <- data.frame(x1 = c(1,2,3,3))
some_formula <- response ~ x1
my_function(some_data, some_formula)
# Call:
# lm(formula = my_formula, data = user_data, model = TRUE)
# Coefficients:
# (Intercept) x1
# -0.7273 0.5455
some_data <- data.frame(x1 = c(1,2,3,3), y = c(6,7,5,6), e1 = c(1,2,3,4))
some_formula <- response ~ x1 + y + e1
my_function(some_data, some_formula)
# Call:
# lm(formula = my_formula, data = user_data, model = TRUE)
# Coefficients:
# (Intercept) x1 y e1
# 1.667e+00 3.850e-16 -3.333e-01 3.333e-01
some_data <- data.frame(x1 = c(1,2,3,3), y = c(6,7,5,6), response=c(1,1,1,1))
some_formula <- response ~ x1 + y + response
my_function(some_data, some_formula)
# Please rename column 'response' in your data frame

Related

scoping/non-standard evaluation issue in glm's formula in a function in R

I have a function that computes a table and a model (and more...):
fun <- function(x, y, formula = y ~ x, data = NULL) {
out <- list()
out$tab <- table(x, y)
out$mod <- glm(formula = formula,
family = binomial,
data = data)
out
}
In the formula, I need to use x and y as provided in the function call (e.g. x = DF1$x and y = DF1$y) and variables from another data frame (e.g. a and b from DF2). It fails with my naive function:
fun(x = DF1$x,
y = DF1$y,
formula = y ~ x + a + b,
data = DF2)
# Error in eval(predvars, data, env) : object 'y' not found
How can I make glm search x and y from the function environment? I guess this issue is related to non-standard evaluation and/or scoping, but I have no idea how to fix it.
Data for the example:
smp <- function(x = c(TRUE, FALSE),
size = 1e2) {
sample(x = x,
size = size,
replace = TRUE)
}
DF1 <- data.frame(x = smp(),
y = smp())
DF2 <- data.frame(a = smp(x = LETTERS),
b = smp(x = LETTERS))
Why not just add x and y into data in the function?
fun <- function(x, y, formula = y ~ x, data = NULL) {
if(length(x) != length(y) |
length(x) != nrow(data) |
length(y) != nrow(data))stop("x, y and data need to be the same length.\n")
data$x <- x
data$y <- y
out <- list()
out$tab <- table(x, y)
out$mod <- glm(formula = formula,
family = binomial,
data = data)
out
}
fun(x = DF1$x,
y = DF1$y,
formula = y ~ x + a + b,
data = DF2)
# $tab
# y
# x FALSE TRUE
# FALSE 27 29
# TRUE 21 23
#
# $mod
# Call: glm(formula = formula, family = binomial, data = data)
#
# Coefficients:
# (Intercept) xTRUE aB aC aD aE aF aG aH aI aJ
# 3.2761 -1.8197 0.3409 -93.9103 -2.0697 20.6813 -41.5963 -1.1078 18.5921 -1.0857 -36.5442
# aK aL aM aN aO aP aQ aR aS aT aU
# -0.5730 -92.5513 -3.0672 22.8989 -53.6200 -0.9450 0.4626 -3.0672 0.3570 -22.8857 1.8867
# aV aW aX aY aZ bB bC bD bE bF bG
# 2.5307 19.5447 -90.5693 -134.0656 -2.5943 -1.2333 20.7726 110.6790 17.1022 -0.5279 -1.2537
# bH bI bJ bK bL bM bN bO bP bQ bR
# -21.7750 114.0199 20.3766 -42.5031 41.1757 -24.3553 -2.0310 -25.9223 -2.9145 51.2537 70.2707
# bS bT bU bV bW bX bY bZ
# -4.7728 -3.7300 -2.0333 -0.3906 -0.5717 -4.0728 0.8155 -4.4021
#
# Degrees of Freedom: 99 Total (i.e. Null); 48 Residual
# Null Deviance: 138.5
# Residual Deviance: 57.73 AIC: 161.7
#
# Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred
#
#DaveArmstrong's answer that was already accepted is correct. This answer explains why there was an error in the original version of the code.
#Thomas quoted the docs in a comment saying
If not found in data, the variables are taken from environment(formula), typically the environment from which glm is called.
The word "typically" is key here. The exact rule is that the environment attached to the formula is the one where the formula expression is first evaluated, because ~ is actually a function. It attaches the evaluation environment to the formula object, and that's the one that stays with it as you pass the object around.
If you run glm(y ~ x), the formula is evaluated wherever you call that, so that's the "typical" case.
In your example, you created the formula object when you called
fun(x = DF1$x,
y = DF1$y,
formula = y ~ x + a + b,
data = DF2)
That means the global environment (where you made this call) is attached to the formula, and there's no y there, so you got the error.
If you had used the default formula = y ~ x by calling
fun(x = DF1$x,
y = DF1$y,
data = DF2)
with no formula argument, it would work, because default arguments are evaluated in the evaluation frame of the function that uses them. Since fun() has local variables x and y created by the arguments, that would be fine.
You also asked why data = NULL would work in #DaveArmstrong's function. He added x and y to it using
data$x <- x
data$y <- y
If you start with data = NULL, the first line changes it to a list containing x and the second line adds a y component, so you end up with a list containing x and y and that's fine for data in glm().

How to expand dots from model.frame with a character string in R

Given data and a variable name:
seed = 1253
dat = data.frame(x = c(1:4, NA), y = rnorm(5), extra = rnorm(5))
var = "extra"
I would like to create a model.frame with all three variables, when only two are specified in formula. This could be done with expanding dots as:
model.frame(y ~ x, dat, var = extra)
# y x (var)
# 1 1.0447865 1 1.4039139
# 2 1.8088280 2 -0.1656416
# 3 0.9614491 3 -0.8215288
# 4 -1.6359538 4 1.0751587
However, I need to be able to add columns to a model.frame from a character string. My attempt:
model.frame(y ~ x, dat, var = var)
returns an error message:
Error in model.frame.default(y ~ x, dat, var = var) :
variable lengths differ (found for '(var)')
How to add additional variables to a model.frame from a character string vector of column names? Alternatively, is it possible to expand model.response and model.matrix with variables that are not present in formula?

R formula of type y ~ fn(type="x")+fn(type="z"), disregard the second occurence of fn?

I have a model building function where the formula can contain a some functions, and I would like it work so that if user inputs the function several times, only first occasion is used with a warning. For example in lm if we use same variable twice, the second one is dropped:
y<-1:3
x<-1:3
lm(y~x+x)
Call:
lm(formula = y ~ x + x)
Coefficients:
(Intercept) x
0 1
This works because the function terms used in model.frame removes variables with identical name. But in my case I'm working with functions inside of formula which doesn't necessarily have identical argument list, and I would like this behaviour to extend so that arguments of these functions wouldn't matter:
model(y~x+fn("x"))
(Intercept) x temp
1 1 1 1
2 1 2 1
3 1 3 1
model(y~x+fn("x")+fn("x")) #identical function calls
(Intercept) x temp
1 1 1 1
2 1 2 1
3 1 3 1
model(y~x+fn("x")+fn("z")) #function with different argument value
Error in attr(all_terms, "variables")[[1 + ind_fn]] :
subscript out of bounds
Here is an example function (highly simplified) I used above:
model <- function(formula, data) {
#the beginning is pretty much copied from lm function
mf <- match.call(expand.dots = FALSE)
mf <- mf[c(1L, match(c("formula", "data"), names(mf), 0L))]
mf[[1L]] <- as.name("model.frame")
mf$na.action <- as.name("na.pass")
all_terms <- if (missing(data)){
terms(formula, "fn")
} else terms(formula, "fn", data = data)
#find the position of the function call in the formula
ind_fn <- attr(all_terms, "specials")$fn
#update the formula by removing the "fn" part
if(!is.null(ind_fn)){
fn_term <- attr(all_terms, "variables")[[1 + ind_fn]]
formula <- update( formula, paste(". ~ .-", deparse(fn_term,
width.cutoff = 500L, backtick = TRUE)))
mf$formula<-formula
}
# build y and X
mf <- eval(mf, parent.frame())
y <- model.response(mf, "numeric")
mt <- attr(mf, "terms")
X <- model.matrix(mt, mf)
#if fn was in formula do something with it
if (!is.null(ind_fn)){
foobar<-function(type=c("x","z")){
if(type=="x"){
rep(1,nrow(X))
} else rep(0,nrow(X))
}
fn_term[[1]]<-as.name("foobar")
temp<-eval(fn_term)
X<-cbind(X,temp)
}
X
}
I could check the name of the specials (the function calls) and rename them as identical with the first occurence, but I was wondering if there would be more clever way of dealing with this?
I wasn't able to get your code to work correctly, but assuming I've understood your task, perhaps something like this accomplishes what you're after.
f <- y ~ x + fn("x") + fn("z") + z + fn('a')
# get list of terms
vars <- as.list(attr(terms(f), 'variables'))
# get those terms that are duplicate calls
redundant <- vars[sapply(vars, is.call) & duplicated(sapply(vars, function(x) as.list(x)[[1]]))]
# remove the duplicate calls from the formula
update(f, paste(". ~ .", paste(sapply(redundant, deparse), collapse='-'), sep='-'))
# y ~ x + fn("x") + z

Is there a function or package which will simulate predictions for an object returned from lm()?

Is there a single function, similar to "runif", "rnorm" and the like which will produce simulated predictions for a linear model? I can code it on my own, but the code is ugly and I assume that this is something someone has done before.
slope = 1.5
intercept = 0
x = as.numeric(1:10)
e = rnorm(10, mean=0, sd = 1)
y = slope * x + intercept + e
fit = lm(y ~ x, data = df)
newX = data.frame(x = as.numeric(11:15))
What I'm interested in is a function that looks like the line below:
sims = rlm(1000, fit, newX)
That function would return 1000 simulations of y values, based on the new x variables.
Showing that Gavin Simpson's suggestion of modifying stats:::simulate.lm is a viable one.
## Modify stats:::simulate.lm by inserting some tracing code immediately
## following the line that reads "ftd <- fitted(object)"
trace(what = stats:::simulate.lm,
tracer = quote(ftd <- list(...)[["XX"]]),
at = list(6))
## Prepare the data and 'fit' object
df <- data.frame(x =x<-1:10, y = 1.5*x + rnorm(length(x)))
fit <- lm(y ~ x, data = df)
## Define new covariate values and compute their predicted/fitted values
newX <- 8:1
newFitted <- predict(fit, newdata = data.frame(x = newX))
## Pass in fitted via the argument 'XX'
simulate(fit, nsim = 4, XX = newFitted)
# sim_1 sim_2 sim_3 sim_4
# 1 11.0910257 11.018211 10.95988582 13.398902
# 2 12.3802903 10.589807 10.54324607 11.728212
# 3 8.0546746 9.925670 8.14115433 9.039556
# 4 6.4511230 8.136040 7.59675948 7.892622
# 5 6.2333459 3.131931 5.63671024 7.645412
# 6 3.7449859 4.686575 3.45079655 5.324567
# 7 2.9204519 3.417646 2.05988078 4.453807
# 8 -0.5781599 -1.799643 -0.06848592 0.926204
That works, but this is a cleaner (and likely better) approach:
## A function for simulating at new x-values
simulateX <- function(object, nsim = 1, seed = NULL, X, ...) {
object$fitted.values <- predict(object, X)
simulate(object = object, nsim = nsim, seed = seed, ...)
}
## Prepare example data and a fit object
df <- data.frame(x =x<-1:10, y = 1.5*x + rnorm(length(x)))
fit <- lm(y ~ x, data = df)
## Supply new x-values in a data.frame of the form expected by
## the newdata= argument of predict.lm()
newX <- data.frame(x = 8:1)
## Try it out
simulateX(fit, nsim = 4, X = newX)
# sim_1 sim_2 sim_3 sim_4
# 1 11.485024 11.901787 10.483908 10.818793
# 2 10.990132 11.053870 9.181760 10.599413
# 3 7.899568 9.495389 10.097445 8.544523
# 4 8.259909 7.195572 6.882878 7.580064
# 5 5.542428 6.574177 4.986223 6.289376
# 6 5.622131 6.341748 4.929637 4.545572
# 7 3.277023 2.868446 4.119017 2.609147
# 8 1.296182 1.607852 1.999305 2.598428

formula error inside function

I want use survfit() and basehaz() inside a function, but they do not work. Could you take a look at this problem. Thanks for your help. The following code leads to the error:
library(survival)
n <- 50 # total sample size
nclust <- 5 # number of clusters
clusters <- rep(1:nclust,each=n/nclust)
beta0 <- c(1,2)
set.seed(13)
#generate phmm data set
Z <- cbind(Z1=sample(0:1,n,replace=TRUE),
Z2=sample(0:1,n,replace=TRUE),
Z3=sample(0:1,n,replace=TRUE))
b <- cbind(rep(rnorm(nclust),each=n/nclust),rep(rnorm(nclust),each=n/nclust))
Wb <- matrix(0,n,2)
for( j in 1:2) Wb[,j] <- Z[,j]*b[,j]
Wb <- apply(Wb,1,sum)
T <- -log(runif(n,0,1))*exp(-Z[,c('Z1','Z2')]%*%beta0-Wb)
C <- runif(n,0,1)
time <- ifelse(T<C,T,C)
event <- ifelse(T<=C,1,0)
mean(event)
phmmd <- data.frame(Z)
phmmd$cluster <- clusters
phmmd$time <- time
phmmd$event <- event
fmla <- as.formula("Surv(time, event) ~ Z1 + Z2")
BaseFun <- function(x){
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
}
BaseFun(fmla)
Error in formula.default(object, env = baseenv()) : invalid formula
But the following function works:
fit <- coxph(fmla, phmmd)
basehaz(fit)
It is a problem of scoping.
Notice that the environment of basehaz is:
environment(basehaz)
<environment: namespace:survival>
meanwhile:
environment(BaseFun)
<environment: R_GlobalEnv>
Therefore that is why the function basehaz cannot find the local variable inside the function.
A possible solution is to send x to the top using assign:
BaseFun <- function(x){
assign('x',x,pos=.GlobalEnv)
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
rm(x)
}
BaseFun(fmla)
Other solutions may involved dealing with the environments more directly.
I'm following up on #moli's comment to #aatrujillob's answer. They were helpful so I thought I would explain how it solved things for me and a similar problem with the rpart and partykit packages.
Some toy data:
N <- 200
data <- data.frame(X = rnorm(N),W = rbinom(N,1,0.5))
data <- within( data, expr = {
trtprob <- 0.4 + 0.08*X + 0.2*W -0.05*X*W
Trt <- rbinom(N, 1, trtprob)
outprob <- 0.55 + 0.03*X -0.1*W - 0.3*Trt
Outcome <- rbinom(N,1,outprob)
rm(outprob, trtprob)
})
I want to split the data to training (train_data) and testing sets, and train the classification tree on train_data.
Here's the formula I want to use, and the issue with the following example. When I define this formula, the train_data object does not yet exist.
my_formula <- Trt~W+X
exists("train_data")
# [1] FALSE
exists("train_data", envir = environment(my_formula))
# [1] FALSE
Here's my function, which is similar to the original function. Again,
badFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
Trying to run this function throws an error similar to OP's.
library(rpart)
library(partykit)
bad_out <- badFunc(data=data, my_formula = my_formula)
# Error in is.data.frame(data) : object 'train_data' not found
# 10. is.data.frame(data)
# 9. model.frame.default(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 8. stats::model.frame(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 7. eval(expr, envir, enclos)
# 6. eval(mf, env)
# 5. model.frame.rpart(obj)
# 4. model.frame(obj)
# 3. as.party.rpart(ct_train)
# 2. partykit::as.party(ct_train)
# 1. badFunc(data = data, my_formula = my_formula)
print(bad_out)
# Error in print(bad_out) : object 'bad_out' not found
Luckily, rpart() is like coxph() in that you can specify the argument model=TRUE to solve these issues. Here it is again, with that extra argument.
goodFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
## This solved it for me
model=TRUE,
##
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
good_out <- goodFunc(data=data, my_formula = my_formula)
print(good_out)
# Model formula:
# Trt ~ W + X
#
# Fitted party:
# [1] root
# | [2] X >= 1.59791: 0.143 (n = 7, err = 0.9)
##### etc
documentation for model argument in rpart():
model:
if logical: keep a copy of the model frame in the result? If
the input value for model is a model frame (likely from an earlier
call to the rpart function), then this frame is used rather than
constructing new data.
Formulas can be tricky as they use lexical scoping and environments in a way that is not always natural (to me). Thank goodness Terry Therneau has made our lives easier with model=TRUE in these two packages!

Resources