How can I fit a linear model inside a user defined function in R; error non-numeric argument when specifying response variable? - r

I am trying to de-clutter some scripts by creating functions to complete repetitive tasks in R. One task I complete repeatedly is fitting a linear model to a set of data and creating predictions from that linear model fit. The data I am working with is concentration and flow data from streams, and flow is always the explanatory variable but the response variable changes and therefore I would like to include it as a function input. However, I receive a "non-numeric argument to mathematical function" error when I run the function. I have tried both with and without quotes since the lm() call does not require quotes but that results in the classis "object 'myobject' not found". Here's a simple example.
Update
flows <- seq(0,7,0.01)
dat <- tibble(flow=sample(flows,30),
parameter1_conc=rnorm(30,15,4),
parameter2_conc=rnorm(30,50,8))
regr_func <- function(modeldata,parameter,pred_maxflow,pred_flowint) {
mod <- lm(as.formula(paste('log(', parameter, ') ~ log(', flow, ')')), data=modeldata)
newflow <- data.frame(flow = seq(0, pred_maxflow, pred_flowint))
preds <<- predict(mod, newdata = newflow,
interval = 'prediction')
}
regr_func(modeldata = dat,
parameter = 'parameter1_conc',
pred_maxflow = 20,
pred_flowint = 0.001)
Original Example Error
flows <- seq(0,7,0.01)
dat <- tibble(flow=sample(flows,30),
parameter1_conc=rnorm(30,15,4),
parameter1_conc=rnorm(30,50,8))
regr_func <- function(modeldata,parameter,pred_maxflow,pred_flowint) {
mod <- lm(log(parameter)~log(flow), data = modeldata)
newflow <- data.frame(flow = seq(0, maxflow, flowint))
preds <<- predict(mod, newdata = newflow,
interval = 'prediction')
}
regr_func(modeldata = dat,
parameter = 'parameter1_conc',
pred_maxflow = 20,
pred_flowint = 0.001)

There are 3 issues here. The main one is that log(parameter) in your lm formula does not get substituted for the variable passed in as parameter. That means lm is literally looking for a column called parameter in your data, which doesn't exist. You can fix this by creating a formula with the name substituted in. Although doing this with strings is the most commonly used method to do this, it is a bit more efficient and safer to use substitute. This also allows you to pass your column name without quotes.
The second issue is that the arguments maxflow and flowint should probably be pred_maxflow and pred_flowint to match your function parameters.
Thirdly, using the <<- operator to write to a variable in the calling frame is bad practice. R users expect functions not to have such side effects, and know to store the output of function calls to variables under their control. Only in very rare circumstances should this be done within the function.
Putting all this together, we have:
regr_func <- function(modeldata, parameter, pred_maxflow, pred_flowint) {
f <- `[[<-`(x ~ log(flow), 2, substitute(log(parameter)))
mod <- lm(f, data = modeldata)
newflow <- data.frame(flow = seq(0, pred_maxflow, pred_flowint))
predict(mod, newdata = newflow, interval = 'prediction')
}
And we would call the function like this:
preds <- regr_func(modeldata = dat,
parameter = parameter1_conc,
pred_maxflow = 20,
pred_flowint = 0.001)
resulting in:
head(preds)
#> fit lwr upr
#> 1 Inf NaN NaN
#> 2 3.365491 2.188942 4.542041
#> 3 3.312636 2.219223 4.406049
#> 4 3.281717 2.236294 4.327140
#> 5 3.259780 2.248073 4.271488
#> 6 3.242765 2.256998 4.228531
Created on 2022-06-03 by the reprex package (v2.0.1)

Related

Can't give a subset when using randomForest inside a function

I'm wanting to create a function that uses within it the randomForest function from the randomForest package. This takes the "subset" argument, which is a vector of row numbers of the data frame to use for training. However, if I use this argument when calling the randomForest function in another defined function, I get the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
Here is a reproducible example, where we attempt to train a random forest to classify a response "type" either "A" or "B", based on three numerical predictors:
library(randomForest)
# define a random data frame to train with
test.data = data.frame(
type = rep(NA, times = 500),
x = runif(500),
y = runif(500),
z = runif(500)
)
train.data$type[runif(500) >= 0.5] = "A"
train.data$type[is.na(test.data$type)] = "B"
train.data$type = as.factor(test.data$type)
# define the training range
training.range = sample(500)[1:300]
# formula to use
tr_form = formula(type ~ x + y + z)
# Function that includes the randomForest function
train_rf = function(form, all_data, tr_subset) {
p = randomForest(
formula = form,
data = all_data,
subset = tr_subset,
na.action = na.omit
)
return(p)
}
# test the new defined function
test_tree = train_rf(form = tr_form, all_data = train.data, tr_subset = training.range)
Running this gives the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
If, however, subset = tr_subset is removed from the randomForest function, and tr_subset is removed from the train_rf function, this code runs fine, however the whole data set is used for training!
It should be noted that using the subset argument in randomForest when not defined in another function works completely fine, and is the intended method for the function, as described in the vignette linked above.
I know in the mean time I could just define another training set that has just the row numbers required, and train using all of that, but is there a reason why my original code doesn't work please?
Thanks.
EDIT: I conjecture that, as subset() is a base R function, R is getting confused and thinking you're wanting to use the base R function rather than defining an argument of the randomForest function. I'm not an expert, though, so I may be wrong.

R C5.0 Undefined columns selected when using a formula stored in a variable

I'm using the C50 R Package for predicting with decision trees.
I have the following code:
library("partykit")
library("C50")
//Creates a sample data frame
data <- data.frame(ID = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
//This is the variable I want to predict
variable <- "ID"
//First I convert the column to factor
data[, variable] <- factor(data[, variable])
//Then I create the formula
formula <- as.formula(paste(variable, " ~ ."))
//And finally I fit the model with the formula
model <- C5.0(formula, data=data, trials=10)
Until this point everything is ok, the problem comes here, when I try to plot the tree:
png(filename = paste("test.png"), width = 800, height = 60)
plot(model) //This line throws the error: Error in `[.data.frame`(mf, rsp) : undefined columns selected
dev.off()
But if I change the line:
model <- C5.0(formula, data=data, trials=10)
To:
model <- C5.0(ID ~ ., data=data, trials=10)
Everything is ok.
After doing a bit of debugging I have this adicional info:
Inside C5.0 function there is this code:
call <- match.call()
If I look inside call this is what I get:
C5.0.formula(formula = formula, data = data, trials = 10)
But if the call to C5.0 is:
model <- C5.0(ID ~ ., data=data, trials=10)
Then the call object is this:
C5.0.formula(formula = ID ~ ., data = data, trials = 10)
This may seem normal, but debugging the plot() function I've seen that in some point the function as.party(x, trial = trial) is called, where x is the C5.0 object. Inside as.party() function there is another call, the model.frame(obj) call, where obj is the C5.0 object, and here is the problem. Inside the model.frame() function I found this line:
rsp <- strsplit(paste(formula$call[2]), " ")[[1]][1]
Remember? The error had a reference to that rsp variable. And the problem is that formula$call has 2 different values. If I make the first call to C.5 like this one:
model <- C5.0(ID ~ ., data=data, trials=10)
Everything is ok as formula$call contains C5.0.formula(formula = ID ~ ., data = data, trials = 10) and rap is ID so the next call to:
tmp <- mf[rsp]
Is executed without problems (mf is the initial data frame).
But with the call:
C5.0.formula(formula = formula, data = data, trials = 10)
The formula$call object contains:
C5.0.formula(formula = formula, data = data, trials = 10)
And rsp is "formula" so the line:
tmp <- mf[rsp]
Fails as there is no "formula" column in the data frame.
Is this the expected behavior? If it is, there is no way I can call C5.0 with a formula stored in a variable?
I need to do the call that way in order to test the algorithm with many different formulas.
Any help would be appreciated. Thank you all in advance.
Unfortunately it is a bug. See issue 8 on github. Looks like Max Kuhn (developer of C50) hasn't had the time to look into it, since the bug report is from August. You might want to attach your issue there as well. That might bring it back to the attention of the developer.

Inner functions pulls call from outer function and causes error

I'm using a function from the library leaps within another function. The last two rows of the leaps function in question goes:
rval$call <- sys.call(sys.parent())
rval
This apparently causes the call to the outer function to be passed to rval$call. And the actual call to the regsubsets function is needed as an argument later on.
Below an example to illustrate:
library(leaps)
#Create some sample data to perform a regression on
inda <- rnorm(100)
indb <- rnorm(100)
dep <- 2 + 0.1*inda + 0.2*indb + rnorm(100, sd = 0.3)
dfk <- data.frame(dep=dep, inda = inda, indb = indb)
#Create some arbitrary outer function
test <- function(dependent, data){
best.fit <- regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)
return(best.fit)
}
#Call outer function
best <- test("dep", dfk)
best$call #Returns "test("dep", dfk)"
So best$call will contain the call to the outer function (test), and not the call to the inner (regsubsets) function. As it's not really an option to change the inner function, is there any way of avoiding this problem?
EDIT:
One way around the problem could be something like this:
test <- function(dependent, data){
thecall <- 'regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)'
best.fit <- eval(parse(text = thecall))
#best.fit$call <- [some transformation of thecall
return(best.fit)
}
EDIT2:
The reason I need to access what's inside $call is that it's needed in a predict function that I copied from Introduction to statitical learning:
predict.regsubsets <- function(regsubset_model, newdata, id, ...){
form <- as.formula(regsubset_model$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(regsubset_model, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
In the second line it uses $call
I’m still not entirely clear on how this is going to be used but in the case of your test function, you could write the following code:
test = function (dependent, data) {
regsubsets_call = bquote(regsubsets(.(as.formula(paste0(dependent, " ~ ."))),
data = .(substitute(data)), nvmax = 2))
best_fit = eval(regsubsets_call)
best_fit$call = regsubsets_call
best_fit
}
However, the result may not work with downstream functions the package provides (though, realistically, it probably will; I’m guessing summary.regsubsets only uses it to print the call).
What’s going on here?
bquote constructs an unevaluated R expression; it’s similar to quote but it allows you to interpolate values (similar to substitute). substitute(data) means that, rather than putting the actual data.frame into the call (which would lead to a very unwieldy output, it puts the variable name (or expression) the user passed to test. So if the user called it as test('mpg', mtcars), then the resulting expression would be
regsubsets(mpg ~ ., data = mtcars, nvmax = 2)
The resulting call object is then (a) evaluated via eval, and (b) stored in the resulting $call.
Incidentally, the formula can (and, as far as I’m concerned, should) be constructed in the same way; no need to parse a string:
as.formula(bquote(.(as.name(dependent)) ~ .))
Taken together, the whole expression would then become:
formula = as.formula(bquote(.(as.name(dependent)) ~ .))
regsubsets_call = bquote(regsubsets(.(formula), data = .(substitute(data)), nvmax = 2))

Why is gam::step.gam returning NULL with forward selection?

I have a large Generalized Additive Model (GAM) made from 10K observations with ~ 100 variables. Building the model with forward stepwise selection results in an object of class "NULL". Why might this be and how do I resolve it?
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
load(url("https://github.com/cornejom/DataSets/raw/master/mygam.Rdata"))
myscope <- gam.scope(mydata, response = 3, arg = "df=4") #Target var in 3rd col.
mygam.step <- step.gam(mygam, myscope, direction = "forward")
mygam.step
NULL
The code that was used to fit mygam from mydata is:
library(gam)
#Identify numerical variables, but exclude the integer response.
numbers = sapply(mydata, class) %in% c("integer", "numeric")
numbers[match("Response", names(mydata))] = FALSE
#Identify factor variables.
factors = sapply(mydata, class) == "factor"
#Create a formula to feed into gam function.
myformula = paste0(paste0("Response ~ ",
paste0("s(", names(mydata)[numbers], ", df=4)", collapse = " + ")
),
" + ",
paste0(paste0(names(mydata)[factors], collapse = " + ")))
mygam = gam(as.formula(myformula), family = "binomial", mydata)
I suspect the issue is with the mygam object.
Explanation
If you read the help(step.gam) it has this paragraph in the explanation of scope argument:
The supplied model ‘object’ is used as the starting model,
and hence there is the requirement that one term from each of
the term formulas be present in ‘formula(object)’. This also
implies that any terms in ‘formula(object)’ not contained
in any of the term formulas will be forced to be present in
every model considered. The function ‘gam.scope’ is helpful
for generating the scope argument for a large model.
In essence this says that the first argument passed to step.gam function (mygam in this case) will have a formula and that formula will be used as a starting model for the stepwise procedure.
Since here we have forward stepwise - it cannot start from the full model, because in that case there is nothing left to add.
Exploring The Code
This idea is reinforced if we look at the code. The code of step.gam function has this loop that runs in case of forward selection.
if (forward) {
trial <- items
trial[i] <- trial[i] + 1
if (trial[i] <= term.lengths[i] && !get.visit(trial,
visited)) {
visited <- cbind(visited, trial)
tform.vector <- form.vector
tform.vector[i] <- scope[[i]][trial[i]]
form.list = c(form.list, list(list(trial = trial,
form.vector = tform.vector, which = i)))
}
}
Notice that the loop executes only when the inner if statement is TRUE. And that if statement seems to check if you have potential variables in your scope (term.length) that are not yet in your model (items, trial). If you don't - the loop skips.
Since in your case the loop never executes it doesn't form the return object and the procedure returns NULL.
The Solution
Given all the above - the solution is to not start with the complete formula when using forward selection method. Here for the demonstration I will be using the intercept-only model as a starting model:
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
mygam <- gam(Response ~ 1, family = "binomial", mydata)
The last line is the only change that needs to be made. Everything else is the same as in the original post:
myscope <- gam.scope(mydata, response = 3, arg = "df=4")
mygam.step <- step.gam(mygam, myscope, direction = "forward")
And now the procedure works.

Object not found error when passing model formula to another function

I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?
When you created your formula, lm.cars, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows
Another problem that can cause this is if one passes a character (string vector) to lm instead of a formula. vectors have no environment, and so when lm converts the character to a formula, it apparently also has no environment instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame, but is in the local function argument, one gets a not found error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser, one can see the weights vector just fine, but it somehow is not found in the lm call!
The bug becomes even harder to debug if one used the name weights for the weights variable. In this case, since lm can't find the weights object, it defaults to the function weights() from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.

Resources