Why is gam::step.gam returning NULL with forward selection? - r

I have a large Generalized Additive Model (GAM) made from 10K observations with ~ 100 variables. Building the model with forward stepwise selection results in an object of class "NULL". Why might this be and how do I resolve it?
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
load(url("https://github.com/cornejom/DataSets/raw/master/mygam.Rdata"))
myscope <- gam.scope(mydata, response = 3, arg = "df=4") #Target var in 3rd col.
mygam.step <- step.gam(mygam, myscope, direction = "forward")
mygam.step
NULL
The code that was used to fit mygam from mydata is:
library(gam)
#Identify numerical variables, but exclude the integer response.
numbers = sapply(mydata, class) %in% c("integer", "numeric")
numbers[match("Response", names(mydata))] = FALSE
#Identify factor variables.
factors = sapply(mydata, class) == "factor"
#Create a formula to feed into gam function.
myformula = paste0(paste0("Response ~ ",
paste0("s(", names(mydata)[numbers], ", df=4)", collapse = " + ")
),
" + ",
paste0(paste0(names(mydata)[factors], collapse = " + ")))
mygam = gam(as.formula(myformula), family = "binomial", mydata)

I suspect the issue is with the mygam object.
Explanation
If you read the help(step.gam) it has this paragraph in the explanation of scope argument:
The supplied model ‘object’ is used as the starting model,
and hence there is the requirement that one term from each of
the term formulas be present in ‘formula(object)’. This also
implies that any terms in ‘formula(object)’ not contained
in any of the term formulas will be forced to be present in
every model considered. The function ‘gam.scope’ is helpful
for generating the scope argument for a large model.
In essence this says that the first argument passed to step.gam function (mygam in this case) will have a formula and that formula will be used as a starting model for the stepwise procedure.
Since here we have forward stepwise - it cannot start from the full model, because in that case there is nothing left to add.
Exploring The Code
This idea is reinforced if we look at the code. The code of step.gam function has this loop that runs in case of forward selection.
if (forward) {
trial <- items
trial[i] <- trial[i] + 1
if (trial[i] <= term.lengths[i] && !get.visit(trial,
visited)) {
visited <- cbind(visited, trial)
tform.vector <- form.vector
tform.vector[i] <- scope[[i]][trial[i]]
form.list = c(form.list, list(list(trial = trial,
form.vector = tform.vector, which = i)))
}
}
Notice that the loop executes only when the inner if statement is TRUE. And that if statement seems to check if you have potential variables in your scope (term.length) that are not yet in your model (items, trial). If you don't - the loop skips.
Since in your case the loop never executes it doesn't form the return object and the procedure returns NULL.
The Solution
Given all the above - the solution is to not start with the complete formula when using forward selection method. Here for the demonstration I will be using the intercept-only model as a starting model:
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
mygam <- gam(Response ~ 1, family = "binomial", mydata)
The last line is the only change that needs to be made. Everything else is the same as in the original post:
myscope <- gam.scope(mydata, response = 3, arg = "df=4")
mygam.step <- step.gam(mygam, myscope, direction = "forward")
And now the procedure works.

Related

How can I fit a linear model inside a user defined function in R; error non-numeric argument when specifying response variable?

I am trying to de-clutter some scripts by creating functions to complete repetitive tasks in R. One task I complete repeatedly is fitting a linear model to a set of data and creating predictions from that linear model fit. The data I am working with is concentration and flow data from streams, and flow is always the explanatory variable but the response variable changes and therefore I would like to include it as a function input. However, I receive a "non-numeric argument to mathematical function" error when I run the function. I have tried both with and without quotes since the lm() call does not require quotes but that results in the classis "object 'myobject' not found". Here's a simple example.
Update
flows <- seq(0,7,0.01)
dat <- tibble(flow=sample(flows,30),
parameter1_conc=rnorm(30,15,4),
parameter2_conc=rnorm(30,50,8))
regr_func <- function(modeldata,parameter,pred_maxflow,pred_flowint) {
mod <- lm(as.formula(paste('log(', parameter, ') ~ log(', flow, ')')), data=modeldata)
newflow <- data.frame(flow = seq(0, pred_maxflow, pred_flowint))
preds <<- predict(mod, newdata = newflow,
interval = 'prediction')
}
regr_func(modeldata = dat,
parameter = 'parameter1_conc',
pred_maxflow = 20,
pred_flowint = 0.001)
Original Example Error
flows <- seq(0,7,0.01)
dat <- tibble(flow=sample(flows,30),
parameter1_conc=rnorm(30,15,4),
parameter1_conc=rnorm(30,50,8))
regr_func <- function(modeldata,parameter,pred_maxflow,pred_flowint) {
mod <- lm(log(parameter)~log(flow), data = modeldata)
newflow <- data.frame(flow = seq(0, maxflow, flowint))
preds <<- predict(mod, newdata = newflow,
interval = 'prediction')
}
regr_func(modeldata = dat,
parameter = 'parameter1_conc',
pred_maxflow = 20,
pred_flowint = 0.001)
There are 3 issues here. The main one is that log(parameter) in your lm formula does not get substituted for the variable passed in as parameter. That means lm is literally looking for a column called parameter in your data, which doesn't exist. You can fix this by creating a formula with the name substituted in. Although doing this with strings is the most commonly used method to do this, it is a bit more efficient and safer to use substitute. This also allows you to pass your column name without quotes.
The second issue is that the arguments maxflow and flowint should probably be pred_maxflow and pred_flowint to match your function parameters.
Thirdly, using the <<- operator to write to a variable in the calling frame is bad practice. R users expect functions not to have such side effects, and know to store the output of function calls to variables under their control. Only in very rare circumstances should this be done within the function.
Putting all this together, we have:
regr_func <- function(modeldata, parameter, pred_maxflow, pred_flowint) {
f <- `[[<-`(x ~ log(flow), 2, substitute(log(parameter)))
mod <- lm(f, data = modeldata)
newflow <- data.frame(flow = seq(0, pred_maxflow, pred_flowint))
predict(mod, newdata = newflow, interval = 'prediction')
}
And we would call the function like this:
preds <- regr_func(modeldata = dat,
parameter = parameter1_conc,
pred_maxflow = 20,
pred_flowint = 0.001)
resulting in:
head(preds)
#> fit lwr upr
#> 1 Inf NaN NaN
#> 2 3.365491 2.188942 4.542041
#> 3 3.312636 2.219223 4.406049
#> 4 3.281717 2.236294 4.327140
#> 5 3.259780 2.248073 4.271488
#> 6 3.242765 2.256998 4.228531
Created on 2022-06-03 by the reprex package (v2.0.1)

Include an object within a function only if it exists

I have a loop that needs to be executed; within which are 6 models. The objects that those models are stored in then need to get passed into a function that executes an AIC analysis. However, sometimes one of the models does not work, which then breaks the code for the AIC function because it does not recognize whatever model that failed because it was not stored as an object.
So, I need a way to pull those models that worked into the AIC function.
Here is an example, but keep in mind it is important that this can all be executed within a loop. Here are three hypothetical models:
hn.1 <- ds(data)
hn.1.obs <- ds(data,formula = ~OBSCODE)
hn.1.obs.mas <- ds(dataformula = ~OBSCODE+MAS)
And this would be my AIC function that compares the models:
summarize_ds_models(hn.1, hn.1.obs, hn.1.obs.mas)
But I get an error if say, the hn.1.obs.mas model failed.
I tried to use "get" and "ls" and I successfully pull the models that exist when I call:
get(ls(pattern='hn.15*'))
But that just returns a character vector, so that when I call:
summarize_ds_models(get(ls(pattern='hn.15*')))
it only conducts the AIC analysis on the first model in the above character vector.
Am I on the right track or is there a better way to do this?
UPDATE with a reproducible example.
Here is a simplified version of my problem:
create and fill two data frames that will be put into a list:
data.frame <- data.frame(x = integer(4),
y = integer(4),
z = integer(4),
i = integer(4))
data.frame$x <- c(1,2,3,4)
data.frame$y <- c(1,4,9,16)
data.frame$z <- c(1,3,8,10)
data.frame$i <- c(1,5,10,15)
data.frame.2 <- data.frame[1:4,1:3]
my.list <- list(data.frame,data.frame.2)
create df to fill with best models from AIC analyses
bestmodels <- data.frame(modelname = character(2))
Here is the function that will run the loop:
myfun <- function(list) {
for (i in 1:length(my.list)){
mod.1 = lm(y ~ x, data = my.list[[i]])
mod.2 = lm(y ~ x + z, data = my.list[[i]])
mod.3 = lm(y ~ i, data = my.list[[i]])
bestmodels[i,1] <- rownames(AIC(mod.1,mod.2,mod.3))[1]#bestmodel is 1st row
}
print(bestmodels)
}
However, on the second iteration of the loop, the AIC function will fail because mod.3 will fail. So, is there a generic way to make it so the AIC function will only execute for those models that worked? The outcome I would want here would be:
> bestmodels
modelname
1 mod.1
2 mod.1
since mod.1 would be chosen for both AIC analyses.
Gregor's comment:
Use a list instead of individual named objects. Then do.call(summarize_ds_models, my_list_of_models). If it isn't done already, you can Filter the list first to make sure only working models are in the list.
solved my problem. Thanks

Pass df column names to nested equation in Graph Printing Function

I need some clarification on the primary post on Passing a data.frame column name to a function
I need to create a function that will take a testSet, trainSet, and colName(aka predictor) as inputs to a function that prints a plot of the dataset with a GAM model trend line.
The issue I run into is:
plot.model = function(predictor, train, test) {
mod = gam(Response ~ s(train[[predictor]], spar = 1), data = train)
...
}
#Function Call
plot.model("Predictor1", 1.0, crime.train, crime.test)
I can't simply pass the predictor as a string into the gam function, but I also can't use a string to index the data frame values as shown in the link above. Somehow, I need to pass the colName key to the game function. This issue occurs in other similar scenarios regarding plotting.
plot <- ggplot(data = test, mapping = aes(x=predictor, y=ViolentCrimesPerPop))
Again, I can't pass a string value for the column name and I can't pass the column values either.
Does anyone have a generic solution for these situations. I apologize if the answer is buried in the above link, but it's not clear to me if it is.
Note: A working gam function call looks like this:
mod = gam(Response ~ s(Predictor1, spar = 1.0), data = train)
Where the train set is a data frame with column names "Response" & "Predictor".
Use aes_string instead of aes when you pass a column name as string.
plot <- ggplot(data = test, mapping = aes_string(x=predictor, y=ViolentCrimesPerPop))
For gam function:: Example which is copied from gam function's documentation. I have used vector, scalar is even easier. Its just using paste with a collapse parameter.
library(mgcv)
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
# String manipulate for formula
formula <- as.formula(paste("y~s(", paste(colnames(dat)[2:5], collapse = ")+s("), ")", sep =""))
b <- gam(formula, data=dat)
is same as
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)

Inner functions pulls call from outer function and causes error

I'm using a function from the library leaps within another function. The last two rows of the leaps function in question goes:
rval$call <- sys.call(sys.parent())
rval
This apparently causes the call to the outer function to be passed to rval$call. And the actual call to the regsubsets function is needed as an argument later on.
Below an example to illustrate:
library(leaps)
#Create some sample data to perform a regression on
inda <- rnorm(100)
indb <- rnorm(100)
dep <- 2 + 0.1*inda + 0.2*indb + rnorm(100, sd = 0.3)
dfk <- data.frame(dep=dep, inda = inda, indb = indb)
#Create some arbitrary outer function
test <- function(dependent, data){
best.fit <- regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)
return(best.fit)
}
#Call outer function
best <- test("dep", dfk)
best$call #Returns "test("dep", dfk)"
So best$call will contain the call to the outer function (test), and not the call to the inner (regsubsets) function. As it's not really an option to change the inner function, is there any way of avoiding this problem?
EDIT:
One way around the problem could be something like this:
test <- function(dependent, data){
thecall <- 'regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)'
best.fit <- eval(parse(text = thecall))
#best.fit$call <- [some transformation of thecall
return(best.fit)
}
EDIT2:
The reason I need to access what's inside $call is that it's needed in a predict function that I copied from Introduction to statitical learning:
predict.regsubsets <- function(regsubset_model, newdata, id, ...){
form <- as.formula(regsubset_model$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(regsubset_model, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
In the second line it uses $call
I’m still not entirely clear on how this is going to be used but in the case of your test function, you could write the following code:
test = function (dependent, data) {
regsubsets_call = bquote(regsubsets(.(as.formula(paste0(dependent, " ~ ."))),
data = .(substitute(data)), nvmax = 2))
best_fit = eval(regsubsets_call)
best_fit$call = regsubsets_call
best_fit
}
However, the result may not work with downstream functions the package provides (though, realistically, it probably will; I’m guessing summary.regsubsets only uses it to print the call).
What’s going on here?
bquote constructs an unevaluated R expression; it’s similar to quote but it allows you to interpolate values (similar to substitute). substitute(data) means that, rather than putting the actual data.frame into the call (which would lead to a very unwieldy output, it puts the variable name (or expression) the user passed to test. So if the user called it as test('mpg', mtcars), then the resulting expression would be
regsubsets(mpg ~ ., data = mtcars, nvmax = 2)
The resulting call object is then (a) evaluated via eval, and (b) stored in the resulting $call.
Incidentally, the formula can (and, as far as I’m concerned, should) be constructed in the same way; no need to parse a string:
as.formula(bquote(.(as.name(dependent)) ~ .))
Taken together, the whole expression would then become:
formula = as.formula(bquote(.(as.name(dependent)) ~ .))
regsubsets_call = bquote(regsubsets(.(formula), data = .(substitute(data)), nvmax = 2))

entering variable name as argument of a function

I'm trying to make a function that will run and compare a set of models given a dataset and a variable name (essentially to be able to change just one model set and have them apply to all relevant dependent variables--selecting an a priori modelset to compare rather than using a data-dredging existing function like glmulti). A simple example:
RunModelset<- function(dataset, response)
{
m1 <- lm(formula=response ~ 1, data=dataset)
m2 <- lm(formula=response ~ 1 + temperature, data=dataset)
comp <- AICctab(m1,m2, base = T, weights = T, nobs=length(data))
return(comp)
}
If I manually enter a specific variable name within the function, it runs the models correctly. However, using the code above and entering a text value for the response argument doesn't work:
RunModel(dataset=MyData,response="responsevariablename")
yields an error: invalid type (NULL) for variable 'dataset$response', which I interpret to mean it isn't finding the column I'm telling it to use. My problem must be in how R inserts a text value as an argument in the function.
How do I enter the response variable name so R knows that "formula=response ~" becomes "formula=dataset$responsevariablename ~"?
ETA Working answer based on this solution:
RunModel<- function(dataset, response)
{
resvar <- eval(substitute(response),dataset)
m1 <- lm(formula=resvar ~ 1, data=dataset)
m2 <- lm(formula=resvar ~ 1 + R.biomass, data=dataset)
comp <- AICctab(m1,m2, base = T, weights = T, nobs=length(data))
return(comp)
}
RunModel(dataset=MyData,response=responsevariablename)
NB - this didn't work when I had quotes on the response argument.
You should be able to use match.call() to achieve this.
See this post

Resources