R C5.0 Undefined columns selected when using a formula stored in a variable - r

I'm using the C50 R Package for predicting with decision trees.
I have the following code:
library("partykit")
library("C50")
//Creates a sample data frame
data <- data.frame(ID = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
//This is the variable I want to predict
variable <- "ID"
//First I convert the column to factor
data[, variable] <- factor(data[, variable])
//Then I create the formula
formula <- as.formula(paste(variable, " ~ ."))
//And finally I fit the model with the formula
model <- C5.0(formula, data=data, trials=10)
Until this point everything is ok, the problem comes here, when I try to plot the tree:
png(filename = paste("test.png"), width = 800, height = 60)
plot(model) //This line throws the error: Error in `[.data.frame`(mf, rsp) : undefined columns selected
dev.off()
But if I change the line:
model <- C5.0(formula, data=data, trials=10)
To:
model <- C5.0(ID ~ ., data=data, trials=10)
Everything is ok.
After doing a bit of debugging I have this adicional info:
Inside C5.0 function there is this code:
call <- match.call()
If I look inside call this is what I get:
C5.0.formula(formula = formula, data = data, trials = 10)
But if the call to C5.0 is:
model <- C5.0(ID ~ ., data=data, trials=10)
Then the call object is this:
C5.0.formula(formula = ID ~ ., data = data, trials = 10)
This may seem normal, but debugging the plot() function I've seen that in some point the function as.party(x, trial = trial) is called, where x is the C5.0 object. Inside as.party() function there is another call, the model.frame(obj) call, where obj is the C5.0 object, and here is the problem. Inside the model.frame() function I found this line:
rsp <- strsplit(paste(formula$call[2]), " ")[[1]][1]
Remember? The error had a reference to that rsp variable. And the problem is that formula$call has 2 different values. If I make the first call to C.5 like this one:
model <- C5.0(ID ~ ., data=data, trials=10)
Everything is ok as formula$call contains C5.0.formula(formula = ID ~ ., data = data, trials = 10) and rap is ID so the next call to:
tmp <- mf[rsp]
Is executed without problems (mf is the initial data frame).
But with the call:
C5.0.formula(formula = formula, data = data, trials = 10)
The formula$call object contains:
C5.0.formula(formula = formula, data = data, trials = 10)
And rsp is "formula" so the line:
tmp <- mf[rsp]
Fails as there is no "formula" column in the data frame.
Is this the expected behavior? If it is, there is no way I can call C5.0 with a formula stored in a variable?
I need to do the call that way in order to test the algorithm with many different formulas.
Any help would be appreciated. Thank you all in advance.

Unfortunately it is a bug. See issue 8 on github. Looks like Max Kuhn (developer of C50) hasn't had the time to look into it, since the bug report is from August. You might want to attach your issue there as well. That might bring it back to the attention of the developer.

Related

Make a model matrix if missing the response variable and where matrix multiplication recreates the predict function

I want to create a model matrix for a test dataset which is missing the response variable, and where I can perfectly replicate the results of calling predict() on the model if building predictions using matrix multiplication. See code below for example.
I have code which can do this (again, see below for example), but it requires that I create a placeholder response variable in my test data. This doesn't seem very clean, and I'm wondering if there's a way to get the code to work without this workaround.
# Make data, fit model
set.seed(1); df_train = data.frame(y = rnorm(10), x = rnorm(10), z = rnorm(10))
set.seed(2); df_test = data.frame(x = rnorm(10), z = rnorm(10))
fit = lm(y ~ poly(x) + poly(z), data = df_train)
# Make model matrices. Get error for the test data as 'y' isnt found
mm_train = model.matrix(terms(fit), df_train)
mm_test = model.matrix(terms(fit), df_test) #"Error in eval(predvars, data, env) : object 'y' not found"
# Make fake y variable for test data then build model matrix. I want to know if there's a less hacky way to do this
df_test$y = 1
mm_test = model.matrix(terms(fit), df_test)
# Check predict and matrix multiplication give identical results on test data. NB this is not the case if contstructing the model matrix using (e.g.) mm_test = model.matrix(formula(fit)[-2], df_test) for the reason outlined here https://stackoverflow.com/questions/59462820/why-are-predict-lm-and-matrix-multiplication-giving-different-predictions.
preds_1 = round(predict(fit, df_test), 5)
preds_2 = round(mm_test %*% fit$coefficients, 5)
all(preds_1 == preds_2) #TRUE

How can I fit a linear model inside a user defined function in R; error non-numeric argument when specifying response variable?

I am trying to de-clutter some scripts by creating functions to complete repetitive tasks in R. One task I complete repeatedly is fitting a linear model to a set of data and creating predictions from that linear model fit. The data I am working with is concentration and flow data from streams, and flow is always the explanatory variable but the response variable changes and therefore I would like to include it as a function input. However, I receive a "non-numeric argument to mathematical function" error when I run the function. I have tried both with and without quotes since the lm() call does not require quotes but that results in the classis "object 'myobject' not found". Here's a simple example.
Update
flows <- seq(0,7,0.01)
dat <- tibble(flow=sample(flows,30),
parameter1_conc=rnorm(30,15,4),
parameter2_conc=rnorm(30,50,8))
regr_func <- function(modeldata,parameter,pred_maxflow,pred_flowint) {
mod <- lm(as.formula(paste('log(', parameter, ') ~ log(', flow, ')')), data=modeldata)
newflow <- data.frame(flow = seq(0, pred_maxflow, pred_flowint))
preds <<- predict(mod, newdata = newflow,
interval = 'prediction')
}
regr_func(modeldata = dat,
parameter = 'parameter1_conc',
pred_maxflow = 20,
pred_flowint = 0.001)
Original Example Error
flows <- seq(0,7,0.01)
dat <- tibble(flow=sample(flows,30),
parameter1_conc=rnorm(30,15,4),
parameter1_conc=rnorm(30,50,8))
regr_func <- function(modeldata,parameter,pred_maxflow,pred_flowint) {
mod <- lm(log(parameter)~log(flow), data = modeldata)
newflow <- data.frame(flow = seq(0, maxflow, flowint))
preds <<- predict(mod, newdata = newflow,
interval = 'prediction')
}
regr_func(modeldata = dat,
parameter = 'parameter1_conc',
pred_maxflow = 20,
pred_flowint = 0.001)
There are 3 issues here. The main one is that log(parameter) in your lm formula does not get substituted for the variable passed in as parameter. That means lm is literally looking for a column called parameter in your data, which doesn't exist. You can fix this by creating a formula with the name substituted in. Although doing this with strings is the most commonly used method to do this, it is a bit more efficient and safer to use substitute. This also allows you to pass your column name without quotes.
The second issue is that the arguments maxflow and flowint should probably be pred_maxflow and pred_flowint to match your function parameters.
Thirdly, using the <<- operator to write to a variable in the calling frame is bad practice. R users expect functions not to have such side effects, and know to store the output of function calls to variables under their control. Only in very rare circumstances should this be done within the function.
Putting all this together, we have:
regr_func <- function(modeldata, parameter, pred_maxflow, pred_flowint) {
f <- `[[<-`(x ~ log(flow), 2, substitute(log(parameter)))
mod <- lm(f, data = modeldata)
newflow <- data.frame(flow = seq(0, pred_maxflow, pred_flowint))
predict(mod, newdata = newflow, interval = 'prediction')
}
And we would call the function like this:
preds <- regr_func(modeldata = dat,
parameter = parameter1_conc,
pred_maxflow = 20,
pred_flowint = 0.001)
resulting in:
head(preds)
#> fit lwr upr
#> 1 Inf NaN NaN
#> 2 3.365491 2.188942 4.542041
#> 3 3.312636 2.219223 4.406049
#> 4 3.281717 2.236294 4.327140
#> 5 3.259780 2.248073 4.271488
#> 6 3.242765 2.256998 4.228531
Created on 2022-06-03 by the reprex package (v2.0.1)

R: how to make predictions using gamboost

library(mboost)
### a simple two-dimensional example: cars data
cars.gb <- gamboost(dist ~ speed, data = cars, dfbase = 4,
control = boost_control(mstop = 50))
set.seed(1)
cars_new <- cars + rnorm(nrow(cars))
> predict(cars.gb, newdata = cars_new$speed)
Error in check_newdata(newdata, blg, mf) :
‘newdata’ must contain all predictor variables, which were used to specify the model.
I fit a model using the example on the help(gamboost) page. I want to use this model to predict on a new dataset, cars_new, but encountered the above error. How can I fix this?
predict function looks for a variable called speed but when you subset it with $ sign it has no name anymore.
so, this variant of prediction works;
predict(cars.gb, newdata = data.frame(speed = cars_new$speed))
or keep the original name as is;
predict(cars.gb, newdata = cars_new['speed'])

Pass df column names to nested equation in Graph Printing Function

I need some clarification on the primary post on Passing a data.frame column name to a function
I need to create a function that will take a testSet, trainSet, and colName(aka predictor) as inputs to a function that prints a plot of the dataset with a GAM model trend line.
The issue I run into is:
plot.model = function(predictor, train, test) {
mod = gam(Response ~ s(train[[predictor]], spar = 1), data = train)
...
}
#Function Call
plot.model("Predictor1", 1.0, crime.train, crime.test)
I can't simply pass the predictor as a string into the gam function, but I also can't use a string to index the data frame values as shown in the link above. Somehow, I need to pass the colName key to the game function. This issue occurs in other similar scenarios regarding plotting.
plot <- ggplot(data = test, mapping = aes(x=predictor, y=ViolentCrimesPerPop))
Again, I can't pass a string value for the column name and I can't pass the column values either.
Does anyone have a generic solution for these situations. I apologize if the answer is buried in the above link, but it's not clear to me if it is.
Note: A working gam function call looks like this:
mod = gam(Response ~ s(Predictor1, spar = 1.0), data = train)
Where the train set is a data frame with column names "Response" & "Predictor".
Use aes_string instead of aes when you pass a column name as string.
plot <- ggplot(data = test, mapping = aes_string(x=predictor, y=ViolentCrimesPerPop))
For gam function:: Example which is copied from gam function's documentation. I have used vector, scalar is even easier. Its just using paste with a collapse parameter.
library(mgcv)
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
# String manipulate for formula
formula <- as.formula(paste("y~s(", paste(colnames(dat)[2:5], collapse = ")+s("), ")", sep =""))
b <- gam(formula, data=dat)
is same as
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)

Why is gam::step.gam returning NULL with forward selection?

I have a large Generalized Additive Model (GAM) made from 10K observations with ~ 100 variables. Building the model with forward stepwise selection results in an object of class "NULL". Why might this be and how do I resolve it?
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
load(url("https://github.com/cornejom/DataSets/raw/master/mygam.Rdata"))
myscope <- gam.scope(mydata, response = 3, arg = "df=4") #Target var in 3rd col.
mygam.step <- step.gam(mygam, myscope, direction = "forward")
mygam.step
NULL
The code that was used to fit mygam from mydata is:
library(gam)
#Identify numerical variables, but exclude the integer response.
numbers = sapply(mydata, class) %in% c("integer", "numeric")
numbers[match("Response", names(mydata))] = FALSE
#Identify factor variables.
factors = sapply(mydata, class) == "factor"
#Create a formula to feed into gam function.
myformula = paste0(paste0("Response ~ ",
paste0("s(", names(mydata)[numbers], ", df=4)", collapse = " + ")
),
" + ",
paste0(paste0(names(mydata)[factors], collapse = " + ")))
mygam = gam(as.formula(myformula), family = "binomial", mydata)
I suspect the issue is with the mygam object.
Explanation
If you read the help(step.gam) it has this paragraph in the explanation of scope argument:
The supplied model ‘object’ is used as the starting model,
and hence there is the requirement that one term from each of
the term formulas be present in ‘formula(object)’. This also
implies that any terms in ‘formula(object)’ not contained
in any of the term formulas will be forced to be present in
every model considered. The function ‘gam.scope’ is helpful
for generating the scope argument for a large model.
In essence this says that the first argument passed to step.gam function (mygam in this case) will have a formula and that formula will be used as a starting model for the stepwise procedure.
Since here we have forward stepwise - it cannot start from the full model, because in that case there is nothing left to add.
Exploring The Code
This idea is reinforced if we look at the code. The code of step.gam function has this loop that runs in case of forward selection.
if (forward) {
trial <- items
trial[i] <- trial[i] + 1
if (trial[i] <= term.lengths[i] && !get.visit(trial,
visited)) {
visited <- cbind(visited, trial)
tform.vector <- form.vector
tform.vector[i] <- scope[[i]][trial[i]]
form.list = c(form.list, list(list(trial = trial,
form.vector = tform.vector, which = i)))
}
}
Notice that the loop executes only when the inner if statement is TRUE. And that if statement seems to check if you have potential variables in your scope (term.length) that are not yet in your model (items, trial). If you don't - the loop skips.
Since in your case the loop never executes it doesn't form the return object and the procedure returns NULL.
The Solution
Given all the above - the solution is to not start with the complete formula when using forward selection method. Here for the demonstration I will be using the intercept-only model as a starting model:
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
mygam <- gam(Response ~ 1, family = "binomial", mydata)
The last line is the only change that needs to be made. Everything else is the same as in the original post:
myscope <- gam.scope(mydata, response = 3, arg = "df=4")
mygam.step <- step.gam(mygam, myscope, direction = "forward")
And now the procedure works.

Resources