Error message when using lrm() and validate() from the rms package - r

I am trying to use validate() from the rms package but get an error. More specifically, I fit an ordinal logistic model using lrm() and then assess my results with validate().
Please find some code below:
library(data.table)
library(rms)
DT2 <- data.table(internal_model_rating_number = c(5,5,5,5,6,6,5,5,5,5),
ratio = c(2.0665194,1.2264998,1.0333628,0.6936382,-0.1883890,
-0.2349949,-0.5062086,-0.5204016,-0.4401635,-0.5824366))
lrm.model <- lrm(internal_model_rating_number ~ ratio,
x = T,
y = T,
data = DT,
maxit = 1000)
validate(lrm.model, group=internal_model_rating_number, B=200, bw=T)
Note that the group argument when running the code for my actual data (not displayed) is necessary, because some categories of the target variable occur very rarely.
The piece of validate() code returns the error:
Error in predab.resample(fit, method = method, fit = lrmfit, measure = discrim, :
object 'internal_model_rating_number' not found
Do you know how I can solve this error?

I think the data has to be specified before the grouping variable.
Try this validate(lrm.model, group = DT2$internal_model_rating_number, B=200, bw=T)
You also had a small typo (try data = DT2 instead of DT.)
If you don't want to specify the data again, you can use attach(DT2) and then run your model.
library(data.table)
library(rms)
DT2 <- data.table(internal_model_rating_number = c(5,5,5,5,6,6,5,5,5,5),
ratio = c(2.0665194,1.2264998,1.0333628,0.6936382,-0.1883890,
-0.2349949,-0.5062086,-0.5204016,-0.4401635,-0.5824366))
attach(DT2)
lrm.model <- lrm(internal_model_rating_number ~ ratio,
x = T,
y = T,
data = DT2,
maxit = 1000)
validate(lrm.model, group = internal_model_rating_number, B=200, bw=T)
Once you're done, if you'd like to detach the data, use detach(DT2)

Related

Extracting the relative influence from a gbm.fit object

I am trying to extract the relative influence of each variable from a gbm.fit object but it is coming up with the error below:
> summary(boost_cox, plotit = FALSE)
Error in data.frame(var = object$var.names[i], rel.inf = rel.inf[i]) :
row names contain missing values
The boost_cox object itself is fitted as follows:
boost_cox = gbm.fit(x = x,
y = y,
distribution="coxph",
verbose = FALSE,
keep.data = TRUE)
I have to use the gbm.fit function rather than the standard gbm function due to the large number of predictors (26k+)
I have solve this issue now myself.
The relative.influence() function can be used and works for objects created using both gbm() and gbm.fit(). However, it does not provide the plots as in the summary() function.
I hope this helps anyone else looking in the future.

R: how to make predictions using gamboost

library(mboost)
### a simple two-dimensional example: cars data
cars.gb <- gamboost(dist ~ speed, data = cars, dfbase = 4,
control = boost_control(mstop = 50))
set.seed(1)
cars_new <- cars + rnorm(nrow(cars))
> predict(cars.gb, newdata = cars_new$speed)
Error in check_newdata(newdata, blg, mf) :
‘newdata’ must contain all predictor variables, which were used to specify the model.
I fit a model using the example on the help(gamboost) page. I want to use this model to predict on a new dataset, cars_new, but encountered the above error. How can I fix this?
predict function looks for a variable called speed but when you subset it with $ sign it has no name anymore.
so, this variant of prediction works;
predict(cars.gb, newdata = data.frame(speed = cars_new$speed))
or keep the original name as is;
predict(cars.gb, newdata = cars_new['speed'])

Can't give a subset when using randomForest inside a function

I'm wanting to create a function that uses within it the randomForest function from the randomForest package. This takes the "subset" argument, which is a vector of row numbers of the data frame to use for training. However, if I use this argument when calling the randomForest function in another defined function, I get the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
Here is a reproducible example, where we attempt to train a random forest to classify a response "type" either "A" or "B", based on three numerical predictors:
library(randomForest)
# define a random data frame to train with
test.data = data.frame(
type = rep(NA, times = 500),
x = runif(500),
y = runif(500),
z = runif(500)
)
train.data$type[runif(500) >= 0.5] = "A"
train.data$type[is.na(test.data$type)] = "B"
train.data$type = as.factor(test.data$type)
# define the training range
training.range = sample(500)[1:300]
# formula to use
tr_form = formula(type ~ x + y + z)
# Function that includes the randomForest function
train_rf = function(form, all_data, tr_subset) {
p = randomForest(
formula = form,
data = all_data,
subset = tr_subset,
na.action = na.omit
)
return(p)
}
# test the new defined function
test_tree = train_rf(form = tr_form, all_data = train.data, tr_subset = training.range)
Running this gives the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
If, however, subset = tr_subset is removed from the randomForest function, and tr_subset is removed from the train_rf function, this code runs fine, however the whole data set is used for training!
It should be noted that using the subset argument in randomForest when not defined in another function works completely fine, and is the intended method for the function, as described in the vignette linked above.
I know in the mean time I could just define another training set that has just the row numbers required, and train using all of that, but is there a reason why my original code doesn't work please?
Thanks.
EDIT: I conjecture that, as subset() is a base R function, R is getting confused and thinking you're wanting to use the base R function rather than defining an argument of the randomForest function. I'm not an expert, though, so I may be wrong.

R C5.0 Undefined columns selected when using a formula stored in a variable

I'm using the C50 R Package for predicting with decision trees.
I have the following code:
library("partykit")
library("C50")
//Creates a sample data frame
data <- data.frame(ID = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
//This is the variable I want to predict
variable <- "ID"
//First I convert the column to factor
data[, variable] <- factor(data[, variable])
//Then I create the formula
formula <- as.formula(paste(variable, " ~ ."))
//And finally I fit the model with the formula
model <- C5.0(formula, data=data, trials=10)
Until this point everything is ok, the problem comes here, when I try to plot the tree:
png(filename = paste("test.png"), width = 800, height = 60)
plot(model) //This line throws the error: Error in `[.data.frame`(mf, rsp) : undefined columns selected
dev.off()
But if I change the line:
model <- C5.0(formula, data=data, trials=10)
To:
model <- C5.0(ID ~ ., data=data, trials=10)
Everything is ok.
After doing a bit of debugging I have this adicional info:
Inside C5.0 function there is this code:
call <- match.call()
If I look inside call this is what I get:
C5.0.formula(formula = formula, data = data, trials = 10)
But if the call to C5.0 is:
model <- C5.0(ID ~ ., data=data, trials=10)
Then the call object is this:
C5.0.formula(formula = ID ~ ., data = data, trials = 10)
This may seem normal, but debugging the plot() function I've seen that in some point the function as.party(x, trial = trial) is called, where x is the C5.0 object. Inside as.party() function there is another call, the model.frame(obj) call, where obj is the C5.0 object, and here is the problem. Inside the model.frame() function I found this line:
rsp <- strsplit(paste(formula$call[2]), " ")[[1]][1]
Remember? The error had a reference to that rsp variable. And the problem is that formula$call has 2 different values. If I make the first call to C.5 like this one:
model <- C5.0(ID ~ ., data=data, trials=10)
Everything is ok as formula$call contains C5.0.formula(formula = ID ~ ., data = data, trials = 10) and rap is ID so the next call to:
tmp <- mf[rsp]
Is executed without problems (mf is the initial data frame).
But with the call:
C5.0.formula(formula = formula, data = data, trials = 10)
The formula$call object contains:
C5.0.formula(formula = formula, data = data, trials = 10)
And rsp is "formula" so the line:
tmp <- mf[rsp]
Fails as there is no "formula" column in the data frame.
Is this the expected behavior? If it is, there is no way I can call C5.0 with a formula stored in a variable?
I need to do the call that way in order to test the algorithm with many different formulas.
Any help would be appreciated. Thank you all in advance.
Unfortunately it is a bug. See issue 8 on github. Looks like Max Kuhn (developer of C50) hasn't had the time to look into it, since the bug report is from August. You might want to attach your issue there as well. That might bring it back to the attention of the developer.

Solve for x value given y = 0 in a random effects model in r

I have a random effects model that looks at the mean bias of the results from a numerical water temperature model given changes to a parameter, roughness. Roughness is the x variable, mean_bias is the y variable, and the location in the stream (1:9) is the random effect variable:
lmer_mb <- lmer(mean_bias ~ roughness + (1|location), data = W3T, REML=FALSE)
I've tried using approx () to find the roughness value when mean_bias = 0:
xval <- approx(roughness = lmer_mb$fitted, mean_bias = lmer_mb$roughness, xout = 0)$mean_bias
But I keep getting the error:
Error in approx(roughness = lmer_mb$fitted, mean_bias =
lmer_mb$roughness, : unused arguments (roughness = lmer_mb$fitted,
mean_bias = lmer_mb$roughness)
I also want to plot the xval (once I figure it out) on my plot, and was going to adapt the code that I found in another question on stackoverflow:
xval <- approx(x = fit$fitted.values, y = x, xout = 30)$y
Am I on the right track?
I found the approach I needed to solve this problem. I needed to load the package
library(merTools)
Then I used the wiggle function to generate a new data frame around the area where I wanted to solve for x:
opt2 <- wiggle(data=W3T, var="roughness", values = seq(0.05,0.1,0.001))
Then, I filtered opt2 for the values I wanted and turned them into a list:
sub.opt2 <- filter(opt2,mean_bias == "0")
sub.opt3 <- list(sub.opt2)
And finally, found the average roughness using the draw() function:
opt.roughness <- draw(lmer_mb, type = 'average', varList = sub.opt3)
opt.roughness
My solution was based on the code I found here:
https://cran.r-project.org/web/packages/merTools/vignettes/merToolsIntro.html

Resources