R: how to make predictions using gamboost - r

library(mboost)
### a simple two-dimensional example: cars data
cars.gb <- gamboost(dist ~ speed, data = cars, dfbase = 4,
control = boost_control(mstop = 50))
set.seed(1)
cars_new <- cars + rnorm(nrow(cars))
> predict(cars.gb, newdata = cars_new$speed)
Error in check_newdata(newdata, blg, mf) :
‘newdata’ must contain all predictor variables, which were used to specify the model.
I fit a model using the example on the help(gamboost) page. I want to use this model to predict on a new dataset, cars_new, but encountered the above error. How can I fix this?

predict function looks for a variable called speed but when you subset it with $ sign it has no name anymore.
so, this variant of prediction works;
predict(cars.gb, newdata = data.frame(speed = cars_new$speed))
or keep the original name as is;
predict(cars.gb, newdata = cars_new['speed'])

Related

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

Predict using multiple variables in R

I have a slight problem with my R coursework.
I have made a following dataset:
Now I'm going to plot the values based on this dataset using the following command:
plot(x ~ Group.1, data = jarelmaks_vaikelaen23mean,
xlab = "Vanus", ylab = "PD", main = "Järelmaks ja väikelaen")
After that, I'm creating a glm model using the following command. The difference is, that now I'm using an original dataset (the values of the dependent values are 1/0).
GLM command:
jarelmaks_vaikelaen23_mudel <- glm(Default ~ Vanus.aastates + Toode,
family = binomial(link = 'logit'), data = jarelmaks_vaikelaen_23)
Now, I'm trying to predict the values using my model.
predict(jarelmaks_vaikelaen23_mudel,data.frame(Vanus.aastates=x),type = "resp")
Unfortunately, I get a following error message:
Error in data.frame(Vanus.aastates = x) : object 'x' not found
Can you give me some ideas, how to solve this problem or explain, how this predict() command works or smth?
When you provide a data-frame to the predict function's newdata argument, the data-frame should have column names that match the variables used as independent variables in your model-fitting step. That is, your predict call should look like
predict(
jarelmaks_vaikelaen23_mudel,
newdata = data.frame(
Vanus.aastates = SOMETHING,
Toode = SOMETHING_ELSE
),
type = "response"
)

R C5.0 Undefined columns selected when using a formula stored in a variable

I'm using the C50 R Package for predicting with decision trees.
I have the following code:
library("partykit")
library("C50")
//Creates a sample data frame
data <- data.frame(ID = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
//This is the variable I want to predict
variable <- "ID"
//First I convert the column to factor
data[, variable] <- factor(data[, variable])
//Then I create the formula
formula <- as.formula(paste(variable, " ~ ."))
//And finally I fit the model with the formula
model <- C5.0(formula, data=data, trials=10)
Until this point everything is ok, the problem comes here, when I try to plot the tree:
png(filename = paste("test.png"), width = 800, height = 60)
plot(model) //This line throws the error: Error in `[.data.frame`(mf, rsp) : undefined columns selected
dev.off()
But if I change the line:
model <- C5.0(formula, data=data, trials=10)
To:
model <- C5.0(ID ~ ., data=data, trials=10)
Everything is ok.
After doing a bit of debugging I have this adicional info:
Inside C5.0 function there is this code:
call <- match.call()
If I look inside call this is what I get:
C5.0.formula(formula = formula, data = data, trials = 10)
But if the call to C5.0 is:
model <- C5.0(ID ~ ., data=data, trials=10)
Then the call object is this:
C5.0.formula(formula = ID ~ ., data = data, trials = 10)
This may seem normal, but debugging the plot() function I've seen that in some point the function as.party(x, trial = trial) is called, where x is the C5.0 object. Inside as.party() function there is another call, the model.frame(obj) call, where obj is the C5.0 object, and here is the problem. Inside the model.frame() function I found this line:
rsp <- strsplit(paste(formula$call[2]), " ")[[1]][1]
Remember? The error had a reference to that rsp variable. And the problem is that formula$call has 2 different values. If I make the first call to C.5 like this one:
model <- C5.0(ID ~ ., data=data, trials=10)
Everything is ok as formula$call contains C5.0.formula(formula = ID ~ ., data = data, trials = 10) and rap is ID so the next call to:
tmp <- mf[rsp]
Is executed without problems (mf is the initial data frame).
But with the call:
C5.0.formula(formula = formula, data = data, trials = 10)
The formula$call object contains:
C5.0.formula(formula = formula, data = data, trials = 10)
And rsp is "formula" so the line:
tmp <- mf[rsp]
Fails as there is no "formula" column in the data frame.
Is this the expected behavior? If it is, there is no way I can call C5.0 with a formula stored in a variable?
I need to do the call that way in order to test the algorithm with many different formulas.
Any help would be appreciated. Thank you all in advance.
Unfortunately it is a bug. See issue 8 on github. Looks like Max Kuhn (developer of C50) hasn't had the time to look into it, since the bug report is from August. You might want to attach your issue there as well. That might bring it back to the attention of the developer.

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

Resources