I want to use train caret function to investigate xgboost results
#open file with train data
trainy <- read.csv('')
# open file with test data
test <- read.csv('')
# we dont need ID column
##### Removing IDs
trainy$ID <- NULL
test.id <- test$ID
test$ID <- NULL
##### Extracting TARGET
trainy.y <- trainy$TARGET
trainy$TARGET <- NULL
# set up the cross-validated hyper-parameter search
xgb_grid_1 = expand.grid(
nrounds = 1000,
eta = c(0.01, 0.001, 0.0001),
max_depth = c(2, 4, 6, 8, 10),
gamma = 1
)
# pack the training control parameters
xgb_trcontrol_1 = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all", # save losses across all models
classProbs = TRUE, # set to TRUE for AUC to be computed
summaryFunction = twoClassSummary,
allowParallel = TRUE
)
# train the model for each parameter combination in the grid,
# using CV to evaluate
xgb_train_1 = train(
x = as.matrix(trainy),
y = as.factor(trainy.y),
trControl = xgb_trcontrol_1,
tuneGrid = xgb_grid_1,
method = "xgbTree"
)
I see this error
Error in train.default(x = as.matrix(trainy), y = as.factor(trainy.y), trControl = xgb_trcontrol_1, :
At least one of the class levels is not a valid R variable name;
I have looked at other cases but still cant understand what I should change? R is quite different from Python for me for now
As I can see I should do something with y classes variable, but what and how exactly ? Why didnt as.factor function work?
I solved this issue, hope it will help to all novices
I needed to transofm all data to factor type in the way like
trainy[] <- lapply(trainy, factor)
Related
I am trying to return the ROC curves for a test dataset using the MLevals package.
# Load data
train <- readRDS(paste0("Data/train.rds"))
test <- readRDS(paste0("Data/test.rds"))
# Create factor class
train$class <- ifelse(train$class == 1, 'yes', 'no')
# Set up control function for training
ctrl <- trainControl(method = "cv",
number = 5,
returnResamp = 'none',
summaryFunction = twoClassSummary(),
classProbs = T,
savePredictions = T,
verboseIter = F)
gbmGrid <- expand.grid(interaction.depth = 10,
n.trees = 18000,
shrinkage = 0.01,
n.minobsinnode = 4)
# Build using a gradient boosted machine
set.seed(5627)
gbm <- train(class ~ .,
data = train,
method = "gbm",
metric = "ROC",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
# Predict results -
pred <- predict(gbm, newdata = test, type = "prob")[,"yes"]
roc <- evalm(data.frame(pred, test$class))
I have used the following post, ROC curve for the testing set using Caret package,
to try and plot the ROC from test data using MLeval and yet I get the following error message:
MLeval: Machine Learning Model Evaluation
Input: data frame of probabilities of observed labels
Error in names(x) <- value :
'names' attribute [3] must be the same length as the vector [2]
Can anyone please help? Thanks.
Please provide a reproducible example with sample data so we can replicate the error and test for solutions (i.e., we cannot access train.rds or test.rds).
Nevertheless, the below may fix your issue.
pred <- predict(gbm, newdata = test, type = "prob")
roc <- evalm(data.frame(pred, test$class))
I am running a glmnet model in caret on the built-in infert dataset, e.g.,
infert_y <- factor(infert$case) %>% plyr::revalue(c("0"="control", "1"="case"))
infert_x <- subset(infert, select=-case)
new.x <- model.matrix(~., infert_x)
# Create cross-validation folds:
myFolds <- createFolds(infert_y, k = 10)
# Create reusable trainControl object:
myControl_categorical <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds
)
model_glmnet_pca <- train(
x = new.x,
y = infert_y,
metric = "ROC",
method = "glmnet",
preProcess=c("zv", "nzv","medianImpute", "center", "scale", "pca"),
trControl = myControl_categorical,
tuneGrid= expand.grid(alpha= seq(0, 1, length = 20),
lambda = seq(0.0001, 1, length = 100))
)
But when I try to get the coefficients:
bestlambda <- model_glmnet_pca$results$lambda[model_glmnet_pca$results$ROC == max(model_glmnet_pca$results$ROC)]
coef(model_glmnet_pca, s=bestlambda)
returns:
NULL
I tried:
coef.glmnet(model_glmnet_pca, s=bestlambda)
which returns:
Error in predict.train(object, s = s, type = "coefficients", exact = exact, :
type must be either "raw" or "prob"
But surely when I'm calling coef() my "type" argument is set to "coefficients"? If I try
coef.glmnet(model_glmnet_pca, s=bestlambda, type="prob")
it returns:
Error in predict.train(object, s = s, type = "coefficients", exact = exact, :
formal argument "type" matched by multiple actual arguments
I am very confused, can anyone point out what I'm doing wrong?
To get the coefficients from the best model, you can use:
coef(model_glmnet_pca$finalModel, model_glmnet_pca$finalModel$lambdaOpt)
See e.g. this link on using regularised regression models with caret.
I am trying to predict the times table training a neural network. However, I couldn't really get how preProcess argument works in train function in Caret.
In the docs, it says:
The preProcess class can be used for many operations on predictors, including centering and scaling.
When we set preProcess like below,
tt.cv <- train(product ~ .,
data = tt.train,
method = 'neuralnet',
tuneGrid = tune.grid,
trControl = train.control,
linear.output = TRUE,
algorithm = 'backprop',
preProcess = 'range',
learningrate = 0.01)
Does it mean that the train function preprocesses (normalizes) the training data passed, in this case tt.train?
After the training is done, when we are trying to predict, do we pass normalized inputs to the predict function or are inputs normalized in the function because we set the preProcess parameter?
# Do we do
predict(tt.cv, tt.test)
# or
predict(tt.cv, tt.normalized.test)
And from the quote above, it seems that when we use preProcess, outputs are not normalized this way in training, how do we go about normalizing outputs? Or do we just normalize the training data beforehand like below and then pass it to the train function?
preProc <- preProcess(tt, method = 'range')
tt.preProcessed <- predict(preProc, tt)
tt.preProcessed.train <- tt.preProcessed[indexes,]
tt.preProcessed.test <- tt.preProcessed[-indexes,]
The whole code:
library(caret)
library(neuralnet)
# Create the dataset
tt = data.frame(multiplier = rep(1:10, times = 10), multiplicand = rep(1:10, each = 10))
tt = cbind(tt, data.frame(product = tt$multiplier * tt$multiplicand))
# Splitting
indexes = createDataPartition(tt$product,
times = 1,
p = 0.7,
list = FALSE)
tt.train = tt[indexes,]
tt.test = tt[-indexes,]
# Pre-process
preProc <- preProcess(tt, method = c('center', 'scale'))
tt.preProcessed <- predict(preProc, tt)
tt.preProcessed.train <- tt.preProcessed[indexes,]
tt.preProcessed.test <- tt.preProcessed[-indexes,]
# Train
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
savePredictions = TRUE)
tune.grid <- expand.grid(layer1 = 8,
layer2 = 0,
layer3 = 0)
tt.cv <- train(product ~ .,
data = tt.train,
method = 'neuralnet',
tuneGrid = tune.grid,
trControl = train.control,
algorithm = 'backprop',
learningrate = 0.01,
stepmax = 100000,
preProcess = c('center', 'scale'),
lifesign = 'minimal',
threshold = 0.01)
I am trying tune Hyperparametes of xgboost for a classification problem, using caret library, As there were a lot of factors in my data set and xgboost likes data as numerical, I created a dummy rows using Feature Hashing, but when I get to run caret train , I get an error
#Using Feature hashing to convert all the factor variables to dummies
objTrain_hashed = hashed.model.matrix(~., data=train1[,-27], hash.size=2^15, transpose=FALSE)
#created a dense matrix which is normally accepted by xgboost method in R
#Hoping I could pass it caret as well
dmodel <- xgb.DMatrix(objTrain_hashed[, ], label = train1$Walc)
xgb_grid_1 = expand.grid(
nrounds = 500,
max_depth = c(5, 10, 15),
eta = c(0.01, 0.001, 0.0001),
gamma = c(1, 2, 3),
colsample_bytree = c(0.4, 0.7, 1.0),
min_child_weight = c(0.5, 1, 1.5)
)
xgb_trcontrol_1 = trainControl(
method = "cv",
number = 3,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all", # save losses across all models
classProbs = TRUE, # set to TRUE for AUC to be computed
summaryFunction = twoClassSummary,
allowParallel = TRUE
)
xgb_train1 <- train(Walc ~.,dmodel,method = 'xgbTree',trControl = xgb_trcontrol_1,
metric = 'accuracy',tunegrid = xgb_grid_1)
I am getting the following error
Error in as.data.frame.default(data) :
cannot coerce class ""xgb.DMatrix"" to a data.frame
Any suggestions, on how I can proceed ?
This is because you are inputting dmodel into the last part of your code. Try inputting objTrain_hashed, which is a matrix, and not an xgb.DMatrix
How about sparse.model.matrix() instead of hashed.model.matrix...
It works on my PC...
and don't transform to xgb.DMatrix()
put it in train() function just mere sparse.model.matrix() form.
like...
model_data <- sparse.model.matrix(Y~., raw_data)
and
xgb_train1 <- train(Y ~.,model_data, <bla bla> ...)
Wish it works... thank you.
I have a multiclass problem: For example, we can take the dataset mtcars dataset and we want to predict number of cylinders cyl.
data(mtcars)
I want to use xgboost and fit it using the caret package. For that I create grid for hyperparameters using
xgb_grid_param = expand.grid(
nrounds = 1000,
eta = c(0.01, 0.001, 0.0001),
max_depth = c(2, 4),
gamma = 0,
colsample_bytree =1,
min_child_weight =1
)
I can create training control parameters as
xgb_tr_ctrl = trainControl(
method = "cv",
number = 5,
repeats =2,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all",
allowParallel = TRUE
)
When I then try to run the train function in caret using:
model <- train(factor(cyl)~., data = mtcars, method = "xgbTree",
trControl = xgb_grid_param, tuneGrid=xgb_grid_param)
I get the error ::
Error in trControl$classProbs && any(classLevels != make.names(classLevels)) :
invalid 'x' type in 'x && y'
How do I fix this error and how do I instruct xgbTree to use mlogloss to optimize the learning.
For another method I could solve "invalid 'x' type in 'x && y'" by setting the label attribute as last column of the data frame / matrix.