My training data has 87620 rows and 5 columns. My test data has the same number of rows and columns. When I use a CART model to predict the "Defaults" (that is the target variable), my model works and provides me with predictions.
When I use a validation data set that has 6 columns and only 19561 rows, and does not have the Defaults variable, and then proceed to use the
loans_training$Default <- as.factor(loans_training$Default)#Make the default variable categorical
loans_test$Default <- as.factor(loans_test$Default)#Make the default variable categorical
loans_training$term <- as.factor(loans_training$term)
loans_test$term <- as.factor(loans_test$term)
#Standardize datasets
preprocess.train.z <- preProcess(loans_training[1:5], method = c("center", "scale"))
loans_train.z <- predict(preprocess.train.z,loans_training[1:5])
preprocess.test.z <- preProcess(loans_test[1:5], method = c("center", "scale"))
loans_test.z <- predict(preprocess.test.z,loans_test[1:5])
(22417 * 2.3) + 22417
#Resampling subroutine
rare.record.indices <- which(loans_train.z$Default == "1")
rare.indices.resampled <- sample(x = rare.record.indices,size = 51559, replace = TRUE)
rare.records.resampled <- loans_train.z[rare.indices.resampled,]
loans_train.3.3x <- rbind(loans_train.z, rare.records.resampled)
#Develop 3.3x CART model
TC <- trainControl(method = "CV", number = 10)
fit.CART.3.3x <- train(Default ~ ., data = loans_train.3.3x, method = "rpart", trControl = TC)
testsetpreds.CART3.3x <- predict(fit.CART.3.3x,loans_test.z)
table(loans_test.z$Default, testsetpreds.CART3.3x)
loans_validation$grade <- as.character(loans_validation$grade)#Make the grade variable categorical
loans_validation$term <- as.factor(loans_validation$term)#Make the term variable categorical
loans_validation$Index <- as.factor(loans_validation$Index)#Make the Index variable categorical
#Standardize dataset
preprocess.validation.z <- preProcess(loans_validation[1:6], method = c("center", "scale"))
loans_validation.z <- predict(preprocess.validation.z,loans_validation[1:6])
#Predict Defaults using Cart
validationsetpreds.CART3.3x <- predict(fit.CART.3.3x,loans_validation.z)
Any help would be greatly appreaciated :)
How would I apply this to the validation data set?
I'm trying to get prediction intervals from a random forest model that has a categorical response variable. Ideally, I would like to see how confident the model is for classifying an observation into a given response category.
On the last line of my code you'll see a predict() that works when the interval = argument is not included. When I include the "interval =" I get an error. Any idea on how to get prediction intervals for the output?
# Load libraries
# Set seed
# Load the necessary data
df.0 <- diamonds
# set up the cross-validation parameters
control <- trainControl(method = "repeatedcv",
number = 10)
metric <- "Accuracy"
mtry <- seq(from = 1,
to = length(unique(df.0$cut)),
by = 1)
tunegrid <- expand.grid(mtry = mtry)
# Add rownames so we can use as index
df.0[, indexNum := .I]
trainer <- df.0[ ,.SD[sample(x = .N, size = (.N * 0.9))], by = cut] # Pull 90% of each cut into training
tester <- df.0[!trainer, on = c("indexNum")]
# Remove index number
tester <- tester[, ":=" (indexNum = NULL)]
trainer <- trainer[, ":=" (indexNum = NULL)]
# build a model and assess its accuracy via 10-fold cross validation
rf_mod <-
x = trainer[, .(x, y, z, depth, table)],
y = trainer$cut,
method = "rf",
metric = "Accuracy",
tuneGrid = tunegrid
# check out which mtry value was best
# test the model against the test data
cut_pred <- predict(rf_mod, newdata = tester[, .(x, y, z, depth, table), interval = "prediction")
I want to extract the predictions for new unseen data using the function caret::extractPrediction with a random forest model but I cannot figure out, why my code throws the error Error: $ operator is invalid for atomic vectors. How should the input parameters be structured, to use this function?
Here is my reproducible code:
dat <-
# create column set
dat$set <- rep("train", nrow(dat))
# split into train and validation set
dat[sample(nrow(dat), 50), which(colnames(dat) == "set")] <- "validation"
# predictors and response
all_preds <- dat[which(dat$set == "train"), which(names(dat) %in% c("Time", "Diet"))]
response <- dat[which(dat$set == "train"), which(names(dat) == "weight")]
# set train control parameters
contr <- caret::trainControl(method="repeatedcv", number=3, repeats=5)
# recursive feature elimination caret
model <- caret::train(x = all_preds,
y = response,
method ="rf",
ntree = 250,
metric = "RMSE",
trControl = contr)
# validation set
vali <- dat[which(dat$set == "validation"), ]
# not working
caret::extractPrediction(models = model, testX = vali[,-c(3,5,1)], testY = vali[,1])
caret::extractPrediction(models = model, testX = vali, testY = vali)
# works without problems
caret::predict.train(model, newdata = vali)
I found a solution by looking at the documentation of extractPrediction. Basically, the argument models doesn't want a single model instance, but a list of models. So I just inserted list(my_rf = model) and not just model.
caret::extractPrediction(models = list(my_rf = model), testX = vali[,-c(3,5,1)], testY = vali[,1])
My dataset contains 5851 observations, and is split into a train (3511 observations) and test (2340 observations) set. I now want to train a model using KNN, with two variables. I want to do 10-fold CV, repeated 5 times, using ROC metric and the one-standard error rule and the variables are preprocessed. The code is shown below.
ctrl_repcvSE <- trainControl(method = "repeatedcv", number = 10, repeats = 5,
summaryFunction = twoClassSummary, classProbs = TRUE,
selectionFunction = "oneSE")
tune_grid <- expand.grid(k = 45:75)
mod4 <- train(purchased ~ total_policies + total_contrib,
data = mhomes_train, method = "knn",
trControl= ctrl_repcvSE, metric = "ROC",
tuneGrid = tune_grid, preProcess = c("center", "scale"))
The problem I have is that I already have tried so many different values of K (e.g., K = 10:20, 30:40, 50:60, 150:160 + different tuning lengths. However, every time the output says that the chosen value for K is the one which is last, so for example for values of K = 70:80, the chosen value for K = 80, every time I do this. This means I should look further, because if the chosen value is K in that case then there are better values of K available which are above 80. How should I eventually find this one?
The assignment only specifies: For k-nearest neighbours, explore reasonable values of k using the total_policies and total_contrib variables only.
Welcome to Stack Overflow. Your question isn't easy to answer.
For k-nearest neighbours I use another function knn3 part of the caret library.
I'll give an example using the iris dataset. We try to get the accuracy of our model for different values for k and plot those accuracies.
dt <-
# converting and scaling data ----
dt$Species <- dt$Species %>% as.factor()
dt$Sepal.Length <- dt$Sepal.Length %>% scale()
dt$Sepal.Width <- dt$Sepal.Width %>% scale()
dt$Petal.Length <- dt$Petal.Length %>% scale()
dt$Petal.Width <- dt$Petal.Width %>% scale()
# remove in the real run ----
# split data into train and test - 3:1 ----
train_index <- createDataPartition(dt$Species, p = 0.75, list = FALSE)
train <- dt[train_index, ]
test <- dt[-train_index, ]
# values to check for k ----
K_VALUES <- 20:1
test_acc <- numeric(0)
train_acc <- numeric(0)
# calculate different models for each value of k ----
for (x in K_VALUES){
model <- knn3(Species ~ ., data = train, k = x)
pred_test <- predict(model, test, type = "class")
pred_test_acc <- confusionMatrix(table(pred_test,
test_acc <- c(test_acc, pred_test_acc)
pred_train <- predict(model, train, type = "class")
pred_train_acc <- confusionMatrix(table(pred_train,
train_acc <- c(train_acc, pred_train_acc)
data <- data.table(x = K_VALUES, train = train_acc, test = test_acc)
# plot a validation curve ----
plot_data <- gather(data, "type", "value", -x)
g <- qplot(x = x,
y = value,
data = plot_data,
color = type,
geom = "path",
xlim = c(max(K_VALUES),min(K_VALUES)-1))
Now find a k with a good accuracy for your test data. That's the value you're looking for.
Disclosure: That's simplified but this approach should help you solving your problem.
I use PCA on my divided train dataset and project the test dataset to the results after removing irrelevant columns.
data <- read.csv('bottom10.csv')
inTrain <- createDataPartition(data$cuisine, p = .8)[[1]]
dataTrain <- data[,-1][inTrain,][,-1]
dataTest <- data[,-1][-inTrain,][,-1]
cuisine.pca <- prcomp(dataTrain[,-1])
Then I extract the first 500 components and project the test dataset.
traincom <- cuisine.pca$x[,1:500]
testcom <- scale(dataTest[,-1], cuisine.pca$center) %*% cuisine.pca$rotation
Then I transfer the labels into integer, and combine components and labels into xgbDMatrix form.
label_train <- as.integer(dataTrain$cuisine) - 1
label_test <- as.integer(dataTest$cuisine) - 1
xgb_train <- xgb.DMatrix(data = traincom, label = label_train)
xgb_test <- xgb.DMatrix(data = testcom, label = label_test)
Then I build the xgboost model as <- xgboost(cuisine~., data = xgb_train, nrounds = 40, num_class = 10, early_stopping_rounds = 5)
And after I run this, there is a warning but the training can still run.
xgboost: label will be ignored
I can predict the train dataset using the model but when I try to predict test dataset there will be an error.
xgb_pred <- predict(, newdata = xgb_train)
sum(label_train == xgb_pred)/length(label_train)
xgb_pred <- predict(, newdata = xgb_test, rescale = T)
Error in predict.xgb.Booster(, newdata = xgb_test, rescale = T) :
Feature names stored in `object` and `newdata` are different!
Please let me know what am I doing wrong?
I am trying to investigate my model with R with machine learning. Training model in general works not well.
# # Logistic regression multiclass
for (i in 1:30) {
# split data into training/test
trainPhyIndex <- createDataPartition(subs_phy$Methane, p=10/17,list = FALSE)
trainingPhy <- subs_phy[trainPhyIndex,]
testingPhy <- subs_phy[-trainPhyIndex,]
# Pre-process predictor values
trainXphy <- trainingPhy[,names(trainingPhy)!= "Methane"]
preProcValuesPhy <- preProcess(x= trainXphy,method = c("center","scale"))
# using boot to avoid over-fitting
fitControlPhyGLMNET <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
classProbs = TRUE
fit_glmnet_phy <- train (Methane~.,
method = "glmnet",
tuneGrid = expand.grid(
.alpha =0.1,
.lambda = 0.00023),
metric = "Accuracy",
trControl = fitControlPhyGLMNET)
pred_glmnet_phy <- predict(fit_glmnet_phy, testingPhy)
# Get the confusion matrix to see accuracy value
u <- union(pred_glmnet_phy,testingPhy$Methane)
t <- table(factor(pred_glmnet_phy, u), factor(testingPhy$Methane, u))
accu_glmnet_phy <- confusionMatrix(t)
# accu_glmnet_phy<-confusionMatrix(pred_glmnet_phy,testingPhy$Methane)
glmnetstatsPhy[(nrow(glmnetstatsPhy)+1),] = accu_glmnet_phy$overall
The program always stopped on fit_glmnet_phy <- train (Methane~., ..
this command and shows
Metric Accuracy not applicable for regression models
I have no idea about this error
I also attached the type of mathane
Try normalizing the input columns and mapping the output column as factors. This helped me resolve an issue similar to it.