My training data has 87620 rows and 5 columns. My test data has the same number of rows and columns. When I use a CART model to predict the "Defaults" (that is the target variable), my model works and provides me with predictions.
When I use a validation data set that has 6 columns and only 19561 rows, and does not have the Defaults variable, and then proceed to use the
View(validationsetpreds.CART3.3x)
I get the attached picture
Validationsetpreds Picture
And when I perform the same command using the test data set I get the following Testsetpreds Picture
set.seed(123)
loans_training$Default <- as.factor(loans_training$Default)#Make the default variable categorical
loans_test$Default <- as.factor(loans_test$Default)#Make the default variable categorical
loans_training$term <- as.factor(loans_training$term)
loans_test$term <- as.factor(loans_test$term)
#Standardize datasets
library(psych)
library(caret)
preprocess.train.z <- preProcess(loans_training[1:5], method = c("center", "scale"))
preprocess.train.z
loans_train.z <- predict(preprocess.train.z,loans_training[1:5])
describe(loans_train.z)
View(loans_train.z)
summary(loans_train.z$Default)
preprocess.test.z <- preProcess(loans_test[1:5], method = c("center", "scale"))
preprocess.test.z
loans_test.z <- predict(preprocess.test.z,loans_test[1:5])
describe(loans_test.z)
View(loans_test.z)
summary(loans_train.z$Default)
(22417 * 2.3) + 22417
#Resampling subroutine
rare.record.indices <- which(loans_train.z$Default == "1")
rare.indices.resampled <- sample(x = rare.record.indices,size = 51559, replace = TRUE)
rare.records.resampled <- loans_train.z[rare.indices.resampled,]
loans_train.3.3x <- rbind(loans_train.z, rare.records.resampled)
table(loans_train.3.3x$Default)
#Develop 3.3x CART model
TC <- trainControl(method = "CV", number = 10)
fit.CART.3.3x <- train(Default ~ ., data = loans_train.3.3x, method = "rpart", trControl = TC)
fit.CART.3.3x$resample
testsetpreds.CART3.3x <- predict(fit.CART.3.3x,loans_test.z)
table(loans_test.z$Default, testsetpreds.CART3.3x)
testsetpreds.CART3.3x
#Predictions
set.seed(123)
loans_validation$grade <- as.character(loans_validation$grade)#Make the grade variable categorical
loans_validation$term <- as.factor(loans_validation$term)#Make the term variable categorical
loans_validation$Index <- as.factor(loans_validation$Index)#Make the Index variable categorical
#Standardize dataset
library(psych)
library(caret)
preprocess.validation.z <- preProcess(loans_validation[1:6], method = c("center", "scale"))
preprocess.validation.z
loans_validation.z <- predict(preprocess.validation.z,loans_validation[1:6])
#Predict Defaults using Cart
validationsetpreds.CART3.3x <- predict(fit.CART.3.3x,loans_validation.z)
View(validationsetpreds.CART3.3x)
Any help would be greatly appreaciated :)
How would I apply this to the validation data set?
Related
I'm trying to get prediction intervals from a random forest model that has a categorical response variable. Ideally, I would like to see how confident the model is for classifying an observation into a given response category.
On the last line of my code you'll see a predict() that works when the interval = argument is not included. When I include the "interval =" I get an error. Any idea on how to get prediction intervals for the output?
# Load libraries
library(data.table)
library(randomForest)
library(caret)
# Set seed
set.seed(123)
# Load the necessary data
df.0 <- diamonds
setDT(df.0)
# set up the cross-validation parameters
control <- trainControl(method = "repeatedcv",
number = 10)
metric <- "Accuracy"
mtry <- seq(from = 1,
to = length(unique(df.0$cut)),
by = 1)
tunegrid <- expand.grid(mtry = mtry)
# Add rownames so we can use as index
df.0[, indexNum := .I]
trainer <- df.0[ ,.SD[sample(x = .N, size = (.N * 0.9))], by = cut] # Pull 90% of each cut into training
tester <- df.0[!trainer, on = c("indexNum")]
# Remove index number
tester <- tester[, ":=" (indexNum = NULL)]
trainer <- trainer[, ":=" (indexNum = NULL)]
# build a model and assess its accuracy via 10-fold cross validation
rf_mod <-
train(
x = trainer[, .(x, y, z, depth, table)],
y = trainer$cut,
method = "rf",
metric = "Accuracy",
tuneGrid = tunegrid
)
# check out which mtry value was best
plot(rf_mod)
# test the model against the test data
cut_pred <- predict(rf_mod, newdata = tester[, .(x, y, z, depth, table), interval = "prediction")
I want to extract the predictions for new unseen data using the function caret::extractPrediction with a random forest model but I cannot figure out, why my code throws the error Error: $ operator is invalid for atomic vectors. How should the input parameters be structured, to use this function?
Here is my reproducible code:
library(caret)
dat <- as.data.frame(ChickWeight)
# create column set
dat$set <- rep("train", nrow(dat))
# split into train and validation set
set.seed(1)
dat[sample(nrow(dat), 50), which(colnames(dat) == "set")] <- "validation"
# predictors and response
all_preds <- dat[which(dat$set == "train"), which(names(dat) %in% c("Time", "Diet"))]
response <- dat[which(dat$set == "train"), which(names(dat) == "weight")]
# set train control parameters
contr <- caret::trainControl(method="repeatedcv", number=3, repeats=5)
# recursive feature elimination caret
set.seed(1)
model <- caret::train(x = all_preds,
y = response,
method ="rf",
ntree = 250,
metric = "RMSE",
trControl = contr)
# validation set
vali <- dat[which(dat$set == "validation"), ]
# not working
caret::extractPrediction(models = model, testX = vali[,-c(3,5,1)], testY = vali[,1])
caret::extractPrediction(models = model, testX = vali, testY = vali)
# works without problems
caret::predict.train(model, newdata = vali)
I found a solution by looking at the documentation of extractPrediction. Basically, the argument models doesn't want a single model instance, but a list of models. So I just inserted list(my_rf = model) and not just model.
caret::extractPrediction(models = list(my_rf = model), testX = vali[,-c(3,5,1)], testY = vali[,1])
My dataset contains 5851 observations, and is split into a train (3511 observations) and test (2340 observations) set. I now want to train a model using KNN, with two variables. I want to do 10-fold CV, repeated 5 times, using ROC metric and the one-standard error rule and the variables are preprocessed. The code is shown below.
set.seed(44780)
ctrl_repcvSE <- trainControl(method = "repeatedcv", number = 10, repeats = 5,
summaryFunction = twoClassSummary, classProbs = TRUE,
selectionFunction = "oneSE")
tune_grid <- expand.grid(k = 45:75)
mod4 <- train(purchased ~ total_policies + total_contrib,
data = mhomes_train, method = "knn",
trControl= ctrl_repcvSE, metric = "ROC",
tuneGrid = tune_grid, preProcess = c("center", "scale"))
The problem I have is that I already have tried so many different values of K (e.g., K = 10:20, 30:40, 50:60, 150:160 + different tuning lengths. However, every time the output says that the chosen value for K is the one which is last, so for example for values of K = 70:80, the chosen value for K = 80, every time I do this. This means I should look further, because if the chosen value is K in that case then there are better values of K available which are above 80. How should I eventually find this one?
The assignment only specifies: For k-nearest neighbours, explore reasonable values of k using the total_policies and total_contrib variables only.
Welcome to Stack Overflow. Your question isn't easy to answer.
For k-nearest neighbours I use another function knn3 part of the caret library.
I'll give an example using the iris dataset. We try to get the accuracy of our model for different values for k and plot those accuracies.
library(data.table)
library(tidyverse)
library(scales)
library(caret)
dt <- as.data.table(iris)
# converting and scaling data ----
dt$Species <- dt$Species %>% as.factor()
dt$Sepal.Length <- dt$Sepal.Length %>% scale()
dt$Sepal.Width <- dt$Sepal.Width %>% scale()
dt$Petal.Length <- dt$Petal.Length %>% scale()
dt$Petal.Width <- dt$Petal.Width %>% scale()
# remove in the real run ----
set.seed(1234567)
# split data into train and test - 3:1 ----
train_index <- createDataPartition(dt$Species, p = 0.75, list = FALSE)
train <- dt[train_index, ]
test <- dt[-train_index, ]
# values to check for k ----
K_VALUES <- 20:1
test_acc <- numeric(0)
train_acc <- numeric(0)
# calculate different models for each value of k ----
for (x in K_VALUES){
model <- knn3(Species ~ ., data = train, k = x)
pred_test <- predict(model, test, type = "class")
pred_test_acc <- confusionMatrix(table(pred_test,
test$Species))$overall["Accuracy"]
test_acc <- c(test_acc, pred_test_acc)
pred_train <- predict(model, train, type = "class")
pred_train_acc <- confusionMatrix(table(pred_train,
train$Species))$overall["Accuracy"]
train_acc <- c(train_acc, pred_train_acc)
}
data <- data.table(x = K_VALUES, train = train_acc, test = test_acc)
# plot a validation curve ----
plot_data <- gather(data, "type", "value", -x)
g <- qplot(x = x,
y = value,
data = plot_data,
color = type,
geom = "path",
xlim = c(max(K_VALUES),min(K_VALUES)-1))
print(g)
Now find a k with a good accuracy for your test data. That's the value you're looking for.
Disclosure: That's simplified but this approach should help you solving your problem.
I use PCA on my divided train dataset and project the test dataset to the results after removing irrelevant columns.
data <- read.csv('bottom10.csv')
set.seed(1)
inTrain <- createDataPartition(data$cuisine, p = .8)[[1]]
dataTrain <- data[,-1][inTrain,][,-1]
dataTest <- data[,-1][-inTrain,][,-1]
cuisine.pca <- prcomp(dataTrain[,-1])
Then I extract the first 500 components and project the test dataset.
traincom <- cuisine.pca$x[,1:500]
testcom <- scale(dataTest[,-1], cuisine.pca$center) %*% cuisine.pca$rotation
Then I transfer the labels into integer, and combine components and labels into xgbDMatrix form.
label_train <- as.integer(dataTrain$cuisine) - 1
label_test <- as.integer(dataTest$cuisine) - 1
xgb_train <- xgb.DMatrix(data = traincom, label = label_train)
xgb_test <- xgb.DMatrix(data = testcom, label = label_test)
Then I build the xgboost model as
xgb.fit <- xgboost(cuisine~., data = xgb_train, nrounds = 40, num_class = 10, early_stopping_rounds = 5)
And after I run this, there is a warning but the training can still run.
xgboost: label will be ignored
I can predict the train dataset using the model but when I try to predict test dataset there will be an error.
xgb_pred <- predict(xgb.fit, newdata = xgb_train)
sum(label_train == xgb_pred)/length(label_train)
xgb_pred <- predict(xgb.fit, newdata = xgb_test, rescale = T)
Error in predict.xgb.Booster(xgb.fit, newdata = xgb_test, rescale = T) :
Feature names stored in `object` and `newdata` are different!
Please let me know what am I doing wrong?
Regards
I am trying to investigate my model with R with machine learning. Training model in general works not well.
# # Logistic regression multiclass
for (i in 1:30) {
# split data into training/test
trainPhyIndex <- createDataPartition(subs_phy$Methane, p=10/17,list = FALSE)
trainingPhy <- subs_phy[trainPhyIndex,]
testingPhy <- subs_phy[-trainPhyIndex,]
# Pre-process predictor values
trainXphy <- trainingPhy[,names(trainingPhy)!= "Methane"]
preProcValuesPhy <- preProcess(x= trainXphy,method = c("center","scale"))
# using boot to avoid over-fitting
fitControlPhyGLMNET <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
savePredictions="final",
classProbs = TRUE
)
fit_glmnet_phy <- train (Methane~.,
trainingPhy,
method = "glmnet",
tuneGrid = expand.grid(
.alpha =0.1,
.lambda = 0.00023),
metric = "Accuracy",
trControl = fitControlPhyGLMNET)
pred_glmnet_phy <- predict(fit_glmnet_phy, testingPhy)
# Get the confusion matrix to see accuracy value
u <- union(pred_glmnet_phy,testingPhy$Methane)
t <- table(factor(pred_glmnet_phy, u), factor(testingPhy$Methane, u))
accu_glmnet_phy <- confusionMatrix(t)
# accu_glmnet_phy<-confusionMatrix(pred_glmnet_phy,testingPhy$Methane)
glmnetstatsPhy[(nrow(glmnetstatsPhy)+1),] = accu_glmnet_phy$overall
}
glmnetstatsPhy
The program always stopped on fit_glmnet_phy <- train (Methane~., ..
this command and shows
Metric Accuracy not applicable for regression models
I have no idea about this error
I also attached the type of mathane
enter image description here
Try normalizing the input columns and mapping the output column as factors. This helped me resolve an issue similar to it.