r studio caret train factor has no data - r

I'm working with a dataset that I'm training with the caret package. My class variable has 7 levels which I create the labels with the dataset documentation. Happened that one of the levels has no data whatsoever in the dataset and I'm having the following error... Error in train.default(x, y, weights = w, ...) : One or more factor levels in the outcome has no data: 'vwnfp'. The easy way should be just getting rid of that level and that should work. But I'm wondering if in the caret packages is any parameter that can handle this type of situations. I did try to add na.action = 'na.omit'. I also wonder if utilizing the preProcess argument can handle this, but I have never use preProcess before and my attempts are unsuccessful. Here is my code to train the data...
fit.control <- trainControl(method = 'cv', number = 10)
grid <- expand.grid(cp = seq(0, 0.05, 0.005))
trained.tree <- train(Type_of_glass ~ ., data = data.train, method = 'rpart',
trControl = fit.control, metric = 'Accuracy', maximize = TRUE,
tuneGrid = grid, na.action = 'na.omit')
The dataset is in the following url... http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data
This is the code I'm utilizing to manipulate the dataset...
# Loading dataset and transform
data <- read.csv(file = 'data.csv',
head = FALSE)
colnames(data) <- c('Id', 'Ri', 'Na', 'Ma', 'Al',
'Si', 'K', 'Ca', 'Ba', 'Fe',
'Type_of_glass')
str(data)
data <- subset(data, select = -Id)
data$Type_of_glass <- factor(data$Type_of_glass,
levels = c(1, 2, 3, 4, 5, 6, 7),
labels = c('bwfp', 'bwnfp', 'vwfp', 'vwnfp',
'c', 't', 'h'))
str(data)
# Spliting training and test dataset
set.seed(2)
sample.train <- sample(1:nrow(data), nrow(data) * .8)
sample.test <- setdiff(1:nrow(data), sample.train)
data.train <- data[sample.train, ]
data.test <- subset(data[sample.test, ], select = -Type_of_glass)
I don't want to manually get rid of the level because in production, after training, the unseen dataset is pass through the model as is. How can I handle this situation in the dataset?

Related

Getting error "invalid type (list) for variable" when running multiple models in a for loop: how to specify outcome/predictors?

For a study I am working on I need to create bootstrapped datasets and inverse probability weights for each dataset and then run a series of models for each of these datasets/weights. I am attempting to do this with a nested for-loop where the first part of the loop creates the weights and the nested loop runs a series of models, each with different outcome variables and/or predictors. I am running about 80 models for each bootstrapped dataset, hence the reason for a more automated way to do this. Below is a example of what I am doing with some mock data:
# Creation of mock data
data <- data.frame("Severity" = as.factor(c(rep("None", 25), rep("Mild", 25), rep("Moderate", 25), rep("Severe", 25))), "Severity2" = as.factor(c(rep("None", 40), rep("Mild", 20), rep("Moderate", 20), rep("Severe", 20))), "Weight" = rnorm(100, mean = 160, sd = 30), "Age" = rnorm(100, mean = 40, sd = 7), "Gender" = as.factor(rbinom(100, size = 1, prob = 0.5)), "Tested" = as.factor(rbinom(100, size = 1, prob = 0.4)))
data$Severity <- ifelse(data$Tested == 0, NA, data$Severity)
data$Severity2 <- ifelse(data$Tested == 0, NA, data$Severity2)
data$Severity <- ordered(data$Severity, levels = c("None", "Mild", "Moderate", "Severe"))
data$Severity2 <- ordered(data$Severity2, levels = c("None", "Mild", "Moderate", "Severe"))
# Creating boostrapped datasets
nboot <- 2
set.seed(10)
boot.samples <- lapply(1:nboot, function(i) {
data[base::sample(1:nrow(data), replace = TRUE),]
})
# Create empty list to store results later
coefs <- list()
# Setting up the outcomes/predictors of each of the models I will run
mod1 <- list("outcome" <- "Severity", "preds" <- c("Weight","Age"))
mod2 <- list("outcome" <- "Severity2", "preds" <- c("Weight", "Age", "Gender"))
models <- list(mod1, mod2)
# Running the for-loop
for(i in 1:length(boot.samples)) {
#Setting up weight creation
null <- glm(formula = Tested ~ 1, family = "binomial", data = boot.samples[[i]])
full <- glm(formula = Tested ~ Age, family = "binomial", data = boot.samples[[i]])
step <- step(null, k = 2, direction = "forward", scope=list(lower = null, upper = full), trace = 0)
pd.combined <- stats::predict(step, type = "response")
numer.combined <- glm(Tested ~ 1, family = "binomial",
data = boot.samples[[i]])
pn.combined <- stats::predict(numer.combined, type = "response")
# Creating stabilized weights
boot.samples[[i]]$ipw <- ifelse(boot.samples[[i]]$Tested==0, ((1-pn.combined)/(1-pd.combined)), (pn.combined)/(pd.combined))
# Now running each model and storing the coefficients
for(j in 1:length(models)) {
outcome <- models[[j]][[1]] # Set the outcome name
predictors <- models[[j]][[2]] # Set the predictor names
model_results <- polr(boot.samples[[i]][,outcome] ~ boot.samples[[i]][, predictors], weights = boot.samples[[i]]$ipw, method = c("logistic"), Hess = TRUE) #Run the model
coefs[[j]] <- model_results$coefficients # Store regression model coefficients in list
}
}
The portion for creating the IPW weights works just fine, but I keep getting an error for the modeling portion that reads:
"Error in model.frame.default(formula = boot.samples[[i]][, outcome] ~ :
invalid type (list) for variable 'boot.samples[[i]][, predictors]'"
Based on the question asked and answered here: Error in model.frame.default ..... : invalid type (list) for variable I know that the issue is with how I'm calling the outcomes and predictors in the model. I've messed around lots of different ways to handle this to no avail, I need to specify the outcome and predictors as I do because in my actual models the outcomes and predictors changes with each model! Any ideas on how to deal with this would be greatly appreciated!
I've tried something like setting outcome <- boot.samples[[i]][,outcome] outside of the model and then just calling outcome in the model, but that gives me the same error.

How to implement knn based on weights

I would like to implement the weighted knn algorithm but I don't know how to do it. Everything and that I can use kknn, I suppose that it can also be done with knn. In the function train(caret) there is an option "weights" but I can't find the solution, any suggestion?
I use the following code in R :
library(caret)
library(corrplot)
glass <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data",
col.names=c("","RI","Na","Mg","Al","Si","K","Ca","Ba","Fe","Type"))
str(glass)
head(glass)
glass_1<- glass[,-7]
glass_2<- glass_1[,-7]
head(glass_2)
glass<- glass_2
standard.features <- scale(glass[,2:8])
data <- cbind(standard.features,glass[9])
anyNA(data)
head(data)
corrplot(cor(data))
data$Type<-factor(data$Type)
inTraining <- createDataPartition(data$Type, p = .7, list = FALSE, times =1 )
training <- data[ inTraining,]
testing <- data[-inTraining,]
prop.table(table(training$Type))
prop.table(table(testing$Type))
dim(training); dim(testing);
summary(data)
fitControl <- trainControl(## 5-fold CV
method = "cv",
number = 5,
## repeated ten times
#repeats = 5)
)
#k_value <- expand.grid(kmax = 3, distance = 2, kernel = "optimal")
k_value <- expand.grid(k = 3)
set.seed(825)
knn_Fit <- train(Type ~ ., data = training, weights = ????,
method = "knn", tuneGrid = k_value,
trControl = fitControl)
## This last option is actually one
## for gbm() that passes through
#verbose = FALSE)
knn_Fit
knn_Fit$finalModel

A function to fit a random forest model and return the results of specified data

consider the following data frame:
dat1 <- data.frame(Loc = rep(c("NY","MA","FL","GA"), each = 1000),
Region = rep(c("a","b","c","d"),each = 1000),
ID = rep(c(1:10), each=200),
var1 = rnorm(1000),
var2=rnorm(1000),
var3=rnorm(1000))
Loc and Region are two grouping variables for ID. Assume I have several other data frames like dat1. I am trying to write a function that will automatically fit a random forest model to the data. I want to specify the dataframe, grouping variable, and columns that I want it to use.
I have tried variants of the following functions, but keep getting error messages that say Error in get(dat, envir = .GlobalEnv) : invalid first argument when I try to run them
library(caret)
library(randomForest)
rand.f <- function(dat,groupvar,cols){
model <- train(groupvar ~ paste0(cols,collapse = "+"), data = dat, method = "rf", trControl = trainControl("cv", number = 10), importance = T)
c.e <- model$finalModel$confusion[, "class.error"]
print(c.e)
}
rand.f(dat="dat1", groupvar = "Region", cols = 5:6)
model$bestTune
##################
rand.f <- function(dat,groupvar,cols){
model <- train(get(dat, envir=.GlobalEnv)[,groupvar] ~ paste0(cols,collapse = "+"), data = dat, method = "rf", trControl = trainControl("cv", number = 10), importance = T)
c.e <- model$finalModel$confusion[, "class.error"]
print(c.e)
}
rand.f(dat="dat1", groupvar = "Region", cols = 5:6)
model$bestTune
what am I doing wrong?
The following should be working:
rand.f <- function(dat,outcome){
model <- train(x = dat[, cols, drop=F]
, y = dat[, outcome]
, method = "rf"
, trControl = trainControl("cv", number = 2)
, importance = T)
c.e <- model$finalModel$confusion[, "class.error"]
return(c.e)
}
which also works for numbers as well as vectors for the column names, e.g.
cols <- colnames(dat1)[5:6]
Note that I renamed the 'grouping' variable as it is a bit unclear what the grouping variable should be in this context. I have renamed it as outcome that is to be predicted to highlight what this stands for. If you did indeed try to predict the region, you can ignore this comment.
If you do want to trigger this function for different groups in your data, i.e. separate forests for different subsets, then you would best do that outside of this function.

Running h2o Grid search on R

I am running h2o grid search on R. The model is a glm using a gamma distribution.
I have defined the grid using the following settings.
hyper_parameters = list(alpha = c(0, .5), missing_values_handling = c("Skip", "MeanImputation"))
h2o.grid(algorithm = "glm", # Setting algorithm type
grid_id = "grid.s", # Id so retrieving information on iterations will be easier later
x = predictors, # Setting predictive features
y = response, # Setting target variable
training_frame = data, # Setting training set
validation_frame = validate, # Setting validation frame
hyper_params = hyper_parameters, # Setting apha values for iterations
remove_collinear_columns = T, # Parameter to remove collinear columns
lambda_search = T, # Setting parameter to find optimal lambda value
seed = 1234, # Setting to ensure replicateable results
keep_cross_validation_predictions = F, # Setting to save cross validation predictions
compute_p_values = F, # Calculating p-values of the coefficients
family = 'gamma', # Distribution type used
standardize = T, # Standardizing continuous variables
nfolds = 2, # Number of cross-validations
fold_assignment = "Modulo", # Specifying fold assignment type to use for cross validations
link = "log")
When i run the above script, i get the following error:
Error in hyper_names[[index2]] : subscript out of bounds
Please can you help me find where the error is
As disucssed in the comments it is difficult to tell what the cause for the error could be without sample data and code. The out-of-bounds error could be because the code is trying to access a value that does not exist in the input. So possibly, it could be either of the inputs to the h2o.grid(). I would check columns and rows in the train and validation data sets. The hyperparameters from the question run fine with family="binomial".
The code below runs fine with glm(). I have made several assumptions such as: (1) family=binomial instead of family=gamma was used based on sample data created, (2) response y is binary, (3) train and test split ratio, (4) number of responses are limited to three predictors or independent variables (x1, x2, x3), (5) one binary response variable (y`).
Import libraries
library(h2o)
library(h2oEnsemble)
Create sample data
x1 <- abs(100*rnorm(100))
x2 <- 10+abs(100*rnorm(100))
x3 <- 100+abs(100*rnorm(100))
#y <- ronorm(100)
y <- floor(runif(100,0,1.5))
df <- data.frame(x1, x2, x3,y)
df$y <- ifelse(df$y==1, 'yes', 'no')
df$y <- as.factor(df$y)
head(df)
Initialize h2o
h2o.init()
Prepare data in required h2o format
df <- as.h2o(df)
y <- "y"
x <- setdiff( names(df), y )
df<- df[ df$y %in% c("no", "yes"), ]
h2o.setLevels(df$y, c("no","yes") )
# Split data into train and validate sets
data <- h2o.splitFrame( df, ratios = c(.6, 0.15) )
names(data) <- c('train', 'valid', 'test')
data$train
Set parameters
grid_id <- 'glm_grid'
hyper_parameters <- list( alpha = c(0, .5, 1),
lambda = c(1, 0.5, 0.1, 0.01),
missing_values_handling = c("Skip", "MeanImputation"),
tweedie_variance_power = c(0, 1, 1.1,1.8,1.9,2,2.1,2.5,2.6,3, 5, 7),
#tweedie_variance_power = c(0, 1, 1.1,1.8,1.9,2,2.1,2.5,2.6,3, 5, 7),
seed = 1234
)
Fit h2o.grid()
h2o.grid(
algorithm = "glm",
#grid_id = grid_id,
hyper_params = hyper_parameters,
training_frame = data$train,
validation_frame = data$valid,
x = x,
y = y,
lambda_search = TRUE,
remove_collinear_columns = T,
keep_cross_validation_predictions = F,
compute_p_values = F,
standardize = T,
nfolds = 2,
fold_assignment = "Modulo",
family = "binomial"
)
Output

Caret: undefined columns selected

I have been trying to get the below code to run in caret but get the error. Can anyone tell me how to trouble shoot it.
Error in [.data.frame(data, , lvls[1]) : undefined columns selected
library(tidyverse)
library(caret)
mydf <- iris
mydf <- mydf %>%
mutate(tgt = as.factor(ifelse(Species == 'setosa','Y','N'))) %>%
select(everything(), -Species)
trainIndex <- createDataPartition(mydf$tgt, p = 0.75, times = 1, list = FALSE)
train <- mydf[trainIndex,]
test <- mydf[-trainIndex,]
fitControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
allowParallel = TRUE,
summaryFunction = twoClassSummary)
fit_log <- train(tgt~.,
data = train,
method = "glm",
trControl = fitControl,
family = "binomial")
You need to used classProbs = TRUE in your control function. The ROC curve is based on the class probabilities and the error is the summary function not finding those columns.
Use data = data.frame(xxxxx). As in the example below
fit.cart <- train(Condition~., data = data.frame(trainset), method="rpart", metric=metric, trControl=control)

Resources