H2O R How to handle levels not trained on when predicting? - r

I have a set of CSVs with a result column to train, and a set of test CSVs without the result column.
library(h2o)
h2o.init()
train <- read.csv(train_file, header=T)
train.h2o <- as.h2o(train)
y <- "Result"
x <- setdiff(names(train.h2o), y)
model <- h2o.deeplearning(x = x,
y = y,
training_frame = train.h2o,
model_id = "my_model",
epochs = 5000,
hidden = c(50),
stopping_rounds=5,
stopping_metric="misclassification",
stopping_tolerance=0.001,
seed = 1)
test <- read.csv(test_file, header=T)
test.h2o <- as.h2o(test)
pred <- h2o.predict(model,test.h2o)
When I try to predict the outcome with test data, I get a bunch of errors like:
1: In doTryCatch(return(expr), name, parentenv, handler) :
Test/Validation dataset column 'ColumnName' has levels not trained on: [ABCD, BCDE]
H2O used to be able to handle data present in test but not during training. I found some posts online where they say they do. But it is not working for me.
How can I avoid these errors, and predict a value for the test data?

There are 2 methods you can have a try:
Use factor as oppose to character
Before feeding data into machine learning function, you can combine your train and test data, and convert character variable to factor.
Hence unique values will be recorded as level info even you split combined data later.
library(h2o)
h2o.init()
#using dummy data as combined training and testing data
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")
#assuming GLEASON is the character variable, and transform it to factor
prostate.hex$GLEASON <- h2o.asfactor(prostate.hex$GLEASON)
#split data such that 0,4,5,8 only in test set, and not in train set.
h2o.test <- prostate.hex[prostate.hex$GLEASON %in% c("0","4","5","8"),]
h2o.train <- prostate.hex[!prostate.hex$GLEASON %in% c("0","4","5","8"),]
#train model
model <- h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS","GLEASON"), training_frame = h2o.train,
family = "binomial", nfolds = 0)
#predict without error
pred <- predict(model,h2o.test)
Use one-hot-encoding Explicitly
I know that h2o machine learning functions provide internal encoding methods (via categorical_encoding parameters) including one-hot-encoding, which turns character variable into lots of 1/0 integer variables.
As oppose to use this technique implicitly, you can use it explicitly. Hence those levels don't exist in training will not be used in model. New levels in testing are simply not used for prediction.

Related

KNN function in R producing NA/NaN/Inf in foreign function call (arg 6) error

I'm working on a project where I need to construct a knn model using R. The professor provided an article with step-by-step instructions (link to article) and some datasets to choose from (link to the data I'm using). I'm getting stuck on step 3 (creating the model from the training data).
Here's my code:
data <- read.delim("data.txt", header = TRUE, sep = "\t", dec = ".")
set.seed(2)
part <- sample(2, nrow(data), replace = TRUE, prob = c(0.65, 0.35))
training_data <- data[part==1,]
testing_data <- data[part==2,]
outcome <- training_data[,2]
model <- knn(train = training_data, test = testing_data, cl = outcome, k=10)
Here's the error message I'm getting:
I checked and found that training_data, testing_data, and outcome all look correct, the issue seems to only be with the knn model.
The issue is with your data and the knn function you are using; it can't handle characters or factor variable
We can force this to work doing something like this first:
library(tidyverse)
data <- data %>%
mutate(Seeded = as.numeric(as.factor(Seeded))-1) %>%
mutate(Season = as.numeric(as.factor(Season)))
But this is a bad idea in general, since Season is not ordered naturally. A better approach would be to instead treat it as a set of dummies.
See this link for examples:
R - convert from categorical to numeric for KNN

Make predictions on new data after training the GLM Lasso model

I have trained a classfication model on 13,000 rows of labels with lasso in r's glmnet library. I have checked my accuracy and it looks decent, now I want to make predictions for rest of the dataset, which is 300,000 rows. My approach was to label rest of the rows using the trained model. I'm not sure if that's the most effective strategy to do approximate labeling.
But, when I'm trying to label rest of the data, I'm running into this error:
Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Even if I break the dataset to 5000 rows for predictions, I still get the same error.
Here's my code:
library(glmnet)
#the subset of original dataset
data.text <- data.text_filtered %>% filter(!label1 == "NA")
#Quanteda corpus
data_corpus <- corpus(data.text$text, docvars = data.frame(labels = data.text$label1))
set.seed(1234)
dataShuffled <- corpus_sample(data_corpus, size = 12845)
dataDfm <- dfm_trim( dfm(dataShuffled, verbose = FALSE), min_termfreq = 10)
#model to train the classifier
lasso <- cv.glmnet(x = dataDfm[1:10000,], y = trainclass[1:10000],
alpha = 1, nfolds = 5, family = "binomial")
#plot the lasso plot
plot(lasso)
#predictions
dataPreds <- predict(lasso, dataDfm[10000:2845,], type="class")
(movTable <- table(dataPreds, docvars(dataShuffled, "labels")[10000:2845]))
make predictions on rest of the dataset. This dataset has 300,000 rows.
data.text_NAs <- data.text_filtered %>% filter(label1 == "NA")
data_NADfm <- dfm_trim( dfm(corpus(data.text_NAs$text), verbose = FALSE), min_termfreq = 10)
data.text_filtered <- data.text_filtered %>% mutate(label = predict(lasso, as.matrix(data_NADfm), type="class", s="lambda.1se")
Thanks much for any help.
The problem lies in the as.matrix(data_NADfm) - this makes the dfm into a dense matrix, which makes it too large to handle.
Solution: Keep it sparse: either remove the as.matrix() wrapper, or if it does not like a raw dfm input, you can coerce it to a plain sparse matrix (from the Matrix package) using as(data_NADfm, "dgCMatrix"). This should be fine since both cv.glmnet() and its predict() method can handle sparse matrix inputs.

R: variable has different number of levels in the node and in the data

I want to use bnlearn for a classification task with Naive Bayes algorithm.
I use this data set for my tests. Where 3 variables are continuous ()V2, V4, V10) and others are discrete. As far as I know bnlearn cannot work with continuous variables, so there is a need to convert them to factors or discretize. For now I want to convert all the features into factors. However, I came across to some problems. Here is a sample code
dataSet <- read.csv("creditcard_german.csv", header=FALSE)
# ... split into trainSet and testSet ...
trainSet[] <- lapply(trainSet, as.factor)
testSet[] <- lapply(testSet, as.factor)
# V25 is the class variable
bn = naive.bayes(trainSet, training = "V25")
fitted = bn.fit(bn, trainSet, method = "bayes")
pred = predict(fitted , testSet)
...
For this code I get an error message while calling predict()
'V1' has different number of levels in the node and in the data.
And when I remove that V1 from the training set, I get the same error for the V2 variable. However, error disappears when I do factorization dataSet [] <- lapply(dataSet, as.factor) and only than split it into training and test sets.
So which is the elegant solution for this? Because in real world applications test and train sets can be from different sources. Any ideas?
The issue appears to be caused by the fact that my train and test datasets had different factor levels. I solved this issue by using the rbind command to combine the two different dataframes (train and test), applying as.factor to get the full set of factors for the complete dataset, and then slicing the factorized dataframe back into separate train and test datasets.
train <- read.csv("train.csv", header=FALSE)
test <- read.csv("test.csv", header=FALSE)
len_train = dim(train)[1]
len_test = dim(test)[1]
complete <- rbind(learn, test)
complete[] <- lapply(complete, as.factor)
train = complete[1:len_train, ]
l = len_train+1
lf = len_train + len_test
test = complete[l:lf, ]
bn = naive.bayes(train, training = "V25")
fitted = bn.fit(bn, train, method = "bayes")
pred = predict(fitted , test)
I hope this can be helpful.

Why am I getting empty string as an extra factor in my target class?

I'm writing code to test a bunch of machine learning models on a test data set. Some of the rows in my target class have empty strings, so I wrote some code to get rid of these rows.
data <- read.csv("ML17-TP2-train.csv", header = TRUE)
filtered_data <- data[!(data$gender==" " | data$gender==""),]
train_data <- filtered_data[1:1200, c(3,4,6,7,8)]
test_data <- filtered_data[15001:17000, c(3,4,6,7,8)]
I then used MLR to train and test a machine learning model
#create the task
nb.task <- makeClassifTask(id = "NaiveBayes", data = nb.data, target = "gender")
#create the learning
nb.learner <- makeLearner("classif.naiveBayes", predict.type = "prob", fix.factors.prediction = TRUE)
#train the learner
nb.trained <- train(nb.learner, nb.task)
#predict
nb.predict <- predict(nb.trained, newdata = test_data)
#get the auc
performance(nb.predict, measures = auc)
I was getting an NA value when I tried checking the AUC
> performance(nb.predict, measures = auc)
auc
NA
when I tried checking the number of factors for nb.predict
test.gender <- as.factor(nb.data$gender)
I noticed that it told me that I had 3 factors, the two that I was expecting plus a 3, empty string "". I've checked my data in Excel, I've deleted all of the variables in my environment and rerun my code from scratch. I even tried deleting all of the records except for 2 and I still get a message telling me that I have 3 factors.
What am I doing that is causing an extra factor to be introduced into my code?

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

Resources