R: variable has different number of levels in the node and in the data - r

I want to use bnlearn for a classification task with Naive Bayes algorithm.
I use this data set for my tests. Where 3 variables are continuous ()V2, V4, V10) and others are discrete. As far as I know bnlearn cannot work with continuous variables, so there is a need to convert them to factors or discretize. For now I want to convert all the features into factors. However, I came across to some problems. Here is a sample code
dataSet <- read.csv("creditcard_german.csv", header=FALSE)
# ... split into trainSet and testSet ...
trainSet[] <- lapply(trainSet, as.factor)
testSet[] <- lapply(testSet, as.factor)
# V25 is the class variable
bn = naive.bayes(trainSet, training = "V25")
fitted = bn.fit(bn, trainSet, method = "bayes")
pred = predict(fitted , testSet)
...
For this code I get an error message while calling predict()
'V1' has different number of levels in the node and in the data.
And when I remove that V1 from the training set, I get the same error for the V2 variable. However, error disappears when I do factorization dataSet [] <- lapply(dataSet, as.factor) and only than split it into training and test sets.
So which is the elegant solution for this? Because in real world applications test and train sets can be from different sources. Any ideas?

The issue appears to be caused by the fact that my train and test datasets had different factor levels. I solved this issue by using the rbind command to combine the two different dataframes (train and test), applying as.factor to get the full set of factors for the complete dataset, and then slicing the factorized dataframe back into separate train and test datasets.
train <- read.csv("train.csv", header=FALSE)
test <- read.csv("test.csv", header=FALSE)
len_train = dim(train)[1]
len_test = dim(test)[1]
complete <- rbind(learn, test)
complete[] <- lapply(complete, as.factor)
train = complete[1:len_train, ]
l = len_train+1
lf = len_train + len_test
test = complete[l:lf, ]
bn = naive.bayes(train, training = "V25")
fitted = bn.fit(bn, train, method = "bayes")
pred = predict(fitted , test)
I hope this can be helpful.

Related

Error: `data` and `reference` should be factors with the same levels random forrest

This is a code I am doing for an assignment. I cannot seem to get a confusion matrix for the predictions, please assist me with troubleshooting the code or make any recommendations necessary.
set.seed(1234)
test_index1 <-createDataPartition(water_potability3$Potability,p=0.1,list= FALSE)
water_potability_train <- water_potability3[test_index1,-c(4,6:9)]
water_potability3_test<- water_potability3[!1:nrow(water_potability3)%in%test_index1,-c(4,6:9)]
<- tuneRF(x=water_potability_train[,1:4],y=water_potability_train$Potability) (mintree <-trf[which.min(trf[,2]),1])
<-randomForest(x=water_potability_train[,-5],y=water_potability_train$Potability,mtry = mintree,importance = TRUE)
(rf_model,main="")
(rf_model,main="")
preds_rf<- predict(rf_model,water_potability3_test[,-5])
table(preds_rf,water_potability3_test$Potability)
confusionmatrix(preds_rf,water_potability3_test$Potability)
Everytime I do a confusion matrix I get the error "Error: data and reference should be factors with the same levels"
As you don't share a dataset that allows me to reproduce the error, I'm gonna have a guess and provide the solution I would have used myself. If this doesn't work for you, please provide some data and perhaps explain what the Potability column contains :-)
When randomly splitting the data into training and test partitions, you risk not having observations from every class in both partitions. E.g. if you have 10 classes, you might only have 8 of them in the smaller test partition. And then when your model predicts one of the two other classes that were available in the training partition, the two factors have different levels.
So I use partition() from groupdata2 with the cat_col argument that ensures each class is represented in both partitions (if possible). Then I use confusion_matrix() from cvms as it allows different levels in the two factors.
library(groupdata2)
library(cvms)
set.seed(1234)
# Create list with two partitions
# where the ratio of classes in Potability are similar
parts <- partition(water_potability3[, -c(4,6:9)],
p = 0.1, cat_col = "Potability")
# Extract the two partitions
water_potability3_test <- parts[[1]]
water_potability3_train <- parts[[2]]
# The modeling (haven't changed anything here)
trf <- tuneRF(x = water_potability_train[, 1:4],
y = water_potability_train$Potability)
(mintree <- trf[which.min(trf[, 2]), 1])
rf_model <- randomForest(
x = water_potability_train[, -5],
y = water_potability_train$Potability,
mtry = mintree,
importance = TRUE
)
preds_rf <- predict(rf_model, water_potability3_test[, -5])
# Create confusion matrix
conf_mat <- cvms::confusion_matrix(
targets = water_potability3_test$Potability,
predictions = preds_rf
)
# The basic confusion matrix table
conf_mat$Table
# Or as a plot
plot_confusion_matrix(conf_mat)
You may also check out cvms::evaluate() that has additional evaluation metrics.
Read more
More on the groupdata2 train/test partitioning functionality here:
https://cran.rstudio.com//web/packages/groupdata2/vignettes/cross-validation_with_groupdata2.html
More on the cvms confusion matrix functionality here:
https://cran.r-project.org/web/packages/cvms/vignettes/Creating_a_confusion_matrix.html

H2O R How to handle levels not trained on when predicting?

I have a set of CSVs with a result column to train, and a set of test CSVs without the result column.
library(h2o)
h2o.init()
train <- read.csv(train_file, header=T)
train.h2o <- as.h2o(train)
y <- "Result"
x <- setdiff(names(train.h2o), y)
model <- h2o.deeplearning(x = x,
y = y,
training_frame = train.h2o,
model_id = "my_model",
epochs = 5000,
hidden = c(50),
stopping_rounds=5,
stopping_metric="misclassification",
stopping_tolerance=0.001,
seed = 1)
test <- read.csv(test_file, header=T)
test.h2o <- as.h2o(test)
pred <- h2o.predict(model,test.h2o)
When I try to predict the outcome with test data, I get a bunch of errors like:
1: In doTryCatch(return(expr), name, parentenv, handler) :
Test/Validation dataset column 'ColumnName' has levels not trained on: [ABCD, BCDE]
H2O used to be able to handle data present in test but not during training. I found some posts online where they say they do. But it is not working for me.
How can I avoid these errors, and predict a value for the test data?
There are 2 methods you can have a try:
Use factor as oppose to character
Before feeding data into machine learning function, you can combine your train and test data, and convert character variable to factor.
Hence unique values will be recorded as level info even you split combined data later.
library(h2o)
h2o.init()
#using dummy data as combined training and testing data
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")
#assuming GLEASON is the character variable, and transform it to factor
prostate.hex$GLEASON <- h2o.asfactor(prostate.hex$GLEASON)
#split data such that 0,4,5,8 only in test set, and not in train set.
h2o.test <- prostate.hex[prostate.hex$GLEASON %in% c("0","4","5","8"),]
h2o.train <- prostate.hex[!prostate.hex$GLEASON %in% c("0","4","5","8"),]
#train model
model <- h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS","GLEASON"), training_frame = h2o.train,
family = "binomial", nfolds = 0)
#predict without error
pred <- predict(model,h2o.test)
Use one-hot-encoding Explicitly
I know that h2o machine learning functions provide internal encoding methods (via categorical_encoding parameters) including one-hot-encoding, which turns character variable into lots of 1/0 integer variables.
As oppose to use this technique implicitly, you can use it explicitly. Hence those levels don't exist in training will not be used in model. New levels in testing are simply not used for prediction.

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

QDA | lengths of training and test data sets | How to split data in training and test data?

In QDA (Quadratic Discriminant Analysis), do i need to keep length of training and test data exactly same? If not, how do you find a Confusion Matrix in such cases?
Here's psuedo data.
Because if I keep training-data and test data sets of different lengths, it gives an error (Using R Studio):
"Error in table(pred, true) : all arguments must have the same length".
Tried to remove NAs using na.omit() on both data sets as well as pred and true; and using na.action = na.exclude for qda(), but it didn't work.
After dividing the data set in exactly half; half of it as training and half as test; it worked perfectly after na.omit() on pred and true.
Following is the code used for either of approaches. In approach 2, with data split into equal halves, it worked perfectly fine.
#Approach 1: divide data age-wise
train <- vif_data$Age < 30
# there are around 400 values passing (TRUE) above condition and around 50 failing (FALSE)
train_vif <- vif_data[train,]
test_vif <- vif_data[!train,]
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(pred) # result: 399
length(true) # result: 47
#that's where it throws error: "Error in table(zone_pred$class, train_vif) : all arguments must have the same length"
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
###############################
#Approach 2: divide data into random halves
train <- splitSample(dataset = vif_data, div = 2, path = "./", type = "csv")
train_data <- read.csv("splitSample_s1.csv")
test_data <- read.csv("splitSample_s2.csv")
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(train_vif)
# this works fine
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
Want to know if there is any method by which we can have a confusion matrix with data set unequally divided into training and test data set.
Thanks!
Are you plugging in your training inputs instead of your test set input data to predict? Notice how this yields the same error message:
table(c(1,2),c(1,2,3))
If pred isn't the right length, then you're probably predicting incorrectly. At the moment, you haven't shared any of your code, so I cannot say anything more. But there is no reason that you shouldn't be able to get a confusion matrix using test data of different size than your training data.

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

Resources