Why am I getting empty string as an extra factor in my target class? - r

I'm writing code to test a bunch of machine learning models on a test data set. Some of the rows in my target class have empty strings, so I wrote some code to get rid of these rows.
data <- read.csv("ML17-TP2-train.csv", header = TRUE)
filtered_data <- data[!(data$gender==" " | data$gender==""),]
train_data <- filtered_data[1:1200, c(3,4,6,7,8)]
test_data <- filtered_data[15001:17000, c(3,4,6,7,8)]
I then used MLR to train and test a machine learning model
#create the task
nb.task <- makeClassifTask(id = "NaiveBayes", data = nb.data, target = "gender")
#create the learning
nb.learner <- makeLearner("classif.naiveBayes", predict.type = "prob", fix.factors.prediction = TRUE)
#train the learner
nb.trained <- train(nb.learner, nb.task)
#predict
nb.predict <- predict(nb.trained, newdata = test_data)
#get the auc
performance(nb.predict, measures = auc)
I was getting an NA value when I tried checking the AUC
> performance(nb.predict, measures = auc)
auc
NA
when I tried checking the number of factors for nb.predict
test.gender <- as.factor(nb.data$gender)
I noticed that it told me that I had 3 factors, the two that I was expecting plus a 3, empty string "". I've checked my data in Excel, I've deleted all of the variables in my environment and rerun my code from scratch. I even tried deleting all of the records except for 2 and I still get a message telling me that I have 3 factors.
What am I doing that is causing an extra factor to be introduced into my code?

Related

R training and tuning random forest classifier with hardware challenge

I'm sorry if this is not appropriate to ask here but please forgive a noobie.
I'm training a random forest multiple class(8) classifier using R Caret on my experimental data using my desktop, 32GB RAM and a 4 core CPU. However, I'm facing constant complains from RStudio reporting it cannot allocate vector of 9GB. So I have to reduce the training set all the way to 1% of the data just to run fold CV and some grid search. As a result my mode accuracy is ~50% and the resulting features selected aren't very good at all. Only 2 out of 8 classes are being distinguished somewhat truthfully. Of course it could be that I don't have any good features. But I want to at least test train and tune my model on a decent size of training data first. What are the solutions can help? or is there anywhere I can upload my data and train somewhere? I'm so new that I don't know if something like cloud based things can help me? Pointers will be appreciated.
Edit: I have uploaded the data table and my codes so maybe it is my bad coding screwed things up.
Here is a link to the data:
https://drive.google.com/file/d/1wScYKd7J-KlRvvDxHAmG3_If1o5yUimy/view?usp=sharing
Here are my codes:
#load libraries
library(data.table)
library(caret)
library(caTools)
library(e1071)
#read the data in
df.raw <-fread("CLL_merged_sampled_same_ctrl_40percent.csv", header =TRUE,data.table = FALSE)
#get the useful data
#subset and get rid of useless labels
df.1 <- subset(df.raw, select = c(18:131))
df <- subset(df.1, select = -c(2:4))
#As I want to build a RF model to classify drug treatments
# make the treatmentsun as factors
#there should be 7 levels
df$treatmentsum <- as.factor(df$treatmentsum)
df$treatmentsum
#find nearZerovarance features
#I did not remove them. Just flagged them
nzv <- nearZeroVar(df[-1], saveMetrics= TRUE)
nzv[nzv$nzv==TRUE,]
possible.nzv.flagged <- nzv[nzv$nzv=="TRUE",]
write.csv(possible.nzv.flagged, "Near Zero Features flagged.CSV", row.names = TRUE)
#identify correlated features
df.Cor <- cor(df[-1])
highCorr <- sum(abs(df.Cor[upper.tri(df.Cor)]) > .99)
highlyCor <- findCorrelation(df.Cor, cutoff = .99,verbose = TRUE)
#Get rid off strongly correlated features
filtered.df<- df[ ,-highlyCor]
str(filtered.df)
#identify linear dependecies
linear.combo <- findLinearCombos(filtered.df[-1])
linear.combo #no linear ones detected
#splt datainto training and test
#Here is my problem, I want to use 80% of the data for training
#but in my computer, I can only use 0.002
set.seed(123)
split <- sample.split(filtered.df$treatmentsum, SplitRatio = 0.8)
training_set <- subset(filtered.df, split==TRUE)
test_set <- subset(filtered.df, split==FALSE)
#scaling numeric data
#leave the first column labels out
training_set[-1] = scale(training_set[-1])
test_set[-1] = scale(test_set[-1])
training_set[1]
#build RF
#use Cross validation for model training
#I can't use repeated CV as it fails on my machine
#I set a grid search for tuning
control <- trainControl(method="cv", number=10,verboseIter = TRUE, search = 'grid')
#default mtry below, is around 10
#mtry <- sqrt(ncol(training_set))
#I used ,mtry 1:12 to run, but I wanted to test more, limited again by machine
tunegrid <- expand.grid(.mtry = (1:20))
model <- train(training_set[,-1],as.factor(training_set[,1]), data=training_set, method = "rf", trControl = control,metric= "Accuracy", maximize = TRUE ,importance = TRUE, type="classification", ntree =800,tuneGrid = tunegrid)
print(model)
plot(model)
prediction2 <- predict(model, test_set[,-1])
cm<-confusionMatrix(prediction2, as.factor(test_set[,1]), positive = "1")

R: variable has different number of levels in the node and in the data

I want to use bnlearn for a classification task with Naive Bayes algorithm.
I use this data set for my tests. Where 3 variables are continuous ()V2, V4, V10) and others are discrete. As far as I know bnlearn cannot work with continuous variables, so there is a need to convert them to factors or discretize. For now I want to convert all the features into factors. However, I came across to some problems. Here is a sample code
dataSet <- read.csv("creditcard_german.csv", header=FALSE)
# ... split into trainSet and testSet ...
trainSet[] <- lapply(trainSet, as.factor)
testSet[] <- lapply(testSet, as.factor)
# V25 is the class variable
bn = naive.bayes(trainSet, training = "V25")
fitted = bn.fit(bn, trainSet, method = "bayes")
pred = predict(fitted , testSet)
...
For this code I get an error message while calling predict()
'V1' has different number of levels in the node and in the data.
And when I remove that V1 from the training set, I get the same error for the V2 variable. However, error disappears when I do factorization dataSet [] <- lapply(dataSet, as.factor) and only than split it into training and test sets.
So which is the elegant solution for this? Because in real world applications test and train sets can be from different sources. Any ideas?
The issue appears to be caused by the fact that my train and test datasets had different factor levels. I solved this issue by using the rbind command to combine the two different dataframes (train and test), applying as.factor to get the full set of factors for the complete dataset, and then slicing the factorized dataframe back into separate train and test datasets.
train <- read.csv("train.csv", header=FALSE)
test <- read.csv("test.csv", header=FALSE)
len_train = dim(train)[1]
len_test = dim(test)[1]
complete <- rbind(learn, test)
complete[] <- lapply(complete, as.factor)
train = complete[1:len_train, ]
l = len_train+1
lf = len_train + len_test
test = complete[l:lf, ]
bn = naive.bayes(train, training = "V25")
fitted = bn.fit(bn, train, method = "bayes")
pred = predict(fitted , test)
I hope this can be helpful.

H2O R How to handle levels not trained on when predicting?

I have a set of CSVs with a result column to train, and a set of test CSVs without the result column.
library(h2o)
h2o.init()
train <- read.csv(train_file, header=T)
train.h2o <- as.h2o(train)
y <- "Result"
x <- setdiff(names(train.h2o), y)
model <- h2o.deeplearning(x = x,
y = y,
training_frame = train.h2o,
model_id = "my_model",
epochs = 5000,
hidden = c(50),
stopping_rounds=5,
stopping_metric="misclassification",
stopping_tolerance=0.001,
seed = 1)
test <- read.csv(test_file, header=T)
test.h2o <- as.h2o(test)
pred <- h2o.predict(model,test.h2o)
When I try to predict the outcome with test data, I get a bunch of errors like:
1: In doTryCatch(return(expr), name, parentenv, handler) :
Test/Validation dataset column 'ColumnName' has levels not trained on: [ABCD, BCDE]
H2O used to be able to handle data present in test but not during training. I found some posts online where they say they do. But it is not working for me.
How can I avoid these errors, and predict a value for the test data?
There are 2 methods you can have a try:
Use factor as oppose to character
Before feeding data into machine learning function, you can combine your train and test data, and convert character variable to factor.
Hence unique values will be recorded as level info even you split combined data later.
library(h2o)
h2o.init()
#using dummy data as combined training and testing data
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")
#assuming GLEASON is the character variable, and transform it to factor
prostate.hex$GLEASON <- h2o.asfactor(prostate.hex$GLEASON)
#split data such that 0,4,5,8 only in test set, and not in train set.
h2o.test <- prostate.hex[prostate.hex$GLEASON %in% c("0","4","5","8"),]
h2o.train <- prostate.hex[!prostate.hex$GLEASON %in% c("0","4","5","8"),]
#train model
model <- h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS","GLEASON"), training_frame = h2o.train,
family = "binomial", nfolds = 0)
#predict without error
pred <- predict(model,h2o.test)
Use one-hot-encoding Explicitly
I know that h2o machine learning functions provide internal encoding methods (via categorical_encoding parameters) including one-hot-encoding, which turns character variable into lots of 1/0 integer variables.
As oppose to use this technique implicitly, you can use it explicitly. Hence those levels don't exist in training will not be used in model. New levels in testing are simply not used for prediction.

QDA | lengths of training and test data sets | How to split data in training and test data?

In QDA (Quadratic Discriminant Analysis), do i need to keep length of training and test data exactly same? If not, how do you find a Confusion Matrix in such cases?
Here's psuedo data.
Because if I keep training-data and test data sets of different lengths, it gives an error (Using R Studio):
"Error in table(pred, true) : all arguments must have the same length".
Tried to remove NAs using na.omit() on both data sets as well as pred and true; and using na.action = na.exclude for qda(), but it didn't work.
After dividing the data set in exactly half; half of it as training and half as test; it worked perfectly after na.omit() on pred and true.
Following is the code used for either of approaches. In approach 2, with data split into equal halves, it worked perfectly fine.
#Approach 1: divide data age-wise
train <- vif_data$Age < 30
# there are around 400 values passing (TRUE) above condition and around 50 failing (FALSE)
train_vif <- vif_data[train,]
test_vif <- vif_data[!train,]
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(pred) # result: 399
length(true) # result: 47
#that's where it throws error: "Error in table(zone_pred$class, train_vif) : all arguments must have the same length"
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
###############################
#Approach 2: divide data into random halves
train <- splitSample(dataset = vif_data, div = 2, path = "./", type = "csv")
train_data <- read.csv("splitSample_s1.csv")
test_data <- read.csv("splitSample_s2.csv")
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(train_vif)
# this works fine
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
Want to know if there is any method by which we can have a confusion matrix with data set unequally divided into training and test data set.
Thanks!
Are you plugging in your training inputs instead of your test set input data to predict? Notice how this yields the same error message:
table(c(1,2),c(1,2,3))
If pred isn't the right length, then you're probably predicting incorrectly. At the moment, you haven't shared any of your code, so I cannot say anything more. But there is no reason that you shouldn't be able to get a confusion matrix using test data of different size than your training data.

randomForest Predict error from test set

I am running into a an error with the R package of randomForest where after I split the data using Caret into training and testing, when I go to predict I run into error:
Error in predict.randomForest(randomForestFit, type = "response", newdata =testing$GEN)
:number of variables in newdata does not match that in the training data
I split the file between train and test from the exact same file. There are no N/A or missing values in any of the data. Below is my full code, but I do not think there is an error there. I am at a loss as to why this error is occurring. Any ideas would be greatly appreciated!
library(caret)
require(foreign)
set.seed(825)
data <- read.spss("C:/MODEL_SAMPLE.sav",use.value.labels=TRUE, to.data.frame = TRUE)
inTraining <- createDataPartition(data$GEN, p = 0.75, list = FALSE)
training <- data[inTraining, ]
testing <- data[-inTraining, ]
library(randomForest)
library(foreach)
start.time <- Sys.time()
randomForestFit <- foreach(ntree=rep(63, 8), .combine=combine, .packages='randomForest')
%dopar% randomForest(training[-201],
training$GEN,
mtry = 40,
ntree=ntree,
verbose = TRUE,
importance = TRUE,
keep.forest=TRUE,
do.trace = TRUE)
randomForestFit
predict = predict(randomForestFit, type="response", newdata=testing$GEN)
stopCluster(cl)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Without the data, its hard for anyone to say what the problem is exactly.
Three suggestions:
First, check the SPSS file for stray characters in the data.
Second, check the options from read.spss are set correctly especially: reencode = NA, use.missings =to.data.frame. You can use the latter option to specify non numeric characters to be turned into NA.
Third, use str(df), summary(df,useNA="if any") and make sure your factor variables including the response are actually factors. Apply as.numeric(as.character()) to numeric data in the data frame, this will generate NA values if there are expressions like VALUE!, #NA in the data frame.
You could also export to csv from SPSS and do the above again.
The key is the following
:number of variables in newdata does not match that in the training data
I therefore guess that the training and test data are different, in particular the column names. Maybe it breaks at this line?
inTraining <- createDataPartition(data$GEN, p = 0.75, list = FALSE)
To better understand the problem, you might have to post 3 rows of the training and test data set (with column names!).
I hope this helps!

Resources