Evaluating weka classifier J48 with missing values in test set, R RWeka - r

I have an error when evaluating a simple test set with evaluate_Weka_classifier. Trying to learn how the interface works from R to Weka with RWeka, but I still don't get this.
library("RWeka")
iris_input <- iris[1:140,]
iris_test <- iris[-(1:140),]
iris_fit <- J48(Species ~ ., data = iris_input)
evaluate_Weka_classifier(iris_fit, newdata = iris_test, numFolds=5)
No problems here, as we would assume (It is ofcourse a stupit test, no random holdout data etc). But now I want to simulate missing data (alot). So i set Petal.Width as missing:
iris_test$Petal.Width <- NA
evaluate_Weka_classifier(iris_fit, newdata = iris_test, numFolds=5)
Which gives the error:
Error in .jcall(evaluation, "S", "toSummaryString", complexity) :
java.lang.IllegalArgumentException: Can't have more folds than instances!
Edit: This error should tell me that I have not enough instances, but I have 10
Edit: If I use write.arff, it can be exported and read in by Weka. Change Petal.Width {} into Petal.Width numeric to make the two files exactly the same. Then it works in Weka.
Is this a thinking error? When reading Machine Learning, Practical machine learning tools and techniques it seems to be legit. Maybe I just have to tell RWeka that I want to use fractions when a split uses a missing variable?
Thnx!

The issue is that you need to tell J48() what to do with missing values.
library(RWeka)
?J48()
#pertinent output
J48(formula, data, subset, na.action,
control = Weka_control(), options = NULL)
na.action tells R what to do with missing values. When following up on na.action you will find that "The ‘factory-fresh’ default is na.omit". Under this setting of course there are not enough instances!
Instead of leaving na.action as the default omit, I have changed it as follows,
iris_fit<-J48(Species~., data = iris_input, na.action=NULL)
and it works like a charm!

Related

R implementation of kohonen SOMs: prediction error due to data type.

I have been trying to run an example code for supervised kohonen SOMs from https://clarkdatalabs.github.io/soms/SOM_NBA . When I tried to predict test set data I got the following error:
pos.prediction <- predict(NBA.SOM3, newdata = NBA.testing)
Error in FUN(X[[i]], ...) :
Data type not allowed: should be a matrix or a factor
I tried newdata = as.matrix(NBA.testing) but it did not help. Neither did as.factor().
Why does it happen? And how can I fix that?
You should put one more argument to the predict function, i.e. "whatmap", then set its value to 1.
The code would be like:
pos.prediction <- predict(NBA.SOM3, newdata = NBA.testing, whatmap = 1)
To verify the prediction result, you can check using:
table(NBA$Pos[-training_indices], pos.prediction$predictions[[2]], useNA = 'always')
The result may be different from that of the tutorial, since it did not declare the use of set.seed() function.
I suggest that the set.seed() with an arbitrary number in it was declared somewhere before the training phase.
For simplicity, put it once on the top most of your script, e.g.
set.seed(12345)
This will guarantee a reproducible result of your model next time you re-run your script.
Hope that will help.

Error in Bagging with party::cforest

I'm trying to bag conditional inference trees following the advice of Kuhn et al in 'Applied Predictive Modeling', Ch.8:
Conditional inference trees can also be bagged using the cforest function > in the party package if the argument mtry is equal to the number of
predictors:
library(party)
The mtry parameter should be the number of predictors (the
number of columns minus 1 for the outcome).
bagCtrl <- cforest_control(mtry = ncol(trainData) - 1)
baggedTree <- cforest(y ~ ., data = trainData, controls = bagCtrl)
Note there may be a typo in the above code (and also in the package's help file), as discussed here:
R package 'partykit' unused argument in ctree_control
However when I try to replicate this code using a dataframe (and trainData in above code is also a dataframe) such that there is more than one independent/predictor variable, I'm getting an error though it works for just one independent variable:
Some dummy code for simulations:
library(party)
df = data.frame(y = runif(5000), x = runif(5000), z = runif(5000))
bagCtrl <- cforest_control(mtry = ncol(df) - 1)
baggedTree_cforest <- cforest(y ~ ., data = df, control = bagCtrl)
The error message is:
Error: $ operator not defined for this S4 class
Thanks for any help.
As suggested, posting my comment from above as an answer as a general R 'trick' if something expected doesn't work and the program has several libraries loaded:
but what solved it was adding the party namespace explicitly to the function > call, so party::cforest() instead of just cforest(). I've also got
library(partykit) loaded in my actual program which too has a cforest()
function and the error could be stemming from there though both functions are > essentially the same
caret::train() is another example where this often pops up

H2O: Deep learning object not found in function 'predict' for argument 'model'

I'm just testing out h2o, in particular its deep learning capabilities, since I've heard great things about it. So far I've been using the following code:
library(h2o)
library(caret)
data("iris")
# Initiate H2O --------------------
h2o.removeAll() # Clean up. Just in case H2O was already running
h2o.init(nthreads = -1, max_mem_size="22G") # Start an H2O cluster with all threads available
# Get training and tournament data -------------------
a <- createDataPartition(iris$Species, list=FALSE)
training <- iris[a,]
test <- iris[-a,]
# Convert target to factor -------------------
target <- as.factor(iris$Species)
feature_names <- names(train)[1:(ncol(train)-1)]
train_h2o <- as.h2o(train)
test_h2o <- as.h2o(test)
prob <- test[, "id", drop = FALSE]
model_dl <- h2o.deeplearning(x = feature_names, y = "target", training_frame = train_h2o, stopping_metric = "logloss")
h2o.logloss(model_dl)
pred_dl <- predict(model_dl, newdata = tourn_h2o)
prob <- cbind(prob, as.data.frame(pred_dl$p1, col.names = "dl"))
write.table(prob[, c("id", "dl")], paste0(model_dl#model_id, ".csv"), sep = ",", row.names = FALSE, col.names = c("id", "probability"))
The relevant part is really that last line, where I got the following error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Object 'DeepLearning_model_R_1494350691427_70' not found in function: predict for argument: model
Has anyone come across this before? Are there any easy solutions to this that I might be missing? Thanks in advance.
EDIT: With the updated code I get the error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Illegal argument(s) for DeepLearning model: DeepLearning_model_R_1494428751150_1. Details: ERRR on field: _train: Training data must have at least 2 features (incl. response).
ERRR on field: _stopping_metric: Stopping metric cannot be logloss for regression.
I assume this has to do with the the way the Iris dataset is being read in.
Answer To First Question: Your original error message sounds like one you can get when things get of sync. E.g. maybe you had two sessions running at once, and removed the model in one session; the other session wouldn't know its variables are now out of date. H2O allows multiple connections, but they have to be co-operative. (Flow - see next paragraph - counts as a second session.)
Unless you can make a reproducible example, shrug and put it down to gremlins, and start a new session. Or, go and look at the data/models in Flow (a web server always running on 127.0.0.1:54321 ), and see if something is no longer there.
For your EDIT question, your model is making a regression model, but you are trying to use logloss, so thought you were doing a classification. This is caused by not having set the target variable to be a factor. Your current as.factor() line is on the wrong data, in the wrong place. It should go after your as.h2o() lines:
train_h2o <- as.h2o(training) #Typo fix
test_h2o <- as.h2o(test)
feature_names <- names(training)[1:(ncol(training)-1)] #typo fix
y = "Species" #The column we want to predict
train_h2o[,y] <- as.factor(train_h2o[,y])
test_h2o[,y] <- as.factor(test_h2o[,y])
And then make the model with:
model_dl <- h2o.deeplearning(x = feature_names, y = y, training_frame = train_h2o, stopping_metric = "logloss")
Get predictions with:
pred_dl <- predict(model_dl, newdata = test_h2o) #Typo fix
And compare with correct answer with the prediction using:
cbind(test[, y], as.data.frame(pred_dl$predict))
(BTW, H2O always detects the Iris data set columns as numeric vs. factor perfectly, so the above as.factor() lines are not needed; your error message must've been on your original data.)
StackOverflow advice: test your reproducible example, in full, and copy and paste in that exact code, with the exact error message that code is giving you. Your code had numerous small typos. E.g. train in places, training in others. createDataPartition() was not given; I assumed a = sample(nrow(iris), 0.8*nrow(iris)). test has no "id" column.
Other H2O advice:
Run h2o.removeAll() after h2o.init(). It was giving you an error message if run before. (Personally I avoid that function - it is the kind of thing that gets left in a production script by mistake...)
Consider importing your data into h2o earlier, and using h2o.splitFrame() to split it. I.e. avoid doing things in R that H2O can easily handle.
Avoid having your data in R, at all, if you can. Prefer importFile() over as.h2o().
The thinking beyond both the last points is that H2O will scale beyond the memory of one machine, while R won't. It also is less confusing than trying to keep track of the same thing in two places.
I had the same issue but could resolve it quite easily.
My error occured because I read in an h2o-object before initialising the h2o-cluster. So I trained an h2o-model, saved it, shut down the cluster, loaded in the model and then initialized the cluster once again.
Before reading in the h2o-object, you should already initialize the cluster (h2o.init()).

Error with gamsel R Package

I'm trying to use the gamsel R package to fit a sparse generalized additive model, and I can't seem to get it to work on real data. When I run on synthetic data as described in the package documentation, everything works well:
library(gamsel)
data=gendata(n=500,p=12,k.lin=3,k.nonlin=3,deg=8,sigma=0.5)
attach(data)
bases=pseudo.bases(X,degree=10,df=6)
gamsel.out=gamsel(X,y,bases=bases)
But when I run on real data, I get the following error:
library(gamsel)
X = as.matrix(read.csv("X.csv"),header=FALSE)
y = as.matrix(read.csv("y.csv"),header=FALSE)
gam_fit = gamsel(X,y)
Error in if (abs((df - current.df)/df) < 1e-04 | iterations == 1)
return(list(lambda = lambda, : missing value where TRUE/FALSE
needed
You can access sample data files that will reproduce this result here. Any thoughts about how to fix this error?

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Resources