randomForest Predict error from test set - r

I am running into a an error with the R package of randomForest where after I split the data using Caret into training and testing, when I go to predict I run into error:
Error in predict.randomForest(randomForestFit, type = "response", newdata =testing$GEN)
:number of variables in newdata does not match that in the training data
I split the file between train and test from the exact same file. There are no N/A or missing values in any of the data. Below is my full code, but I do not think there is an error there. I am at a loss as to why this error is occurring. Any ideas would be greatly appreciated!
library(caret)
require(foreign)
set.seed(825)
data <- read.spss("C:/MODEL_SAMPLE.sav",use.value.labels=TRUE, to.data.frame = TRUE)
inTraining <- createDataPartition(data$GEN, p = 0.75, list = FALSE)
training <- data[inTraining, ]
testing <- data[-inTraining, ]
library(randomForest)
library(foreach)
start.time <- Sys.time()
randomForestFit <- foreach(ntree=rep(63, 8), .combine=combine, .packages='randomForest')
%dopar% randomForest(training[-201],
training$GEN,
mtry = 40,
ntree=ntree,
verbose = TRUE,
importance = TRUE,
keep.forest=TRUE,
do.trace = TRUE)
randomForestFit
predict = predict(randomForestFit, type="response", newdata=testing$GEN)
stopCluster(cl)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Without the data, its hard for anyone to say what the problem is exactly.
Three suggestions:
First, check the SPSS file for stray characters in the data.
Second, check the options from read.spss are set correctly especially: reencode = NA, use.missings =to.data.frame. You can use the latter option to specify non numeric characters to be turned into NA.
Third, use str(df), summary(df,useNA="if any") and make sure your factor variables including the response are actually factors. Apply as.numeric(as.character()) to numeric data in the data frame, this will generate NA values if there are expressions like VALUE!, #NA in the data frame.
You could also export to csv from SPSS and do the above again.

The key is the following
:number of variables in newdata does not match that in the training data
I therefore guess that the training and test data are different, in particular the column names. Maybe it breaks at this line?
inTraining <- createDataPartition(data$GEN, p = 0.75, list = FALSE)
To better understand the problem, you might have to post 3 rows of the training and test data set (with column names!).
I hope this helps!

Related

Error : 'data' must be a data.frame, environment, or list

#define training and testing sets
set.seed(555)
train <- df2[1:800, c("charges")]
y_test <- df2[801:nrow(df2), c("charges")]
test <- df2[801:nrow(df2), c("age","bmi","children","smoker")]
#use model to make predictions on a test set
model <- pcr(charges~age+bmi+children+smoker, data = train, scale=TRUE, validation="CV")
pcr_pred <- predict(model, test, ncomp = 4)
#calculate RMSE
sqrt(mean((pcr_pred - y_test)^2))
I dont know why i get this error... already tried number of things but still stuck here
When you executed:
train <- df2[1:800, c("charges")]
You created an R atomic character vector. The class of the result would not be a list unless you also added the drop=FALSE parameter:
train <- df2[1:800, c("charges"), drop=FALSE]
That should fix that error although the lack of any data prevents any of us from determining whether further errors might arise. Actually, I'm pretty sure you did not want that train object to be just a single column since your model obviously expected other columns. Try this instead:
set.seed(555)
train <- df2[1:800, ]
test <- df2[801:nrow(df2), ]
#use model to make predictions on a test set
model <- pcr(charges~age+bmi+children+smoker, data = train, scale=TRUE, validation="CV")
pcr_pred <- predict(model, test, ncomp = 4)
#calculate RMSE
sqrt(mean((pcr_pred - y_test)^2))

KNN function in R producing NA/NaN/Inf in foreign function call (arg 6) error

I'm working on a project where I need to construct a knn model using R. The professor provided an article with step-by-step instructions (link to article) and some datasets to choose from (link to the data I'm using). I'm getting stuck on step 3 (creating the model from the training data).
Here's my code:
data <- read.delim("data.txt", header = TRUE, sep = "\t", dec = ".")
set.seed(2)
part <- sample(2, nrow(data), replace = TRUE, prob = c(0.65, 0.35))
training_data <- data[part==1,]
testing_data <- data[part==2,]
outcome <- training_data[,2]
model <- knn(train = training_data, test = testing_data, cl = outcome, k=10)
Here's the error message I'm getting:
I checked and found that training_data, testing_data, and outcome all look correct, the issue seems to only be with the knn model.
The issue is with your data and the knn function you are using; it can't handle characters or factor variable
We can force this to work doing something like this first:
library(tidyverse)
data <- data %>%
mutate(Seeded = as.numeric(as.factor(Seeded))-1) %>%
mutate(Season = as.numeric(as.factor(Season)))
But this is a bad idea in general, since Season is not ordered naturally. A better approach would be to instead treat it as a set of dummies.
See this link for examples:
R - convert from categorical to numeric for KNN

R training and tuning random forest classifier with hardware challenge

I'm sorry if this is not appropriate to ask here but please forgive a noobie.
I'm training a random forest multiple class(8) classifier using R Caret on my experimental data using my desktop, 32GB RAM and a 4 core CPU. However, I'm facing constant complains from RStudio reporting it cannot allocate vector of 9GB. So I have to reduce the training set all the way to 1% of the data just to run fold CV and some grid search. As a result my mode accuracy is ~50% and the resulting features selected aren't very good at all. Only 2 out of 8 classes are being distinguished somewhat truthfully. Of course it could be that I don't have any good features. But I want to at least test train and tune my model on a decent size of training data first. What are the solutions can help? or is there anywhere I can upload my data and train somewhere? I'm so new that I don't know if something like cloud based things can help me? Pointers will be appreciated.
Edit: I have uploaded the data table and my codes so maybe it is my bad coding screwed things up.
Here is a link to the data:
https://drive.google.com/file/d/1wScYKd7J-KlRvvDxHAmG3_If1o5yUimy/view?usp=sharing
Here are my codes:
#load libraries
library(data.table)
library(caret)
library(caTools)
library(e1071)
#read the data in
df.raw <-fread("CLL_merged_sampled_same_ctrl_40percent.csv", header =TRUE,data.table = FALSE)
#get the useful data
#subset and get rid of useless labels
df.1 <- subset(df.raw, select = c(18:131))
df <- subset(df.1, select = -c(2:4))
#As I want to build a RF model to classify drug treatments
# make the treatmentsun as factors
#there should be 7 levels
df$treatmentsum <- as.factor(df$treatmentsum)
df$treatmentsum
#find nearZerovarance features
#I did not remove them. Just flagged them
nzv <- nearZeroVar(df[-1], saveMetrics= TRUE)
nzv[nzv$nzv==TRUE,]
possible.nzv.flagged <- nzv[nzv$nzv=="TRUE",]
write.csv(possible.nzv.flagged, "Near Zero Features flagged.CSV", row.names = TRUE)
#identify correlated features
df.Cor <- cor(df[-1])
highCorr <- sum(abs(df.Cor[upper.tri(df.Cor)]) > .99)
highlyCor <- findCorrelation(df.Cor, cutoff = .99,verbose = TRUE)
#Get rid off strongly correlated features
filtered.df<- df[ ,-highlyCor]
str(filtered.df)
#identify linear dependecies
linear.combo <- findLinearCombos(filtered.df[-1])
linear.combo #no linear ones detected
#splt datainto training and test
#Here is my problem, I want to use 80% of the data for training
#but in my computer, I can only use 0.002
set.seed(123)
split <- sample.split(filtered.df$treatmentsum, SplitRatio = 0.8)
training_set <- subset(filtered.df, split==TRUE)
test_set <- subset(filtered.df, split==FALSE)
#scaling numeric data
#leave the first column labels out
training_set[-1] = scale(training_set[-1])
test_set[-1] = scale(test_set[-1])
training_set[1]
#build RF
#use Cross validation for model training
#I can't use repeated CV as it fails on my machine
#I set a grid search for tuning
control <- trainControl(method="cv", number=10,verboseIter = TRUE, search = 'grid')
#default mtry below, is around 10
#mtry <- sqrt(ncol(training_set))
#I used ,mtry 1:12 to run, but I wanted to test more, limited again by machine
tunegrid <- expand.grid(.mtry = (1:20))
model <- train(training_set[,-1],as.factor(training_set[,1]), data=training_set, method = "rf", trControl = control,metric= "Accuracy", maximize = TRUE ,importance = TRUE, type="classification", ntree =800,tuneGrid = tunegrid)
print(model)
plot(model)
prediction2 <- predict(model, test_set[,-1])
cm<-confusionMatrix(prediction2, as.factor(test_set[,1]), positive = "1")

R: variable has different number of levels in the node and in the data

I want to use bnlearn for a classification task with Naive Bayes algorithm.
I use this data set for my tests. Where 3 variables are continuous ()V2, V4, V10) and others are discrete. As far as I know bnlearn cannot work with continuous variables, so there is a need to convert them to factors or discretize. For now I want to convert all the features into factors. However, I came across to some problems. Here is a sample code
dataSet <- read.csv("creditcard_german.csv", header=FALSE)
# ... split into trainSet and testSet ...
trainSet[] <- lapply(trainSet, as.factor)
testSet[] <- lapply(testSet, as.factor)
# V25 is the class variable
bn = naive.bayes(trainSet, training = "V25")
fitted = bn.fit(bn, trainSet, method = "bayes")
pred = predict(fitted , testSet)
...
For this code I get an error message while calling predict()
'V1' has different number of levels in the node and in the data.
And when I remove that V1 from the training set, I get the same error for the V2 variable. However, error disappears when I do factorization dataSet [] <- lapply(dataSet, as.factor) and only than split it into training and test sets.
So which is the elegant solution for this? Because in real world applications test and train sets can be from different sources. Any ideas?
The issue appears to be caused by the fact that my train and test datasets had different factor levels. I solved this issue by using the rbind command to combine the two different dataframes (train and test), applying as.factor to get the full set of factors for the complete dataset, and then slicing the factorized dataframe back into separate train and test datasets.
train <- read.csv("train.csv", header=FALSE)
test <- read.csv("test.csv", header=FALSE)
len_train = dim(train)[1]
len_test = dim(test)[1]
complete <- rbind(learn, test)
complete[] <- lapply(complete, as.factor)
train = complete[1:len_train, ]
l = len_train+1
lf = len_train + len_test
test = complete[l:lf, ]
bn = naive.bayes(train, training = "V25")
fitted = bn.fit(bn, train, method = "bayes")
pred = predict(fitted , test)
I hope this can be helpful.

MXnet odd error

This is my first ANN so I imagine that there might be a lot of things done wrong here. I don't follow
I'm trying to predict species of flowers from iris data set provided in R language but I get following error:
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(n)) :
invalid 'dimnames' given for data frame
My code:
require(mxnet)
train <- iris[1:130,]
test <- iris[131:150,]
train.data <- as.data.frame(train[-5])
train.label <- data.frame(model.matrix(data=train,object =~Species-1))
test.data <- as.data.frame(test[-5])
test.label <- data.frame(model.matrix(data=test,object =~Species-1))
var1 <- mx.symbol.Variable("data")
layer0 <- mx.symbol.FullyConnected(var1, num.hidden=3)
cat.out <- mx.symbol.SoftmaxOutput(layer0)
net.model <- mx.model.FeedForward.create(cat.out,
array.layout = "auto",
X=train.data,
y=train.label,
eval.data = list(data=test.data,label=test.label),
num.round = 20,
array.batch.size = 20,
learning.rate=0.1,
momentum=0.9,
eval.metric = mx.metric.accuracy)
UPDATE:
I managed to get rid of this error by specifying column to use in labels(traning.label[,1]and test.label[,1]).
However now I'm training my net to predict just one of my binary variables while I have 3 (one for each species).
I had the same problem, turned out that:
train.data should be a matrix
train.label should be a numeric vector
Check these two and hopefully it should work.
I had a similar problem but during the prediction step. It turns out that my features were in a Data Frame which was causing the issue. Once I converted the data frame into a matrix, the issue went away.
pred.values = stats::predict(model,as.matrix(features))
instead of
pred.values = stats::predict(model,features)
So, the features need to be a matrix both during training and during the process of making predictions.

Resources