I am trying to run a confusion matrix in R for my decision tree model but get the following error:
"Error in table(data, reference, dnn = dnn, ...) : all arguments must have the same length"
I don't understand why it wont run.
dtree_test <- rpart(writeoff ~ education+employ_status+residential_status+loan_amount+loan_length+
net_income,method="class", data=testnew,parms=list(split="information"))
dtree_test$cptable
plotcp(dtree_test)
dtree_test.pruned <- prune(dtree_test, cp=.01`enter code here`639344)
prp(dtree_test.pruned, type = 2, extra = 104,
fallen.leaves = TRUE, main="Decision Tree")
dtree_test.pred <- predict(dtree_test.pruned, testnew, type="class")
dtree_test.perf <- table(testnew$writeoff, dtree_test.pred,
dnn=c("Actual", "Predicted"))
dtree_test.perf
confusionMatrix(predict(dtree_test.pruned, testnew, type="class"),train$writeoff)
The last line is:
confusionMatrix(predict(dtree_test.pruned, testnew, type="class"),train$writeoff)
which makes predictions for dataset testnew but compares those with the response in dataset train.
Also in rpart(...) you have data=testnew but perhaps you meant to use your training data to fit the model?
Related
I'm working on a project where I need to construct a knn model using R. The professor provided an article with step-by-step instructions (link to article) and some datasets to choose from (link to the data I'm using). I'm getting stuck on step 3 (creating the model from the training data).
Here's my code:
data <- read.delim("data.txt", header = TRUE, sep = "\t", dec = ".")
set.seed(2)
part <- sample(2, nrow(data), replace = TRUE, prob = c(0.65, 0.35))
training_data <- data[part==1,]
testing_data <- data[part==2,]
outcome <- training_data[,2]
model <- knn(train = training_data, test = testing_data, cl = outcome, k=10)
Here's the error message I'm getting:
I checked and found that training_data, testing_data, and outcome all look correct, the issue seems to only be with the knn model.
The issue is with your data and the knn function you are using; it can't handle characters or factor variable
We can force this to work doing something like this first:
library(tidyverse)
data <- data %>%
mutate(Seeded = as.numeric(as.factor(Seeded))-1) %>%
mutate(Season = as.numeric(as.factor(Season)))
But this is a bad idea in general, since Season is not ordered naturally. A better approach would be to instead treat it as a set of dummies.
See this link for examples:
R - convert from categorical to numeric for KNN
I am trying to solve the digit Recognizer competition in Kaggle and I run in to this error.
I loaded the training data and adjusted the values of it by dividing it with the maximum pixel value which is 255. After that, I am trying to build my model.
Here Goes my code,
Given_Training_data <- get(load("Given_Training_data.RData"))
Given_Testing_data <- get(load("Given_Testing_data.RData"))
Maximum_Pixel_value = max(Given_Training_data)
Tot_Col_Train_data = ncol(Given_Training_data)
training_data_adjusted <- Given_Training_data[, 2:ncol(Given_Training_data)]/Maximum_Pixel_value
testing_data_adjusted <- Given_Testing_data[, 2:ncol(Given_Testing_data)]/Maximum_Pixel_value
label_training_data <- Given_Training_data$label
final_training_data <- cbind(label_training_data, training_data_adjusted)
smp_size <- floor(0.75 * nrow(final_training_data))
set.seed(100)
training_ind <- sample(seq_len(nrow(final_training_data)), size = smp_size)
training_data1 <- final_training_data[training_ind, ]
train_no_label1 <- as.data.frame(training_data1[,-1])
train_label1 <-as.data.frame(training_data1[,1])
svm_model1 <- svm(train_label1,train_no_label1) #This line is throwing an error
Error : Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty!
Please Kindly share your thoughts. I am not looking for an answer but rather some idea that guides me in the right direction as I am in a learning phase.
Thanks.
Update to the question :
trainlabel1 <- train_label1[sapply(train_label1, function(x) !is.factor(x) | length(unique(x))>1 )]
trainnolabel1 <- train_no_label1[sapply(train_no_label1, function(x) !is.factor(x) | length(unique(x))>1 )]
svm_model2 <- svm(trainlabel1,trainnolabel1,scale = F)
It didn't help either.
Read the manual (https://cran.r-project.org/web/packages/e1071/e1071.pdf):
svm(x, y = NULL, scale = TRUE, type = NULL, ...)
...
Arguments:
...
x a data matrix, a vector, or a sparse matrix (object of class
Matrix provided by the Matrix package, or of class matrix.csr
provided by the SparseM package,
or of class simple_triplet_matrix provided by the slam package).
y a response vector with one label for each row/component of x.
Can be either a factor (for classification tasks) or a numeric vector
(for regression).
Therefore, the mains problems are that your call to svm is switching the data matrix and the response vector, and that you are passing the response vector as integer, resulting in a regression model. Furthermore, you are also passing the response vector as a single-column data-frame, which is not exactly how you are supposed to do it. Hence, if you change the call to:
svm_model1 <- svm(train_no_label1, as.factor(train_label1[, 1]))
it will work as expected. Note that training will take some minutes to run.
You may also want to remove features that are constant (where the values in the respective column of the training data matrix are all identical) in the training data, since these will not influence the classification.
I don't think you need to scale it manually since svm itself will do it unlike most neural network package.
You can also use the formula version of svm instead of the matrix and vectors which is
svm(result~.,data = your_training_set)
in your case, I guess you want to make sure the result to be used as factor,because you want a label like 1,2,3 not 1.5467 which is a regression
I can debug it if you can share the data:Given_Training_data.RData
I am building a predictive model with caret/R and I am running into the following problems:
When trying to execute the training/tuning, I get this error:
Error in if (tmps < .Machine$double.eps^0.5) 0 else tmpm/tmps :
missing value where TRUE/FALSE needed
After some research it appears that this error occurs when there missing values in the data, which is not the case in this example (I confirmed that the data set has no NAs). However, I also read somewhere that the missing values may be introduced during the re-sampling routine in caret, which I suspect is what's happening.
In an attempt to solve problem 1, I tried "pre-processing" the data during the re-sampling in caret by removing zero-variance and near-zero-variance predictors, and automatically inputting missing values using a carets knn automatic imputing method preProcess(c('zv','nzv','knnImpute')), , but now I get the following error:
Error: Matrices or data frames are required for preprocessing
Needless to say I checked and confirmed that the input data set are indeed matrices, so I dont understand why I get this second error.
The code follows:
x.train <- predict(dummyVars(class ~ ., data = train.transformed),train.transformed)
y.train <- as.matrix(select(train.transformed,class))
vbmp.grid <- expand.grid(estimateTheta = c(TRUE,FALSE))
adaptive_trctrl <- trainControl(method = 'adaptive_cv',
number = 10,
repeats = 3,
search = 'random',
adaptive = list(min = 5, alpha = 0.05,
method = "gls", complete = TRUE),
allowParallel = TRUE)
fit.vbmp.01 <- train(
x = (x.train),
y = (y.train),
method = 'vbmpRadial',
trControl = adaptive_trctrl,
preProcess(c('zv','nzv','knnImpute')),
tuneGrid = vbmp.grid)
The only difference between the code for problem (1) and (2) is that in (1), the pre-processing line in the train statement is commented out.
In summary,
-There are no missing values in the data
-Both x.train and y.train are definitely matrices
-I tried using a standard 'repeatedcv' method in instead of 'adaptive_cv' in trainControl with the same exact outcome
-Forgot to mention that the outcome class has 3 levels
Anyone has any suggestions as to what may be going wrong?
As always, thanks in advance
reyemarr
I had the same problem with my data, after some digging i found that I had some Inf (infinite) values in one of the columns.
After taking them out (df <- df %>% filter(!is.infinite(variable))) the computation ran without error.
i trying do create a plot for my model create using SVM in e1071 package.
my code to build the model, predict and build confusion matrix is
ptm <- proc.time()
svm.classifier = svm(x = train.set.list[[0.999]][["0_0.1"]],
y = train.factor.list[[0.999]][["0_0.1"]],
kernel ="linear")
pred = predict(svm.classifier, test.set.list[[0.999]][["0_0.1"]], decision.values = TRUE)
time[["svm"]] = proc.time() - ptm
confmatrix = confusionMatrix(pred,test.factor.list[[0.999]][["0_0.1"]])
confmatrix
train.set.list and test.set.list contains the test and train set for several conditions. train and set factor has the true label for each set. Train.set and test.set are both documenttermmatrix.
Then i tried to see a plot of my data, i tried with
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]])
but i got the message:
"Error in plot.svm(svm.classifier, train.set.list[[0.999]][["0_0.1"]]) :
missing formula."
what i'm doing wrong? confusion matrix seems good to me even not using formula parameter in svm function
Without given code to run, it's hard to say exactly what the problem is. My guess, given
?plot.svm
which says
formula formula selecting the visualized two dimensions. Only needed if more than two input variables are used.
is that your data has more than two predictors. You should specify in your plot function:
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]], predictor1 ~ predictor2)
i want to use naive Bayes classifier to make some predictions.
So far i can make the prediction with the following (sample) code in R
library(klaR)
library(caret)
Faktor<-x <- sample( LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
alter<-abs(rnorm(10000,30,5))
HF<-abs(rnorm(10000,1000,200))
Diffalq<-rnorm(10000)
Geschlecht<-sample(c("Mann","Frau", "Firma"),10000,replace=TRUE)
data<-data.frame(Faktor,alter,HF,Diffalq,Geschlecht)
set.seed(5678)
flds<-createFolds(data$Faktor, 10)
train<-data[-flds$Fold01 ,]
test<-data[flds$Fold01 ,]
features <- c("HF","alter","Diffalq", "Geschlecht")
formel<-as.formula(paste("Faktor ~ ", paste(features, collapse= "+")))
nb<-NaiveBayes(formel, train, usekernel=TRUE)
pred<-predict(nb,test)
test$Prognose<-as.factor(pred$class)
Now i want to improve this model by feature selection. My real data is about 100 features big.
So my question is , what woould be the best way to select the most important features for naive Bayes classification?
Is there any paper dor reference?
I tried the following line of code, bit this did not work unfortunately
rfe(train[, 2:5],train[, 1], sizes=1:4,rfeControl = rfeControl(functions = ldaFuncs, method = "cv"))
EDIT: It gives me the following error message
Fehler in { : task 1 failed - "nicht-numerisches Argument für binären Operator"
Calls: rfe ... rfe.default -> nominalRfeWorkflow -> %op% -> <Anonymous>
Because this is in german you may please reproduce this on your machine
How can i adjust the rfe() call to get a recursive feature elimination?
This error appears to be due to the ldaFuncs. Apparently they do not like factors when using matrix input. The main problem can be re-created with your test data using
mm <- ldaFuncs$fit(train[2:5], train[,1])
ldaFuncs$pred(mm,train[2:5])
# Error in FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) :
# non-numeric argument to binary operator
And this only seems to happens if you include the factor variable. For example
mm <- ldaFuncs$fit(train[2:4], train[,1])
ldaFuncs$pred(mm,train[2:4])
does not return the same error (and appears to work correctly). Again, this only appears to be a problem when you use the matrix syntax. If you use the formula/data syntax, you don't have the same problem. For example
mm <- ldaFuncs$fit(Faktor ~ alter + HF + Diffalq + Geschlecht, train)
ldaFuncs$pred(mm,train[2:5])
appears to work as expected. This means you have a few different options. Either you can use the rfe() formula syntax like
rfe(Faktor ~ alter + HF + Diffalq + Geschlecht, train, sizes=1:4,
rfeControl = rfeControl(functions = ldaFuncs, method = "cv"))
Or you could expand the dummy variables yourself with something like
train.ex <- cbind(train[,1], model.matrix(~.-Faktor, train)[,-1])
rfe(train.ex[, 2:6],train.ex[, 1], ...)
But this doesn't remember which variables are paired in the same factor so it's not ideal.