KNNCAT error "some classes have only one member" - r

I'm trying to run a KNN analysis on auto data using knncat's knncat function. My training set is around 700,000 observations. The following happens when I try to implement the analysis. I've attempted to remove NA using the complete cases method while reading the data in. I'm not sure exactly how to take care of the errors or what they mean.
kdata.training = kdataf[ind==1,]
kdata.test = kdataf[ind==2,]
kdata_pred = knncat(train = kdata.training, test = kdata.test, classcol = 4)
Error in knncat(train = kdata.training, test = kdata.test, classcol = 4) :
Some classes have only one member. Check "classcol"
When I attempt to run a small subsection of the training and test set(200 and 70 observations respectively) I get the following error:
kdata_strain = kdata.training[1:200,]
kdata_stest = kdata.test[1:70,]
kdata_pred = knncat(train = kdata_strain, test = kdata_stest, classcol = 4)
Error in knncat(train = kdata_strain, test = kdata_stest, classcol = 4) :
Some factor has empty levels
Here is the str method called on kdataf, the dataframe for which the above data was sampled for:
str(kdataf)
'data.frame': 1159712 obs. of 9 variables:
$ vehicle_sales_price: num 13495 11999 14499 12495 14999 ...
$ week_number: Factor w/ 27 levels "1","2","3","4",..: 11 10 13 10 10 9 18 10 10 10 ...
$ county: Factor w/ 219 levels "Anderson","Andrews",..: 49 49 49 49 49 49 49 49 49 49 ...
$ ownership_code : Factor w/ 23 levels "1","2","3","4",..: 11 11 3 1 11 11 11 11 11 11 ...
$ X30_days_late : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
$ X60_days_late : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
$ penalty : num 0 0 55.3 0 0 ...
$ processing_time : int 28 24 32 29 19 20 63 27 28 24 ...
$ transaction_code : Factor w/ 2 levels "TITLE","WDTA": 2 2 2 2 2 2 2 2 2 2 ...
The seed was set to '1234' and the ratio of the training to test data was 2:1

First, I know very little about R, so take my answer with a grain of salt.
I had the same problem, that made no sense, because there were no NAs. I thought at the beginning that it were strange characters like ', /, etc that I had in my data. But no, the knncat algorithm works with those characters when I put the following three lines of code after defining my train sets (i use data.table because my data are huge):
write.csv(train, file="train.csv")
train <- fread("train.csv", sep=",", header=T, stringsAsFactors=T)
train[,V1:=NULL]
Then, there are no more messages 'Some factor has empty levels' or 'Some classes have only one member. Check "classcol"'.
I know this is not a real solution to the problem, but at least, you can finish your work.
Hope it helps.

Related

R round correlate function from corrr package

I'm creating a correlation table using the correlate function in the corrr package. Here is my code and a screenshot of the output.
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table
I think this would look better and be easier to read if I could round off the values in the correlation table. I tried this code:
correlation_table <- round(corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson"),2)
But I get this error:
Error in Math.data.frame(list(term = c("prof_rank_factor", "yrs.since.phd", : non-numeric variable(s) in data frame: term
The non-numeric variables part of this error message doesn't make sense to me. When I look at the structure I only see integer or numeric variable types.
'data.frame': 397 obs. of 6 variables:
$ prof_rank_factor : num 3 3 1 3 3 2 3 3 3 3 ...
$ yrs.since.phd : int 19 20 4 45 40 6 30 45 21 18 ...
$ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
$ salary : num 139750 173200 79750 115000 141500 ...
$ sex_factor : num 1 1 1 1 1 1 1 1 1 2 ...
$ discipline_factor: num 2 2 2 2 2 2 2 2 2 2 ...
How can I clean up this correlation table with rounded values?
After returning the tibble output with correlate, loop across the columns that are numeric and round
library(dplyr)
corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson") %>%
mutate(across(where(is.numeric), round, digits = 2))
We can use:
options(digits=2)
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table

Single input data.frame instead of testset.csv fails in randomForest code in R

I have written a R script which successfully runs and predicts output but only when csv with multiple entries is passed as input to classifier.
training_set = read.csv('finaldata.csv')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[-5],
y = training_set$Song,
ntree = 50)
test_set = read.csv('testSet.csv')
y_pred = predict(classifier, newdata = test_set)
Above code runs succesfully, but instead of giving 10+ inputs to classifier, I want to pass a data.frame as single input to this classifier. That works in other classifier except this, why?
So following code doesn't work and throws error -
y_pred = predict(classifier, data.frame(Emot="happy",Pact="Walking",Mact="nothing",Session="morning"))
Error in predict.randomForest(classifier, data.frame(Emot = "happy", :
Type of predictors in new data do not match that of the training data.
I even tried keeping single entry in testinput.csv, still throws same error! How to solve it? This code is back-end of my another code and I want only single entry to pass as test to predict results. Also all are 'factors' in training as well as testing set. Help appreciated.
PS: Previous solutions to same error, didn't help me.
str(test_set)
'data.frame': 1 obs. of 5 variables:
$ Emot : Factor w/ 1 level "fear": 1
$ Pact : Factor w/ 1 level "Bicycling": 1
$ Mact : Factor w/ 1 level "browsing": 1
$ Session: Factor w/ 1 level "morning": 1
$ Song : Factor w/ 1 level "Dusk Till Dawn.mp3": 1
str(training_set)
'data.frame': 1052 obs. of 5 variables:
$ Emot : Factor w/ 8 levels "anger","contempt",..: 4 7 6 6 4 3 4 6 4 6 ...
$ Pact : Factor w/ 5 levels "Bicycling","Driving",..: 1 2 2 2 4 3 1 1 3 4 ...
$ Mact : Factor w/ 6 levels "browsing","chatting",..: 1 6 1 4 5 1 5 6 6 6 ...
$ Session: Factor w/ 4 levels "afternoon","evening",..: 3 4 3 2 1 3 1 1 2 1 ...
$ Song : Factor w/ 101 levels "Aaj Ibaadat.mp3",..: 29 83 47 72 29 75 77 8 30 53 ...
Ohk this worked successfully, weird solution. Equalized classes of training and test set. Following code binds the first row of training set to the test set and than delete it.
test_set <- rbind(training_set[1, ] , test_set)
test_set <- test_set[-1,]
done! it works for single input as well as single entry .csv file, without bringing error in model.

How to reveal the content of an build-in dataset?

Many packages in R with build-in datasets in them.(just like “Vehicle” in “mlbench”, and “churn” in C50) We can use data() function to load these dataset. Sometimes, I want to check the structure and content of these data set in order to construct a new dataset for further analysis. But the view() function offen failed to do this job, summary() could use in some cases, but if you use summary(churn), the only result you get is an error: Error in summary(churn) : 找不到对象'churn'.
Is there any common methods to reveal a part of the build-in dataset?
Despite the fact that churn.Rdata is in the ../data/ directory of the C50 library, loading it shows that there is no 'churn' object in it. There are, however, both 'churnTest' and 'churnTrain' datasets and you can see their structure with str():
load('/path/to/my/current_R/Resources/library/C50/data/churn.RData')
ls(patt='churn')
#[1] "churnTest" "churnTrain"
str(churnTest)
'data.frame': 1667 obs. of 20 variables:
$ state : Factor w/ 51 levels "AK","AL","AR",..: 12 27 36 33 41 13 29 19 25 44 ...
$ account_length : int 101 137 103 99 108 117 63 94 138 128 ...
$ area_code : Factor w/ 3 levels "area_code_408",..: 3 3 1 2 2 2 2 1 3 2 ...
$ international_plan : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ voice_mail_plan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 2 ...
# snipped remainder of output
Would also have gotten some sort of response to:
data(package="C50")
I get a panel that pops up with:
Data sets in package ‘C50’:
churnTest (churn) Customer Churn Data
churnTrain (churn) Customer Churn Data

How to apply Naive Bayes model to new data

I asked a question on this this morning but am deleting that and posting here with more betterer wording.
I created my first machine learning model using train and test data. I returned a confusion matrix and saw some summary stats.
I would now like to apply the model to new data to make predictions but I don't know how.
Context: Predicting monthly "churn" cancellations. Target variable is "churned" and it has two possible labels "churned" and "not churned".
head(tdata)
months_subscription nvk_medium org_type churned
1 25 none Community not churned
2 7 none Sports clubs not churned
3 28 none Sports clubs not churned
4 18 unknown Religious congregations and communities not churned
5 15 none Association - Professional not churned
6 9 none Association - Professional not churned
Here's me training and testing:
library("klaR")
library("caret")
# import data
test_data_imp <- read.csv("tdata.csv")
# subset only required vars
# had to remove "revenue" since all churned records are 0 (need last price point)
variables <- c("months_subscription", "nvk_medium", "org_type", "churned")
tdata <- test_data_imp[variables]
#training
rn_train <- sample(nrow(tdata),
floor(nrow(tdata)*0.75))
train <- tdata[rn_train,]
test <- tdata[-rn_train,]
model <- NaiveBayes(churned ~., data=train)
# testing
predictions <- predict(model, test)
confusionMatrix(test$churned, predictions$class)
Everything up till here works fine.
Now I have new data, structure and laid out the same way as tdata above. How can I apply my model to this new data to make predictions? Intuitively I was seeking a new column cbinded that had the predicted class for each record.
I tried this:
## prediction ##
# import data
data_imp <- read.csv("pdata.csv")
pdata <- data_imp[variables]
actual_predictions <- predict(model, pdata)
#append to data and output (as head by default)
predicted_data <- cbind(pdata, actual_predictions$class)
# output
head(predicted_data)
Which threw errors
actual_predictions <- predict(model, pdata)
Error in object$tables[[v]][, nd] : subscript out of bounds
In addition: Warning messages:
1: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 1
2: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 2
3: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 3
How can I apply my model to the new data? I'd like a new data frame with a new column that has the predicted class?
** following comment, here is head and str of new data for prediction**
head(pdata)
months_subscription nvk_medium org_type churned
1 26 none Community not churned
2 8 none Sports clubs not churned
3 30 none Sports clubs not churned
4 19 unknown Religious congregations and communities not churned
5 16 none Association - Professional not churned
6 10 none Association - Professional not churned
> str(pdata)
'data.frame': 6433 obs. of 4 variables:
$ months_subscription: int 26 8 30 19 16 10 3 5 14 2 ...
$ nvk_medium : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ...
$ org_type : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ...
$ churned : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ...
This is most likely caused by a mismatch in the encoding of factors in the training data (variable tdata in your case) and the new data used in the predict function (variable pdata), typically that you have factor levels in the test data that are not present in the training data. Consistency in the encoding of the features must be enforced by you, because the predict function will not check it. Therefore, I suggest that you double-check the levels of the features nvk_medium and org_type in the two variables.
The error message:
Error in object$tables[[v]][, nd] : subscript out of bounds
is raised when evaluating a given feature (the v-th feature) in a data point, in which nd is the numeric value of the factor corresponding to the feature. You also have warnings, indicating that the posterior probabilities for all the cases in data points ("observation") 1, 2, and 3 are all zero, but it is not clear if this is also related to the encoding of the factors...
To reproduce the error that you are seeing, consider the following toy data (from http://amunategui.github.io/binary-outcome-modeling/), which has a set of features somewhat similar to that in your data:
# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age', 'Sex', 'Title', 'Survived')]
# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]
with:
> str(Data_train)
'data.frame': 656 obs. of 5 variables:
$ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 3 3 3 1 1 3 3 3 3 ...
$ Age : num 35 28 34 28 29 28 28 28 45 28 ...
$ Sex : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
$ Title : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 2 2 1 2 4 3 2 3 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ...
> str(Data_test)
'data.frame': 657 obs. of 5 variables:
$ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age : num 47 63 39 58 19 28 50 37 25 39 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 1 2 3 3 2 3 2 2 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...
Then everything goes as expected:
model <- NaiveBayes(Survived ~ ., data = Data_train)
# This will work
pred_1 <- predict(model, Data_test)
> str(pred_1)
List of 2
$ class : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"
However, if the encoding is not consistent, e.g.:
# Mess things up, by "displacing" the factor values (i.e., 'Nothing'
# will now be encoded as number 5, which was not present in the
# training data)
Data_test_2 <- Data_test
Data_test_2$Title <- factor(
as.character(Data_test_2$Title),
levels = c("Dr", "Miss", "Mr", "Mrs", "Nothing")
)
> str(Data_test_2)
'data.frame': 657 obs. of 5 variables:
$ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age : num 47 63 39 58 19 28 50 37 25 39 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title : Factor w/ 5 levels "Dr","Miss","Mr",..: 3 2 3 4 4 3 4 3 3 3 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...
then:
> pred_2 <- predict(model, Data_test_2)
Error in object$tables[[v]][, nd] : subscript out of bounds

Getting an error "(subscript) logical subscript too long" while training SVM from e1071 package in R

I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.

Resources