No missing values are allows kNN in R - r
I've data set of 45212 elements with 17 columns and i want to find the class label of last column using kNN algorithm, according to me everything is OK, but I always come up with error
"Error in knn(train = data_train, test = data_test, cl = data_train_labels, :
no missing values are allowed"
here is my code
> data_train <-data[1:25000,]
> data_test <-data[25001:45212,]
> data_train_labels <- data[1:25000, 17]
> data_test_labels <- data[1:25000, 17]
> install.package("class")
> library(class)
> data_test_pred <- knn(train=data_train, test=data_test, cl=data_train_labels, k=10)
here is how my data set looks like:
age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no
41,admin.,divorced,secondary,no,270,yes,no,unknown,5,may,222,1,-1,0,unknown,no
I think that your problem is all of the factors in your data. The knn documentation says that it uses Euclidean distance, which does not make sense for factors. Here is a possible solution if you really want to use knn. You can get a distance matrix between the points using daisy in the cluster package. There are several implementations of knn in R but I don't know of one that accepts a distance matrix. You could either write your own (not so difficult) or you could map the distance matrix to a Euclidean space using cmdscale. Then use knn on the projected space.
I believe that your mistake is: data_train <-data[1:25000,]
You are including your header that you have not normalized. I was able to reproduce the same error. But when I changed to data_train <-data[2:25000,] it ran fine.
Related
KNN in R -- All arguments must have the same length, test.X is empty
I'm trying to perform KNN in R on a dataframe, following 3-way classification for vehicle types (car, boat, plane), using columns such as mpg, cost as features. To start, when I run: knn.pred=knn(train.X,test.X,train.VehicleType,k=3) then knn.pred returns factor(0) Levels: car boat plane And table(knn.pred,VehicleType.All) returns Error in table(knn.pred, VehicleType.All) : all arguments must have the same length I think my problem is that I can successfully load train.X with cbind() but when I try the same for test.X it remains an empty matrix. My code looks like this: train=(DATA$Values<=200) # to train for all 200 entries including cars, boats and planes train.X = cbind(DATA$mpg,DATA$cost)[train,] summary(train.X) Here, summary(train.X) returns correctly, but when I try the same for test.X: test.X = cbind(DATA$mpg,DATA$cost)[!train,] When I try and print test.X it returns an empty matrix like so: [,1] [,2] Apologies for such a long question and I'm probably not including all relevant info. If anyone has any idea what's going wrong here or why my test.X isn't loading through any data I'd appreciate it!
Without any info on your data, it is hard to guess where the problem is. You should post a minimal reproducible example or at least dput your data or part of it. However here I show 2 methods for training a knn model, using 2 different package (class, and caret) with the mtcars built-in dataset. with class library(class) data("mtcars") str(mtcars) mtcars$gear <- as.factor(mtcars$gear) ind <- sample(1:nrow(mtcars),20) train.X <- mtcars[ind,] test.X <- mtcars[-ind,] train.VehicleType <- train.X[,"gear"] VehicleType.All <- test.X[,"gear"] knn.pred=knn(train.X,test.X,train.VehicleType,k=3) table(knn.pred,VehicleType.All) with caret library(caret) ind <- createDataPartition(mtcars$gear,p=0.60,list=F) train.X <- mtcars[ind,] test.X <- mtcars[-ind,] control <-trainControl(method = "cv",number = 10) grid <- expand.grid(k=2:10) knn.pred <- train(gear~.,data=train.X,method="knn",tuneGrid=grid) pred <- predict(knn.pred,test.X[,-10]) cm <- confusionMatrix(pred,test.X$gear) the caret package allows performing cross-validation for parameters tuning during model fitting, in a straightforward way. By default train perform a 25 rep bootstrap cross-validation to find the best value of k among the values I've supplied in the grid object. From your example, it seems that your test object is empty so the result of knn is a 0-length vector. Probably your problem is in the data reading. However, a better way to subset your DATA can be this: #insetad of train.X = cbind(DATA$mpg,DATA$cost)[train,] #you should do: train.X <- DATA[train,c("mpg","cost")] test.X <- DATA[-train,c("mpg","cost")] However, I do not understand what variable is DATA$Values, Firstly I was thinking it was the outcome, but, this line confused me a lot: train=(DATA$Values<=200) You can work on these examples to catch your error on your own. If you can't post an example that reproduces your situation.
R Caret knnImpute for partially NA rows
I'm trying to run some code to preprocess my data for machine learning in Caret. One step I'm having a lot of trouble with is KNN imputation. When I run the following block of code: library(caret) traindf <- data.frame(matrix( rnorm(7*7,mean=0,sd=1), nrow=7, ncol=7)) testdf <- data.frame(matrix( rnorm(7*7,mean=0,sd=1), nrow=7, ncol=7)) for(i in 1:7){ traindf[i,i] <- NA #generates NA's in every row } impute_model <- preProcess(traindf, method = c('knnImpute')) #this line is problematic imputed_train <- predict(impute_model, traindf) imputed_test <- predict(impute_model, testdf) I get an error: Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, : Cannot find more nearest neighbours than there are points From some research, I believe this is due to the fact that the kNN imputation implementation Caret uses discards rows with any NA's. In my dataset, NA's are scattered throughout such that this would result in all rows being discarded for imputation purposes. Instead I would like to keep these partially NA rows and still use them for imputation. I know of one package that does this:https://www.rdocumentation.org/packages/impute/versions/1.46.0/topics/impute.knn. However, this one doesn't override predict, so I can't use it easily to impute the test set as well like in the above example. Does anyone have suggestions on how I can get this partial-NA KNN imputation working with Caret?
Kaggle Digit Recognizer Using SVM (e1071): Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty
I am trying to solve the digit Recognizer competition in Kaggle and I run in to this error. I loaded the training data and adjusted the values of it by dividing it with the maximum pixel value which is 255. After that, I am trying to build my model. Here Goes my code, Given_Training_data <- get(load("Given_Training_data.RData")) Given_Testing_data <- get(load("Given_Testing_data.RData")) Maximum_Pixel_value = max(Given_Training_data) Tot_Col_Train_data = ncol(Given_Training_data) training_data_adjusted <- Given_Training_data[, 2:ncol(Given_Training_data)]/Maximum_Pixel_value testing_data_adjusted <- Given_Testing_data[, 2:ncol(Given_Testing_data)]/Maximum_Pixel_value label_training_data <- Given_Training_data$label final_training_data <- cbind(label_training_data, training_data_adjusted) smp_size <- floor(0.75 * nrow(final_training_data)) set.seed(100) training_ind <- sample(seq_len(nrow(final_training_data)), size = smp_size) training_data1 <- final_training_data[training_ind, ] train_no_label1 <- as.data.frame(training_data1[,-1]) train_label1 <-as.data.frame(training_data1[,1]) svm_model1 <- svm(train_label1,train_no_label1) #This line is throwing an error Error : Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty! Please Kindly share your thoughts. I am not looking for an answer but rather some idea that guides me in the right direction as I am in a learning phase. Thanks. Update to the question : trainlabel1 <- train_label1[sapply(train_label1, function(x) !is.factor(x) | length(unique(x))>1 )] trainnolabel1 <- train_no_label1[sapply(train_no_label1, function(x) !is.factor(x) | length(unique(x))>1 )] svm_model2 <- svm(trainlabel1,trainnolabel1,scale = F) It didn't help either.
Read the manual (https://cran.r-project.org/web/packages/e1071/e1071.pdf): svm(x, y = NULL, scale = TRUE, type = NULL, ...) ... Arguments: ... x a data matrix, a vector, or a sparse matrix (object of class Matrix provided by the Matrix package, or of class matrix.csr provided by the SparseM package, or of class simple_triplet_matrix provided by the slam package). y a response vector with one label for each row/component of x. Can be either a factor (for classification tasks) or a numeric vector (for regression). Therefore, the mains problems are that your call to svm is switching the data matrix and the response vector, and that you are passing the response vector as integer, resulting in a regression model. Furthermore, you are also passing the response vector as a single-column data-frame, which is not exactly how you are supposed to do it. Hence, if you change the call to: svm_model1 <- svm(train_no_label1, as.factor(train_label1[, 1])) it will work as expected. Note that training will take some minutes to run. You may also want to remove features that are constant (where the values in the respective column of the training data matrix are all identical) in the training data, since these will not influence the classification.
I don't think you need to scale it manually since svm itself will do it unlike most neural network package. You can also use the formula version of svm instead of the matrix and vectors which is svm(result~.,data = your_training_set) in your case, I guess you want to make sure the result to be used as factor,because you want a label like 1,2,3 not 1.5467 which is a regression I can debug it if you can share the data:Given_Training_data.RData
predict in caret ConfusionMatrix is removing rows
I'm fairly new to using the caret library and it's causing me some problems. Any help/advice would be appreciated. My situations are as follows: I'm trying to run a general linear model on some data and, when I run it through the confusionMatrix, I get 'the data and reference factors must have the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine. I went through every variable and everything was balanced until I got to the confusionMatrix predict. I discovered this by doing the following: a <- table(testing2$hold1yes0no) a[1]+a[2] 1543 b <- table(predict(modelFit,trainTR2)) dim(b) [1] 1538 Those two values shouldn't disagree. Where are the missing 5 rows? My code is below: set.seed(2382) inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE) training2 <- HOLD[inTrain2,] testing2 <- HOLD[-inTrain2,] preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox") trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)]) trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)]) modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2) confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try: modelFit <- train(trainPC2, training2$hold1yes0no, method="glm") Which specifies y = training2$hold1yes0no and x = trainPC2.
Use of randomforest() for classification in R?
I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with training <- sapply(training.temp,as.numeric) But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did training[,"Class"] <- factor(training[,ncol(training)]) I proceed to creating the tree with training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100) But I'm getting two errors: 1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) : <= this is not relevant for factors (roughly translated) 2: In randomForest.default(m, y, ...) : The response has five or fewer unique values. Are you sure you want to do regression? I would appreciate it if someone could point out the formatting mistake I'm making. Thanks!
So the issue is actually quite simple. It turns out my training data was an atomic vector. So it first had to be converted as a data frame. So I needed to add the following line: training <- as.data.frame(training) Problem solved!
First, your coercion to a factor is not working because of syntax errors. Second, you should always use indexing when specifying a RF model. Here are changes in your code that should make it work. training <- sapply(training.temp,as.numeric) training[,"Class"] <- as.factor(training[,"Class"]) training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=training[,"Class"], importance=TRUE, do.trace=100) # You can also coerce to a factor directly in the model statement training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=as.factor(training[,"Class"]), importance=TRUE, do.trace=100)