My KNN model - Vector to data.frame issue - r

I need your help this time. I'm working on my KNN model (looking for their probabilities).
predictions<- knn(x_training, x_testing, y_training, k = 5, prob = TRUE)
However, I'd like to get a dataframe with it. When I applied the data.frame function, I get only 0/1 (whether is true or false) but not the probability.

It is a little difficult to understand your question, you should try to give us a reproducible example with code. It is much easier for us as a community to answer your questions efficiently if you do.
Here is an example:
library(class)
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
predictions <- knn(train, test, cl, k = 3, prob=TRUE)
I believe you are running into trouble because you are trying to coerce the output of the KNN function into a dataframe. However, the KNN output has the probabilities as an attribute of the data.
So you need to access the probabilities using the attr() function. For more info type:
?attr
into the R console.
To achieve your desired outcome you need to do this:
data.frame(Value=predictions,Prob=attr(predictions,"prob"))

Related

KNN in R -- All arguments must have the same length, test.X is empty

I'm trying to perform KNN in R on a dataframe, following 3-way classification for vehicle types (car, boat, plane), using columns such as mpg, cost as features.
To start, when I run:
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
then
knn.pred
returns
factor(0) Levels: car boat plane
And
table(knn.pred,VehicleType.All)
returns
Error in table(knn.pred, VehicleType.All) :
all arguments must have the same length
I think my problem is that I can successfully load train.X with cbind() but when I try the same for test.X it remains an empty matrix. My code looks like this:
train=(DATA$Values<=200) # to train for all 200 entries including cars, boats and planes
train.X = cbind(DATA$mpg,DATA$cost)[train,]
summary(train.X)
Here, summary(train.X) returns correctly, but when I try the same for test.X:
test.X = cbind(DATA$mpg,DATA$cost)[!train,]
When I try and print test.X it returns an empty matrix like so:
[,1] [,2]
Apologies for such a long question and I'm probably not including all relevant info. If anyone has any idea what's going wrong here or why my test.X isn't loading through any data I'd appreciate it!
Without any info on your data, it is hard to guess where the problem is. You should post a minimal reproducible example
or at least dput your data or part of it. However here I show 2 methods for training a knn model, using 2 different package (class, and caret) with the mtcars built-in dataset.
with class
library(class)
data("mtcars")
str(mtcars)
mtcars$gear <- as.factor(mtcars$gear)
ind <- sample(1:nrow(mtcars),20)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
train.VehicleType <- train.X[,"gear"]
VehicleType.All <- test.X[,"gear"]
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
table(knn.pred,VehicleType.All)
with caret
library(caret)
ind <- createDataPartition(mtcars$gear,p=0.60,list=F)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
control <-trainControl(method = "cv",number = 10)
grid <- expand.grid(k=2:10)
knn.pred <- train(gear~.,data=train.X,method="knn",tuneGrid=grid)
pred <- predict(knn.pred,test.X[,-10])
cm <- confusionMatrix(pred,test.X$gear)
the caret package allows performing cross-validation for parameters tuning during model fitting, in a straightforward way. By default train perform a 25 rep bootstrap cross-validation to find the best value of k among the values I've supplied in the grid object.
From your example, it seems that your test object is empty so the result of knn is a 0-length vector. Probably your problem is in the data reading. However, a better way to subset your DATA can be this:
#insetad of
train.X = cbind(DATA$mpg,DATA$cost)[train,]
#you should do:
train.X <- DATA[train,c("mpg","cost")]
test.X <- DATA[-train,c("mpg","cost")]
However, I do not understand what variable is DATA$Values, Firstly I was thinking it was the outcome, but, this line confused me a lot:
train=(DATA$Values<=200)
You can work on these examples to catch your error on your own. If you can't post an example that reproduces your situation.

Understanding how to use nnet in R

This is my first attempt using a machine learning paradigm in R. I'm using a planet data set (url: https://www.kaggle.com/mrisdal/open-exoplanet-catalogue) and I simply want to predict a planet's size based on the size of its Sun. This is the code I currently have, using nnet():
library(nnet)
#Organize data:
cols_to_keep = c(1,4,21)
full_data <- na.omit(read.csv('Planet_Data.csv')[, cols_to_keep])
#Split data:
train_data <- full_data[sample(nrow(full_data), round(nrow(full_data)/2)),]
rownames(train_data) <- 1:nrow(train_data)
test_data <- full_data[!rownames(full_data) %in% rownames(data1),]
rownames(test_data) <- 1:nrow(test_data)
#nnet
nnet_attempt <- nnet(RadiusJpt~HostStarRadiusSlrRad, data=train_data, size=0, linout=TRUE, skip=TRUE, maxNWts=10000, trace=FALSE, maxit=1000, decay=.001)
nnet_newdata <- predict(nnet_attempt, newdata=test_data)
nnet_newdata
When I print nnet_newdata I get a value for each row in my data, but I don't really understand what these values mean. Is this a proper way to use the nnet() package to predict a simple regression?
Thanks
When predict is called for an object with class nnet you will get, by default, the raw output from the nnet model applied to your new dataset. If, instead, yours is a classification problem, you can use type = "class".
See here.

leave-one-out cross validation with knn in R

I have defined my training and test sets as follows:
colon_samp <-sample(62,40)
colon_train <- colon_data[colon_samp,]
colon_test <- colon_data[-colon_samp,]
And the KNN function:
knn_colon <- knn(train = colon_train[1:12533], test = colon_test[1:12533], cl = colon_train$class, k=2)
Here is my LOOCV loop for KNN:
newColon_train <- data.frame(colon_train, id=1:nrow(colon_train))
id <- unique(newColon_train$id)
loo_colonKNN <- NULL
for(i in id){
knn_colon <- knn(train = newColon_train[newColon_train$id!=i,], test = newColon_train[newColon_train$id==i,],cl = newColon_train[newColon_train$id!=i,]$Y)
loo_colonKNN[[i]] <- knn_colon
}
print(loo_colonKNN)
When I print loo_colonKNNit gives me 40 predictions (i.e. the 40 train set predictions), however, I would like it to give me the 62 predictions (all of my n samples in the original dataset). How might I go about doing this?
Thank you.
You would simply call the knn function again, using a different test parameter:
[...]
knn_colon2 <- knn(train = newColon_train[newColon_train$id!=i,],
test = newColon_test[newColon_test$id==i,],
cl = newColon_train[newColon_train$id!=i,]$Y)
This is caused by KNN being an non-parametric, instance based model: the data itself is the model, hence "training" is just holding the data for "later" prediction and does not require any computationally intensive model fitting procedure. Consequently it is unproblematic to call the training procedure multiple times to apply it to multiple test sets.
But be aware that the idea of CV is to only evaluate on the left partition each time, so looking at all samples is probably not what you want to do. And, instead of coding this yourself, you might be better off using e.g. the knn.cv function or the caret framework instead, which provides APIs for partitioning, resampling, etc. all in one, therefore is pretty convenient in such tasks.

Implementing Naive Bayes for text classification using Quanteda

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.
I'm attempting this with Quanteda and have the following code:
library(quanteda)
bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)
# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))
bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)
It seems to work smoothly until predict(), which gives:
Error in newdata %*% log.lik :
requires numeric/complex matrix/vector arguments
Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!
Here is a link to the dataset.
As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars:
library("quanteda")
text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)
all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)
You don't even need to specify a second argument to predict. If you don't, it will use the whole original dfm:
bbc_pred <- predict(bbcNb)
Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:
library(caret)
confusionMatrix(
bbc_pred$docs$predicted[1781:2225],
all_classes[1781:2225]
)
However, as #ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:
docvars(bbc_corpus)$category <- factor(
ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)
(note that this must be done before you extract all_classes from bbc_corpus above).

Predictive model decision tree

I want to build a predictive model using decision tree classification in R. I used this code:
library(rpart)
library(caret)
DataYesNo <- read.csv('DataYesNo.csv', header=T)
summary(DataYesNo)
worktrain <- sample(1:50, 40)
worktest <- setdiff(1:50, worktrain)
DataYesNo[worktrain,]
DataYesNo[worktest,]
M <- ncol(DataYesNo)
input <- names(DataYesNo)[1:(M-1)]
target <- “YesNo”
tree <- rpart(YesNo~Var1+Var2+Var3+Var4+Var5,
data=DataYesNo[worktrain, c(input,target)],
method="class",
parms=list(split="information"),
control=rpart.control(usesurrogate=0, maxsurrogate=0))
summary(tree)
plot(tree)
text(tree)
I got just one root (Var3) and two leafs (yes, no). I'm not sure about this result. How can I get the confusion matrix, accuracy, sensitivity, and specificity?
Can I get them with the caret package?
If you use your model to make predictions on your test set, you can use confusionMatrix() to get the measures you're looking for.
Something like this...
predictions <- predict(tree, worktest)
cmatrix <- confusionMatrix(predictions, worktest$YesNo)
print(cmatrix)
Once you create a confusion matrix, other measures can also be obtained - I don't remember them at the moment.
According to your example, the confusion matrix can be obtained as following.
fitted <- predict(tree, DataYesNo[worktest, c(input,target)])
actual <- DataYesNo[worktest, c(target)]
confusion <- table(data.frame(fitted = fitted, actual = actual))

Resources