Finding k-Nearest-Neighbor in R with knn() from class package - r

This is home work.
I have a 2 matrices, 1 for training and 1 for test.
The data has two columns of data which shall be used for classification and a third column with the known class. Both matrices has the third column.
[1] [2] [3]
[1] 6.4 0.32 2
[2] 4.8 0.34 0
[3] 4.9 0.25 2
[4] 7.2 0.32 1
where the intergers are the classes (from 0-2).
The dimension for my datasets are 100 3 for the training set and 38 3 for the testing set.
I have tried to use the knn() function of the class lbrary.
knn uses the follwing arguments: (train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
I have tried to use my data sets directly, but then i get the error: "'train' and 'class' have different lengths"
I have tried a few thing, but I am stuck after some hours now. At the moment I have this code in my editor:
cl <- t(factor(c(rep("0",1), rep("1",1), rep("2",1))))
k <- knn(train, test, cl)
but it does not work. Can someone help me?
I want to find run the function with 3 different k-values, and find the accuracy of each. After that I will be 5-fold cross validating the best k.

As the documentation states cl is the factor of true classifications of the training set i.e. your y variable (the third column of your training set).
This means that the function should be as follows:
cl <- factor(c(2,0,2,1)) #or alternatively factor(train[,3])
k <- knn(train[,c(1,2)], test[,c(1,2)], cl)
As you can see in both training and test sets the y variable (the column with the classes) is not included in the test and training sets. The column is only included as a factor in the cl argument.
The error you received is because the number of rows of the training set is not equal to the length of the factor in which case it only had 3 elements (which is because you thought you only needed to specify the levels of the factor there).

Related

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

'mtry' in 'rfcv()' function of R's 'randomForest' library

I would like to use cross validation to determine the number of variables to try in a Random Forest method. I don't understand how to use the mtry argument in the rfcv() function.
I have 6 predictors in my dataset. I want to use mtry = 6,5,4,3,2,1, e.g., any possible m value, and cross validate with 5-fold CV.
I believe this can be done with rfcv() function of randomForest package. I am running the code:
rf_cv<- rfcv(training_x,training_y,cv.fold=5, mtry=function(p) max(1, p-1))
However, calling rf_cv$n.var gives me:
[1] 6 3 1
So, this method does not apply mtry as I was hoping, since I said each time subtract the number of variables used by 1.
How can I try every number of variables by applying a 5-fold cross validation for each number of variable?
I checked this post, however it is not completely related since they are discussing the default of mtry.
In the post you referenced, it explains how the steps will determine the mtry tested. So in your case, p=6, and since you did not change step or the scale, then:
p=6; 0.5
k <- floor(log(p, base = 1/step))
n.var <- round(p * step^(0:(k - 1)))
[1] 6 3
And if n.var does not include 1, it goes ahead and includes it for you, which gives you 6,3,1. So if you want to try all numbers, set mtry to be identity, and step to be 1, set scale to anything but "log" (yeah the code doesn't give you other options):
rf_cv=rfcv(matrix(rnorm(100*6),ncol=6),rnorm(100),cv.fold=3,
mtry=identity,scale="new",step=-1)
rf_cv$n.var
[1] 6 5 4 3 2 1

Creation prediction function for kmean in R

I want create predict function which predicts for which cluster, observation belong
data(iris)
mydata=iris
m=mydata[1:4]
train=head(m,100)
xNew=head(m,10)
rownames(train)<-1:nrow(train)
norm_eucl=function(train)
train/apply(train,1,function(x)sum(x^2)^.5)
m_norm=norm_eucl(train)
result=kmeans(m_norm,3,30)
predict.kmean <- function(cluster, newdata)
{
simMat <- m_norm(rbind(cluster, newdata),
sel=(1:nrow(newdata)) + nrow(cluster))[1:nrow(cluster), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.kmean(m_norm, x[result$cluster, ], xNew)
After i get the error
Error in predict.kmean(m_norm, x[result$cluster, ], xNew) :
unused argument (xNew)
i understand that i am making something wrong function, cause I'm just learning to do it, but I can't understand where exactly.
indeed i want adopt similar function of apcluster ( i had seen similar topic, but for apcluster)
predict.apcluster <- function(s, exemplars, newdata)
{
simMat <- s(rbind(exemplars, newdata),
sel=(1:nrow(newdata)) + nrow(exemplars))[1:nrow(exemplars), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.apcluster(negDistMat(r=2), x[apres#exemplars, ], xNew)
how to do it?
Rather than trying to replicate something, let's come up with our own function. For a given vector x, we want to assign a cluster using some prior k-means output. Given how k-means algorithm works, what we want is to find which cluster's center is closest to x. That can be done as
predict.kmeans <- function(x, newdata)
apply(newdata, 1, function(r) which.min(colSums((t(x$centers) - r)^2)))
That is, we go over newdata row by row and compute the corresponding row's distance to each of the centers and find the minimal one. Then, e.g.,
head(predict(result, train / sqrt(rowSums(train^2))), 3)
# 1 2 3
# 2 2 2
all.equal(predict(result, train / sqrt(rowSums(train^2))), result$cluster)
# [1] TRUE
which confirms that our predicting function assigned all the same clusters to the training observations. Then also
predict(result, xNew / sqrt(rowSums(xNew^2)))
# 1 2 3 4 5 6 7 8 9 10
# 2 2 2 2 2 2 2 2 2 2
Notice also that I'm calling simply predict rather than predict.kmeans. That is because result is of class kmeans and a right method is automatically chosen. Also notice how I normalize the data in a vectorized manner, without using apply.

R multiclass/multinomial classification ROC using multiclass.roc (Package ‘pROC’)

I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?
The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.

Help using predict() for kernlab's SVM in R?

I am trying to use the kernlab R package to do Support Vector Machines (SVM). For my very simple example, I have two pieces of training data. A and B.
(A and B are of type matrix - they are adjacency matrices for graphs.)
So I wrote a function which takes A+B and generates a kernel matrix.
> km
[,1] [,2]
[1,] 14.33333 18.47368
[2,] 18.47368 38.96053
Now I use kernlab's ksvm function to generate my predictive model. Right now, I'm just trying to get the darn thing to work - I'm not worried about training error, etc.
So, Question 1: Am I generating my model correctly? Reasonably?
# y are my classes. In this case, A is in class "1" and B is in class "-1"
> y
[1] 1 -1
> model2 = ksvm(km, y, type="C-svc", kernel = "matrix");
> model2
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
[1] " Kernel matrix used as input."
Number of Support Vectors : 2
Objective Function Value : -0.1224
Training error : 0
So far so good. We created our custom kernel matrix, and then we created a ksvm model using that matrix. We have our training data labeled as "1" and "-1".
Now to predict:
> A
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 0 0 0
> predict(model2, A)
Error in as.matrix(Z) : object 'Z' not found
Uh-oh. This is okay. Kind of expected, really. "Predict" wants some sort of vector, not a matrix.
So lets try some things:
> predict(model2, c(1))
Error in as.matrix(Z) : object 'Z' not found
> predict(model2, c(1,1))
Error in as.matrix(Z) : object 'Z' not found
> predict(model2, c(1,1,1))
Error in as.matrix(Z) : object 'Z' not found
> predict(model2, c(1,1,1,1))
Error in as.matrix(Z) : object 'Z' not found
> predict(model2, km)
Error in as.matrix(Z) : object 'Z' not found
Some of the above tests are nonsensical, but that is my point: no matter what I do, I just can't get predict() to look at my data and do a prediction. Scalars don't work, vectors don't work. A 2x2 matrix doesn't work, nor does a 3x3 matrix.
What am I doing wrong here?
(Once I figure out what ksvm wants, then I can make sure that my test data can conform to that format in a sane/reasonable/mathematically sound way.)
If you think about how the support vector machine might "use" the kernel matrix, you'll see that you can't really do this in the way you're trying (as you've seen :-)
I actually struggled a bit with this when I first was using kernlab + a kernel matrix ... coincidentally, it was also for graph kernels!
Anyway, let's first realize that since the SVM doesn't know how to calculate your kernel function, it needs to have these values already calculated between your new (testing) examples, and the examples it picks out as the support vectors during the training step.
So, you'll need to calculate the kernel matrix for all of your examples together. You'll later train on some and test on the others by removing rows + columns from the kernel matrix when appropriate. Let me show you with code.
We can use the example code in the ksvm documentation to load our workspace with some data:
library(kernlab)
example(ksvm)
You'll need to hit return a few (2) times in order to let the plots draw, and let the example finish, but you should now have a kernel matrix in your workspace called K. We'll need to recover the y vector that it should use for its labels (as it has been trampled over by other code in the example):
y <- matrix(c(rep(1,60),rep(-1,60)))
Now, pick a subset of examples to use for testing
holdout <- sample(1:ncol(K), 10)
From this point on, I'm going to:
Create a training kernel matrix named trainK from the original K kernel matrix.
Create an SVM model from my training set trainK
Use the support vectors found from the model to create a testing kernel matrix testK ... this is the weird part. If you look at the code in kernlab to see how it uses the support vector indices, you'll see why it's being done this way. It might be possible to do this another way, but I didn't see any documentation/examples on predicting with a kernel matrix, so I'm doing it "the hard way" here.
Use the SVM to predict on these features and report accuracy
Here's the code:
trainK <- as.kernelMatrix(K[-holdout,-holdout]) # 1
m <- ksvm(trainK, y[-holdout], kernel='matrix') # 2
testK <- as.kernelMatrix(K[holdout, -holdout][,SVindex(m), drop=F]) # 3
preds <- predict(m, testK) # 4
sum(sign(preds) == sign(y[holdout])) / length(holdout) # == 1 (perfect!)
That should just about do it. Good luck!
Responses to comment below
what does K[-holdout,-holdout] mean? (what does the "-" mean?)
Imagine you have a vector x, and you want to retrieve elements 1, 3, and 5 from it, you'd do:
x.sub <- x[c(1,3,5)]
If you want to retrieve everything from x except elements 1, 3, and 5, you'd do:
x.sub <- x[-c(1,3,5)]
So K[-holdout,-holdout] returns all of the rows and columns of K except for the rows we want to holdout.
What are the arguments of your as.kernelMatrix - especially the [,SVindex(m),drop=F] argument (which is particulary strange because it looks like that entire bracket is a matrix index of K?)
Yeah, I inlined two commands into one:
testK <- as.kernelMatrix(K[holdout, -holdout][,SVindex(m), drop=F])
Now that you've trained the model, you want to give it a new kernel matrix with your testing examples. K[holdout,] would give you only the rows which correspond to the training examples in K, and all of the columns of K.
SVindex(m) gives you the indexes of your support vectors from your original training matrix -- remember, those rows/cols have holdout removed. So for those column indices to be correct (ie. reference the correct sv column), I must first remove the holdout columns.
Anyway, perhaps this is more clear:
testK <- K[holdout, -holdout]
testK <- testK[,SVindex(m), drop=FALSE]
Now testK only has the rows of our testing examples and the columns that correspond to the support vectors. testK[1,1] will have the value of the kernel function computed between your first testing example, and the first support vector. testK[1,2] will have the kernel function value between your 1st testing example and the second support vector, etc.
Update (2014-01-30) to answer comment from #wrahool
It's been a while since I've played with this, so the particulars of kernlab::ksvm are a bit rusty, but in principle this should be correct :-) ... here goes:
what is the point of testK <- K[holdout, -holdout] - aren't you removing the columns that correspond to the test set?
Yes. The short answer is that if you want to predict using a kernel matrix, you have to supply the a matrix that is of the dimension rows by support vectors. For each row of the matrix (the new example you want to predict on) the values in the columns are simply the value of the kernel matrix evaluated between that example and the support vector.
The call to SVindex(m) returns the index of the support vectors given in the dimension of the original training data.
So, first doing testK <- K[holdout, -holdout] gives me a testK matrix with the rows of the examples I want to predict on, and the columns are from the same examples (dimension) the model was trained on.
I further subset the columns of testK by SVindex(m) to only give me the columns which (now) correspond to my support vectors. Had I not done the first [, -holdout] selection, the indices returned by SVindex(m) may not correspond to the right examples (unless all N of your testing examples are the last N columns of your matrix).
Also, what exactly does the drop = FALSE condition do?
It's a bit of defensive coding to ensure that after the indexing operation is performed, the object that is returned is of the same type as the object that was indexed.
In R, if you index only one dimension of a 2D (or higher(?)) object, you are returned an object of the lower dimension. I don't want to pass a numeric vector into predict because it wants to have a matrix
For instance
x <- matrix(rnorm(50), nrow=10)
class(x)
[1] "matrix"
dim(x)
[1] 10 5
y <- x[, 1]
class(y)
[1] "numeric"
dim(y)
NULL
The same will happen with data.frames, etc.
First off, I have not used kernlab much. But simply looking at the docs, I do see working examples for the predict.ksvm() method. Copying and pasting, and omitting the prints to screen:
## example using the promotergene data set
data(promotergene)
## create test and training set
ind <- sample(1:dim(promotergene)[1],20)
genetrain <- promotergene[-ind, ]
genetest <- promotergene[ind, ]
## train a support vector machine
gene <- ksvm(Class~.,data=genetrain,kernel="rbfdot",\
kpar=list(sigma=0.015),C=70,cross=4,prob.model=TRUE)
## predict gene type probabilities on the test set
genetype <- predict(gene,genetest,type="probabilities")
That seems pretty straight-laced: use random sampling to generate a training set genetrain and its complement genetest, then fitting via ksvm and a call to a predict() method using the fit, and new data in a matching format. This is very standard.
You may find the caret package by Max Kuhn useful. It provides a general evaluation and testing framework for a variety of regression, classification and machine learning methods and packages, including kernlab, and contains several vignettes plus a JSS paper.
Steve Lianoglou is right.
In kernlab it is a bit wired, and when predicting it requires the input kernel matrix between each test example and the support vectors. You need to find this matrix yourself.
For example, a test matrix [n x m], where n is the number of test samples and m is the number of support vectors in the learned model (ordered in the sequence of SVindex(model)).
Example code
trmat <- as.kernelMatrix(kernels[trainidx,trainidx])
tsmat <- as.kernelMatrix(kernels[testidx,trainidx])
#training
model = ksvm(x=trmat, y=trlabels, type = "C-svc", C = 1)
#testing
thistsmat = as.kernelMatrix(tsmat[,SVindex(model)])
tsprediction = predict(model, thistsmat, type = "decision")
kernels is the input kernel matrix. trainidx and testidx are ids for training and test.
Build the labels yourself from the elements of the solution. Use this alternate predictor method which takes ksvm model (m) and data in original training format (d)
predict.alt <- function(m, d){
sign(d[, m#SVindex] %*% m#coef[[1]] - m#b)
}
K is a kernelMatrix for training. For validation's sake, if you run predict.alt on the training data you will notice that the alternate predictor method switches values alongside the fitted values returned by ksvm. The native predictor behaves in an unexpected way:
aux <- data.frame(fit=kout#fitted, native=predict(kout, K), alt=predict.alt(m=kout, d=as.matrix(K)))
sample_n(aux, 10)
fit native alt
1 0 0 -1
100 1 0 1
218 1 0 1
200 1 0 1
182 1 0 1
87 0 0 -1
183 1 0 1
174 1 0 1
94 1 0 1
165 1 0 1

Resources