Creation prediction function for kmean in R - r

I want create predict function which predicts for which cluster, observation belong
data(iris)
mydata=iris
m=mydata[1:4]
train=head(m,100)
xNew=head(m,10)
rownames(train)<-1:nrow(train)
norm_eucl=function(train)
train/apply(train,1,function(x)sum(x^2)^.5)
m_norm=norm_eucl(train)
result=kmeans(m_norm,3,30)
predict.kmean <- function(cluster, newdata)
{
simMat <- m_norm(rbind(cluster, newdata),
sel=(1:nrow(newdata)) + nrow(cluster))[1:nrow(cluster), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.kmean(m_norm, x[result$cluster, ], xNew)
After i get the error
Error in predict.kmean(m_norm, x[result$cluster, ], xNew) :
unused argument (xNew)
i understand that i am making something wrong function, cause I'm just learning to do it, but I can't understand where exactly.
indeed i want adopt similar function of apcluster ( i had seen similar topic, but for apcluster)
predict.apcluster <- function(s, exemplars, newdata)
{
simMat <- s(rbind(exemplars, newdata),
sel=(1:nrow(newdata)) + nrow(exemplars))[1:nrow(exemplars), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.apcluster(negDistMat(r=2), x[apres#exemplars, ], xNew)
how to do it?

Rather than trying to replicate something, let's come up with our own function. For a given vector x, we want to assign a cluster using some prior k-means output. Given how k-means algorithm works, what we want is to find which cluster's center is closest to x. That can be done as
predict.kmeans <- function(x, newdata)
apply(newdata, 1, function(r) which.min(colSums((t(x$centers) - r)^2)))
That is, we go over newdata row by row and compute the corresponding row's distance to each of the centers and find the minimal one. Then, e.g.,
head(predict(result, train / sqrt(rowSums(train^2))), 3)
# 1 2 3
# 2 2 2
all.equal(predict(result, train / sqrt(rowSums(train^2))), result$cluster)
# [1] TRUE
which confirms that our predicting function assigned all the same clusters to the training observations. Then also
predict(result, xNew / sqrt(rowSums(xNew^2)))
# 1 2 3 4 5 6 7 8 9 10
# 2 2 2 2 2 2 2 2 2 2
Notice also that I'm calling simply predict rather than predict.kmeans. That is because result is of class kmeans and a right method is automatically chosen. Also notice how I normalize the data in a vectorized manner, without using apply.

Related

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Run DBSCAN against grouped coordinates

I'm attempting to run DBSCAN against some grouped coordinates in order to get sub-clusters. I've clustered some spacial data and I'd now like to further divide these regions according to the density of points within them. I think DBSCAN is probably the best way to do this.
My issue is that I can't figure out how to run DBSCAN against each cluster seperately and then output the cluster assignment as a new column. Here's some sample data:
library(dplyr)
library(dbscan)
# Create sample data
df <- data.frame(
"ID"=1:200,
"X"=c(1.0083,1.3166,1.3072,1.1311,1.2984,1.2842,1.1856,1.3451,1.1932,1.0926,1.2464,1.3197,1.2331,1.2996,1.3482,
1.1944,1.2800,1.3051,1.4471,0.9068,1.3150,1.1846,1.0232,1.0005,1.0640,1.3177,1.1015,0.9598,1.0354,1.2203,
0.8388,0.8655,1.3387,1.0133,1.0106,1.1753,1.3200,1.0139,1.1511,1.3508,1.2747,1.3681,1.1074,1.2735,1.2245,
0.9695,1.3250,1.0537,1.2020,1.3093,0.9268,1.3244,1.2626,1.3123,1.2819,1.1063,0.8759,1.0063,1.0173,1.0187,
1.2396,1.0241,1.2619,1.2682,1.0008,1.0827,1.3639,1.3099,1.0004,0.8886,1.2359,1.1370,1.2783,1.0803,1.1918,
1.1156,1.3313,1.1205,1.0776,1.3895,1.3559,0.8518,1.1315,1.3521,1.2281,1.2589,0.9974,1.1487,1.4204,0.9998,
1.0154,1.0098,0.8851,1.0252,0.9331,1.2197,1.0084,1.2303,1.2808,1.3125,0.5500,0.6694,0.3301,0.3787,0.6492,
0.6568,0.6773,0.3769,0.6237,0.7265,0.5509,0.3579,0.7201,0.2631,0.3881,0.7596,0.3343,0.7049,0.3430,0.2951,
0.5483,0.7699,0.3806,0.6555,0.2524,0.4030,0.6329,0.5006,0.2701,0.0822,0.5442,0.5233,0.7105,0.5660,0.3962,
0.3187,0.3143,0.5673,0.3731,0.7310,0.6376,0.4864,0.8865,0.3352,0.7540,0.0690,0.7983,0.6990,0.4090,0.5658,
0.5636,0.5420,0.7223,0.6146,0.5648,0.2711,0.3422,0.7214,0.2196,0.2848,0.6496,0.7907,0.7418,0.7825,0.4550,
0.4361,0.7417,0.2661,0.8978,0.7875,0.2343,0.3853,0.6874,0.7761,0.2905,0.6092,0.5329,0.6189,0.0684,0.5726,
0.5740,0.7060,0.4609,0.3568,0.7037,0.2874,0.6200,0.7149,0.5100,0.7059,0.2520,0.3105,0.6870,0.7888,0.3674,
0.6514,0.7271,0.6679,0.3752,0.7067),
"Y"=c(-1.2547,-1.1499,-1.1803,-1.0626,-1.2877,-1.1151,-1.0958,-1.1339,-1.0808,-1.5461,-1.0775,-1.1431,-1.0499,
-1.1521,-1.1675,-1.0963,-1.1407,-1.1916,-1.1229,-1.2297,-1.1308,-1.0341,-1.3071,-1.2370,-1.5043,-1.1154,
-1.5452,-1.0349,-1.5412,-1.0348,-1.3620,-1.3776,-1.1830,-1.2552,-1.2354,-1.0838,-1.1214,-1.2396,-1.4937,
-1.0793,-1.1857,-1.0679,-1.5425,-1.1633,-1.1620,-1.0838,-1.0750,-1.3493,-1.4155,-1.1354,-1.0615,-1.1494,
-1.1620,-1.1582,-1.1800,-1.5230,-1.3019,-1.2484,-1.5490,-1.2435,-1.0487,-1.2330,-1.1234,-1.0924,-1.0702,
-1.0446,-1.1077,-1.1144,-1.2170,-1.2715,-1.1537,-1.5077,-1.1305,-1.3396,-1.2107,-1.5458,-1.1482,-1.1224,
-1.3690,-1.2058,-1.1685,-1.3400,-1.5033,-1.2152,-1.3805,-1.1439,-1.5183,-1.4288,-1.1252,-1.2330,-1.2511,
-1.5429,-1.3333,-1.1851,-1.1367,-1.3952,-1.1240,-1.2113,-1.1632,-1.1965,-0.9917,-0.7416,-0.7729,-1.1279,
-0.9323,-0.9372,-0.7013,-1.1746,-0.9191,-0.9356,-0.7873,-1.1957,-0.9838,-0.5825,-1.0738,-0.9302,-0.7713,
-0.9407,-0.7774,-0.8160,-0.9861,-1.0440,-0.9896,-0.6478,-0.8865,-1.0601,-1.0640,-0.9898,-0.5989,-0.7375,
-0.7689,-0.9799,-0.9147,-1.1048,-0.9735,-0.8591,-0.7913,-1.0085,-0.7231,-0.9688,-0.9272,-0.9395,-0.9494,
-0.7859,-1.0817,-0.7262,-0.9915,-0.9329,-1.0953,-1.0425,-1.0806,-1.0132,-0.8514,-1.0785,-1.1109,-0.8542,
-1.0849,-0.9665,-0.5940,-0.6145,-0.7830,-0.9601,-0.8996,-0.7717,-0.7447,-1.0406,-1.0067,-0.5710,-0.9839,
-1.0594,-0.7069,-1.1202,-0.9705,-1.0100,-0.6377,-1.0632,-0.9450,-0.9163,-0.7865,-1.0090,-1.1005,-1.0049,
-0.8042,-1.0781,-0.6829,-0.5962,-1.0759,-0.7918,-0.9732,-0.7353,-0.5615,-1.2002,-0.9295,-0.9944,-1.1570,
-0.9524,-0.9257,-0.9360,-1.1328,-0.7661),
"cluster"=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2))
# How do you run DBSCAN against the points within each cluster?
I first thought I'd try to use the group_by function in dplyr but DBSCAN requires a data matrix input and group_by doesn't work for matrices.
matrix <- as.matrix(df[, -1])
set.seed(1234)
db = matrix %>%
group_by(cluster) %>%
dbscan(matrix, 0.4, 4)
#Error in UseMethod("group_by_") :
# no applicable method for 'group_by_' applied to an object of class "c('matrix', 'double', 'numeric')"
I've also tried using by() but get duplicate results for each cluster grouping, which isn't right:
by(data = df, INDICES = df$cluster, FUN = function(x) {
out <- dbscan(as.matrix(df[, c(2:3)]),eps=.0215,minPts=4)
})
#df$cluster: 1
#DBSCAN clustering for 200 objects.
#Parameters: eps = 0.0215, minPts = 4
#The clustering contains 10 cluster(s) and 138 noise points.
#
# 0 1 2 3 4 5 6 7 8 9 10
#138 11 12 4 5 8 2 4 8 4 4
#
#Available fields: cluster, eps, minPts
#--------------------------------------------------------------------------
#df$cluster: 2
#DBSCAN clustering for 200 objects.
#Parameters: eps = 0.0215, minPts = 4
#The clustering contains 10 cluster(s) and 138 noise points.
#
# 0 1 2 3 4 5 6 7 8 9 10
#138 11 12 4 5 8 2 4 8 4 4
#
#Available fields: cluster, eps, minPts
Can anyone point me in the right direction?
To be clear, dbscan::dbscan works fine on data.frame objects. You do not need to convert to matrix. It returns an object that includes a vector with the same dimension as the number of records in your input. The issue is that dplyr exposes variables to other functions as individual vectors, rather than as data.frame or matrix objects. You are free to do something like:
df %>%
group_by(cluster) %>%
mutate(
dbscan_cluster = dbscan::dbscan(
data.frame(X, Y),
eps = 0.0215,
minPts = 4
)[["cluster"]]
)
dplyr is not necessary, by also works, you just need to supply a generic function rather than one that directly references the source object directly. Your data must already be ordered by cluster.
df$dbscan_cluster <- unlist(
by(
df,
INDICES = df$cluster,
function(x) dbscan::dbscan(x[,c(2,3)], eps = 0.0215, minPts = 4)[["cluster"]]
)
)
However, you can still get garbage results if you don't have a good way to pick your epsilon. You might consider using dbscan::optics instead.

R multiclass/multinomial classification ROC using multiclass.roc (Package ‘pROC’)

I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?
The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.

Using a for loop for performing several regressions

I am currently performing a style analysis using the following method: http://www.r-bloggers.com/style-analysis/ . It is a constrained regression of one asset on a number of benchmarks, over a rolling 36 month window.
My problem is that I need to perform this regression for a fairly large number of assets and doing it one by one would take a huge amount of time. To be more precise: Is there a way to tell R to regress columns 1-100 one by one on colums 101-116. Of course this also means printing 100 different plots, one for each asset. I am new to R and have been stuck for several days now.
I hope it doesn't matter that the following excerpt isn't reproducible, since the code works as originally intended.
# Style Regression over Window, constrained
#--------------------------------------------------------------------------
# setup
load.packages('quadprog')
style.weights[] = NA
style.r.squared[] = NA
# Setup constraints
# 0 <= x.i <= 1
constraints = new.constraints(n, lb = 0, ub = 1)
# SUM x.i = 1
constraints = add.constraints(rep(1, n), 1, type = '=', constraints)
# main loop
for( i in window.len:ndates ) {
window.index = (i - window.len + 1) : i
fit = lm.constraint( hist.returns[window.index, -1], hist.returns[window.index, 1], constraints )
style.weights[i,] = fit$coefficients
style.r.squared[i,] = fit$r.squared
}
# plot
aa.style.summary.plot('Style Constrained', style.weights, style.r.squared, window.len)
Thank you very much for any tips!
"Is there a way to tell R to regress columns 1-100 one by one on colums 101-116."
Yes! You can use a for loop, but you there's also a whole family of 'apply' functions which are appropriate. Here's a generalized solution with a random / toy dataset and using lm(), but you can sub in whatever regression function you want
# data frame of 116 cols of 20 rows
set.seed(123)
dat <- as.data.frame(matrix(rnorm(116*20), ncol=116))
# with a for loop
models <- list() # empty list to store models
for (i in 1:100) {
models[[i]] <-
lm(formula=x~., data=data.frame(x=dat[, i], dat[, 101:116]))
}
# with lapply
models2 <-
lapply(1:100,
function(i) lm(formula=x~.,
data=data.frame(x=dat[, i], dat[, 101:116])))
# compare. they give the same results!
all.equal(models, models2)
# to access a single model, use [[#]]
models2[[1]]

Local prediction modelling approach in R

users
I am trying to develop a local model (PLSR) which is predicting a query sample by a model built on the 10 most similar samples using the code below (not the full model yet, just a part of it). I got stuck when trying to predict the query sample (second to last line). The model is actually predicting something, ("prd") but not the query sample!
Here is my code:
require("pls")
set.seed(10000) # generate some sample data
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method="euclidean") # create a distance matrix
eDm <- as.matrix(eD)
Looping over all 100 samples and extracting their 10 most similar samples for subsequent model building and prediction of query sample:
for (i in 1:nrow(eDm)) {
kni <- head(order(eDm[,i]),11)[-1] # add 10 most similar samples to kni
pls1 <- plsr(y[kni,] ~ x[kni,], ncomp=5, validation="CV") # run plsr on sel. samples
prd <- predict(pls1, ncomp=5, newdata=x[[i]]) # predict query sample ==> I suspect there is something wrong with this expression: newdata=x[[i]]
}
I can't figure out how to address the query sample properly - many thanks i.a. for any help!
Best regards,
Chega
You are going to run into all sorts of pain building models with formulae like that. Also the x[[i]] isn't doing what you think it is - you need to supply a data frame usually to these modelling functions. In this case a matrix seems fine too.
I get all your code working OK if I use:
prd <- predict(pls1, ncomp=5, newdata=x[i, ,drop = FALSE])
giving
> predict(pls1, ncomp=5, newdata=x[i,,drop = FALSE])
, , 5 comps
y[kni, ]
[1,] 0.6409897
What you were seeing with your code are the fitted values for the training data.
> fitted(pls1)[, , 5, drop = FALSE]
, , 5 comps
y[kni, ]
1 0.1443274
2 0.2706769
3 1.1407780
4 -0.2345429
5 -1.0468221
6 2.1353091
7 0.8267103
8 3.3242296
9 -0.5016016
10 0.6781804
This is convention in R when you either don't supply newdata or the object you are supplying makes no sense and doesn't contain the covariates required to generate predictions.
I would have fitted the model as follows:
pls1 <- plsr(y ~ x, ncomp=5, validation="CV", subset = kni)
where I use the subset argument for its intended purpose; to select the rows of the input data to fit the model with. You get nicer output from the models; the labels use y instead of y[kni, ] etc, plus this general convention will serve you well in other modelling tools, where R will expect newdata to be a data frame with names exactly the same as those mentioned in the model formula. In your case, with your code, that would mean creating a data frame with names like x[kni, ] which are not easy to do, for good reason!

Resources