better way to calculate euclidean distance with R - r

I am trying to calculate euclidean distance for Iris dataset. Basically I want to calculate distance between each pair of objects. I have a code working as follows:
for (i in 1:iris_column){
for (j in 1:iris_row) {
m[i,j] <- sqrt((iris[i,1]-iris[j,1])^2+
(iris[i,2]-iris[j,2])^2+
(iris[i,3]-iris[j,3])^2+
(iris[i,4]-iris[j,4])^2)
}
}
Although this works, I don't think this is a good way to wring R-style code. I know that R has built-in function to calculate Euclidean function. Without using built-in function, I want to know better code (faster and fewer lines) which could do the same as my code.

The part inside the loop can be written as
m[i, j] = sqrt(sum((iris[i, ] - iris[j, ]) ^ 2))
I’d keep the nested loop, nothing wrong with that here.

Or stay with the standard package stats:
m <- dist(iris[,1:4]))
This gives you an object of the class dist, which stores the lower triangle (all you need) compactly. You can get an ordinary full symmetric matrix if, e.g., you like to look at some elements:
> as.matrix(m)[1:5,1:5]
1 2 3 4 5
1 0.0000000 0.5385165 0.509902 0.6480741 0.1414214
2 0.5385165 0.0000000 0.300000 0.3316625 0.6082763
3 0.5099020 0.3000000 0.000000 0.2449490 0.5099020
4 0.6480741 0.3316625 0.244949 0.0000000 0.6480741
5 0.1414214 0.6082763 0.509902 0.6480741 0.0000000

Related

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Jaro-Winkler's difference between packages

I am using fuzzy matching to clean up medication data input by users, and I am using Jaro-Winkler's distance. I was testing which package with Jaro-Winkler's distance was faster when I noticed the default settings do not give identical values. Can anyone help me understand where the difference comes from? Example:
library(RecordLinkage)
library(stringdist)
jarowinkler("advil", c("advi", "advill", "advil", "dvil", "sdvil"))
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
1- stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), method = "jw")
# [1] 0.9333333 0.9444444 1.0000000 0.9333333 0.8666667
I am assuming it has to do with the weights, and I know I am using the defaults on both. However, if someone with more experience could shed light on what's going on, I would really appreciate it. Thanks!
Documentation:
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf
https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf
Tucked away in the documentation for stringdist is the following:
The Jaro-Winkler distance (method=jw, 0<p<=0.25) adds a correction term to the Jaro-distance. It is defined as d − l · p · d, where d is the Jaro-distance. Here, l is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p is a penalty factor, which in the work of Winkler is often chosen 0.1.
However, in stringdist::stringdist, p = 0 by default. Hence:
1 - stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"),
method = "jw", p = .1)
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
In fact that value is hard-coded in the source of RecordLinkage::jarowinkler.

Set time series vectors lengths equal (resize/rescale them with use of linear interpolation)

I have huge dataset of time series which are represented as vectors (no time labels available), due to some errors in measuring process their lengths (as values from length() show) varies slightly (~10%) but each of them definitively describs time interval of exacly two minutes. I would like to rescale/resize them and then calculate some statistics between them (so I need time series of equal lengths).
I need vary fast approach and linear interpolation is perfectly good choice for me, because speed is more important.
Simple example, rescaling vector of length 5 to vector of length of 10 :
input <- 0:4 # should be rescaled/resized into :
output <- c(0, .444, .888, 1.333, 1.777, 2.222, 2.666, 3.111, 3.555, 4)
I think that the fastest approach is to create matrix w ('w' for weights) which dimensions are : length(output) x length(input), so w %*% input gives output(as matrix object), if it is the fastest way, how to create matrices w efficiently ?
I think this could be enough:
resize <- function (input, len) approx(seq_along(input), input, n = len)$y
For example:
> resize(0:4, 10)
[1] 0.0000000 0.4444444 0.8888889 1.3333333 1.7777778 2.2222222 2.6666667 3.1111111 3.5555556 4.0000000
> resize( c(0, 3, 2, 1), 10)
[1] 0.000000 1.000000 2.000000 3.000000 2.666667 2.333333 2.000000 1.666667 1.333333 1.000000

kernel PCA with Kernlab and classification of Colon--cancer dataset

I need to Perform kernel PCA on the colon-­‐cancer dataset:
and then
I need to Plot number of principal components vs classification accuracy with PCA data.
For the first part i am using kernlab in R as follows (let number of features be 2 and then i will vary it from say 2-100)
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=2)
I am having tough time to understand how to use this PCA data for classification ( i can use any classifier for eg SVM)
EDIT : My Question is how to feed the output of PCA into a classifier
data looks like this (cleaned data)
uncleaned original data looks like this
I will show you a small example on how to use the kpca function of the kernlab package here:
I checked the colon-cancer file but it needs a bit of cleaning to be able to use it so I will use a random data set to show you how:
Assume the following data set:
y <- rep(c(-1,1), c(50,50))
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
x4 <- runif(100)
x5 <- runif(100)
df <- data.frame(y,x1,x2,x3,x4,x5)
> df
y x1 x2 x3 x4 x5
1 -1 0.125841208 0.040543611 0.317198114 0.40923767 0.635434021
2 -1 0.113818719 0.308030825 0.708251147 0.69739496 0.839856000
3 -1 0.744765204 0.221210582 0.002220568 0.62921565 0.907277935
4 -1 0.649595597 0.866739474 0.609516644 0.40818013 0.395951297
5 -1 0.967379006 0.926688915 0.847379556 0.77867315 0.250867680
6 -1 0.895060293 0.813189446 0.329970821 0.01106764 0.123018797
7 -1 0.192447416 0.043720717 0.170960540 0.03058768 0.173198036
8 -1 0.085086619 0.645383728 0.706830885 0.51856286 0.134086770
9 -1 0.561070374 0.134457795 0.181368729 0.04557505 0.938145228
In order to run the pca you need to do:
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=4)
which is the same way as you use it. However, I need to point out that the features argument is the number of principal components and not the number of classes in your y variable. Maybe you knew this already but having 2000 variables and producing only 2 principal components might not be what you are looking for. You need to choose this number carefully by checking the eigen values. In your case I would probably pick 100 principal components and chose the first n number of principal components according to the highest eigen values. Let's see this in my random example after running the previous code:
In order to see the eigen values:
> kpc#eig
Comp.1 Comp.2 Comp.3 Comp.4
0.03756975 0.02706410 0.02609828 0.02284068
In my case all of the components have extremely low eigen values because my data is random. In your case I assume you will get better ones. You need to choose the n number of components that have the highest values. A value of zero shows that the component does not explain any of the variance. (Just for the sake of the demonstration I will use all of them in the svm below).
In order to access the principal components i.e. the PCA output you do this:
> kpc#pcv
[,1] [,2] [,3] [,4]
[1,] -0.1220123051 1.01290883 -0.935265092 0.37279158
[2,] 0.0420830469 0.77483019 -0.009222970 1.14304032
[3,] -0.7060568260 0.31153129 -0.555538694 -0.71496666
[4,] 0.3583160509 -0.82113573 0.237544936 -0.15526000
[5,] 0.1158956953 -0.92673486 1.352983423 -0.27695507
[6,] 0.2109994978 -1.21905573 -0.453469345 -0.94749503
[7,] 0.0833758766 0.63951377 -1.348618472 -0.26070127
[8,] 0.8197838629 0.34794455 0.215414610 0.32763442
[9,] -0.5611750477 -0.03961808 -1.490553198 0.14986663
...
...
This returns a matrix of 4 columns i.e. the number of the features argument which is the PCA output i.e. the principal components. kerlab uses the S4 Method Dispatch System and that is why you use # at kpc#pcv.
You then need to use the above matrix to feed in an svm in the following way:
svmmatrix <- kpc#pcv
library(e1071)
svm(svmmatrix, as.factor(y))
Call:
svm.default(x = svmmatrix, y = as.factor(y))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.25
Number of Support Vectors: 95
And that's it! A very good explanation I found on the internet about pca can be found here in case you or anyone else reading this wants to find out more.

Plotting results of several Bland-Altman analysis

I compared several diagnostic methods to a gold standard using Bland-Altman plots. Now I would graphically represent the difference in agreement between each method and the gold standard in one single plot. I'm trying to plot means, confidence intervals and variance derived from the various Bland-Altman plots as horizontal boxplots, but I don't know I to do that.
I have a dataframe like this:
Method LCL mean UCL var
A -5 4 15 27
B -9 2 13 33
C -8 4 16 36
Thank you very much for your help!
Corrado
You need to realize that a "true" boxplot is a specific type of plot based on non-parametric statistics, none of which you have offered. If you want to call it something else you are free to do so and you can use the bxp function to do the plotting. You need to create a matrix with 5 rows and 3 columns with the values for whisker and box parameters. You may be thinking that the variance could be used to construct standard deviation?
dat <- read.table(text="Method LCL mean UCL var
A -5 4 15 27
B -9 2 13 33
C -8 4 16 36
", header=TRUE)
dat$sdpd <- dat$mean + dat$var^0.5
dat$sdmd <- dat$mean - dat$var^0.5
dat
#------
Method LCL mean UCL var sdpd sdmd
1 A -5 4 15 27 9.196152 -1.196152
2 B -9 2 13 33 7.744563 -3.744563
3 C -8 4 16 36 10.000000 -2.000000
#----------
bxpm <- with(dat, t(matrix(c(LCL, sdmd, mean, sdpd, UCL), 3,5)))
bxpm
#----------
[,1] [,2] [,3]
[1,] -5.000000 -9.000000 -8
[2,] -1.196152 -3.744563 -2
[3,] 4.000000 2.000000 4
[4,] 9.196152 7.744563 10
[5,] 15.000000 13.000000 16
bxp(list(stats=bxpm, names=dat$Method ), main="Not a real boxplot\n
Perhaps a double dynamite plot?")
I can't provide you with working R code as you didn't supply raw data (which are needed for boxplots), and it is not clear what you want to display as nothing indicates where your gold standard comes into play in the given aggregated data (are these repeated measurements with different instruments?), unless the reported means stand for difference between the ith method and the reference method (in which case I don't see how you could use a boxplot). A basic plot of your data might look like
dfrm <- data.frame(method=LETTERS[1:3], lcl=c(-5,-9,-8),
mean=c(4,2,4), ucl=c(15,13,16), var=c(27,33,36))
# I use stripchart to avoid axis relabeling and casting of factor to numeric
# with default plot function
stripchart(mean ~ seq(1,3), data=dfrm, vertical=TRUE, ylim=c(-10,20),
group.names=levels(dfrm$method), pch=19)
with(dfrm, arrows(1:3, mean-lcl, 1:3, mean+lcl, angle=90, code=3, length=.1))
abline(h=0, lty=2)
However, I can recommend you to take a look at the MethComp package which will specifically help you in comparing several methods to a gold standard, with or without replicates, as well as in displaying results. The companion textbook is
Carstensen, B. Comparing Clinical Measurement Methods. John Wiley
& Sons Ltd 2010
Have you tried using R's boxplot() command?
I think by default it assumes you are supplying the raw data, and specifying a factor with which to segment the data. It will compute it's own bounds for the box, which may or may not correspond to what you are using. If you want to be able to easily fine tune r-graphics, and you have a little bit of time to learn, checkout hadly wikham's ggplot2. It's powerful, flexible, and pretty!
Good luck!.

Resources