How to get Principal Component Data in PAM in R - r

I create a graph using autoplot function using mtcars data and get graph like this
here my code:
library(cluster)
library(NbClust)
library(ggplot2)
library(ggfortify)
x <- mtcars
number.cluster <- NbClust(x, distance = "euclidean", min.nc = 1, max.nc = 5, method = "complete", index = "ch")
best.cluster <- as.numeric(number.cluster$Best.nc[1])
x.pam <- pam(x, best.cluster)
autoplot(x.pam, data = x, frame = T) + ggtitle("PAM MTCARS")
my question is how do i get PC1 & PC2 data Coordinate based on this graph?
thank you

You can use layer_data() to get the data used for a ggplot object:
p <- autoplot(x.pam, data = x, frame = T) + ggtitle("PAM MTCARS")
layer_data(p, 1L) # coordinates of all points
layer_data(p, 2L) # coordinates of points that contribute to polygons

Your entire process is flawed. First you use complete linkage to estimate the number of clusters; but rather than using the "best" clustering found you then cluster again with PAM instead.
You use Euclidean distance, but in Euclidean space k-means will usually work better than PAM - PAM shines when you don't have Euclidean geometry and cannot use k-means.
And then you want to use this PCA plot, which is heavily distorted (almost the entire variance is in the first component, the y axis is visualizing pretty much random deviation). Just use PCA if you want these coordinates, not reconstruct this from the plot.

Related

Calculation of allowed space within monte carlo simulated data of 3 variables (cube in 3D coordinates)

I´m working on the topic of calculating the robust working range of a process. For this purpose I´m building models from DOE data and simulating data with a monte carlo approach. Filtering the data with a criteria for the response leads to a allowed space (see plots for better visualization).
In the example below, there are 3 variables and the goal is to calculate the biggest possible square (in parallel with the axis) within the allowed room. This would describe the working range of the process. The coding is just to get every variable in the same range (-1 to 1).
library(tidyverse)
library(MASS)
library(ggplot2)
library(gridExtra)
library(rgl)
df<-data.frame(
X1=runif(100,0,2),
X2=runif(100,10,30),
X3=runif(100,5,75))%>%
mutate(Y1=2*X1-2*X2+X3)
f1<-Y1~X1+X2+X3
model1<- lm(f1, data=df)
m.c <- NULL
n=10000
for (k in 1:n)
{
X1=runif(1,0,2)
X2=runif(1,10,30)
X3=runif(1,5,75)
m.c = rbind(m.c, data.frame(X1, X2, X3))
}
m.c_coded<-m.c%>%
mutate(predict1=predict(model1, newdata = .))%>%
mutate(X1=(X1-1/1))%>%
mutate(X2=(X2-20)/10)%>%
mutate(X3=(X3-40)/35)
Space<- m.c_coded%>%
filter(predict1<=0)
p1<-ggplot(Space)+
geom_point(aes(X1, X2))+
xlim(-1,1)+
ylim(-1,1)
p2<-ggplot(Space)+
geom_point(aes(X1, X3))+
xlim(-1,1)+
ylim(-1,1)
p3<-ggplot(Space)+
geom_point(aes(X2, X3))+
xlim(-1,1)+
ylim(-1,1)
grid.arrange(arrangeGrob(p1,p2,p3, nrow = 1), nrow = 1)
MODR_plot3D<-plot3d( x=Space$X1, y=Space$X2, z=Space$X3, type = "p",
xlim = (c(-1,1)), ylim(c(-1,1)), zlim = (c(-1,1))
)
There are specialized programms for that (DOE software) which can calculate this so called Design-space, but I want to implement it in my R skript. Sadly I do not have any idea, how I can calculate the position (edges) of this square. My approach would be to find the maximum distance to the surface on (center of the square).
Does anyone an idea how I can calculate this cube in a proper way? If possible I want to extend this also for the n-dimensional room.

Density based clustering that allows user to specify number of clusters

I have data that consists of roughly 100,000 points on a 2-d graph. Each point has X and Y coordinates. I'm looking for an algorithm that will cluster these points based on density but I want to specify the number of clusters.
I originally tried K-Means since this would allow me to specify the number of clusters. However, my data naturally "clumps" into ridges. K-Means would inevitably bisect some of these ridges. DBSCAN seems like a better fit simply due to the shape of my data, but with DBSCAN I can't specify the number of clusters I'd like.
Essentially what I'm trying to find is an algorithm that will optimally cluster the graph into N groups based on density. Where N is supplied by me. At this point I don't care where it's implemented (R, Python, FORTRAN...).
Any direction you can provide would be much appreciated.
In an area of high density, the points tend to be close together, so clustering on the (euclidian) distance may give similar results (not always).
For example, with these three normals in 2 dimensions:
x1 <- mnormt::rmnorm(200, c(10,10), matrix(c(20,0,0,.1), 2, 2))
x2 <- mnormt::rmnorm(100, c(10,20), matrix(c(20,0,0,.1), 2, 2))
x3 <- mnormt::rmnorm(300, c(23, 15), matrix(c(.1,0,0,35), 2, 2))
xx <- rbind(x1, x2, x3)
plot(xx, col=rep(c("grey10","pink2", "green4"), times=c(200,100,300)))
We can apply different clustering algorithms:
# hierarchical
clustering <- hclust(dist(xx,
method = "euclidian"),
method = "ward.D")
h.cl <- cutree(clustering, k=3)
# K-means and dbscan
k.cl <- kmeans(xx, centers = 3L)
d.cl <- dbscan::dbscan(xx, eps = 1)
And we see on this particular example, the hierarchical clustering and DBSCAN produced similar results, whereas K-means cut one of the clusters in a wrong way.
opar <- par(mfrow=c(3,1), mar = c(1,1,1,1))
plot(xx, col = k.cl$cluster, main="K-means")
plot(xx, col = d.cl$cluster, main="DBSCAN")
plot(xx, col = h.cl, main="Hierarchical")
par(opar)
Of course, there is no guarantee this will work on your particular data.

R, SOM, Kohonen Package, Outlier Detection

With SOM I experimented a little. First I used MiniSOM in Python but I was not impressed and changed to the kohonen package in R, which offers more features than the previous one. Basically, I applied SOM for three use cases: (1) clustering in 2D with generated data, (2) clustering with more-dimensional data: built-in wine data set, and (3) outlier detection. I solved all the three use cases but I would like to raise a question in connection with the outlier detection I applied. For this purpose I used the vector som$distances, which contains a distance for each row of the input data set. The values with excelling distances can be outliers. However, I do not know how this distance is computed. The package description (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf) states for this metric : "distance to the closest unit".
Could you please tell how this distance is computed?
Could you please comment the outlier detection I used? How would you have done it? (In the generated data set it really finds the outliers. In
the real wine data set there are four relatively excelling values among the 177 wine sorts. See
the charts below. The idea that crossed my mind to use bar charts for depicting this I really like.)
Charts:
Generated data, 100 point in 2D in 5 distinct clusters and 2
outliers (Category 6 shows the outliers):
Distances shown for all the 102 data points, the last two ones are
the outliers which were correctly identified. I repeated the test
with 500, and 1000 data points and added solely 2 outliers. The
outliers were also found in those cases.
Distances for the real wine data set with potential outliers:
The row id of the potential outliers:
# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]
it produces the following output:
index value
59 59 12.22916
110 110 13.41211
121 121 15.86576
158 158 11.50079
My annotated code snippet:
data(wines)
scaled_wines <- scale(wines)
# creating and training SOM
som.wines <- som(scaled_wines, grid = somgrid(5, 5, "hexagonal"))
summary(som.wines)
#looking for outliers, dist = distance to the closest unit
som.wines$distances
len <- length(som.wines$distances)
index_in_vector <- c(1:len)
df_wine<-data.frame(cbind(index_in_vector, som.wines$distances))
colnames(df_wine) <-c("index", "value")
po <-ggplot(df_wine, aes(index, value)) + geom_bar(stat = "identity")
po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som.wines$distances") + xlab("Number of Rows in the Data Set")
plot(po)
# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]
Further Code Samples
With regard to the discussion in the comments I also post the code snippets asked for. As far as I remember, the code lines responsible for clustering I constructed based on samples I found in the description of the Kohonen package (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf). However, I am not completely sure, it was more than a year ago. The code is provided as is without any warranty :-). Please bear in mind that a particular clustering approach may perform with different accuracy on different data. I would also recommend to compare it with t-SNE on the wine data set (data(wines) available in R). Moreover, implement the heat-maps to demonstrate how the data with regard to individual variables are located. (In the case of the above example with 2 variables it is not important but it would be nice for the wine data set).
Data Generation with Five Clusters and 2 Outliers and Plotting
library(stats)
library(ggplot2)
library(kohonen)
generate_data <- function(num_of_points, num_of_clusters, outliers=TRUE){
num_of_points_per_cluster <- num_of_points/num_of_clusters
cat(sprintf("#### num_of_points_per_cluster = %s, num_of_clusters = %s \n", num_of_points_per_cluster, num_of_clusters))
arr<-array()
standard_dev_y <- 6000
standard_dev_x <- 2
# for reproducibility setting the random generator
set.seed(10)
for (i in 1:num_of_clusters){
centroid_y <- runif(1, min=10000, max=200000)
centroid_x <- runif(1, min=20, max=70)
cat(sprintf("centroid_x = %s \n, centroid_y = %s", centroid_x, centroid_y ))
vector_y <- rnorm(num_of_points_per_cluster, mean=centroid_y, sd=standard_dev_y)
vector_x <- rnorm(num_of_points_per_cluster, mean=centroid_x, sd=standard_dev_x)
cluster <- array(c(vector_y, vector_x), dim=c(num_of_points_per_cluster, 2))
cluster <- cbind(cluster, i)
arr <- rbind(arr, cluster)
}
if(outliers){
#adding two outliers
arr <- rbind(arr, c(10000, 30, 6))
arr <- rbind(arr, c(150000, 70, 6))
}
colnames(arr) <-c("y", "x", "Cluster")
# WA to remove the first NA row
arr <- na.omit(arr)
return(arr)
}
scatter_plot_data <- function(data_in, couloring_base_indx, main_label){
df <- data.frame(data_in)
colnames(df) <-c("y", "x", "Cluster")
pl <- ggplot(data=df, aes(x = x,y=y)) + geom_point(aes(color=factor(df[, couloring_base_indx])))
pl <- pl + ggtitle(main_label) + theme(plot.title = element_text(hjust = 0.5))
print(pl)
}
##################
# generating data
data <- generate_data(100, 5, TRUE)
print(data)
scatter_plot_data(data, couloring_base_indx<-3, "Original Clusters without Outliers \n 102 Points")
Preparation, Clustering and Plotting
I used the hierarchical clustering approach with the Kohonen Map (SOM).
normalising_data <- function(data){
# normalizing data points not the cluster identifiers
mtrx <- data.matrix(data)
umtrx <- scale(mtrx[,1:2])
umtrx <- cbind(umtrx, factor(mtrx[,3]))
colnames(umtrx) <-c("y", "x", "Cluster")
return(umtrx)
}
train_som <- function(umtrx){
# unsupervised learning
set.seed(7)
g <- somgrid(xdim=5, ydim=5, topo="hexagonal")
#map<-som(umtrx[, 1:2], grid=g, alpha=c(0.005, 0.01), radius=1, rlen=1000)
map<-som(umtrx[, 1:2], grid=g)
summary(map)
return(map)
}
plot_som_data <- function(map){
par(mfrow=c(3,2))
# to plot some charactristics of the SOM map
plot(map, type='changes')
plot(map, type='codes', main="Mapping Data")
plot(map, type='count')
plot(map, type='mapping') # how many data points are held by each neuron
plot(map, type='dist.neighbours') # the darker the colours are, the closer the point are; the lighter the colours are, the more distant the points are
#to switch the plot config to the normal
par(mfrow=c(1,1))
}
plot_disstances_to_the_closest_point <- function(map){
# to see which neuron is assigned to which value
map$unit.classif
#looking for outliers, dist = distance to the closest unit
map$distances
max(map$distances)
len <- length(map$distances)
index_in_vector <- c(1:len)
df<-data.frame(cbind(index_in_vector, map$distances))
colnames(df) <-c("index", "value")
po <-ggplot(df, aes(index, value)) + geom_bar(stat = "identity")
po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som$distances") + xlab("Number of Rows in the Data Set")
plot(po)
return(df)
}
###################
# unsupervised learning
umtrx <- normalising_data(data)
map<-train_som(umtrx)
plot_som_data(map)
#####################
# creating the dendogram and then the clusters for the neurons
dendogram <- hclust(object.distances(map, "codes"), method = 'ward.D')
plot(dendogram)
clusters <- cutree(dendogram, 7)
clusters
length(clusters)
#visualising the clusters on the map
par(mfrow = c(1,1))
plot(map, type='dist.neighbours', main="Mapping Data")
add.cluster.boundaries(map, clusters)
Plots with the Clusters
You can also create nice heat-maps for selected variables but I had not implemented them for clustering with 2 variables it does not really make sense. If you implement it for the wine data set, please add the code and the charts to this post.
#see the predicted clusters with the data set
# 1. add the vector of the neuron ids to the data
mapped_neurons <- map$unit.classif
umtrx <- cbind(umtrx, mapped_neurons)
# 2. taking the predicted clusters and adding them the the original matrix
# very good description of the apply functions:
# https://www.guru99.com/r-apply-sapply-tapply.html
get_cluster_for_the_row <- function(x, cltrs){
return(cltrs[x])
}
predicted_clusters <- sapply (umtrx[,4], get_cluster_for_the_row, cltrs<-clusters)
mtrx <- cbind(mtrx, predicted_clusters)
scatter_plot_data(mtrx, couloring_base_indx<-4, "Predicted Clusters with Outliers \n 100 points")
See the predicted clusters below in case there were outliers and in case there were not.
I am not quite sure though, but I often find that the distance measurement of two objects reside in a particular dimensional space uses mostly Euclidean distance. For example, two points A and B in a two dimensional space having location of A(x=3, y=4) and B(x=6, y=8) are 5 distance unit apart. It is a result of performing calculation of squareroot((3-6)^2 + (4-8)^2). This is also applied to the data whose greater dimension, by adding trailing power of two of the difference of the two point's value in a particular dimension. If A(x=3, y=4, z=5) and B(x=6, y=8, z=7) then the distance is squareroot((3-6)^2 + (4-8)^2 + (5-7)^2), and so on. In kohonen, I think that after the model has finished the training phase, the algorithm then calculates the distances of each datum to all nodes and then assign it to the nearest node (a node which has the smallest distance to it). Eventually, the values inside the variable 'distances' returned by the model is the distance of every datum to its nearest node. One thing to note from your script is that the algorithm does not measure the distance directly from the original property values that the data have, because they have been scaled prior to feeding the data to the model. The distance measurement is applied to the scaled version of the data. The scaling is a standard procedure to eliminate the dominance of a variable on top of another.
I believe that your method is acceptable, because the values inside the 'distances' variable are the distance of each datum to its nearest node. So if a value of the distance between a datum and its nearest node is high, then this also means: the distance of the datum to other nodes are obviously much much higher.

Simplifying 3D points. R

I need to work with 3D data (spatial) very long tables with for coumns:
x, y, z, Value
There are too many data to be plotted with scatterplot3d or similar (rgl, lattice...)
I would like to reduce the number of data.
One idea could be to sample.
But I'd like to know how to reduce the data, getting new points that summarize the nearby points.
Is there any package to do it and work with this kind of data?
Something like creating a predefined 3D grid and averaging the points in each grid.
But I don't know whether it's better to choose the new points equidistants or just get their coordinates averaging the old ones locally. Or even weighting their final contribution with the distance to the new point.
Other issues:
The "optimal" grid could be tilted, but I don't know it beforehand.
I don't know if the grid should be extended a little bit beyond the data nor how much.
PD: I don't want to create surfaces nor wireframes nor adjust anything.
PD: I've checked spatial packages but as far as I see they are useful for data on a surface, such as the earth, but without height.
To reduce the size of the data set, have you thought about using a clustering methods such as kmeans or hierarchical clustering (hclust). These methods could reduce your data set down to a reasonable size. Be aware, if your data set is large enough these methods could still be too computational time consuming.
Seems like you might benefiit from fitting some sort of model to your data and then displaying the prediction on a resolution of your choice.
Here is an example of fitting with a GAM model:
library(sinkr) # https://github.com/marchtaylor/sinkr
library(mgcv)
library(rgl)
# make data ---------------------------------------------------------------
n <- 1000
x <- runif(n, min=-10, max=10)
y <- runif(n, min=-10, max=10)
z <- runif(n, min=-10, max=10)
value <- (-0.01*x^3 + -0.2*y^2 + -0.3*z^2) * rlnorm(n, 0, 0.1)
# fit model (GAM) ---------------------------------------------------------
fit <- gam(value ~ s(x) + s(y) + s(z))
plot.gam(fit, pages = 1)
This visualization is already helpful in understanding the 3d pattern of value, but you could also predict the values to a new grid. To visualize the prediction in 3d, the rgl package might be useful:
# predict to new grid -----------------------------------------------------
grd <- expand.grid(
x=seq(min(x), max(x),,10),
y=seq(min(y), max(y),,10),
z=seq(min(z), max(z),,10)
)
grd$value <- predict.gam(fit, newdata = grd)
# plot prediction with rgl ------------------------------------------------
# original data
plot3d(x, y, z, col=val2col(value, col=jetPal(100)))
rgl.snapshot("original.png")
# interpolated data
plot3d(grd$x, grd$y, grd$z, col=val2col(grd$value, col=jetPal(100)), alpha=0.5, size=5)
rgl.snapshot("points.png")
spheres3d(grd$x, grd$y, grd$z, col=val2col(grd$value, col=jetPal(100)), alpha=0.3, radius=1)
rgl.snapshot("spheres.png")
I've found the way to do it.
I'll post an example, just in case it's useful for others.
I write only two dimensions (and only working on the coordinates) to make it clear, but it can be generalized to higher dimensions and summarizing the functions at every coordinate).
set.seed(1)
xx <- runif(30,0,100); yy <- runif(30,0,100)
datos <- data.frame(xx,yy) #sample data
plot(xx,yy,pch=20) # 2D plot to visualize it.
n <- 4 # Same number of splits on every axis. Simple example.
rango <- function(ii){(max(ii)-min(ii))+0.000001}
renorm<- function(jj) {trunc(n*(jj-min(jj))/rango(jj))+1}
result <- aggregate(cbind(xx,yy)~renorm(xx) + renorm(yy),datos, mean)
points(result$xx,result$yy,pch=20, col="red")
abline(v=( min(xx) + (rango(xx)/n)*0:n) )
abline(h=( min(yy) + (rango(yy)/n)*0:n) )
Everything could be modified with na.rm=T
Maybe there are a simpler solutions with split, cut, dplyr, data.table, tapply...
I like this way more than fixing the new points coordinates at the center of every subregion because if you have only 1 point it keeps its original coordinates.
The +0.000000001 is to avoid the last point to move to a subregion further.
The full solution would have been:
aggregate(cbind(xx,yy,zz, Value)~renorm(xx)+renorm(yy)+renorm(zz),datos, mean)
And it could be further improved by weighting distances.

spatial clustering in R (simple example)

I have this simple data.frame
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
data=data.frame(lat,lon)
The idea is to find the spatial clusters based on the distance
First, I plot the map (lon,lat) :
plot(data$lon,data$lat)
so clearly I have three clusters based in the distance between the position of points.
For this aim, I've tried this code in R :
d= as.matrix(dist(cbind(data$lon,data$lat))) #Creat distance matrix
d=ifelse(d<5,d,0) #keep only distance < 5
d=as.dist(d)
hc<-hclust(d) # hierarchical clustering
plot(hc)
data$clust <- cutree(hc,k=3) # cut the dendrogram to generate 3 clusters
This gives :
Now I try to plot the same points but with colors from clusters
plot(data$x,data$y, col=c("red","blue","green")[data$clust],pch=19)
Here the results
Which is not what I'm looking for.
Actually, I want to find something like this plot
Thank you for help.
What about something like this:
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
km <- kmeans(cbind(lat, lon), centers = 3)
plot(lon, lat, col = km$cluster, pch = 20)
Here's a different approach. First it assumes that the coordinates are WGS-84 and not UTM (flat). Then it clusters all neighbors within a given radius to the same cluster using hierarchical clustering (with method = single, which adopts a 'friends of friends' clustering strategy).
In order to compute the distance matrix, I'm using the rdist.earth method from the package fields. The default earth radius for this package is 6378.388 (the equatorial radius) which might not be what one is looking for, so I've changed it to 6371. See this article for more info.
library(fields)
lon = c(31.621785, 31.641773, 31.617269, 31.583895, 31.603284)
lat = c(30.901118, 31.245008, 31.163886, 30.25058, 30.262378)
threshold.in.km <- 40
coors <- data.frame(lon,lat)
#distance matrix
dist.in.km.matrix <- rdist.earth(coors,miles = F,R=6371)
#clustering
fit <- hclust(as.dist(dist.in.km.matrix), method = "single")
clusters <- cutree(fit,h = threshold.in.km)
plot(lon, lat, col = clusters, pch = 20)
This could be a good solution if you don't know the number of clusters (like the k-means option), and is somewhat related to the dbscan option with minPts = 1.
---EDIT---
With the original data:
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
data=data.frame(lat,lon)
dist <- rdist.earth(data,miles = F,R=6371) #dist <- dist(data) if data is UTM
fit <- hclust(as.dist(dist), method = "single")
clusters <- cutree(fit,h = 1000) #h = 2 if data is UTM
plot(lon, lat, col = clusters, pch = 20)
As you have a spatial data to cluster, so DBSCAN is best suited for you data.
You can do this clustering using dbscan() function provided by fpc, a R package.
library(fpc)
lat<-c(1,2,3,10,11,12,20,21,22,23)
lon<-c(5,6,7,30,31,32,50,51,52,53)
DBSCAN <- dbscan(cbind(lat, lon), eps = 1.5, MinPts = 3)
plot(lon, lat, col = DBSCAN$cluster, pch = 20)

Resources