visualization for high-dimensional points in R

visualization for high-dimensional points in R - r

I have a centroid, e.g., A. and I have other 100 points. All of these points are of high-dimensions, e.g, 1000 dimensions. Is there a way to visualize these points in a two-dimensional space in-terms of their distance with A.

A common (though simple) way to visualize high-dimensional points in low dimensional space is to use some form of multi-dimensional scaling:
dat <- matrix(runif(1000*99),99,1000)
#Combine with "special" point
dat <- rbind(rep(0.1,1000),dat)
out <- cmdscale(dist(dat),k = 2)
#Plot everything, highlighting our "special" point
plot(out)
points(out[1,1],out[1,2],col = "red")
You can also check out isoMDS or sammon in the MASS package for other implementations in R.

The distance (by which I assume you mean the norm of the difference vector) is only 1 value, so you can calculate these norms and show them on a 1D plot, but for 2D you'll need a second parameter.

Related

DBSCAN Clustering returning single cluster with noise points

I am trying to perform DBSCAN clustering on the data https://www.kaggle.com/arjunbhasin2013/ccdata. I have cleaned the data and applied the algorithm.
data1 <- read.csv('C:\\Users\\write\\Documents\\R\\data\\Project\\Clustering\\CC GENERAL.csv')
head(data1)
data1 <- data1[,2:18]
dim(data1)
colnames(data1)
head(data1,2)
#to check if data has empty col or rows
library(purrr)
is_empty(data1)
#to check if data has duplicates
library(dplyr)
any(duplicated(data1))
#to check if data has NA values
any(is.na(data1))
data1 <- na.omit(data1)
any(is.na(data1))
dim(data1)
Algorithm was applied as follows.
#DBSCAN
data1 <- scale(data1)
library(fpc)
library(dbscan)
set.seed(500)
#to find optimal eps
kNNdistplot(data1, k = 34)
abline(h = 4, lty = 3)
The figure shows the 'knee' to identify the 'eps' value. Since there are 17 attributes to be considered for clustering, I have taken k=17*2 =34.
db <- dbscan(data1,eps = 4,minPts = 34)
db
The result I obtained is "The clustering contains 1 cluster(s) and 147 noise points."
No matter whatever values I change for eps and minPts the result is same.
Can anyone tell where I have gone wrong?
Thanks in advance.

You have two options:
Increase the radius of your center points (given by the epsilon parameter)
Decrease the minimum number of points (minPts) to define a center point.
I would start by decreasing the minPts parameter, since I think it is very high and since it does not find points within that radius, it does not group more points within a group

A typical problem with using DBSCAN (and clustering in general) is that real data typically does not fall into nice clusters, but forms one connected point cloud. In this case, DBSCAN will always find only a single cluster. You can check this with several methods. The most direct method would be to use a pairs plot (a scatterplot matrix):
plot(as.data.frame(data1))
Since you have many variables, the scatterplot pannels are very small, but you can see that the points are very close together in almost all pannels. DBSCAN will connect all points in these dense areas into a single cluster. k-means will just partition the dense area.
Another option is to check for clusterability with methods like VAT or iVAT (https://link.springer.com/chapter/10.1007/978-3-642-13657-3_5).
library("seriation")
## calculate distances for a small sample
d <- dist(data1[sample(seq(nrow(data1)), size = 1000), ])
iVAT(d)
You will see that the plot shows no block structure around the diagonal indicating that clustering will not find much.
To improve clustering, you need to work on the data. You can remove irrelevant variables, you may have very skewed variables that should be transformed first. You could also try non-linear embedding before clustering.

Coordinates of greatest kernel density for trivariate KDE

I have calculated the kernel density of a 3-column matrix in R using the following code:
ss<-read.table("data.csv",header=TRUE,sep=",")
x<-ss[,1]
y<-ss[,2]
z<-ss[,3]
ssdata<-c(x,y,z)
ssmat<-matrix(ssdata,,3)
rp<-kde(ssmat)
plot(rp)
What I need now are the (x,y,z) coordinates of the point of maximum kernel density. Based on the answer provided at on the R-help list, I understand that the kde() function plots the joint density of the three variables in a fourth dimension which is represented in the 3d plot by shading to indicate areas of greater point density. So in effect I am trying to locate the maximum value of this "fourth" dimension. I suspect that this is a relatively simple problem but I haven't been able to find the answer. Any ideas?

You can extract the max value from the info returned from kde. To see all the stuff returned, use str(rp).
## Get the indices
inds <- which(abs(rp$estimate - max(rp$estimate)) < 1e-10, arr.ind=T)
xyz <- mapply(function(a, b) a[b], rp$eval.points, inds)
## Add it to plot
plot(rp)
points3d(x=xyz[1], y=xyz[2], z=xyz[3], size=20, col="blue")

Find correct 2D translation of a subset of coordinates

I have a problem I wish to solve in R with example data below. I know this must have been solved many times but I have not been able to find a solution that works for me in R.
The core of what I want to do is to find how to translate a set of 2D coordinates to best fit into an other, larger, set of 2D coordinates. Imagine for example having a Polaroid photo of a small piece of the starry sky with you out at night, and you want to hold it up in a position so they match the stars' current positions.
Here is how to generate data similar to my real problem:
# create reference points (the "starry sky")
set.seed(99)
ref_coords = data.frame(x = runif(50,0,100), y = runif(50,0,100))
# generate points take subset of coordinates to serve as points we
# are looking for ("the Polaroid")
my_coords_final = ref_coords[c(5,12,15,24,31,34,48,49),]
# add a little bit of variation as compared to reference points
# (data should very similar, but have a little bit of noise)
set.seed(100)
my_coords_final$x = my_coords_final$x+rnorm(8,0,.1)
set.seed(101)
my_coords_final$y = my_coords_final$y+rnorm(8,0,.1)
# create "start values" by, e.g., translating the points we are
# looking for to start at (0,0)
my_coords_start =apply(my_coords_final,2,function(x) x-min(x))
# Plot of example data, goal is to find the dotted vector that
# corresponds to the translation needed
plot(ref_coords, cex = 1.2) # "Starry sky"
points(my_coords_start,pch=20, col = "red") # start position of "Polaroid"
points(my_coords_final,pch=20, col = "blue") # corrected position of "Polaroid"
segments(my_coords_start[1,1],my_coords_start[1,2],
my_coords_final[1,1],my_coords_final[1,2],lty="dotted")
Plotting the data as above should yield:
The result I want is basically what the dotted line in the plot above represents, i.e. a delta in x and y that I could apply to the start coordinates to move them to their correct position in the reference grid.
Details about the real data
There should be close to no rotational or scaling difference between my points and the reference points.
My real data is around 1000 reference points and up to a few hundred points to search (could use less if more efficient)
I expect to have to search about 10 to 20 sets of reference points to find my match, as many of the reference sets will not contain my points.
Thank you for your time, I'd really appreciate any input!
EDIT: To clarify, the right plot represent the reference data. The left plot represents the points that I want to translate across the reference data in order to find a position where they best match the reference. That position, in this case, is represented by the blue dots in the previous figure.
Finally, any working strategy must not use the data in my_coords_final, but rather reproduce that set of coordinates starting from my_coords_start using ref_coords.

So, the previous approach I posted (see edit history) using optim() to minimize the sum of distances between points will only work in the limited circumstance where the point distribution used as reference data is in the middle of the point field. The solution that satisfies the question and seems to still be workable for a few thousand points, would be a brute-force delta and comparison algorithm that calculates the differences between each point in the field against a single point of the reference data and then determines how many of the rest of the reference data are within a minimum threshold (which is needed to account for the noise in the data):
## A brute-force approach where min_dist can be used to
## ameliorate some random noise:
min_dist <- 5
win_thresh <- 0
win_thresh_old <- 0
for(i in 1:nrow(ref_coords)) {
x2 <- my_coords_start[,1]
y2 <- my_coords_start[,2]
x1 <- ref_coords[,1] + (x2[1] - ref_coords[i,1])
y1 <- ref_coords[,2] + (y2[1] - ref_coords[i,2])
## Calculate all pairwise distances between reference and field data:
dists <- dist( cbind( c(x1, x2), c(y1, y2) ), "euclidean")
## Only take distances for the sampled data:
dists <- as.matrix(dists)[-1*1:length(x1),]
## Calculate the number of distances within the minimum
## distance threshold minus the diagonal portion:
win_thresh <- sum(rowSums(dists < min_dist) > 1)
## If we have more "matches" than our best then calculate a new
## dx and dy:
if (win_thresh > win_thresh_old) {
win_thresh_old <- win_thresh
dx <- (x2[1] - ref_coords[i,1])
dy <- (y2[1] - ref_coords[i,2])
}
}
## Plot estimated correction (your delta x and delta y) calculated
## from the brute force calculation of shifts:
points(
x=ref_coords[,1] + dx,
y=ref_coords[,2] + dy,
cex=1.5, col = "red"
)
I'm very interested to know if there's anyone that solves this in a more efficient manner for the number of points in the test data, possibly using a statistical or optimization algorithm.

spatial filtering by proximity in R

I have occurrence points for a species, and I'd like to remove potential sampling bias (where some regions might have much greater density of points than others). One way to do this would be to maximize a subset of points that are no less than a certain distance X of each other. Essentially, I would prevent points from being too close to each other.
Are there any existing R functions to do this? I've searched through various spatial packages, but haven't found anything, and can't figure out exactly how to implement this myself.
An example occurrence point dataset can be downloaded here.
Thanks!

I've written a new version of this function that no longer really follows rMaternII.
The input can either be a SpatialPoints, SpatialPointsDataFrame or matrix object.
Seems to work well, but suggestions welcome!
filterByProximity <- function(xy, dist, mapUnits = F) {
#xy can be either a SpatialPoints or SPDF object, or a matrix
#dist is in km if mapUnits=F, in mapUnits otherwise
if (!mapUnits) {
d <- spDists(xy,longlat=T)
}
if (mapUnits) {
d <- spDists(xy,longlat=F)
}
diag(d) <- NA
close <- (d <= dist)
diag(close) <- NA
closePts <- which(close,arr.ind=T)
discard <- matrix(nrow=2,ncol=2)
if (nrow(closePts) > 0) {
while (nrow(closePts) > 0) {
if ((!paste(closePts[1,1],closePts[1,2],sep='_') %in% paste(discard[,1],discard[,2],sep='_')) & (!paste(closePts[1,2],closePts[1,1],sep='_') %in% paste(discard[,1],discard[,2],sep='_'))) {
discard <- rbind(discard, closePts[1,])
closePts <- closePts[-union(which(closePts[,1] == closePts[1,1]), which(closePts[,2] == closePts[1,1])),]
}
}
discard <- discard[complete.cases(discard),]
return(xy[-discard[,1],])
}
if (nrow(closePts) == 0) {
return(xy)
}
}
Let's test it:
require(rgeos)
require(sp)
pts <- readWKT("MULTIPOINT ((3.5 2), (1 1), (2 2), (4.5 3), (4.5 4.5), (5 5), (1 5))")
pts2 <- filterByProximity(pts,dist=2, mapUnits=T)
plot(pts)
axis(1)
axis(2)
apply(as.data.frame(pts),1,function(x) plot(gBuffer(SpatialPoints(coords=matrix(c(x[1],x[2]),nrow=1)),width=2),add=T))
plot(pts2,add=T,col='blue',pch=20,cex=2)

There is also an R package called spThin that performs spatial thinning on point data. It was developed for reducing the effects of sampling bias for species distribution models, and does multiple iterations for optimization. The function is quite easy to implement---the vignette can be found here. There is also a paper in Ecography with details about the technique.

Following Josh O'Brien's advice, I looked at spatstat's rMaternI function, and came up with the following. It seems to work pretty well.
The distance is in map units. It would be nice to incorporate one of R's distance functions that always returns distances in meters, rather than input units, but I couldn't figure that out...
require(spatstat)
require(maptools)
occ <- readShapeSpatial('occurrence_example.shp')
filterByProximity <- function(occ, dist) {
pts <- as.ppp.SpatialPoints(occ)
d <- nndist(pts)
z <- which(d > dist)
return(occ[z,])
}
occ2 <- filterByProximity(occ,dist=0.2)
plot(occ)
plot(occ2,add=T,col='blue',pch=20)

Rather than removing data points, you might consider spatial declustering. This involves giving points in clusters a lower weight than outlying points. The two simplest ways to do this involve a polygonal segmentation, like a Voronoi diagram, or some arbitrary grid. Both methods will weight points in each region according to the area of the region.
For example, if we take the points in your test (1,1),(2,2),(4.5,4.5),(5,5),(1,5) and apply a regular 2-by-2 mesh, where each cell is three units on a side, then the five points fall into three cells. The points ((1,1),(2,2)) falling into the cell [0,3]X[0,3] would each have weights 1/( no. of points in current cell TIMES tot. no. of occupied cells ) = 1 / ( 2 * 3 ). The same thing goes for the points ((4.5,4.5),(5,5)) in the cell (3,6]X(3,6]. The "outlier", (1,5) would have a weight 1 / ( 1 * 3 ). The nice thing about this technique is that it is a quick way to generate a density based weighting scheme.
A polygonal segmentation involves drawing a polygon around each point and using the area of that polygon to calculate the weight. Generally, the polygons completely cover the entire region, and the weights are calculated as the inverse of the area of each polygon. A Voronoi diagram is usually used for this, but polygonal segmentations may be calculated using other techniques, or may be specified by hand.

What techniques exists in R to visualize a "distance matrix"?

I wish to present a distance matrix in an article I am writing, and I am looking for good visualization for it.
So far I came across balloon plots (I used it here, but I don't think it will work in this case), heatmaps (here is a nice example, but they don't allow to present the numbers in the table, correct me if I am wrong. Maybe half the table in colors and half with numbers would be cool) and lastly correlation ellipse plots (here is some code and example - which is cool to use a shape, but I am not sure how to use it here).
There are also various clustering methods but they will aggregate the data (which is not what I want) while what I want is to present all of the data.
Example data:
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dist(nba[1:20, -1], )
I am open for ideas.

You could also use force-directed graph drawing algorithms to visualize a distance matrix, e.g.
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dist_m <- as.matrix(dist(nba[1:20, -1]))
dist_mi <- 1/dist_m # one over, as qgraph takes similarity matrices as input
library(qgraph)
jpeg('example_forcedraw.jpg', width=1000, height=1000, unit='px')
qgraph(dist_mi, layout='spring', vsize=3)
dev.off()

Tal, this is a quick way to overlap text over an heatmap. Note that this relies on image rather than heatmap as the latter offsets the plot, making it more difficult to put text in the correct position.
To be honest, I think this graph shows too much information, making it a bit difficult to read... you may want to write only specific values.
also, the other quicker option is to save your graph as pdf, import it in Inkscape (or similar software) and manually add the text where needed.
Hope this helps
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dst <- dist(nba[1:20, -1],)
dst <- data.matrix(dst)
dim <- ncol(dst)
image(1:dim, 1:dim, dst, axes = FALSE, xlab="", ylab="")
axis(1, 1:dim, nba[1:20,1], cex.axis = 0.5, las=3)
axis(2, 1:dim, nba[1:20,1], cex.axis = 0.5, las=1)
text(expand.grid(1:dim, 1:dim), sprintf("%0.1f", dst), cex=0.6)

A Voronoi Diagram (a plot of a Voronoi Decomposition) is one way to visually represent a Distance Matrix (DM).
They are also simple to create and plot using R--you can do both in a single line of R code.
If you're not famililar with this aspect of computational geometry, the relationship between the two (VD & DM) is straightforward, though a brief summary might be helpful.
Distance Matrices--i.e., a 2D matrix showing the distance between a point and every other point, are an intermediate output during kNN computation (i.e., k-nearest neighbor, a machine learning algorithm which predicts the value of a given data point based on the weighted average value of its 'k' closest neighbors, distance-wise, where 'k' is some integer, usually between 3 and 5.)
kNN is conceptually very simple--each data point in your training set is in essence a 'position' in some n-dimension space, so the next step is to calculate the distance between each point and every other point using some distance metric (e.g., Euclidean, Manhattan, etc.). While the training step--i.e., construcing the distance matrix--is straightforward, using it to predict the value of new data points is practically encumbered by the data retrieval--finding the closest 3 or 4 points from among several thousand or several million scattered in n-dimensional space.
Two data structures are commonly used to address that problem: kd-trees and Voroni decompositions (aka "Dirichlet tesselation").
A Voronoi decomposition (VD) is uniquely determined by a distance matrix--i.e., there's a 1:1 map; so indeed it is a visual representation of the distance matrix, although again, that's not their purpose--their primary purpose is the efficient storage of the data used for kNN-based prediction.
Beyond that, whether it's a good idea to represent a distance matrix this way probably depends most of all on your audience. To most, the relationship between a VD and the antecedent distance matrix will not be intuitive. But that doesn't make it incorrect--if someone without any statistics training wanted to know if two populations had similar probability distributions and you showed them a Q-Q plot, they would probably think you haven't engaged their question. So for those who know what they are looking at, a VD is a compact, complete, and accurate representation of a DM.
So how do you make one?
A Voronoi decomp is constructed by selecting (usually at random) a subset of points from within the training set (this number varies by circumstances, but if we had 1,000,000 points, then 100 is a reasonable number for this subset). These 100 data points are the Voronoi centers ("VC").
The basic idea behind a Voronoi decomp is that rather than having to sift through the 1,000,000 data points to find the nearest neighbors, you only have to look at these 100, then once you find the closest VC, your search for the actual nearest neighbors is restricted to just the points within that Voronoi cell. Next, for each data point in the training set, calculate the VC it is closest to. Finally, for each VC and its associated points, calculate the convex hull--conceptually, just the outer boundary formed by that VC's assigned points that are farthest from the VC. This convex hull around the Voronoi center forms a "Voronoi cell." A complete VD is the result from applying those three steps to each VC in your training set. This will give you a perfect tesselation of the surface (See the diagram below).
To calculate a VD in R, use the tripack package. The key function is 'voronoi.mosaic' to which you just pass in the x and y coordinates separately--the raw data, not the DM--then you can just pass voronoi.mosaic to 'plot'.
library(tripack)
plot(voronoi.mosaic(runif(100), runif(100), duplicate="remove"))

You may want to consider looking at a 2-d projection of your matrix (Multi Dimensional Scaling). Here is a link to how to do it in R.
Otherwise, I think you are on the right track with heatmaps. You can add in your numbers without too much difficulty. For example, building of off Learn R :
library(ggplot2)
library(plyr)
library(arm)
library(reshape2)
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
nba$Name <- with(nba, reorder(Name, PTS))
nba.m <- melt(nba)
nba.m <- ddply(nba.m, .(variable), transform,
rescale = rescale(value))
(p <- ggplot(nba.m, aes(variable, Name)) + geom_tile(aes(fill = rescale),
colour = "white") + scale_fill_gradient(low = "white",
high = "steelblue")+geom_text(aes(label=round(rescale,1))))

A dendrogram based on a hierarchical cluster analysis can be useful:
http://www.statmethods.net/advstats/cluster.html
A 2-D or 3-D multidimensional scaling analysis in R:
http://www.statmethods.net/advstats/mds.html
If you want to go into 3+ dimensions, you might want to explore ggobi / rggobi:
http://www.ggobi.org/rggobi/

In the book "Numerical Ecology" by Borcard et al. 2011 they used a function called *coldiss.r *
you can find it here: http://ichthyology.usm.edu/courses/multivariate/coldiss.R
it color codes the distances and even orders the records by dissimilarity.
another good package would be the seriation package.
Reference:
Borcard, D., Gillet, F. & Legendre, P. (2011) Numerical Ecology with R. Springer.

A solution using Multidimensional Scaling
data = read.csv("http://datasets.flowingdata.com/ppg2008.csv", sep = ",")
dst = tcrossprod(as.matrix(data[,-1]))
dst = matrix(rep(diag(dst), 50L), ncol = 50L, byrow = TRUE) +
matrix(rep(diag(dst), 50L), ncol = 50L, byrow = FALSE) - 2*dst
library(MASS)
mds = isoMDS(dst)
#remove {type = "n"} to see dots
plot(mds$points, type = "n", pch = 20, cex = 3, col = adjustcolor("black", alpha = 0.3), xlab = "X", ylab = "Y")
text(mds$points, labels = rownames(data), cex = 0.75)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex