3-d point matching/clustering in R - r

I would like to match points in 3-dimensional space.
Therefore, I am using the Hungarian Method described in this question: Finding the best matching pairwise points from 2 vectors
Here is my example using R:
# packages
library(rgl)
library(clue)
library(plyr)
library(fields)
set.seed(1)
a <- c(rep(2,7), 3,4,5,6,3,4,5,6,7,7,7,7,7,7) # x values
b <- c(rep(3,7),3,3,3,3, 3,3,3,3,3,3,3,3,3,3) # y values
c <- c(seq(1,7),1,1,1,1,7,7,7,7,1,2,3,4,5,6) # z values
# transform the points
set.seed(2)
a1 <- a + seq(1,length(a))
b1 <- b + 8
c1 <- c + 9
# plot the data
plot3d(a,b,c, col="red", pch=16,size=10)
plot3d(a1,b1,c1, lwd=10, col="blue", pch=16,size=10, add=TRUE)
# run the Hungarian Method
A <- cbind(a,b,c)
B <- cbind(a1,b1,c1)
distances <- rdist(A,B) # calculate Euclidean Distance between points
min.dist <- solve_LSAP(distances) # minimizing the sum of distance
min.dist.num <- as.numeric(min.dist)
# plot the minimized lines between point sets
for (ii in 1:dim(B)[1]){
D <- c(A[ii,1], B[min.dist.num[ii],1])
R <- c(A[ii,2], B[min.dist.num[ii],2])
W <- c(A[ii,3], B[min.dist.num[ii],3])
segments3d(D,R,W,col=2,lwd=1)
}
# calculate the share of points that is matched correctly
sum(1:dim(B)[1]==min.dist.num)/dim(B)[1]* 100
The problem here is that only 5% of the points are matched correctly (see last line of the code). In my view, the main trouble is that the algorithm does not take the structure of the object (a square) into account.
Question: Is there any method that performs better for this sample data?
In my original data, the dimensional structure of the points is way more complicated. I have a cloud of data and within this cloud there are multiple subfigures.
I am seeking primarily for a solution in R, but other implementations (e.g. MATLAB, Excel, Java) are also welcome.

Related

How to visualize a distance matrix on the map by the thickness or color of the line connect the adjacent localities in R?

Suppose I have two datasets: (1) a data frame: coordinates of localities, each with ID; and (2) a linguistic distance matrix which reflects the linguistic distance between these localities.
# My data are similar to this structure
# dataframe
id <- c("A","B","C","D","E")
x_coor <- c(0.5,1,1,1.5,2)
y_coor <- c(5.5,3,7,6.5,5)
my.data <- data.frame(id = id, x_coor = x_coor, y_coor = y_coor)
# linguistic distance matrix
A B C D
B 308.298557
C 592.555483 284.256926
D 141.421356 449.719913 733.976839
E 591.141269 282.842712 1.414214 732.562625
Now, I want to visualize the linguistic distance between every two sites onto a map by the thickness or color of the line connect the adjacent localities in R.
Just like this:
enter image description here
My idea is to generate the delaunay triangulation by deldir or tripack package in R.
# generate delaunay triangulation
library(deldir)
de=deldir(my.data$x_coor,my.data$y_coor)
plot.deldir(de,wlines="triang",col='blue',wpoints = "real",cex = 0.1)
text(my.data$x_coor,my.data$y_coor,my.data$id)
this is the plot:
enter image description here
My question is how to reflect the linguistic distance by the thickness or color of the edges of triangles? Is there any other better method?
Thank you very much!
What you want to do in respect of the line widths can be done "fairly
easily" by the deldir package. You simply call plot.deldir() with the
appropriate value of "lw" (line width).
At the bottom of this answer is a demonstration script "demo.txt" which shows how to do this in the case of your example. In particular this script shows
how to obtain the appropriate value of lw from the "linguistic distance
matrix". I had to make some adjustments in the way this matrix was
presented. I.e. I had to convert it into a proper matrix.
I have rescaled the distances to lie between 0 and 10 to obtain the
corresponding values of the line widths. You might wish to rescale in a different manner.
In respect of colours, there are two issues:
(1) It is not at all clear how you would like to map the "linguistic
distances" to colours.
(2) Unfortunately the code for plot.deldir() is written in a very
kludgy way, whence the "col" argument to segments() cannot be
appropriately passed on in the same manner that the "lw" argument can.
(I wrote the plot.deldir() code a long while ago, when I knew far less about
R programming than I know now! :-))
I will adjust this code and submit a new version of deldir to CRAN
fairly soon.
#
# Demo script
#
# Present the linguistic distances in a useable way.
vldm <- c(308.298557,592.555483,284.256926,141.421356,449.719913,
733.976839,591.141269,282.842712,1.414214,732.562625)
ldm <- matrix(nrow=5,ncol=5)
ldm[row(ldm) > col(ldm)] <- vldm
ldm[row(ldm) <= col(ldm)] <- 0
ldm <- (ldm + t(ldm))/2
rownames(ldm) <- LETTERS[1:5]
colnames(ldm) <- LETTERS[1:5]
# Set up the example data. It makes life much simpler if
# you denote the "x" and "y" coordinates by "x" and "y"!!!
id <- c("A","B","C","D","E")
x_coor <- c(0.5,1,1,1.5,2)
y_coor <- c(5.5,3,7,6.5,5)
# Eschew nomenclature like "my.data". Such nomenclature
# is Micro$oft-ese and is an abomination!!!
demoDat <- data.frame(id = id, x = x_coor, y = y_coor)
# Form the triangulation/tessellation.
library(deldir)
dxy <- deldir(demoDat)
# Plot the triangulation with line widths proportional
# to "linguistic distances". Note that plot.deldir() is
# a *method* for plot, so you do not have to (and shouldn't)
# type the ".deldir" in the plotting command.
plot(dxy,col=0) # This, and plotting with "add=TRUE" below, is
# a kludge to dodge around spurious warnings.
ind <- as.matrix(dxy$delsgs[,c("ind1","ind2")])
lwv <- ldm[ind]
lwv <- 10*lwv/max(lwv)
plot(dxy,wlines="triang",col='grey',wpoints="none",
lw=10*lwv/max(lwv),add=TRUE)
with(demoDat,text(x,y,id,col="red",cex=1.5))

Coordinates from distance matrix in R

Is there a package to convert a distance matrix to a set of coordinates?
I have gone throught the below question. I was hoping there would be a package for this.
Finding the coordinates of points from distance matrix
I have considered Sammons Projection for this but from what I understand, it is an optimizer and gets you an optimum solution. I think there should be an algorithm to get a unique solution for this.
Multidimensional scaling (MDS) aims to project the distance matrix of your data to a lower dimension k, where desired k = 2 in your case, while trying to preserve the distances between data points:
# Multidimensional scaling
library(MASS)
set.seed(1)
labels <- as.factor(sample(LETTERS[1:5], 100, replace=TRUE))
dat <- mvrnorm(n=100, mu = c(1:4), Sigma=matrix(1:16, ncol=4)) + as.numeric(labels)^2
#> dim(dat)
#[1] 100 4
# Euclidean distance matrix (100x100)
d <- dist(dat)
# Classical MDS for distance matrix d
# http://en.wikipedia.org/wiki/Multidimensional_scaling
mds <- cmdscale(d, k = 2)
x <- mds[,1]
y <- mds[,2]
plot(x,y, col=rainbow(5)[as.numeric(labels)], pch=16, main="MDS for object 'dat'")
legend("topright", legend=unique(labels), col=rainbow(5)[unique(as.numeric(labels))], pch=16)
Further reading: https://stats.stackexchange.com/questions/14002/whats-the-difference-between-principal-components-analysis-and-multidimensional
Look-up an algorithm called Multi-Dimensional Scaling (MDS). An implementation in R is the cmdscale function from the stats package:
Multidimensional scaling takes a set of dissimilarities and returns a set of points such that the distances between the points are approximately equal to the dissimilarities.
The documentation also has an example where a distance matrix is turned into two vectors of x and y coordinates, then plotted.

Finding the best matching pairwise points from 2 vectors

I have 2 lists with X,Y coordinates of points.
List 1 contains more points than list 2.
The task is to find pairs of points in a way that the overall euclidean distance is minimized.
I have a working code, but i don't know if this is the best way and I would like to get hint what I can improve for result (better algorithm to find the minimum ) or speed, because the list are about 2000 elements each.
The round in the sample vectors is implemented to get also points with same distances.
With the "rdist" function all distances are generated in "distances". Than the minimum in the matrix is used to link 2 point ("dist_min"). All distances of these 2 points are now replaced by NA and the loop continues by searching the next minimum until all points of list 2 have a point from list 1.
At the end I have added a plot for visualization.
require(fields)
set.seed(1)
x1y1.data <- matrix(round(runif(200*2),2), ncol = 2) # generate 1st set of points
x2y2.data <- matrix(round(runif(100*2),2), ncol = 2) # generate 2nd set of points
distances <- rdist(x1y1.data, x2y2.data)
dist_min <- matrix(data=NA,nrow=ncol(distances),ncol=7) # prepare resulting vector with 7 columns
for(i in 1:ncol(distances))
{
inds <- which(distances == min(distances,na.rm = TRUE), arr.ind=TRUE)
dist_min[i,1] <- inds[1,1] # row of point(use 1st element of inds if points have same distance)
dist_min[i,2] <- inds[1,2] # column of point (use 1st element of inds if points have same distance)
dist_min[i,3] <- distances[inds[1,1],inds[1,2]] # distance of point
dist_min[i,4] <- x1y1.data[inds[1,1],1] # X1 ccordinate of 1st point
dist_min[i,5] <- x1y1.data[inds[1,1],2] # Y1 coordinate of 1st point
dist_min[i,6] <- x2y2.data[inds[1,2],1] # X2 coordinate of 2nd point
dist_min[i,7] <- x2y2.data[inds[1,2],2] # Y2 coordinate of 2nd point
distances[inds[1,1],] <- NA # remove row (fill with NA), where minimum was found
distances[,inds[1,2]] <- NA # remove column (fill with NA), where minimum was found
}
# plot 1st set of points
# print mean distance as measure for optimization
plot(x1y1.data,col="blue",main="mean of min_distances",sub=mean(dist_min[,3],na.rm=TRUE))
points(x2y2.data,col="red") # plot 2nd set of points
segments(dist_min[,4],dist_min[,5],dist_min[,6],dist_min[,7]) # connect pairwise according found minimal distance
This is a fundamental problem in combinatorial optimization known as the assignment problem. One approach to solving the assignment problem is the Hungarian algorithm which is implemented in the R package clue:
require(clue)
sol <- solve_LSAP(t(distances))
We can verify that it outperforms the naive solution:
mean(dist_min[,3])
# [1] 0.05696033
mean(sqrt(
(x2y2.data[,1] - x1y1.data[sol, 1])^2 +
(x2y2.data[,2] - x1y1.data[sol, 2])^2))
#[1] 0.05194625
And we can construct a similar plot to the one in your question:
plot(x1y1.data,col="blue")
points(x2y2.data,col="red")
segments(x2y2.data[,1], x2y2.data[,2], x1y1.data[sol, 1], x1y1.data[sol, 2])

Generating multidimensional data

Does R have a package for generating random numbers in multi-dimensional space? For example, suppose I want to generate 1000 points inside a cuboid or a sphere.
I have some functions for hypercube and n-sphere selection that generate dataframes with cartesian coordinates and guarantee a uniform distribution through the hypercube or n-sphere for an arbitrary amount of dimensions :
GenerateCubiclePoints <- function(nrPoints,nrDim,center=rep(0,nrDim),l=1){
x <- matrix(runif(nrPoints*nrDim,-1,1),ncol=nrDim)
x <- as.data.frame(
t(apply(x*(l/2),1,'+',center))
)
names(x) <- make.names(seq_len(nrDim))
x
}
is in a cube/hypercube of nrDim dimensions with a center and l the length of one side.
For an n-sphere with nrDim dimensions, you can do something similar, where r is the radius :
GenerateSpherePoints <- function(nrPoints,nrDim,center=rep(0,nrDim),r=1){
#generate the polar coordinates!
x <- matrix(runif(nrPoints*nrDim,-pi,pi),ncol=nrDim)
x[,nrDim] <- x[,nrDim]/2
#recalculate them to cartesians
sin.x <- sin(x)
cos.x <- cos(x)
cos.x[,nrDim] <- 1 # see the formula for n.spheres
y <- sapply(1:nrDim, function(i){
if(i==1){
cos.x[,1]
} else {
cos.x[,i]*apply(sin.x[,1:(i-1),drop=F],1,prod)
}
})*sqrt(runif(nrPoints,0,r^2))
y <- as.data.frame(
t(apply(y,1,'+',center))
)
names(y) <- make.names(seq_len(nrDim))
y
}
in 2 dimensions, these give :
From code :
T1 <- GenerateCubiclePoints(10000,2,c(4,3),5)
T2 <- GenerateSpherePoints(10000,2,c(-5,3),2)
op <- par(mfrow=c(1,2))
plot(T1)
plot(T2)
par(op)
Also check out the copula package. This will generate data within a cube/hypercube with uniform margins, but with correlation structures that you set. The generated variables can then be transformed to represent other shapes, but still with relations other than independent.
If you want more complex shapes but are happy with uniform and idependent within the shape then you can just do rejection sampling: generate data within a cube that contains your shape, then test if the points are within your shape, reject them if not, then keep doing this until there are enough points.
A couple of years ago, I made a package called geozoo. It is available on CRAN.
install.packages("geozoo")
library(geozoo)
It has many different functions to produce objects in N-dimensions.
p = 4
n = 1000
# Cube with points on it's face.
# A 3D version would be a box with solid walls and a hollow interior.
cube.face(p)
# Hollow sphere
sphere.hollow(p, n)
# Solid cube
cube.solid.random(p, n)
cube.solid.grid(p, 10) # evenly spaced points
# Solid Sphere
sphere.solid.random(p, n)
sphere.solid.grid(p, 10) # evenly spaced points
One of my favorite ones to watch animate is a cube with points along its edges, because it was one of the first objects that I made. It also gives you a sense of distance between vertices.
# Cube with points along it's edges.
cube.dotline(4)
Also, check out the website: http://streaming.stat.iastate.edu/~dicook/geometric-data/. It contains pictures and downloadable data sets.
Hope it meets your needs!
Cuboid:
df <- data.frame(
x = runif(1000),
y = runif(1000),
z = runif(1000)
)
head(df)
x y z
1 0.7522104 0.579833314 0.7878651
2 0.2846864 0.520284731 0.8435828
3 0.2240340 0.001686003 0.2143208
4 0.4933712 0.250840233 0.4618258
5 0.6749785 0.298335804 0.4494820
6 0.7089414 0.141114804 0.3772317
Sphere:
df <- data.frame(
radius = runif(1000),
inclination = 2*pi*runif(1000),
azimuth = 2*pi*runif(1000)
)
head(df)
radius inclination azimuth
1 0.1233281 5.363530 1.747377
2 0.1872865 5.309806 4.933985
3 0.2371039 5.029894 6.160549
4 0.2438854 2.962975 2.862862
5 0.5300013 3.340892 1.647043
6 0.6972793 4.777056 2.381325
Note: edited to include code for sphere
Here is one way to do it.
Say we hope to generate a bunch of 3d points of the form y = (y_1, y_2, y_3)
Sample X from multivariate Gaussian with mean zero and covariance matrix R.
(x_1, x_2, x_3) ~ Multivariate_Gaussian(u = [0,0,0], R = [[r_11, r_12, r_13],r_21, r_22, r_23], [r_31, r_32, r_33]]
You can find a function which generates Multivariate Gaussian samples in an R package.
Take the Gaussian cdf of each covariate (phi(x_1) , phi(x_2), phi(x_3)). In this case, phi is the Gaussian cdf of our variables. Ie phi(x_1) = Pr[x <= x_1] By the probability integral transform, these (phi(x_1) , phi(x_2), phi(x_3)) = (u_1, u_2, u_3), will each be uniformly distrubted on [0,1].
Then, take the inverse cdf of each uniformly distributed marginal. In other words take the inverse cdf of u_1, u_2, u_3:
F^{-1}(u_1), F^{-2}(u_2), F^{-3}(u_3) = (y_1, y_2, y_3), where F is the marginal cdf of the distrubution you are trying to sample from.

generate random sequence and plot in R

I would like to generate a random sequence composed of 3000 points, which follows the normal distribution. The mean is c and the standard deviation is d. But I would like these 3000 points lies in the range of [a,b].
Can you tell me how to do it in R?
If I would like to plot this sequence, if Y-axis uses the generated 3000 points, then how should I generate the points corresponding to X-axis.
You can do this using standard R functions like this:
c <- 1
d <- 2
a <- -2
b <- 3.5
ll <- pnorm(a, c, d)
ul <- pnorm(b, c, d)
x <- qnorm( runif(3000, ll, ul), c, d )
hist(x)
range(x)
mean(x)
sd(x)
plot(x, type='l')
The pnorm function is used to find the limits to use for the uniform distriution, data is then generated from a uniform and then transformed back to the normal.
This is even simpler using the distr package:
library(distr)
N <- Norm(c,d)
N2 <- Truncate(N, lower=a, upper=b)
plot(N2)
x <- r(N2)(3000)
hist(x)
range(x)
mean(x)
sd(x)
plot(x, type='l')
Note that in both cases the mean is not c and the sd is not d. If you want the mean and sd of the resulting truncated data to be c and d, then you need the parent distribution (before truncating) to have different values (higher sd, mean depends on the truncating values), finding those values would be a good homework problem for a math/stat theory course. If that is what you really need then add a comment or edit the question to say so specifically.
If you want to generate the data from the untruncated normal, but only plot the data within the range [a,b] then just use the ylim argument to plot:
plot( rnorm(3000, c, d), ylim=c(a,b) )
Generating a random sequence of numbers from any probability distribution is very easy in R. To do this for the normal distribution specifically
c = 1
d = 2
x <- rnorm(3000, c, d)
Clipping the values in x so that they're only within a given range is kind of a strange thing to want to do with a sample from the normal distribution. Maybe what you really want to do is sample a uniform distribution.
a = 0
b = 3
x2 <- runif(3000, a, b)
As for how the plot the distribution, I'm not sure I follow your question. You can plot a density estimate for the sample with this code
plot(density(x))
But, if you want to plot this data as a scatter plot of some sort, you actually need to generate a second sample of numbers.
If I would like to plot this sequence, if Y-axis uses the generated 3000 points, then how should I generate the points corresponding to X-axis.
If you just generate your points, like JoFrhwld said with
y <- rnorm(3000, 1, 2)
Then
plot(y)
Will automatically plot them using the array indices as x axis
a = -2; b = 3
plot(dnorm, xlim = c(a, b))

Resources