Generating multidimensional data - r

Does R have a package for generating random numbers in multi-dimensional space? For example, suppose I want to generate 1000 points inside a cuboid or a sphere.

I have some functions for hypercube and n-sphere selection that generate dataframes with cartesian coordinates and guarantee a uniform distribution through the hypercube or n-sphere for an arbitrary amount of dimensions :
GenerateCubiclePoints <- function(nrPoints,nrDim,center=rep(0,nrDim),l=1){
x <- matrix(runif(nrPoints*nrDim,-1,1),ncol=nrDim)
x <- as.data.frame(
t(apply(x*(l/2),1,'+',center))
)
names(x) <- make.names(seq_len(nrDim))
x
}
is in a cube/hypercube of nrDim dimensions with a center and l the length of one side.
For an n-sphere with nrDim dimensions, you can do something similar, where r is the radius :
GenerateSpherePoints <- function(nrPoints,nrDim,center=rep(0,nrDim),r=1){
#generate the polar coordinates!
x <- matrix(runif(nrPoints*nrDim,-pi,pi),ncol=nrDim)
x[,nrDim] <- x[,nrDim]/2
#recalculate them to cartesians
sin.x <- sin(x)
cos.x <- cos(x)
cos.x[,nrDim] <- 1 # see the formula for n.spheres
y <- sapply(1:nrDim, function(i){
if(i==1){
cos.x[,1]
} else {
cos.x[,i]*apply(sin.x[,1:(i-1),drop=F],1,prod)
}
})*sqrt(runif(nrPoints,0,r^2))
y <- as.data.frame(
t(apply(y,1,'+',center))
)
names(y) <- make.names(seq_len(nrDim))
y
}
in 2 dimensions, these give :
From code :
T1 <- GenerateCubiclePoints(10000,2,c(4,3),5)
T2 <- GenerateSpherePoints(10000,2,c(-5,3),2)
op <- par(mfrow=c(1,2))
plot(T1)
plot(T2)
par(op)

Also check out the copula package. This will generate data within a cube/hypercube with uniform margins, but with correlation structures that you set. The generated variables can then be transformed to represent other shapes, but still with relations other than independent.
If you want more complex shapes but are happy with uniform and idependent within the shape then you can just do rejection sampling: generate data within a cube that contains your shape, then test if the points are within your shape, reject them if not, then keep doing this until there are enough points.

A couple of years ago, I made a package called geozoo. It is available on CRAN.
install.packages("geozoo")
library(geozoo)
It has many different functions to produce objects in N-dimensions.
p = 4
n = 1000
# Cube with points on it's face.
# A 3D version would be a box with solid walls and a hollow interior.
cube.face(p)
# Hollow sphere
sphere.hollow(p, n)
# Solid cube
cube.solid.random(p, n)
cube.solid.grid(p, 10) # evenly spaced points
# Solid Sphere
sphere.solid.random(p, n)
sphere.solid.grid(p, 10) # evenly spaced points
One of my favorite ones to watch animate is a cube with points along its edges, because it was one of the first objects that I made. It also gives you a sense of distance between vertices.
# Cube with points along it's edges.
cube.dotline(4)
Also, check out the website: http://streaming.stat.iastate.edu/~dicook/geometric-data/. It contains pictures and downloadable data sets.
Hope it meets your needs!

Cuboid:
df <- data.frame(
x = runif(1000),
y = runif(1000),
z = runif(1000)
)
head(df)
x y z
1 0.7522104 0.579833314 0.7878651
2 0.2846864 0.520284731 0.8435828
3 0.2240340 0.001686003 0.2143208
4 0.4933712 0.250840233 0.4618258
5 0.6749785 0.298335804 0.4494820
6 0.7089414 0.141114804 0.3772317
Sphere:
df <- data.frame(
radius = runif(1000),
inclination = 2*pi*runif(1000),
azimuth = 2*pi*runif(1000)
)
head(df)
radius inclination azimuth
1 0.1233281 5.363530 1.747377
2 0.1872865 5.309806 4.933985
3 0.2371039 5.029894 6.160549
4 0.2438854 2.962975 2.862862
5 0.5300013 3.340892 1.647043
6 0.6972793 4.777056 2.381325
Note: edited to include code for sphere

Here is one way to do it.
Say we hope to generate a bunch of 3d points of the form y = (y_1, y_2, y_3)
Sample X from multivariate Gaussian with mean zero and covariance matrix R.
(x_1, x_2, x_3) ~ Multivariate_Gaussian(u = [0,0,0], R = [[r_11, r_12, r_13],r_21, r_22, r_23], [r_31, r_32, r_33]]
You can find a function which generates Multivariate Gaussian samples in an R package.
Take the Gaussian cdf of each covariate (phi(x_1) , phi(x_2), phi(x_3)). In this case, phi is the Gaussian cdf of our variables. Ie phi(x_1) = Pr[x <= x_1] By the probability integral transform, these (phi(x_1) , phi(x_2), phi(x_3)) = (u_1, u_2, u_3), will each be uniformly distrubted on [0,1].
Then, take the inverse cdf of each uniformly distributed marginal. In other words take the inverse cdf of u_1, u_2, u_3:
F^{-1}(u_1), F^{-2}(u_2), F^{-3}(u_3) = (y_1, y_2, y_3), where F is the marginal cdf of the distrubution you are trying to sample from.

Related

3-d point matching/clustering in R

I would like to match points in 3-dimensional space.
Therefore, I am using the Hungarian Method described in this question: Finding the best matching pairwise points from 2 vectors
Here is my example using R:
# packages
library(rgl)
library(clue)
library(plyr)
library(fields)
set.seed(1)
a <- c(rep(2,7), 3,4,5,6,3,4,5,6,7,7,7,7,7,7) # x values
b <- c(rep(3,7),3,3,3,3, 3,3,3,3,3,3,3,3,3,3) # y values
c <- c(seq(1,7),1,1,1,1,7,7,7,7,1,2,3,4,5,6) # z values
# transform the points
set.seed(2)
a1 <- a + seq(1,length(a))
b1 <- b + 8
c1 <- c + 9
# plot the data
plot3d(a,b,c, col="red", pch=16,size=10)
plot3d(a1,b1,c1, lwd=10, col="blue", pch=16,size=10, add=TRUE)
# run the Hungarian Method
A <- cbind(a,b,c)
B <- cbind(a1,b1,c1)
distances <- rdist(A,B) # calculate Euclidean Distance between points
min.dist <- solve_LSAP(distances) # minimizing the sum of distance
min.dist.num <- as.numeric(min.dist)
# plot the minimized lines between point sets
for (ii in 1:dim(B)[1]){
D <- c(A[ii,1], B[min.dist.num[ii],1])
R <- c(A[ii,2], B[min.dist.num[ii],2])
W <- c(A[ii,3], B[min.dist.num[ii],3])
segments3d(D,R,W,col=2,lwd=1)
}
# calculate the share of points that is matched correctly
sum(1:dim(B)[1]==min.dist.num)/dim(B)[1]* 100
The problem here is that only 5% of the points are matched correctly (see last line of the code). In my view, the main trouble is that the algorithm does not take the structure of the object (a square) into account.
Question: Is there any method that performs better for this sample data?
In my original data, the dimensional structure of the points is way more complicated. I have a cloud of data and within this cloud there are multiple subfigures.
I am seeking primarily for a solution in R, but other implementations (e.g. MATLAB, Excel, Java) are also welcome.

Find correct 2D translation of a subset of coordinates

I have a problem I wish to solve in R with example data below. I know this must have been solved many times but I have not been able to find a solution that works for me in R.
The core of what I want to do is to find how to translate a set of 2D coordinates to best fit into an other, larger, set of 2D coordinates. Imagine for example having a Polaroid photo of a small piece of the starry sky with you out at night, and you want to hold it up in a position so they match the stars' current positions.
Here is how to generate data similar to my real problem:
# create reference points (the "starry sky")
set.seed(99)
ref_coords = data.frame(x = runif(50,0,100), y = runif(50,0,100))
# generate points take subset of coordinates to serve as points we
# are looking for ("the Polaroid")
my_coords_final = ref_coords[c(5,12,15,24,31,34,48,49),]
# add a little bit of variation as compared to reference points
# (data should very similar, but have a little bit of noise)
set.seed(100)
my_coords_final$x = my_coords_final$x+rnorm(8,0,.1)
set.seed(101)
my_coords_final$y = my_coords_final$y+rnorm(8,0,.1)
# create "start values" by, e.g., translating the points we are
# looking for to start at (0,0)
my_coords_start =apply(my_coords_final,2,function(x) x-min(x))
# Plot of example data, goal is to find the dotted vector that
# corresponds to the translation needed
plot(ref_coords, cex = 1.2) # "Starry sky"
points(my_coords_start,pch=20, col = "red") # start position of "Polaroid"
points(my_coords_final,pch=20, col = "blue") # corrected position of "Polaroid"
segments(my_coords_start[1,1],my_coords_start[1,2],
my_coords_final[1,1],my_coords_final[1,2],lty="dotted")
Plotting the data as above should yield:
The result I want is basically what the dotted line in the plot above represents, i.e. a delta in x and y that I could apply to the start coordinates to move them to their correct position in the reference grid.
Details about the real data
There should be close to no rotational or scaling difference between my points and the reference points.
My real data is around 1000 reference points and up to a few hundred points to search (could use less if more efficient)
I expect to have to search about 10 to 20 sets of reference points to find my match, as many of the reference sets will not contain my points.
Thank you for your time, I'd really appreciate any input!
EDIT: To clarify, the right plot represent the reference data. The left plot represents the points that I want to translate across the reference data in order to find a position where they best match the reference. That position, in this case, is represented by the blue dots in the previous figure.
Finally, any working strategy must not use the data in my_coords_final, but rather reproduce that set of coordinates starting from my_coords_start using ref_coords.
So, the previous approach I posted (see edit history) using optim() to minimize the sum of distances between points will only work in the limited circumstance where the point distribution used as reference data is in the middle of the point field. The solution that satisfies the question and seems to still be workable for a few thousand points, would be a brute-force delta and comparison algorithm that calculates the differences between each point in the field against a single point of the reference data and then determines how many of the rest of the reference data are within a minimum threshold (which is needed to account for the noise in the data):
## A brute-force approach where min_dist can be used to
## ameliorate some random noise:
min_dist <- 5
win_thresh <- 0
win_thresh_old <- 0
for(i in 1:nrow(ref_coords)) {
x2 <- my_coords_start[,1]
y2 <- my_coords_start[,2]
x1 <- ref_coords[,1] + (x2[1] - ref_coords[i,1])
y1 <- ref_coords[,2] + (y2[1] - ref_coords[i,2])
## Calculate all pairwise distances between reference and field data:
dists <- dist( cbind( c(x1, x2), c(y1, y2) ), "euclidean")
## Only take distances for the sampled data:
dists <- as.matrix(dists)[-1*1:length(x1),]
## Calculate the number of distances within the minimum
## distance threshold minus the diagonal portion:
win_thresh <- sum(rowSums(dists < min_dist) > 1)
## If we have more "matches" than our best then calculate a new
## dx and dy:
if (win_thresh > win_thresh_old) {
win_thresh_old <- win_thresh
dx <- (x2[1] - ref_coords[i,1])
dy <- (y2[1] - ref_coords[i,2])
}
}
## Plot estimated correction (your delta x and delta y) calculated
## from the brute force calculation of shifts:
points(
x=ref_coords[,1] + dx,
y=ref_coords[,2] + dy,
cex=1.5, col = "red"
)
I'm very interested to know if there's anyone that solves this in a more efficient manner for the number of points in the test data, possibly using a statistical or optimization algorithm.

How to find the smallest ellipse covering a given fraction of a set of points in R?

I'm wondering: Is there some function/clever way to find the smallest ellipse covering a given fraction of a set of 2d points in R? With smallest I mean the ellipse with the smallest area.
Clarification: I'm fine with an approximately correct solution if the number of points are large (as I guess an exact solution would have to try all combinations of subsets of points)
This question might sound like a duplicate of the question Ellipse containing percentage of given points in R but the way that question is phrased the resulting answer does not result in the smallest ellipse. For example, using the solution given to Ellipse containing percentage of given points in R:
require(car)
x <- runif(6)
y <- runif(6)
dataEllipse(x,y, levels=0.5)
The resulting ellipse is clearly not the smallest ellipse containing half of the points, Which, I guess, would be a small ellipse covering the three points up in the top-left corner.
I think I have a solution which requires two functions, cov.rob from the MASS package and ellipsoidhull from the cluster package. cov.rob(xy, quantile.used = 50, method = "mve") finds approximately the "best" 50 points out of the total number of 2d points in xy that are contained in the minimum volume ellipse. However, cov.rob does not directly return this ellipse but rather some other ellipse estimated from the best points (the goal being to robustly estimate the covariance matrix). To find the actuall minimum ellipse we can give the best points to ellipsoidhull which finds the minimum ellipse, and we can use predict.ellipse to get out the coordinates of the path defining the hull of the elllipse.
I'm not 100% certain this method is the easiest and/or that it works 100% (It feels like it should be possible to avoid the seconds step of using ellipsoidhull but I havn't figured out how.). It seems to work on my toy example at least....
Enough talking, here is the code:
library(MASS)
library(cluster)
# Using the same six points as in the question
xy <- cbind(x, y)
# Finding the 3 points in the smallest ellipse (not finding
# the actual ellipse though...)
fit <- cov.rob(xy, quantile.used = 3, method = "mve")
# Finding the minimum volume ellipse that contains these three points
best_ellipse <- ellipsoidhull( xy[fit$best,] )
plot(xy)
# The predict() function returns a 2d matrix defining the coordinates of
# the hull of the ellipse
lines(predict(best_ellipse), col="blue")
Looks pretty good! You can also inspect the ellipse object for more info
best_ellipse
## 'ellipsoid' in 2 dimensions:
## center = ( 0.36 0.65 ); squared ave.radius d^2 = 2
## and shape matrix =
## x y
## x 0.00042 0.0065
## y 0.00654 0.1229
## hence, area = 0.018
Here is a handy function which adds an ellipse to an existing base graphics plot:
plot_min_ellipse <- function(xy, points_in_ellipse, color = "blue") {
fit <- cov.rob(xy, quantile.used = points_in_ellipse, method = "mve")
best_ellipse <- ellipsoidhull( xy[fit$best,] )
lines(predict(best_ellipse), col=color)
}
Let's use it on a larger number of points:
x <- runif(100)
y <- runif(100)
xy <- cbind(x, y)
plot(xy)
plot_min_ellipse(xy, points_in_ellipse = 50)
This sounds very much like a 2D confidence interval. Try http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/ellipsoidhull.html. You will probably need to run it on each combination of N points, then choose the smallest result.

Coordinates from distance matrix in R

Is there a package to convert a distance matrix to a set of coordinates?
I have gone throught the below question. I was hoping there would be a package for this.
Finding the coordinates of points from distance matrix
I have considered Sammons Projection for this but from what I understand, it is an optimizer and gets you an optimum solution. I think there should be an algorithm to get a unique solution for this.
Multidimensional scaling (MDS) aims to project the distance matrix of your data to a lower dimension k, where desired k = 2 in your case, while trying to preserve the distances between data points:
# Multidimensional scaling
library(MASS)
set.seed(1)
labels <- as.factor(sample(LETTERS[1:5], 100, replace=TRUE))
dat <- mvrnorm(n=100, mu = c(1:4), Sigma=matrix(1:16, ncol=4)) + as.numeric(labels)^2
#> dim(dat)
#[1] 100 4
# Euclidean distance matrix (100x100)
d <- dist(dat)
# Classical MDS for distance matrix d
# http://en.wikipedia.org/wiki/Multidimensional_scaling
mds <- cmdscale(d, k = 2)
x <- mds[,1]
y <- mds[,2]
plot(x,y, col=rainbow(5)[as.numeric(labels)], pch=16, main="MDS for object 'dat'")
legend("topright", legend=unique(labels), col=rainbow(5)[unique(as.numeric(labels))], pch=16)
Further reading: https://stats.stackexchange.com/questions/14002/whats-the-difference-between-principal-components-analysis-and-multidimensional
Look-up an algorithm called Multi-Dimensional Scaling (MDS). An implementation in R is the cmdscale function from the stats package:
Multidimensional scaling takes a set of dissimilarities and returns a set of points such that the distances between the points are approximately equal to the dissimilarities.
The documentation also has an example where a distance matrix is turned into two vectors of x and y coordinates, then plotted.

spatial distribution of points, R

What would be an easy way to generate a 3 different spatial distribution of points (N = 20 points) using R. For example, 1) random, 2) uniform, and 3) clustered on the same space (50 x 50 grid)?
1) Here's one way to get a very even spacing of 5 points in a 25 by 25 grid numbered from 1 each direction. Put points at (3,18), (8,3), (13,13), (18,23), (23,8); you should be able to generalize from there.
2) as you suggest, you could use runif ... but I'd have assumed from your question you actually wanted points on the lattice (i.e. integers), in which case you might use sample.
Are you sure you want continuous rather than discrete random variables?
3) This one is "underdetermined" - depending on how you want to define things there's a bunch of ways you might do it. e.g. if it's on a grid, you could sample points in such a way that points close to (but not exactly on) already sampled points had a much higher probability than ones further away; a similar setup works for continuous variables. Or you could generate more points than you need and eliminate the loneliest ones. Or you could start with random uniform points and them make them gravitate toward their neighbors. Or you could generate a few cluster-centers (4-10, say), and then scatter points about those centers. Or you could do any of a hundred other things.
A bit late, but the answers above do not really address the problem. Here is what you are looking for:
library(sp)
# make a grid of size 50*50
x1<-seq(1:50)-0.5
x2<-x1
grid<-expand.grid(x1,x2)
names(grid)<-c("x1","x2")
# make a grid a spatial object
coordinates(grid) <- ~x1+x2
gridded(grid) <- TRUE
First: random sampling
# random sampling
random.pt <- spsample(x = grid, n= 20, type = 'random')
Second: regular sampling
# regular sampling
regular.pt <- spsample(x = grid, n= 20, type = 'regular')
Third: clustered at a distance of 2 from a random location (can go outside the area)
# random sampling of one location
ori <- data.frame(spsample(x = grid, n= 1, type = 'random'))
# select randomly 20 distances between 0 and 2
n.point <- 20
h <- rnorm(n.point, 1:2)
# empty dataframe
dxy <- data.frame(matrix(nrow=n.point, ncol=2))
# take a random angle from the randomly selected location and make a dataframe of the new distances from the original sampling points, in a random direction
angle <- runif(n = n.point,min=0,max=2*pi)
dxy[,1]= h*sin(angle)
dxy[,2]= h*cos(angle)
cluster <- data.frame(x=rep(NA, 20), y=rep(NA, 20))
cluster$x <- ori$coords.x1 + dxy$X1
cluster$y <- ori$coords.x2 + dxy$X2
# make a spatial object and plot
coordinates(cluster)<- ~ x+y
plot(grid)
plot(cluster, add=T, col='green')
plot(random.pt, add=T, col= 'red')
plot(regular.pt, add=T, col= 'blue')

Resources