Given a set of xy coordinates, how can I choose n points such that those n points are most distant from each other?
An inefficient method that probably wouldn't do too well with a big dataset would be the following (identify 20 points out of 1000 that are most distant):
xy <- cbind(rnorm(1000),rnorm(1000))
n <- 20
bestavg <- 0
bestSet <- NA
for (i in 1:1000){
subset <- xy[sample(1:nrow(xy),n),]
avg <- mean(dist(subset))
if (avg > bestavg) {
bestavg <- avg
bestSet <- subset
}
}
This code, based on Pascal's code, drops the point that has the largest row sum in the distance matrix.
m2 <- function(xy, n){
subset <- xy
alldist <- as.matrix(dist(subset))
while (nrow(subset) > n) {
cdists = rowSums(alldist)
closest <- which(cdists == min(cdists))[1]
subset <- subset[-closest,]
alldist <- alldist[-closest,-closest]
}
return(subset)
}
Run on a Gaussian cloud, where m1 is #pascal's function:
> set.seed(310366)
> xy <- cbind(rnorm(1000),rnorm(1000))
> m1s = m1(xy,20)
> m2s = m2(xy,20)
See who did best by looking at the sum of the interpoint distances:
> sum(dist(m1s))
[1] 646.0357
> sum(dist(m2s))
[1] 811.7975
Method 2 wins! And compare with a random sample of 20 points:
> sum(dist(xy[sample(1000,20),]))
[1] 349.3905
which does pretty poorly as expected.
So what's going on? Let's plot:
> plot(xy,asp=1)
> points(m2s,col="blue",pch=19)
> points(m1s,col="red",pch=19,cex=0.8)
Method 1 generates the red points, which are evenly spaced out over the space. Method 2 creates the blue points, which almost define the perimeter. I suspect the reason for this is easy to work out (and even easier in one dimension...).
Using a bimodal pattern of initial points also illustrates this:
and again method 2 produces much larger total sum distance than method 1, but both do better than random sampling:
> sum(dist(m1s2))
[1] 958.3518
> sum(dist(m2s2))
[1] 1206.439
> sum(dist(xy2[sample(1000,20),]))
[1] 574.34
Following #Spacedman's suggestion, I have written a function that drops a point from the closest pair, until the desired number of points remains. It seems to work well, however, it slows down pretty quickly as you add points.
xy <- cbind(rnorm(1000),rnorm(1000))
n <- 20
subset <- xy
alldist <- as.matrix(dist(subset))
diag(alldist) <- NA
alldist[upper.tri(alldist)] <- NA
while (nrow(subset) > n) {
closest <- which(alldist == min(alldist,na.rm=T),arr.ind=T)
subset <- subset[-closest[1,1],]
alldist <- alldist[-closest[1,1],-closest[1,1]]
}
Related
I am working on Spike Trains and my code to get a spike train like this:
for 20 trials is written below. The image is representational for 5 trials.
fr = 100
dt = 1/1000 #dt in milisecond
duration = 2 #no of duration in s
nBins = 2000 #10msSpikeTrain
nTrials = 20 #NumberOfSimulations
MyPoissonSpikeTrain = function(p, fr= 100) {
p = runif(nBins)
q = ifelse(p < fr*dt, 1, 0)
return(q)
}
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
plot(x=-1,y=-1, xlab="time (s)", ylab="Trial",
main="Spike trains",
ylim=c(0.5, nTrials+1), xlim=c(0, duration))
for (i in 1: nTrials)
{
clip(x1 = 0, x2= duration, y1= (i-0.2), y2= (i+0.4))
abline(h=i, lwd= 1/4)
abline(v= dt*which( SpikeMat[i,]== 1))
}
Each trial has spikes occuring at random time points. Now what I am trying to work towards, is getting a random sample time point that works for all 20 trials and I want to get the vector consisting of length of the intervals this point falls into, for each trial. The code to get the time vector for the points where the spikes occur is,
A <- numeric()
for (i in 1: nTrials)
{
ISI <- function(i){
spike_times <- c(dt*which( SpikeMat[i, ]==1))
ISI1vec <- c(diff(spike_times))
A <- c(A, ISI1vec)
return(A)}
}
Then you call ISI(i) for whichever trial you wish to see the Interspike interval vector for. A visual representation of what I want is:
I want to get a vector that has the lengths of the interval where this points fall into, for each trial. I want to figure out it's distribution as well, but that's for later. Can anybody help me figure out how to code my way to this? Any help is appreciated, even if it's just about how to start/where to look.
Your data
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
I suggest transforming your sparse matrix data into a list of indices where spikes occur
L <- lapply(seq_len(nrow(SpikeMat)), function(i) setNames(which(SpikeMat[i, ] == 1), seq_along(which(SpikeMat[i, ] == 1))))
Grab random timepoint
set.seed(1)
RT <- round(runif(1) * ncol(SpikeMat))
# 531
Result
distances contains the distances to the 2 nearest spikes - each element of the list is a named vector where the values are the distances (to RT) and their names are their positions in the vector. nearest_columns shows the original timepoint (column number) of each spike in SpikeMat.
bookend_values <- function(vec) {
lower_val <- head(sort(vec[sign(vec) == 1]), 1)
upper_val <- head(sort(abs(vec[sign(vec) == -1])), 1)
return(c(lower_val, upper_val))
}
distances <- lapply(L, function(i) bookend_values(RT-i))
nearest_columns <- lapply(seq_along(distances), function(i) L[[i]][names(distances[[i]])])
Note that the inter-spike interval of the two nearest spikes that bookend RT can be obtained with
sapply(distances, sum)
I'm trying to perform k-means on a dataframe with 69 columns and 1000 rows. First, I need to decide upon the optimal numbers of clusters first with the use of the Davies-Bouldin index. This algorithm requires that the input should be in the form of a matrix, I used this code first:
totalm <- data.matrix(total)
Followed by the following code (Davies-Bouldin index)
clusternumber<-0
max_cluster_number <- 30
#Davies Bouldin algorithm
library(clusterCrit)
smallest <-99999
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,b)
cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
print("##clusternumber##")
print(clusternumber)
print("##smallest##")
print(smallest)
I keep on getting this error:(list) object cannot be coerced to type 'double'.
How can I solve this?
Reproducable example:
a <- c(0,0,1,0,1,0,0)
b <- c(0,0,1,0,0,0,0)
c <- c(1,1,0,0,0,0,1)
d <- c(1,1,0,0,0,0,0)
total <- cbind(a,b,c,d)
The error is coming from cl<-as.numeric(cl). The result of a call to kmeans is an object, which is a list containing various information about the model.
Run ?kmeans
I would also recommend you add nstart = 20 to your kmeans call. k-means clustering is a random process. This will run the algorithm 20 times and find the best fit (i.e. for each number of centers).
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,centers = b,nstart = 20)
#cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
This gave me
[1] "##clusternumber##"
[1] 4
[1] "##smallest##"
[1] 0.138675
(tempoarily changing max clusters to 4 as reproducible data is a small set)
EDIT Integer Error
I was able to reproduce your error using
a <- as.integer(c(0,0,1,0,1,0,0))
b <- as.integer(c(0,0,1,0,0,0,0))
c <- as.integer(c(1,1,0,0,0,0,1))
d <- as.integer(c(1,1,0,0,0,0,0))
totalm <- cbind(a,b,c,d)
So that an integer matrix is created.
I was then able to remove the error by using
storage.mode(totalm) <- "double"
Note that
total <- cbind(a,b,c,d)
totalm <- data.matrix(total)
is unnecessary for the data in this example
> identical(total,totalm)
[1] TRUE
accCost() and costDistance() functions from R gdistance produce different values when going from source coordinate A to destination coordinate B. Shouldn't the cost accumulation value at B be equivalent to the costDistance value from A to B given an equivalent anisotropic transition matrix and that both functions use the Dijkstra algorithm?
If not, then what is the fundamental difference between the calculations? If so, what accounts for the different values derived from the code presented below? In the example, A to B costDistance=0.13 hours and accCost=0.11 hours at point B. My other tests suggest that accCost is consistently less than costDistance and consierably so over long distances. The code is based on the example provided in accCost documentation.
require(gdistance)
r <- raster(system.file("external/maungawhau.grd", package="gdistance"))
altDiff <- function(x){x[2] - x[1]}
hd <- transition(r, altDiff, 8, symm=FALSE)
slope <- geoCorrection(hd)
adj <- adjacent(r, cells=1:ncell(r), pairs=TRUE, directions=8)
speed <- slope
speed[adj] <- 6 * 1000 * exp(-3.5 * abs(slope[adj] + 0.05))#1000 to convert to a common spatial unit of meters
Conductance <- geoCorrection(speed)
A <- matrix(c(2667670, 6479000),ncol=2)
B <- matrix(c(2667800, 6479400),ncol=2)
ca <- accCost(Conductance,fromCoords=A)
extract(ca,B)
costDistance(Conductance,fromCoords=A,toCoords=B)
There should be no difference. The current version of accCost has a small bug that arises from a change in the igraph package.
For the moment, please see if this function solves the problem.
setMethod("accCost", signature(x = "TransitionLayer", fromCoords = "Coords"),
def = function(x, fromCoords)
{
fromCoords <- .coordsToMatrix(fromCoords)
fromCells <- cellFromXY(x, fromCoords)
if(!all(!is.na(fromCells))){
warning("some coordinates not found and omitted")
fromCells <- fromCells[!is.na(fromCells)]
}
tr <- transitionMatrix(x)
tr <- rBind(tr,rep(0,nrow(tr)))
tr <- cBind(tr,rep(0,nrow(tr)))
startNode <- nrow(tr) #extra node to serve as origin
adjP <- cbind(rep(startNode, times=length(fromCells)), fromCells)
tr[adjP] <- Inf
adjacencyGraph <- graph.adjacency(tr, mode="directed", weighted=TRUE)
E(adjacencyGraph)$weight <- 1/E(adjacencyGraph)$weight
shortestPaths <- shortest.paths(adjacencyGraph, v=startNode, mode="out")[-startNode]
result <- as(x, "RasterLayer")
result <- setValues(result, shortestPaths)
return(result)
}
)
This issue has been resolved in gdistance 1.2-1.
I'm new to this site. I was wondering if anyone had experience with turning a list of grid coordinates (shown in example code below as df). I've written a function that can handle the job for very small data sets but the run time increases exponentially as the size of the data set increases (I think 800 pixels would take about 25 hours). It's because of the nested for loops but I don't know how to get around it.
## Dummy Data
x <- c(1,1,2,2,2,3,3)
y <- c(3,4,2,3,4,1,2)
df <- as.data.frame(cbind(x,y))
df
## Here's what it looks like as an image
a <- c(NA,NA,1,1)
b <- c(NA,1,1,1)
c <- c(1,1,NA,NA)
image <- cbind(a,b,c)
f <- function(m) t(m)[,nrow(m):1]
image(f(image))
## Here's my adjacency matrix function that's slowwwwww
adjacency.coordinates <- function(x,y) {
df <- as.data.frame(cbind(x,y))
colnames(df) = c("V1","V2")
df <- df[with(df,order(V1,V2)),]
adj.mat <- diag(1,dim(df)[1])
for (i in 1:dim(df)[1]) {
for (j in 1:dim(df)[1]) {
if((df[i,1]-df[j,1]==0)&(abs(df[i,2]-df[j,2])==1) | (df[i,2]-df[j,2]==0)&(abs(df[i,1]-df[j,1])==1)) {
adj.mat[i,j] = 1
}
}
}
return(adj.mat)
}
## Here's the adjacency matrix
adjacency.coordinates(x,y)
Does anyone know of a way to do this that will work well on a set of coordinates a couple thousand pixels long? I've tried conversion to SpatialGridDataFrame and went from there but it won't get the adjacency matrix correct. Thank you so much for your time.
While I thought igraph might be the way to go here, I think you can do it more simply like:
result <- apply(df, 1, function(pt)
(pt["x"] == df$x & abs(pt["y"] - df$y) == 1) |
(abs(pt["x"] - df$x) == 1 & pt["y"] == df$y)
)
diag(result) <- 1
And avoid the loopiness and get the same result:
> identical(adjacency.coordinates(x,y),result)
[1] TRUE
First of all, I am new to R (I started yesterday).
I have two groups of points, data and centers, the first one of size n and the second of size K (for instance, n = 3823 and K = 10), and for each i in the first set, I need to find j in the second with the minimum distance.
My idea is simple: for each i, let dist[j] be the distance between i and j, I only need to use which.min(dist) to find what I am looking for.
Each point is an array of 64 doubles, so
> dim(data)
[1] 3823 64
> dim(centers)
[1] 10 64
I have tried with
for (i in 1:n) {
for (j in 1:K) {
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
S[i] <- which.min(d)
}
which is extremely slow (with n = 200, it takes more than 40s!!). The fastest solution that I wrote is
distance <- function(point, group) {
return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}
for (i in 1:n) {
d <- distance(data[i,], centers)
which.min(d)
}
Even if it does a lot of computation that I don't use (because dist(m) computes the distance between all rows of m), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance code is very ugly. I tried to replace it with
distance <- function(point, group) {
return (dist(rbind(point,group))[1:nrow(group)])
}
but this seems to be twice slower. I also tried to use dist for each pair, but it is also slower.
I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?
ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans do all computation in less than one second.
Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.
# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)
system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)
Runs in:
utilisateur système écoulé
0.100 0.008 0.108
on my laptop.
rdist() is a R function from {fields} package which is able to calculate distances between two sets of points in matrix format quickly.
https://www.image.ucar.edu/~nychka/Fields/Help/rdist.html
Usage :
library(fields)
#generating fake data
n <- 5
m <- 10
d <- 3
x <- matrix(rnorm(n * d), ncol = d)
y <- matrix(rnorm(m * d), ncol = d)
rdist(x, y)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.512383 3.053084 3.1420322 4.942360 3.345619
[2,] 3.531150 4.593120 1.9895867 4.212358 2.868283
[3,] 1.925701 2.217248 2.4232672 4.529040 2.243467
[4,] 2.751179 2.260113 2.2469334 3.674180 1.701388
[5,] 3.303224 3.888610 0.5091929 4.563767 1.661411
[6,] 3.188290 3.304657 3.6668867 3.599771 3.453358
[7,] 2.891969 2.823296 1.6926825 4.845681 1.544732
[8,] 2.987394 1.553104 2.8849988 4.683407 2.000689
[9,] 3.199353 2.822421 1.5221291 4.414465 1.078257
[10,] 2.492993 2.994359 3.3573190 6.498129 3.337441
You may want to have a look into the apply functions.
For instance, this code
for (j in 1:K)
{
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
Can easily be substituted by something like
dt <- data[i,]
d <- apply(centers, 1, function(x){ sqrt(sum(x-dt)^2)})
You can definitely optimise it more but you get the point I hope
dist works fast because is't vectorized and call internal C functions.
You code in loop could be vectorized in many ways.
For example to compute distance between data and centers you could use outer:
diff_ij <- function(i,j) sqrt(rowSums((data[i,]-centers[j,])^2))
X <- outer(seq_len(n), seq_len(K), diff_ij)
This gives you n x K matrix of distances. And should be way faster than loop.
Then you could use max.col to find maximum in each row (see help, there are some nuances when are many maximums). X must be negate cause we search for minimum.
CL <- max.col(-X)
To be efficient in R you should vectorized as possible. Loops could be in many cases replaced by vectorized substitute. Check help for rowSums (which describe also rowMeans, colSums, rowSums), pmax, cumsum. You could search SO, e.g.
https://stackoverflow.com/search?q=[r]+avoid+loop (copy&paste this link, I don't how to make it clickable) for some examples.
My solution:
# data is a matrix where each row is a point
# point is a vector of values
euc.dist <- function(data, point) {
apply(data, 1, function (row) sqrt(sum((point - row) ^ 2)))
}
You can try it, like:
x <- matrix(rnorm(25), ncol=5)
euc.dist(x, x[1,])