Clustering based on connectivity of points - r

I have 1 million records of lat long [5 digits precision] and Route. I want to cluster those data points.
I dont want to use standard k-means clustering as I am not sure how many clsuters [tried Elbow method but not convinced].
Here is my Logic -
1) I want to reduce width of lat long from 5 digits to 3 digits.
2) Now lat longs which are in range of +/- 0.001 are to be clustered in once cluster. Calculate centroid of cluster.
But in doing so I am unable to find good algorithm and R Script to execute my thought code.
Can any one please help me in above problem.
Thanks,

Clustering can be done based on connected components.
All points that are in +/-0.001 distance to each other can be connected so we will have a graph that contains subgraphs that each may be a single poin or a series of connected points(connected components)
then connected components can be found and their centeroid can be calculated.
Two packages required for this task :
1.deldir to form triangulation of points and specify which points are adaject to each other and to calculate distances between them.
2 igraph to find connected components.
library(deldir)
library(igraph)
coords <- data.frame(lat = runif(1000000),long=runif(1000000))
#round to 3 digits
coords.r <- round(coords,3)
#remove duplicates
coords.u <- unique(coords.r)
# create triangulation of points. depends on the data may take a while an consume more memory
triangulation <- deldir(coords.u$long,coords.u$lat)
#compute distance between adjacent points
distances <- abs(triangulation$delsgs$x1 - triangulation$delsgs$x2) +
abs(triangulation$delsgs$y1 - triangulation$delsgs$y2)
#remove edges that are greater than .001
edge.list <- as.matrix(triangulation$delsgs[distances < .0011,5:6])
if (length(edge.list) == 0) { #there is no edge that its lenght is less than .0011
coords.clustered <- coords.u
} else { # find connected components
#reformat list of edges so that if the list is
# 9 5
# 5 7
#so reformatted to
# 3 1
# 1 2
sorted <- sort(c(edge.list), index.return = TRUE)
run.length <- rle(sorted$x)
indices <- rep(1:length(run.length$lengths),times=run.length$lengths)
edge.list.reformatted <- edge.list
edge.list.reformatted[sorted$ix] <- indices
#create graph from list of edges
graph.struct <- graph_from_edgelist(edge.list.reformatted, directed = FALSE)
# cluster based on connected components
clust <- components(graph.struct)
#computation of centroids
coords.connected <- coords.u[run.length$values, ]
centroids <- data.frame(lat = tapply(coords.connected$lat,factor(clust$membership),mean) ,
long = tapply(coords.connected$long,factor(clust$membership),mean))
#combine clustered points with unclustered points
coords.clustered <- rbind(coords.u[-run.length$values,], centroids)
# round the data and remove possible duplicates
coords.clustered <- round(coords.clustered, 3)
coords.clustered <- unique(coords.clustered)
}

Related

R function for creating discs around each point in a pattern, then counting number of points in each disc [spatial]

I am attempting to create a disc for each point in a pattern; each disc will have the same radius. Then for each disc, I want to count the number of points falling within the disc. Each pattern has 100-400 points. I have written code to do this, but it is quite slow. The code is below. I cannot provide the shapefile and points as that would be very difficult, but I could create some dummy data if need be.
W <- as.owin(shape)
#Converts created .shp file into a "window"
#in which everything is plotted and calculated
SPDF <- SpatialPointsDataFrame(P[,1:2], P)
#Converts data frame to spatial points data frame
SP <- as(SPDF, "SpatialPoints") #Converts SPDF to spatial points
SP1 <- as.ppp(coordinates(SP), W)
SP2 <- as.ppp(SP1)
attr(SP1, "rejects")
attr(SP2, "rejects")
aw <- area.owin(W) #Area, in pixels squared, of leaf window created earlier
#awm <- aw * (meas)^2 * 100 #Area window in millimeters squared
# Trichome_Density_Count-----------------------------------------------------------------------------------------------
TC <- nrow(P) #Counts number of rows in XY data points file,
#this is number of trichomes from ImageJ
TD <- TC/awm #Trichome density, trichomes per mm^2
#SPDF2 <- as.SpatialPoints.ppp(SP2)
#kg <- knn.graph(SPDF2, k = 1)
#Creates the lines connecting each NND pairwise connection
#dfkg <- data.frame(kg) #Converts lines into a data frame
#dfkgl <- dfkg$length
meanlength <- 78
discstest <- discs(SP2, radii = meanlength,
separate = TRUE, mask = FALSE, trim = FALSE,
delta = NULL, npoly=NULL)
#Function creates discs for each trichome
#Using nearest neighbor lengths as radii
#NEED TO ADD CLIPPING
ratiolist <- c()
for (i in 1:length(discstest)) {
ow2sp <- owin2SP(discstest[[i]])
leafsp <- owin2SP(W)
tic("gIntersection")
intersect <- rgeos::gIntersection(ow2sp, leafsp)
Sys.sleep(1)
toc()
tic("over")
res <- as.data.frame(sp::over(SP, intersect, returnList = FALSE))
Sys.sleep(1)
toc()
res[is.na(res)] <- 0
newowin <- as.owin(intersect)
circarea <- area.owin(newowin)
trichactual <- sum(res)
trichexpect <- (TC / aw) * circarea
ratio <- trichactual / trichexpect
ratiolist[[i]] <- ratio
}
If I understand you correctly you want to loop through each point and check how many points fall within a disc of radius R centered in that point. This is done very efficiently in spatstat with the function closepaircounts:
closepaircounts(SP2, r = meanlength)
This simply returns a vector with the number of points contained in the disc of radius r for each point in SP2.
I have just tried this for 100,000 points where each point on average had almost 3000 other points in the disc around it, and it took 8 seconds on my laptop. If you have many more points or in particular if the disc radius is so big that each disc contains many more points it may become very slow to calculate this.

How to calculate the number of vertices contracted into one graph?

I have a few large igraph objects that represent social networks. All nodes have various attributes, among them sector which is a factor variable. I have contracted this large network into a small where vertices represent groups and edges have the sum of individual edges in the original network. The label attribute in the second network represents the sector attribute in the first.
groupnet <- contract(g, as.integer(as.factor(V(g)$sector)), "ignore")
E(groupnet)$weight <- 1
groupnet <- simplify(groupnet, edge.attr.comb = list(weight = "sum"))
V(groupnet)$label <- levels(as.factor(V(g)$sector))
I would like to add another attribute to the second object V(groupnet)$groupsize that represents the number of original vertices that were contracted into groupnet. I have tried it with the following code but it did not work:
V(groupnet)$groupsize <- length(V(g)$sector[V(g)$sector == V(groupnet)$label])
How can I do this properly?
table() could be helpful here. Try out:
set.seed(1234)
library(igraph)
g <- make_ring(1000)
V(g)$sector <- factor(sample(LETTERS, 100, replace = T))
V(g)$sector
## contracted network
groupnet <- contract(g, as.integer(as.factor(V(g)$sector)), "ignore")
E(groupnet)$weight <- 1
V(groupnet)$label <- levels(as.factor(V(g)$sector))
## number of original vertices that were contracted into groupnet
# the tip is to see that table(V(g)$sector) provides the number of vertices per sector and
# its output is also arranged like V(groupnet)
table(V(g)$sector)
V(groupnet)
# solution
V(groupnet)$groupsize <- as.numeric(table(V(g)$sector))

Solving Chinese Postman algorithm with eulerization

I'm would like to solve Chinese Postman problem in a graph where an eulerian cycle does not exist. So basically I'm looking for a path in a graph which visits every edge exactly once, and starts and ends at the same node. A graph will have an euler cycle if and only if every node has same number of edges entering into and going out of it. Obviously my graph doesn't .
I found out that Eulerization (making a graph Eulerian) could solve my question LINK. Can anyone suggest a script to add duplicate edges to a graph so that the resulting graph has no vertices of odd degree (and thus does have an Euler Circuit)?
Here is my example:
require(igraph)
require(graph)
require(eulerian)
require(GA)
g1 <- graph(c(1,2, 1,3, 2,4, 2,5, 1,5, 3,5, 4,7, 5,7, 5,8, 3,6, 6,8, 6,9, 9,11, 8,11, 8,10, 8,12, 7,10, 10,12, 11,12), directed = FALSE)
mat <- get.adjacency(g1)
mat <- as.matrix(mat)
rownames(mat) <- LETTERS[1:12]
colnames(mat) <- LETTERS[1:12]
g2 <- as(graphAM(adjMat=mat), "graphNEL")
hasEulerianCycle(g2)
Fun problem.
The graph you sugest in the code above, can be made to have duplicates that enable a eulerian cycle to be created. The function I provide below tries to add the minimum amount of duplicate edges, but also readily breaks the graph structure by adding new links if it has to.
You can run:
eulerian.g1 <- make.eulerian(g1)$graph
Check what the function did to your graph with:
make.eulerian(g1)$info
Bare in mind that:
This is not the only graph structure where duplicates added to the original g1 graph can form an eulerian cycle. Imagine for example my function looping the vertices of the graph backwards instead.
Your graph already has an uneven number of vertices with uneven degree, and all of the vertices that are, have neighbours with uneven degrees to pair them with. This function therefore works well four your particular example data.
The function could fail to produce a graph using only duplicates even in graphs where eulerian cycles are possible with correctly added duplicates. This is since it always goes for connecting a node with the first of its neighbours with uneven degree. If this is something that you'd absolutely like to get around, an MCMC-approach would be the way to go.
See also this excellent answer on probability calculation:
Here's my function in a full script that you can source out-of-the-box:
library(igraph)
# You asked about this graph
g1 <- graph(c(1,2, 1,3, 2,4, 2,5, 1,5, 3,5, 4,7, 5,7, 5,8, 3,6, 6,8, 6,9, 9,11, 8,11, 8,10, 8,12, 7,10, 10,12, 11,12), directed = FALSE)
# Make a CONNECTED random graph with at least n nodes
connected.erdos.renyi.game <- function(n,m){
graph <- erdos.renyi.game(n,m,"gnm",directed=FALSE)
graph <- delete_vertices(graph, (degree(graph) == 0))
}
# This is a random graph
g2 <- connected.erdos.renyi.game(n=12, m=16)
make.eulerian <- function(graph){
# Carl Hierholzer (1873) had explained how eulirian cycles exist for graphs that are
# 1) connected, and 2) contain only vertecies with even degrees. Based on this proof
# the posibility of an eulerian cycle existing in a graph can be tested by testing
# on these two conditions.
#
# This function assumes a connected graph.
# It adds edges to a graph to ensure that all nodes eventuall has an even numbered. It
# tries to maintain the structure of the graph by primarily adding duplicates of already
# existing edges, but can also add "structurally new" edges if the structure of the
# graph does not allow.
# save output
info <- c("broken" = FALSE, "Added" = 0, "Successfull" = TRUE)
# Is a number even
is.even <- function(x){ x %% 2 == 0 }
# Graphs with an even number of verticies with uneven degree will more easily converge
# as eulerian.
# Should we even out the number of unevenly degreed verticies?
search.for.even.neighbor <- !is.even(sum(!is.even(degree(graph))))
# Loop to add edges but never to change nodes that have been set to have even degree
for(i in V(graph)){
set.j <- NULL
#neighbors of i with uneven number of edges are good candidates for new edges
uneven.neighbors <- !is.even(degree(graph, neighbors(graph,i)))
if(!is.even(degree(graph,i))){
# This node needs a new connection. That edge e(i,j) needs an appropriate j:
if(sum(uneven.neighbors) == 0){
# There is no neighbor of i that has uneven degree. We will
# have to break the graph structure and connect nodes that
# were not connected before:
if(sum(!is.even(degree(graph))) > 0){
# Only break the structure if it's absolutely nessecary
# to force the graph into a structure where an euclidian
# cycle exists:
info["Broken"] <- TRUE
# Find candidates for j amongst any unevenly degreed nodes
uneven.candidates <- !is.even(degree(graph, V(graph)))
# Sugest a new edge between i and any node with uneven degree
if(sum(uneven.candidates) != 0){
set.j <- V(graph)[uneven.candidates][[1]]
}else{
# No candidate with uneven degree exists!
# If all edges except the last have even degrees, thith
# function will fail to make the graph eulerian:
info["Successfull"] <- FALSE
}
}
}else{
# A "structurally duplicated" edge may be formed between i one of
# the nodes of uneven degree that is already connected to it.
# Sugest a new edge between i and its first neighbor with uneven degree
set.j <- neighbors(graph, i)[uneven.neighbors][[1]]
}
}else if(search.for.even.neighbor == TRUE & is.null(set.j)){
# This only happens once (probably) in the beginning of the loop of
# treating graphs that have an uneven number of verticies with uneven
# degree. It creates a duplicate between a node and one of its evenly
# degreed neighbors (if possible)
info["Added"] <- info["Added"] + 1
set.j <- neighbors(graph, i)[ !uneven.neighbors ][[1]]
# Never do this again if a j is correctly set
if(!is.null(set.j)){search.for.even.neighbor <- FALSE}
}
# Add that a new edge to alter degrees in the desired direction
# OBS: as.numeric() since set.j might be NULL
if(!is.null(set.j)){
# i may not link to j
if(i != set.j){
graph <- add_edges(graph, edges=c(i, set.j))
info["Added"] <- info["Added"] + 1
}
}
}
# return the graph
(list("graph" = graph, "info" = info))
}
# Look at what we did
eulerian <- make.eulerian(g1)
eulerian$info
g <- eulerian$graph
par(mfrow=c(1,2))
plot(g1)
plot(g)

Find nearest neighbour of points with the same value when comparing 2 different data sets in R

I have 2 data frames (df1 and df2) that consist of three columns; x co-ordinate, y co-ordinate, category (with 5 levels A-E). So I essentially have 2 sets of points data with each point being assigned to a category
e.g.
X Y Cat
1 1.5 A
2 1.5 B
3.3 1.9 C
etc...
(although both of my data frames have 100s of points in them)
I would like to find the nearest neighbour of the same category for each point in my first data frame (df1) from the second data frame (df2).
I've used nncross in the package spatstat to find the nearest neighbour for each point in df1 with df2, and then to list out each of these distances, as follows;
# Convert the dataframes to ppp objects
df1.ppp <- ppp(df1$X,df1$Y,c(0,10),c(0,10),marks=df1$Cat)
df2.ppp <- ppp(df2$X,df2$Y,c(0,10),c(0,10),marks=df2$Cat)
# Produce anfrom output that lists the distance from each point in df1 to its nearest neighbour in df2
out<-nncross(X=df1.ppp,Y=df2.ppp,what=c("dist","which"))
But I am struggling to work out how I use the category labels stored in the ppp objects (as defined by marks) to find the nearest neighbour from the same category. I am sure it should be fairly straight forward but if anyone has any suggestions or any alternative methods to achieve the same result I would be really grateful.
First some artificial data to work with:
library(spatstat)
# Artificial data similar to the question
set.seed(42)
X1 <- rmpoint(100, win = square(10), types = factor(LETTERS[1:5]))
X2 <- rmpoint(100, win = square(10), types = factor(LETTERS[1:5]))
Then a simple solution (but it loses id info):
# Separate patterns for each type:
X1list <- split(X1)
X2list <- split(X2)
# For each point in X1 find nearest neighbour of same type in X2:
out <- list()
for(i in 1:5){
out[[i]] <- nncross(X1list[[i]], X2list[[i]], what=c("dist","which"))
}
Finally, an ugly solution which recovers the id of the neighbour:
# Make separate marks for pattern 1 and 2 and collect into one pattern
marks(X1) <- factor(paste0(marks(X1), "1"))
marks(X2) <- factor(paste0(marks(X2), "2"))
X <- superimpose(X1, X2)
# For each point get the nearest neighbour of each type from both X1 and X2
# (both dist and index)
nnd <- nndist(X, by = marks(X))
nnw <- nnwhich(X, by = marks(X))
# Type to look for. I.e. the mark with 1 and 2 swapped
# (with 0 as intermediate step)
type <- marks(X)
type <- gsub("1", "0", type)
type <- gsub("2", "1", type)
type <- gsub("0", "2", type)
# Result
rslt <- cbind(as.data.frame(X), dist = 0, which = 0)
for(i in 1:nrow(rslt)){
rslt$dist[i] <- nnd[i, type[i]]
rslt$which[i] <- nnw[i, type[i]]
}
# Separate results
rslt1 <- rslt[1:npoints(X1),]
rslt2 <- rslt[npoints(X1) + 1:npoints(X2),]
rslt1$which <- rslt1$which - npoints(X1)
I also had another go at tacking this but by using the package geosphere to create a distance matrix from my original data frames and found quite a simple way to solve this.
# load geosphere library
library("geosphere")
#create a distance matrix between all points in the 2 dataframes
dist<-distm(df1[,c('X','Y')],df2[,c('X','Y')])
# find the nearest neighbour to each point
df1$nearestneighbor <- apply(dist,1,min)
# create a distance matrix where only the distances between points of the same category are recorded
sameCat <- outer(df1$Cat, df2$Cat, "!=")
dist2 <- dist + ifelse(sameCat, Inf, 0)
# find the nearest neighbour of the same category
df1$closestmatch <- apply(dist2,1,min)

Finding the best matching pairwise points from 2 vectors

I have 2 lists with X,Y coordinates of points.
List 1 contains more points than list 2.
The task is to find pairs of points in a way that the overall euclidean distance is minimized.
I have a working code, but i don't know if this is the best way and I would like to get hint what I can improve for result (better algorithm to find the minimum ) or speed, because the list are about 2000 elements each.
The round in the sample vectors is implemented to get also points with same distances.
With the "rdist" function all distances are generated in "distances". Than the minimum in the matrix is used to link 2 point ("dist_min"). All distances of these 2 points are now replaced by NA and the loop continues by searching the next minimum until all points of list 2 have a point from list 1.
At the end I have added a plot for visualization.
require(fields)
set.seed(1)
x1y1.data <- matrix(round(runif(200*2),2), ncol = 2) # generate 1st set of points
x2y2.data <- matrix(round(runif(100*2),2), ncol = 2) # generate 2nd set of points
distances <- rdist(x1y1.data, x2y2.data)
dist_min <- matrix(data=NA,nrow=ncol(distances),ncol=7) # prepare resulting vector with 7 columns
for(i in 1:ncol(distances))
{
inds <- which(distances == min(distances,na.rm = TRUE), arr.ind=TRUE)
dist_min[i,1] <- inds[1,1] # row of point(use 1st element of inds if points have same distance)
dist_min[i,2] <- inds[1,2] # column of point (use 1st element of inds if points have same distance)
dist_min[i,3] <- distances[inds[1,1],inds[1,2]] # distance of point
dist_min[i,4] <- x1y1.data[inds[1,1],1] # X1 ccordinate of 1st point
dist_min[i,5] <- x1y1.data[inds[1,1],2] # Y1 coordinate of 1st point
dist_min[i,6] <- x2y2.data[inds[1,2],1] # X2 coordinate of 2nd point
dist_min[i,7] <- x2y2.data[inds[1,2],2] # Y2 coordinate of 2nd point
distances[inds[1,1],] <- NA # remove row (fill with NA), where minimum was found
distances[,inds[1,2]] <- NA # remove column (fill with NA), where minimum was found
}
# plot 1st set of points
# print mean distance as measure for optimization
plot(x1y1.data,col="blue",main="mean of min_distances",sub=mean(dist_min[,3],na.rm=TRUE))
points(x2y2.data,col="red") # plot 2nd set of points
segments(dist_min[,4],dist_min[,5],dist_min[,6],dist_min[,7]) # connect pairwise according found minimal distance
This is a fundamental problem in combinatorial optimization known as the assignment problem. One approach to solving the assignment problem is the Hungarian algorithm which is implemented in the R package clue:
require(clue)
sol <- solve_LSAP(t(distances))
We can verify that it outperforms the naive solution:
mean(dist_min[,3])
# [1] 0.05696033
mean(sqrt(
(x2y2.data[,1] - x1y1.data[sol, 1])^2 +
(x2y2.data[,2] - x1y1.data[sol, 2])^2))
#[1] 0.05194625
And we can construct a similar plot to the one in your question:
plot(x1y1.data,col="blue")
points(x2y2.data,col="red")
segments(x2y2.data[,1], x2y2.data[,2], x1y1.data[sol, 1], x1y1.data[sol, 2])

Resources