How to build a graph using random distance matrix? - r

I want to generate random transportation graphs in R with n vertices.
The graphs should be bipartite and connected ( there is usually a link / way between 2 vertices , not usually a direct one ) without " Loops".
In addition , the edges valuation should be random & strictly positive .
I tried doing the following :
n=6 # number of vertices
F <- erdos.renyi.game(n, p.or.m=0.5, directed=FALSE)
m=ecount(F)
min = 1 # 1 km
max = 50 # 50 km
F <- set.edge.attribute(F, name="distance", value=runif(m , min , max))
plot(F, layout=layout.fruchterman.reingold)
distances(F)
The problem is that i failed to get the wanted distances in the graph distance matrix :
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 2 1 1 2 1
[2,] 2 0 2 1 1 1
[3,] 1 2 0 1 1 1
[4,] 1 1 1 0 2 1
[5,] 2 1 1 2 0 1
[6,] 1 1 1 1 1 0
Normally , i should get uniform random values between 1 and 50 .
Also i'm not sure if the graph will not contain a loop if i execute another occurence of this code in future.
I guess my questions are clear !
Thank you if you could help !

The solution :
n=6 # number of vertices
F <- erdos.renyi.game(n, p.or.m=0.5, directed=FALSE) # We create the graph F
m=ecount(F) # number of edges we obtained using erdos.renyi.game()
min = 1 # 1 km # min and max values for the edges flows
max = 50 # 50 km
F <- set.edge.attribute(F, name="distance", value=runif(m , min , max)) # The flows are drawn in random
plot(F, layout=layout.fruchterman.reingold) # we plot the graph F
distances(F) # matrix of lentghs of the shortest paths between i & j
distances(F, weights = E(F)$distance) # matrix of flows of the shortest paths
distances(F, v = 1, to = 6, weights = E(F)$distance) # same if we want a specific minimum path between two vertices ( ex : 1--->6 )
get.all.shortest.paths(F, 1, to = V(F)) # all shortest paths list
graph.maxflow(F, 1, 6) # the maximum possible flow of any route between 1--->6

Related

Confusion matrix using table in k-means and hierarchical clustering

I have some problems with calculating of confusion matrix. I have created three sets of points by multivariate normal distibution:
library('MASS')
library('ggplot2')
library('reshape2')
library("ClusterR")
library("cluster")
library("dplyr")
library ("factoextra")
library("dendextend")
library("circlize")
mu1<-c(1,1)
mu2<-c(1,-9)
mu3<-c(-7,-2)
sigma1<-matrix(c(1,1,1,2), nrow=2, ncol=2, byrow = TRUE)
sigma2<-matrix(c(1,-1,-1,2), nrow=2, ncol=2, byrow = TRUE)
sigma3<-matrix(c(2,0.5,0.5,0.3), nrow=2, ncol=2, byrow = TRUE)
simulation1<-mvrnorm(100,mu1,sigma1)
simulation2<-mvrnorm(100,mu2,sigma2)
simulation3<-mvrnorm(100,mu3,sigma3)
X<-rbind(simulation1,simulation2,simulation3)
colnames(X)<-c("x","y")
X<-data.frame(X)
I have also constructed clusters using k-means clustering and hierarchical clustering with k initial centers (k=3):
//k-means clustering
k<-3
B<-kmeans(X, centers = k, nstart = 10)
x_cluster = data.frame(X, group=factor(B$cluster))
ggplot(x_cluster, aes(x, y, color = group)) + geom_point()
//hierarchical clustering
single<-hclust(dist(X), method = "single")
clusters2<-cutree(single, k = 3)
fviz_cluster(list (data = X, cluster=clusters2))
How can I calculate confusion matrix for full dataset(X) using table in both of these cases?
Using your data, insert set.seed(42) just before you create sigma1 so that we have a reproducible example. Then after you created X:
X.df <- data.frame(Grp=rep(1:3, each=100), x=X[, 1], y=X[, 2])
k <- 3
B <- kmeans(X, centers = k, nstart = 10)
table(X.df$Grp, B$cluster)
#
# 1 2 3
# 1 1 0 99
# 2 0 100 0
# 3 100 0 0
Original group 1 is identified as group 3 with one specimen assigned to group 1. Original group 2 is assigned to group 2 and original group 3 is assigned to group 1. The group numbers are irrelevant. The classification is perfect if each row/column contains all values in a single cell. In this case only 1 specimen was missplaced.
single <- hclust(dist(X), method = "single")
clusters2 <- cutree(single, k = 3)
table(X.df$Grp, clusters2)
# clusters2
# 1 2 3
# 1 99 1 0
# 2 0 0 100
# 3 0 100 0
The results are the same, but the cluster numbers are different. One specimen from the original group 1 was assigned to the same group as the group 3 specimens. To compare these results:
table(Kmeans=B$cluster, Hierarch=clusters2)
# Hierarch
# Kmeans 1 2 3
# 1 0 101 0
# 2 0 0 100
# 3 99 0 0
Notice that each row/column contains only one cell that is nonzero. The two cluster analyses agree with one another even though the cluster designations differ.
D <- lda(Grp~x + y, X.df)
table(X.df$Grp, predict(D)$class)
#
# 1 2 3
# 1 99 0 1
# 2 0 100 0
# 3 0 0 100
Linear discriminant analysis tries to predict the specimen number given the values of x and y. Because of this, the cluster numbers are not arbitrary and the correct predictions all fall on the diagonal of the table. This is what is usually described as a confusion matrix.

conditional which.min function

I have two sets of data, one is coordinates of machines, one is coordinates of the nearest repair shop.
I have a working model that has assigned each machine to the nearest store. However one store only has 1 machine and another has 7 machines assigned to it.
What I want is to add a condition so that each store is assigned at least 2 machines but no more than 4.
library(geosphere)
library(ggplot2)
#machine Locations
machine.x <- c(-122.37, -111.72, -111.87, -112.05, -87.17, -86.57, -86.54, -88.04, -86.61, -88.04, -86.61)
machine.y <- c(37.56, 35.23, 33.38, 33.57, 30.36, 30.75, 30.46, 30.68, 30.42, 30.68, 30.42)
machines <- data.frame(machine.x, machine.y)
#store locations
store.x <- c(-121.98, -112.17, -86.57)
store.y <- c(37.56, 33.59, 30.75)
stores <- data.frame(store.x, store.y)
centers<-data.frame(x=stores$store.x, y=stores$store.y)
pts<-data.frame(x=(machines$machine.x), y=(machines$machine.y))
#allocate space
distance<-matrix(-1, nrow = length(pts$x), ncol= length(centers$x))
#calculate the dist matrix - the define centers to each point
#columns represent centers and the rows are the data points
dm<-apply(data.frame(1:length(centers$x)), 1, function(x){ replace(distance[,x], 1:length(pts$x), distGeo(centers[x,], pts))})
#find the column with the smallest distance
closestcenter<-apply(dm, 1, which.min)
#color code the original data for verification
colors<-c(stores)
#create a scatter plot of assets color coded by which fe they belong to
plot(pts, col=closestcenter, pch=9)
So what I want is for each group to have a minimum count of 2 and a max count of 4, I tried adding a if else statement in the closest center variable but it didn't get even close to working out the way I thought it would. and i've looked around on line but can't find any way to add a counting condition to the which.min statement.
Note:My actual data set has several thousand machines and over 100 stores.
If M is an 11 x 3 zero-one matrix where M[i,j] = 1 if machine i is assigned to store j and 0 otherwise then the rows of M must each sum to 1 and the columns must each sum to 2 to 4 inclusive and we want to choose such an M which minimizes the sum of the distances sum(M * dm), say. This would give us the 0-1 linear program shown below. Below A is such that A %*% c(M) is the same as rowSums(M). Also B is such that B %*% c(M) is the same as colSums(M).
library(lpSolve)
k <- 3
n <- 11
dir <- "min"
objective.in <- c(dm)
A <- t(rep(1, k)) %x% diag(n)
B <- diag(k) %x% t(rep(1, n))
const.mat <- rbind(A, B, B)
const.dir <- c(rep("==", n), rep(">=", 3), rep("<=", 3))
const.rhs <- c(rep(1, n), rep(2, k), rep(4, k))
res <- lp(dir, objective.in, const.mat, const.dir, const.rhs, all.bin = TRUE)
res
## Success: the objective function is 9025807
soln <- matrix(res$solution, n, k)
and this solution:
> soln
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 1 0 0
[3,] 0 1 0
[4,] 0 1 0
[5,] 0 1 0
[6,] 0 0 1
[7,] 0 0 1
[8,] 1 0 0
[9,] 0 0 1
[10,] 0 1 0
[11,] 0 0 1
or in terms of the vector of store numbers assigned to each machine:
c(soln %*% (1:k))
## [1] 1 1 2 2 2 3 3 1 3 2 3

R, Create K-nearest neighbors weights in a Matrix

I have a 2-column data frame corresponding to X and Y cartesian coordinates of a sample of 500 georeferenced observations.
I want to generate a weight Matrix W where each elements is equal to:
* 1 :if observation j is one of the k-nearest neighbors to observation i, and
* 0 :if else.
Suppose we have this data frame:
df=as.data.frame(cbind(x=rnorm(500), y=rnorm(500)))
And let suppose k= 20, so how to create this matrix with R ?
Using CRAN's FastKNN package... Let's say you have your distance matrix of 5 * 5 as follows:
library(FastKNN)
df <- as.data.frame(cbind(x = rnorm(5), y=rnorm(5)))
dist_mat <- as.matrix(dist(df, method = "euclidean", upper = TRUE, diag=TRUE))
## Let's say k = 2...
k <- 2
nrst <- lapply(1:nrow(dist_mat), function(i) k.nearest.neighbors(i, dist_mat, k = k))
## Build w
w <- matrix(nrow = dim(dist_mat), ncol=dim(dist_mat)) ## all NA right now
w[is.na(w)] <- 0 ## populate with 0
for(i in 1:length(nrst)) for(j in nrst[[i]]) w[i,j] = 1
So my df looked like this:
> df
x y
1 -0.2109351 -0.315256132
2 0.5172415 0.003352551
3 1.5700413 -0.737475081
4 -0.2699282 -0.198414683
5 1.3997493 -0.241382737
And my w ended up looking like this:
> w
[,1] [,2] [,3] [,4] [,5]
[1,] 0 1 0 1 0
[2,] 1 0 0 1 0
[3,] 0 1 0 0 1
[4,] 1 1 0 0 0
[5,] 0 1 1 0 0

Initialize kmeans, *vector* initial centroids, R

In this post there is a method to initialize the centers for the K-means algorithm in R. However, the data used therein is scalar (i.e. numbers).
A variation on this question: what if the data has multiple dimensions. In that case, the new centers should be vectors, so start should be a vector of vectors... I tried something like :
C1<- c(1,2)
C2<- c(4,-5)
to have my two initial centers, and then use
kmeans(dat, c(C1,C2))
but it didn't work. I also tried cbind() instead of c(). Same result...
You expand the matrix start to have cluster rows and variables columns (dimensions), where cluster is the number of clusters you are attempting to identify and variables is the number of variables in the data set.
Here is an extension of the post you linked to, expanding the example to 3 dimensions (variables), x, y, and z:
set.seed(1)
dat <- data.frame(x = rnorm(99, mean = c(-5, 0 , 5)),
y = rnorm(99, mean = c(-5, 0, 5)),
z = rnorm(99, mean = c(-5, 2, -4)))
plot(dat)
The plot is:
Now we need to specify cluster centres for each of our three clusters. This is done via a matrix as before:
start <- matrix(c(-5, 0, 5, -5, 0, 5, -5, 2, -4), nrow = 3, ncol = 3)
> start
[,1] [,2] [,3]
[1,] -5 -5 -5
[2,] 0 0 2
[3,] 5 5 -4
Here, the important thing to note is that the clusters are in rows. The columns are coordinates on that dimension of the specified cluster centre. Hence for cluster 1 we are specifying that the centroid is at (-5,-5,-5)
Calling kmeans()
kmeans(dat, start)
results in it picking groups very close to our initial starting points (as it should for this example):
> kmeans(dat, start)
K-means clustering with 3 clusters of sizes 33, 33, 33
Cluster means:
x y z
1 -4.8371412 -4.98259934 -4.953537
2 0.2106241 0.07808787 2.073369
3 4.9708243 4.77465974 -4.047120
Clustering vector:
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
[39] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
[77] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
Within cluster sum of squares by cluster:
[1] 117.78043 77.65203 77.00541
(between_SS / total_SS = 93.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
It is worth noting here the output for the cluster centres:
Cluster means:
x y z
1 -4.8371412 -4.98259934 -4.953537
2 0.2106241 0.07808787 2.073369
3 4.9708243 4.77465974 -4.047120
This layout is exactly the same as the matrix start.
You don't have to build the matrix directly using matrix(), nor do you have to specify the centres column-wise. For example:
c1 <- c(-5, -5, -5)
c2 <- c( 0, 0, 2)
c3 <- c( 5, 5, -4)
start2 <- rbind(c1, c2, c3)
> start2
[,1] [,2] [,3]
c1 -5 -5 -5
c2 0 0 2
c3 5 5 -4
Or
start3 <- matrix(c(-5, -5, -5,
0, 0, 2,
5, 5, -4), ncol = 3, nrow = 3, byrow = TRUE)
> start3
[,1] [,2] [,3]
[1,] -5 -5 -5
[2,] 0 0 2
[3,] 5 5 -4
If those are more comfortable for you.
The key thing to remember is that variables are in columns, cluster centres in the rows.
## Your centers
C1 <- c(1, 2)
C2 <- c(4, -5)
## Simulate some data with groups around these centers
library(MASS)
set.seed(0)
dat <- rbind(mvrnorm(100, mu=C1, Sigma = matrix(c(2,3,3,10), 2)),
mvrnorm(100, mu=C2, Sigma = matrix(c(10,3,3,2), 2)))
clusts <- kmeans(dat, rbind(C1, C2)) # get clusters with your center starting points
## Look at them
plot(dat, col=clusts$cluster)

Computing pairwise distances between a set of intervals

Let's say I have a set of closed linear intervals represented by this matrix:
interval.mat = matrix(c(1,2,3,5,4,6,8,9), byrow = TRUE, ncol = 2)
where interval.mat[,1] are the interval start points and interval.mat[,2] are their corresponding end points.
I'm looking for an efficient (since this example matrix is a toy and in reality my matrix contains a few thousands of intervals) way to produce a matrix that will hold all the pairwise positive distances between the intervals. The distance between a pair of intervals should be the start of the interval with the bigger end among the two minus the end of the interval with the smaller end among the two. For example the distance between intervals c(1,2) and c(3,5) should 3 - 2 = 1, since the second interval ends after the first one. In case the intervals overlap the distance should be 0. So for example, in the case of c(3,5) and c(4,6) the distance would be 0.
So, the pairwise distance matrix for the intervals above would be:
> matrix(c(0,1,2,6,1,0,0,3,2,0,0,2,6,3,2,0), byrow = TRUE, nrow = 4, ncol = 4)
[,1] [,2] [,3] [,4]
[1,] 0 1 2 6
[2,] 1 0 0 3
[3,] 2 0 0 2
[4,] 6 3 2 0
Here's an Rcpp solution. It will be fast and memory efficient (for details see below).
First let's define a helper function which calculates all the pairwise distances. If n is the number of intervals to consider, we have n*(n-1)/2 unique pairs of vectors (we don't take the same intervals into account, of course, as the distance between them is 0).
library('Rcpp')
library('inline')
cppFunction("
NumericVector distint_help(NumericMatrix x) {
int n = x.nrow(); // number of rows
NumericVector out(n*(n-1)/2); // result numeric vector
int k = 0;
for (int i=0; i<n-1; ++i) {
for (int j=i+1; j<n; ++j) {
if (x(i,0) >= x(j,1))
out[k++] = x(i,0)-x(j,1);
else if (x(j,0) > x(i,1))
out[k++] = x(j,0)-x(i,1);
else
out[k++] = 0.0;
}
}
return out;
}
")
The above function returns a numeric vector with the calculated distances. Let's try to mimic the output of the built-in dist function (checkout the result of x <- dist(interval.mat); unclass(x)).
Now the main function:
distint <- function(interval) {
stopifnot(is.numeric(interval), is.matrix(interval), ncol(interval) == 2)
res <- distint_help(interval) # use Rcpp to calculate the distances
# return the result similar to the one of dist()
structure(res, class='dist', Size=nrow(interval), Diag=FALSE, Upper=FALSE)
}
distint(interval.mat)
## 1 2 3
## 2 1
## 3 2 0
## 4 6 3 2
The above object may be converted to an "ordinary" square matrix:
as.matrix(distint(interval.mat))
## 1 2 3 4
## 1 0 1 2 6
## 2 1 0 0 3
## 3 2 0 0 2
## 4 6 3 2 0
Unless the distance matrix is sparse (there are many many zeros), the above solution is storage efficient.
A benchmark:
test <- matrix(runif(1000), ncol=2)
library('microbenchmark')
library(proxy)
f <- function(x,y) max(min(x)-max(y),0)
microbenchmark(distint(test), as.matrix(dist(test, method=f)), times=10)
## Unit: milliseconds
## expr min lq median uq max neval
## distint(test) 1.584548 1.615146 1.650645 3.071433 3.164231 10
## as.matrix(dist(test, method = f)) 455.300974 546.438875 551.596582 599.977164 609.418194 10
You can use the proxy package, which has a dist(...) method that allows user definition of the distance function. Note that loading this library will mask the dist(...) function in base R
library(proxy)
f <- function(x,y) max(min(x)-max(y),0)
as.matrix(dist(interval.mat,method=f))
# 1 2 3 4
# 1 0 1 2 6
# 2 1 0 0 3
# 3 2 0 0 2
# 4 6 3 2 0

Resources