I am working with centered longitude (x) and latitude(y) data. My goal is to clustering the connected locations.
Two location on earth (x1,y1) and (x2,y2) are said to be connected if earth_distance((x1,y1),(x2,y2))<15 kilometer.
I am using the distHaversine function in R, to calculate earth distance.
Here is some sample data,
x=c(1.000000, 1.055672, 1.038712, 1.094459, 1.133179, 1.116241, 1.126053, 1.181824 ,1.377892, 5.869881, 5.925270, 5.909721)
and
y=c(1.333368,1.304790,1.347332,1.318743,1.332676,1.375229,1.572287,1.544174,2.371105,2.337032,2.383415)
also
distance <- distHaversine(c(x,y))
I wish find the different clusters formed by the different connected set of points (each connected set of points form a cluster).
I looked at How to cluster points and plot but I could not solved my problem.
Any reference, suggestion or answer will be very much appreciated.
Maybe this. First make some coordinates:
> x=c(1.000000, 1.055672, 1.038712, 1.094459, 1.133179, 1.116241, 1.126053, 1.181824 ,1.377892, 5.869881, 5.925270)
> y=c(1.333368, 1.304790, 1.347332, 1.318743, 1.332676, 1.375229, 1.572287, 1.544174, 2.371105 ,2.337032, 2.383415)
Make into a data frame
> xy = data.frame(x=x,y=y)
Now use outer to loop over all pairs of rows and columns to compute a full distance matrix. This does twice as much work as is really necessary since it computes i to j and j to i for all i and j. Anyway, it gets us a distance matrix:
> dmat = outer(1:nrow(xy), 1:nrow(xy), function(i,j)distHaversine(xy[i,],xy[j,]))
Now we want a connectivity matrix, which is any pair closer than 15,000 metres:
> cmat = dmat < 15000
Now we use the igraph package to build a connectivity graph object:
> require(igraph)
> cgraph = graph.adjacency(cmat)
You can plot this to see the cluster formation, but note these are not plotted in your x-y space:
> plot(cgraph)
Now to get the connected clusters:
> clusters(cgraph)
$membership
[1] 1 1 1 1 1 1 2 2 3 4 4
$csize
[1] 6 2 1 2
$no
[1] 4
Which you can add to your data frame thus:
> xy$cluster = clusters(cgraph)$membership
> xy
x y cluster
1 1.000000 1.333368 1
2 1.055672 1.304790 1
3 1.038712 1.347332 1
4 1.094459 1.318743 1
5 1.133179 1.332676 1
6 1.116241 1.375229 1
7 1.126053 1.572287 2
8 1.181824 1.544174 2
9 1.377892 2.371105 3
10 5.869881 2.337032 4
11 5.925270 2.383415 4
And plot:
> plot(xy$x,xy$y,col=xy$cluster)
Related
I currently have a data frame storing separate x,y,z coordinates from an accelerometer sensor (with timestamps), but want to perform vector operations on it.
Test data (actually have thousands of rows, and a timestamp row to be preserved)
x <- c(1,3,1,0,3)
y <- c(2,4,8,8,9)
z <- c(0,1,1,2,0)
df <- data.frame(x,y,z)
proj <- function(a,b) {
as.double((a %*% b) / (b %*% b)) * b
}
v = c(1,2,3)
I want to mutate (or create a new dataframe?) df by applying proj(_,v) on each row.
I have tried along the lines of mutate(projected = proj(c(x,y,z), v), but doesn't work, I am probably misusing this.
What is the best way to achieve this? Should I instead be using a list of vectors to store the coordinates?
While your proj(a,b)-function does only take two inputs, in your example you wanted to provide three proj(c(x,y,z),v) or did I misunderstand?
However, this would work:
dplyr::mutate(projected = proj(x,y), df) resulting in
x y z projected
1 1 2 0 0.4279476
2 3 4 1 0.8558952
3 1 8 1 1.7117904
4 0 8 2 1.7117904
5 3 9 0 1.9257642
I have directed igraph and want to fetch all the cycles. girth function works but only returns the smallest cycle. Is there a way in R to fetch all the cycles in a graph of length greater then 3 (no vertex pointing to itself and loops)
It is not directly a function in igraph, but of course you can code it up. To find a cycle, you start at some node, go to some neighboring node and then find a simple path back to the original node. Since you did not provide any sample data, I will illustrate with a simple example.
Sample data
## Sample graph
library(igraph)
set.seed(1234)
g = erdos.renyi.game(7, 0.29, directed=TRUE)
plot(g, edge.arrow.size=0.5)
Finding Cycles
Let me start with just one node and one neighbor. Node 2 connects to Node 4. So some cycles may look like 2 -> 4 -> (Nodes other than 2 or 4) -> 2. Let's get all of the paths like that.
v1 = 2
v2 = 4
lapply(all_simple_paths(g, v2,v1, mode="out"), function(p) c(v1,p))
[[1]]
[1] 2 4 2
[[2]]
[1] 2 4 3 5 7 6 2
[[3]]
[1] 2 4 7 6 2
We see that there are three cycles starting at 2 with 4 as the second node. (I know that you said length greater than 3. I will come back to that.)
Now we just need to do that for every node v1 and every neighbor v2 of v1.
Cycles = NULL
for(v1 in V(g)) {
for(v2 in neighbors(g, v1, mode="out")) {
Cycles = c(Cycles,
lapply(all_simple_paths(g, v2,v1, mode="out"), function(p) c(v1,p)))
}
}
This gives 17 cycles in the whole graph. There are two issues though that you may need to look at depending on how you want to use this. First, you said that you wanted cycles of length greater than 3, so I assume that you do not want the cycles that look like 2 -> 4 -> 2. These are easy to get rid of.
LongCycles = Cycles[which(sapply(Cycles, length) > 3)]
LongCycles has 13 cycles having eliminated the 4 short cycles
2 -> 4 -> 2
4 -> 2 -> 4
6 -> 7 -> 6
7 -> 6 -> 7
But that list points out the other problem. There still are some that you cycles that you might think of as duplicates. For example:
2 -> 7 -> 6 -> 2
7 -> 6 -> 2 -> 7
6 -> 2 -> 7 -> 6
You might want to weed these out. To get just one copy of each cycle, you can always choose the vertex sequence that starts with the smallest vertex number. Thus,
LongCycles[sapply(LongCycles, min) == sapply(LongCycles, `[`, 1)]
[[1]]
[1] 2 4 3 5 7 6 2
[[2]]
[1] 2 4 7 6 2
[[3]]
[1] 2 7 6 2
This gives just the distinct cycles.
Addition regarding efficiency and scalability
I am providing a much more efficient version of the code that I
originally provided. However, it is primarily for the purpose of
arguing that, except for very simple graphs, you will not be able
produce all cycles.
Here is some more efficient code. It eliminates checking many
cases that either cannot produce a cycle or will be eliminated
as a redundant cycle. In order to make it easy to run the tests
that I want, I made it into a function.
## More efficient version
FindCycles = function(g) {
Cycles = NULL
for(v1 in V(g)) {
if(degree(g, v1, mode="in") == 0) { next }
GoodNeighbors = neighbors(g, v1, mode="out")
GoodNeighbors = GoodNeighbors[GoodNeighbors > v1]
for(v2 in GoodNeighbors) {
TempCyc = lapply(all_simple_paths(g, v2,v1, mode="out"), function(p) c(v1,p))
TempCyc = TempCyc[which(sapply(TempCyc, length) > 3)]
TempCyc = TempCyc[sapply(TempCyc, min) == sapply(TempCyc, `[`, 1)]
Cycles = c(Cycles, TempCyc)
}
}
Cycles
}
However, except for very simple graphs, there is a combinatorial
explosion of possible paths and so finding all possible cycles is
completely impractical I will illustrate this with graphs much smaller
than the one that you mention in the comments.
First, I will start with some small graphs where the number of edges
is approximately twice the number of vertices. Code to generate my
examples is below but I want to focus on the number of cycles, so I
will just start with the results.
## ecount ~ 2 * vcount
Nodes Edges Cycles
10 21 15
20 41 18
30 65 34
40 87 424
50 108 3433
55 117 22956
But you report that your data has approximately 5 times as
many edges as vertices. Let's look at some examples like that.
## ecount ~ 5 * vcount
Nodes Edges Cycles
10 48 3511
12 61 10513
14 71 145745
With this as the growth of the number of cycles, using 10K nodes
with 50K edges seems to be out of the question. BTW, it took several
minutes to compute the example with 14 vertices and 71 edges.
For reproducibility, here is how I generated the above data.
set.seed(1234)
g10 = erdos.renyi.game(10, 0.2, directed=TRUE)
ecount(g10)
length(FindCycles(g10))
set.seed(1234)
g20 = erdos.renyi.game(20, 0.095 , directed=TRUE)
ecount(g20)
length(FindCycles(g20))
set.seed(1234)
g30 = erdos.renyi.game(30, 0.056 , directed=TRUE)
ecount(g30)
length(FindCycles(g30))
set.seed(1234)
g40 = erdos.renyi.game(40, 0.042 , directed=TRUE)
ecount(g40)
length(FindCycles(g40))
set.seed(1234)
g50 = erdos.renyi.game(50, 0.038 , directed=TRUE)
ecount(g50)
length(FindCycles(g50))
set.seed(1234)
g55 = erdos.renyi.game(55, 0.035 , directed=TRUE)
ecount(g55)
length(FindCycles(g55))
##########
set.seed(1234)
h10 = erdos.renyi.game(10, 0.55, directed=TRUE)
ecount(h10)
length(FindCycles(h10))
set.seed(1234)
h12 = erdos.renyi.game(12, 0.46, directed=TRUE)
ecount(h12)
length(FindCycles(h12))
set.seed(1234)
h14 = erdos.renyi.game(14, 0.39, directed=TRUE)
ecount(h14)
length(FindCycles(h14))
Use this example data to see what I mean
tag <- as.character(c(1,2,3,4,5,6,7,8,9,10))
species <- c("A","A","A","A","B","B","B","C","C","D")
size <- c(0.10,0.20,0.25,0.30,0.30,0.15,0.15,0.20,0.15,0.15)
radius <- (size*40)
x <- c(9,4,25,14,28,19,9,22,10,2)
y <- c(36,7,15,16,22,24,39,20,34,9)
data <- data.frame(tag, species, size, radius, x, y)
# Plot the points using qplot (from package tidyverse)
qplot(x, y, data = data) +
geom_point(aes(colour = species, size = size))
Now that you can see the plot, what I want to do is for each individual “species A” point, I’d like to identify the largest point within a radius of size*40.
For example, in the bottom left of the plot you can see that species A (tag 2) would produce a radius large enough to contain the close species D point.
However, the species A point on the far right-hand-side of the plot (tag 3) would produce a radius large enough to contain both of the close species B and species C points, in which case I’d want some sort of output that identifies the largest individual within the species A radius.
I’d like to know what I can run (if anything) on this data set to get find the largest “within radius” point for each species A point and get an output like this:
Species A point ---- Largest point within radius
Species A tag 1 ----- Species C tag 9
Species A tag 2 ----- Species D tag 10
Species A tag 3 ----- Species B tag 5
Species A tag 4 ----- Species C tag 8
I've used spatstat and CTFSpackage to make some plots in the past but I can't figure out how to "find largest neighbor within radius". Perhaps I can tackle this in ArcMAP? Also, this is just a small example dataset. Realistically I will be wanting to find the "largest neighbor within radius" for thousands of points.
Any help or feedback would be greatly appreciated.
Following finds the largest species and tag pair that is within given radius for each of the species.
all_df <- data # don't wanna have a variable called data
res_df <- data.frame()
for (j in 1 : nrow(all_df)) {
# subset the data
df <- subset(all_df, species != species[j])
# index of animals within radius
ind <- which ((df$x - x[j])^2 + (df$y - y[j])^2 < radius[j]^2 )
# find the max `size` in the subset df
max_size <- max(df$size[ind])
# all indices with max_size in df
max_inds <- which(df$size[ind] == max_size)
# pick the last one is there is more than on max_size
new_ind <- ind[max_inds[length(max_inds)]]
# results in data.frame
res_df <- rbind(res_df, data.frame(org_sp = all_df$species[j],
org_tag = all_df$tag[j],
res_sp = df$species[new_ind],
res_tag = df$tag[new_ind]))
}
res_df
# org_sp org_tag res_sp res_tag
# 1 A 1 C 9
# 2 A 2 D 10
# 3 A 3 B 5
# 4 A 4 C 8
# 5 B 5 A 3
# 6 B 6 C 8
# 7 B 7 C 9
# 8 C 8 B 5
# 9 C 9 B 7
# 10 D 10 A 2
I'm very new to R and trying to calculate the adjacent vertices of a graph, which is obtained from deleting certain nodes from an original graph.
However, the output of the result doesn't match with the plot of the graph.
For example:
library(igraph)
g <- make_ring(8)
g <- add_edges(g, c(1,2, 2,7, 3,6, 4,5, 8,2, 6,2))
V(g)$label <- 1:8
plot(g)
h <- delete.vertices(g, c(1,2))
plot(h)
If I compute:
adjacent_vertices(h,6)= 5
However, I want the output to be 3,5,7 as the plot shows. The problem lies in the fact that it doesn't know I'm trying to find the adjacent vertices of node labelled 6.
Could someone please help. Thanks.
The issue here is that when you delete the vertices, the indices for the remaining vertices are shifted down to [0,6]:
> V(h)
+ 6/6 vertices:
[1] 1 2 3 4 5 6
To find the neighbors, using the original vertex names, you could then simply offset the values by the number of vertices removed, e.g.:
> neighbors(h, 6 - offset) + offset
+ 3/6 vertices:
[1] 3 5 7
A better approach, however, would be to refer to the vertex labels instead of using the indices:
> V(g)$label
[1] 1 2 3 4 5 6 7 8
> V(h)$label
[1] 3 4 5 6 7 8
> V(h)[V(h)$label == 6]
+ 1/6 vertex:
[1] 4
To get the neighbors of your vertex of interest, you can modify your code to look like:
> vertex_of_interest <- V(h)[V(h)$label == 6]
> neighbors(h, vertex_of_interest)$label
[1] 3 5 7
I would like to do a simulation in R. I would like to set up a loop using some large number of trials. Specifically I would like to use a normal distribution with known mean, Standard deviation and N = 9. I would like to set up a counter which counts the number of times on of the replicates goes below (or above) a certain value. Also I would like to see a histogram of the data generated.
Not a big fan of loops, so I'd do something like this:
func<-function(n){
counter=0
x<-rnorm(1,0,1)
if(x>2|x<(-2)) counter<-1
return(c(n,x,counter))
}
n=1:1000
sum(do.call(rbind,lapply(n,func))[,3])
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 41
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 43
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 43
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 39
while the do.call(rbind,lapply(n,func)) will provide you with the actual data you need to make the histogram of the numbers created:
dat<-data.frame(do.call(rbind,lapply(n,func)))
names(dat)<-c("n","x","counter")
head(dat)
n x counter
1 1 -0.6591145 0
2 2 1.8163984 0
3 3 -2.0291848 1
4 4 0.3309398 0
5 5 -0.8214298 0
6 6 0.5275238 0
Try something along these lines.
#in this structure each row in the matrix is a sim rep
sim.data<- matrix(rnorm(9*1000,0,1),1000,9)
#this counts number of observations below threshold for each rep
below <- apply(sim.data, 1, function(x) sum(x<0.60))
hist(below)