igraph: summarize each node's neighbours characteristics - r

With an igraph object I would like to capture some features of each node's neighbours, for example the average degree of its neighbours.
I come up with this code, which is inelegant and quite slow.
How should I rethink it for large and complex networks?
library(igraph)
# Toy example
set.seed(123)
g <- erdos.renyi.game(10,0.2)
# Loop to calculate average degree of each node's neighbourhood
s <- character(0)
for(i in 1:gorder(g)){
n <- ego_size(g, nodes = i, order = 1, mindist = 1)
node_of_interest <- unique(unlist(ego(g, nodes = i, order = 1, mindist = 1)))
m <- mean(degree(g, v = node_of_interest, loops = TRUE, normalized = FALSE)-1)
s <- rbind(s,data.frame(node = i, neighbours = n, mean = m))
}

Expanding the data structure with rbind in a loop can get quite slow in R, because at every step it needs to allocate the space for the new object, and then copy it (see section 24.6 here). Also, you might be computing the degree of a node many times, if it s the neighbor of multiple nodes.
A possibly better alternative could be:
# add vertex id (not really necessary)
V(g)$name <- V(g)
# add degree to the graph
V(g)$degree <- degree(g, loops = TRUE, normalized = FALSE)
# get a list of neighbours, for each node
g_ngh <- neighborhood(g, mindist = 1)
# write a function that gets the means
get.mean <- function(x){
mean(V(g)$degree[x]-1)
}
# apply the function, add result to the graph
V(g)$av_degr_nei <- sapply(g_ngh, get.mean)
# get data into dataframe, if necessary
d_vert_attr <- as_data_frame(g, what = "vertices")
d_vert_attr
name degree av_degr_nei
1 1 0 NaN
2 2 1 2.0000000
3 3 2 1.0000000
4 4 1 1.0000000
5 5 2 1.0000000
6 6 1 1.0000000
7 7 3 0.6666667
8 8 1 0.0000000
9 9 1 0.0000000
10 10 0 NaN

Related

turning a weighted edgelist into an unweighted in r

I have data on every interaction that could and did happen at a university club weekly social hour
id1 id2 timestalked date
1 2 1 1/1/2010
1 3 0 1/1/2010
...
100 2 4 1/8/2010
...
I want to first load this in as a directed graph for the entire time period for visualization. For the weighted matrix I did.
library(igraph);
el <- read.csv("el.csv", header = TRUE);
G <- graph.data.frame(el,directed=TRUE);
A <- as_adjacency_matrix(G,type="both",names=TRUE,sparse=FALSE,attr="timestalked");
I thought removing attr="timestalked" would turn the weights > 0 into 1 but that does not seem to work
library(igraph);
el <- read.csv("el.csv", header = TRUE);
G_unweight <- graph.data.frame(el,directed=TRUE);
A_unweight <- as_adjacency_matrix(G_unweight,type="both",names=TRUE,sparse=FALSE)
as_adjacency_matrix() doesn't provide any argument to control weights. Note that it just provides the number of edges between nodes from the graph.
To turn the weighted edgelist into an unweighted one, try this
A <- as_adjacency_matrix(G, type = "both", names = TRUE, sparse = FALSE)
A[A > 1] <- 1
Note that you can also use the graph_from_adjacency_matrix() function to create an unweighted igraph graph from the adjacency matrix by specifying weighted = NULL.

R igraph: shortest path extraction

This is the first time I am working with graphs and R igraph package and I need some help with processing graph objects.
What I want to achieve:
From a given contact matrix extract shortest confident path between nodes. By confident I mean that edge weights are higher then neighbouring edges.
Examples:
A
m <- read.table(row.names = 1, header = TRUE, text =
" A B C D E F
A 0 1 1 1 1 5
B 1 0 1 1e2 1e2 1
C 1 1 0 1 1 1
D 1 1e2 1 0 1e2 1
E 1 1e2 1 1e2 0 1
F 5 1 1 1 1 0")
m <- as.matrix(m)
ig <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
sp <- shortest.paths(ig, algorithm = "dijkstra")
In matrix m there is one cluster (clique?) between B-D-E (ie., egde weights between those nodes are high). However, as there is weight between A and F I am also getting cluster there, even though edge weight is low (only 5).
Question A: How to extract only those clusters that have high edge weight? I can transform those contacts to 0 with m[which(m <= 5)] <- 0, but I hope that there is more "mathy" solution for this in igraph package.
B
m <- read.table(row.names = 1, header = TRUE, text =
" A B C D E F
A 0 1 1 5 1 1
B 1 0 1 1e2 1e2 1
C 1 1 0 1 1 1
D 5 1e2 1 0 1e2 1
E 1 1e2 1 1e2 0 1
F 1 1 1 1 1 0")
m <- as.matrix(m)
ig <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
sp <- shortest.paths(ig, algorithm = "dijkstra")
In matrix m there is cluster between B-D-E, but as there is low weight between A and B - A is also connected to this cluster.
Question B: How to not assign nodes to a cluster if edge weight is low?
This is my first question here, if you need clarification or better examples I will improve my questions.
First, it is good to know that when looking up paths, igraph understands weights as costs, i.e. on edges with higher weight it costs more to travel, so it will consider shorter the paths with lower sum weight. It is easy to turn this into the opposite, just take the reciprocal of your weights (1 / E(ig)$weight). Between 2 vertices there might be only one shortest path, but sometimes there are more equally short paths. You can look up all of them (all_shortest_paths), or tell igraph to return only one of the shortests for each pairs of vertices (shortest_paths). Each call of these methods returns the paths from one selected vertex, to have the paths between all pairs, you need to call these once for each vertex (ok, at an undirected graph, it is enough to call for half of the vertices). To formulate what I explained until this point:
spaths <- lapply(V(ig),
function(v){
all_shortest_paths(ig, v,
weight = 1 / E(ig)$weight
)
}
)
Here spaths will be a list of lists, access the paths from C to all the vertices like this:
spaths$C$res
[[1]]
+ 2/6 vertices, named:
[1] C A
[[2]]
+ 2/6 vertices, named:
[1] C B
[[3]]
+ 1/6 vertex, named:
[1] C
[[4]]
+ 2/6 vertices, named:
[1] C D
[[5]]
+ 2/6 vertices, named:
[1] C E
[[6]]
+ 2/6 vertices, named:
[1] C F
spaths$C$res[[2]] # this is the path from `C` to `B`,
# a vector of 2 vertices
Note, the third element is actually from C to itself, you can either ignore it, or provide a vector of all other vertices to the to parameter of all_shortest_paths. Also, in your example all shortest paths will be of length 1, but if I set for example the weight of B--E to 1 instead of 100, we see that the method works, and from B to E the shortest path will be B-D-E.
Regarding your second question, here it is not completely clear what do you want to achieve, especially how do you get these clusters? If you want to find communities, i.e. more closely connected group of vertices, taking into account also the edge weights, there are many methods for this, all those named cluster_[...] or community.[...] in igraph. For example, if we run the fastgreedy method on your graph, it will detect the cluster you mentioned:
fg <- fastgreedy.community(ig, weights = E(ig)$weight)
IGRAPH clustering fast greedy, groups: 2, mod: 0.059
+ groups:
$`1`
[1] "A" "C" "F"
$`2`
[1] "B" "D" "E"
So here we have the B, D, E cluster, what is connected with higher weight edges. If we run the same method without weights, all the vertices will belong to one group (fastgreedy.community(ig, weights = NULL)). Note, at community detection, igraph understands weights as strength, so vertices connected with higher weight edges more likely to cluster together, this is kind of opposite like how it works at calculating paths.

Depth of a node in partykit

I am building a tree using the partykit R package, and I am wondering if there is a simple, efficient way to determine the depth number at each internal node. For example, the root node would have depth 0, the first two kid nodes have depth 1, the next kid nodes have depth 2, and so forth. This will eventually be used to calculate the minimal depth of a variable. Below is a very basic example (taken from vignette("constparty", package="partykit")):
library("partykit")
library("rpart")
data("Titanic", package = "datasets")
ttnc<-as.data.frame(Titanic)
ttnc <- ttnc[rep(1:nrow(ttnc), ttnc$Freq), 1:4]
names(ttnc)[2] <- "Gender"
rp <- rpart(Survived ~ ., data = ttnc)
ttncTree<-as.party(rp)
plot(ttncTree)
#This is one of my many attempts which does NOT work
internalNodes<-nodeids(ttncTree)[-nodeids(ttncTree, terminal = TRUE)]
depth(ttncTree)-unlist(nodeapply(ttncTree, ids=internalNodes, FUN=function(n){depth(n)}))
In this example, I want to output something similar to:
nodeid = 1 2 4 7
depth = 0 1 2 1
I apologize if my question is too specific.
Here's a possible solution which should be efficient enough as usually the trees have no more than several dozens of nodes.
I'm ignoring node #1, as it is always 0 an hence no point neither calculating it or showing it (IMO)
Inters <- nodeids(ttncTree)[-nodeids(ttncTree, terminal = TRUE)][-1]
table(unlist(sapply(Inters, function(x) intersect(Inters, nodeids(ttncTree, from = x)))))
# 2 4 7
# 1 2 1
I had to revisit this problem recently. Below is a function to determine the depth of each node. I count the depth based on the number of times a vertical line | appears running the print.party() function.
library(stringr)
idDepth <- function(tree) {
outTree <- capture.output(tree)
idCount <- 1
depthValues <- rep(NA, length(tree))
names(depthValues) <- 1:length(tree)
for (index in seq_along(outTree)){
if (grepl("\\[[0-9]+\\]", outTree[index])) {
depthValues[idCount] <- str_count(outTree[index], "\\|")
idCount = idCount + 1
}
}
return(depthValues)
}
> idDepth(ttncTree)
1 2 3 4 5 6 7 8 9
0 1 2 2 3 3 1 2 2
There definitely seems to be a simpler, faster solution, but this is faster than using the intersect() function. Below is an example of the computation time for a large tree (around 1,500 nodes)
# Compare computation time for large tree #
library(mlbench)
set.seed(470174)
dat <- data.frame(mlbench.friedman1(5000))
rp <- rpart(as.formula(paste0("y ~ ", paste(paste0("x.", 1:10), collapse=" + "))),
data=dat, control = rpart.control(cp = -1, minsplit=3, maxdepth = 10))
partyTree <- as.party(rp)
> length(partyTree) #Number of splits
[1] 1503
>
> # Intersect() computation time
> Inters <- nodeids(partyTree)[-nodeids(partyTree, terminal = TRUE)][-1]
> system.time(table(unlist(sapply(Inters, function(x) intersect(Inters, nodeids(partyTree, from = x))))))
user system elapsed
22.38 0.00 22.44
>
> # Proposed computation time
> system.time(idDepth(partyTree))
user system elapsed
2.38 0.00 2.38

Get closest point based on coordinates

I'd like to snap single points to other points based on their distance in R.
In detail, I have a bunch of points defined by X and Y coordinate pairs.
In addition I have single different points, that I want to snap to
the closest neighbors (euclidean distance).
# target points
df <- data.frame(X=c(1,2,2,2,3,4),Y=c(1,2,2,3,3,4))
# points that need snapping
point1 <- data.frame(X=2.1, Y=2.3)
point2 <- data.frame(X=2.5, Y=2.5)
plot(df)
points(point1$X,point1$Y,pch=20,col="red")
points(point2$X,point2$Y,pch=20,col="blue")
But how to proceed with the snapping of the points?
How can I snap the points and assign new coordinate pairs to the single points?
Is there a simple function available in R? Or do I need to apply
the dist() function to obtain a distance matrix and search for
the closest distance? Maybe there is a more straight forward way.
Thats how it should look like:
1) snap to the closest (euclidean distance) point (clear solution for point 1)
point1$X_snap <- 2
point1$Y_snap <- 2
2) if two or more points similarily close than
snap to that which is more "north-east"
a) snap first to the one which is more north (Y direction)
b) if there are more than one that a similarly distant in Y direction snap
to the one that is more east
point2$X_snap <- 3
point2$Y_snap <- 3
For a graphical illustration how the results should look like
#plot snapped points:
points(point1$X_snap,point1$Y_snap,pch=8,col="red")
points(point2$X_snap,point2$Y_snap,pch=8,col="blue")
It's possible using dist:
dist(rbind(point1,df))
1 2 3 4 5
2 1.7029386
3 0.3162278 1.4142136
4 0.3162278 1.4142136 0.0000000
5 0.7071068 2.2360680 1.0000000 1.0000000
6 1.1401754 2.8284271 1.4142136 1.4142136 1.0000000
7 2.5495098 4.2426407 2.8284271 2.8284271 2.2360680
6
2
3
4
5
6
7 1.4142136
So the row with the minimum value (distance) in the first column identifies the point in df which is closest to point1 . In your example, you have a repeated location. Repeat for each of your point_x .
I found another solution using the matchpt() function
from Biobase (Bioconductor):
# target points
df <- data.frame(X=c(1,2,2,2,3,4),Y=c(1,2,2,3,3,4))
# points that need snapping
point1 <- data.frame(X=2.1, Y=2.3)
point2 <- data.frame(X=2.5, Y=2.5)
snap <- function(df,point){
require(Biobase)
d <- matchpt(as.matrix(df),
as.matrix(data.frame(X=point$X+0.0001,Y=point$Y+0.0001))) # to the "northwest" criteria correct
min_row <- as.numeric(rownames(d[d$distance==min(d$distance),]))
point$X_snap <- unique(df[min_row,"X"])
point$Y_snap <- unique(df[min_row,"Y"])
point
}
snap(df,point2)
You might also want to try the RANN package for Fast Nearest Neighbour Search:
# your data
df <- data.frame(X=c(1,2,2,2,3,4),Y=c(1,2,2,3,3,4))
pts <- data.frame(X=c(2.1, 2.5), Y=c(2.3, 2.5))
library(RANN)
# for each point in pts, find the nearest neighbor from df
closest <- RANN::nn2(data = df, query = pts, k = 1)
# argument k sets the number of nearest neighbours, here 1 (the closest)
closest
# $nn.idx
# [,1]
# [1,] 3
# [2,] 5
#
# $nn.dists
# [,1]
# [1,] 0.3162278
# [2,] 0.7071068
# Get coordinates of nearest neighbor
pts$X_snap <- df[closest$nn.idx, "X"]
pts$Y_snap <- df[closest$nn.idx, "Y"]
pts
# X Y X_snap Y_snap
# 1 2.1 2.3 2 2
# 2 2.5 2.5 3 3
I would put the criteria (distance, "southness", "westness") in a dataframe and then sort this dataframe along these criteria:
# input data
df <- data.frame(X=c(1,2,2,2,3,4),Y=c(1,2,2,3,3,4))
point1 <- data.frame(X=2.1, Y=2.3)
point2 <- data.frame(X=2.5, Y=2.5)
df.res[with(df.res, order(dst, dy, dx)), ]
# function that sorts all potential snapping points according to distance, "westness", "southness"
snap.xy <- function(point, other.points) {
df.res <- data.frame(X = other.points$X, # to later access the coordinates to snap to
Y = other.points$Y, # dto
dx <- point$X - other.points$X, # "westness" (the higher, the more "west")
dy <- point$Y - other.points$Y, # "southness"
dst = sqrt(dx^2 + dy^2)) # distance
# print(df.res[with(df.res, order(dst, dy, dx)), ]) # just for checking the results
return(df.res[with(df.res, order(dst, dy, dx)), ][1,c("X", "Y")]) # return only the X/Y coordinates
}
# examples
snap.xy(point1, df) # 2/2
snap.xy(point2, df) # 3/3
snap.xy(point2, df)$X # 3
snap.xy(point2, df)$Y # 3

Create Spatial Data in R

I have a dataset of species and their rough locations in a 100 x 200 meter area. The location part of the data frame is not in a format that I find to be usable. In this 100 x 200 meter rectangle, there are two hundred 10 x 10 meter squares named A through CV. Within each 10 x 10 square there are four 5 x 5 meter squares named 1, 2, 3, and 4, respectively (1 is south of 2 and west of 3. 4 is east of 2 and north of 3). I want to let R know that A is the square with corners at (0 ,0), (10,0), (0,0), and (0,10), that B is just north of A and has corners (0,10), (0,20), (10,10), and (10,20), and K is just east of A and has corners at (10,0), (10,10), (20,0), and (20,10), and so on for all the 10 x 10 meter squares. Additionally, I want to let R know where each 5 x 5 meter square is in the 100 x 200 meter plot.
So, my data frame looks something like this
10x10 5x5 Tree Diameter
A 1 tree1 4
B 1 tree2 4
C 4 tree3 6
D 3 tree4 2
E 3 tree5 3
F 2 tree6 7
G 1 tree7 12
H 2 tree8 1
I 2 tree9 2
J 3 tree10 8
K 4 tree11 3
L 1 tree12 7
M 2 tree13 5
Eventually, I want to be able to plot the 100 x 200 meter area and have each 10 x 10 meter square show up with the number of trees, or number of species, or total biomass
What is the best way to turn the data I have into spatial data that R can use for graphing and perhaps analysis?
Here's a start.
## set up a vector of all 10x10 position tags
tags10 <- c(LETTERS,
paste0("A",LETTERS),
paste0("B",LETTERS),
paste0("C",LETTERS[1:22]))
A function to convert (e.g.) {"J",3} to the center of the corresponding sub-square.
convpos <- function(pos10,pos5) {
## convert letters to major (x,y) positions
p1 <- as.numeric(factor(pos10,levels=tags10)) ## or use match()
p1.x <- ((p1-1) %% 10) *10+5 ## %% is modulo operator
p1.y <- ((p1-1) %/% 10)*10+5 ## %/% is integer division
## sort out sub-positions
p2.x <- ifelse(pos5 <=2,2.5,7.5) ## {1,2} vs {3,4} values
p2.y <- ifelse(pos5 %%2 ==1 ,2.5,7.5) ## odd {1,3} vs even {2,4} values
c(p1.x+p2.x,p1.y+p2.y)
}
usage:
convpos("J",2)
convpos(mydata$tenbytenpos,mydata$fivebyfivepos)
Important notes:
this is a proof of concept, I can pretty much guarantee I haven't got the correspondence of x and y coordinates quite right. But you should be able to trace through this line-by-line and see what it's doing ...
it should work correctly on vectors (see second usage example above): I switched from switch to ifelse for that reason
your column names (10x10) are likely to get mangled into something like X10.10 when reading data into R: see ?data.frame and ?check.names
Similar to what #Ben Bolker has done, here's a lookup function (though you may need to transpose something to make the labels match what you describe).
tenbyten <- c(LETTERS[1:26],
paste0("A",LETTERS[1:26]),
paste0("B",LETTERS[1:26]),
paste0("C",LETTERS[1:22]))
tenbyten <- matrix(rep(tenbyten, each = 2), ncol = 10)
tenbyten <- t(apply(tenbyten, 1, function(x){rep(x, each = 2)}))
# the 1234 squares
squares <- matrix(c(rep(c(1,2),10),rep(c(4,3),10)), nrow = 20, ncol = 20)
# stick together into a reference grid
my.grid <- matrix(paste(tenbyten, squares, sep = "-"), nrow = 20, ncol = 20)
# a lookup function for the site grid
coordLookup <- function(tbt, fbf, .my.grid = my.grid){
x <- col(.my.grid) * 5 - 2.5
y <- row(.my.grid) * 5 - 2.5
marker <- .my.grid == paste(tbt, fbf, sep = "-")
list(x = x[marker], y = y[marker])
}
coordLookup("BB",2)
$x
[1] 52.5
$y
[1] 37.5
If this isn't what you're looking for, then maybe you'd prefer a SpatialPolygonsDataFrame, which has proper polygon IDs, and you attach data to, etc. In that case just Google around for how to make one from scratch, and manipulate the row() and col() functions to get your polygon corners, similar to what's given in this lookup function, which only returns centroids.
Edit: getting SPDF started:
This is modified from the function example and can hopefully be a good start:
library(sp)
# really you have a 20x20 grid, counting the small ones.
# c(2.5,2.5) specifies the distance in any direction from the cell center
grd <- GridTopology(c(1,1), c(2.5,2.5), c(20,20)))
grd <- as.SpatialPolygons.GridTopology(grd)
# get centroids
coords <- coordinates(polys)
# make SPDF, with an extra column for your grid codes, taken from the above.
# you can add further columns to this data.frame(), using polys#data
polys <- SpatialPolygonsDataFrame(grd,
data=data.frame(x=coords[,1], y=coords[,2], my.ID = as.vector(my.grid),
row.names=getSpPPolygonsIDSlots(grd)))

Resources