Suppose I have the following data frames
df <- data.frame(dev = c("A","A","B","B","C","C","C"),
proj = c("W","X","Y","X","W","X","Z"))
types <- data.frame(proj = c("W","X","Y","Z"),
type = c("blue","orange","orange","blue"))
> df
dev proj
1 A W
2 A X
3 B Y
4 B X
5 C W
6 C X
7 C Z
> types
proj type
1 W blue
2 X orange
3 Y orange
4 Z blue
I would like to turn these into the following network
The nodes are the unique entries in proj. For nodes u,v, there is an arc from u to v if u and v share an element from dev. The data is a list of developers and projects that each developer has worked on, and I would like to form a network which connects projects that have a developer in common. Each project is of a particular type, and that information would need to be encoded in the graph (I did this in this toy example via colour).
From this graph what I need is the degree of each node, as well as one or more measures of centrality. In particular I need the closeness centrality of each node, as well as a modified version of closeness centrality which measures the centrality within each type. So my end goal is to obtain a table like this:
proj degree closeness_centrality type_centrality
W 2 0.75 1
X 3 1 1
Y 2 0.75 1
Z 1 0.60 1
For reference, the closeness centrality of a node u is defined as C(u)=(N-1)/(sum over all nodes v of the distance from u to v), where N is the number of nodes in the graph and the distance from u to v is the length of the shortest u-v-path. The type centrality is C(T,u)=|T-u|/(sum over all nodes v in T of the distance from u to v) where T is the set of all nodes of a given type, and |T-u| is the size of T with u excluded (so either |T| or |T|-1 depending on the type of u).
One of the big challenges is that my actual df has almost 300,000 rows and this graph will have around 155,000 vertices. The average degree will be very low though so I think that it is doable.
My questions are:
Is R the best tool to be using for this? Are there good packages for performing these types of calculations on graphs?
What is the best way to store this kind of data? Should I form an adjacency matrix, or something else?
Any insight or tips at all would be well appreciated; as an economics major I'm kind of in over my head comp-sci-wise here.
Thanks!
Related
I am doing some basic network analysis using networks from the R package "networkdata". To this end, I use the package "igraph" as well as "sna". However, I realised that the results of descriptive network statistics vary depending on the package I use. Most variation is not too grave but the average degree of my undirected graph halved as soon as I switched from "sna" to "igraph".
library(networkdata)
n_1 <- covert_28
library(igraph)
library(sna)
n_1_adjmat <- as_adjacency_matrix(n_1)
n_1_adjmat2 <- as.matrix(n_1_adjmat)
mean(sna::degree(n_1_adjmat2, cmode = "freeman")) # [1] 23.33333
mean(igraph::degree(n_1, mode = "all")) # [1] 11.66667
This doesn't happen in case of my directed graph. Here, I get the same results regardless of using "sna" or "igraph".
Is there any explanation for this phenomenon? And if so, is there anything I can do in order to prevent this from happening?
Thank you in advance!
This is explained in the documentation for sna::degree.
indegree of a vertex, v, corresponds to the cardinality
of the vertex set N^+(v) = {i in V(G) : (i,v) in E(G)};
outdegree corresponds to the cardinality of the vertex
set N^-(v) = {i in V(G) : (v,i) in E(G)}; and total
(or “Freeman”) degree corresponds to |N^+(v)| + |N^-(v)|.
(Note that, for simple graphs,
indegree=outdegree=total degree/2.)
A simpler example than yours makes it clear.
library(igraph)
library(sna)
g = make_ring(3)
plot(g)
AM = as.matrix(as_adjacency_matrix(g))
sna::degree(AM)
[1] 4 4 4
igraph::degree(g)
[1] 2 2 2
Vertex 1 has links to both vertices 2 and 3. These count in the
in-degree and also count in the out-degree, so
Freeman = in + out = 2 + 2 = 4
The "Note" in the documentation states this.
I have a symmetric matrix which I modified a bit:
The above matrix is a symmetric matrix except the fact that I have added values in diagonal too (will tell the purpose going forward)
This matrix represents that how many times a person (A, B, C, D, E) works with other person on a publication. e.g. B and C worked 3 times together, similarly A and E worked 4 times together. Now the diagonal values represents how many times a person worked individually e.g. B worked on 4 publications (either alone or with someone else) similarly C worked on 3 publications.
Now I want to make a network analysis graph in R which describes relation between different person in terms of edge thickness and node size. e.g. the graph should look like this:
In graph, node circle size depends on number of publications a person worked on, e.g. circle B is largest as its diagonal value is maximum and A & E are smallest as they have lowest diagonal values. Also, the edge thickness between nodes depends on how many times they worked together, e.g. edge thickness between A & E is maximum as they worked 4 times together, compared to edge thickness (lesser than edge thickness between A & E) between B & C as they have worked 3 times together.
I can describe the relation between two persons basis edge thickness, however inclusion of diagonal values creating problems for me. Is it possible to do it in R? Any leads would be highly appreciated
You can do this with the igraph package. Because the diagonal means something different from the other entries in the matrix, I have separated the matrix into two pieces, the diagonal and the rest.
Your data
SM = as.matrix(read.table(text="A B C D E
1 2 1 1 4
2 4 3 2 1
1 3 3 1 2
1 2 1 2 1
4 1 2 1 1",
header=TRUE))
rownames(SM) = colnames(SM)
library(igraph)
AM = SM
diag(AM) = 0
D = diag(SM)
g = graph_from_adjacency_matrix(AM,
mode = "undirected",
weighted = TRUE)
plot(g,
edge.width=E(g)$weight,
vertex.size = 10+3*D)
Show that if the edge set of a graph G(V,E) with n nodes
can be partitioned into 2 trees,
then there is at least one vertex of degree less than 4 in G.
...................................................................................
I have tried to prove this problem with the help of the method of contradiction.
Assume that all vertices of the graph G has degree >= 4.
Assume the graph G is partitioned into two trees T1 and T2.
With the help of the above assumptions the only observation I could make is that for every vertex v in G
degree of v must be greater than or equal to 2 in either T1 or T2.
I don't know how proceed with this. Please help.
If my approach for solving this problem is wrong then please provide a different solution.
You started with a good approach. Lets assume all vertices in G has degree of 4 (or above) and sssume the graph G is partitioned into two trees T1 and T2.
We know that number of edge in tree is n-1 (when n is number of vertices). Therefor in each of T1 and T2 we have n-1 edges (consider n to be |V|) -> combine we have 2n-2 edges in G -> |E| = 2n-2
From the other hand we know that each v in G -> d(v) > 4 . And we know that sum of degree in graph equal to 2|E|. therefor, 2*|E| >= 4*n (I took the minimum degree for each vertex and each edge contribute 2 to the sum of the degree). So we got |E| >= 2*n.
Contradiction -> There is have to be one vertex with degree less then 4
I have a homework which I was given about a week ago. The thing is, I don't understand what my teacher taught but he gave us a homework...
A = {a,b,s}, B = {b,h,t}, C = {a,t,s}, D = {h,t,s}, E = {a,b}, F = {b,t,s}
How to create a minimal vertex coloring, which A,B,C,D,E and F are the vertexes?
I do know how to color a vertex but I don't know how to create the graphs from the given sets. Any helps? I tried looking on the internet but I don't come across a question like this.
If the graph is to be interpreted in such a way that the vertices A, B, C, D, E, F are meant to be connected if and only if they intersect, an optimal coloring has 5 colors.
The resulting graph is almost the complete graph on 6 vertices - {E,F} and {E,D} are the only edges which are missing. That being said, it contains the complete graph on 5 vertices via the subgraph induced by {A,B,C,D,F}. Consequently, any vertex coloring cannot use less than 5 colors. In total, the coloring
F : 1
A : 2
B : 3
C : 4
D : 5
E : 1
is a 5-coloring of the graph which is optimal.
I have a directed network where 50 nodes have a degree of 3 and another 50 have a degree of 10.
source("http://bioconductor.org/biocLite.R")
biocLite("graph")
#load graph and make the specified graph
library(graph)
degrees=c(rep(3,50),rep(10,50))
names(degrees)=paste("node",seq_along(degrees)) #nodes must be names
x=randomNodeGraph(degrees)
#verify graph
edges=edgeMatrix(x)
edgecount=table(as.vector(edges))
table(edgecount)
This is a directed network where the total degree is made up from both indegree and outdegree.
I would like to have a network where every indegree is also an outdegree and vice versa
so for example if node 1 has an edge to node 5 then node 5 also needs to have an edge to node 1. My main goal is to preserve the degree distribution, i.e. 50 with degree of 3 and 50 with degree of 10.
Simply setting the graph to be undirected seems to do it:
x2 <- x
edgemode(x2) <- "undirected"
edges<-edgeMatrix(x)
edgecount <- table(as.vector(edges))
table(edgecount)
Gives the same results as your code.
Also, an undirected graph will always have an edge from 5 to 1 if there is an edge from 1 to 5. A single edge satisfies this property.
Paul Shannon suggests the following:
library(graph)
library(igraph)
degrees=c(rep(3,50),rep(10,50))
g <- igraph.to.graphNEL(degree.sequence.game(degrees))
table(graph::degree(g))
This gives the same results as your code.