Duplicate Vertex IDs R - r

I am trying to make a network visualization for calling activity from a manager to store locations. The only problem is I keep getting the error "Duplicate Vertex IDs". I need to have multiple of the same vertex IDs as one manager has called more than one store. How do I get around this?
My edges data is organized as follows:
from to weight
12341 1 5
12341 2 4
23435 1 3
My node data includes only the from column:
from
12341
12341
23435
This was the code I tried to run:
MANAGER_LOC <- graph_from_data_frame(d = edges, vertices = nodes,
directed = TRUE)

You are getting the duplicate vertex ID error because you need to reference unique node data in vertices = . You could use unique(nodes), but this will give you another error, because nodes 1 and 2 you are referencing in your adjacency list data are not included in your nodes data.
Your node data cannot only include unique values from column edges$from, it must include all unique values from edges$from and edges$to, because you are passing adjacency list data to the graph_from_data_frame() function.
So in edges$to you also need to reference vertices by their names as in edges$from, e.g. 12341 or 23435.
Here is some R-Code, maybe including what you are trying to achieve.
#graph from your data frame
MANAGER_LOC <- graph_from_data_frame(
d = edges
,vertices = unique(c(edges$from, edges$to))
,directed = TRUE);
#plot also includes vertices 1 and 2
plot(
x = MANAGER_LOC
,main = "Plot from your edges data");
#plot from your data assuming you are referencing an id in edges$to
MANAGER_LOC <- graph_from_data_frame(
d = merge(
x = edges
,y = data.frame(
to_vertice_id = 1:length(unique(edges$from))
,to_vertice = unique(edges$from))
,by.x = "to"
,by.y = "to_vertice_id"
,all.x = T)[,c("from","to_vertice","weight")]
,vertices = unique(edges$from)
,directed = TRUE);
#plot does not include vertices 1 and 2
plot(
x = MANAGER_LOC
,main = "Plot assuming vertice ID
reference in edges$to");
#plot from your data assuming you are referencing the xth value of edges$from in edges$to
MANAGER_LOC <- graph_from_data_frame(
d = merge(
x = edges
,y = data.frame(
to_vertice_ref = 1:nrow(edges)
,to_vertice = edges$from)
,by.x = "to"
,by.y = "to_vertice_ref"
,all.x = T)[,c("from","to_vertice","weight")]
,vertices = unique(edges$from)
,directed = TRUE);
#plot does not include vertices 1 and 2
plot(
x = MANAGER_LOC
,main = "Plot assuming edges$from
reference in edges$to");

Related

Assign value to all upstream nodes in R igraph based on downstream value

I'm out of my depth when it comes to network graphs, but I have a table of ~6300 From/To links similar to the data frame df given below. Each vertex has a binary property called status.
What I would like to do is determine all of the vertices that are upstream of a vertex where status = 1, how would i do this in igraph? I've looked at data.tree but my data are not necessarily a single-root tree and "loops" are possible.
In the example below, this would mean that vertices Z, R, S, M, and K should have status = 1 (i.e. be orange in the plot), as they are upstream of Q, L, I, respectively.
library(igraph)
df <- data.frame(from = c("D","B","A","Q","Z","L","M","R","S","T","U","H","I","K"),
to = c("O","D","B","B","Q","O","L","Q","R","O","T","J","J","I"),
stringsAsFactors = FALSE
)
vertices <- data.frame(vertex = unique(c(df[,1], c(df[,2]))),
status = c(0,0,0,1,0,1,0,0,0,0,0,1,
1,0,0,0))
g <- graph_from_data_frame(df, vertices = vertices, directed = T)
plot(g, vertex.color = vertex_attr(g, "status"))
You can use subcomponent with mode='in'.

Manipulating edgelist in a network graph in R

I have an edgelist that has the following columns, from and to. This represents the edges between nodes.
from = c("10009", "10009", "10009", "10009", "10011", "10011", ...)
to = c("23908", "230908", "230908", "230908", "230514", "230514", ...)
edgelist = data.frame(from, to)
nodes = c("10009", "10011", "230908", "230514" ...)
I then created a network object, converted to graph object, to calculate its centrality measures:
library(network)
library(qgraph)
library(igraph)
network_el = network(edgelist, vertex.attr = nodes, directed=T) #network object
g = asIgraph(network_el) #convert to graph object
centrality = centrality_auto(g) #calculate Centrality
df = data.frame(centrality$edge.betweenness.centrality) #extract edge betweenness centrality into a dataframe
This gives me a dataframe with the columns c("from", "to", "centrality"). However, the "from" and "to" are no longer the original node names listed in edgelist. They have been converted into a different ID, starting from 1...to the last row.
#my current results
from = c("1","2","3","4"...)
to = c("6", "100", "204", ...)
edge.betweenness.centrality = c(4653193, 20188105, ...)
How do I merge back the original node names? I need to identify the actual "from" and "to" (i.e., the node data), such as:
#my desired results
from = c("10009", "10009", "10009", "10009", "10011"...) #rather than 1,2,3..
to = c("23908", "230908", "230908", "230908", "235014",...)
edge.betweenness.centrality = c(4653193, 20188105, ...)
I think this works - the node IDs are now back in place!
#Assign as dictionary
dict <- data.frame(name = V(g)$vertex.names)
dict <- data.frame (row.names(dict), dict)
#Replace with dictionary values
df$from <- with(dict, name[match(df$from, row.names.dict.)])
df$to <- with(dict, name[match(df$to, row.names.dict.)])
Here's an example using igraph:
library(igraph)
We create an example graph with explicit vertex names:
g <- sample_pa(length(letters), m=2, directed=F)
V(g)$name <- letters
Compute edge betweenness and save it to an edge attribute:
E(g)$eb <- edge_betweenness(g)
Now the dataframe you asked for, with the original vertex names:
> as_data_frame(g)
from to eb
1 a b 12.831818
2 a c 16.524242
3 b c 25.700000
4 c d 30.812879
5 b d 12.464394
...
...

R: generate a sequence of graphs (scale-free networks)

I'm generating a sequence of scale-free networks, in which I can add and remove edges according to a value sampled from a uniform distribution. The following code works, but throws a warning once in awhile (every 10 runs or so). The warning is:
Warning message:
In data.table::data.table(...) :
Item 1 is of size 64 but maximum size is 66 (recycled leaving remainder of 2 items)
I've seen this question, but I don't really understand the answer and if it applies in my situation.
The code is:
library(igraph)
create_graph_sequence = function(num_nodes, num_timesteps) {
keep_graphs <- vector(mode="list", length=num_timesteps)
proportions = runif(2)
cat('proportions are: ', proportions)
prop_add = proportions[1] #Let both follow a uniform distribution
prop_del= proportions[2]
min_num_edges = ceiling(num_nodes/2)
g <- barabasi.game(num_nodes, power=1.2, directed=TRUE, algorithm="psumtree")
keep_graphs[[1]] = g
for(i in 2:num_timesteps) {
print(i)
edgelist = get.edgelist(keep_graphs[[i-1]]) #(g)
#Add and remove edges per time step.
add_edge_to_graph = function() {
#Do not allow creation of loops! If farm a ships to farm b, then farm b cannot ship to farm a.
#Do not allow self-loops! If farm a is in the network, it cannot ship to farm a.
reverse_edgelist = cbind(edgelist[,2], edgelist[,1])
self_edgelist = cbind(seq(1:num_nodes), seq(1:num_nodes))
total_edges_not_to_repeat = rbind(edgelist, reverse_edgelist, self_edgelist)
#Find two nodes that are not in the current edgelist.
#1: get a (num_nodes)*2 matrix of possible edges
possible_edges_1 = rep(seq(1:num_nodes), each=num_nodes)
possible_edges_2 = rep(seq(1:num_nodes), num_nodes)
possible_edges = cbind(possible_edges_1, possible_edges_2)
possible_edges = data.matrix(possible_edges)
DT1 <- data.table(possible_edges)
DT2 <- data.table(cbind(total_edges_not_to_repeat, 0), key=paste0("V", seq(len=ncol(total_edges_not_to_repeat))))
setnames(DT2, c(head(names(DT2), -1L), 'found'))
da <- DT2[DT1, list(found=ifelse(is.na(found), 0, 1))]
#Append found to the possible_edges
dt1 <- cbind(DT1, da)
#randomly select *prop_add* rows that have '0' in the found column and add the edges
dt1 = data.matrix(dt1)
select_0 = dt1[dt1[, "found"]==0,]
new_edge_row = sample(nrow(select_0), ceiling(nrow(edgelist)*prop_add))
new_edges = select_0[new_edge_row, 1:2] #possible_edges[new_edge_row,]
#While not all new_edges fit the bill: are self-loops, create loops with other farms, etc.
#take a new sample.
new_edges_df = as.data.frame(new_edges, by_row=False)
tentr_df = as.data.frame(total_edges_not_to_repeat, by_row=True)
while(any(do.call(paste0,new_edges_df) %in% do.call(paste0, tentr_df))) {
new_edge_row = sample(nrow(select_0), ceiling(nrow(edgelist)*prop_add))
new_edges = select_0[new_edge_row, 1:2]
new_edges_df = as.data.frame(new_edges)
tentr_df = as.data.frame(total_edges_not_to_repeat)
}
new_edges = t(as.matrix(new_edges_df)) #for some reason matrix -> df -> matrix transposes. Transposing back.
#g2 = g
print('i-1 is')
print(i-1)
#print(keep_graphs[[i-1]])
g2 = keep_graphs[[i-1]]
for(i in 1:nrow(new_edges)) {
addthisedge = c(new_edges[i,][[1]], new_edges[i,][[2]])
g2 = add_edges(g2, edges = addthisedge)
}
return(g2)
}
delete_edge_from_graph = function() {
#Randomly select a second proportion *prop_del* rows to delete
#(include the edges from the new graph - this means the number of edges DOES NOT remain constant;
#if prop_add = prop_del then yes, stays constant;
#if prop_del > prop_add then going to have a graph that gets smaller over time.)
#BUT if the number to remove results in a graph with unacceptably few edges, then remove no edges.
g2_edgelist = get.edgelist(g2)
edges_to_remove = sample(nrow(g2_edgelist), floor(nrow(g2_edgelist)*prop_del))
edgefunctiong2 = E(g2)
if(nrow(g2_edgelist) - length(edgefunctiong2[edges_to_remove]) < min_num_edges) {
print('g2')
print(g2)
return(g2)
} else {
g2 = g2 - edgefunctiong2[edges_to_remove]
print('g2')
print(g2)
return(g2)
}
}
g2 = add_edge_to_graph()
g2 = delete_edge_from_graph()
keep_graphs[[i]] = g2
}
return(keep_graphs)
}
You can call this MWE by:
kept_graphs = create_graph_sequence(8, 3)
To create a sequence of 3 graphs with 8 nodes. It shouldn't take more than a few seconds to run on a basic laptop.
How can I get rid of this warning message?
The problem is in the line
dt1 <- cbind(DT1, da)
It sometimes happens that DT1 has 64 rows, while da has 66. That needs a warning because it doesn't look like a typical recycling (e.g., when one vector is (1,2,3,4,5,6) and we give another (2,3) expecting recycling to make it (2,3,2,3,2,3)).
Since it's not obvious what the function does, I'm sure it's better if you fix the issue yourself. To replicate it, do set.seed(123) before calling the function, and add if(i == 3) browser() right before dt1 <- cbind(DT1, da).

SpatialLines to iGraph conversion simplifies topology

I'm trying to convert from spatialLinesDataFrame to igraph object, and think I may be losing information I want to keep. Fairly new to igraph so please bear with me. The example below illustrates:
# create sldf object
require(sp); require(igraph); require(shp2graph)
d = data.frame(x = c(0,80,100,0,-20,-8,0,3,-10,-5,80,75),
y = c(0,-10,5,0,14,33,0,-4,-10,-12,-10,5),
grp = c(1,1,1,2,2,2,3,3,3,3,4,4))
sl = SpatialLines(list(
Lines(list(Line(d[d$grp == 1,1:2]),
Line(d[d$grp == 4,1:2])), ID=1),
Lines(Line(d[d$grp == 2,1:2]), ID=2),
Lines(Line(d[d$grp == 3,1:2]), ID=3))
)
sldf = SpatialLinesDataFrame(sl, iris[1:3,])
plot(sldf)
Now convert to igraph and plot:
read_sldf = readshpnw(sldf, ELComputed = T)
g = nel2igraph(read_sldf[[2]], read_sldf[[3]], weight=read_sldf[[4]])
plot(g)
Am I right that the off-branch of the first spdf row (sldf[1,]) has been lost? Calling as_edgelist(g) returns 3 rows not 4.
Just change those options in readshpnw:
# create sldf object
require(sp); require(igraph); require(shp2graph)
d = data.frame(x = c(0,80,100,0,-20,-8,0,3,-10,-5,80,75),
y = c(0,-10,5,0,14,33,0,-4,-10,-12,-10,5),
grp = c(1,1,1,2,2,2,3,3,3,3,4,4))
sl = SpatialLines(list(
Lines(list(Line(d[d$grp == 1,1:2]),
Line(d[d$grp == 4,1:2])), ID=1),
Lines(Line(d[d$grp == 2,1:2]), ID=2),
Lines(Line(d[d$grp == 3,1:2]), ID=3))
)
sldf = SpatialLinesDataFrame(sl, iris[1:3,])
plot(sldf)
nodes = readshpnw(sldf, ELComputed = TRUE, Detailed = TRUE, ea.prop = names(sldf))
g = nel2igraph(nodes[[2]], nodes[[3]])
plot(g)

Graph Layout in igraph

I am trying to gerate gexf file using igraph but unfortunatly I have a problem with layout. How can I solve it to get a good graph like second one.
First image created with 1000 nodes but second one with 500
gD <- simplify(graph.data.frame(dataSet, directed=FALSE))
# Print number of nodes and edges
#vcount(gD)
#ecount(gD)
############################################################################################
# Calculate some node properties and node similarities that will be used to illustrate
# different plotting abilities
# Calculate degree for all nodes
degAll <- igraph::degree(gD, v = V(gD), mode = "all")
# Calculate betweenness for all nodes
#tnet::betweenness_w(data.frame(V1 = dataSet$V1,V2=dataSet$V2, V3 =dataSet$V3 ))
betAll <- igraph::betweenness(gD, v = V(gD),normalized=TRUE)
betAll <- betAll
#options("scipen"=100000000)
#betweenness()
#betAll.norm <- betAll / 100000000000000
betAll.norm <-betAll
#betAll.norm <- (betAll - min(betAll))/(max(betAll) - min(betAll))
rm(betAll)
# Calculate Dice similarities between all pairs of nodes
dsAll <- similarity.dice(gD, vids = V(gD), mode = "all")
############################################################################################
# Add new node/edge attributes based on the calculated node properties/similarities
newdataSet1 <- data.frame(username = dataSet$V1,gender = dataSet$V4)
newdataSet2 <- data.frame(username = dataSet$V2,gender = dataSet$V4)
newdataSet<-rbind(newdataSet1,newdataSet2)
genderdata<-subset(newdataSet,!duplicated(newdataSet$username))
gD <- set.vertex.attribute(gD, "degree", index = V(gD), value = degAll)
gD <- set.vertex.attribute(gD, "betweenness", index = V(gD), value = betAll.norm)
gD <- set.vertex.attribute(gD, "gender", index = V(gD), value = genderdata$gender)
newdataSet1 <- data.frame(username = dataSet$V1,sentiment = dataSet$V5)
newdataSet2 <- data.frame(username = dataSet$V2,sentiment = dataSet$V5)
newdataSet<-rbind(newdataSet1,newdataSet2)
sentimentdata<-subset(newdataSet,!duplicated(newdataSet$username))
gD <- set.vertex.attribute(gD, "sentiment", index = V(gD), value = sentimentdata$sentiment)
# Check the attributes
# summary(gD)
F1 <- function(x) {data.frame(V4 = dsAll[which(V(gD)$name == as.character(x$V1)), which(V(gD)$name == as.character(x$V2))])}
cl = createCluster(6, export = list("F1","dsAll","gD"), lib = list("igraph","plyr"))
system.time(dataSet.ext <- ddply(dataSet, .variables=c("V1", "V2", "V3"), function(x) data.frame(F1(x)),.parallel = TRUE))
#res = ddply(dat, .(category), bla, .parallel = TRUE)
stopCluster(cl)
gD <- set.edge.attribute(gD, "weight", index = E(gD), value = 0)
gD <- set.edge.attribute(gD, "similarity", index = E(gD), value = 0)
# The order of interactions in gD is not the same as it is in dataSet or as it is in the edge list,
# and for that reason these values cannot be assigned directly
#length(E(gD)[as.character(dataSet.ext$V1) %--% as.character(dataSet.ext$V2)]$weight )
E(gD)[as.character(dataSet.ext$V1) %--% as.character(dataSet.ext$V2)]$weight <- as.numeric(dataSet.ext$V3)
E(gD)[as.character(dataSet.ext$V1) %--% as.character(dataSet.ext$V2)]$similarity <- as.numeric(dataSet.ext$V4)
# Check the attributes
# summary(gD)
####################################
# Print network in the file format ready for Gephi
# This requires rgexf package
# Create a dataframe nodes: 1st column - node ID, 2nd column -node name
nodes_df <- data.frame(ID = c(1:vcount(gD)), NAME = V(gD)$name)
# Create a dataframe edges: 1st column - source node ID, 2nd column -target node ID
edges_df <- as.data.frame(get.edges(gD, c(1:ecount(gD))))
# Define node and edge attributes - these attributes won't be directly used for network visualization, but they
# may be useful for other network manipulations in Gephi
#
# Create a dataframe with node attributes: 1st column - attribute 1 (degree), 2nd column - attribute 2 (betweenness)
nodes_att <- data.frame(DEG = V(gD)$degree, BET = V(gD)$betweenness,gender = V(gD)$gender,sentement = V(gD)$sentiment)
#
# Create a dataframe with edge attributes: 1st column - attribute 1 (weight), 2nd column - attribute 2 (similarity)
edges_att <- data.frame(WGH = E(gD)$weight, SIM = E(gD)$similarity)
# Define node/edge visual attributes - these attributes are the ones used for network visualization
#
# Calculate node coordinate - needs to be 3D
#nodes_coord <- as.data.frame(layout.fruchterman.reingold(gD, weights = E(gD)$similarity, dim = 3, niter = 10000))
# We'll cheat here, as 2D coordinates result in a better (2D) plot than 3D coordinates
nodes_coord <- as.data.frame(layout.fruchterman.reingold(gD))
nodes_coord <- cbind(nodes_coord, rep(0, times = nrow(nodes_coord)))
#
# Calculate node size
# We'll interpolate node size based on the node betweenness centrality, using the "approx" function
uniqueNess<-unique(V(gD)$betweenness)
approxVals <- approx(c(1, 5), n = length(unique(V(gD)$betweenness)))
# And we will assign a node size for each node based on its betweenness centrality
nodes_size <- sapply(V(gD)$betweenness, function(x) approxVals$y[which(sort(unique(V(gD)$betweenness)) == x)])
#
# Define node color
# We'll interpolate node colors based on the node degree using the "colorRampPalette" function from the "grDevices" library
# This function returns a function corresponding to a collor palete of "bias" number of elements
F2 <- colorRampPalette(c("#F5DEB3", "#FF0000"), bias = length(unique(V(gD)$degree)), space = "rgb", interpolate = "linear")
# Now we'll create a color for each degree
colCodes <- F2(length(unique(V(gD)$degree)))
#################test parallel####################
cl = createCluster(6, export = list("F2","dsAll","gD","colCodes"), lib = list("igraph","plyr"))
system.time(nodes_col<-parSapply(cl,V(gD)$degree, function(x) colCodes[which(sort(unique(V(gD)$degree)) == x)]))
#res = ddply(dat, .(category), bla, .parallel = TRUE)
stopCluster(cl)
#############################
# And we will assign a color for each node based on its degree
#nodes_col <- sapply(V(gD)$degree, function(x) colCodes[which(sort(unique(V(gD)$degree)) == x)])
# Transform it into a data frame (we have to transpose it first)
nodes_col_df <- as.data.frame(t(col2rgb(nodes_col, alpha = FALSE)))
# And add alpha (between 0 and 1). The alpha from "col2rgb" function takes values from 0-255, so we cannot use it
nodes_col_df <- cbind(nodes_col_df, alpha = rep(1, times = nrow(nodes_col_df)))
# Assign visual attributes to nodes (colors have to be 4dimensional - RGBA)
nodes_att_viz <- list(color = nodes_col_df, position = nodes_coord, size = nodes_size)
# Assign visual attributes to edges using the same approach as we did for nodes
F2 <- colorRampPalette(c("#FFFF00", "#006400"), bias = length(unique(E(gD)$weight)), space = "rgb", interpolate = "linear")
colCodes <- F2(length(unique(E(gD)$weight)))
#################test parallel####################
cl = createCluster(12, export = list("F2","dsAll","gD","colCodes"), lib = list("igraph","plyr"))
system.time(edges_col<-parSapply(cl,E(gD)$weight, function(x) colCodes[which(sort(unique(E(gD)$weight)) == x)]))
stopCluster(cl)
#############################
#edges_col <- sapply(E(gD)$weight, function(x) colCodes[which(sort(unique(E(gD)$weight)) == x)])
edges_col_df <- as.data.frame(t(col2rgb(edges_col, alpha = FALSE)))
edges_col_df <- cbind(edges_col_df, alpha = rep(1, times = nrow(edges_col_df)))
edges_att_viz <-list(color = edges_col_df)
# Write the network into a gexf (Gephi) file
#write.gexf(nodes = nodes_df, edges = edges_df, nodesAtt = nodes_att, edgesWeight = E(gD)$weight, edgesAtt = edges_att, nodesVizAtt = nodes_att_viz, edgesVizAtt = edges_att_viz, defaultedgetype = "undirected", output = "lesmis.gexf")
# And without edge weights
write.gexf(nodes = nodes_df, edges = edges_df, nodesAtt = nodes_att, edgesAtt = edges_att, nodesVizAtt = nodes_att_viz, edgesVizAtt = edges_att_viz, defaultedgetype = "undirected", output = "arctic.gexf")

Resources