Cleaning a graph with R package "igraph" - r

I need to "clean" a graph in R. By cleaning, I mean that i need to delete all nodes which are not linked with a specific one. For instance, if in my graph there are 4 nodes, with these edges :
1 to 3
1 to 2
4 to 2
I want to keep only the nodes linked with the edges 1 plus the edges 1 itself, so to say I need to delete the edges 4.
Is there any way with igraph to build an algorithm which can do that for very very very big graph (like more than 1000 nodes and 1 000 000 edges) ?

Usesubcomponent and induced.subgraph:
edges_df <- data.frame(from = c(1, 1, 4), to = c(3, 2, 2))
g1 <- graph.data.frame(edges_df, directed = TRUE)
g2 <- induced.subgraph(g1, subcomponent(g1, "1", mode = "out"))
As for the "big" graphs: 1000 is not so big. On my laptop:
system.time({
g3 <- graph.full(n = 1000, directed = TRUE)
g4 <- induced.subgraph(g3, subcomponent(g3, "1", mode = "out"))
})
# user system elapsed
# 0.47 0.10 0.57

Related

Word graph in R using Igraph

I have a simple problem. I have 2 text documents and I want to make a graph of each document through Igraph or other similar library. I actually want to make a large graph combine both subgraphs of two documents. I tried the following code. But,
> Topic1 = c("I love Pakistan")
> Topic2 = c("Pakistan played well")
> src = data.frame(Topic1,Topic2)
> mycorpus = Corpus(VectorSource(src))
> tdm = as.matrix(TermDocumentMatrix(mycorpus))
Now, don't know what should do next.
First graph of Topic1 will have 3 nodes and 3 edges, similarly, Second graph Topic2 will have 3 nodes and 3 edges. Now, I want o merge these two graph into one graph. The large graph now will have 5 nodes and 6 edges, where, node Pakistan will have 4 edges.
Anybody can help me?
Finally, I got the solution myself. First, we should make a graph of terms from Topic1. We will use every term that have frequency greater than 0.
tdm = as.matrix(TermDocumentMatrix(my))
x = names(tdm[,1][tdm[,1]>0])
k = t(combn(x,2))
g = graph_from_edgelist(k,directed = FALSE)
plot(g)
x2 = names(tdm[,2][tdm[,2]>0])
k2 = t(combn(x2,2))
g2 = graph_from_edgelist(k2,directed = FALSE)
plot(g2)
E1 = get.edgelist(g)
E2 = get.edgelist(g2)
E3 = rbind(E1,E2)
g3 = graph_from_edgelist(E3,directed = FALSE)
plot(g3)
g3 = simplify(g3,remove.multiple = TRUE, remove.loops = TRUE)

igraph: summarize each node's neighbours characteristics

With an igraph object I would like to capture some features of each node's neighbours, for example the average degree of its neighbours.
I come up with this code, which is inelegant and quite slow.
How should I rethink it for large and complex networks?
library(igraph)
# Toy example
set.seed(123)
g <- erdos.renyi.game(10,0.2)
# Loop to calculate average degree of each node's neighbourhood
s <- character(0)
for(i in 1:gorder(g)){
n <- ego_size(g, nodes = i, order = 1, mindist = 1)
node_of_interest <- unique(unlist(ego(g, nodes = i, order = 1, mindist = 1)))
m <- mean(degree(g, v = node_of_interest, loops = TRUE, normalized = FALSE)-1)
s <- rbind(s,data.frame(node = i, neighbours = n, mean = m))
}
Expanding the data structure with rbind in a loop can get quite slow in R, because at every step it needs to allocate the space for the new object, and then copy it (see section 24.6 here). Also, you might be computing the degree of a node many times, if it s the neighbor of multiple nodes.
A possibly better alternative could be:
# add vertex id (not really necessary)
V(g)$name <- V(g)
# add degree to the graph
V(g)$degree <- degree(g, loops = TRUE, normalized = FALSE)
# get a list of neighbours, for each node
g_ngh <- neighborhood(g, mindist = 1)
# write a function that gets the means
get.mean <- function(x){
mean(V(g)$degree[x]-1)
}
# apply the function, add result to the graph
V(g)$av_degr_nei <- sapply(g_ngh, get.mean)
# get data into dataframe, if necessary
d_vert_attr <- as_data_frame(g, what = "vertices")
d_vert_attr
name degree av_degr_nei
1 1 0 NaN
2 2 1 2.0000000
3 3 2 1.0000000
4 4 1 1.0000000
5 5 2 1.0000000
6 6 1 1.0000000
7 7 3 0.6666667
8 8 1 0.0000000
9 9 1 0.0000000
10 10 0 NaN

fixing nodes when plotting networks

mynet is a network object with the 93 vertices and three vertex attributes: sex, indegree, and outdegree. Another network object, simnet, is simulated version of the network. The nodes and degree distributions are the same, but some edges have been rewired.
I plot them side by side...
par(mfrow=c(1,2))
plot(mynet, vertex.col="sex", main="mynet")
plot(simnet, vertex.col="sex", main="simnet")
...and get the following result:
This would be much more useful if I could fix the node location in both plots, as it would make the differences in edges very clear. Is there a way to do this with the base plot() function? If not, what is the simplest way to do this without manually entering coordinates for each node?
There is a way to do this by setting the layout in advance of plotting and using the same layout for both plots. We can do this using the names of the nodes since these are the same nodes between each graph. The approach is a little hacky but seems to work. Example code below:
library(igraph)
# Make some fake networks
set.seed(42)
df1 <- data.frame(e1 = sample(1:5, 10, replace = T),
e1 = sample(1:5, 10, replace = T))
df2 <- data.frame(e1 = sample(1:5, 10, replace = T),
e1 = sample(1:5, 10, replace = T))
# the original
g1 <- graph_from_data_frame(df1, directed = F)
# the 'simulations'
g2 <- graph_from_data_frame(df2, directed = F)
# set up the plot
par(mfrow=c(1,2))
# we set the layout
lo <- layout_with_kk(g1)
# this is a matrix of positions. Positions
# refer to the order of the nodes
head(lo)
#> [,1] [,2]
#> [1,] -0.03760207 0.08115827
#> [2,] 1.06606602 0.35564140
#> [3,] -1.09026110 0.28291157
#> [4,] -0.90060771 -0.72591181
#> [5,] 0.67151585 -1.82471026
V(g1)
#> + 5/5 vertices, named, from 418e4e6:
#> [1] 5 2 4 3 1
# If the layout has names for the rows then we can
# use those names to fiddle with the order
row.names(lo) <- names(V(g1))
# plot with layout
plot(g1, layout = lo)
# plot with layout but reorder the layout to match the order
# in which nodes appear in g2
plot(g2, layout = lo[names(V(g2)), ])
Created on 2018-11-15 by the reprex package (v0.2.1)

Depth of a node in partykit

I am building a tree using the partykit R package, and I am wondering if there is a simple, efficient way to determine the depth number at each internal node. For example, the root node would have depth 0, the first two kid nodes have depth 1, the next kid nodes have depth 2, and so forth. This will eventually be used to calculate the minimal depth of a variable. Below is a very basic example (taken from vignette("constparty", package="partykit")):
library("partykit")
library("rpart")
data("Titanic", package = "datasets")
ttnc<-as.data.frame(Titanic)
ttnc <- ttnc[rep(1:nrow(ttnc), ttnc$Freq), 1:4]
names(ttnc)[2] <- "Gender"
rp <- rpart(Survived ~ ., data = ttnc)
ttncTree<-as.party(rp)
plot(ttncTree)
#This is one of my many attempts which does NOT work
internalNodes<-nodeids(ttncTree)[-nodeids(ttncTree, terminal = TRUE)]
depth(ttncTree)-unlist(nodeapply(ttncTree, ids=internalNodes, FUN=function(n){depth(n)}))
In this example, I want to output something similar to:
nodeid = 1 2 4 7
depth = 0 1 2 1
I apologize if my question is too specific.
Here's a possible solution which should be efficient enough as usually the trees have no more than several dozens of nodes.
I'm ignoring node #1, as it is always 0 an hence no point neither calculating it or showing it (IMO)
Inters <- nodeids(ttncTree)[-nodeids(ttncTree, terminal = TRUE)][-1]
table(unlist(sapply(Inters, function(x) intersect(Inters, nodeids(ttncTree, from = x)))))
# 2 4 7
# 1 2 1
I had to revisit this problem recently. Below is a function to determine the depth of each node. I count the depth based on the number of times a vertical line | appears running the print.party() function.
library(stringr)
idDepth <- function(tree) {
outTree <- capture.output(tree)
idCount <- 1
depthValues <- rep(NA, length(tree))
names(depthValues) <- 1:length(tree)
for (index in seq_along(outTree)){
if (grepl("\\[[0-9]+\\]", outTree[index])) {
depthValues[idCount] <- str_count(outTree[index], "\\|")
idCount = idCount + 1
}
}
return(depthValues)
}
> idDepth(ttncTree)
1 2 3 4 5 6 7 8 9
0 1 2 2 3 3 1 2 2
There definitely seems to be a simpler, faster solution, but this is faster than using the intersect() function. Below is an example of the computation time for a large tree (around 1,500 nodes)
# Compare computation time for large tree #
library(mlbench)
set.seed(470174)
dat <- data.frame(mlbench.friedman1(5000))
rp <- rpart(as.formula(paste0("y ~ ", paste(paste0("x.", 1:10), collapse=" + "))),
data=dat, control = rpart.control(cp = -1, minsplit=3, maxdepth = 10))
partyTree <- as.party(rp)
> length(partyTree) #Number of splits
[1] 1503
>
> # Intersect() computation time
> Inters <- nodeids(partyTree)[-nodeids(partyTree, terminal = TRUE)][-1]
> system.time(table(unlist(sapply(Inters, function(x) intersect(Inters, nodeids(partyTree, from = x))))))
user system elapsed
22.38 0.00 22.44
>
> # Proposed computation time
> system.time(idDepth(partyTree))
user system elapsed
2.38 0.00 2.38

how to get all terminal nodes - weight & response prediction 'ctree' in r

Here's what I can use to list weight for all terminal nodes : but how can I add some code to get response prediction as well as weight by each terminal node ID :
say I want my output to look like this
--
Here below is what I have so far to get the weight
nodes(airct, unique(where(airct)))
Thank you
The Binary tree is a big S4 object, so sometimes it is difficult to extract the data.
But the plot method for BinaryTree object, has an optional panel function of the form function(node) plotting the terminal nodes. So when you plot you can get node informations.
here I use the plot function, to extract the information and even better I used the gridExtra package to convert the terminal node to a table.
library(party)
library(gridExtra)
set.seed(100)
lls <- data.frame(N = gl(3, 50, labels = c("A", "B", "C")),
a = rnorm(150) + rep(c(1, 0,150)),
b = runif(150))
pond= sample(1:5,150,replace=TRUE)
tt <- ctree(formula=N~a+b, data=lls,weights = pond)
output.df <- data.frame()
innerWeights <- function(node){
dat <- data.frame (x=node$nodeID,
y=sum(node$weights),
z=paste(round(node$prediction,2),collapse=' '))
grid.table(dat,
cols = c('ID','Weights','Prediction'),
h.even.alpha=1,
h.odd.alpha=1,
v.even.alpha=0.5,
v.odd.alpha=1)
output.df <<- rbind(output.df,dat) # note the use of <<-
}
plot(tt, type='simple', terminal_panel = innerWeights)
data
ID Weights Prediction
1 4 24 0.42 0.5 0.08
2 5 17 0.06 0.24 0.71
3 6 24 0.08 0 0.92
4 7 388 0.37 0.37 0.26
Here's what I found , it works fine with a bit extra information. But I just want to post it here just in case someone need them in the future.
y <- do.call(rbind, nodes(tt, unique(where(tt))))
write.table(y, 'clipboard', sep='\t')
#agstudy , let me know what you think.

Resources