I have an output.csv file with adjacency list of a graph. It is in the following format..
Every line starts with the source node (which is an integer) followed by the nodes it is connected to. The nodes are separated from each other and from the source node by a space (' ') separator..
A snapshot looks as follows:
0 2 5 8
1 2 7 4 6
2 0 1
3 4 7 8
4 1 3
I want to read this into an adjacency list format and use it to plot in igraph. What is the simplest way to do this ? Thanks..
Your data is not a proper adjacency list, because it is missing the lists for 5-8. So I just removed these vertices from your list.
Igraph has a function to create a graph from an adjacency list, so you just need to read in the data, and create the graph from the adjacency list with graph.adjlist. Here is one way to do it, not necessarily the simplest:
## magrittr for the %>% pipes
library(magrittr)
library(igraph)
## sample data
text <- "0 2\n1 2 4\n2 0 1\n3 4\n4 1 3"
## read in as lines, replace textConnection(text) with your file name
lines <- readLines(textConnection(text))
g <- lines %>%
strsplit(split = " ") %>% # 1
lapply(as.numeric) %>% # 2
lapply(extract, -1) %>% # 3
lapply(add, 1) %>% # 4
graph.adjlist(mode = "all") # 5
g
#> IGRAPH U--- 5 4 --
#> + edges:
#> [1] 1--3 2--3 2--5 4--5
Some explanation for the long pipe steps:
We split the lines at single spaces.
Convert them to numeric.
Drop the first number from each line, this is not needed for graph.adjlist.
Add one to all numbers, since igraph vertex ids start with one, yours seem to start with zero.
Call graph.adjlist to create an undirected graph.
Related
I have an igraph network that contains two types of nodes, one set that describes my points/nodes of interest (NOI) and another set that act as barriers (B) in my network. Now I'd like to measure the total length of all edges that are connected starting from a specific NOI until a barrier is approached.
Here a short example using a ring-shape in igraph:
set.seed(123)
g <- make_ring(10) %>%
set_edge_attr("weight", value = rnorm(10,100,20))%>%
set_vertex_attr("barrier", value = c(0,0,1,0,0,1,0,0,1,0))%>%
set_vertex_attr("color", value = c("green","green","red",
"green","green","red",
"green","green","red","green"))
For example when starting from my node 1 (NOI, green) all edges until the nodes 9 and 3 are reachable (the nodes 9 and 3 are barriers B and block). Thus the total connected length of edges for NOI 1 is the sum of the lengths/weights of edges 1--2,2--3,1--10 and 10--9. The same value is true for node 10 as starting node. I the end I am interested in a list/dataframe of all NOI and their total length of reachable network. How to best proceed in R using igraph? Is there a built-in function in igraph?
Here's one possible strategy. First, I set a name for each node so I will be preserved during graph transformations
V(g)$name = seq.int(vcount(g))
Now I drop all the barriers and split the graph up into separate connected nodes of interest that will all share the same length.
gd <- g %>% induced_subgraph(V(g)[V(g)$barrier==0]) %>% decompose()
Then We can write a helper function that takes a subgraph and finds all the incident edges for the nodes in the subgraph in the the original graph, extracts the weights, and sums them up
get_connected_length <- function(x) {
incident_edges(g, V(g)$name %in% V(x)$name) %>% do.call("c", .) %>% unique() %>% .$weight %>% sum()
}
Now we apply the function to each of the subgraphs and extract the node names
n <- gd %>% Map(function(x) V(x)$name, .)
w <- gd %>% Map(get_connected_length, .)
And we can combine that data all together in a matrix
do.call("rbind", Map(cbind, n, w))
# [,1] [,2]
# [1,] 1 361.5366
# [2,] 2 361.5366
# [3,] 10 361.5366
# [4,] 4 335.1701
# [5,] 5 335.1701
# [6,] 7 318.2184
# [7,] 8 318.2184
To conserve memory space when dealing with a very large corpus sample i'm looking to take just the top 10 1grams and combine those with all of the 2 thru 5grams to form my single quanteda::dfmSparse object that will be used in natural language processing [nlp] predictions. Carrying around all the 1grams will be pointless because only the top ten [ or twenty ] will ever get used with the simple back off model i'm using.
I wasn't able to find a quanteda::dfm(corpusText, . . .) parameter that instructs it to only return the top ## features. So based on comments from package author #KenB in other threads i'm using the dfm_select/remove functions to extract the top ten 1grams and based on the "quanteda dfm join" search results hit "concatenate dfm matrices in 'quanteda' package" i'm using rbind.dfmSparse??? function to join those results.
So far everything looks right from what i can tell. Thought i'd bounce this game plan off of SO community to see if i'm overlooking a more efficient route to arrive at this result or some flaw in solution I've arrived at thus far.
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/eoq
For extracting the top 10 unigrams, this strategy will work just fine:
sort the dfm by the (default) decreasing order of overall feature frequency, which you have already done, but then add a step tp slice out the first 10 columns.
combine this with the 2- to 5-gram dfm using cbind() (not rbind())).
That should do it:
dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams)
head(dfmCombined, nfeat = 15)
# Document-feature matrix of: 1 document, 195 features (0% sparse).
# (showing first document and first 15 features)
# features
# docs some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence
# text1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1
Your example code includes some use of data.table, although this does not appear in the question. In v0.99 we have added a new function textstat_frequency() which produces a "long"/"tidy" format of frequencies in a data.frame that might be helpful:
head(textstat_frequency(dfmCombined), 10)
# feature frequency rank docfreq
# 1 some 2 1 1
# 2 corpus 2 2 1
# 3 text 2 3 1
# 4 of 2 4 1
# 5 to 2 5 1
# 6 very 2 6 1
# 7 large 2 7 1
# 8 top 2 8 1
# 9 ten 2 9 1
# 10 some_corpus 2 10 1
This should be straightforward, but I want to obtain the number of mutual edges associated with all the vertices in my graph:
library(igraph)
ed <- data.frame(from = c(1,1,2,3,3), to = c(2,3,1,1,2))
ver <- data.frame(id = 1:3)
gr <- graph_from_data_frame(d = ed,vertices = ver, directed = T)
plot(gr)
I know I can use which_mutual for edges, but is there an equivalent command for getting something like this:
# vertex edges no_mutual
# 1 2 2
# 2 1 1
# 3 2 1
UDPATE: Corrected inconsistencies in output table as pointed out by emilliman5
Here's a one-liner solution:
> table(unlist(strsplit(attr(E(gr)[which_mutual(gr)],"vnames"),"\\|")))/2
1 2 3
2 1 1
It relies on getting the vertex names for each edge in an edgelist as the "vnames" attribute being a "|"-separated string. It then splits on that, then that gives you a table of all vertexes in mutual edges, and each one appears twice per edge so divide by two.
If there's a less hacky way of getting vertex names from an edgelist, I'm sure Gabor knows it.
Here's that trick in more detail:
For your graph gr:
> E(gr)
+ 5/5 edges (vertex names):
[1] 1->2 1->3 2->1 3->1 3->2
You can get vertexes for edges thus:
> attr(E(gr),"vnames")
[1] "1|2" "1|3" "2|1" "3|1" "3|2"
So my one-liner subsets that edge list my the mutuality criterion, then manipulates the strings.
I am not sure how well this will scale, but it gets the job done. Your expected table has some inconsistencies so I did the best I could, i.e. vertex 2 only has one originating edge not 2.
mutual_edges <- lapply(V(gr), function(x) which_mutual(gr, es = E(gr)[from(x) | to(x)]))
df <- data.frame(Vertex=names(mutual_edges),
Edges=unlist(lapply(V(gr), function(x) length(E(gr)[from(x)]) )),
no_mutual=unlist(lapply(mutual_edges, function(x) sum(x)/2)))
df
# Vertex Edges no_mutual
#1 1 2 2
#2 2 1 1
#3 3 2 1
I would like to extract the hierarchical structure of the nodes of a dendrogram or cluster.
For example in the next example:
library(dendextend)
dend15 <- c(1:5) %>% dist %>% hclust(method = "average") %>% as.dendrogram
dend15 %>% plot
The nodes are classified according their position in the dendrogram (see figure below)
(Figure extracted from the dendextend package's tutorial)
I would like to get all the nodes for each final leaf as the next output:
(the labels are ordered from left to right and from bottom to top)
hierarchical structure
leaf_1: 3-2-1
leaf_2: 4-2-1
leaf_3: 6-5-1
leaf_4: 8-7-5-1
leaf_5: 9-7-5-1
Thanks in advance,
First I find all subtrees (i.e structure) that uses a node. In your example, there would be 9 nodes.
subtrees <- partition_leaves(dend15)
leaves <- subtrees[[1]] # assume top node is used by all subtrees
I make a helper function to find route for each leaf, and apply it to all leaves.
pathRoutes <- function(leaf) {
which(sapply(subtrees, function(x) leaf %in% x))
}
paths <- lapply(leaves, pathRoutes)
The raw output in list form, where each list element is the structure for an end node / leaf
> paths
[[1]]
[1] 1 2 3
[[2]]
[1] 1 2 4
[[3]]
[1] 1 5 6
[[4]]
[1] 1 5 7 8
[[5]]
[1] 1 5 7 9
I am trying to construct a function which shouldn't be hard in terms of programming but I am having some difficulties to conceptualize it. Hope you'll be able to understand my problem better than me!
I'd like a function that takes a single list of vectors as argument. Something like
arg1 = list(c(1,2), c(2,3), c(5,6), c(1,3), c(4,6), c(6,7), c(7,5), c(5,8))
The function should output a matrix with two columns (or a list of two vectors or something like that) where one column contains letters and the other numbers. One can think of the argument as a list of the positions/values that should be placed in the same group. If in the list there is the vector c(5,6), then the output should contain somewhere the same letters next to the values 5 and 6 in the number column. If there are the three following vectors c(1,2), c(2,3) and c(1,3), then the output should contain somewhere the same letters next to the value 1, 2 and 3 in the number column.
Therefore if we enter the object arg1 in the function it should return:
myFun(arg1)
number_column letters_column
1 A
2 A
3 A
5 B
6 B
7 B
4 C
6 C
5 D
8 D
(the order is not important. The letters E should not be present before the letter D has been used)
Therefore the function has constructed 2 groups of 3 (A:[1,2,3] and B:[5,6,7]) and 2 groups of 2 (C:[4,6] and D:[5,8]). Note one position or number can be in several group.
Please let me know if something is unclear in my question! Thanks!
As I wrote in the comments, it appears that you want a data frame that lists the maximal cliques of a graph given a list of vectors that define the edges.
require(igraph)
## create a matrix where each row is an edge
argmatrix <- do.call(rbind, arg1)
## create an igraph object from the matrix of edges
gph <- graph.edgelist(argmatrix, directed = FALSE)
## returns a list of the maximal cliques of the graph
mxc <- maximal.cliques(gph)
## creates a data frame of the output
dat <- data.frame(number_column = unlist(mxc),
group_column = rep.int(seq_along(mxc),times = sapply(mxc,length)))
## converts group numbers to letters
## ONLY USE if max(dat$group_column) <= 26
dat$group_column <- LETTERS[dat$group_column]
# number_column group_column
# 1 5 A
# 2 8 A
# 3 5 B
# 4 6 B
# 5 7 B
# 6 4 C
# 7 6 C
# 8 3 D
# 9 1 D
# 10 2 D