node attribute csv igraph - r

I have a large number of adjacency matrices, in csv format exported from excel. I also have a large number of csv. files with vertex attribute data.
I have linked them in SNA but igraph goes further functionally, so I am looking to move to it, but I am failing to be able to build the graph+attribute files.
I am looking to set up some code that will be a workhorse for doing a range of plots.
Although there seem many ways to link these two data sets it seemed this was the simplest:
To make the adjacency matrix in the csv a data frame (cut down for missing vertex data) I use:
m <- read.table(header=TRUE, check.names=FALSE, textConnection("
2 3 4 5 6 7
2 0 1 1 0 1 0
3 1 0 0 0 1 0
4 0 0 0 0 0 0
5 1 0 1 0 0 1
6 0 0 0 0 0 0
7 1 1 0 1 0 0
"))
In the case of having both vertex and row names in the original file, the imported attributes file has both vertex names and 'row.names' which correspond to the node names. Hex.ed[1,1] gives the value of the attribute for the first node in the m network, i.e. node 2:
Hex.ed <- read.table(header=TRUE, textConnection("
HH Emo Extra Aggr Consci OTE
2 3.3750 3.0000 3.0000 3.0000 3.0625 3.4375
3 3.5625 2.9375 3.0625 3.0000 3.3125 3.6250
4 3.2500 2.8750 3.7500 3.2500 3.8750 3.5000
5 3.6875 3.1250 3.3750 3.5625 3.6250 3.3125
6 3.3125 3.0000 3.3125 3.8750 3.2500 3.6875
7 3.8125 3.2500 3.5625 2.8750 3.6875 3.4375
"))
g <- graph.data.frame(m, directed=TRUE, vertices=Hex.ed)
However, I get the error: Error in graph.data.frame(m, directed = TRUE, vertices = Hex.ed) : Duplicate vertex names

I get a different error message:
Error in graph.data.frame(m, directed = TRUE, vertices = Hex.ed) :
Some vertex names in edge list are not listed in vertex data frame
but this is because you were not running the example in the question, but used your complete data set, possibly.
Anyway, graph.data.frame does not use adjacency matrices. From the docs at http://igraph.sourceforge.net/doc/R/graph.data.frame.html:
... the first two columns of d are used as a symbolic edge list and
additional columns as edge attributes. The names of the attributes are
taken from the names of the columns.
If you cared about reading the manual you would have seen an example at the bottom.
If you have an adjacency matrix, then you can use graph.adjacency to create the graph, and then add the vertex attribute one by one:
g <- graph.adjacency(as.matrix(m))
for (i in seq_len(ncol(Hex.ed))) {
g <- set.vertex.attribute(g, colnames(Hex.ed)[i], value=Hex.ed[,i])
}
g
# IGRAPH DN-- 6 11 --
# + attr: name (v/c), HH (v/n), Emo (v/n), Extra (v/n), Aggr (v/n),
# Consci (v/n), OTE (v/n)

Related

Assign column name based on key in preceding column

I am importing a data set from a csv
the top two lines are key value pairs
UTC
BSP
AWA
TWA
1.
2.
3.
4.
Then the data below this has a column containing the value and then the corresponding data. for example
1
41223
2
0
3
045
4
026
1
41224
2
0
3
052
4
035
1
41225
2
0
3
087
4
040
1
41226
2
0
3
023
4
041
2
0
3
052
4
082
I want it to become
UTC
BSP
AWA
TWA
41223
0
045
026
I tried putting the key value pairs in a list
test <- read.csv("my.csv",
nrows=2, header=FALSE)
key <- as.character(test[1,])
value <- as.numeric(test[2,])
mylist <- list()
for(i in 1:length(key)){
mylist[key[i]] <- value[i]
}
And then I was trying to match or reference the previous column with the values in the list. I have not managed to do.
Any help much appreciated
Thanks

Vertex Labels in igraph R

I am using igraph to plot a non directed force network.
I have a dataframe of nodes and links as follows:
> links
source target value sourceID targetID
1 3 4 0.6245 1450552 1519842
2 6 8 0.5723 2607133 3051992
3 9 7 0.7150 3101536 3025831
4 0 1 0.7695 401517 425784
5 2 5 0.5535 1045501 2258363
> nodes
name group size
1 401517 1 8
2 425784 1 8
3 1045501 1 8
4 1450552 1 8
5 1519842 1 8
6 2258363 1 8
7 2607133 1 8
8 3025831 1 8
9 3051992 1 8
10 3101536 1 8
I plot these using igraph as follows:
gg <- graph.data.frame(links,directed=FALSE)
plot(gg, vertex.color = 'lightblue', edge.label=links$value, vertex.size=1, edge.color="darkgreen",
vertex.label.font=1, edge.label.font =1, edge.label.cex = 1,
vertex.label.cex = 2 )
On this plot, igraph has used the proxy indexes for source and target as vertex labels.
I want to use the real ID's, in my links table expressed as sourceID and targetID.
So, for:
source target value sourceID targetID
1 3 4 0.6245 1450552 1519842
This would show as:
(1450552) ----- 0.6245 ----- (1519842)
Instead of:
(3) ----- 0.6245 ----- (4)
(Note that the proxy indexes are zero indexed in the links dataframe, and one indexed in the nodes dataframe. This offset by 1 is necessary for igraph plotting).
I know I need to somehow match or map the proxy indexes to their corresponding name within the nodes dataframe. However, I am at a loss as I do no not know the order in which igraph plots labels.
How can I achieve this?
I have consulted the following questions to no avail:
Vertex Labels in igraph with R
how to specify the labels of vertices in R
R igraph rename vertices
You can specify the labels like this:
library(igraph)
gg <- graph.data.frame(
links,directed=FALSE,
vertices = rbind(
setNames(links[,c(1,4)],c("id","label")),
setNames(links[,c(2,5)], c("id","label"))))
plot(gg, vertex.color = 'lightblue', edge.label=links$value,
vertex.size=1, edge.color="darkgreen",
vertex.label.font=1, edge.label.font =1, edge.label.cex = 1,
vertex.label.cex = 2 )
You could also pass
merge(rbind(
setNames(links[,c(1,4)],c("id","label")),
setNames(links[,c(2,5)], c("id","label"))),
nodes,
by.x="label", by.y="name")
to the vertices argument if you needed the other node attributes.
Data:
links <- read.table(header=T, text="
source target value sourceID targetID
1 3 4 0.6245 1450552 1519842
2 6 8 0.5723 2607133 3051992
3 9 7 0.7150 3101536 3025831
4 0 1 0.7695 401517 425784
5 2 5 0.5535 1045501 2258363")
nodes <- read.table(header=T, text="
name group size
1 401517 1 8
2 425784 1 8
3 1045501 1 8
4 1450552 1 8
5 1519842 1 8
6 2258363 1 8
7 2607133 1 8
8 3025831 1 8
9 3051992 1 8
10 3101536 1 8")
It appears I was able to repurpose the answer to this question to achieve this.
r igraph - how to add labels to vertices based on vertex id
The key was to use the vertex.label attribute within plot() and a select a sliced subset of nodes$names.
For our index we can use the ordered default labels returned in igraph automatically. To extract these, you can type V(gg)$names.
Within plot(gg) we can then write:
vertex.label = nodes[c(as.numeric(V(gg)$name)+1),]$name
# 1 Convert to numeric
# 2 Add 1 for offset between proxy links index and nodes index
# 3 Select subset of nodes with above as row index. Return name column
As full code:
gg <- graph.data.frame(links,directed=FALSE)
plot(gg, vertex.color = 'lightblue', edge.label=links$value, vertex.size=1, edge.color="darkgreen",
vertex.label.font=1, edge.label.font =1, edge.label.cex = 1,
vertex.label.cex = 2, vertex.label = nodes[c(as.numeric(V(gg)$name)+1),]$name)
With the data above, this gave:
The easiest solution would be to reorder the columns of links, because according to the documentation:
"If vertices is NULL, then the first two columns of d are used as a symbolic edge list and additional columns as edge attributes."
Hence, your code will give the correct output after running:
links <- links[,c(4,5,3)]

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

Phylogenetic tree

I am working to have a phylogenetic tree based on pairwise-data of genes.Below is my subset of the data(test.txt).The tree does not has to be constructed on the basis of any DNA sequences,but just treating it as words.
ID gene1 gene2
1 ADRA1D ADK
2 ADRA1B ADK
3 ADRA1A ADK
4 ADRB1 ASIC1
5 ADRB1 ADK
6 ADRB2 ASIC1
7 ADRB2 ADK
8 AGTR1 ACHE
9 AGTR1 ADK
10 ALOX5 ADRB1
11 ALOX5 ADRB2
12 ALPPL2 ADRB1
13 ALPPL2 ADRB2
14 AMY2A AGTR1
15 AR ADORA1
16 AR ADRA1D
17 AR ADRA1B
18 AR ADRA1A
19 AR ADRA2A
20 AR ADRA2B
Below is my code in R
library(ape)
tab=read.csv("test.txt",sep="\t",header=TRUE)
d=dist(tab,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
My figure is attached here
I have a question on how they are clustered.Since the pairs
17 AR ADRA1B
18 AR ADRA1A
and
2 ADRA1B ADK
3 ADRA1A ADK
should be clustered closely because they have one common gene.so 17 and 2 should be together,and 18 and 3.
Should I use any other method,if I am wrong in using this method(Euclidean distance)?
Should I convert my data to a matrix of rows and columns ,where gene1 is x-axis ,and gene2 is y-axis,each cell being filled by 1 or 0?(Basically if they are paired would mean 1, and if not then 0)
Updated Code :
table=table(tab$gene1, tab$gene2)
d <- dist(table,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
However, in this I get only the genes from gene1 and not gene2 column.The below figure is exactly what I want but should have genes from gene2 column as well
There is some room for interpretation in the example of the question. My answer is only valid if there are really only two genes present in each individual and each row describes an individual. If, however, each row means that gene1 occurs with gene2 with certainty no useful clustering can be performed, in my opinion. In that case I would expect an additional column stating the probability for their common occurrence and something like an principal component analysis (PCA) may be preferred, but I am far away from being an expert on (hierarchial) clustering.
Before you can use the dist function, you have to bring your data into an appropriate format:
# convert test data into suitable format
gene.names <- sort(unique(c(tab[,"gene1"],tab[,"gene2"])))
gene.matrix <- cbind(tab[,"ID"],matrix(0L,nrow=nrow(tab),ncol=length(gene.names)))
colnames(gene.matrix) <- c("ID",gene.names)
lapply(seq_len(nrow(tab)),function(x) gene.matrix[x,match(tab[x,c("gene1","gene2")],colnames(gene.matrix))]<<-1)
The obtained gene.matrix has the shape:
ID ACHE ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2A ...
[1,] 1 0 1 0 0 0 1 0
[2,] 2 0 1 0 0 1 0 0
[3,] 3 0 1 0 1 0 0 0
[4,] 4 0 0 0 0 0 0 0
...
So each row represents an observation (=individual) where the first column identifies the individual and each of the subsequent columns contains 1 if the gene is present and 0 if it is missing. On this matrix the dist function can be reasonably applied (ID column removed):
d <- dist(gene.matrix[,-1],method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
Maybe, it is a good idea to read up the differences between the distance measures euclidean, manhattan etc. For instance, the euclidian distance between the individuals with ID=1 and ID=2 is:
euclidean_dist = sqrt((0-0)^2 + (1-1)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2 + ...)
whereas the manhattan distance is
manhattan_dist = abs(0-0) + abs(1-1) + abs(0-0) + abs(0-0) + abs(0-1) + ...

Number of different states/events in a sequence with TraMineR

I'm interested in counting the number of different states present in each sequence of my dataset. For sake of simplicity, I'll use a TraMineR example:
starting from this sequence:
1230 D-D-D-D-A-A-A-A-A-A-A-D
then computing the extract distinct states with the seqdss function obtaining:
1230 D-A-D
Is there a function to extract the overall number of different states in the sequence only accounting for presence of a state and not its potential repetition along the sequence? In other words, for the case described above I would like to obtain a vector containing for this sequence the value 2 (event A and event D) instead of 3 (1 event A + 2 events D).
Thank you.
You can compute the number of distinct states by first computing the state distribution of each sequence using seqistatd and then summing the number of non-zero elements in each row of the matrix returned by seqistatd. I illustrate below using the biofam data:
library(TraMineR)
data(biofam)
bf.seq <- seqdef(biofam[,10:25])
## longitudinal distributions
bf.ldist <- seqistatd(bf.seq)
n.states <- apply(bf.ldist,1,function(x) sum(x != 0))
## displaying results
bf.ldist[1:3,]
0 1 2 3 4 5 6 7
1167 9 0 0 1 0 0 6 0
514 1 10 0 1 0 0 4 0
1013 7 5 0 1 0 0 3 0
n.states[1:3]
1167 514 1013
3 4 4
I might be missing something here, but it looks like you're after unique.
Your expected result is not clear ( maybe because you describe it in English and not in pseudo code). I guess you you are looking for table to count number of states per subject. Here I am using provided with TraMineR package:
library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal,13:24)
head(actcal.seq )
Sequence
2848 B-B-B-B-B-B-B-B-B-B-B-B
1230 D-D-D-D-A-A-A-A-A-A-A-D
2468 B-B-B-B-B-B-B-B-B-B-B-B
654 C-C-C-C-C-C-C-C-C-B-B-B
6946 A-A-A-A-A-A-A-A-A-A-A-A
1872 D-B-B-B-B-B-B-B-B-B-B-B
Now applying table to the 4th row for example:
tab <- table(unlist(actcal.seq[4,]))
tab[tab>0]
B C
3 9

Resources