Using ape to display BEAST ancestral state reconstructions - ape-phylo

I have run ancestral state reconstructions using BEAST, which gives me a Nexus file like this
#NEXUS
Begin taxa;
Dimensions ntax=93;
Taxlabels
adan1251
blag1240-nule
wers1238-marit
;
End;
Begin trees;
Translate
1 adan1251,
2 blag1240-nule,
3 wers1238-marit
;
tree STATE_0 = ((1[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.02243609504948792,2[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.02243609504948792)[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.01067010801410265,3[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.03310620306359057)[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.022661511629175332;
tree STATE_1 = ((1[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:1.02243609504948792,2[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.02243609504948792)[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.01067010801410265,3[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.03310620306359057)[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.022661511629175332;
tree STATE_2 = ((1[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:2.02243609504948792,2[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.02243609504948792)[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.01067010801410265,3[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.03310620306359057)[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.022661511629175332;
tree STATE_3 = ((1[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:3.02243609504948792,2[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.02243609504948792)[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.01067010801410265,3[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.03310620306359057)[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.022661511629175332;
tree STATE_4 = ((1[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:4.02243609504948792,2[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.02243609504948792)[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.01067010801410265,3[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.03310620306359057)[&recon_lexicon:cooked rice="00000000000001",recon_lexicon:mountain="000000000001",recon_lexicon:to die="00001",recon_lexicon:wall="00000001"]:0.022661511629175332;
End;
(except with 20 times as many taxa, 2000 times as many trees and trees actually differing.)
I would like to visualize the reconstructions for the lexical items in the intenal and tip nodes, and it seems that ape may be a good tool to do it, because it can be scripted, it can read the Nexus file (I tried using read.nexus("filename.nex"), and it seems the str is reasonable) and judging from http://ape-package.ird.fr/ape_screenshots.html it can display the reconstructions in a nice format:
How do I get ape to construct something like this thermo tree from data given in the comments ([&...]) of 10000 different Newick trees, after constructing some kind of consensus tree from the raw data?

After looking at some of the documentation for the ape package, if looks like you would be interested in the nodelabels() function.
After you have plotted your tree, you simply need to pass a vector with length equal to the number of taxa you have with each values representing the probability of occupying one of the character states. Then, simply plot the probabilities on the nodes with the thermo option on.
From the help files:
data(bird.orders)
plot(bird.orders, "c", use.edge.length = FALSE, font = 1)
nodelabels(thermo = runif(22), cex = .8)
If you have more than two states, you must create a matrix with the number of columns equal to the number of you states and rows equal to the number of taxa. For each row, include the relative probability of that state for each taxa. Each row should add up to 1.
Extended example with 3 states:
thermo <- matrix(c(.6,.3,.1), nrow=22, ncol=3)
plot(bird.orders, "c", use.edge.length = FALSE, font = 1)
nodelabels(thermo = thermo, cex = .8)

Related

Problem implementing a BFS algorithm in R

I'm trying to implement a bfs breadth first search algorithm in R. I know about the graph::bfs function and do_bfs from DiaGrammer. I think my problem is in the "for" of the bfs function.
The input would be a graph as the following
1
2 3
4 5 6 7
The output should be the path. in this case, if i start from 1, 1,2,3,4,5,6,7
library (igraph)
library(foreach)
library(flifo)
library(digest)
# devtools::install_github("rdpeng/queue")
This packages seemed useful for the implementation, especially the queue one.
t<-make_tree(7, children = 2, mode ="out")
plot.igraph(t)
bfsg(t, 1)
bfsg<- function (g, n) {
m <- c(replicate(length(V(t)), 0))
q<-flifo::fifo ()
m[n]<- 1
push (q, n)
pr <- c(replicate(length(V(t)), 0))
}
at this point, 1 should be in the queue, afrter this, got printed and popped out of the queue. After the pop, the algorithm should go to 2 and 3
while (size(q)!=0){
print (n)
pop(q)
}
for (i in unlist(adjacent_vertices(g, n, mode = "out"))){
if (m[i] == 0){
push(q,i)
m[i]=2
}
}

Finding longest cyclic path in a graph with Gremlin

I am trying to construct Gremlin queries to use within DSE Graph with geo-searches enabled (indexed in Solr). The problem is the graph is so densely interconnected that the cyclic path traversals time out. Right now the prototype graph I'm working with has ~1600 vertices and ~35K edges. The number of triangles passing through each vertex also summarised:
+--------------------+-----+
| gps|count|
+--------------------+-----+
|POINT (-0.0462032...| 1502|
|POINT (-0.0458048...| 405|
|POINT (-0.0460680...| 488|
|POINT (-0.0478356...| 1176|
|POINT (-0.0479465...| 5566|
|POINT (-0.0481031...| 9896|
|POINT (-0.0484724...| 433|
|POINT (-0.0469379...| 302|
|POINT (-0.0456595...| 394|
|POINT (-0.0450722...| 614|
|POINT (-0.0475904...| 3080|
|POINT (-0.0479464...| 5566|
|POINT (-0.0483400...| 470|
|POINT (-0.0511753...| 370|
|POINT (-0.0521901...| 1746|
|POINT (-0.0519999...| 1026|
|POINT (-0.0468071...| 1247|
|POINT (-0.0469636...| 1165|
|POINT (-0.0463685...| 526|
|POINT (-0.0465805...| 1310|
+--------------------+-----+
only showing top 20 rows
I anticipate the graph growing to a massive size eventually but I will limit the searches for cycles to geographic regions (say of radius ~ 300 meters).
My best attempt so far has been some versions of the following:
g.V().has('gps',Geo.point(lon, lat)).as('P')
.repeat(both()).until(cyclicPath()).path().by('gps')
Script evaluation exceeded the configured threshold of realtime_evaluation_timeout at 180000 ms for the request
For the sake of illustration, the map below shows a starting vertex in green and a terminating vertex in red. Assume that all the vertices are interconnected. I am interested in the longest path between green and red, which would be to circumnavigate the block.
A few links I've read through to no avail:
1) http://tinkerpop.apache.org/docs/current/recipes/#cycle-detection
2) Longest acyclic path in a directed unweighted graph
3) https://groups.google.com/forum/#!msg/gremlin-users/tc8zsoEWb5k/9X9LW-7bCgAJ
EDIT
Using Daniel's suggestion below to create a subgraph, it still times out:
gremlin> hood = g.V().hasLabel('image').has('gps', Geo.inside(point(-0.04813968113126384, 51.531259899256995), 100, Unit.METERS)).bothE().subgraph('hood').cap('hood').next()
==>tinkergraph[vertices:640 edges:28078]
gremlin> hg = hood.traversal()
==>graphtraversalsource[tinkergraph[vertices:640 edges:28078], standard]
gremlin> hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x')
==>v[{~label=image, partition_key=2507574903070261248, cluster_key=RFAHA095CLK-2017-09-14 12:52:31.613}]
gremlin> hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x').repeat(both().simplePath()).emit(where(both().as('x'))).both().where(eq('x')).tail(1).path()
Script evaluation exceeded the configured threshold of realtime_evaluation_timeout at 180000 ms for the request: [91b6f1fa-0626-40a3-9466-5d28c7b5c27c - hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x').repeat(both().simplePath()).emit(where(both().as('x'))).both().where(eq('x')).tail(1).path()]
Type ':help' or ':h' for help.
Display stack trace? [yN]n
The longest path, based on the number of hops, will be the last one you can find.
g.V().has('gps', Geo.point(x, y)).as('x').
repeat(both().simplePath()).
emit(where(both().as('x'))).
both().where(eq('x')).tail(1).
path()
There's no way to make this query perform well in OLTP, unless you have a very tiny (sub)graph. So, depending on what you see as a "city block" in your graph, you should probably extract that first as a subgraph and then apply the longest path query (in memory).
One solution I've come up with involves using Spark GraphFrames and a label propagation algorithm (GraphFrames, LPA). Each community's average GPS location can then be computed (in fact you don't even need the average, simply a single member of each community would suffice) and all the edges that exist between each community member representative (average or otherwise).
Select and save a region of the graph and save the vertices and edges:
g.V().has('gps', Geo.inside(Geo.point(x,y), radius, Unit.METERS))
.subgraph('g').cap(g')
Spark snippet:
import org.graphframes.GraphFrame
val V = spark.read.json("v.json")
val E = spark.read.json("e.json")
val g = GraphFrame(V,E)
val result = g.labelPropagation.maxIter(5).run()
val rdd = result.select("fullgps", "label").map(row => {
val coords = row.getString(0).split(",")
val x = coords(0).toDouble
val y = coords(1).toDouble
val z = coords(2).toDouble
val id = row.getLong(1)
(x,y,z,id)
}).rdd
// Average GPS:
val newVertexes = rdd.map{ case (x:Double,y:Double,z:Double,id:Long) => (id, (x,y,z)) }.toDF("lbl","gps")
rdd.map{ case (x:Double,y:Double,z:Double,id:Long) => (id, (x,y,z)) }.mapValues(value => (value,1)).reduceByKey{ case (((xL:Double,yL:Double,zL:Double), countL:Int), ((xR:Double,yR:Double,zR:Double), countR:Int)) => ((xR+xL,yR+yL,zR+yL),countR+countL) }.map{ case (id,((x,y,z),c)) => (id, ((x/c,y/c,z/c),c)) }.map{ case (id:Long, ((x:Double, y:Double, z:Double), count:Int)) => Array(x.toString,y.toString,z.toString,id.toString,count.toString) }.map(a => toCsv(a)).saveAsTextFile("avg_gps.csv")
// Keep IDs
val rdd2 = result.select("id", "label").map(row => {
val id = row.getString(0)
val lbl = row.getLong(1)
(lbl, id) }).rdd
val edgeDF = E.select("dst","src").map(row => (row.getString(0),row.getString(1))).toDF("dst","src")
// Src
val tmp0 = result.select("id","label").join(edgeDF, result("id") === edgeDF("src")).withColumnRenamed("lbl","src_lbl")
val srcDF = tmp0.select("src","dst","label").map(row => { (row.getString(0)+"###"+row.getString(1),row.getLong(2)) }).withColumnRenamed("_1","src_lbl").withColumnRenamed("_2","src_edge")
// Dst
val tmp1 = result.select("id","label").join(edgeDF, result("id") === edgeDF("dst")).withColumnRenamed("lbl","dst_lbl")
val dstDF = tmp1.select("src","dst","label").map(row => { (row.getString(0)+"###"+row.getString(1),row.getLong(2)) }).withColumnRenamed("_1","dst_lbl").withColumnRenamed("_2","dst_edge")
val newE = srcDF.join(dstDF, srcDF("src_lbl")===dstDF("dst_lbl"))
val newEdges = newE.filter(newE("src_edge")=!=newE("dst_edge")).select("src_edge","dst_edge").map(row => { (row.getLong(0).toString + "###" + row.getLong(1).toString, row.getLong(0), row.getLong(1)) }).withColumnRenamed("_1","edge").withColumnRenamed("_2","src").withColumnRenamed("_3","dst").dropDuplicates("edge").select("src","dst")
val newGraph = GraphFrames(newVertexes, newEdges)
The averaged locations are then connected by edges and the problem is reduced in this case from ~1600 vertices and ~35K edges to 25 vertices and 54 edges:
Here the non-green coloured segments (red, white, black, etc.) represent the individual communities. The green circles are the averaged GPS locations and their sizes are proportional to the number of members in each community. Now it is considerably easier to perform an OLTP algorithm such as proposed by Daniel in the comment above.

C Tree in R : Get all leaf node split variable in form of list in a non binary tree

The decision tree we are using in our current project uses Conditional Inference (C Tree) algorithm. I can extract the split variables for binary c-trees using the code below :
#develop ctree decision tree
prod_discount_data_ctree <- ctree(Discount~Prod, data=prod_discount_data, controls = ctree_control(minsplit=30))
plot(prod_discount_data_ctree)
#extract the left and right terminal node split rule
lvls <- levels(prod_discount_data_ctree#tree$psplit$splitpoint)
#left leaf node split variable
left.df = lvls[prod_discount_data_ctree#tree$psplit$splitpoint == 1]
#right leaf node split variable
right.df = lvls[prod_discount_data_ctree#tree$psplit$splitpoint == 0]
This works fine if the tree has only one node (depth = 1) which splits into 2 leaf nodes. But if the tree has one node (node 1) that splits into multiple nodes (node 2,5) which further split into leaf nodes (node 2{3,4} node 5{6,7}), how should I traverse deeper and get the leaf node split variable?
Based on the example I would want split variables for node 3,4,6,7 in the form of 4 lists.
I tried all possible options and finally found a way to traverse inside a C-tree, and get the split variables for each leaf node. Pasting the code snippet if anyone wants to refer in future.
if (nrow(SubBrandright_total) > 200) {
sec_discount_data <- subset(SubBrandright_total, select=c(Discount,Sector))
sec_discount_data_ctree <- ctree(Discount~Sector, data=sec_discount_data, controls = ctree_control(minsplit=30))
sec_lvls_r <- levels(sec_discount_data_ctree#tree$psplit$splitpoint)
#Testing if the node is terminal [TRUE] or not [FALSE]
#print(sec_discount_data_ctree#tree$terminal)
#print(sec_discount_data_ctree#tree$left$terminal)
#print(sec_discount_data_ctree#tree$left$left$terminal)
#print(sec_discount_data_ctree#tree$left$right$terminal)
sec_left_left.df = sec_lvls_r[sec_discount_data_ctree#tree$left$psplit$splitpoint == 1]
sec_left.df = sec_lvls_r[sec_discount_data_ctree#tree$psplit$splitpoint == 1]
#Using setdiff to get right leaf node from Node minus left leaf node
sec_left_right.df = setdiff(sec_left.df,sec_left_left.df)
print("Sector Segmentation")
print(sec_left_left.df)
print(sec_left_right.df)
sec_right.df = sec_lvls_r[sec_discount_data_ctree#tree$psplit$splitpoint == 0]
sec_right_right.df = sec_lvls_r[sec_discount_data_ctree#tree$right$psplit$splitpoint == 0]
#Using setdiff to get left leaf node from Node minus right leaf node
sec_right_left.df = setdiff(sec_right.df,sec_right_right.df)
print(sec_right_left.df)
print(sec_right_right.df)
}

Graph branch decomposition

Hello,
I would like to know about an algorithm to produce a graph decomposition into branches with rank in the following way:
Rank | path (or tree branch)
0 1-2
1 2-3-4-5-6
1 2-7
2 7-8
2 7-9
The node 1 would be the Root node and the nodes 6, 8 and 9 would be the end nodes.
the rank of a branch should be given by the number of bifurcation nodes up to the root node. Let's assume that the graph has no loops (But I'd like to have no such constraint)
I am electrical engineer, and perhaps this is a very standard problem, but so far I have only found the BFS algorithm to get the paths, and all the cut sets stuff. I also don't know if this applies.
I hope that my question is clear enough.
PS: should this question be in stack overflow?
From your example, I'm making some assumptions:
You want to bifurcate whenever a node's degree is > 2
Your input graph is acyclic
With an augmented BFS this is possible from the root r. The following will generate comp_groups, which will be a list of components (each of which is a list of its member vertices). The rank of each component will be under the same index in the list rank.
comp[1..n] = -1 // init all vertices to belong to no components
comp[r] = 0 // r is part of component 0
comp_groups = [[r]] // a list of lists, with the start of component 0
rank[0] = 0 // component 0 (contains root) has rank 0
next_comp_id = 1
queue = {r} // queues for BFS
next_queue = {}
while !queue.empty()
for v in queue
for u in neighbors(v)
if comp[u] == -1 // test if u is unvisited
if degree(v) > 2
comp[u] = next_comp_id // start a new component
next_comp_id += 1
rank[comp[u]] = rank[comp[v]] + 1 // new comp's rank is +1
comp_groups[comp[u]] += [v] // add v to the new component
else
comp[u] = comp[v] // use same component
comp_group[comp[u]] += [u] // add u to the component
next_queue += {u} // add u to next frontier
queue = next_queue // move on to next frontier
next_queue = {}

The correct way to build a Binary Search Tree in OCaml

Ok, I have written a binary search tree in OCaml.
type 'a bstree =
|Node of 'a * 'a bstree * 'a bstree
|Leaf
let rec insert x = function
|Leaf -> Node (x, Leaf, Leaf)
|Node (y, left, right) as node ->
if x < y then
Node (y, insert x left, right)
else if x > y then
Node (y, left, insert x right)
else
node
The above code was said to be good in The right way to use a data structure in OCaml
However, I found a problem. This insert will only work when building a bst from a list in one go, such as
let rec set_of_list = function
[] > empty
| x :: l > insert x (set_of_list l);;
So if we build a bst from a list continuously, no problem, we can get a complete bst which has all nodes from the list.
However, if I have a bst built previously and now I wish to insert a node, then the resulting bst won't have complete nodes from the previous tree, am I right?
Then how should I write a bst in OCaml so that we create a new bst with all nodes from previous tree to keep the previous tree immutable? If every time I need to copy all nodes from old bst, will that impact the performance?
Edit:
So let's say initially, a bst is created with one node t1 = (10, Leaf, Leaf).
then I do let t2 = insert 5 t1, then I get t2 = (10, (5, Leaf, Leaf), Leaf), right? inside t2, let's give a variable c1 to the child node (5, Leaf, Leaf)
then I do let t5 = insert 12 t2, then I get t3 = (10, (5, Leaf, Leaf), (15, Leaf, Leaf)). let's give a variable c2 to the child node (5, Leaf, Leaf)
So my question is whether c1 == c2? Are the two (5, Leaf, Leaf)s in t2 and t3 exactly the same?
I'll try to answer the sharing part of your question. The short answer is yes, the two parts of the two trees will be identical. The reason immutable data works so well is that there are no limitations on the possible sharing. That's why FP works so well.
Here's a session that does what you describe:
# let t1 = Node (10, Leaf, Leaf);;
val t1 : int bstree = Node (10, Leaf, Leaf)
# let t2 = insert 5 t1;;
val t2 : int bstree = Node (10, Node (5, Leaf, Leaf), Leaf)
# let t3 = insert 12 t2;;
val t3 : int bstree = Node (10, Node (5, Leaf, Leaf), Node (12, Leaf, Leaf))
# let Node (_, c1, _) = t2;;
val c1 : int bstree = Node (5, Leaf, Leaf)
# let Node (_, c2, _) = t3;;
val c2 : int bstree = Node (5, Leaf, Leaf)
# c1 == c2;;
- : bool = true
The long answer is that there's no guarantee that the two parts will be identical. If the compiler and/or runtime can see a reason to copy a subtree, it's also free to do that. There are cases (as in distributed processing) where that would be a better choice. Again the great thing about FP is that there are no limitations on sharing, which means that sharing is neither required nor forbidden in such cases.
Look at the accepted answer to the linked question.
To be specific this line here:
let tree_of_list l = List.fold_right insert l Leaf
Work out the chain of what is happening. Take the list 1,2,3.
First we have no tree and the result of insert 1 Leaf.
call this T1
Next is the tree generated by insert 2 T1
call this T2
Then the tree generated by insert 3 T2
This is what is returned as the result of Tree_of_list.
If we call the result T3 then somewhere else in code call insert 4 T3 there is no difference in the result returned from insert than in calling Tree_of_list with the list 1,2,3,4.

Resources