Overlapping two gene sets ,finding their overlap significance and plotting them - r

(Fig. 3a, b, Extended Data Fig. 3a, b and Supplementary Table 1).
After 48 h, more than one-third of the transcriptome was
differentially expressed (>5,000 genes; 405 genes encoding for
proteins in the extracellular region, Gene Ontology (GO) accession
0005576), significantly overlapping with the gene expression changes
of A375 tumours in vivo after 5 days of vemurafenib treatment (Fig.
3a, b and Extended Data Fig. 3c). Similar extensive gene expression
changes were observed in Colo800 and UACC62 melanoma cells treated
with vemurafenib and H3122 lung adenocarcinoma cells treated with
crizotinib (Extended Data Fig. 3d). Despite different cell lineages,
different oncogenic drivers, and different targeted therapies we
observed a significant overlap between the secretome of melanoma and
lung adenocarcinoma cells (P < 9.11 × 10−5)
The original paper
I would like to see similar to the figure f where it shows the intersection and significance overlap. To achieve that i got this code working till the intersection part but I dont know how to run the significance part.
library(reshape2)
library(venneuler)
RNA_seq_cds <- read.csv("~/Downloads/RNA_seq_gene_set.txt", header=TRUE, sep="\t")
head(RNA_seq_cds)
ATAC_seq <- read.csv("~/Downloads/ATAC_seq_gene_set.txt", header=TRUE, sep="\t")
head(ATAC_seq)
RNA_seq <- RNA_seq_cds
ATAC_seq <- ATAC_seq
#https://stackoverflow.com/questions/6988184/combining-two-data-frames-of-different-lengths
cbindPad <- function(...) {
args <- list(...)
n <- sapply(args, nrow)
mx <- max(n)
pad <- function(x, mx) {
if (nrow(x) < mx) {
nms <- colnames(x)
padTemp <- matrix(NA, mx - nrow(x), ncol(x))
colnames(padTemp) <- nms
if (ncol(x) == 0) {
return(padTemp)
} else {
return(rbind(x, padTemp))
}
} else {
return(x)
}
}
rs <- lapply(args, pad, mx)
return(do.call(cbind, rs))
}
dat <- cbindPad(ATAC_seq, RNA_seq)
vennfun <- function(x) {
x$id <- seq(1, nrow(x)) #add a column of numbers (required for melt)
xm <- melt(x, id.vars="id", na.rm=TRUE) #melt table into two columns (value & variable)
xc <- dcast(xm, value~variable, fun.aggregate=length) #remove NA's, list presence/absence of each value for each variable (1 or 0)
rownames(xc) <- xc$value #value column=rownames (required for Venneuler)
xc$value <- NULL #remove redundent value column
xc #output the new dataframe
}
#https://stackoverflow.com/questions/9121956/legend-venn-diagram-in-venneuler
VennDat <- vennfun(dat)
genes.venn <- venneuler(VennDat)
genes.venn$labels <- c("RNA", "\nATAC" )
# plot(genes.venn, cex =15, )
#https://stackoverflow.com/questions/30225151/how-to-create-venn-diagram-in-r-studio-from-group-of-three-frequency-column
#https://rstudio-pubs-static.s3.amazonaws.com/13301_6641d73cfac741a59c0a851feb99e98b.html
vd <- venneuler(VennDat)
vd$labels <- paste(genes.venn$labels, colSums(VennDat))
plot(vd, cex=10)
text(.3, .45,
bquote(bold("Common ="~.(as.character(sum(rowSums(VennDat) == 2))))),
col="red", cex=1)
LABS <- vd$labels
The above code gives me the intersection plot
Now the significance part how do i do that between two gene sets and show it as shown in the original figure.
My data which i have used to generate the above plot
Any suggestion or help would be really appreciated.

If you talk about how to place any text under your figure, just use 'text' as you did before. It's just some guessing on which x= and y= coordinates. Thexpd=TRUE allows you to plot over the margin.
VennDat <- vennfun(dat)
vd <- venneuler(VennDat)
vd$labels <- paste(c("RNA", "ATAC"), colSums(VennDat))
plot(vd, cex=10, border=c(NA, 'red'), col=c('#6b65af', '#ad7261'))
text(x=.5, y=.5, sum(rowSums(VennDat) == 2), xpd=TRUE)
text(.5, .15, 'overlap\n', xpd=TRUE)
text(.5, .13, bquote(italic(p)*'< 9.11E-55'), xpd=TRUE)
I also adjusted some parameters of plot. You may inspect the code of the plotting method using:
venneuler:::plot.VennDiagram
If you want to know how significance is calculated, you should post your question at Cross Validated.

Related

How do I create a combined colour plot with legend overlaying two marks in Spatstat?

I wanted to create a combined colour plot of two marks in Spatstat with a legend to show the species as well as the diameters of multiple species in one point process pattern.
I started with this plot:
set.seed(42)
species <- LETTERS[1:16]
diameter <- sample(15:50,16,TRUE)
x <- sample(2:18,16,TRUE)
y <- sample(2:18,16,TRUE)
library(spatstat)
Dat <- data.frame(x,y,species,diameter)
X <- as.ppp(Dat,W=square(20))
marks(X)$species <- factor(marks(X)$species)
ccc <- (1:16)[as.numeric(factor(marks(X)$species))]
# Here ccc will just be 1:16 since there are the same number
# of species as points, but in general ccc will be a vector of
# length = npoints(X), with entries determined by the species
# associated with the given point.
newPal <- vector("list",4)
newPal[[1]] <- colorRampPalette(c("green","red"))(10)
newPal[[2]] <- heat.colors(16)
newPal[[3]] <- topo.colors(16)
newPal[[4]] <- terrain.colors(16)
for(k in 1:4) {
palette(newPal[[k]])
plot(X,which.marks="diameter",maxsize=1,main="")
plot(X,which.marks="diameter",maxsize=1,bg=ccc,add=TRUE)
if(k < 4) readline("Go? ")
}
You need any planar point pattern file "X" (.ppp) with "dbh" (diameter at breast height) as a numeric variable and "species" (as a factor numerically coded (but still categorical variable)) as marks.
Thanks to Rolf Turner for this:
dbhAndSpecies <- function(X,palfn=c("hcl","heat","terrain","topo","cm","rainbow"),
palnm="Spectral",...) {
library(spatstat)
if(!(all(c("species","dbh") %in% names(marks(X))))) {
whinge <- paste0("The marks of \"X\" must be a data frame",
" having columns \"species\" and \"dbh\".\n")
stop(whinge)
}
X <- shift(X,origin="bottomleft")
xr <- Window(X)$xrange
yr <- Window(X)$yrange
X <- affine(X,mat=diag(c(5/xr[2],10/yr[2])))
lbls <- c("Ash","Beech","Lime","FieldMaple","Oak","WychElm",
"Alder","An","Birch","Sallow","CrabApple","Blackthorn",
"Hawthorn","Larch","WildService","Whitebeam","Holly","Privet",
"Yew","Rose","Dogwood","GuelderRose","Hazel","Spindle")
lbls <- substr(lbls,1,5)
ok <- (1:24) %in% marks(X)$species
marks(X)$species <- factor(marks(X)$species,levels=(1:24)[ok],labels=lbls[ok])
nspec <- length(levels(marks(X)$species))
palfn <- match.arg(palfn)
if(palfn=="hcl") {
pal <- hcl.colors(n=nspec,palette=palnm)
} else {
if(palfn != "rainbow") palfn <- paste0(palfn,".colors")
palfn <- get(palfn)
pal <- palfn(n=nspec)
}
kul <- (1:nspec)[as.numeric(factor(marks(X)$species))]
oldPal <- palette(pal)
on.exit(palette(oldPal))
OP <- par(xpd=NA)
on.exit(par(OP),add=TRUE)
plot(X,which.marks="dbh",maxsize=1,...)
plot(X,which.marks="dbh",maxsize=1,bg=kul,add=TRUE)
leg <- levels(marks(X)$species)
legend("right",pch=20,col=1:nspec,legend=leg,inset=-0.02,title="Species")
}

Bivariate Choropleth Map in R

I am looking for a general solution to create bivariate choropleth maps in R using raster files.
I have found the following code here which nearly does what I need but it is limited: it can only handle data which are between 0 and 1 on both axes. In my specific use-case one axis spans 0-1 while another spans between -1 and 1. Regardless as to my specific use-case, I think a more general function which can handle different data ranges would be useful to many people.
I have already tried updating the code within the function colmat to handle negative data but for the life of me cannot get it to work. In the interests of clarity I have avoided posting all of my failed attempts and have insread copied below the code I found at the link above in the hope that someone may be able to offer a solution.
The current code first creates a colour matrix using colmat. The colour matrix generated is then used in bivariate.map along with your two raster files containing the data. I think the ideal solution would be to create the colour matrix based on the two rasters first (so that it can correctly bin the data based on your actual data, not the current solution which is between 0 and 1).
````
library(classInt)
library(raster)
library(rgdal)
library(dismo)
library(XML)
library(maps)
library(sp)
# Creates dummy rasters
rasterx<- raster(matrix(rnorm(400),5,5))
rasterx[rasterx <=0]<-1
rastery<- raster(matrix(rnorm(400),5,5))
# This function creates a colour matrix
# At present it cannot handle negative values i.e. the matrix spans from 0 to 1 along both axes
colmat<-function(nquantiles=10, upperleft=rgb(0,150,235, maxColorValue=255), upperright=rgb(130,0,80, maxColorValue=255), bottomleft="grey", bottomright=rgb(255,230,15, maxColorValue=255), xlab="x label", ylab="y label"){
my.data<-seq(0,1,.01)
my.class<-classIntervals(my.data,n=nquantiles,style="quantile")
my.pal.1<-findColours(my.class,c(upperleft,bottomleft))
my.pal.2<-findColours(my.class,c(upperright, bottomright))
col.matrix<-matrix(nrow = 101, ncol = 101, NA)
for(i in 1:101){
my.col<-c(paste(my.pal.1[i]),paste(my.pal.2[i]))
col.matrix[102-i,]<-findColours(my.class,my.col)
}
plot(c(1,1),pch=19,col=my.pal.1, cex=0.5,xlim=c(0,1),ylim=c(0,1),frame.plot=F, xlab=xlab, ylab=ylab,cex.lab=1.3)
for(i in 1:101){
col.temp<-col.matrix[i-1,]
points(my.data,rep((i-1)/100,101),pch=15,col=col.temp, cex=1)
}
seqs<-seq(0,100,(100/nquantiles))
seqs[1]<-1
col.matrix<-col.matrix[c(seqs), c(seqs)]
}
# Creates colour matrix
col.matrix<-colmat(nquantiles=2, upperleft="blue", upperright="yellow", bottomleft="green", bottomright="red", xlab="Species Richness", ylab="Change in activity hours")
# Function to create bivariate map, given the colour ramp created previously
bivariate.map<-function(rasterx, rastery, colormatrix=col.matrix, nquantiles=10){
quanmean<-getValues(rasterx)
temp<-data.frame(quanmean, quantile=rep(NA, length(quanmean)))
brks<-with(temp, quantile(temp,na.rm=TRUE, probs = c(seq(0,1,1/nquantiles))))
r1<-within(temp, quantile <- cut(quanmean, breaks = brks, labels = 2:length(brks),include.lowest = TRUE))
quantr<-data.frame(r1[,2])
quanvar<-getValues(rastery)
temp<-data.frame(quanvar, quantile=rep(NA, length(quanvar)))
brks<-with(temp, quantile(temp,na.rm=TRUE, probs = c(seq(0,1,1/nquantiles))))
r2<-within(temp, quantile <- cut(quanvar, breaks = brks, labels = 2:length(brks),include.lowest = TRUE))
quantr2<-data.frame(r2[,2])
as.numeric.factor<-function(x) {as.numeric(levels(x))[x]}
col.matrix2<-colormatrix
cn<-unique(colormatrix)
for(i in 1:length(col.matrix2)){
ifelse(is.na(col.matrix2[i]),col.matrix2[i]<-1,col.matrix2[i]<-which(col.matrix2[i]==cn)[1])
}
cols<-numeric(length(quantr[,1]))
for(i in 1:length(quantr[,1])){
a<-as.numeric.factor(quantr[i,1])
b<-as.numeric.factor(quantr2[i,1])
cols[i]<-as.numeric(col.matrix2[b,a])}
r<-rasterx
r[1:length(r)]<-cols
return(r)
}
# Creates map
bivmap<-bivariate.map(rasterx,rastery, colormatrix=col.matrix, nquantiles=2)
# Plots a map
plot(bivmap,frame.plot=F,axes=F,box=F,add=F,legend=F,col=as.vector(col.matrix)) ````
Ideally,a more general function would take two raster files, determine the data ranges of both and then create a bivariate chorpleth map based on the number of bins/quantiles specified by the user.
Here are some ideas based on your code
Three functions
makeCM <- function(breaks=10, upperleft, upperright, lowerleft, lowerright) {
m <- matrix(ncol=breaks, nrow=breaks)
b <- breaks-1
b <- (0:b)/b
col1 <- rgb(colorRamp(c(upperleft, lowerleft))(b), max=255)
col2 <- rgb(colorRamp(c(upperright, lowerright))(b), max=255)
cm <- apply(cbind(col1, col2), 1, function(i) rgb(colorRamp(i)(b), max=255))
cm[, ncol(cm):1 ]
}
plotCM <- function(cm, xlab="", ylab="", main="") {
n <- cm
n <- matrix(1:length(cm), nrow=nrow(cm), byrow=TRUE)
r <- raster(n)
cm <- cm[, ncol(cm):1 ]
image(r, col=cm, axes=FALSE, xlab=xlab, ylab=ylab, main=main)
}
rasterCM <- function(x, y, n) {
q1 <- quantile(x, seq(0,1,1/(n)))
q2 <- quantile(y, seq(0,1,1/(n)))
r1 <- cut(x, q1, include.lowest=TRUE)
r2 <- cut(y, q2, include.lowest=TRUE)
overlay(r1, r2, fun=function(i, j) {
(j-1) * n + i
})
}
Example data
library(raster)
set.seed(42)
r <- raster(ncol=50, nrow=50, xmn=0, xmx=10, ymn=0,ymx=10, crs="+proj=utm +zone=1")
x <- init(r, "x") * runif(ncell(r), .5, 1)
y <- init(r, "y") * runif(ncell(r), .5, 1)
And now used the functions
breaks <- 5
cmat <- makeCM(breaks, "blue", "yellow", "green", "red")
xy <- rasterCM(x, y, breaks)
par(mfrow=c(2,2), mai=c(.5,.5,.5,.5), las=1)
plot(x)
plot(y)
par(mai=c(1,1,1,1))
plotCM(cmat, "var1", "var2", "legend")
par(mai=c(.5,.5,.5,.5))
image(xy, col=cmat, las=1)

How to collapse branches in a phylogenetic tree by the label in their nodes or leaves?

I have built a phylogenetic tree for a protein family that can be split into different groups, classifying each one by its type of receptor or type of response. The nodes in the tree are labeled as the type of receptor.
In the phylogenetic tree I can see that proteins that belong to the same groups or type of receptor have clustered together in the same branches. So I would like to collapse these branches that have labels in common, grouping them by a given list of keywords.
The command would be something like this:
./collapse_tree_by_label -f phylogenetic_tree.newick -l list_of_labels_to_collapse.txt -o collapsed_tree.eps(or pdf)
My list_of_labels_to_collapse.txt would be like this:
A
B
C
D
My newick tree would be like this:
(A_1:0.05,A_2:0.03,A_3:0.2,A_4:0.1):0.9,(((B_1:0.05,B_2:0.02,B_3:0.04):0.6,(C_1:0.6,C_2:0.08):0.7):0.5,(D_1:0.3,D_2:0.4,D_3:0.5,D_4:0.7,D_5:0.4):1.2)
The output image without collapsing is like this:
http://i.stack.imgur.com/pHkoQ.png
The output image collapsing should be like this (collapsed_tree.eps):
http://i.stack.imgur.com/TLXd0.png
The width of the triangles should represent the branch length, and the high of the triangles must represent the number of nodes in the branch.
I have been playing with the "ape" package in R. I was able to plot a phylogenetic tree, but I still can't figure out how to collapse the branches by keywords in the labels:
require("ape")
This will load the tree:
cat("((A_1:0.05,A_2:0.03,A_3:0.2,A_4:0.1):0.9,(((B_1:0.05,B_2:0.02,B_3:0.04):0.6,(C_1:0.6,C_2:0.08):0.7):0.5,(D_1:0.3,D_2:0.4,D_3:0.5,D_4:0.7,D_5:0.4):1.2):0.5);", file = "ex.tre", sep = "\n")
tree.test <- read.tree("ex.tre")
Here should be the code to collapse
This will plot the tree:
plot(tree.test)
Your tree as it is stored in R already has the tips stored as polytomies. It's just a matter of plotting the tree with triangles representing the polytomies.
There is no function in ape to do this, that I am aware of, but if you mess with the plotting function a little bit you can pull it off
# Step 1: make edges for descendent nodes invisible in plot:
groups <- c("A", "B", "C", "D")
group_edges <- numeric(0)
for(group in groups){
group_edges <- c(group_edges,getMRCA(tree.test,tree.test$tip.label[grepl(group, tree.test$tip.label)]))
}
edge.width <- rep(1, nrow(tree.test$edge))
edge.width[tree.test$edge[,1] %in% group_edges ] <- 0
# Step 2: plot the tree with the hidden edges
plot(tree.test, show.tip.label = F, edge.width = edge.width)
# Step 3: add triangles
add_polytomy_triangle <- function(phy, group){
root <- length(phy$tip.label)+1
group_node_labels <- phy$tip.label[grepl(group, phy$tip.label)]
group_nodes <- which(phy$tip.label %in% group_node_labels)
group_mrca <- getMRCA(phy,group_nodes)
tip_coord1 <- c(dist.nodes(phy)[root, group_nodes[1]], group_nodes[1])
tip_coord2 <- c(dist.nodes(phy)[root, group_nodes[1]], group_nodes[length(group_nodes)])
node_coord <- c(dist.nodes(phy)[root, group_mrca], mean(c(tip_coord1[2], tip_coord2[2])))
xcoords <- c(tip_coord1[1], tip_coord2[1], node_coord[1])
ycoords <- c(tip_coord1[2], tip_coord2[2], node_coord[2])
polygon(xcoords, ycoords)
}
Then you just have to loop through the groups to add the triangles
for(group in groups){
add_polytomy_triangle(tree.test, group)
}
I've also been searching for this kind of tool for ages, not so much for collapsing categorical groups, but for collapsing internal nodes based on a numerical support value.
The di2multi function in the ape package can collapse nodes to polytomies, but it currently can only does this by branch length threshold.
Here is a rough adaptation that allows collapsing by a node support value threshold instead (default threshold = 0.5).
Use at your own risk, but it works for me on my rooted Bayesian tree.
di2multi4node <- function (phy, tol = 0.5)
# Adapted di2multi function from the ape package to plot polytomies
# based on numeric node support values
# (di2multi does this based on edge lengths)
# Needs adjustment for unrooted trees as currently skips the first edge
{
if (is.null(phy$edge.length))
stop("the tree has no branch length")
if (is.na(as.numeric(phy$node.label[2])))
stop("node labels can't be converted to numeric values")
if (is.null(phy$node.label))
stop("the tree has no node labels")
ind <- which(phy$edge[, 2] > length(phy$tip.label))[as.numeric(phy$node.label[2:length(phy$node.label)]) < tol]
n <- length(ind)
if (!n)
return(phy)
foo <- function(ancestor, des2del) {
wh <- which(phy$edge[, 1] == des2del)
for (k in wh) {
if (phy$edge[k, 2] %in% node2del)
foo(ancestor, phy$edge[k, 2])
else phy$edge[k, 1] <<- ancestor
}
}
node2del <- phy$edge[ind, 2]
anc <- phy$edge[ind, 1]
for (i in 1:n) {
if (anc[i] %in% node2del)
next
foo(anc[i], node2del[i])
}
phy$edge <- phy$edge[-ind, ]
phy$edge.length <- phy$edge.length[-ind]
phy$Nnode <- phy$Nnode - n
sel <- phy$edge > min(node2del)
for (i in which(sel)) phy$edge[i] <- phy$edge[i] - sum(node2del <
phy$edge[i])
if (!is.null(phy$node.label))
phy$node.label <- phy$node.label[-(node2del - length(phy$tip.label))]
phy
}
This is my answer based on phytools::phylo.toBackbone function,
see http://blog.phytools.org/2013/09/even-more-on-plotting-subtrees-as.html, and http://blog.phytools.org/2013/10/finding-edge-lengths-of-all-terminal.html. First, load the function at the end of code.
library(ape)
library(phytools) #phylo.toBackbone
library(phangorn)
cat("((A_1:0.05,E_2:0.03,A_3:0.2,A_4:0.1,A_5:0.1,A_6:0.1,A_7:0.35,A_8:0.4,A_9:01,A_10:0.2):0.9,((((B_1:0.05,B_2:0.05):0.5,B_3:0.02,B_4:0.04):0.6,(C_1:0.6,C_2:0.08):0.7):0.5,(D_1:0.3,D_2:0.4,D_3:0.5,D_4:0.7,D_5:0.4):1.2):0.5);"
, file = "ex.tre", sep = "\n")
phy <- read.tree("ex.tre")
groups <- c("A", "B|C", "D")
backboneoftree<-makebackbone(groups,phy)
# tip.label clade.label N depth
# 1 A_1 A 10 0.2481818
# 2 B_1 B|C 6 0.9400000
# 3 D_1 D 5 0.4600000
{
tryCatch(dev.off(),error=function(e){""})
par(fig=c(0,0.5,0,1), mar = c(0, 0, 2, 0))
plot(phy, main="Original" )
par(fig=c(0.5,1,0,1), oma = c(0, 0, 1.2, 0), xpd=NA, new=T)
plot(backboneoftree)
title(main="Clades")
}
makebackbone <- function(groupings,phy){
listofspecies <- phy$tip.label
listtopreserve <- character()
newedgelengths <- meandistnode<- lengthofclades<- numeric()
for (i in 1:length(groupings)){
bestmrca<-getMRCA(phy,grep(groupings[i], phy$tip.label) )
mrcatips<-phy$tip.label[unlist(phangorn::Descendants(phy,bestmrca, type="tips") )]
listtopreserve[i] <- mrcatips[1]
meandistnode[i] <- mean(dist.nodes(phy)[unlist(lapply(mrcatips,
function(x) grep(x, phy$tip.label) ) ),bestmrca] )
lengthofclades[i] <- length(mrcatips)
provtree <- drop.tip(phy,mrcatips, trim.internal=F, subtree = T)
n3 <- length(provtree$tip.label)
newedgelengths[i] <- setNames(provtree$edge.length[sapply(1:n3,function(x,y)
which(y==x),
y=provtree$edge[,2])],
provtree$tip.label)[provtree$tip.label[grep("tips",provtree$tip.label)] ]
}
newtree <- drop.tip(phy,setdiff(listofspecies,listtopreserve),
trim.internal = T)
n <- length(newtree$tip.label)
newtree$edge.length[sapply(1:n,function(x,y)
which(y==x),
y=newtree$edge[,2])] <- newedgelengths + meandistnode
trans <- data.frame(tip.label=newtree$tip.label,clade.label=groupings,
N=lengthofclades, depth=meandistnode )
rownames(trans) <- NULL
print(trans)
backboneoftree <- phytools::phylo.toBackbone(newtree,trans)
return(backboneoftree)
}
EDIT: I haven't tried this, but it might be another answer: "Script and function to transform the tip branches of a tree , i.e the thickness or to triangles, with the width of both correlating with certain parameters (e.g., species number of the clade) (tip.branches.R)"
https://www.en.sysbot.bio.lmu.de/people/employees/cusimano/use_r/index.html
I think the script is finally doing what I wanted.
From the answer that #CactusWoman provided, I changed the code a little bit so the script will try to find the MRCA that represents the largest branch that matches to my search pattern. This solved the problem of not merging non-polytomic branches, or collapsing the whole tree because one matching node was mistakenly outside the correct branch.
In addition, I included a parameter that represents the limit for the pattern abundance ratio in a given branch, so we can select and collapse/group branches that have at least 90% of its tips matching to the search pattern, for example.
library(geiger)
library(phylobase)
library(ape)
#functions
find_best_mrca <- function(phy, group, threshold){
group_matches <- phy$tip.label[grepl(group, phy$tip.label, ignore.case=TRUE)]
group_mrca <- getMRCA(phy,phy$tip.label[grepl(group, phy$tip.label, ignore.case=TRUE)])
group_leaves <- tips(phy, group_mrca)
match_ratio <- length(group_matches)/length(group_leaves)
if( match_ratio < threshold){
#start searching for children nodes that have more than 95% of descendants matching to the search pattern
mrca_children <- descendants(as(phy,"phylo4"), group_mrca, type="all")
i <- 1
new_ratios <- NULL
nleaves <- NULL
names(mrca_children) <- NULL
for(new_mrca in mrca_children){
child_leaves <- tips(tree.test, new_mrca)
child_matches <- grep(group, child_leaves, ignore.case=TRUE)
new_ratios[i] <- length(child_matches)/length(child_leaves)
nleaves[i] <- length(tips(phy, new_mrca))
i <- i+1
}
match_result <- data.frame(mrca_children, new_ratios, nleaves)
match_result_sorted <- match_result[order(-match_result$nleaves,match_result$new_ratios),]
found <- numeric(0);
print(match_result_sorted)
for(line in 1:nrow(match_result_sorted)){
if(match_result_sorted$ new_ratios[line]>=threshold){
return(match_result_sorted$mrca_children[line])
found <- 1
}
}
if(found==0){return(found)}
}else{return(group_mrca)}
}
add_triangle <- function(phy, group,phylo_plot){
group_node_labels <- phy$tip.label[grepl(group, phy$tip.label)]
group_mrca <- getMRCA(phy,group_node_labels)
group_nodes <- descendants(as(tree.test,"phylo4"), group_mrca, type="tips")
names(group_nodes) <- NULL
x<-phylo_plot$xx
y<-phylo_plot$yy
x1 <- max(x[group_nodes])
x2 <-max(x[group_nodes])
x3 <- x[group_mrca]
y1 <- min(y[group_nodes])
y2 <- max(y[group_nodes])
y3 <- y[group_mrca]
xcoords <- c(x1,x2,x3)
ycoords <- c(y1,y2,y3)
polygon(xcoords, ycoords)
return(c(x2,y3))
}
#main
cat("((A_1:0.05,E_2:0.03,A_3:0.2,A_4:0.1,A_5:0.1,A_6:0.1,A_7:0.35,A_8:0.4,A_9:01,A_10:0.2):0.9,((((B_1:0.05,B_2:0.05):0.5,B_3:0.02,B_4:0.04):0.6,(C_1:0.6,C_2:0.08):0.7):0.5,(D_1:0.3,D_2:0.4,D_3:0.5,D_4:0.7,D_5:0.4):1.2):0.5);", file = "ex.tre", sep = "\n")
tree.test <- read.tree("ex.tre")
# Step 1: Find the best MRCA that matches to the keywords or search patten
groups <- c("A", "B|C", "D")
group_labels <- groups
group_edges <- numeric(0)
edge.width <- rep(1, nrow(tree.test$edge))
count <- 1
for(group in groups){
best_mrca <- find_best_mrca(tree.test, group, 0.90)
group_leaves <- tips(tree.test, best_mrca)
groups[count] <- paste(group_leaves, collapse="|")
group_edges <- c(group_edges,best_mrca)
#Step2: Remove the edges of the branches that will be collapsed, so they become invisible
edge.width[tree.test$edge[,1] %in% c(group_edges[count],descendants(as(tree.test,"phylo4"), group_edges[count], type="all")) ] <- 0
count = count +1
}
#Step 3: plot the tree hiding the branches that will be collapsed/grouped
last_plot.phylo <- plot(tree.test, show.tip.label = F, edge.width = edge.width)
#And save a copy of the plot so we can extract the xy coordinates of the nodes
#To get the x & y coordinates of a plotted tree created using plot.phylo
#or plotTree, we can steal from inside tiplabels:
last_phylo_plot<-get("last_plot.phylo",envir=.PlotPhyloEnv)
#Step 4: Add triangles and labels to the collapsed nodes
for(i in 1:length(groups)){
text_coords <- add_triangle(tree.test, groups[i],last_phylo_plot)
text(text_coords[1],text_coords[2],labels=group_labels[i], pos=4)
}
This doesn't address depicting the clades as triangles, but it does help with collapsing low-support nodes. The library ggtree has a function as.polytomy which can be used to collapse nodes based on support values.
For example, to collapse bootstraps less than 50%, you'd use:
polytree = as.polytomy(raxtree, feature='node.label', fun=function(x) as.numeric(x) < 50)

lm models over all possible pairwise combinations of the columns of two matrices

I am working through a problem at the moment in R and have got stuck. I have searched around on various help lists for assistance but could not find anything - but apologies if I have missed something. A dummy example of my problem is below. I will continue to work on it, but any help would be greatly appreciated.
Thanks in advance for your time.
I have a matrix of response variables:
p<-matrix(c(rnorm(120,1),
rnorm(120,1),
rnorm(120,1)),
120,3)
and two matrices of covariates:
g<-matrix(c(rep(1:3, each=40),
rep(3:1, each=40),
rep(1:3, 40)),
120,3)
m<-matrix(c(rep(1:2, 60),
rep(2:1, 60),
rep(1:2, each=60)),
120,3)
For all combinations of the columns of the covariate matrices g and m I want to run these two models:
test <- function(uniq_m, uniq_g, p = p) {
full <- lm(p ~ factor(uniq_m) * factor(uniq_g))
null <- lm(p ~ factor(uniq_m) + factor(uniq_g))
return(list('f'=full, 'n'=null))
}
So I want to test for an interaction between column 1 of m and column 1 of g, then column 2 of m and column 1 of g, then column 2 of m and column 2 of g...and so forth across all possible pairwise interactions. The response variable is the same each time and is a matrix containing multiple columns.
So far, I can do this for a single combination of columns:
test_1 <- test(m[ ,1], g[ ,1], p)
And I can also run the model over all columns of m and one coloumn of g:
test_2 <- apply(m, 2, function(uniq_m) {
test(uniq_m, g[ ,1], p = p)
})
I can then get the F statistics for each response variable of each model:
sapply(summary(test_2[[1]]$f), function(x) x$fstatistic)
sapply(summary(test_2[[1]]$n), function(x) x$fstatistic)
And I can compare models for each response variable using an F-test:
d1<-colSums(matrix(residuals(test_2[[1]]$n),nrow(g),ncol(p))^2)
d2<-colSums(matrix(residuals(test_2[[2]]$f),nrow(g),ncol(p))^2)
F<-((d1-d2) / (d2/114))
My question is how do I run the lm models over all combinations of columns from the m and the g matrix, and get the F-statistics?
While this is a dummy example, the real analysis will have a response matrix that is 700 x 8000, and the covariate matrices will be 700 x 4000 and 700 x 100 so I need something that is as fast as possible.
Hopefully this helps, this is some code a friend of mine shared with me. It may not be exactly what you need but might set you off in the right direction (though given this is 9 months later than you asked it, it may be of no use to you specifically!):
#### this first function models the correlation and fixes the text size based on the strength of the correlation
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
##### this function places a histogram of your data on the diagonal
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}
### read in Fishers famous iris dataset for our example
data(iris)
head(iris)
library(corrgram)
##corrgram also gives you some nice panel options to use in pairs, but you dont necesarily need them
##e.g. panel.ellipse, panel.pie, panel.conf
library(asbio)
##asbio offers more panel options, such as a linear regression (panel.lm) etc
### run pairs() on your data
### set upper panel to panel.cor (the function we just wrote), and diagonal to panel.hist
### do what you like for the lower, add a smoother line isnt very informative
pairs(~ Sepal.Length + Sepal.Width + Petal.Length, data=iris, lower.panel=panel.lm, upper.panel=panel.cor, diag.panel = panel.hist, main="pair plots of variables")
All credit to James Keating.

Making simple phylogenetic dendrogram (tree) from a list of species

I want to make a simple phylogenetic tree for a marine biology course as an educative example. I have a list of species with taxonomic rank:
Group <- c("Benthos","Benthos","Benthos","Benthos","Benthos","Benthos","Zooplankton","Zooplankton","Zooplankton","Zooplankton",
"Zooplankton","Zooplankton","Fish","Fish","Fish","Fish","Fish","Fish","Phytoplankton","Phytoplankton","Phytoplankton","Phytoplankton")
Domain <- rep("Eukaryota", length(Group))
Kingdom <- c(rep("Animalia", 18), rep("Chromalveolata", 4))
Phylum <- c("Annelida","Annelida","Arthropoda","Arthropoda","Porifera","Sipunculida","Arthropoda","Arthropoda","Arthropoda",
"Arthropoda","Echinoidermata","Chorfata","Chordata","Chordata","Chordata","Chordata","Chordata","Chordata","Heterokontophyta",
"Heterokontophyta","Heterokontophyta","Dinoflagellata")
Class <- c("Polychaeta","Polychaeta","Malacostraca","Malacostraca","Demospongiae","NA","Malacostraca","Malacostraca",
"Malacostraca","Maxillopoda","Ophiuroidea","Actinopterygii","Chondrichthyes","Chondrichthyes","Chondrichthyes","Actinopterygii",
"Actinopterygii","Actinopterygii","Bacillariophyceae","Bacillariophyceae","Prymnesiophyceae","NA")
Order <- c("NA","NA","Amphipoda","Cumacea","NA","NA","Amphipoda","Decapoda","Euphausiacea","Calanioda","NA","Gadiformes",
"NA","NA","NA","NA","Gadiformes","Gadiformes","NA","NA","NA","NA")
Species <- c("Nephtys sp.","Nereis sp.","Gammarus sp.","Diastylis sp.","Axinella sp.","Ph. Sipunculida","Themisto abyssorum","Decapod larvae (Zoea)",
"Thysanoessa sp.","Centropages typicus","Ophiuroidea larvae","Gadus morhua eggs / larvae","Etmopterus spinax","Amblyraja radiata",
"Chimaera monstrosa","Clupea harengus","Melanogrammus aeglefinus","Gadus morhua","Thalassiosira sp.","Cylindrotheca closterium",
"Phaeocystis pouchetii","Ph. Dinoflagellata")
dat <- data.frame(Group, Domain, Kingdom, Phylum, Class, Order, Species)
dat
I would like to get a dendrogram (cluster analysis) and use Domain as the first cutting point, Kindom as the second, Phylum as the third, etc. Missing values should be ignored (no cutting point, a straight line instead). Group should be used as a coloring category for the labels.
I am a bit uncertain how to make a distance matrix from this data frame. There are a lot of phylogenetic tree packages for R, they seem to want newick data / DNA / other advanced information. Thus help with this would be appreciated.
It's probably a bit lame to answer my own question, but I found an easier solution. Maybe it helps someone one day.
library(ape)
taxa <- as.phylo(~Kingdom/Phylum/Class/Order/Species, data = dat)
col.grp <- merge(data.frame(Species = taxa$tip.label), dat[c("Species", "Group")], by = "Species", sort = F)
cols <- ifelse(col.grp$Group == "Benthos", "burlywood4", ifelse(col.grp$Group == "Zooplankton", "blueviolet", ifelse(col.grp$Group == "Fish", "dodgerblue", ifelse(col.grp$Group == "Phytoplankton", "darkolivegreen2", ""))))
plot(taxa, type = "cladogram", tip.col = cols)
Note that all columns have to be factors. This demonstrates the work flow with R. It takes a week to find out something, although the code itself is just a couple of rows =)
If you wanted to draw the tree by hand
(this is probably not the best way to do it),
you could start as follows
(it is not a complete answer:
the colours are missing,
and the edges are too long).
This assumes that the data has already been sorted.
# Data: remove Group
dat <- data.frame(Domain, Kingdom, Phylum, Class, Order, Species)
# Start a new plot
par(mar=c(0,0,0,0))
plot(NA, xlim=c(0,ncol(dat)+1), ylim=c(0,nrow(dat)+1),
type="n", axes=FALSE, xlab="", ylab="", main="")
# Compute the position of each node and find all the edges to draw
positions <- NULL
links <- NULL
for(k in 1:ncol(dat)) {
y <- tapply(1:nrow(dat), dat[,k], mean)
y <- y[ names(y) != "NA" ]
positions <- rbind( positions, data.frame(
name = names(y),
x = k,
y = y
))
}
links <- apply( dat, 1, function(u) {
u <- u[ !is.na(u) & u != "NA" ]
cbind(u[-length(u)],u[-1])
} )
links <- do.call(rbind, links)
rownames(links) <- NULL
links <- unique(links[ order(links[,1], links[,2]), ])
# Draw the edges
for(i in 1:nrow(links)) {
from <- positions[links[i,1],]
to <- positions[links[i,2],]
lines( c(from$x, from$x, to$x), c(from$y, to$y, to$y) )
}
# Add the text
text(positions$x, positions$y, label=positions$name)

Resources