I would like to export an hclust-dendrogram from R into a data table in order to subsequently import it into another ("home-made") software. str(unclass(fit)) provides a text overview for the dendrogram, but what I'm looking for is really a numeric table. I've looked at the Bioconductor ctc package, but the output it's producing looks somewhat cryptical. I would like to have something similar to this table: http://stn.spotfire.com/spotfire_client_help/heat/heat_importing_exporting_dendrograms.htm
Is there a way to get this out of an hclust object in R?
In case anyone is also interested in dendrogram export, here is my solution. Most probably, it's not the best one as I started using R only recently, but at least it works. So suggestions on how to improve the code are welcome.
So, ifhris my hclust object and df is my data, the first column of which contains a simple index starting from 0, and the row names are the names of the clustered items:
# Retrieve the leaf order (row name and its position within the leaves)
leaf.order <- matrix(data=NA, ncol=2, nrow=nrow(df),
dimnames=list(c(), c("row.num", "row.name")))
leaf.order[,2] <- hr$labels[hr$order]
for (i in 1:nrow(leaf.order)) {
leaf.order[which(leaf.order[,2] %in% rownames(df[i,])),1] <- df[i,1]
}
leaf.order <- as.data.frame(leaf.order)
hr.merge <- hr$merge
n <- max(df[,1])
# Re-index all clustered leaves and nodes. First, all leaves are indexed starting from 0.
# Next, all nodes are indexed starting from max. index leave + 1.
for (i in 1:length(hr.merge)) {
if (hr.merge[i]<0) {hr.merge[i] <- abs(hr.merge[i])-1}
else { hr.merge[i] <- (hr.merge[i]+n) }
}
node.id <- c(0:length(hr.merge))
# Generate dendrogram matrix with node index in the first column.
dend <- matrix(data=NA, nrow=length(node.id), ncol=6,
dimnames=list(c(0:(length(node.id)-1)),
c("node.id", "parent.id", "pruning.level",
"height", "leaf.order", "row.name")) )
dend[,1] <- c(0:((2*nrow(df))-2)) # Insert a leaf/node index
# Calculate parent ID for each leaf/node:
# 1) For each leaf/node index, find the corresponding row number within the merge-table.
# 2) Add the maximum leaf index to the row number as indexing the nodes starts after indexing all the leaves.
for (i in 1:(nrow(dend)-1)) {
dend[i,2] <- row(hr.merge)[which(hr.merge %in% dend[i,1])]+n
}
# Generate table with indexing of all leaves (1st column) and inserting the corresponding row names into the 3rd column.
hr.order <- matrix(data=NA,
nrow=length(hr$labels), ncol=3,
dimnames=list(c(), c("order.number", "leaf.id", "row.name")))
hr.order[,1] <- c(0:(nrow(hr.order)-1))
hr.order[,3] <- t(hr$labels[hr$order])
hr.order <- data.frame(hr.order)
hr.order[,1] <- as.numeric(hr.order[,1])
# Assign the row name to each leaf.
dend <- as.data.frame(dend)
for (i in 1:nrow(df)) {
dend[which(dend[,1] %in% df[i,1]),6] <- rownames(df[i,])
}
# Assign the position on the dendrogram (from left to right) to each leaf.
for (i in 1:nrow(hr.order)) {
dend[which(dend[,6] %in% hr.order[i,3]),5] <- hr.order[i,1]-1
}
# Insert height for each node.
dend[c((n+2):nrow(dend)),4] <- hr$height
# All leaves get the highest possible pruning level
dend[which(dend[,1] <= n),3] <- nrow(hr.merge)
# The nodes get a decreasing index starting from the pruning level of the
# leaves minus 1 and up to 0
for (i in (n+2):nrow(dend)) {
if ((dend[i,4] != dend[(i-1),4]) || is.na(dend[(i-1),4])){
dend[i,3] <- dend[(i-1),3]-1}
else { dend[i,3] <- dend[(i-1),3] }
}
dend[,3] <- dend[,3]-min(dend[,3])
dend <- dend[order(-node.id),]
# Write results table.
write.table(dend, file="path", sep=";", row.names=F)
There is package that does exactly opposite of what you want - Labeltodendro ;-)
But seriously, can't you just manually extract the elements from hclust object (e.g. $merge, $height, $order) and create custom table from the extracted elements?
Related
I have a dataframe with ~9000 rows of human coded data in it, two coders per item so about 4500 unique pairs. I want to break the dataset into each of these pairs, so ~4500 dataframes, run a kripp.alpha on the scores that were assigned, and then save those into a coder sheet I have made. I cannot get the loop to work to do this.
I can get it to work individually, using this:
example.m <- as.matrix(example.m)
s <- kripp.alpha(example.m)
example$alpha <- s$value
However, when trying a loop I am getting either "Error in get(v) : object 'NA' not found" when running this:
for (i in items) {
v <- i
v <- v[c("V1","V2")]
v <- assign(v, as.matrix(get(v)))
s <- kripp.alpha(v)
i$alpha <- s$value
}
Or am getting "In i$alpha <- s$value : Coercing LHS to a list" when running:
for (i in items) {
i.m <- i[c("V1","V2")]
i.m <- as.matrix(i.m)
s <- kripp.alpha(i.m)
i$alpha <- s$value
}
Here is an example set of data. Items is a list of individual dataframes.
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
items <- c("l","t")
I am sure this is a basic question, but what I want is for each file, i, to add a column with the alpha score at the end. Thanks!
Your problem is with scoping and extracting names from objects when referenced through strings. You'd need to eval() some of your object to make your current approach work.
Here's another solution
library("irr") # For kripp.alpha
# Produce the data
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
# Collect the data as a list right away
items <- list(l, t)
Now you can sapply() directly over the elements in the list.
sapply(items, function(v) {
kripp.alpha(as.matrix(v[c("V1","V2")]))$value
})
which produces
[1] 0.0 -0.5
So, I have a function:
complete <- function(directory,id = 1:332 ) {
directory <- list.files(path="......a")
g <- list()
for(i in 1:length(directory)) {
g[[i]] <- read.csv(directory[i],header=TRUE)
}
rbg <- do.call(rbind,g)
rbgr <- na.omit(rbg) #reads files and omits NA's
complete_subset <- subset(rbgr,rbgr$ID %in% id,select = ID)
table.rbgr <- sapply(complete_subset,table)
table.rbd <- data.frame(table.rbgr)
id.table <- c(id)
findla.tb <- cbind (id.table,table.rbd)
names(findla.tb) <- c("id","nob")
print(findla.tb) #creates table with number of observations
}
Basically when you call the specific numberic id (say 4),
you are suppose to get this output
id nobs
15 328
So, I just need the nobs data to be fed into another function which measures the correlation between two columns if the nobs value is greater than another arbitrarily determined value(T). Since nobs is determined by the value of id, I am uncertain how to create a function that takes into account the output of the other function?
I have tried something like this:
corr <- function (directory, t) {
directory <- list.files(path=".......")
g <- list()
for(i in 1:length(directory)) {
g[[i]] <- read.csv(directory[i],header=TRUE)
}
rbg <- do.call(rbind,g)
g.all <- na.omit(rbg) #reads files and removes observations
source(".....complete.R") #sourcing the complete function above
complete("spec",id)
g.allse <- subset(g.all,g.all$ID %in% id,scol )
g.allnit <- subset(g.all,g.all$ID %in% id,nit )
for(g.all$ID %in% id) {
if(id > t) {
cor(g.allse,g.allnit) #calcualte correlation of these two columns if they have similar id
}
}
#basically for each id that matches the ID in g.all function, if the id > t variable, calculate the correlation between columns
}
complete("spec", 3)
cr <- corr("spec", 150)
head(cr)
I have also tried to make the complete function a data.frame but it does not work and it gives me the following error:
error in data.frame(... check.names = false) arguments imply differing number of rows. So, I am not sure how to proceed....
First off, a reproducible example always helps in getting your question answered, along with a clear explanation of what your functions do/are supposed to do. We cannot run your example code.
Next, you seem to have an error in your corr function. You make multiple references to id but never actually populate this variable in your example code. So we'll just have to guess at what you need help with.
I think what you are trying to do is:
given an id, call complete with that id
use the nobs from that in your code.
In this case, you need to make sure to store the output of your call to complete, e.g.
comp <- complete('spec', id)
You can access the id column value comp['id'] and the nobs value via comp['nobs'] so you could do e.g.
if (comp['nobs'] > t) {
# do stuff e.g.
cor(g.allse, g.allnit)
}
Make sure you store the output of cor somewhere if you wish to actualy get it back later.
You will have to fix the problem of id not being defined yourself, because it is unclear what you want that to be.
I am currently working on a script, that loads a TIF file into a raster object, crops it and plots two points (starting point and point of destination; selected via the click-function) into that raster. I then want it to get the cell numbers of those two points. All of that hasn't caused any trouble but now I have tried to write a while-Loop which gets me the number of a random cell (which is adjacent to the current cell; beginning from the starting point) until that cell number equals the cell number of my point of destination. My idea behind that was to "walk" across the raster until I have reached my point of destination or at least the column containing it (to reduce computation time). The numbers of the cells i cross during that walk should be stored in a vector ("Path"). I select the adjacent cell (=choose my next step) by randomly sampling from a vector that contains numbers that, when added to the current cell number, lead to the number of an adjacent cell. I have multiple vectors from which to sample as the number of possible directions in which to "walk" differs depending on the position of the current cell (e.g. I can't "walk" to the cell to my lower rigth (=n + (ncol_dispersal + 1) if I am currently positioned at the bottom of the raster). The script looks like this so far:
library(gdistance)
library(raster)
library(rgdal)
library(sp)
setwd("C:/Users/Giaco/Dropbox/Random Walk")
altdata <- raster("altitude.tif")
plot(altdata)
e <- extent(92760.79,93345.79,204017.5,204242.5)
dispersal_area <- crop(altdata,e)
plot(dispersal_area)
points(92790.79,204137.5,pch=16,cex=1)
points(93300.79,204062.5,pch=16,cex=1)
Pts <- matrix(c(92790.79,204137.5,93300.79,204062.5),nrow=2,ncol=2,byrow=TRUE)
Start <- cellFromXY(dispersal_area,Pts[1,])
End <- cellFromXY(dispersal_area,Pts[2,])
nrow_dispersal <- nrow(dispersal_area)
ncol_dispersal <- ncol(dispersal_area)
col_start <- colFromCell(dispersal_area,Start)
row_start <- rowFromCell(dispersal_area,Start)
col_end <- colFromCell(dispersal_area,End)
row_end <- rowFromCell(dispersal_area,End)
upper_left_corner <- cellFromRowCol(dispersal_area,1,1)
lower_left_corner <- cellFromRowCol(dispersal_area,14,1)
sample_standard <- c(1,(ncol_dispersal+1),(ncol_dispersal*-1+1))
sample_top <- c(1,ncol_dispersal,(ncol_dispersal+1))
sample_bottom <- c(1,(ncol_dispersal*-1+1),(ncol_dispersal*-1))
sample_left <- c(1,(ncol_dispersal+1),(ncol_dispersal*-1+1))
sample_upper_left <- c(1,ncol_dispersal,(ncol_dispersal+1))
sample_lower_left <- c(1,(ncol_dispersal*-1+1),(ncol_dispersal*-1))
Path <- c()
Path[1] <- Start
n <- Start
counter <- 1
while (n != End)
{
n = Start+sample(sample_standard,1)
if (colFromCell(dispersal_area,n)==col_end) {
n=End
break
} else if (n==upper_left_corner) {
n = n+sample(sample_upper_left,1)
} else if(n==lower_left_corner){
n = n+sample(sample_lower_left,1)
} else if(colFromCell(dispersal_area,n)==1) {
n = n+sample(sample_left,1)
} else if(rowFromCell(dispersal_area,n)==1){
n = n+sample(sample_top,1)
} else if(rowFromCell(dispersal_area,n)==nrow_dispersal) {
n = n+sample(sample_bottom,1)
}
counter <- counter+1
Path[counter] <- n
}
When I run the script and print the path vector it returns a veeerryy long vector (I always have to stop it as it never finishes computing) which contains only a few different numbers. Why is that ? I have been staring at this all day but I simply can't figure out where i went wrong. There must be something wrong with the while Loop but I don't see it.
If anyone of you guys could help me out with this I would be really really thankful.
Thanks in advance !
Here is a simple and reproducible example (that also answers your question).
library(gdistance)
r <- raster(system.file("external/maungawhau.grd", package="gdistance"))
r <- aggregate(r, 5)
p <- matrix(c(2667531, 6478843, 2667731, 6479227), ncol=2, byrow=TRUE)
start <- cellFromXY(r, p[1,])
end <- cellFromXY(r, p[2,])
counter <- 1
cell <- start
path <- cell
while (cell != end) {
a <- adjacent(r, cell, pairs=F)
cell <- sample(a, 1)
path <- c(path, cell)
}
xy <- xyFromCell(r, path)
plot(r)
lines(xy)
or
cols <- rainbow(nrow(xy))
for (i in 1:nrow(xy)-1) { lines(xy[i:(i+1), ], col=cols[i]) }
This is pretty fast on this coarse raster, but it could indeed take a very long time to reach a particular cell on a large raster by random walk.
Perhaps there are function in gdistance that are more useful?
I want to make a tree (cluster) using Interactive Tree of Life web-based tool (iTOL). As an input file (or string) this tool uses Newick format which is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. Beside that, additional information might be supported such as bootstrapped values of cluster's nodes.
For example, here I created dataset for a cluster analysis using clusterGeneration package:
library(clusterGeneration)
set.seed(1)
tmp1 <- genRandomClust(numClust=3, sepVal=0.3, numNonNoisy=5,
numNoisy=3, numOutlier=5, numReplicate=2, fileName="chk1")
data <- tmp1$datList[[2]]
Afterwards I performed cluster analysis and assessed support for the cluster's nodes by bootstrap using pvclust package:
set.seed(2)
y <- pvclust(data=data,method.hclust="average",method.dist="correlation",nboot=100)
plot(y)
Here is the cluster and bootstrapped values:
In order to make a Newick file, I used ape package:
library(ape)
yy<-as.phylo(y$hclust)
write.tree(yy,digits=2)
write.tree function will print tree in a Newick format:
((x2:0.45,x6:0.45):0.043,((x7:0.26,(x4:0.14,(x1:0.14,x3:0.14):0.0064):0.12):0.22,(x5:0.28,x8:0.28):0.2):0.011);
Those numbers represent branch lengths (cluster's edge lengths). Following instructions from iTOL help page ("Uploading and working with your own trees" section) I manually added bootstrapped values into my Newick file (bolded values below):
((x2:0.45,x6:0.45)74:0.043,((x7:0.26,(x4:0.14,(x1:0.14,x3:0.14)55:0.0064)68:0.12)100:0.22,(x5:0.28,x8:0.28)100:0.2)63:0.011);
It works fine when I upload the string into iTOL. However, I have a huge cluster and doing it by hand seems tedious...
QUESTION: What would be a code that can perform it instead of manual typing?
Bootstrap values can be obtained by:
(round(y$edges,2)*100)[,1:2]
Branch lengths used to form Newick file can be obtained by:
yy$edge.length
I tried to figure out how write.tree function works after debugging it. However, I noticed that it internally calls function .write.tree2 and I couldn't understand how to efficiently change the original code and obtain bootstrapped values in appropriate position in a Newick file.
Any suggestion are welcome.
Here is one solution for you: objects of class phylo have an available slot called node.label that, appropriately, gives you the label of a node. You can use it to store your bootstrap values. There will be written in your Newick File at the appropriate place as you can see in the code of .write.tree2:
> .write.tree2
function (phy, digits = 10, tree.prefix = "")
{
brl <- !is.null(phy$edge.length)
nodelab <- !is.null(phy$node.label)
...
if (is.null(phy$root.edge)) {
cp(")")
if (nodelab)
cp(phy$node.label[1])
cp(";")
}
else {
cp(")")
if (nodelab)
cp(phy$node.label[1])
cp(":")
cp(sprintf(f.d, phy$root.edge))
cp(";")
}
...
The real difficulty is to find the proper order of the nodes. I searched and searched but couldn't find a way to find the right order a posteriori.... so that means we will have to get that information during the transformation from an object of class hclust to an object of class phylo.
And luckily, if you look into the function as.phylo.hclust, there is a vector containing the nodes index in their correct order vis-à-vis the previous hclust object:
> as.phylo.hclust
function (x, ...)
{
N <- dim(x$merge)[1]
edge <- matrix(0L, 2 * N, 2)
edge.length <- numeric(2 * N)
node <- integer(N) #<-This one
...
Which means we can make our own as.phylo.hclust with a nodenames parameter as long as it is in the same order as the nodes in the hclust object (which is the case in your example since pvclust keeps a coherent order internally, i. e. the order of the nodes in the hclust is the same as in the table in which you picked the bootstraps):
# NB: in the following function definition I only modified the commented lines
as.phylo.hclust.with.nodenames <- function (x, nodenames, ...) #We add a nodenames argument
{
N <- dim(x$merge)[1]
edge <- matrix(0L, 2 * N, 2)
edge.length <- numeric(2 * N)
node <- integer(N)
node[N] <- N + 2L
cur.nod <- N + 3L
j <- 1L
for (i in N:1) {
edge[j:(j + 1), 1] <- node[i]
for (l in 1:2) {
k <- j + l - 1L
y <- x$merge[i, l]
if (y > 0) {
edge[k, 2] <- node[y] <- cur.nod
cur.nod <- cur.nod + 1L
edge.length[k] <- x$height[i] - x$height[y]
}
else {
edge[k, 2] <- -y
edge.length[k] <- x$height[i]
}
}
j <- j + 2L
}
if (is.null(x$labels))
x$labels <- as.character(1:(N + 1))
node.lab <- nodenames[order(node)] #Here we define our node labels
obj <- list(edge = edge, edge.length = edge.length/2, tip.label = x$labels,
Nnode = N, node.label = node.lab) #And you put them in the final object
class(obj) <- "phylo"
reorder(obj)
}
In the end, here is how you would use this new function in your case study:
bootstraps <- (round(y$edges,2)*100)[,1:2]
yy<-as.phylo.hclust.with.nodenames(y$hclust, nodenames=bootstraps[,2])
write.tree(yy,tree.names=TRUE,digits=2)
[1] "((x5:0.27,x8:0.27)100:0.24,((x7:0.25,(x4:0.14,(x1:0.13,x3:0.13)61:0.014)99:0.11)100:0.23,(x2:0.46,x6:0.46)56:0.022)61:0.027)100;"
#See the bootstraps ^^^ here for instance
plot(yy,show.node.label=TRUE) #To show that the order is correct
plot(y) #To compare with (here I used the yellow value)
I have a "hit list" of genes in a matrix. Each row is a hit, and the format is "chromosome(character) start(a number) stop(a number)." I would like to see which of these hits overlap with genes in the fly genome, which is a matrix with the format "chromosome start stop gene"
I have the following function that works (prints a list of genes from column 4 of dmelGenome):
geneListBuild <- function(dmelGenome='', hitList='', binSize='', saveGeneList='')
{
genomeColumns <- c('chr', 'start', 'stop', 'gene')
genome <- read.table(dmelGenome, header=FALSE, col.names = genomeColumns)
chr <- genome[,1]
startAdjust <- genome[,2] - binSize
stopAdjust <- genome[,3] + binSize
gene <- genome[,4]
genome <- data.frame(chr, startAdjust, stopAdjust, gene)
hits <- read.table(hitList, header=TRUE)
chrHits <- hits[hits$chr == "chr3R",]
chrGenome <- genome[genome$chr == "chr3R",]
genes <- c()
for(i in 1:length(chrHits[,1]))
{
for(j in 1:length(chrGenome[,1]))
{
if( chrHits[i,2] >= chrGenome[j,2] && chrHits[i,3] <= chrGenome[j,3] )
{
print(chrGenome[j,4])
}
}
}
genes <- unique(genes[is.finite(genes)])
print(genes)
fileConn<-file(saveGeneList)
write(genes, fileConn)
close(fileConn)
}
however, when I substitute print() with:
genes[j] <- chrGenome[j,4]
R returns a vector that has some values that are present in chrGenome[,1]. I don't know how it chooses these values, because they aren't in rows that seem to fulfill the if statement. I think it's an indexing issue?
Also I'm sure that there is a more efficient way of doing this. I'm new to R, so my code isn't very efficient.
This is similar to the "writing the results from a nested loop into another vector in R," but I couldn't fix it with the information in that thread.
Thanks.
I believe the inner loop could be replaced with:
gene.in <- ifelse( chrHits[i,2] >= chrGenome[,2] & chrHits[i,3] <= chrGenome[,3],
TRUE, FALSE)
Then you can use that logical vector to select what you want. Doing
which(gene.in)
might also be of use to you.