Biclustering in R - r

I want to apply byclustering on a binary matrix in R. There is a nice package called "biclust" available, but it does and displays not everything that I want.
I have a binary matrix which looks like the following:
1 0 0 1 0 1 0
0 0 0 0 0 0 0
0 0 1 0 1 0 0
1 0 0 1 0 1 0
0 0 1 0 1 0 0
1 0 0 1 0 1 0
0 0 0 0 0 0 0
And my goal is to bicluster (and display) this as following (may be colored):
1 1 1 0 0 0 0
1 1 1 0 0 0 0
1 1 1 0 0 0 0
0 0 0 1 1 0 0
0 0 0 1 1 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Set up code:
# install.packages("biclust") (if necessary)
library("biclust")
testMatrix <- matrix(c(1,0,0,1,0,1,0,
0,0,0,0,0,0,0,
0,0,1,0,1,0,0,
1,0,0,1,0,1,0,
0,0,1,0,1,0,0,
1,0,0,1,0,1,0,
0,0,0,0,0,0,0),
nrow = 7,
ncol = 7,
byrow = TRUE)
I applied the biclust function of the "biclust" R package:
testCluster <- biclust(x = testMatrix, method=BCBimax())
and indeed I get the two clusters expected:
An object of class Biclust
call:
biclust(x = testMatrix, method = BCBimax())
Number of Clusters found: 2
First 2 Cluster sizes:
BC 1 BC 2
Number of Rows: 3 2
Number of Columns: 3 2
I can both display the clusters separately by:
drawHeatmap(x = testMatrix, bicResult = testCluster, number = 1) # shown in picture below
drawHeatmap(x = testMatrix, bicResult = testCluster, number = 2)
and I can display the entire clustered matrix (one cluster at upper left corner) by:
drawHeatmap2(x = testMatrix, bicResult = testCluster, number = 1) # shown in picture below
drawHeatmap2(x = testMatrix, bicResult = testCluster, number = 2)
So far so good, but I want:
Colors of display switched. Now the 1 is red and the 0 is green.
I want to see the rows and columns of the original matrix. Now there are shown just the row numbers and column numbers of the specific cluster (with drawHeatMap) and there are shown no row and column numbers at the entire clustered matrix (drawHeatMap2).
I want a nicely ordered clustered matrix. Now only the cluster specified in drawHeatmap2 is shown in the upper left corner, but for the rest of the matrix I also want the other clusters nicely ordered from the upper left corner to the lower right corner for the rest of the matrix.
Are these changes possible (with the "biclust" package)? Or is it better to do it in another way with R?

Change the drawHeatmap() funtion in the biclust source packag package:
trace("drawHeatmap", edit = TRUE)
Change the following:
(a) Switch red and green - switch the rvect and gvect in call rgb()
(b) Original rownames instead of new - change 'labels=' to '=bicCols' and '=bicRows'.
Print rownumbers: before axis about rows: cat(bicRows).
Save rownumbers to file - before axis about rows: write(bicRows, file="FILENAME.txt")

Related

randomly regroup data in r ('regroup' in package 'LearnBayes' and 'caroline' didn't work)

Suppose I have a binomial distribution where n=12, p=0.2. I split this sample into 4 chunks(groups), each chunk has group size 3. Then I remove the output whose sum is equal to 0, and combine the remaining outputs into a new dataset. Here are some of my code:
set.seed(123)
sample1=rbinom(12,1,0.2)
chuck2=function(x,n)split(x,cut(seq_along(x),n,labels=FALSE))
chunk=chuck2(sample1,4)
newvector=c()
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
newvector=c(newvector,a.no0)
}
}
print(newvector)
and this is the result I got:
[1] 1 1 0 0 1 0 0 1 0
what I'm trying to do is randomly regroup this data, for example:
[1] 0 1 0 0 1 1 1 0 0
or
[1] 1 0 1 0 1 0 1 0 0
......
I tried to use 'regroup' in package 'LearnBayes' and 'caroline', but it didn't work. Any hints please?

How to refer to previous cell in a data-frame column (lagged cell), in R

I’m working in R and am trying to find a way to refer to the previous cell within a vector when that vector belongs to a data frame. By previous cell, I’m essentially hoping for a “lag” command of some sort so that I can compare one cell to the cell previous. As an example, I have these data:
A <- c(1,0,0,0,1,0,0)
B <- c(1,1,1,1,1,0,0)
AB_df <- cbind (A,B)
What I want is for a given cell in a given row, if that cell’s value is less than the previous cell’s value for the same column vector, to return a value of 1 and if not to return a value of 0. For this example, the new columns would be called “A-flag” and “B-flag” below.
A B A-flag B-flag
1 1 0 0
0 1 1 0
0 1 0 0
0 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
Any suggestions for syntax that can do this? Ideally, to just create a new column variable into an existing data-frame.
Here is one solution using dplyr package and it's lag method:
library(dplyr)
AB_df <- data.frame(A = A, B = B)
AB_df %>% mutate(A.flag = ifelse(A < lag(A, default = 0), 1, 0),
B.flag = ifelse(B < lag(B, default = 0), 1, 0))
A B A.flag B.flag
1 1 1 0 0
2 0 1 1 0
3 0 1 0 0
4 0 1 0 0
5 1 1 0 0
6 0 0 1 1
7 0 0 0 0

R- How to plot correct pie charts in haploNet haplotyp Networks {pegas} {ape} {adegenet}

When using the haploNet package to make some plots on a haplotype network,
I used a script available on the internet to do so. However I think there is something wrong. The script is available in form of the woodmouse example. The code I used is:
x <- read.dna(file="Masto.fasta",format="fasta")
h <- haplotype(x)
net <- haploNet(h)
plot(net)
plot(net, size = attr(net, "freq"), fast = TRUE)
plot(net, size = attr(net, "freq"))
plot(net, size=attr(net, "freq"), scale.ratio = 2, cex = 0.8
table(rownames(x))
ind.hap<-with(
stack(setNames(attr(h, "index"), rownames(h))),
table(hap=ind, pop=rownames(x)[values])
)
ind.hap
plot(net, size=attr(net, "freq"), scale.ratio = 2, cex = 0.8, pie=ind.hap)
legend(50,50, colnames(ind.hap), col=rainbow(ncol(ind.hap)), pch=20)
legend(x=7,y=10,c("Baeti ero","Felege weyni","Golgole naele","Hagare selam","Ruba feleg","Ziway"),c("red","yellow","green","turquoise","blue","magenta"))
However when plotting ind.hap, you can notice that some rows are not in the right place. You can see this here:
pop
hap Baetiero ETH022 ETH742 Felegeweyni Golgolenaele Rubafeleg
I 0 0 1 0 0 0
II 0 1 0 0 0 0
III 1 0 0 1 0 1
IV 2 0 0 0 0 3
IX 0 0 0 1 0 0
V 4 0 0 0 2 0
VI 4 0 0 1 0 4
VII 2 0 0 1 0 0
VIII 0 0 0 1 0 1
X 3 0 0 0 1 0
XI 0 0 0 0 1 1
XII 0 0 0 1 0 0
XIII 0 0 0 0 0 1
You can see that row IX is not on its right place. This would not be too much of a problem, but the program takes row 9 to make the pie plot for IX, which is the data of VIII. This is the result:
(I could not insert the image since my reputation is below 10..., you get the image by executing the whole file anyway)
You can see that for V until IX it's not as it should be (these are the swapped rows). For instance: IX has only 1 haplotype in it, but there's a pie chart for 2 haplotypes (which both have 50% of the chart), which is generated using the VIII data. Since the rows are sorted alphabetically instead of ascending, but this is inherent to the package, I don't know what to do.
I'm far from a master in R, so try not to be too abstract, but provide code instead.
If there is someone who knows this package very well, please explain also why there are these weird extra lines behind the real charts (these with the numbers on them), as they were not visible in the woodmouse example (maybe that's because of what's wrong too?)
Thanx in advance
I've struggled with the same issue, but believe I came up with a solution.
The problem is that the step making the table of haplotype counts per "population" orders the haplotypes alphabetically. So, for example, haplotype "IX" comes before "V". On the other hand, the function haplotype() sorts the haplotypes by their "numerical" order. And this is what creates a discrepancy when plotting.
This can be solved by sorting the haplotype object by "label", as explained in ?haplotype help.
I'll use the woodmouse example data to exemplify:
# Sample 9 distinct haplotypes
library(pegas)
data(woodmouse)
x <- woodmouse[sample(9, 100, replace = T), ]
To simplify, I create a function to create the count table of haplotypes (based on this post):
countHap <- function(hap = h, dna = x){
with(
stack(setNames(attr(hap, "index"), rownames(hap))),
table(hap = ind, pop = attr(dna, "dimnames")[[1]][values])
)
}
Now, let's see the result without sorting haplotypes:
h <- haplotype(x) # create haplotype object
net <- haploNet(h) # create haploNet object
plot(net, pie = countHap(), size = attr(net, "freq"), legend = T)
Now, let's look at our count table, to check these results:
countHap(h, x)
pop
hap No0906S No0908S No0909S No0910S No0912S No0913S No304 No305 No306
I 0 0 0 0 0 0 0 8 0
II 0 0 0 0 0 0 9 0 0
III 0 0 0 0 0 0 0 0 10
IV 16 0 0 0 0 0 0 0 0
IX 0 0 0 0 0 8 0 0 0
V 0 12 0 0 0 0 0 0 0
VI 0 0 10 0 0 0 0 0 0
VII 0 0 0 13 0 0 0 0 0
VIII 0 0 0 0 14 0 0 0 0
Things do not match: for example, haplotype "V" should occur in individual "No0908S", but instead is coloured as individual "No0913S" (which should be the label for haplotype "IX").
Now, let's sort haplotypes:
h <- haplotype(x)
h <- sort(h, what = "labels") # This is the extra step!!
net <- haploNet(h)
plot(net, pie = countHap(), size = attr(net, "freq"), legend = T)
And all is well now!
Extra:
Although this is not requested by the OP, I thought of leaving it here if it is of interest for anyone else.
Sometimes, I find it convenient to label haplotypes by their frequency. This can be done by changing the haplotype labels to be equal to their frequencies:
attr(h, "labels") <- attr(h, "freq")
plot(net, pie = countHap(), size = attr(net, "freq"), legend = T)

How to plot Pie charts in haploNet Haplotype Networks {pegas}

I'm trying to use haploNet function of {pegas} to plot a haplotype network, but i`m having trouble putting equal haplotypes from different populations in a same piechart. I can build a haplotype net with the following script:
x <- read.dna(file="x.fas",format="fasta")
h <- haplotype(x)
net <- haploNet(h)
plot(net)
I'd like to set in the dnabin data the label of the original population of each taxa, so i could have piecharts of different colors (of haplotypes from different populations) in the resulting network. I'd like also to remove overlapping circles in the resulting haplotype network.
Thanks for any help!
An example:
> data(woodmouse)
> x <- woodmouse[sample(15, size = 110, replace = TRUE), ]
> h <- haplotype(x)
> net <- haploNet(h)
> plot(net, size=attr(net, "freq"), scale.ratio = 2, cex = 0.8)
This script is used to build an haplotype network using {pegas}. The bigger circles represent much more haplotypes of some type. I`d like to know how I could set in the dnabin matrix the origin of the haplotypes, so they would appear with different colors in the network.
Ok, trying to make sense from your example. It appears the populations you have are 15 populations with anywhere from 3-13 samples per population.
table(rownames(x))
# No0906S No0908S No0909S No0910S No0912S No0913S No1007S
# 10 8 6 3 3 7 6
# No1103S No1114S No1202S No1206S No1208S No304 No305
# 4 13 9 6 9 13 7
# No306
# 6
When you run haplotype(x), you get (unsurprisingly) 15 haplotypes representing a 1:1 mapping from population to haplotype. We can create a table showing the relationship between the populations and haplotypes with
ind.hap<-with(
stack(setNames(attr(h, "index"), rownames(h))),
table(hap=ind, pop=rownames(x)[values])
)
ind.hap[1:10, 1:9] #print just a chunk
# pop
# hap No0906S No0908S No0909S No0910S No0912S No0913S No1007S No1103S No1114S
# I 0 0 0 0 0 0 0 0 0
# II 0 0 0 0 0 0 6 0 0
# III 0 0 0 0 0 0 0 4 0
# IV 10 0 0 0 0 0 0 0 0
# IX 0 0 0 0 0 0 0 0 0
# V 0 0 6 0 0 0 0 0 0
# VI 0 0 0 0 0 0 0 0 0
# VII 0 0 0 0 0 7 0 0 0
# VIII 0 0 0 0 0 0 0 0 13
# X 0 0 0 0 0 0 0 0 0
We can use this table during plotting to draw pic chars at each of the nodes.
plot(net, size=attr(net, "freq"), scale.ratio = 2, cex = 0.8, pie=ind.hap)
legend(50,50, colnames(ind.hap), col=rainbow(ncol(ind.hap)), pch=20)
To better show off the pie charts, we can assign incorrect populations to each of the samples
wrong.pop<-rep(letters[1:5], each=22)
ind.hap2<-with(
stack(setNames(attr(h, "index"), rownames(h))),
table(hap=ind, pop=wrong.pop[values])
)
plot(net, size=attr(net, "freq"), scale.ratio = 2, cex = 0.8, pie=ind.hap2)
legend(50,50, colnames(ind.hap2), col=rainbow(ncol(ind.hap2)), pch=20)
Here you can see we have more diversity at each haplotype because we've incorrectly labeled the populations with artificial names so they don't clump as nicely.

Find # of rows between events in R

I have a series of data in the format (true/false). eg it looks like it can be generated from rbinom(n, 1, .1). I want a column that represents the # of rows since the last true. So the resulting data will look like
true/false gap
0 0
0 0
1 0
0 1
0 2
1 0
1 0
0 1
What is an efficient way to go from true/false to gap (in practice I'll this will be done on a large dataset with many different ids)
DF <- read.table(text="true/false gap
0 0
0 0
1 0
0 1
0 2
1 0
1 0
0 1", header=TRUE)
DF$gap2 <- sequence(rle(DF$true.false)$lengths) * #create a sequence for each run length
(1 - DF$true.false) * #multiply with 0 for all 1s
(cumsum(DF$true.false) != 0L) #multiply with zero for the leading zeros
# true.false gap gap2
#1 0 0 0
#2 0 0 0
#3 1 0 0
#4 0 1 1
#5 0 2 2
#6 1 0 0
#7 1 0 0
#8 0 1 1
The cumsum part might not be the most efficient for large vectors. Something like
if (DF$true.false[1] == 0) DF$gap2[seq_len(rle(DF$true.false)$lengths[1])] <- 0
might be an alternative (and of course the rle result could be stored temporarly to avoid calculating it twice).
Ok, let me put this in answer
1) No brainer method
data['gap'] = 0
for (i in 2:nrow(data)){
if data[i,'true/false'] == 0{
data[i,'gap'] = data[i-1,'gap'] + 1
}
}
2) No if check
data['gap'] = 0
for (i in 2:nrow(data)){
data[i,'gap'] = (data[i-1,'gap'] + 1) * (-(data[i,'gap'] - 1))
}
Really don't know which is faster, as both contain the same amount of reads from data, but (1) have an if statement, and I don't know how fast is it (compared to a single multiplication)

Resources