I have generated a pearson similarity matrix and plotted the results using pheatmap (clustered using hclust, method = "complete"). I'd like to output the ordered matrix, but in R the default seems to be just to alphabetize everything.
Here is my code:
df <- cor(t(genes), method = "pearson")
pheatmap(df, clustering_method = "complete")
head(genes)
pre early mid late end
AAC1 2.0059007 3.64679740 3.0092533 2.4936171 2.2693034
AAC3 -1.6843969 -1.62572636 -0.7654462 -1.5827481 -1.6059080
AAD10 2.6012529 2.05759631 1.3665322 1.4590833 0.3778324
AAD14 0.5047704 0.76021375 0.1825944 0.6111774 0.1174208
AAD15 7.6017557 8.52315453 7.2605744 6.9029452 5.9028824
AAD16 1.2018193 -0.03285354 0.2229450 -0.1337033 0.2198542
This what the current output (df) looks like:
A B C D
A 1 0.5 0.25 0.1
B 0.1 1 0.1 0.5
C 0.5 0.2 1 0.2
D 0 0.1 0.7 1
How can I output the similarity matrix as ordered by hclust?
I've looked, but I haven't been able to find anything that quite accomplishes what I need. Thanks in advance for your help!
(also sorry I don't know how to properly format everything yet)
EDIT: maybe some visuals would help. My clustered pheatmap output looks like this: ordered heatmap
I can see groups of genes that behave similarly, but because there are so many it's impossible/useless to read the labels. I want to find out which genes cluster together, but I can't output the ordered matrix.
When I plot the data without clustering it looks like this: unclustered heatmap
So the output/data I can get is pretty much useless for further analysis.
Related
I have a PCA that shows two really big clusters and I dont know how to figure out which of my samples are in each cluster.
If it helps, Im using prcomp to generate the PCA:
pca1 <- autoplot(prcomp(df), label = TRUE, label.size = 2)
My approach has been to attempt to cluster the PCA output using kmeans with 2 groups to get the clusters:
pca <- prcomp(df, scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=2)$cluster
I can then make a beautiful plot, but I am still lost as to which samples are in each cluster. For reference, here is the plot generate if I graph the kmeans output:
As you can see in the first PCA plot, the labels literally say which sample each dot is. My ideal output would be a two column txt file with the sample name in one column, and the group it belongs to in the other column.
All that aside, if there is a better way, please let me know.
Thanks in advance.
Here is a chunk of my data:
a b c b e
Sample_1013 312011 624559 625898 534309 220415
Sample_1046 474774 949458 951145 843049 366136
Sample_104 645363 1290450 1292520 919474 272200
Sample_1057 267319 534685 535294 690574 422645
Sample_106 414065 830571 834527 657354 234130
Sample_107 299289 602483 603756 566256 262153
In my question, clust is the name of the output from my kmeans:
clust <- kmeans(pca$x[,1:2], centers=2)$cluster
I typed clust into the terminal and got which samples belong to each group:
> clust
Sample_1013 Sample_1046 Sample_104 Sample_1057 Sample_106 Sample_107
1 1 1 1 1 1
Sample_1098 Sample_109 Sample_1109 Sample_1129 Sample_1130 Sample_1140
1 1 1 1 1 1
Sample_1149 Sample_115 Sample_118 Sample_1220 Sample_1223 Sample_1225
1 1 1 1 1 1
Hopefully this helps someone.
So I have plotted a curve, and have had a look in both my book and on stack but can not seem to find any code to instruct R to tell me the value of y when along curve at 70 x.
curve(
20*1.05^x,
from=0, to=140,
xlab='Time passed since 1890',
ylab='Population of Salmon',
main='Growth of Salmon since 1890'
)
So in short, I would like to know how to command R to give me the number of salmon at 70 years, and at other times.
Edit:
To clarify, I was curious how to command R to show multiple Y values for X at an increase of 5.
salmon <- data.frame(curve(
20*1.05^x,
from=0, to=140,
xlab='Time passed since 1890',
ylab='Population of Salmon',
main='Growth of Salmon since 1890'
))
salmon$y[salmon$x==70]
1 608.5285
This salmon data.frame gives you all of the data.
head(salmon)
x y
1 0.0 20.00000
2 1.4 21.41386
3 2.8 22.92768
4 4.2 24.54851
5 5.6 26.28392
6 7.0 28.14201
If you can also use inequalities to check the number of salmon in given ranges using the syntax above.
It's also simple to answer the 2nd part of your question using this object:
salmon$z <- salmon$y*5 # I am using * instead of + to make the plot more clear
plot(x=salmon$x,y=salmon$z, xlab='Time passed since 1890', ylab='Population of Salmon',type="l")
lines(salmon$x,salmon$y, col="blue")
curve is plotting the function 20*1.05^x
so just plug any value you want in that function instead of x, e.g.
> 20*1.05^70
[1] 608.5285
>
20*1.05^(seq(from=0, to=70, by=10))
Was all I had to do, I had forgotten until Ed posted his reply that I could type a function directly into R.
I have 2 matrices in R. One is called
j= matrix(c(1:8,1:8), nrow=2,ncol=8)
and the second:
B= matrix (c(Dav_Bou_k_med$r,Dav_Bou$r),nrow=2,ncol=8)
both Dav_Bou_k_med$r and Dav_Bou$r are matrices of nrow=1 and and ncol=8 so they are like this:
[1] 1.668 2.000 1.5 1.7 1.7 1.9 1.9 2.5
etc.
I used this plot:
plot(j,B)
but what I get is the relevant points for every 1:8 of the first matrix (j) (2 points for every 1:8, because I have two rows in B). What I want is to connect these points for every row in the B matrix in the plot. So, each of these points in the B matrix will be connected for each row (of B) and ideally with different colors. Is there any easy way to achieve that?
It's a little difficult to interpret exactly what you are looking for, but I imagine it's something like this?
j= matrix(c(1:8,1:8), nrow=2,ncol=8, byrow=TRUE)
fake_data <- sample(seq(1,3,0.2), 8, replace=TRUE)
more_fake_data <- sample(seq(1,3,0.2), 8, replace=TRUE)
B= matrix (c(fake_data, more_fake_data),nrow=2,ncol=8, byrow=TRUE)
plot(j, B)
lines(j[1,],B[1,])
lines(j[2,],B[2,], col="green")
If I have a set of points that have different y positions (A,B,C) each with the same x coordinate. Is it possible to cluster this set of 3 points together and not individually?
I'd like to see the occurrence of this set of 3 points together in a given sample and see what set (A,B,C) is most frequent.
I've seen most of the clustering algorithm can cluster points for a given position (x,y) but not a set of several points for a given x coordinate.
For instance, if i have the following
X A B C
1 0.7 0.1 0.2
2 0.3 0.4 0.1
3 0.4 0.5 0.1
4 0.7 0.1 0.2
5 0.7 0.1 0.2
6 0.2 0.1 0.5
The positions x :1, 4 and 5 should be clustered together because they have the same set (A,B,C) = (0.7,0.1,0.2).
Is there any algorithm or tool (R) that is already doing that, clustering by pair of several points, finding the most occurrent pair with a graphical visualization?
Any help would be much appreciated.
If you're looking to tabulate the instances, then something along the lines of:
tab <- table(sprintf("%s:%s:%s", df1$A, df1$B, df1$C))
which.max(tab)
sort(tab, decreasing=TRUE)
will give you the most frequent combination (you can use strsplit to separate out the individual components if you need to go on and use them programmatically.
If you're looking to cluster, in the sense of find similar distances, then you can just use
dis <- dist(as.matrix(df1[[c("A","B","C")]])
clust <- hclust(dis)
and dis will tell you all the pairwise distances (find the zero's to get the identicals), and clust will give you a tree based on similarity across A:c
If this isn't answering the question, you probably need to clarify. You say things like same x coordinate in the text, but none of your rows have the same X value. And it's fairly unconventional to switch interchangeably between y coordinate / position / (A,B,C) .
It's hard to suggest a visualisation without knowing what feature you want to emphasize. Possibly a multi-dimensional scaling graph, where each node represents all x with the same (A,B,C) triplet, and then neighbours are other X's with closest (A', B', C') values?
I'm fairly new to R but I am trying to create line graphs that monitor growth of bacteria over the course of time. I can successfully do this but the resulting graph isn't to my satisfaction. This is because I'm not using evenly spaced time increments although R plots these increments equally. Here is some sample data to give you and idea of what I'm talking about.
x=c(.1,.5,.6,.7,.7)
plot(x,type="o",xaxt="n",xlab="Time (hours)",ylab="Growth")
axis(1,at=1:5,lab=c(0,24,72,96,120))
As you can see there are 48 hours between 24 and 72 but this is evenly distributed on the graph, is there anyway I can adjust the scale to more accurately display my data?
It's always best in R to use data structures that exhibit the relationships between your data. Instead of defining growth and time as two separate vectors, use a data frame:
growth <- c(.1,.5,.6,.7,.7)
time <- c(0,24,72,96,120)
df <- data.frame(time,growth)
print(df)
time growth
1 0 0.1
2 24 0.5
3 72 0.6
4 96 0.7
5 120 0.7
plot(df, type="o")
Not sure if this produces the exact x-axis labels that you want, but you should be free to edit the graph now without changing the relationship between the growth and time variables.
x=data.frame(x=c(.1,.5,.6,.7,.7), y=c(0,24,72,96,120))
plot(x$y, x$x,type="o",xaxt="n",xlab="Time (hours)",ylab="Growth")