Why the histogram look different using two different breaks argument in R? - r

I want to plot the distribution of the datasets using the histogram in R. I tried using different arguments (default, Freedman-Diaconis, and Scott) to get the best representation. I consider using a log scale later, but first I want to know the raw distribution without any scaling. However, the results look different, why is that? The dataset I use can be downloaded from here data or here data. The code I'm running are
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks = 200)
result is
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks = "Scott")
Result is
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks="Freedman-Diaconis")
result is
Please help. Thank you very much.

Histograms are very sensitive to the choice of cell break points. Even for the same (!) number of cells, the histogram can become considerably different by just a small shift of the cell borders. It is thus generally preferable to use kernel density estimators instead of histograms, because they do not depend on random cell border placement:
# increase n if you have a wide range of values
d <- density(as.matrix(deviation_all_genes_all_spots), n=512)
plot(d$x, d$y)
In your second and third call of hist, you ask for an automatic way to select the number of cells and the cell borders. Obviously, this results in more cells than in your first call with breaks=200. You can query the cells from the return value of hist, e.g.
h <- hist(as.matrix(deviation_all_genes_all_spots))
cat(srintf("number of cells = %i\n", length(h$mids))

Related

Manually set breaks and keep equal distance between them

I'd like to plot a dataset that consists of two vectors of length 100. The mean difference of the vectors being high and the variance of each of them being considerably smaller, it is quite difficult to plot both vectors and still be able to see the variation within each vector.
What I'd like to be able to manually set the breaks so that we could both see the difference between the vectors and within them.
Consider this data set
a=rnorm(100,sd=0.005)+1
b=rnorm(100,sd=0.005)+10
vec = c(a,b)
Neither plot(vec) nor plot(vec,log="y") gives satisfying results, as it is not possible to distinguish the variation within the vector (see picture).
I'd like the breaks on the y-axis to be (min(a), max(a), 5, min(b), max(b)) (and get equal distance between them). How could one achieve that?
Depending on exactly what you are trying to do, a simple transformation of the data in each part of the vector might be enough:
vec2 <- c( (a - min(a))/ (max(a)-min(a)) , 3 + (b - min(b))/ (max(b)-min(b)) )
plot(vec2, axes=F)
box()
axis(1)
axis(2, at=c(0,1,2,3,4), labels = round(c(min(a), max(a), 5, min(b), max(b)),2))
Alternative approaches might be a custom transformation in ggplot, a secondary axis in ggplot, breaking the graph into facets, or using ggbreak.

How would I split a histogram or plot that show the number of main Principal Components?

I have performed PCA Analysis using the prcomp function apart of the FactoMineR package on quite a substantial dataset of 3000 x 500.
I have tried plotting the main Principal Components that cover up to 100% of cumulative variance proportion with a fviz_eig plot. However, this is a very large plot due to the large dimensions of the dataset. Is there any way in R to split a plot into multiple plots using a for loop or any other way?
Here is a visual of my plot that only cover 80% variance due to the fact it being large. Could I split this plot into 2 plots?
Large Dataset Visualisation
I have tried splitting the plot up using a for loop...
for(i in data[1:20]) {
fviz_eig(data, addlabels = TRUE, ylim = c(0, 30))
}
But this doesn't work.
Edited Reproducible example:
This is only a small reproducible example using an already available dataset in R but I used a similar method for my large dataset. It will show you how the plot actually works.
# Already existing data in R.
install.packages("boot")
library(boot)
data(frets)
frets
dataset_pca <- prcomp(frets)
dataset_pca$x
fviz_eig(dataset_pca, addlabels = TRUE, ylim = c(0, 100))
However, my large dataset has a lot more PCs that this one (possibly 100 or more to cover up to 100% of cumulative variance proportion) and therefore this is why I would like a way to split the single plot into multiple plots for better visualisation.
Update:
I have performed what was said by #G5W below...
data <- prcomp(data, scale = TRUE, center = TRUE)
POEV = data$sdev^2 / sum(data$sdev^2)
barplot(POEV, ylim=c(0,0.22))
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
barplot(POEV[1:40], ylim=c(0,0.22), main="PCs 1 - 40")
text(0.7+(0:6)*1.2, POEV[1:40], labels = round(100*POEV[1:40], 1),
pos=3)
and I have now got a graph as follows...
Graph
But I am finding it difficult getting the labels to appear above each bar. Can someone help or suggest something for this please?
I am not 100% sure what you want as your result,
but I am 100% sure that you need to take more control over
what is being plotted, i.e. do more of it yourself.
So let me show an example of doing that. The frets data
that you used has only 4 dimensions so it is hard to illustrate
what to do with more dimensions, so I will instead use the
nuclear data - also available in the boot package. I am going
to start by reproducing the type of graph that you displayed
and then altering it.
library(boot)
data(nuclear)
N_PCA = prcomp(nuclear)
plot(N_PCA)
The basic plot of a prcomp object is similar to the fviz_eig
plot that you displayed but has three main differences. First,
it is showing the actual variances - not the percent of variance
explained. Second, it does not contain the line that connects
the tops of the bars. Third, it does not have the text labels
that tell the heights of the boxes.
Percent of Variance Explained. The return from prcomp contains
the raw information. str(N_PCA) shows that it has the standard
deviations, not the variances - and we want the proportion of total
variation. So we just create that and plot it.
POEV = N_PCA$sdev^2 / sum(N_PCA$sdev^2)
barplot(POEV, ylim=c(0,0.8))
This addresses the first difference from the fviz_eig plot.
Regarding the line, you can easily add that if you feel you need it,
but I recommend against it. What does that line tell you that you
can't already see from the barplot? If you are concerned about too
much clutter obscuring the information, get rid of the line. But
just in case, you really want it, you can add the line with
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
However, I will leave it out as I just view it as clutter.
Finally, you can add the text with
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
This is also somewhat redundant, but particularly if you change
scales (as I am about to do), it could be helpful for making comparisons.
OK, now that we have the substance of your original graph, it is easy
to separate it into several parts. For my data, the first two bars are
big so the rest are hard to see. In fact, PC's 5-11 show up as zero.
Let's separate out the first 4 and then the rest.
barplot(POEV[1:4], ylim=c(0,0.8), main="PC 1-4")
text(0.7+(0:3)*1.2, POEV[1:4], labels = round(100*POEV[1:4], 1),
pos=3)
barplot(POEV[5:11], ylim=c(0,0.0001), main="PC 5-11")
text(0.7+(0:6)*1.2, POEV[5:11], labels = round(100*POEV[5:11], 4),
pos=3, cex=0.8)
Now we can see that even though PC 5 is much smaller that any of 1-4,
it is a good bit bigger than 6-11.
I don't know what you want to show with your data, but if you
can find an appropriate way to group your components, you can
zoom in on whichever PCs you want.

Unsure how to plot a histogram with variable break points from a one column matrix in R

I have a matrix which has the following approximate dimensions: 20000 x 1. I would like to plot the values in a histogram with bins of length 0.01 from -0.05 to +0.15. However, the values in the matrix are pretty random - for eg, 0.0123421, 0.0124523, 0.124523, -0.011234, etc. Thus, I need to first count the number of values that fall into a particular bin, and then plot a histogram. For the numbers I gave, I'd have 2 values between 0.01 and 0.02, 1 between -0.02 and -0.01, and so on, which I need in a histogram. Is there an easy way to do this? I'm relatively new to R, so any help is appreciated!
As an example illustrating breaks (content summarized from an excellent post on R-bloggers which you can refer to here), lets assume that you start with some normally distributed data. In R, you can generate normal data this way using the rnorm() function:
data <-rnorm(n=1000, m=24.2, sd=2.2)
We can then generate a simple histogram using the following call:
hist(data)
Now, let's assume that you want to have coarser or finer groups for your bins. There are a number of ways to do this. You could, for example, use the breaks() option. Below is a tidy example illustrating this:
hist(data, breaks=20, main="Breaks=20")
hist(data, breaks=5, main="Breaks=5")
Now, if you want more control over exactly the breakpoints between bins, you can be more precise with the breaks() option and give it a vector of breakpoints, like this:
hist(data, breaks=c(17,20,23,26,29,32), main="Breaks is vector of breakpoints")
This dictates exactly the start and end point of each bin. Of course, you could give the breaks vector as a sequence like this to cut down on the messiness of the code:
hist(data, breaks=seq(17,32,by=3), main="Breaks is vector of breakpoints")
Note that when giving breakpoints, the default for R is that the histogram cells are right-closed (left open) intervals of the form (a,b]. You can change this with the right=FALSE option, which would change the intervals to be of the form [a,b). This is important if you have a lot of points exactly at the breakpoint.
hist(x, breaks = seq(-.05, .15, .01))
See ?hist

pch in plot with R

I have a dataset that I have plotted, I am now trying to build a legend with the corresponding point styles, the points are plotted correctly on the graph but the legend shows the same symbol for the binary response set. I am a bit confused as to why and hope it is something small. Here is my code
# data should already be loaded in from the project on the school drive
library(survival)
attach(lace)
lace
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- FAILURE + 1
table(psymbol)
plot(AGE, TOTAL.LACE, pch=(psymbol))
legend(0, 15, c("FAILURE = 1", "FAILURE = 0"), pch=(psymbol))]
picture
Thank you,
pysmbol is a vector of length n, where n is the number of data points in your data set. Your legend call is passing this entire vector to pch where you really only need a vector of length 2. Hence legend uses the first two elements of psymbol for pch. Now, go look at psymbol[1:2]. I'll be very surprised if that doesn't return two 1s.
I'd suggest you do pch = unique(psymbol). It looks like it should be a numeric vector so that should work.
Note that you don't need parentheses around psymbol in your calls, and attach()ing an object is considered poor practice unless you quickly detach() immediately after. See ?with for an alternative approach.

Clustering and heatmap in R

I am a newbie to R and I am trying to do some clustering on a data table where rows represent individual objects and columns represent the features that have been measured for these objects. I've worked through some clustering tutorials and I do get some output, however, the heatmap that I get after clustering does not correspond at all to the heatmap produced from the same data table with another programme. While the heatmap of that programme does indicate clear differences in marker expression between the objects, my heatmap doesn't show much differences and I cannot recognize any clustering (i.e., colour) pattern on the heatmap, it just seems to be a randomly jumbled set of colours that are close to each other (no big contrast). Here is an example of the code I am using, maybe someone has an idea on what I might be doing wrong.
mydata <- read.table("mydata.csv")
datamat <- as.matrix(mydata)
datalog <- log(datamat)
I am using log values for the clustering because I know that the other programme does so, too
library(gplots)
hr <- hclust(as.dist(1-cor(t(datalog), method="pearson")), method="complete")
mycl <- cutree(hr, k=7)
mycol <- sample(rainbow(256)); mycol <- mycol[as.vector(mycl)]
heatmap(datamat, Rowv=as.dendrogram(hr), Colv=NA,
col=colorpanel(40, "black","yellow","green"),
scale="column", RowSideColors=mycol)
Again, I plot the original colours but use the log-clusters because I know that this is what the other programme does.
I tried to play around with the methods, but I don't get anything that would at least somehow look like a clustered heatmap. When I take out the scaling, the heatmap becomes extremely dark (and I am actually quite sure that I have somehow to scale or normalize the data by column). I also tried to cluster with k-means, but again, this didn't help. My idea was that the colour scale might not be used completely because of two outliers, but although removing them slightly increased the range of colours plotted on the heatmap, this still did not reveal proper clusters.
Is there anything else I could play around with?
And is it possible to change the colour scale with heatmap so that outliers are found in the last bin that has a range of "everything greater than a particular value"? I tried to do this with heatmap.2 (argument "breaks"), but I didn't quite succeed and also I didn't manage to put the row side colours that I use with the heatmap function.
If you are okay with using heatmap.2 from the gplots package that will allow you to add breaks to assign colors to ranges represented in your heatmap.
For example if you had 3 colors blue, white, and red with the values going from low to high you could do something like this:
my.breaks <- c(seq(-5, -.6, length.out=6),seq(-.5999999, .1, length.out=4),seq(.100009,5, length.out=7))
result <- heatmap.2(mtscaled, Rowv=T, scale='none', dendrogram="row", symm = T, col=bluered(16), breaks=my.breaks)
In this case you have 3 sets of values that correspond to the 3 colors, the values will differ of course depending on what values you have with your data.
One thing you are doing in your program is to call hclust on your data then to call heatmap on it, however if you look in the heatmap manual page it states:
Defaults to hclust.
So I don't think you need to do that. You might want to take a look at some similar questions that I had asked that might help to point you in the right direction:
Heatmap Question 1
Heatmap Question 2
If you post an image of the heatmap you get and an image of the heatmap that the other program is making it will be easier for us to help you out more.

Resources