R - Histogram Doesn't show density due to magnitude of the Data - r

I have a vector called data with length 444000 approximately, and most of the numeric values are between 1 and 100 (almost all of them). I want to draw the histogram and draw the the appropriate density on it. However, when I draw the histogram I get this:
hist(data,freq=FALSE)
What can I do to actually see a more detailed histogram? I tried to use the breaks code, it helped, but it's really hard do see the histogram, because it's so small. For example I used breaks = 2000 and got this:
Is there something that I can do? Thanks!

Since you don't show data, I'll generate some random data:
d <- c(rexp(1e4, 100), runif(100, max=5e4))
hist(d)
Dealing with outliers like this, you can display the histogram of the logs, but that may difficult to interpret:
If you are okay with showing a subset of the data, then you can filter the outliers out either dynamically (perhaps using quantile) or manually. The important thing when showing this visualization in your analysis is that if you must remove data for the plot, then be up-front when the removal. (This is terse ... it would also be informative to include the range and/or other properties of the omitted data, but that's subjective and will differ based on the actual data.)
quantile(d, seq(0, 1, len=11))
d2 <- d[ d < quantile(d, 0.90) ]
hist(d2)
txt <- sprintf("(%d points shown, %d excluded)", length(d2), length(d) - length(d2))
mtext(txt, side = 1, line = 3, adj = 1)
d3 <- d[ d < 10 ]
hist(d3)
txt <- sprintf("(%d points shown, %d excluded)", length(d3), length(d) - length(d3))
mtext(txt, side = 1, line = 3, adj = 1)

Related

Want to plot a function between [a,b] and on the same plot shade the area under [c,d] where c and d lie between a and b

I'm trying to plot a function using ggplot, which I can do. For example y = x. I plot between -1 and 4. Works great. On the same graph I now want to shade the area under the curve between 1 and 3. I cannot get it to work, nor can I find any documentation. Can someone help me?
Skeleton code that I'm trying:
eq<-function(x){(x)}
ggplot(data.frame(x=c(-1,4)),aes(x)) +
stat_function(fun=eq,geom="line",color="red") +
stat_function(fun=eq,geom="area",fill="blue")
I tried all different permutations. If there was a way to limit the second stat_function to a different domain it might work. Any ideas?
It appears that stat_function is working over the full range of data, rather than with the particular values supplied. One option is to generate the data you want to plot first, then pass it to a specific geom_*. However, if you really want to stay with stat_function, you may need to do a bit more work around. One approach would be to create two functions, one of which limits the outputs to the range you want.
eq<-function(x){(x)}
eqB<-function(x){ifelse(x < 3 & x > 0, x, NA)}
ggplot(data.frame(x=c(-1,4)),aes(x)) +
stat_function(fun=eq,geom="line",color="red") +
stat_function(fun=eqB,geom="area",fill="blue")
A more robust solution is to create a single function that accepts range limits, then pass those in using args:
eqC <- function(x,upr = max(x), lwr = min(x)){
ifelse(x <= upr & x >= lwr, x, NA)
}
ggplot(data.frame(x=c(-1,4)),aes(x)) +
stat_function(fun=eqC,geom="line",color="red") +
stat_function(fun=eqC,geom="area",fill="blue"
, args = list(upr = 3, lwr = 0))
Both of these generate a plot that looks like this:

R: Increase space between multiple boxplots to avoid omitted x axis labels

Let's say I generate 5 sets of random data and want to visualize them using boxplots and save those to a file "boxplots.png". Using the code
png("boxplots.png")
data <- matrix(rnorm(25),5,5)
boxplot(data, names = c("Name1","Name2","Name3","Name4","Name5"))
dev.off()
there are 5 boxplots created as desired in "boxplots.png", however the names for the second ("Name2") and the fourth ("Name4") boxplot are omitted. Even changing the window of my png-view makes no difference. How can I avoid this behavior?
Thank you!
Your offered code does not produce an overlap in my setting, but that point is relatively moot: you want a way to allow more space between words.
One (brute-force-ish) way to fix the symptom is to alternate putting them on separate lines:
set.seed(42)
data <- matrix(rnorm(25),5,5)
nms <- c("Name1","Name2","Name3","Name4","Name5")
oddnums <- which(seq_along(nms) %% 2 == 0)
evennums <- which(seq_along(nms) %% 2 == 1)
(There's got to be a better way to do that, but it works.)
From here:
png("boxplot.png", height = 240)
boxplot(data, names = FALSE)
mtext(nms[oddnums], side = 1, line = 2, at = oddnums)
mtext(nms[evennums], side = 1, line = 1, at = evennums)
dev.off()
(The use of png is not important here, I just used it because of your edit.)

avoiding over-crowding of labels in r graphs

I am working on avoid over crowding of the labels in the following plot:
set.seed(123)
position <- c(rep (0,5), rnorm (5,1,0.1), rnorm (10, 3,0.1), rnorm (3, 4, 0.2), 5, rep(7,5), rnorm (3, 8,2), rnorm (10,9,0.5),
rep (0,5), rnorm (5,1,0.1), rnorm (10, 3,0.1), rnorm (3, 4, 0.2), 5, rep(7,5), rnorm (3, 8,2), rnorm (10,9,0.5))
group <- c(rep (1, length (position)/2),rep (2, length (position)/2) )
mylab <- paste ("MR", 1:length (group), sep = "")
barheight <- 0.5
y.start <- c(group-barheight/2)
y.end <- c(group+barheight/2)
mydf <- data.frame (position, group, barheight, y.start, y.end, mylab)
plot(0,type="n",ylim=c(0,3),xlim=c(0,10),axes=F,ylab="",xlab="")
#Create two horizontal lines
require(fields)
yline(1,lwd=4)
yline(2,lwd=4)
#Create text for the lines
text(10,1.1,"Group 1",cex=0.7)
text(10,2.1,"Group 2",cex=0.7)
#Draw vertical bars
lng = length(position)/2
lg1 = lng+1
lg2 = lng*2
segments(mydf$position[1:lng],mydf$y.start[1:lng],y1=mydf$y.end[1:lng])
segments(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2],y1=mydf$y.end[lg1:lg2])
text(mydf$position[1:lng],mydf$y.start[1:lng]+0.65, mydf$mylab[1:lng], srt = 90)
text(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2]+0.65, mydf$mylab[lg1:lg2], srt = 90)
You can see some areas are crowed with the labels - when x value is same or similar. I want just to display only one label (when there is multiple label at same point). For example,
mydf$position[1:5] are all 0,
but corresponding labels mydf$mylab[1:5] -
MR1 MR2 MR3 MR4 MR5
I just want to display the first one "MR1".
Similarly the following points are too close (say the difference of 0.35), they should be considered a single cluster and first label will be displayed. In this way I would be able to get rid of overcrowding of labels. How can I achieve it ?
If you space the labels out and add some extra lines you can label every marker.
clpl <- function(xdata, names, y=1, dy=0.25, add=FALSE){
o = order(xdata)
xdata=xdata[o]
names=names[o]
if(!add)plot(0,type="n",ylim=c(y-1,y+2),xlim=range(xdata),axes=F,ylab="",xlab="")
abline(h=1,lwd=4)
dy=0.25
segments(xdata,y-dy,xdata,y+dy)
tpos = seq(min(xdata),max(xdata),len=length(xdata))
text(tpos,y+2*dy,names,srt=90,adj=0)
segments(xdata,y+dy,tpos,y+2*dy)
}
Then using your data:
clpl(mydf$position[lg1:lg2],mydf$mylab[lg1:lg2])
gives:
You could then think about labelling clusters underneath the main line.
I've not given much thought to doing multiple lines in a plot, but I think with a bit of mucking with my code and the add parameter it should be possible. You could also use colour to show clusters. I'm fairly sure these techniques are present in some of the clustering packages for R...
Obviously with a lot of markers even this is going to get smushed, but with a lot of clusters the same thing is going to happen. Maybe you end up labelling clusters with a this technique?
In general, I agree with #Joran that cluster labelling can't be automated but you've said that labelling a group of lines with the first label in the cluster would be OK, so it is possible to automate some of the process.
Putting the following code after the line lg2 = lng*2 gives the result shown in the image below:
clust <- cutree(hclust(dist(mydf$position[1:lng])),h=0.75)
u <- rep(T,length(unique(clust)))
clust.labels <- sapply(c(1:lng),function (i)
{
if (u[clust[i]])
{
u[clust[i]] <<- F
as.character(mydf$mylab)[i]
}
else
{
""
}
})
segments(mydf$position[1:lng],mydf$y.start[1:lng],y1=mydf$y.end[1:lng])
segments(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2],y1=mydf$y.end[lg1:lg2])
text(mydf$position[1:lng],mydf$y.start[1:lng]+0.65, clust.labels, srt = 90)
text(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2]+0.65, mydf$mylab[lg1:lg2], srt = 90)
(I've only labelled the clusters on the lower line -- the same principle could be applied to the upper line too). The parameter h of cutree() might have to be adjusted case-by-case to give the resolution of labels that you want, but this approach is at least easier than labelling every cluster by hand.

Long vector-plot/Coverage plot in R

I really need your R skills here. Been working with this plot for several days now. I'm a R newbie, so that might explain it.
I have sequence coverage data for chromosomes (basically a value for each position along the length of every chromosome, making the length of the vectors many millions). I want to make a nice coverage plot of my reads. This is what I got so far:
Looks alright, but I'm missing y-labels so I can tell which chromosome it is, and also I've been having trouble modifying the x-axis, so it ends where the coverage ends. Additionally, my own data is much much bigger, making this plot in particular take extremely long time. Which is why I tried this HilbertVis plotLongVector. It works but I can't figure out how to modify it, the x-axis, the labels, how to make the y-axis logged, and the vectors all get the same length on the plot even though they are not equally long.
source("http://bioconductor.org/biocLite.R")
biocLite("HilbertVis")
library(HilbertVis)
chr1 <- abs(makeRandomTestData(len=1.3e+07))
chr2 <- abs(makeRandomTestData(len=1e+07))
par(mfcol=c(8, 1), mar=c(1, 1, 1, 1), ylog=T)
# 1st way of trying with some code I found on stackoverflow
# Chr1
plotCoverage <- function(chr1, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr1[start:end]), type="l")
}
plotCoverage(chr1, start=1, end=length(chr1)) # Plots coverage result.
# Chr2
plotCoverage <- function(chr2, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr2[start:end]), type="l")
}
plotCoverage(chr2, start=1, end=length(chr2)) # Plots coverage result.
# 2nd way of trying with plotLongVector
plotLongVector(chr1, bty="n", ylab="Chr1") # ylab doesn't work
plotLongVector(chr2, bty="n")
Then I have another vector called genes that are of special interest. They are about the same length as the chromosome-vectors but in my data they contain more zeroes than values.
genes_chr1 <- abs(makeRandomTestData(len=1.3e+07))
genes_chr2 <- abs(makeRandomTestData(len=1e+07))
These gene vectors I would like plotted as a red dot under the chromosomes! Basically, if the vector has a value there (>0), it is presented as a dot (or line) under the long vector plot. This I have not idea how to add! But it seems fairly straightforward.
Please help me! Thank you so much.
DISCLAIMER: Please do not simply copy and paste this code to run off the entire positions of your chromosome. Please sample positions (for example, as #Gx1sptDTDa shows) and plot those. Otherwise you'd probably get a huge black filled rectangle after many many hours, if your computer survives the drain.
Using ggplot2, this is really easily achieved using geom_area. Here, I've generated some random data for three chromosomes with 300 positions, just to show an example. You can build up on this, I hope.
# construct a test data with 3 chromosomes and 100 positions
# and random coverage between 0 and 500
set.seed(45)
chr <- rep(paste0("chr", 1:3), each=100)
pos <- rep(1:100, 3)
cov <- sample(0:500, 300)
df <- data.frame(chr, pos, cov)
require(ggplot2)
p <- ggplot(data = df, aes(x=pos, y=cov)) + geom_area(aes(fill=chr))
p + facet_wrap(~ chr, ncol=1)
You could use the ggplot2 package.
I'm not sure what exactly you want, but here's what I did:
This has 7000 random data points (about double the amount of genes on Chromosome 1 in reality). I used alpha to show dense areas (not many here, as it's random data).
library(ggplot2)
Chr1_cov <- sample(1.3e+07,7000)
Chr1 <- data.frame(Cov=Chr1_cov,fil=1)
pl <- qplot(Cov,fil,data=Chr1,geom="pointrange",ymin=0,ymax=1.1,xlab="Chromosome 1",ylab="-",alpha=I(1/50))
print(pl)
And that's it. This ran in less than a second. ggplot2 has a humongous amount of settings, so just try some out. Use facets to create multiple graphs.
The code beneath is for a sort of moving average, and then plotting the output of that. It is not a real moving average, as a real moving average would have (almost) the same amount of data points as the original - it will only make the data smoother. This code, however, takes an average for every n points. It will of course run quite a bit faster, but you will loose a lot of detailed information.
VeryLongVector <- sample(500,1e+07,replace=TRUE)
movAv <- function(vector,n){
chops <- as.integer(length(vector)/n)
count <- 0
pos <- 0
Cov <-0
pos[1:chops] <- 0
Cov[1:chops] <- 0
for(c in 1:chops){
tmpcount <- count + n
tmppos <- median(count:tmpcount)
tmpCov <- mean(vector[count:tmpcount])
pos[c] <- tmppos
Cov[c] <- tmpCov
count <- count + n
}
result <- data.frame(pos=pos,cov=Cov)
return(result)
}
Chr1 <- movAv(VeryLongVector,10000)
qplot(pos,cov,data=Chr1,geom="line")

Tufte tables: convert quartile plots into standard error plots hacking qTable function from NMOF package

If you remember there is a nice version of table conceived by Tufte that include small quartile plots running next to the corresponding data rows:
There is an implementation of such solution in R using package NMOF and function qTable, which basically creates the table shown above and outputs it as a LaTeX code:
require(NMOF)
x <- rnorm(100, mean = 0, sd = 2)
y <- rnorm(100, mean = 1, sd = 2)
z <- rnorm(100, mean = 1, sd = 0.5)
X <- cbind(x, y, z)
qTable(X,filename="res.tex")#this will save qTable in tex format in current dir
This method of visualization seem to be especially useful if you have a small amount of information to present, and you don't want to waste a space for the full graph. But I would like to hack qTable a bit. Instead of displaying quantile plots, I would prefer to display standard errors of the mean. I am not great in hacking such functions, and I used brute force to do it. I replaced the line from the qTable function which computes quantiles:
A <- apply(X, 2L, quantile, c(0.25, 0.5, 0.75))
with something very brutal, that computes standard errors:
require(psych)#got 'SE' function to compute columns standard deviation
M = colMeans(X)#column means
SE = SD(X)/sqrt(nrow(X))#standard error
SELo = M-SE#lower bound
SEHi = M+SE#upper bound
A = t(data.frame(SELo,M,SEHi))#combines it together
I know, I know it's possibly unsustainable approach, but it actually works to some extend - it plots standard errors but keeps this gap in the plot:
and I would like this gap to disappear.
Here is a qTable function with modification discussed above.
To remove the gaps, you can insert these two lines:
B[2, ] <- B[3, ]
B[4, ] <- B[3, ]
right before the for loop that starts with
for (cc in 1L:dim(X)[2L]) {...
Why does it work? Reading the graph from left to right, the five rows of B correspond to where
1) the left segments start
2) the left segments ends
3) the dots are
4) the right segments start
5) the right segments end
so by forcing B[2, ] and B[4, ] to B[3, ] you are effectively getting rid of the gaps.

Resources