Axis breaks in ggplot histogram in R [duplicate] - r

I have data that is mostly centered in a small range (1-10) but there is a significant number of points (say, 10%) which are in (10-1000). I would like to plot a histogram for this data that will focus on (1-10) but will also show the (10-1000) data. Something like a log-scale for th histogram.
Yes, i know this means not all bins are of equal size
A simple hist(x) gives
while hist(x,breaks=c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,3,4,5,7.5,10,15,20,50,100,200,500,1000,10000))) gives
none of which is what I want.
update
following the answers here I now produce something that is almost exactly what I want (I went with a continuous plot instead of bar-histogram):
breaks <- c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,4,8)
ggplot(t,aes(x)) + geom_histogram(colour="darkblue", size=1, fill="blue") + scale_x_log10('true size/predicted size', breaks = breaks, labels = breaks)![alt text][3]
the only problem is that I'd like to match between the scale and the actual bars plotted. There two options for doing that : the one is simply use the actual margins of the plotted bars (how?) then get "ugly" x-axis labels like 1.1754,1.2985 etc. The other, which I prefer, is to control the actual bins margins used so they will match the breaks.

Log scale histograms are easier with ggplot than with base graphics. Try something like
library(ggplot2)
dfr <- data.frame(x = rlnorm(100, sdlog = 3))
ggplot(dfr, aes(x)) + geom_histogram() + scale_x_log10()
If you are desperate for base graphics, you need to plot a log-scale histogram without axes, then manually add the axes afterwards.
h <- hist(log10(dfr$x), axes = FALSE)
Axis(side = 2)
Axis(at = h$breaks, labels = 10^h$breaks, side = 1)
For completeness, the lattice solution would be
library(lattice)
histogram(~x, dfr, scales = list(x = list(log = TRUE)))
AN EXPLANATION OF WHY LOG VALUES ARE NEEDED IN THE BASE CASE:
If you plot the data with no log-transformation, then most of the data are clumped into bars at the left.
hist(dfr$x)
The hist function ignores the log argument (because it interferes with the calculation of breaks), so this doesn't work.
hist(dfr$x, log = "y")
Neither does this.
par(xlog = TRUE)
hist(dfr$x)
That means that we need to log transform the data before we draw the plot.
hist(log10(dfr$x))
Unfortunately, this messes up the axes, which brings us to workaround above.

Using ggplot2 seems like the most easy option. If you want more control over your axes and your breaks, you can do something like the following :
EDIT : new code provided
x <- c(rexp(1000,0.5)+0.5,rexp(100,0.5)*100)
breaks<- c(0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000,10000)
major <- c(0.1,1,10,100,1000,10000)
H <- hist(log10(x),plot=F)
plot(H$mids,H$counts,type="n",
xaxt="n",
xlab="X",ylab="Counts",
main="Histogram of X",
bg="lightgrey"
)
abline(v=log10(breaks),col="lightgrey",lty=2)
abline(v=log10(major),col="lightgrey")
abline(h=pretty(H$counts),col="lightgrey")
plot(H,add=T,freq=T,col="blue")
#Position of ticks
at <- log10(breaks)
#Creation X axis
axis(1,at=at,labels=10^at)
This is as close as I can get to the ggplot2. Putting the background grey is not that straightforward, but doable if you define a rectangle with the size of your plot screen and put the background as grey.
Check all the functions I used, and also ?par. It will allow you to build your own graphs. Hope this helps.

A dynamic graph would also help in this plot. Use the manipulate package from Rstudio to do a dynamic ranged histogram:
library(manipulate)
data_dist <- table(data)
manipulate(barplot(data_dist[x:y]), x = slider(1,length(data_dist)), y = slider(10, length(data_dist)))
Then you will be able to use sliders to see the particular distribution in a dynamically selected range like this:

Related

Is it possible to over-ride the x axis range in R package ggbio when using autoplot and ensdb transcripts?

I am trying to use ggbio to plot gene transcripts. I want to plot a very specific range so it matches my ggplot2 plots. The problem is my example plot ends up having range of 133,567,500-133,570,000 regardless of the GRange and whether I specify xlim or not.
This example should only plot a small bit of intron (the thin arrowed line) but instead plots the full 2 exons and intron in between. I believe autoplot wants to plot the entire transcript or transcripts present in the range and widens the range to accommodate for that.
library(EnsDb.Hsapiens.v86)
library(ggbio)
ensdb <- EnsDb.Hsapiens.v86
mut<-GRanges("10", IRanges(133568909, 133569095))
gene <- autoplot(ensdb, which=mut, names.expr="gene_name",xlim=c(133568909,133569095))
gene.gg <- gene#ggplot
png("test_gene_plot_5.png")
gene.gg
dev.off()
Is there any way to over-ride this? I've looked at the manual page for autoplot and I couldn't narrow down an option that would fix it. Others have said to use xlim, but that does not seem to change anything
I like ggbio because it can make a ggplot2 object to be plotted along with other ggplot2 objects. I have not seen an example for that with other approaches like Gvis. But I would entertain other approaches if they could be combined with my existing plots.
Thanks!
Amy
It kind of depends wether you want clipped or squished data. Usually autoplot outputs a ggplot object at some point that can be manipulated as such.
For squished data:
library(GenomicRanges) # just to be sure start and end work
gene#ggplot +
scale_x_continuous(limits = c(start(mut), end(mut)), oob = scales::squish)
For clipped data:
gene#ggplot +
coord_cartesian(xlim = c(start(mut), end(mut)))
But to be totally honest, I'm unsure wether this is the most informative way to communicate that you are plotting the internals of an intron.
Alternatively, I've written a gene model geom at some point that doesn't work through the autoplot methods (which can sometimes be a pain if you want to customise everything). Downside is that you'd have to do some manual gene searching and setting aesthetics. Upside is that it works like most other geoms and is therefore easy to combine with some other data.
library(ggnomics) # from: https://github.com/teunbrand/ggnomics
# Finding a gene's exons manually
my_gene <- transcriptsByOverlaps(EnsDb.Hsapiens.v86, mut)
my_gene <- exonsByOverlaps(EnsDb.Hsapiens.v86, my_gene)
my_gene <- as.data.frame(my_gene)
some_other_data <- data.frame(
x = seq(start(mut), end(mut), by = 10),
y = cumsum(rnorm(19))
)
ggplot(some_other_data) +
geom_line(aes(x, y)) +
geom_genemodel(data = my_gene,
aes(xmin = start, xmax = end,
y = max(some_other_data$y) + 1,
group = 1, strand = strand)) +
coord_cartesian(xlim = c(start(mut), end(mut)))
Hope that helped!

Plot continuous data with discrete colors

I found some similar questions but the answers didn't solve my problem.
I try to plot a time series of to variables as a scatterplot and using the date to color the points. In this example, I created a simple dataset (see below) and I want to plot all data with timesteps in the 1960ties, 70ties, 80ties and 90ties with one colour respectively.
Using the standard plot command (plot(x,y,...)) it works the way it should, as I try using the ggplot library some strange happens, I guess I miss something. Has anyone an idea how to solve this and generate a correct plot?
Here is my code using the standard plot command with a colorbar
# generate data frame with test data
x <- seq(1,40)
y <- seq(1,40)
year <- c(rep(seq(1960,1969),2),seq(1970,1989,2),seq(1990,1999))
df <- data.frame(x,y,year)
# define interval and assing color to interval
myinterval <- seq(1959,1999,10)
mycolors <- rainbow(4)
colbreaks <- findInterval(df$year, vec = myinterval, left.open = T)
# basic plot
layout(array(1:2,c(1,2)),widths =c(5,1)) # divide the device area in two panels
par(oma=c(0,0,0,0), mar=c(3,3,3,3))
plot(x,y,pch=20,col = mycolors[colbreaks])
# add colorbar
ncols <- length(myinterval)-1
colbarlabs <- seq(1960,2000,10)
par(mar=c(5,0,5,5))
image(t(array(1:ncols, c(ncols,1))), col=mycolors, axes=F)
box()
axis(4, at=seq(0.5/(ncols-1)-1/(ncols-1),1+1/(ncols-1),1/(ncols-1)), labels=colbarlabs, cex.axis=1, las=1)
abline(h=seq(0.5/(ncols-1),1,1/(ncols-1)))
mtext("year",side=3,line=0.5,cex=1)
As I would like to use ggplot package, as I do for other plots, I tried this version with ggplot
# plot with ggplot
require(ggplot2)
ggplot(df, aes(x=x,y=y,color=year)) + geom_point() +
scale_colour_gradientn(colours= mycolors[colbreaks])
but it didn't work the way I thought it would. Obviously, there is something wrong with the color coding. Also, the colorbar looks strange. I also tried it with scale_color_manual and scale_color_gradient2 but I got more errors (Error in continuous_scale).
Any idea how to solve this and generate a plot according to the standard plot 3 including a colorbar.

"zoom"/"scale" with coord_polar()

I've got a polar plot which uses geom_smooth(). The smoothed loess line though is very small and rings around the center of the plot. I'd like to "zoom in" so you can see it better.
Using something like scale_y_continuous(limits = c(-.05,.7)) will make the geom_smooth ring bigger, but it will also alter it because it will recompute with the datapoints limited by the limits = c(-.05,.7) argument.
For a Cartesian plot I could use something like coord_cartesian(ylim = c(-.05,.7)) which would clip the chart but not the underlying data. However I can see no way to do this with coord_polar()
Any ideas? I thought there might be a way to do this with grid.clip() in the grid package but I am not having much luck.
Any ideas?
What my plot looks like now, note "higher" red line:
What I'd like to draw:
What I get when I use scale_y_continuous() note "higher" blue line, also it's still not that big.
I haven't figured out a way to do this directly in coord_polar, but this can be achieved by modifying the ggplot_build object under the hood.
First, here's an attempt to make a plot like yours, using the fake data provided at the bottom of this answer.
library(ggplot2)
plot <- ggplot(data, aes(theta, values, color = series, group = series)) +
geom_smooth() +
scale_x_continuous(breaks = 30*-6:6, limits = c(-180,180)) +
coord_polar(start = pi, clip = "on") # use "off" to extend plot beyond axes
plot
Here, my Y (or r for radius) axis ranges from about -2.4 to 4.3.
We can confirm this by looking at the associated ggplot_build object:
# Create ggplot_build object and look at radius range
plot_build <- ggplot_build(plot)
plot_build[["layout"]][["panel_params"]][[1]][["r.range"]]
# [1] -2.385000 4.337039
If we redefine the range of r and plot that, we get what you're looking for, a close-up of the plot.
# Here we change the 2nd element (max) of r.range from 4.337 to 1
plot_build[["layout"]][["panel_params"]][[1]][["r.range"]][2] <- 1
plot2 <- ggplot_gtable(plot_build)
plot(plot2)
Note, this may not be a perfect solution, since this seems to introduce some image cropping issues that I don't know how to address. I haven't tested to see if those can be overcome using ggsave or perhaps by further modifying the ggplot_build object.
Sample data used above:
set.seed(4.2)
data <- data.frame(
series = as.factor(rep(c(1:2), each = 10)),
theta = rep(seq(from = -170, to = 170, length.out = 10), times = 2),
values = rnorm(20, mean = 0, sd = 1)
)

Formatting and manipulating a plot from the R package "hexbin"

I generate a plot using the package hexbin:
# install.packages("hexbin", dependencies=T)
library(hexbin)
set.seed(1234)
x <- rnorm(1e6)
y <- rnorm(1e6)
hbin <- hexbin(
x = x
, y = y
, xbin = 50
, xlab = expression(alpha)
, ylab = expression(beta)
)
## Using plot method for hexbin objects:
plot(hbin, style = "nested.lattice")
abline(h=0)
This seems to generate an S4 object (hbin), which I then plot using plot.
Now I'd like to add a horizontal line to that plot using abline, but unfortunately this gives the error:
plot.new has not yet been called
I have also no idea, how I can manipulate e.g. the position of the axis labels (alpha and beta are within the numbers), change the position of the legend, etc.
I'm familiar with OOP, but so far I could not find out how plot() handles the object (does it call certain methods of the object?) and how I can manipulate the resulting plot.
Why can't I simply draw a line onto the plot?
How can I manipulate axis labels?
Use lattice version of hex bin - hexbinplot(). With panel you can add your line, and with style you can choose different ways of visualizing hexagons. Check help for hexbinplot for more.
library(hexbin)
library(lattice)
x <- rnorm(1e6)
y <- rnorm(1e6)
hexbinplot(x ~ y, aspect = 1, bins=50,
xlab = expression(alpha), ylab = expression(beta),
style = "nested.centroids",
panel = function(...) {
panel.hexbinplot(...)
panel.abline(h=0)
})
hexbin uses grid graphics, not base. There is a similar function, grid.abline, which can draw lines on plots by specifying a slope and intercept, but the co-ordinate system used is confusing:
grid.abline(325,0)
gets approximately what you want, but the intercept here was found by eye.
You will have more luck using ggplot2:
library(ggplot2)
ggplot(data,aes(x=alpha,y=beta)) + geom_hex(bins=10) + geom_hline(yintercept=0.5)
I had a lot of trouble finding a lot of basic plot adjustments (axis ranges, labels, etc.) with the hexbin library but I figured out how to export the points into any other plotting function:
hxb<-hexbin(x=c(-15,-15,75,75),
y=c(-15,-15,75,75),
xbins=12)
hxb#xcm #gives the x co-ordinates of each hex tile
hxb#ycm #gives the y co-ordinates of each hex tile
hxb#count #gives the cell size for each hex tile
points(x=hxb#xcm, y=hxb#ycm, pch=hxb#count)
You can just feed these three vectors into any plotting tool you normally use.. there is the usual tweaking of size scaling, etc. but it's far better than the stubborn hexplot function. The problem I found with the ggplot2 stat_binhex is that I couldn't get the hexes to be different sizes... just different colors.
if you really want hexagons, plotrix has a hexagon drawing function that i think is fine.

How can I plot a histogram of a long-tailed data using R?

I have data that is mostly centered in a small range (1-10) but there is a significant number of points (say, 10%) which are in (10-1000). I would like to plot a histogram for this data that will focus on (1-10) but will also show the (10-1000) data. Something like a log-scale for th histogram.
Yes, i know this means not all bins are of equal size
A simple hist(x) gives
while hist(x,breaks=c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,3,4,5,7.5,10,15,20,50,100,200,500,1000,10000))) gives
none of which is what I want.
update
following the answers here I now produce something that is almost exactly what I want (I went with a continuous plot instead of bar-histogram):
breaks <- c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,4,8)
ggplot(t,aes(x)) + geom_histogram(colour="darkblue", size=1, fill="blue") + scale_x_log10('true size/predicted size', breaks = breaks, labels = breaks)![alt text][3]
the only problem is that I'd like to match between the scale and the actual bars plotted. There two options for doing that : the one is simply use the actual margins of the plotted bars (how?) then get "ugly" x-axis labels like 1.1754,1.2985 etc. The other, which I prefer, is to control the actual bins margins used so they will match the breaks.
Log scale histograms are easier with ggplot than with base graphics. Try something like
library(ggplot2)
dfr <- data.frame(x = rlnorm(100, sdlog = 3))
ggplot(dfr, aes(x)) + geom_histogram() + scale_x_log10()
If you are desperate for base graphics, you need to plot a log-scale histogram without axes, then manually add the axes afterwards.
h <- hist(log10(dfr$x), axes = FALSE)
Axis(side = 2)
Axis(at = h$breaks, labels = 10^h$breaks, side = 1)
For completeness, the lattice solution would be
library(lattice)
histogram(~x, dfr, scales = list(x = list(log = TRUE)))
AN EXPLANATION OF WHY LOG VALUES ARE NEEDED IN THE BASE CASE:
If you plot the data with no log-transformation, then most of the data are clumped into bars at the left.
hist(dfr$x)
The hist function ignores the log argument (because it interferes with the calculation of breaks), so this doesn't work.
hist(dfr$x, log = "y")
Neither does this.
par(xlog = TRUE)
hist(dfr$x)
That means that we need to log transform the data before we draw the plot.
hist(log10(dfr$x))
Unfortunately, this messes up the axes, which brings us to workaround above.
Using ggplot2 seems like the most easy option. If you want more control over your axes and your breaks, you can do something like the following :
EDIT : new code provided
x <- c(rexp(1000,0.5)+0.5,rexp(100,0.5)*100)
breaks<- c(0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000,10000)
major <- c(0.1,1,10,100,1000,10000)
H <- hist(log10(x),plot=F)
plot(H$mids,H$counts,type="n",
xaxt="n",
xlab="X",ylab="Counts",
main="Histogram of X",
bg="lightgrey"
)
abline(v=log10(breaks),col="lightgrey",lty=2)
abline(v=log10(major),col="lightgrey")
abline(h=pretty(H$counts),col="lightgrey")
plot(H,add=T,freq=T,col="blue")
#Position of ticks
at <- log10(breaks)
#Creation X axis
axis(1,at=at,labels=10^at)
This is as close as I can get to the ggplot2. Putting the background grey is not that straightforward, but doable if you define a rectangle with the size of your plot screen and put the background as grey.
Check all the functions I used, and also ?par. It will allow you to build your own graphs. Hope this helps.
A dynamic graph would also help in this plot. Use the manipulate package from Rstudio to do a dynamic ranged histogram:
library(manipulate)
data_dist <- table(data)
manipulate(barplot(data_dist[x:y]), x = slider(1,length(data_dist)), y = slider(10, length(data_dist)))
Then you will be able to use sliders to see the particular distribution in a dynamically selected range like this:

Resources