Scale huge axis in R for plotting - r

I have a huge file I load into one vector
y = scan("my_file)
My x axis is also really huge, lets say it is in range of x=1:5000000
My question is now how can I scale my plot so that I actually can see something?So far I am doing the following
UPDATE:
plot(x, y, log="x", pch=".")
However only the logarithm is not enough. Can i somehow scale the x more, like taking a sqrt or something, and if yes how? Sorry this may be a simple question but I am really new to R..
I am not sure how to add a file, but the file I a using to load into vector y is as I said, only 5 million values of entry 1,2 or 0: so
y=c(1,0,1,....................)
the x axis as I mentioned above.
The second thing I tried was:
zerotwo <- data.frame(x, y)
ggplot(aes(x, y, fill=as.factor(y)), data=zerotwo) + geom_tile() + scale_x_continuous(trans='log2') + geom_tile()
But here also the fill=as.factor doesnt do its job

Another possibility is to rely on color-coding to encode the value, and use the y-axis to put the data into different rows. 5m tiles on the x-axis is a bit much for ggplot, but 50*100k works ok if the plot size is large enough.read from left to right, then top to bottom.
# create test data
zerototwo <- data.frame(position=1:5000000, value=sample(0:2, 5000000, replace=TRUE))
# for your data: zerototwo <- data.frame(position=1:length(y), value=y)
zerototwo$row <- floor((zerototwo$position -1)/100000)
zerototwo$rowpos <- (zerototwo$position - 1) %% 100000
ggplot(aes(x=rowpos, y=row, fill=as.factor(value)), data=zerototwo) +
geom_tile(height=0.9) + scale_y_reverse()

Related

Histogram: Combine continuous and discrete values in ggplot2

I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.

R - Bar Plot with transparency based on values?

I have a dataset myData which contains x and y values for various Samples. I can create a line plot for a dataset which contains a few Samples with the following pseudocode, and it is a good way to represent this data:
myData <- data.frame(x = 290:450, X52241 = c(..., ..., ...), X75123 = c(..., ..., ...))
myData <- myData %>% gather(Sample, y, -x)
ggplot(myData, aes(x, y)) + geom_line(aes(color=Sample))
Which generates:
This turns into a Spaghetti Plot when I have a lot more Samples added, which makes the information hard to understand, so I want to represent the "hills" of each sample in another way. Preferably, I would like to represent the data as a series of stacked bars, one for each myData$Sample, with transparency inversely related to what is in myData$y. I've tried to represent that data in photoshop (badly) here:
Is there a way to do this? Creating faceted plots using facet_wrap() or facet_grid() doesn't give me what I want (far too many Samples). I would also be open to stacked ridgeline plots using ggridges, but I am not understanding how I would be able to convert absolute values to a stat(density) value needed to plot those.
Any suggestions?
Thanks to u/Joris for the helpful suggestion! Since, I did not find this question elsewhere, I'll go ahead and post the pretty simple solution to my question here for others to find.
Basically, I needed to apply the alpha aesthetic via aes(alpha=y, ...). In theory, I could apply this over any geom. I tried geom_col(), which worked, but the best solution was to use geom_segment(), since all my "bars" were going to be the same length. Also note that I had to "slice" up the segments in order to avoid the problem of overplotting similar to those found here, here, and here.
ggplot(myData, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, yend=Sample, alpha=y), color='blue3', size=14)
That gives us the nice gradient:
Since the max y values are not the same for both lines, if I wanted to "match" the intensity I normalized the data (myDataNorm) and could make the same plot. In my particular case, I kind of preferred bars that did not have a gradient, but which showed a hard edge for the maximum values of y. Here was one solution:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, y=end=Sample, alpha=ifelse(y>0.9,1,0)) +
theme(legend.position='none')
Better, but I did not like the faint-colored areas that were left. The final code is what gave me something that perfectly captured what I was looking for. I simply moved the ifelse() statement to apply to the x aesthetic, so the parts of the segment drawn were only those with high enough y values. Note my data "starts" at x=290 here. Probably more elegant ways to combine those x and xend terms, but whatever:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(
x=ifelse(y>0.9,x,290), xend=ifelse(y>0.9,x-1,290),
y=Sample, yend=Sample), color='blue3', size=14) +
xlim(290,400) # needed to show entire scale

Speed up rendering of large heatmap from ggplot in R

I am trying to plot a large heatmap, generated with ggplot, in R. Ultimately, I would like to 'polish' this heat map using Illustrator.
Sample code:
# Load packages (tidyverse)
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,100000), y = seq(1,100000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
Although I save the plot as a vectorized image (.pdf; that is not that large), the pdf is loading very slowly when opening. I expect that every individual point in the data frame is rendered when opening the file.
I have read other posts (e.g. Data exploration in R: display heatmap of large matrix, quickly?) that use image() to visualize matrices, however I would like to use ggplot to modify the image.
Question: How do I speed up the rendering of this plot? Is there a way (besides lowering the resolution of the plot), while keeping the image vectorized, to speed this process up? Is it possible to downsample a vectorized ggplot?
The first thing I tried was stat_summary_2d to get average binning, but it seemed slow and also created some artifacts on the right and top edges:
library(tidyverse)
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
set.seed(123)
df$z <- rnorm(nrow(df))
print(object.size(df), units = "Mb")
#15.4 Mb
ggplot(data = df, aes(x = x, y = y, z = z)) +
stat_summary_2d(bins = c(100,100)) + #10x downsample, in this case
scale_x_continuous(breaks = 100*0:10) +
labs(title = "stat_summary_2d, 1000x1000 downsampled to 100x100")
Even though this is much smaller than your suggested data, this still took about 3 seconds to plot on my machine, and had artifacts on the top and right edges, I presume due to those bins being smaller ones from the edges, leaving more variation.
It got slower from there when I tried a larger grid like you are requesting.
(As an aside, it may be worth clarifying that a vector graphic file like a PDF, unlike a raster graphic, can be resized without loss of resolution. However, in this use case, the output is 10,000 megapixel raster file, far beyond the limits of human perception, that is getting exported into a vector format, where each "pixel" becomes a very tiny rectangle in the PDF. That use of a vector format could be useful for certain unusual cases, like if you ever need to blow up your heatmap without loss of resolution onto a gigantic surface, like a football field. But it sounds like in this case it might be the wrong tool for the job, since you're putting heaps of data into the vector file that won't be perceptible.)
What worked more efficiently was to do the averaging with dplyr before ggplot. With that, I could take a 10k x 10k array and downsample it 100x before sending to ggplot. This necessarily reduces the resolution, but I don't understand the value in this use case of preserving resolution beyond human abilities to perceive it.
Here's some code to do the bucketing ourselves and then plot the downsampled version:
# Using 10k x 10k array, 1527.1 Mb when initialized
downsample <- 100
df2 <- df %>%
group_by(x = downsample * round(x / downsample),
y = downsample * round(y / downsample)) %>%
summarise(z = mean(z))
ggplot(df2, aes(x = x, y = y)) +
geom_raster(aes(fill = z)) +
scale_x_continuous(breaks = 1000*0:10) +
labs(title = "10,000x10,000 downsampled to 100x100")
Your reproducible example just shows noise so it's hard to know what kind of output you would like.
One way would be to follow #dww's suggestion and use geom_hex to show aggregated data.
Another way, as you ask "Is it possible to downsample a vectorized ggplot?", is to use dplyr::sample_frac or dplyr::sample_n in the data argument of your geom_raster. I have to take a smaller sample than in your example though or I can't build the df.
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z), . %>% sample_frac(0.1))
If you want to start from your high resolution ggplot object you can do for the same effect:
gg <- ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
gg$data <- sample_frac(gg$data,0.1)
gg

R plot, how to start y at non zero?

I have a set of data which goes from: 8e41 to ~ 1e44. I want to plot starting at 1e41 on the y-axis, however using the argument: ylim=c(1e41,1e45) does not work for me.
Here is the minimally reproducible code:
x = c(1,2)
y = c(1e41,1e44)
plot(x,y,ylim=c(1e41,1e45))
The problem is that 1e41 is so much closer to 0 than 1e45 that it's virtually the same. Have you considered working on the log scale?
plot(x,y,ylim=c(1e41,1e45), log = 'y')
or even
plot(x, y, log = 'y')
Think of this another way - rescale your data by dividing your range by 1e41: c(8e41, 1e44)/ 1e41 - you get 8 and 1000. Is there any significant difference between starting the scale at 0 (or 1) versus 8? If you chose to divide by 1e40 instead, you're looking at 80 and 10,000. Try the following code to see this:
m <- 1e41 # change this as desired
plot(x, y / m)
abline(h = c(0, 1e41 / m))
By changing m, the only thing that changes is the numbers on the y-axis, the relative positions do not change. Look at how close 0 and 8e41 are, and you'll see why it really doesn't matter whether or not the plot starts at 0 versus 1e41. As a fraction of the total height of the plot, the difference is 1/1000.
Changing the values at which the axis is labeled
Here's one more option for you - changing the values at which the plot is marked. That requires two steps - first, removing the axis labels when the plot is originally created, then adding in the ones you actually want:
plot(x, y, yaxt = 'none')
axis(2, c(1e41, seq(1e43, 1e44, 1e43)))
library(ggplot2)
x = c(1,2)
y = c(1e41,1e44)
data = data.frame(x,y)
ggplot(data, aes(x=x, y=log(y))) + geom_point() + ylim(90,150)
I think you should use the log of y, as it shows the same data.

Can I change where the x-axis intersects the y-axis in ggplot2?

I'm plotting some index data as a bar chart. I'd like to emphasise the "above index" and "below index"-ness of the numbers by forcing the x-axis to cross at 100 (such that a value of 80 would appear as a -20 bar.)
This is part of a much longer process, so it's hard to share data usefully. Here, though, is some bodge-y code that illustrates the problem (and the beginnings of my solution):
df <- data.frame(c("a","b","c"),c(118,80,65))
names(df) <- c("label","index")
my.plot <- ggplot(df,aes(label,index))
my.plot + geom_bar()
df$adjusted <- as.numeric(lapply(df$index,function(x) x-100))
my.plot2 <- ggplot(df,aes(label,adjusted))
my.plot2 + geom_bar()
I can, of course, change my index calculation to read: (value.new/value.old)*100-100 then title the chart appropriately (something like "xxx relative to index") but this seems clumsy.
So, too, does the approach I've been testing (to run the simple calculation above, then re-label the y-axis.) Is that really the best solution?
No doubt someone's going to tell me that this sort of axis manipulation is frowned upon. If this is the case, please could they point me in the direction of an explanation? At least then I'll have learned something.
This doesn't directly answer you question, but instead of missing about with the x-axis, why not make a single grid line a bit thicker? For example,
dd = data.frame(x = 1:10, y = runif(10))
g = ggplot(dd, aes(x, y)) + geom_point()
g + geom_hline(yintercept=0.2, colour="white", lwd=3)
Or as Paul suggested, with a black line and some text:
g + geom_hline(yintercept=0.2, colour="black", lwd=3) +
annotate("text", x = 2, y = 0.22, label = "Reference")
The coordinate system of you plot has the x-axis and the y-axis crossing at (0,0). This is just the way you define your coordinate system. You can of course draw a horizontal line at (x = 100), but to call this is x-axis is false.
What you already proposed is to redefine your coordinate system by transforming the data. Whether or not this transformation is appropriate is easier to answer with a reproducible example from your side.

Resources