Plotting stacked histogram with log scale in ggplot2 - r

Note: I found a similar question, for which there was an answer explaining the problem. However, I'm looking for an answer, as opposed to a reason why it's difficult (which I fully understand).
I have data for which I want to create a histogram. This data has a count of 10000 for the bin [0, 200) and a count of 1 for several bins such as [30000, 30200). Both bins are important and need to be visible. For this, I can perform a histogram with the log1p scale.
contig_len <- read.table(data_file, header = FALSE, sep = ",", col.names=c("Length"))
ggplot(contig_len, aes(x = Length)) + geom_histogram(binwidth=200) +
scale_y_continuous(trans="log1p")
This works perfectly! But now, I want to categorise the items in the histogram, as follows:
ggplot(contig_len, aes(x = Length, fill = Prevalence)) +
geom_histogram(binwidth=200, alpha=0.5, position="stack") +
scale_y_continuous(trans = "log1p")
This doesn't work, however, as the stacking is performed without taking the log scale into account. Has anyone found a way around this problem? My data looks like this:
head(contig_len)
Length Prevalence
1 606 Repetitive (<5)
2 888 Non-Repetitive
3 192 Repetitive (<9)
4 9830 Non-Repetitive
5 506 Non-Repetitive
6 850 Non-Repetitive

Related

Clustered Bar Plot Using ggplot2

Basically I want to display a barplot which is grouped by Methods i.e I want to display the number of people conducted the tests, the number of positive test results had found for each of the methods. Also, I want to display all the numbers and percentages as labels on the bar. I am trying to display these using ggplot2. But I am failing every time.
Any helps.
Thanks in advance
I'm not sure to have fully understand your question. But I will suggest you to take look on geom_text.
library(ggplot2)
ggplot(df, aes(x = methods, y = percentage)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(percentage,2), " (",positive," / ", people,")")), vjust = -0.3, size = 3.5)+
scale_x_discrete(limits = c("NS1", "NS1+IgM", "NS1+IgG","Tourniquet")) +
ylim(0,100)
Data:
df = data.frame(methods = c("NS1", "NS1+IgM","NS1+IgG","Tourniquet"),
people = c(542,542,541,250),
positive = c(505,503,38,93))
df$percentage = df$positive / df$people * 100
> df
methods people positive percentage
1 NS1 542 505 93.17343
2 NS1+IgM 542 503 92.80443
3 NS1+IgG 541 38 7.02403
4 Tourniquet 250 93 37.20000
Does it answer your question ? If not, can you clarify your question by adding the code you have tried so far in ggplot ?

ggplot plotting problems and error bars

So I have some data that I imported into R using read.csv.
d = read.csv("Flux_test_results_for_R.csv", header=TRUE)
rows_to_plot = c(1,2,3,4,5,6,13,14)
d[rows_to_plot,]
It looks like it worked fine:
> d[rows_to_plot,]
strain selective rate ci.low ci.high
1 4051 rif 1.97539e-09 6.93021e-10 5.63066e-09
2 4052 rif 2.33927e-09 9.92957e-10 5.51099e-09
3 4081 (mutS) rif 1.32915e-07 1.05363e-07 1.67671e-07
4 4086 (mutS) rif 1.80342e-07 1.49870e-07 2.17011e-07
5 4124 (mutL) rif 5.53369e-08 4.03940e-08 7.58077e-08
6 4125 (mutL) rif 1.42575e-07 1.14957e-07 1.76828e-07
13 4760-all rif 6.74928e-08 5.41247e-08 8.41627e-08
14 4761-all rif 2.49119e-08 1.91979e-08 3.23265e-08
So now I'm trying to plot the column "rate", with "strain" as labels, and ci.low and ci.high as boundaries for confidence intervals.
Using ggplot, I can't even get the plot to work. This gives a plot where all the dots are at 1 on the y-axis:
g <- ggplot(data=d[rows_to_plot,], aes(x=strain, y=rate))
g + geom_dotplot()
Attempt at error bars:
error_limits = aes(ymax = d2$ci.high, ymin = d2$ci.low)
g + geom_errorbar(error_limits)
As you can tell I'm a complete noob to plotting things in R, any help appreciated.
Answer update
There were two things going on. As per boshek's answer, which I selected, I it seems that geom_point(), not geom_dotplot(), was the way to go.
The other issue was that originally, I filtered the data to only plot some rows, but I didn't also filter the error limits by row. So I switched to:
d2 = d[c(1,2,3,4,5,6,13,14),]
error_limits = aes(ymax = d2$ci.high, ymin = d2$ci.low)
g = ggplot(data=d2, ...etc...
A couple general comments. Get away from using attach. Though it has its uses, for beginners it can be very confusing. Get used to things like d$strain and d$selective. That said, once you call the dataframe with ggplot() you can refer to variables in that dataframe subsequently just by their names. Also you really need to ask questions with a reproducible example. This is a very important step in figuring out how to ask questions in R.
Now for the plot. I think this should work:
error_limits = aes(ymax = rate + ci.high, ymin = rate - ci.low)
ggplot(data=d[rows_to_plot,], aes(x=strain, y=rate)) +
geom_point() +
geom_errorbar(error_limits)
But of course this is untested because you haven't provided a reproducible examples.

How do you create a bar graph for a data frame in R that uses percentages as the y-axis instead of a count?

If I had a data from like this (but larger):
ID Rating
12 Good
12 Good
16 Good
16 Bad
16 Very Bad
34 Very Good
38 Very Bad
52 Bad
What would I have to do to make a plot show the percent of the count of each type. Basically, the graph should look like 4 bars on the x-axis for each type of rating and the y-axis should be the percent of the time the rating appears. For example, the data frame above would have 4 bars with Very Bad and Bad being 25%, Good being 37.5% and Very Good being 12.5%. I would really prefer to get an answer in ggplot2, but, since I really can't find this at all, anything in R would work.
This is the best answer I found:
# create data
data <- data.frame(ID = as.factor(c(12,12,16,16,16,34,38,52)),
Rating = c("Good","Good","Good","Bad","Very Bad","Very Good","Very Bad","Bad"))
# get summary table of Rating
t <- table(data$Rating)
# get percentage list
percent <- as.vector(t)/nrow(data)
# plot
library(ggplot2)
ggplot(data = data,aes(x=Rating)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
ylab("Percentage") +
ylim(0,0.4)
library(ggplot2)
# create some data
DT <- data.frame(ID=1:10,Rating=sample(c("Very Good","Good","Bad","Very Bad"),20,replace=TRUE))
ggplot(DT, aes(factor(Rating))) + geom_bar()
Reference: ggplot2 docs
For showing proportions in base barplots, with actual proportions displayed as text over the bars:
tmp.table <- prop.table(table(dat$Rating))
with(dat, barplot(tmp.table, xlab= "Rating", ylab="proportion", ylim=c(0,.40)))
text(x = c(0.75, 2, 3.1, 4.25), y = tmp.table + .01, labels=paste(tmp.table*100,"%"))
Result
Data
dat <- read.csv(text="Rating
Good
Good
Good
Bad
Very Bad
Very Good
Very Bad
Bad")

set x/y limits in facet_wrap with scales = 'free'

I've seen similar questions asked, and this discussion about adding functionality to ggplot Setting x/y lim in facet_grid . In my research I often want to produce several panels plots, say for different simulation trials, where the axes limits remain the same to highlight differences between the trials. This is especially useful when showing the plot panels in a presentation. In each panel plot I produce, the individual plots require independent y axes as they're often weather variables, temperature, relative humidity, windspeed, etc. Using
ggplot() + ... + facet_wrap(~ ..., scales = 'free_y')
works great as I can easily produce plot panels of different weather variables.
When I compare between different plot panels, its nice to have consistent axes. Unfortunately ggplot provides no way of setting the individual limits of each plot within a panel plots. It defaults to using the range of given data. The Google Group discussion linked above discusses this shortcoming, but I was unable to find any updates as to whether this could be added. Is there a way to trick ggplot to set the individual limits?
A first suggestion that somewhat sidesteps the solution I'm looking for is to combine all my data into one data table and use facet_grid on my variable and simulation
ggplot() + ... + facet_grid(variable~simulation, scales = 'free_y')
This produces a fine looking plot that displays the data in one figure, but can become unwieldy when considering many simulations.
To 'hack' the plotting into producing what I want, I first determined which limits I desired for each weather variable. These limits were found by looking at the greatest extents for all simulations of interest. Once determined I created a small data table with the same columns as my simulation data and appended it to the end. My simulation data had the structure
'year' 'month' 'variable' 'run' 'mean'
1973 1 'rhmax' 1 65.44
1973 2 'rhmax' 1 67.44
... ... ... ... ...
2011 12 'windmin' 200 0.4
So I created a new data table with the same columns
ylims.sims <- data.table(year = 1, month = 13,
variable = rep(c('rhmax','rhmin','sradmean','tmax','tmin','windmax','windmin'), each = 2),
run = 201, mean = c(20, 100, 0, 80, 100, 350, 25, 40, 12, 32, 0, 8, 0, 2))
Which gives
'year' 'month' 'variable' 'run' 'mean'
1 13 'rhmax' 201 20
1 13 'rhmax' 201 100
1 13 'rhmin' 201 0
1 13 'rhmin' 201 80
1 13 'sradmean' 201 100
1 13 'sradmean' 201 350
1 13 'tmax' 201 25
1 13 'tmax' 201 40
1 13 'tmin' 201 12
1 13 'tmin' 201 32
1 13 'windmax' 201 0
1 13 'windmax' 201 8
1 13 'windmin' 201 0
1 13 'windmin' 201 2
While the choice of year and run is aribtrary, the choice of month need to be anything outside 1:12. I then appended this to my simulation data
sim1data.ylims <- rbind(sim1data, ylims)
ggplot() + geom_boxplot(data = sim1data.ylims, aes(x = factor(month), y = mean)) +
facet_wrap(~variable, scale = 'free_y') + xlab('month') +
xlim('1','2','3','4','5','6','7','8','9','10','11','12')
When I plot these data with the y limits, I limit the x-axis values to those in the original data. The appended data table with y limits has month values of 13. As ggplot still scales axes to the entire dataset, even when the axes are limited, this gives me the y limits I desire. Important to note that if there are data values greater than the limits you specify, this will not work.
Before: Notice the differences in the y limits for each weather variable between the panels.
After: Now the y limits remain consistent for each weather variable between the panels.
I hope to edit this post in the coming days and add a reproducible example for better explanation. Please comment if you've heard anything about adding this functionality to ggplot.

Set the width of ggplot geom_path based on a variable

I have two functions, a and b, that each take a value of x from 1-3 and produce an estimate and an error.
x variable estimate error
1 a 8 4
1 b 10 2
2 a 9 3
2 b 10 1
3 a 8 5
3 b 11 3
I'd like to use geom_path() in ggplot to plot the estimates and errors for each function as x increases.
So if this is the data:
d = data.frame(x=c(1,1,2,2,3,3),variable=rep(c('a','b'),3),estimate=c(8,10,9,10,8,11),error=c(4,2,3,1,5,3))
Then the output that I'd like is something like the output of:
ggplot(d,aes(x,estimate,color=variable)) + geom_path()
but with the thickness of the line at each point equal to the size of the error. I might need to use something like geom_polygon(), but I haven't been able to find a good way to do this without calculating a series of coordinates manually.
If there's a better way to visualize this data (y value with confidence intervals at discrete x values), that would be great. I don't want to use a bar graph because I actually have more than two functions and it's hard to track the changing estimate/error of any specific function with a large group of bars at each x value.
The short answer is that you need to map size to error so that the size of the geometric object will vary depending on the value, error in this case. There are many ways to do what you want like you have suggested.
df = data.frame(x = c(1,1,2,2,3,3),
variable = rep(c('a','b'), 3),
estimate = c(8,10,9,10,8,11),
error = c(4,2,3,1,5,3))
library(ggplot2)
ggplot(df, aes(x, estimate, colour = variable, group = variable, size = error)) +
geom_point() + theme(legend.position = 'none') + geom_line(size = .5)
I found geom_ribbon(). The answer is something like this:
ggplot(d,aes(x,estimate,ymin=estimate-error,ymax=estimate+error,fill=variable)) + geom_ribbon()

Resources