set x/y limits in facet_wrap with scales = 'free' - r

I've seen similar questions asked, and this discussion about adding functionality to ggplot Setting x/y lim in facet_grid . In my research I often want to produce several panels plots, say for different simulation trials, where the axes limits remain the same to highlight differences between the trials. This is especially useful when showing the plot panels in a presentation. In each panel plot I produce, the individual plots require independent y axes as they're often weather variables, temperature, relative humidity, windspeed, etc. Using
ggplot() + ... + facet_wrap(~ ..., scales = 'free_y')
works great as I can easily produce plot panels of different weather variables.
When I compare between different plot panels, its nice to have consistent axes. Unfortunately ggplot provides no way of setting the individual limits of each plot within a panel plots. It defaults to using the range of given data. The Google Group discussion linked above discusses this shortcoming, but I was unable to find any updates as to whether this could be added. Is there a way to trick ggplot to set the individual limits?

A first suggestion that somewhat sidesteps the solution I'm looking for is to combine all my data into one data table and use facet_grid on my variable and simulation
ggplot() + ... + facet_grid(variable~simulation, scales = 'free_y')
This produces a fine looking plot that displays the data in one figure, but can become unwieldy when considering many simulations.
To 'hack' the plotting into producing what I want, I first determined which limits I desired for each weather variable. These limits were found by looking at the greatest extents for all simulations of interest. Once determined I created a small data table with the same columns as my simulation data and appended it to the end. My simulation data had the structure
'year' 'month' 'variable' 'run' 'mean'
1973 1 'rhmax' 1 65.44
1973 2 'rhmax' 1 67.44
... ... ... ... ...
2011 12 'windmin' 200 0.4
So I created a new data table with the same columns
ylims.sims <- data.table(year = 1, month = 13,
variable = rep(c('rhmax','rhmin','sradmean','tmax','tmin','windmax','windmin'), each = 2),
run = 201, mean = c(20, 100, 0, 80, 100, 350, 25, 40, 12, 32, 0, 8, 0, 2))
Which gives
'year' 'month' 'variable' 'run' 'mean'
1 13 'rhmax' 201 20
1 13 'rhmax' 201 100
1 13 'rhmin' 201 0
1 13 'rhmin' 201 80
1 13 'sradmean' 201 100
1 13 'sradmean' 201 350
1 13 'tmax' 201 25
1 13 'tmax' 201 40
1 13 'tmin' 201 12
1 13 'tmin' 201 32
1 13 'windmax' 201 0
1 13 'windmax' 201 8
1 13 'windmin' 201 0
1 13 'windmin' 201 2
While the choice of year and run is aribtrary, the choice of month need to be anything outside 1:12. I then appended this to my simulation data
sim1data.ylims <- rbind(sim1data, ylims)
ggplot() + geom_boxplot(data = sim1data.ylims, aes(x = factor(month), y = mean)) +
facet_wrap(~variable, scale = 'free_y') + xlab('month') +
xlim('1','2','3','4','5','6','7','8','9','10','11','12')
When I plot these data with the y limits, I limit the x-axis values to those in the original data. The appended data table with y limits has month values of 13. As ggplot still scales axes to the entire dataset, even when the axes are limited, this gives me the y limits I desire. Important to note that if there are data values greater than the limits you specify, this will not work.
Before: Notice the differences in the y limits for each weather variable between the panels.
After: Now the y limits remain consistent for each weather variable between the panels.
I hope to edit this post in the coming days and add a reproducible example for better explanation. Please comment if you've heard anything about adding this functionality to ggplot.

Related

Plotting stacked histogram with log scale in ggplot2

Note: I found a similar question, for which there was an answer explaining the problem. However, I'm looking for an answer, as opposed to a reason why it's difficult (which I fully understand).
I have data for which I want to create a histogram. This data has a count of 10000 for the bin [0, 200) and a count of 1 for several bins such as [30000, 30200). Both bins are important and need to be visible. For this, I can perform a histogram with the log1p scale.
contig_len <- read.table(data_file, header = FALSE, sep = ",", col.names=c("Length"))
ggplot(contig_len, aes(x = Length)) + geom_histogram(binwidth=200) +
scale_y_continuous(trans="log1p")
This works perfectly! But now, I want to categorise the items in the histogram, as follows:
ggplot(contig_len, aes(x = Length, fill = Prevalence)) +
geom_histogram(binwidth=200, alpha=0.5, position="stack") +
scale_y_continuous(trans = "log1p")
This doesn't work, however, as the stacking is performed without taking the log scale into account. Has anyone found a way around this problem? My data looks like this:
head(contig_len)
Length Prevalence
1 606 Repetitive (<5)
2 888 Non-Repetitive
3 192 Repetitive (<9)
4 9830 Non-Repetitive
5 506 Non-Repetitive
6 850 Non-Repetitive

Approach for creating plotting means from data frame

Trying to develop a flexible script to plot mean of continuous variable observations 'score' as a function of discrete time points 'day' from data frame.
I can do this by creating subsets, but I have a big set of data with many factor vectors like 'day,' so would like to get vectors or a data frame for each factor and its corresponding mean.
I have a data frame structured like this:
subject day score
1 0 99.13
2 0 NA
3 0 86.87
1 7 73.71
2 7 82.42
3 7 84.45
1 14 66.88
2 14 83.73
3 14 NA
I tried tapply(), but couldn't get it to output vectors or tables with appropriate headers and could also handle NAs.
Looking for a simple bit of code to get two vectors or a data frame with which to plot mean of 'score' as a function of factor 'day'.
So the plot will have point for average score on each day 0, 7, and 14.
I have seen a lot of posts for doing this directly with ggplot, but it seems useful to know how to do, and I need to see the output to make sure it is handling NAs correctly.
If you are able to help, please include explanatory annotations in your script. Thanks!
I think tapply should be able to handle this, you can amend the function to remove NAs:
df=data.frame("subject"=rep(1:3,3), "day"=as.factor(rep(c(0,7,14),each=3)),
"score"=c(99.13,NA,86.87,73.71,82.42,84.45,66.88,83.73,NA))
res = with(df, tapply(score, day, function(x) mean(x,na.rm=T)))
EDIT to get day and score as vectors
day=as.numeric(names(res))
day
0 7 14
score=as.numeric(res)
score
93.00000 80.19333 75.30500
Plot in base R:
plot(x=as.numeric(as.character(df$day)),y=df$score,type="p")
lines(x=names(res),y=res, col="red")
Not entirely clear what are you trying to achieve. Here I will show how to use the ggplot2 package to create a point plot with the mean for each group. Assuming that dt is your data frame.
library(ggplot2)
ggplot(dt, aes(x = day, y = score, color = factor(subject))) + # Specify x, y and color information
geom_point(size = 3) + # plot the point and specify the size is 3
scale_color_brewer(name = "Subject",
type = "qual",
palette = "Pastel1") + # Format the color of points and the legend using ColorBrewer
scale_x_continuous(breaks = c(0, 7, 14)) + # Set the breaks on x-axis
stat_summary(fun.y = "mean",
color = "red",
geom = "point",
size = 5,
shape = 8) + # Compute mean of each group and plot it
theme_classic() # Specify the theme
Warning messages: 1: Removed 2 rows containing non-finite values
(stat_summary). 2: Removed 2 rows containing missing values
(geom_point).
If you run the above code, you will get the warning message and a plot as follows. The warning message means NA has been removed, so you don't need to further remove NA from the dataset.
DATA
dt <- read.table(text = "subject day score
1 0 99.13
2 0 NA
3 0 86.87
1 7 73.71
2 7 82.42
3 7 84.45
1 14 66.88
2 14 83.73
3 14 NA",
header = TRUE, stringsAsFactors = FALSE)

R ggplot Facet Wrap

I'm tyring to facet wrap this scatter plot by the y axis.For example, if the y-axis goes up to 1000, I would like to separate this graph into 4, the first where the y-axis goes from 0-250, the next 251-500, the next 501-750, and the last 751-1000. Is this type of facet wrapping possible?
library(ggplot2)
A
nrow(A)
# 1000
ncol(A)
# 3
head(A)
# Track Base Location
# 1 1 A 1
# 2 1 C 2
# 3 1 G 3
# 4 1 G 4
# 5 1 A 5
# 6 1 A 6
p <- ggplot(data = A, aes(y=Track, x=Location)) +
geom_point(aes(colour=Base),shape=15,size=2)
print(p)
This is what I have right now, as you can see, it doesn't look aesthetically pleasing.
You can - you just have to make an indicator variable which shows which facet each point should belong in.
(Quick aside - what you have placed in your question is still not a reproducible example - what we are after is something we can copy-paste into our terminals that will still demonstrate your problem. For example, in my answer I've shown you a reproducible example where I only have 100 rows in A rather than you 1000, and my intervals are different to yours, but the key is you can copy-paste it straight to your terminal and you can easily see how it will extend to your problem).
e.g.
# here's a reproducible example, with 100 data points
A <- data.frame(Track=1, Location=1:100, Base=factor(sample(c('A', 'T', 'C', 'G'), 100, replace=T)))
library(ggplot2)
ggplot(transform(A, panel=cut(Location, seq(0, 100, by=25), include.lowest=T)),
aes(x=Track, y=Location)) +
geom_point(aes(colour=Base),shape=15,size=2) +
facet_wrap( ~ panel, scales='free')
Key points:
transform(A, panel=...) adds an extra column "panel" into your dataframe (try running that on its own to see)
cut(Location, seq(0, 100, by=25), include.lowest=T) makes a factor that indicates which interval each Location is in. The intervals here are [0, 25], (25,50] (50,75] (75,100] (the include.lowest makes sure that the 0 is included in the first interval). For the breaks you mentioned in your question you'd do something like seq(0, 1000, by=250)
the facet_wrap(~ panel,...) makes one facet per interval
the scales='free' in the facet_wrap makes it so that all the y scales may be different in each panel (skip it out and they all get plotted on a common scale - just try to skip it out and you will see what I mean).

Adding total counts as horizontal lines to histograms in facet_grid()

Data:
I have a data frame comprising 4 variables and about 300k rows including a unique account ID, a start date in yyyy-mm-dd, a start year, and the total number of months to-date the customer has held an account active. Snippet of the data below (don't let the row numbers confuse, this is obviously a subset, if more data is necessary, let me know):
> head(ten.by.id)
acct.id start_date strt.yr max_ten
1 155 1998-11-01 1998 175
19 902 2001-09-01 2001 143
39 995 2001-09-01 2001 143
59 1014 2000-10-01 2000 153
78 1017 2000-04-01 2000 160
100 1137 2000-11-01 2000 153
Problem (Why I want to render a faceted plot):
Showing a histogram of the entire dataset across all years renders the following:
Obviously, there are mixed distributions of information here, but the effect is unknown. First I thought I'd check for time domain effects with a visual. By using facets, I can provide a serial histogram of frequency distributions by year, overlaying the KDE plot for each year.
If multiple distributions were a product of something that occurred over time, I could spot check relevant shape changes (i.e. uni to multimodal). I used the code below to generate this plot:
maxten_time <- ggplot(ten.by.id, aes(max_ten))
+ geom_histogram(colour="grey19", fill="orange", binwidth=2, stat="bin")
+ scale_y_continuous(breaks=seq(0,12000,by=100))
+ scale_x_continuous(breaks=seq(0,180,by=45))
+ labs(title ="Serial Distribution of Max Length of Tenure for all Customers by Start Date", x="Max Tenure(months)", y="# of Customers", colour="blue")
+ facet_grid(. ~ strt.yr) + geom_density(fill=NA, colour="orange", cex=1) + aes(y = ..count..)
Which renders the following:
Questions for recreating the faceted plot:
What I wish to do is add a horizontal line (or some other single marker) to each facet which indicates
the total # of customer starts for each year. Can this be done in a faceted
plot?
I would like to add an additional axis that spans across the facets to
mark the number of months across all years (1 to 175). Am I reaching with ggplot to try to do this (i.e. since each facet is its own plot, would aligning the month markers across all facets even be possible)? I haven't seen any relevant examples on doing something quite like this.
The objective is merely to combine the horiz lines in each facet and the axis across facets into the entire plot. Any direction would be helpful.
Phillip

Multiple Plots in R

I want to plot 2 graphs in 1 frame. Basically I want to compare the results.
Anyways, the code I tried is:
plot(male,pch=16,col="red")
lines(male,pch=16,col="red")
par(new=TRUE)
plot(female,pch=16,col="green")
lines(female,pch=16,col="green")
When I run it, I DO get 2 plots in a frame BUT it changes my y-axis. Added my plot below. Anyways, y-axis values are -4,-4,-3,-3,...
It's like both of the plots display their own axis.
Please help.
Thanks
You don't need the second plot. Just use
> plot(male,pch=16,col="red")
> lines(male, pch=16, col = "red")
> lines(female, pch=16, col = "green")
> points(female, pch=16, col = "green")
Note: that will set the frame boundaries based on the first data set, so some data from the second plot could be outside the boundaries of the plot. You can fix it by e.g. setting the limits of the first plot yourself.
For this kind of plot I usually like the plotting with ggplot2 much better. The main reason: It generalizes nicely to more than two lines without a lot of code.
The drawback for your sample data is that it is not available as a data.frame, which is required for ggplot2. Furthermore, in every case you need a x-variable to plot against. Thus, first let us create a data.frame out of your data.
dat <- data.frame(index=rep(1:10, 2), vals=c(male, female), group=rep(c('male', 'female'), each=10))
Which leaves us with
> dat
index vals group
1 1 -0.4334269341 male
2 2 0.8829902521 male
3 3 -0.6052638138 male
4 4 0.2270191965 male
5 5 3.5123679143 male
6 6 0.0615821014 male
7 7 3.6280155376 male
8 8 2.3508890457 male
9 9 2.9824432680 male
10 10 1.1938052833 male
11 1 1.3151289227 female
12 2 1.9956491556 female
13 3 0.8229389822 female
14 4 1.2062726250 female
15 5 0.6633392820 female
16 6 1.1331669670 female
17 7 -0.9002109636 female
18 8 3.2137052284 female
19 9 0.3113656610 female
20 10 1.4664434215 female
Note that my command assumes you have 10 data values each. That command would have to be adjusted according to your actual data.
Now we may use the mighty power of ggplot2:
library(ggplot2)
ggplot(dat, aes(x=index, y=vals, color=group)) + geom_point() + geom_line()
The call above has three elements: ggplot initializes the plot, tells R to use dat as datasource and defines the plot aesthetics, or better: Which aesthetic properties of the plot (such as color, position, size, etc.) are influenced by your data. We use the x and y-values as expected and furthermore set the color aesthetic to the grouping variable - that makes ggplot automatically plot two groups with different colors. Finally, we add two geometries, that pretty much do what is written above: Draw lines and draw points.
The result:
If you have your data saved in the standard way in R (in a data.frame), you end with one line of code. And if after some thousands years of evolution you want to add another gender, it is still one line of code.

Resources