I would like to draw a horizontal line yintercept = mean(y) of my data (x, y)
geom_line(stat="hline", linetype="dotted", yintercept="mean") works fine, but
geom_hline(linetype="dotted", yintercept="mean") doesn't work for me?
I wonder what are the difference between these 2 functions? I thought geom_hline = geom_line + stat_hline. Is it not?
Update
Answered here Is there any difference between `geom_a(stat="b", ...)` and `stat_b(geom="a",...)`?
geom_hline is just weird!
geom_line works with your original data and is mainly designed to connect points, or otherwise characterise the data. You have instructed it to take the mean of the y-values to create a horizontal line.
geom_hline is merely an annotation. You need to specify the y-intercept explicitly. It only sees "mean" here as a character, rather than a function to use. You would need to write:
geom_hline(linetype="dotted", yintercept=mean(y))
Related
I am creating histograms of substitutions: 1st, 2nd,or 3rd sub over Time. So each histogram shows the number of subs in a given minute given the Sub Number. The histograms make sense to me because for the most part they are smooth (I used a bin width of 1 minute). Nothing looks too out of the ordinary. However, when I overlay a density plot, the tails on the left inflate and I cannot determine why for one of the graphs.
The dataset is of substitions, ranging from minute 1 to a maximum time. I then cut this dataset in half to only look at when the sub was made after minute 45. I have not folded this data back and I have tried to create a reproducable example, but cannot given the data.
Code used to create in R
## Filter out subs that are not in the second half
df.half<-df[df$PeriodId>=2,]
p<-ggplot(data=df.half, aes(x=time)) +
geom_histogram(aes(y=..density..),position="identity", alpha=0.5,binwidth=1)+
geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
geom_density(alpha=.2)+
facet_grid(SUB_NUMBER ~ .)+
scale_y_continuous(limits = c(0,0.075),breaks = c(seq(0,0.075,0.025)),
minor_breaks = c(seq(0,0.075,0.025)),name='Count')
p
Why, for the First Sub is the density plot inflated in the tail if there are no values less than 45? Also why isn't the density plot more inflated in the tail for the Second Sub?
Side Note: I did ask this question on crossvalidated, but was told since it involved R, to ask it here instead. Here
So I was able to change the code and get the following:
ggplot() +
geom_histogram(data=df.half, aes(x=time,y=..density..),position="identity", alpha=0.5,binwidth=1)+
geom_density(data=df.half,aes(x=time,y=..density..))+
geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
facet_grid(SUB_NUMBER ~ .)
This looks more correct and at least now fits the dataset. However, I am still confused as to why those issues occured in the first place.
While there is no data sample to reproduce the error, you could try to
make sure that the environment used by geom_density is correct by specifying it explicitly. You can also try to move the code line specifying the density (geom_density) just after the geom_histogram. Also, the y-axis label is probably wrong - it is now set as counts, while values suggest that is in fact density.
How would I specify density explicitly?
You can specify the density parameters explicitly by specifying data, aes and position directly in geom_density function call, so it would use these stated instead of inherited arguments:
ggplot() +
geom_histogram(data=df.half, aes(x=time,y=..density..),position="identity", alpha=0.5,binwidth=1)+
geom_density(data=df.half,aes(x=time,y=..density..))+
geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
facet_grid(SUB_NUMBER ~ .)
I do not understand how it occured in the first place
I think in your initial code for geom_density, you have explicitly specified just the alpha argument. Thus for all of the rest of the parameters it needed, (data, aes, position etc) it used the inherited arguments/parameters and apparently it did not inherit them correctly. Probably it tried to use the data argument from the geom_vline function - sumy.df.half , or was confused by the syntaxis in argument "..density.."
I had a graph created with default R-plot functionality but now want to change to ggplot2 mainly because I want to use ggrepel to place labels correctly and non-overlapping.
My old plot contains diagonal lines which I need to keep. They are ploted like this:
for (i in -5:10) {
abline(a= i, b= 1, lty = 5)
}
The issues I have now are:
How do I do this for-loop with ggplot2 so I don't need to add all the lines expliclty?
How do I actually created the lines correctly?
geom_abline(slope=1, intercept=10)
Does not work as expected, probably due to log10 scale. So how can I draw diagonal lines on log10 scales correctly?
It actually works fine. This issue is directly related to my other issue about x and y axis limits. Per default the plot draws a bigger area than the x and y limits define (who thought this was a good idea???). And therefore the intercepts look wrong but the actually are ok.
If I set expand = c(0, 0) for both axis, then the intercept is also looks fine because that only draws to the limits.
The solution for multiple lines is a intercept list:
geom_abline(slope=1, intercept=(-3):(5)
Is there a way to transform data in ggplot2 in the aes declaration of a geom?
I have a plot conceptually similar to this one:
test=data.frame("k"=rep(1:3,3),"ce"=rnorm(9),"comp"=as.factor(sort(rep(1:3,3))))
plot=ggplot(test,aes(y=ce,x=k))+geom_line(aes(lty=comp))
Suppose I would like to add a line calculated as the maximum of the values between the three comp for each k point, with only the plot object available. I have tried several options (e.g. using aggregate in the aes declaration, stat_function, etc.) but I could not find a way to make this work.
At the moment I am working around the problem by extracting the data frame with ggplot_build, but I would like to find a direct solution.
Is
require(plyr)
max.line = ddply(test, .(k), summarise, ce = max(ce))
plot = ggplot(test, aes(y=ce,x=k))
plot = plot + geom_line(aes(lty=comp))
plot = plot + geom_line(data=max.line, color='red')
something like what you want?
Thanks to JLLagrange and jlhoward for your help. However both solutions require the access to the underlying data.frame, which I do not have. This is the workaround I am using, based on the previous example:
data=ggplot_build(plot)$data[[1]]
cemax=with(data,aggregate(y,by=list(x),max))
plot+geom_line(data=cemax,aes(x=Group.1,y=x),colour="green",alpha=.3,lwd=2)
This does not require direct access to the dataset, but to me it is a very inefficient and inelegant solution. Obviously if there is no other way to manipulate the data, I do not have much of a choice :)
EDIT (Response to OP's comment):
OK I see what you mean now - I should have read your question more carefully. You can achieve what you want using stat_summary(...), which does not require access to the original data frame. It also solves the problem I describe below (!!).
library(ggplot2)
set.seed(1)
test <- data.frame(k=rep(1:3,3),ce=rnorm(9),comp=factor(rep(1:3,each=3)))
plot <- ggplot(test,aes(y=ce,x=k))+geom_line(aes(lty=comp))
##
plot + stat_summary(fun.y=max, geom="line", col="red")
Original Response (Requires access to original df)
One unfortunate characteristic of ggplot is that aggregating functions (like max, min, mean, sd, sum, and so on) used in aes(...) operate on the whole dataset, not subgroups. So
plot + geom_line(aes(y=max(ce)))
will not work - you get the maximum of all test$ce vs. k, which is not useful.
One way around this, which is basically the same as #JLLagrange's answer (but doesn't use external libraries), is:
plot+geom_line(data=aggregate(ce~k,test,max),colour="red")
This creates a data frame dynamically aggregating ce by k using the max function.
Hi I got a data frame weekly.mean.values with the following structure:
week:mean:ci.lower:ci.upper
Where week is a factor; mean, ci.lower and ci.upper are numeric. For each week, there is only one mean, and one ci.lower or ci.upper.
I was trying to plot a shaded area inside of the 95% confidence interval around the mean, with the following code:
ggplot(weekly.mean.values,aes(x=week,y=mean)) +
geom_line() +
geom_ribbon(aes(ymin=ci.lower,ymax=ci.upper))
The plot, however, came out blank (that is only with x-axis and y-axis present, but no lines, or points, let alone shaded areas).
If I removed the geom_ribbon part, I did get a line. I know that this should be a very simple task but I don't know why I couldn't get geom_ribbon to plot what I wanted. Any hint would be truly appreciated.
I realize this thread is super old, but google still find it.
The answer is that you need to set the ymin and ymax to use a part of the data you are using on the y-axis. It you set them to scalar values then the ribbon covers the entire plot from top to bottom.
You can use
ymin=0
ymax=mean
to go from 0 to your y-point or even
ymin=mean-1
ymax=mean+1
to have the ribbon cover a strip encompassing your actual data.
I may be missing something, but the ribbon will be plotted filled with grey20 by default. You are plotting this layer on top of the data so no wonder it obscures it. Also, it is also possible that the limits for the plot axes derived from the data provided to the initial ggplot() call will not be sufficient to contain the confidence interval ribbon. In that case, I would not be surprised to see a grey/blank plot.
To see if this is the problem, try altering your geom_ribbon() line to:
geom_ribbon(aes(ymin=ci.lower,ymax=ci.upper), alpha = 0.5)
which will plot the ribbon with transparency whic should show the data underneath if the problem is what I think it is.
If so, set the x and y limits to the range of the data +/- the confidence interval you wish to plot and swap the order of the layers (i.e. draw the line on top of the ribbon), and use transparency in the ribbon to show the grid through it.
From ggplot's docs for geom_ribbon (2.1.0):
For each continuous x value, geom_interval displays a y interval. geom_area is a special case of geom_ribbon, where the minimum of the range is fixed to 0.
In this case, x values cannot be factors for geom_ribbon. One solution would be to convert week from a factor to a numeric. e.g.
ggplot(weekly.mean.values,aes(x=as.numeric(week),y=mean)) +
geom_line() +
geom_ribbon(aes(ymin=ci.lower,ymax=ci.upper))
geom_line should handle the switch from factor to numeric without incident, although the X axis scale may display differently.
I am creating a map (choropleth) as described on the ggplot2 wiki. Everything works like a charm, except that I am running into an issue mapping a continuous value to the polygon fill color via the scale_fill_brewer() function.
This question describes the problem I'm having. As in the answer, my workaround has been to pre-cut my data into bins using the gtools quantcut() function:
UPDATE: This first example is actually the right way to do this
require(gtools) # needed for quantcut()
...
fill_factor <- quantcut(fill_continuous, q=seq(0,1,by=0.25))
ggplot(mydata) +
aes(long,lat,group=group,fill=fill_factor) +
geom_polygon() +
scale_fill_brewer(name="mybins", palette="PuOr")
This works, however, I feel like I should be able to skip the step of pre-cutting my data and do something like this with the breaks option:
ggplot(mydata) +
aes(long,lat,group=group,fill=fill_continuous) +
geom_polygon() +
scale_fill_brewer(names="mybins", palette="PuOr", breaks=quantile(fill_continuous))
But this doesn't work. Instead I get an error something like:
Continuous variable (composite score) supplied to discrete scale_brewer.
Have I misunderstood the purpose of the "breaks" option? Or is breaks broken?
A major issue with pre-cutting continuous data is that there are three pieces of information used at different points in the code:
The Brewer palette -- determines the maximum number of colors available
The number of break points (or the bin width) -- has to be specified with the data
The actual data to be plotted -- influences the choice of the Brewer palette (sequential/diverging)
A true vicious circle. This can be broken by providing a function that accepts the data and the palette, automatically derives the number of break points and returns an object that can be added to the ggplot object. Something along the following lines:
fill_brewer <- function(fill, palette) {
require(RColorBrewer)
n <- brewer.pal.info$maxcolors[palette == rownames(brewer.pal.info)]
discrete.fill <- call("quantcut", match.call()$fill, q=seq(0, 1, length.out=n))
list(
do.call(aes, list(fill=discrete.fill)),
scale_fill_brewer(palette=palette)
)
}
Use it like this:
ggplot(mydata) + aes(long,lat,group=group) + geom_polygon() +
fill_brewer(fill=fill_continuous, palette="PuOr")
As Hadley explains, the breaks option moves the ticks, but does not make the data continuous. Therefore pre-cutting the data as per the first example in the question is the right way to use the scale_fill_brewer command.