Histograms and Density Plots do not match up - r

I am creating histograms of substitutions: 1st, 2nd,or 3rd sub over Time. So each histogram shows the number of subs in a given minute given the Sub Number. The histograms make sense to me because for the most part they are smooth (I used a bin width of 1 minute). Nothing looks too out of the ordinary. However, when I overlay a density plot, the tails on the left inflate and I cannot determine why for one of the graphs.
The dataset is of substitions, ranging from minute 1 to a maximum time. I then cut this dataset in half to only look at when the sub was made after minute 45. I have not folded this data back and I have tried to create a reproducable example, but cannot given the data.
Code used to create in R
## Filter out subs that are not in the second half
df.half<-df[df$PeriodId>=2,]
p<-ggplot(data=df.half, aes(x=time)) +
geom_histogram(aes(y=..density..),position="identity", alpha=0.5,binwidth=1)+
geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
geom_density(alpha=.2)+
facet_grid(SUB_NUMBER ~ .)+
scale_y_continuous(limits = c(0,0.075),breaks = c(seq(0,0.075,0.025)),
minor_breaks = c(seq(0,0.075,0.025)),name='Count')
p
Why, for the First Sub is the density plot inflated in the tail if there are no values less than 45? Also why isn't the density plot more inflated in the tail for the Second Sub?
Side Note: I did ask this question on crossvalidated, but was told since it involved R, to ask it here instead. Here
So I was able to change the code and get the following:
ggplot() +
geom_histogram(data=df.half, aes(x=time,y=..density..),position="identity", alpha=0.5,binwidth=1)+
geom_density(data=df.half,aes(x=time,y=..density..))+
geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
facet_grid(SUB_NUMBER ~ .)
This looks more correct and at least now fits the dataset. However, I am still confused as to why those issues occured in the first place.

While there is no data sample to reproduce the error, you could try to
make sure that the environment used by geom_density is correct by specifying it explicitly. You can also try to move the code line specifying the density (geom_density) just after the geom_histogram. Also, the y-axis label is probably wrong - it is now set as counts, while values suggest that is in fact density.
How would I specify density explicitly?
You can specify the density parameters explicitly by specifying data, aes and position directly in geom_density function call, so it would use these stated instead of inherited arguments:
ggplot() +
geom_histogram(data=df.half, aes(x=time,y=..density..),position="identity", alpha=0.5,binwidth=1)+
geom_density(data=df.half,aes(x=time,y=..density..))+
geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
facet_grid(SUB_NUMBER ~ .)
I do not understand how it occured in the first place
I think in your initial code for geom_density, you have explicitly specified just the alpha argument. Thus for all of the rest of the parameters it needed, (data, aes, position etc) it used the inherited arguments/parameters and apparently it did not inherit them correctly. Probably it tried to use the data argument from the geom_vline function - sumy.df.half , or was confused by the syntaxis in argument "..density.."

Related

Create stacked relative change plot

I am looking to create a plot in R that shows the relative change of some variables between two factors. I would like to stack them to reduce redundant text and make it easy to visually compare the changes between the two factors. I would like it to look something like this: http://postimg.org/image/clmw5zj37/.
where the lines (or bars) represent the relative change (y) in each variable (X), a solid circle (or any other symbol) represents no change, and an asterisk indicates the the change is statistically significant. Anyone have an idea of how to accomplish this in R?
This?
set.seed(1)
df <- data.frame(x=toupper(letters[1:10]),
y=rnorm(20,0,50),
sig=sample(0:1,20,replace=T),
factor=rep(c("Factor1","Factor2"),each=10))
library(ggplot2)
ggplot(df) +
geom_point(aes(x=x,y=y),shape=1,size=3)+
geom_linerange(aes(x=x,ymin=0,ymax=y))+
geom_text(data=df[df$sig==1,], aes(x=x,y=y+10*sign(y)),label="*",size=10)+
geom_hline(yintercept=0)+
facet_grid(factor~.)
Note that it is considered polite to provide a representative dataset. See this link for the way to formulate a question well.
Edit In response to OP's comment.
To plot points only wheny=0, set data=df[df$y==0,] in the call to geom_point(...). Vertical alignment of the stars can be done using vjust= in the call to geom_text(...). So, this code:
set.seed(1)
df <- data.frame(x=toupper(letters[1:10]),
y=rnorm(20,0,50),
sig=sample(0:1,20,replace=T),
factor=rep(c("Factor1","Factor2"),each=10))
df[sample(1:nrow(df),4),2:3]=0 # add some zeros to example
library(ggplot2)
ggplot(df) +
geom_point(data=df[df$y==0,],aes(x=x,y=y),size=5)+
geom_linerange(aes(x=x,ymin=0,ymax=y))+
geom_text(data=df[df$sig==1,], aes(x=x,y=y+10*sign(y)),
label="*", size=10, vjust=+0.65)+
geom_hline(yintercept=0)+
facet_grid(factor~.)
Genreates this ggplot:

What is the difference between geom_line(stat="hline") and geom_hline?

I would like to draw a horizontal line yintercept = mean(y) of my data (x, y)
geom_line(stat="hline", linetype="dotted", yintercept="mean") works fine, but
geom_hline(linetype="dotted", yintercept="mean") doesn't work for me?
I wonder what are the difference between these 2 functions? I thought geom_hline = geom_line + stat_hline. Is it not?
Update
Answered here Is there any difference between `geom_a(stat="b", ...)` and `stat_b(geom="a",...)`?
geom_hline is just weird!
geom_line works with your original data and is mainly designed to connect points, or otherwise characterise the data. You have instructed it to take the mean of the y-values to create a horizontal line.
geom_hline is merely an annotation. You need to specify the y-intercept explicitly. It only sees "mean" here as a character, rather than a function to use. You would need to write:
geom_hline(linetype="dotted", yintercept=mean(y))

Geom_ribbon() just turns the graph blank

Hi I got a data frame weekly.mean.values with the following structure:
week:mean:ci.lower:ci.upper
Where week is a factor; mean, ci.lower and ci.upper are numeric. For each week, there is only one mean, and one ci.lower or ci.upper.
I was trying to plot a shaded area inside of the 95% confidence interval around the mean, with the following code:
ggplot(weekly.mean.values,aes(x=week,y=mean)) +
geom_line() +
geom_ribbon(aes(ymin=ci.lower,ymax=ci.upper))
The plot, however, came out blank (that is only with x-axis and y-axis present, but no lines, or points, let alone shaded areas).
If I removed the geom_ribbon part, I did get a line. I know that this should be a very simple task but I don't know why I couldn't get geom_ribbon to plot what I wanted. Any hint would be truly appreciated.
I realize this thread is super old, but google still find it.
The answer is that you need to set the ymin and ymax to use a part of the data you are using on the y-axis. It you set them to scalar values then the ribbon covers the entire plot from top to bottom.
You can use
ymin=0
ymax=mean
to go from 0 to your y-point or even
ymin=mean-1
ymax=mean+1
to have the ribbon cover a strip encompassing your actual data.
I may be missing something, but the ribbon will be plotted filled with grey20 by default. You are plotting this layer on top of the data so no wonder it obscures it. Also, it is also possible that the limits for the plot axes derived from the data provided to the initial ggplot() call will not be sufficient to contain the confidence interval ribbon. In that case, I would not be surprised to see a grey/blank plot.
To see if this is the problem, try altering your geom_ribbon() line to:
geom_ribbon(aes(ymin=ci.lower,ymax=ci.upper), alpha = 0.5)
which will plot the ribbon with transparency whic should show the data underneath if the problem is what I think it is.
If so, set the x and y limits to the range of the data +/- the confidence interval you wish to plot and swap the order of the layers (i.e. draw the line on top of the ribbon), and use transparency in the ribbon to show the grid through it.
From ggplot's docs for geom_ribbon (2.1.0):
For each continuous x value, geom_interval displays a y interval. geom_area is a special case of geom_ribbon, where the minimum of the range is fixed to 0.
In this case, x values cannot be factors for geom_ribbon. One solution would be to convert week from a factor to a numeric. e.g.
ggplot(weekly.mean.values,aes(x=as.numeric(week),y=mean)) +
geom_line() +
geom_ribbon(aes(ymin=ci.lower,ymax=ci.upper))
geom_line should handle the switch from factor to numeric without incident, although the X axis scale may display differently.

Setting breakpoints for data with scale_fill_brewer() function in ggplot2

I am creating a map (choropleth) as described on the ggplot2 wiki. Everything works like a charm, except that I am running into an issue mapping a continuous value to the polygon fill color via the scale_fill_brewer() function.
This question describes the problem I'm having. As in the answer, my workaround has been to pre-cut my data into bins using the gtools quantcut() function:
UPDATE: This first example is actually the right way to do this
require(gtools) # needed for quantcut()
...
fill_factor <- quantcut(fill_continuous, q=seq(0,1,by=0.25))
ggplot(mydata) +
aes(long,lat,group=group,fill=fill_factor) +
geom_polygon() +
scale_fill_brewer(name="mybins", palette="PuOr")
This works, however, I feel like I should be able to skip the step of pre-cutting my data and do something like this with the breaks option:
ggplot(mydata) +
aes(long,lat,group=group,fill=fill_continuous) +
geom_polygon() +
scale_fill_brewer(names="mybins", palette="PuOr", breaks=quantile(fill_continuous))
But this doesn't work. Instead I get an error something like:
Continuous variable (composite score) supplied to discrete scale_brewer.
Have I misunderstood the purpose of the "breaks" option? Or is breaks broken?
A major issue with pre-cutting continuous data is that there are three pieces of information used at different points in the code:
The Brewer palette -- determines the maximum number of colors available
The number of break points (or the bin width) -- has to be specified with the data
The actual data to be plotted -- influences the choice of the Brewer palette (sequential/diverging)
A true vicious circle. This can be broken by providing a function that accepts the data and the palette, automatically derives the number of break points and returns an object that can be added to the ggplot object. Something along the following lines:
fill_brewer <- function(fill, palette) {
require(RColorBrewer)
n <- brewer.pal.info$maxcolors[palette == rownames(brewer.pal.info)]
discrete.fill <- call("quantcut", match.call()$fill, q=seq(0, 1, length.out=n))
list(
do.call(aes, list(fill=discrete.fill)),
scale_fill_brewer(palette=palette)
)
}
Use it like this:
ggplot(mydata) + aes(long,lat,group=group) + geom_polygon() +
fill_brewer(fill=fill_continuous, palette="PuOr")
As Hadley explains, the breaks option moves the ticks, but does not make the data continuous. Therefore pre-cutting the data as per the first example in the question is the right way to use the scale_fill_brewer command.

Why wont ggplot2 allow me to set a size for each individual point?

I've got a scatter plot. I'd like to scale the size of each point by its frequency. So I've got a frequency column of the same length. However, if I do:
... + geom_point(size=Freq)
I get this error:
When _setting_ aesthetics, they may only take one value. Problems: size
which I interpret as all points can only have 1 size. So how would I do what I want?
Update: data is here
The basic code I used is:
dcount=read.csv(file="New_data.csv",header=T)
ggplot(dcount,aes(x=Time,y=Counts)) + geom_point(aes(size=Freq))
Have you tried..
+ geom_point(aes(size = Freq))
Aesthetics are mapped to variables in the data with the aes function. Check out http://had.co.nz/ggplot2/geom_point.html
ok, this might be what you're looking for. The code you provided above aggregates the information into four categories. If you don't want that, you can specify the categories with scale_size_manual().
sizes <- unique(dcount$Freq)
names(sizes) <- as.character(unique(dcount$Freq))
ggplot(dcount,aes(x=Time,y=Counts)) + geom_point(aes(size=as.factor(Freq))) + scale_size_manual(values = sizes/2)
If the code gd047 gave doesn't work, I'd double check that your Freq column is actually called Freq and that your workspace doesn't have some other object called Freq. Other than that, the code should work. How do you know that the scale has nothing to do with the frequency?

Resources