I'm working with a dataset where I have to transform some data for a curve fit. I'm plotting it using ggplot2, and can use stat_smooth on the transformed data to get the fit, but then want to overlay the result on the correct datapoints.
As a toy example, let's say I had
qplot(1:10, 1:10)+stat_smooth(formula=y+1~x, method="lm")
But I want to shift the stat_smooth line down by one (other than by taking the +1 out of the formula). Is this possible?
I don't think position_nudge() was available when this was asked 10.5 years ago but it's provided a simpler way of doing this for some time (as of ggplot 3.3.5, late 2021).
qplot(1:10, 1:10 + rnorm(10, sd = 0.3)) + stat_smooth(formula = y~x, method = "lm", position = position_nudge(y = 1))
It seems worth cautioning there's a good chance of displaying confusing or misleading confidence intervals when manipulating stat_smooth()'s formula. I've added a bit of variation to qplot()'s input in the line above to illustrate this.
Sometimes things can be very obvious :
qplot(1:10, 1:10)+stat_smooth(formula=(y+1)-1~x, method="lm")
If you can raise it 1 by adding 1 to y, you can lower it 1 by substracting 1 from y. ;-)
Related
Is there a parameter that can anchor start and end points for a loess geom_smooth regression? If I increase the span (so that the regression isn't too wiggly), the starting and ending points seem to be drastically different (I have multiple lines on a graph, using as.factor) when in reality they are not (quite close together). I can't share my data as it is for confidential academic research, and I'm not sure how to reproduce an example for this... just wondering if this is possible with ggplot.
Here are some pictures that illustrate the problem, though...
Low span (span = 0.1), just the first 10 out of the 750 points to be graphed --> with this you can see the true starting points:
And then with the high span (span = 1.0), and all 750 points, the starting value and ending values are completely different. I'm not sure why this happens, but it is very misleading:
Basically, I want the smoothness of the second picture, but the specific and accurate starting points of the first when I graph all of the data (i.e., all 750 points). Let me know if there's any way to do this. Thanks for all your help.
Without seeing your code, I can already tell that you're setting your axis limits for the "span = 1.0" version using xlim(0,10) or scale_x_continuous(limits=c(0,10)) - is that correct? Change it to the following:
coord_cartesian(xlim = c(0, 10))
This is because xlim() (which is just a wrapper for scale_x_continuous(limits=...)) does not just zoom in on your data, but in fact discards any of the data outside of those limits before performing any calculations. Check the documentation on xlim() and the documentation on coord_cartesian() for more info.
It's easy to see how this is working using the following example:
# create dataset
set.seed(8675309)
df <- data.frame(x=1:1000, y=rnorm(1000))
# basic plot
p <- ggplot(df, aes(x,y)) + theme_bw() +
geom_point(color='gray75', size=1) + geom_smooth()
p
We get a basic plot, and as we expect, the result of geom_smooth() on this dataset is a straight line parallel to the x axis at y=0.
If we use xlim() or scale_x_continuous(limits=...) to see the first 10 points, you see that the geom_smooth() line is not the same:
p + xlim(0,10)
# or this one... results in the same plot
p + scale_x_continuous(limits=c(0,10))
The resulting line has a much higher standard deviation and is a bit above y=0, since the first 10 points happen to be just a bit above the average for the rest of the 990 points. If you use coord_cartesian(xlim=...), the zooming in of the plot happens after the calculations are made and no points are discarded, giving you the same points plotted, but the geom_smooth() line that matches that of the full dataset:
p + coord_cartesian(xlim=c(0,10))
After many google searches I decided to ask for your help, guys.
I am plotting just some observations at different time points and I want to add a linear regression with stat_smooth. However, I want the linear model with the intercept at 100 (because data are percentage relative to time 0). To do that, I found that the easiest way is to use the offset parameter in lm. The problem is how to get the number of 'y' observations per group(col and facet groups) to pass it to offset parameter.
If I use data with the same number of observations per group (10 in my case), I can just write the number and it works great:
myplot <- ggplot(mydt2, aes(x=Time_point, y=GFP_rel, col=Gene, fill=Gene,group=Gene))
myplot <- myplot + stat_smooth(method='lm', formula = y ~ x + 0, method.args=list(offset=rep(100,10))) +
facet_wrap(~Cell_line)
However, this is not very elegant and/or flexible. My question is: how can I pass the number of observations to method.args? I tried offset(100,..count..), but I get the error: (list) object cannot be coerced to type 'integer').
Any suggestions?
Thanks
You can use the I(y - 100) coding in the formula as shown here instead of using an offset.
However, the predicted values for stat_smooth will then be predictions for y - 100, not y. This line will go through 0. You can move the lines back to the position to display predictions of the original y variable using position_nudge.
So the stat_smooth code would look something like
stat_smooth(method = "lm", formula = I(y - 100) ~ x + 0,
position = position_nudge(y = 100))
I have a more general question regarding the principle behind density2d.
I'm using ggplot and the density2d function to visualize animal movements. My idea was calculating heat maps showing where the animal is most of the time and/or to identify areas of particular interest. Yet, the density2d function sometimes generates rather inexplicable plots.
Here's what I mean:
set.seed(4)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
There are areas with a density estimate but without data (around x:50, y:300).
Now compare with this:
set.seed(13)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
Here there are regions "wihtout" a density estimate but with actual data (around x:100,y:550).
Someone asked a related question:
Create heatmap with distribution of attribute values in R (not density heatmap)
but there are no satisfactory answers to be found.
So my question would be (i) Why? and (ii) How to avoid/adjust if possible?
This may be helpful. I am not that familiar with stat_density2d. After seeing your code and ggplot documents (http://docs.ggplot2.org/0.9.2.1/stat_density2d.html), I thought ..level.. might not be the one. I, then, tried ..density.. Someone will be able to explain why you need density meanwhile I think this is the graph you wanted.
ggplot(data = df, aes(x = x, y = y)) +
stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) +
geom_path() +
coord_equal(xlim=c(0,600),ylim=c(0,600)) +
expand_limits(x=c(0,600),y=c(0,600))
I'd like to plot data such that on y axis there would be probability (in range [0,1]) and on x-axis I have the data values. The data is contiguous (also in range [0,1]), thus I'd like to use some kernel density estimation function and normalize it such that the y-value at some point x would mean the probability of seeing value x in input data.
So, I'd like to ask:
a) Is it reasonable at all? I understand that I cannot have probability of seeing values I do not have in the data, but I just would like to interpolate between points I have using a kernel density estimation function and normalize it afterwards.
b) Are there any built-in options in ggplot I could use, that would override default behavior of geom_density() for example for doing this?
Thanks in advance,
Timo
EDIT:
when i said "normalize" before, I actually meant "scale". But I got the answer, so thanks guys for clearing up my mind about this.
Just making up a quick merge of #JD Long's and #yesterday's answers:
ggplot(df, aes(x=x)) +
geom_histogram(aes(y = ..density..), binwidth=density(df$x)$bw) +
geom_density(fill="red", alpha = 0.2) +
theme_bw() +
xlab('') +
ylab('')
This way the binwidth for ggplot2 was calculated by the density function, and also the latter is drawn on the top of a histogram with a nice transparency. But you should definitely look into stat_densitiy as #yesterday suggested for further customization.
This isn't a ggplot answer, but if you want to bring together the ideas of kernel smoothing and histograms you could do a bootstrapping + smoothing approach. You'll get beat about the head and shoulders by stats folks for doing ugly things like this, so use at your own risk ;)
start with some synthetic data:
set.seed(1)
randomData <- c(rnorm(100, 5, 3), rnorm(100, 20, 3) )
hist(randomData, freq=FALSE)
lines(density(randomData), col="red")
The density function has a reasonably smart bandwidth calculator which you can borrow from:
bw <- density(randomData)$bw
resample <- sample( randomData, 10000, replace=TRUE)
Then use the bandwidth calc as the SD to make some random noise
noise <- rnorm(10000, 0, bw)
hist(resample + noise, freq=FALSE)
lines(density(randomData), col="red")
Hey look! A kernel smoothed histogram!
I know this long response is not really an answer to your question, but maybe it will provide some creative ideas on how to abuse your data.
You can control the behaviour of density / kernel estimation in ggplot by calling stat_density() rather than geom_density().
See the on-line user manual: http://had.co.nz/ggplot2/stat_density.html
You can specify any of the kernel estimation functions that are supported by by stats::density()
library(ggplot2)
df <- data.frame(x = rnorm(1000))
ggplot(df, aes(x=x)) + stat_density(kernel="biweight")
I am trying to produce some example graphics using ggplot2, and one of the examples I picked was the birthday problem, here using code 'borrowed' from a Revolution computing presentation at Oscon.
birthday<-function(n){
ntests<-1000
pop<-1:365
anydup<-function(i){
any(duplicated(sample(pop,n,replace=TRUE)))
}
sum(sapply(seq(ntests), anydup))/ntests
}
x<-data.frame(x=rep(1:100, each=5))
x<-ddply(x, .(x), function(df) {return(data.frame(x=df$x, prob=birthday(df$x)))})
birthdayplot<-ggplot(x, aes(x, prob))+
geom_point()+geom_smooth()+
theme_bw()+
opts(title = "Probability that at least two people share a birthday in a random group")+
labs(x="Size of Group", y="Probability")
Here my graph is what I would describe as exponential, but the geom_smooth doesn't fit the data particularly well. I've tried the loess method but this didn't change things much. Can anyone suggest how to add a better smooth ?
Thanks
Paul.
The smoothing routine does not react to the sudden change for low values of x fast enough (and it has no way of knowing that the values of prob are restricted to a 0-1 range). Since you have so low variability, a quick solution is to reduce the span of values over which smoothing at each point is done. Check out the red line in this plot:
birthdayplot + geom_smooth(span=0.1, colour="red")
The problem is that the probabilities follow a logistic curve. You could fit a proper smoothing line if you change the birthday function to return the raw successes and failures instead of the probabilities.
birthday<-function(n){
ntests<-1000
pop<-1:365
anydup<-function(i){
any(duplicated(sample(pop,n,replace=TRUE)))
}
data.frame(Dups = sapply(seq(ntests), anydup) * 1, n = n)
}
x<-ddply(x, .(x),function(df) birthday(df$x))
Now, you'll have to add the points as a summary, and specify a logistic regression as the smoothing type.
ggplot(x, aes(n, Dups)) +
stat_summary(fun.y = mean, geom = "point") +
stat_smooth(method = "glm", family = binomial)