I am trying to produce some example graphics using ggplot2, and one of the examples I picked was the birthday problem, here using code 'borrowed' from a Revolution computing presentation at Oscon.
birthday<-function(n){
ntests<-1000
pop<-1:365
anydup<-function(i){
any(duplicated(sample(pop,n,replace=TRUE)))
}
sum(sapply(seq(ntests), anydup))/ntests
}
x<-data.frame(x=rep(1:100, each=5))
x<-ddply(x, .(x), function(df) {return(data.frame(x=df$x, prob=birthday(df$x)))})
birthdayplot<-ggplot(x, aes(x, prob))+
geom_point()+geom_smooth()+
theme_bw()+
opts(title = "Probability that at least two people share a birthday in a random group")+
labs(x="Size of Group", y="Probability")
Here my graph is what I would describe as exponential, but the geom_smooth doesn't fit the data particularly well. I've tried the loess method but this didn't change things much. Can anyone suggest how to add a better smooth ?
Thanks
Paul.
The smoothing routine does not react to the sudden change for low values of x fast enough (and it has no way of knowing that the values of prob are restricted to a 0-1 range). Since you have so low variability, a quick solution is to reduce the span of values over which smoothing at each point is done. Check out the red line in this plot:
birthdayplot + geom_smooth(span=0.1, colour="red")
The problem is that the probabilities follow a logistic curve. You could fit a proper smoothing line if you change the birthday function to return the raw successes and failures instead of the probabilities.
birthday<-function(n){
ntests<-1000
pop<-1:365
anydup<-function(i){
any(duplicated(sample(pop,n,replace=TRUE)))
}
data.frame(Dups = sapply(seq(ntests), anydup) * 1, n = n)
}
x<-ddply(x, .(x),function(df) birthday(df$x))
Now, you'll have to add the points as a summary, and specify a logistic regression as the smoothing type.
ggplot(x, aes(n, Dups)) +
stat_summary(fun.y = mean, geom = "point") +
stat_smooth(method = "glm", family = binomial)
Related
INTRO: I am new to r and to stack overflow...So I am doing a term paper and need to run some stats on how or better when to develop habits.
Ideally habit formation is according to Mitscherlich’s law & looks like a non-linear regression and a asymptote. Once a participant reaches his plateau (defined as 95% interval to asymptote) One can speak of an established habit... Well actually that is debateable... BUT we are doing a replica of a study done by Lally et al. 2010 (How habits are formed:Modelling habit formation in the real world) So we somehow have to stick to certian criteria
ACTUAL QUESTION: The first step is to obtain the R2 for linear and non-linear regression. This I managed.
But for some reason I just can not manage to obtain the x-Axis value for the intersect(orange point in picture) of a non-linear function and my 95% Habit plateau line (Purple line in picture)...
Here is an example of how an ideal graph looks like
But exactley this X value is crucial for group comparison and later on checking for significant differences...
Of course I already googled but somehow I could not manage to make sense of the presented solutions to other or similar question... It seems one can not do it in ggplot with the geom_point() & therefore has to build a seperate formula using the approx() function, right?
Maybe someone can help me out... Tks in advance!
And here is the code of interest...
library(ggplot2)
library(patchwork)
library(stats)
days <- c(0:15)
score <- c(14,17,16,22,24,27,30,31,32,35,40,43,42,43,43)
df <- data.frame(days,score)
# red curve in graph
#This way the R2 for the nonlinear regression is obtained for later analisis
nonlinreg1 <- nls(score ~ SSasymp(days, Asym, R0, lrc), data = df)
summary(nonlinreg1)
RSS <- sum(residuals(nonlinreg1)^2)
TSS <- sum((df$score - mean(df$score))^2)
R.square.nonlinreg1 <- 1 - (RSS/TSS)
R.square.nonlinreg1
# purple line in graph
#Definition of plateau at 95% of asymptote reached
Asymp95 <- summary(nonlinreg1)$parameters[1,1] * 0.95
# define green line as the Asymptote
nls_line <- predict(nonlinreg1)
#plotting Asymptote (nls_line)
HabitReach95 <- approx(nls_line, df$days, xout = Asymp95)$y
# Now in GGplot
ggplot(data=df,aes(x=days, y=score)) +
geom_point()+
#HERE now from this intersect below, I would like to obtain the exact X-value
geom_point(x = HabitReach95, y= Asymp95, aes(color="Intersect"), lwd=2) +
#this is the rest of ggplot code but I think it is not of interest for the described problem, but still just in case...
geom_smooth(method=lm, aes(color="Linear Reg"), se=F) +
geom_smooth(method="nls", formula=y~SSasymp(x, Asym, R0, lrc), aes(color="Non-Linear Reg"), se=F) +
geom_hline(aes(color="Asymptote for non-linear Reg", yintercept=summary(nonlinreg1)$parameters[1,1])) +
geom_hline(aes(color="Habit plateau at 95%", yintercept=Asymp95)) +
xlab("Days of Experiment") + ylab("Automaticity Score Habit") +
ggtitle("Test graph for participant") +
theme(plot.title = element_text(hjust = 0.5))+
#ylim(0,49)+ # Actual graph or scale for experiment
scale_color_manual(values = c("green", "purple", "orange", "blue", "red"), name="Legend")
OMg I am so stupid... I already calculate it with this line here!!!
HabitReach95 <- approx(nls_line, df$days, xout = Asymp95)$y
Haha can not believe it... well sometimes you don't see the forest for the trees!
I am trying to add a linear regression to data plotted with ggplot; however, due to the nature of my data I need to plot it such that the responding variable in the linear regression is the x-axis, not the y. Is there a way to change the way the regression is done (I tried changing "formula = y~x" to "formula = x~y" but no luck) by maybe specifying alternate mapping from the mapping specified by the plot? Or is there an easy way to invert the plot after I have added the regression? Thanks! Any help is appreciated.
One straightforward way (which you suggested) would be to make the plot with y and x reversed, and then "inverting" the final plot. I used heavily right skewed "noise" so the example really makes it clear what is being fit.
library(tidyverse)
set.seed(42)
foo <- data_frame(x = 1:100, y = 2 + 0.5*x + 3*rchisq(100, 3))
foo %>%
ggplot(aes(x=y, y=x)) + geom_point() + stat_smooth(method = "lm") + coord_flip()
I'm trying to display some frequencies convolved with a Gaussian kernel in ggplot2. I tried smoothing the lines with:
+ stat_smooth(se = F,method = "lm", formula = y ~ poly(x, 24))
Without success.
I read an article suggesting the frequencies should be convolved with a Gaussian kernel. Which ggplot2's stat_density function (http://docs.ggplot2.org/current/stat_density.html) seem to be able to produce.
However, I can't seem to be able to replace my geometry with stat_density. I there anything wrong with my code?
require(reshape2)
library(ggplot2)
library(RColorBrewer)
fileName = "/1.csv" # downloadable there: https://www.dropbox.com/s/l5j7ckmm5s9lo8j/1.csv?dl=0
mydata = read.csv(fileName,sep=",", header=TRUE)
dataM = melt(mydata,c("bins"))
myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
ggplot(data=dataM,
aes(x=bins, y=value, colour=variable)) +
geom_line() + scale_x_continuous(limits = c(0, 2))
This code produces the following plot:
I'm looking at smoothing the lines a little bit, so they look more like this:
(from http://journal.frontiersin.org/Journal/10.3389/fncom.2013.00189/full)
Since my comments solved your problem, I'll convert them to an answer:
The density function takes individual measurements and calculates a kernel density distribution by convolution (gaussian is the default kernel). For example, plot(density(rnorm(1000))). You can control the smoothness with the bw (bandwidth) parameter. For example, plot(density(rnorm(1000), bw=0.01)).
But your data frame is already a density distribution (analogous to the output of the density function). To generate a smoother density estimate, you need to start with the underlying data and run density on it, adjusting bw to get the smoothness where you want it.
If you don't have access to the underlying data, you can smooth out your existing density distributions as follows:
ggplot(data=dataM, aes(x=bins, y=value, colour=variable)) +
geom_smooth(se=FALSE, span=0.3) +
scale_x_continuous(limits = c(0, 2)).
Play around with the span parameter to get the smoothness you want.
I'm working with a dataset where I have to transform some data for a curve fit. I'm plotting it using ggplot2, and can use stat_smooth on the transformed data to get the fit, but then want to overlay the result on the correct datapoints.
As a toy example, let's say I had
qplot(1:10, 1:10)+stat_smooth(formula=y+1~x, method="lm")
But I want to shift the stat_smooth line down by one (other than by taking the +1 out of the formula). Is this possible?
I don't think position_nudge() was available when this was asked 10.5 years ago but it's provided a simpler way of doing this for some time (as of ggplot 3.3.5, late 2021).
qplot(1:10, 1:10 + rnorm(10, sd = 0.3)) + stat_smooth(formula = y~x, method = "lm", position = position_nudge(y = 1))
It seems worth cautioning there's a good chance of displaying confusing or misleading confidence intervals when manipulating stat_smooth()'s formula. I've added a bit of variation to qplot()'s input in the line above to illustrate this.
Sometimes things can be very obvious :
qplot(1:10, 1:10)+stat_smooth(formula=(y+1)-1~x, method="lm")
If you can raise it 1 by adding 1 to y, you can lower it 1 by substracting 1 from y. ;-)
I'd like to plot data such that on y axis there would be probability (in range [0,1]) and on x-axis I have the data values. The data is contiguous (also in range [0,1]), thus I'd like to use some kernel density estimation function and normalize it such that the y-value at some point x would mean the probability of seeing value x in input data.
So, I'd like to ask:
a) Is it reasonable at all? I understand that I cannot have probability of seeing values I do not have in the data, but I just would like to interpolate between points I have using a kernel density estimation function and normalize it afterwards.
b) Are there any built-in options in ggplot I could use, that would override default behavior of geom_density() for example for doing this?
Thanks in advance,
Timo
EDIT:
when i said "normalize" before, I actually meant "scale". But I got the answer, so thanks guys for clearing up my mind about this.
Just making up a quick merge of #JD Long's and #yesterday's answers:
ggplot(df, aes(x=x)) +
geom_histogram(aes(y = ..density..), binwidth=density(df$x)$bw) +
geom_density(fill="red", alpha = 0.2) +
theme_bw() +
xlab('') +
ylab('')
This way the binwidth for ggplot2 was calculated by the density function, and also the latter is drawn on the top of a histogram with a nice transparency. But you should definitely look into stat_densitiy as #yesterday suggested for further customization.
This isn't a ggplot answer, but if you want to bring together the ideas of kernel smoothing and histograms you could do a bootstrapping + smoothing approach. You'll get beat about the head and shoulders by stats folks for doing ugly things like this, so use at your own risk ;)
start with some synthetic data:
set.seed(1)
randomData <- c(rnorm(100, 5, 3), rnorm(100, 20, 3) )
hist(randomData, freq=FALSE)
lines(density(randomData), col="red")
The density function has a reasonably smart bandwidth calculator which you can borrow from:
bw <- density(randomData)$bw
resample <- sample( randomData, 10000, replace=TRUE)
Then use the bandwidth calc as the SD to make some random noise
noise <- rnorm(10000, 0, bw)
hist(resample + noise, freq=FALSE)
lines(density(randomData), col="red")
Hey look! A kernel smoothed histogram!
I know this long response is not really an answer to your question, but maybe it will provide some creative ideas on how to abuse your data.
You can control the behaviour of density / kernel estimation in ggplot by calling stat_density() rather than geom_density().
See the on-line user manual: http://had.co.nz/ggplot2/stat_density.html
You can specify any of the kernel estimation functions that are supported by by stats::density()
library(ggplot2)
df <- data.frame(x = rnorm(1000))
ggplot(df, aes(x=x)) + stat_density(kernel="biweight")