I am using a model to predict some numbers. My prediction also includes a confidence interval for each number. I need to plot the actual numbers + predicted numbers and their quantile values on the same plot. Here is a simple example:
actualVals = c(12,20,15,30)
lowQuantiles = c(19,15,12,18)
midQuantiles = c(22,22,17,25)
highQuantiles = c(30,25,25,30)
and I'm looking for something like this, perhaps by using ggplot():
You can use geom_errorbar, among others you can see at ?geom_errorbar. I created a data.frame from your variables, dat and added dat$x <- 1:4.
ggplot(dat) +
geom_errorbar(aes(x, y=midQuantiles, ymax=highQuantiles, ymin=lowQuantiles, width=0.2), lwd=2, color="blue") +
geom_point(aes(x, midQuantiles), cex=4, shape=22, fill="grey", color="black") +
geom_line(aes(x, actualVals), color="maroon", lwd=2) +
geom_point(aes(x, actualVals), shape=21, cex=4, fill="white", color='maroon') +
ylim(0, 30) +
theme_bw()
Related
I created the plot below using:
ggplot(data_all, aes(x = data_all$Speed, fill = data_all$Season)) +
theme_bw() +
geom_histogram(position = "identity", alpha = 0.2, binwidth=0.1)
As you can see, the difference in the amount of data available is very large. Is there a way to look only at the distribution and not at the total data amount?
You can reference some of the other calculated values from stat functions using a notation that you may have seen before: ..value... I'm not sure the proper name for these or where you can find a list documented, but sometimes these are called "special variables" or "calculated aesthetics".
In this case, the default calculated aesthetic on the y axis for geom_histogram() is ..count... When comparing distributions of different total N size, it's useful to use ..density... You can access ..density.. by passing it to the y aesthetic directly in the geom_histogram() function.
First, here's an example of two histograms with vastly different sizes (similar to OP's question):
library(ggplot2)
set.seed(8675309)
df <- data.frame(
x = c(rnorm(1000, -1, 0.5), rnorm(100000, 3, 1)),
group = c(rep("A", 1000), rep("B", 100000))
)
ggplot(df, aes(x, fill=group)) + theme_classic() +
geom_histogram(
alpha=0.2, color='gray80',
position="identity", bins=80)
And here's the same plot using ..density..:
ggplot(df, aes(x, fill=group)) + theme_classic() +
geom_histogram(
aes(y=..density..), alpha=0.2, color='gray80',
position="identity", bins=80)
A previous post describes how to draw red circles around points which exceed a given value in ggplot. I would like to do the same for anomaly detection results, but instead have the circles drawn around points belonging to a given factor level.
How could I change this code to allow circles to be drawn around a given factor level?
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
geom_point(data=mtcars[mtcars$mpg>30,],
pch=21, fill=NA, size=4, colour="red", stroke=1) +
theme_bw()
All you need is to first plot all points and then plot only the circles for the data reduced to the factor levels you want to highlight. Does this solve your problem?
ggplot() +
geom_point(data=iris, aes(Sepal.Length, Sepal.Width)) +
geom_point(data=iris[iris$Species %in% c("setosa"),], aes(Sepal.Length, Sepal.Width),
pch=21, fill=NA, size=4, colour="red", stroke=1) +
theme_bw()
Please note that I changed the dataset, as I needed a factor in the data to show you how it works.
Let's suppose that the "factor level" you are interested in is the value 10.4 for mtcars$mpg. mtcars$mpg is a numerical vector, so you first have to convert it into a factor.
mtcars$mpg <- as.factor(mtcars$mpg)
Then you can use the same code you used previously for values greater than a limit, except that this time the condition is to belong to the factor level 10.4:
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
geom_point(data=mtcars[mtcars$mpg %in% 10.4, ],
pch=21, fill=NA, size=4, colour="red", stroke=1) +
theme_bw()
Note that the conversion of mtcars$mpg to factor is not necessary and that the code will run on the numerical vector in the same way. I converted it since your question was about "factor level".
Note also that if you are not dealing with factor levels but simply with values matching a certain number, you can use:
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
geom_point(data=mtcars[mtcars$mpg == 10.4, ],
pch=21, fill=NA, size=4, colour="red", stroke=1) +
theme_bw()
since you are now only testing for equality and not for appartenance.
I recently tried to use the above methods to highlight a subset of points with a factored axis. Unfortunately, the inclusion of the second subset geom_point call seemed to reorder the axis. I was able to avoid this problem by using the gghighlight package.
ggplot(mtcars, aes(x = cyl, y = mpg, color = as.factor(carb))) +
geom_point() +
gghighlight(carb == 2, use_direct_label = FALSE, unhighlighted_colour = NULL) +
geom_point(pch=21, fill=NA, size=4, colour="black", stroke=0.5)
So I have a three column data frame that has Trials, Ind. Variable, Observation. Something like:
df1<- data.frame(Trial=rep(1:10,5), Variable=rep(1:5, each=10), Observation=rnorm(1:50))
I am trying to plot a 95% conf. Interval around the mean for each trial using a rather inefficient method as follows:
b<-NULL
b$mean<- aggregate(Observation~Variable, data=df1,mean)[,2]
b$sd <- aggregate(Observation~Variable, data=df1,sd)[,2]
b$Variable<- df1$Variable
b$Observation <- df1$Observation
b$ucl <- rep(qnorm(.975, mean=b$mean, sd=b$sd), each=10)
b$lcl <- rep(qnorm(.025, mean=b$mean, sd=b$sd), each=10)
b<- as.data.frame(b)
c <- ggplot(b, aes(Variable, Observation))
c + geom_point(color="red") +
geom_smooth(aes(ymin = lcl, ymax = ucl), data=b, stat="summary", fun.y="mean")
This is inefficient since it duplicates values for ymin, ymax. I've seen the geom_ribbon methods but I would still need to duplicate. However, if I was using any kind of smoothing like glm, the code is much simpler with no duplication. Is there a better way of doing this?
References:
1. R Plotting confidence bands with ggplot
2. Shading confidence intervals manually with ggplot2
3. http://docs.ggplot2.org/current/geom_smooth.html
With this method, I get the same output as with your method. This was inspired by the docs for ggplot. Again, this will be meaningful so long as each x value has multiple points.
set.seed(1)
df1 <- data.frame(Trial=rep(1:10,5), Variable=rep(1:5, each=10), Observation=rnorm(1:50)) my_ci <- function(x) data.frame(y=mean(x), ymin=mean(x)-2*sd(x), ymax=mean(x)+2*sd(x))
my_ci <- function(x) data.frame(
y=mean(x),
ymin=mean(x) - 2 * sd(x),
ymax=mean(x) + 2 * sd(x)
)
ggplot(df1, aes(Variable, Observation)) + geom_point(color="red") +
stat_summary(fun.data="my_ci", geom="smooth")
The ggplot package comes with wrappers for a number of summarizing functions in the Hmisc package, including
mean_cl_normal which calculates the confidence limits based on the t-distribution,
mean_cl_boot which uses a bootstrap method that does not assume a distribution of the mean,
mean_sdl which uses a multiple of the standard deviation (default=2).
This latter method is the same as in the answer above, but is not the 95% CL. Confidence limits based on the t-distribution are given by:
CL = t × s / √n
Where t is the appropriate quantile of the t-distribution and s is the sample standard deviation. Compare the confidence bands:
ggplot(df1, aes(x=Variable, y=Observation)) +
stat_summary(fun.data="mean_sdl", geom="line", colour="blue")+
stat_summary(fun.data="mean_sdl", mult=2, geom="errorbar",
width=0.1, linetype=2, colour="blue")+
geom_point(color="red") +
labs(title=expression(paste(bar(x)," \u00B1 ","2 * sd")))
ggplot(df1, aes(x=Variable, y=Observation)) +
geom_point(color="red") +
stat_summary(fun.data="mean_cl_normal", geom="line", colour="blue")+
stat_summary(fun.data="mean_cl_normal", conf.int=0.95, geom="errorbar",
width=0.1, linetype=2, colour="blue")+
stat_summary(fun.data="mean_cl_normal", geom="point", size=3,
shape=1, colour="blue")+
labs(title=expression(paste(bar(x)," \u00B1 ","t * sd / sqrt(n)")))
Finally, rotating this last plot using coord_flip() generates something very close to a Forest Plot, which is a standard method for summarizing data like yours.
ggplot(df1, aes(x=Variable, y=Observation)) +
geom_point(color="red") +
stat_summary(fun.data="mean_cl_normal", conf.int=0.95, geom="errorbar",
width=0.2, colour="blue")+
stat_summary(fun.data="mean_cl_normal", geom="point", size=3,
shape=1, colour="blue")+
geom_hline(aes(yintercept=mean(Observation)), linetype=2)+
labs(title="Forest Plot")+
coord_flip()
Using ggplot2's stat_smooth(), I'm curious how one might adjust the transparency of the generated regression line. Using geom_points() or geom_line(), one normally sets a value for 'alpha', indicating the percent transparency. However, with stat_smooth(), alpha sets the transparency of the confidence interval (in my sample below, turned off - se=FALSE).
I cannot seem to find a way to make the regression line(s) a lower transparency than 1.
Your advice would be wonderful.
Sample Code
library(reshape2)
df <- data.frame(x = 1:300)
df$y1 <- 0.5*(1/df$x + 0.1*(df$x-1)/df$x + rnorm(300,0,0.015))
df$y2 <- 0.5*(1/df$x + 0.3*(df$x-1)/df$x + rnorm(300,0,0.015))
df$y3 <- 0.5*(1/df$x + 0.6*(df$x-1)/df$x + rnorm(300,0,0.015))
df <- melt(df, id = 1)
ggplot(df, aes(x=x, y=value, color=variable)) +
geom_point(size=2) +
stat_smooth(method = "lm", formula = y ~ 0 + I(1/x) + I((x-1)/x),
se = FALSE,
size = 1.5,
alpha = 0.5)
To set alpha value just for the line you should replace stat_smooth() with geom_line() and then inside the geom_line() use the same arguments as in stat_smooth() and additionally add stat="smooth".
ggplot(df, aes(x=x, y=value, color=variable)) +
geom_point(size=2) +
geom_line(stat="smooth",method = "lm", formula = y ~ 0 + I(1/x) + I((x-1)/x),
size = 1.5,
linetype ="dashed",
alpha = 0.5)
As an alternative that's slightly more intuitive -- perhaps created since this answer -- you can use stat_smooth (geom="line"). The SE envelope disappears, though you can add it back with something like:
geom_smooth (alpha=0.3, size=0, span=0.5)
stat_smooth (geom="line", alpha=0.3, size=3, span=0.5) +
The first line creates the SE. with no (0-width) line, and the second line adds the line over top of it. The (current) documentation mentions that stat_smooth is for non-standard geoms (e.g. "line").
I have this kind of data frame:
df<-data.frame(x=c(1,2,3,4,5,6,7,8,9,10),y=c(2,11,24,30,45,65,90,110,126,145), a=c(0.2,0.2,0.3,0.4,0.1,0.8,0.7,0.6,0.8,0.9))
Using ggplot, I would like to plot on the same figure two regression lines, calculated for a subset of my data frame under condition (a > or < 0.5).
Visually, I would like that both regression lines:
df_a<-subset(df, df$a<0.5)
ggplot(df_a,aes(x,y))+
geom_point(aes(color = a), size=3.5) +
geom_smooth(method="lm", size=1, color="black") +
ylim(-5,155) +
xlim(0,11)
df_b<-subset(df, df$a>0.5)
ggplot(df_b,aes(x,y)) +
geom_point(aes(color = a), size=3.5) +
geom_smooth(method="lm", size=1, color="black") +
ylim(-5,155) +
xlim(0,11)
Appear on this figure:
ggplot(df,aes(x,y))+ geom_point(aes(color = a), size=3.5)
I've tried with par(new=TRUE) without success.
Make a flag variable, and use group:
df$small=df$a<0.5
ggplot(df,aes(x,y,group=small))+geom_point() + stat_smooth(method="lm")
and have yourself pretty colours and a legend if you want:
ggplot(df,aes(x,y,group=small,colour=small))+geom_point() + stat_smooth(method="lm")
Or maybe you want to colour the dots:
ggplot(df,aes(x,y,group=small)) +
stat_smooth(method="lm")+geom_point(aes(colour=a))