How can I make the method argument of geom_smooth() from ggplot be dynamic and adapt to the number of data points in a group?
For example, I have data in the following format:
1. DATE PRODUCT SIZE
2. 3/1/2017 A 10
3. 3/2/2017 B 14
4. 3/3/2017 C 25
5. 3/4/2017 A 16
6. etc.
This charts completely fine and adds a loess fit to each group (PRODUCT) with the following code (each PRODUCT group has about 20 entries):
DT<-read.csv("TEST_DATA.csv")
DT$DATE<-as.Date(DT$DATE, "%m/%d/%Y")
myPlot<-ggplot(DT, aes(DATE, SIZE, color = PRODUCT))
myPlot + geom_point() + geom_smooth(method = "loess", se = FALSE)
However, let's say I add in just 2 data points for a 4th Product "D". I then get the following warning messages and no loess fit lines are added to the plot for ANY group.
Warning messages:
1: In simpleLoess(y, x, w, span, degree = degree, parametric = parametric, ... : span too small. fewer data values than degrees of freedom.
I believe this warning is due to the fact that the number of observations for product D is less that the degrees of freedom for the loess fit.
Setting method = "auto", chooses "loess" anyway so that doesn't help and setting method to "lm" is not what I want.
I would like to do the following but can't quite get it to work and am wondering if someone can help?
myPlot + geom_point() + geom_smooth(data = DT, method = if(length(DT$PRODUCT)<5) {"lm"} else {"loess"}, se = F)
As you can see, I am trying to have geom_smooth() use method = "lm" if any groups have less than 5 observations, otherwise use the "loess" method. But I can't quite figure out how to access the number of observations of each group within the geom_smooth() function.
There's an n argument (number of points to evaluate smoother at) that you can use. See stat_smooth for details.
EDIT:
You can build the plot dynamically:
sProduct <- unique(DT$PRODUCT)
myPlot <- ggplot(DT, aes(DATE, SIZE, color = PRODUCT)) + geom_point()
for (i in sProduct){
sMethod <- ifelse(sum(DT$PRODUCT == i) <= 5, "lm", "loess")
myPlot <- myPlot + geom_smooth(data = subset(DT, PRODUCT == i), method = sMethod, se = FALSE)
}
myPlot
You could write a function that chooses the smoothing method conditionally, based on minimum group length. For example:
library(tidyverse)
theme_set(theme_classic())
conditional_smooth = function(data, xvar, yvar, group) {
p = ggplot(data, aes_string(xvar, yvar, colour=group)) +
geom_point()
min_group_length = split(data, data[, group]) %>% map_dbl(nrow) %>% min
# Choose smoothing method based on minimum group length
if(min_group_length >= 5) {
p + geom_smooth(method=loess)
}
else {
p + geom_smooth(method=lm)
}
}
Let's run the function. For the iris data frame, the smallest group has length 50.
conditional_smooth(iris, "Petal.Length", "Sepal.Length", "Species")
Now let's shorten one group to four values:
conditional_smooth(iris[c(1:50,97:150), ], "Petal.Length", "Sepal.Length", "Species")
Related
I am trying to show different colors for coefficients that are not significant (p>0.05) and the ones that are. Plus, if someone has a way to show the legend or signify the colors that would also be nice..
Any ideas?
Sample code:
library(nycflights13)
library(dplyr)
library(dotwhisker)
library(MASS)
flights <- nycflights13::flights
flights<- sample_n (flights, 500)
m1<- glm(formula = arr_delay ~ dep_time + origin+ air_time+ distance , data = flights)
#m1<- glm(formula = arr_delay ~ . , data = flights)
m1<- stepAIC(m1)
p<- dotwhisker::dwplot(m1)
z<- p +
geom_vline(xintercept=0, linetype="dashed")+
geom_segment(aes(x=conf.low,y=term,xend=conf.high,
yend=term,col=p.value<0.05)) +
geom_point(aes(x=estimate,y=term,col=p.value<0.05)) +
xlab("standardized coefficient") +
ylab("coefficient") +
ggtitle("coefficients in the model and significance")
print(z)
Your code already kind of does what you want. The problem is that the object p produced by dwplot already has a geom_segment layer and a geom_point layer with a number of aesthetic mappings. Their colors are currently mapped to the variable model, which is just a factor level allowing for different colorings when comparing models side by side. It is possible to over-write them though:
p$layers[[1]]$mapping[5] <- aes(color = p.value < 0.05)
p$layers[[2]]$mapping[4] <- aes(color = p.value < 0.05)
And you can change the legend label with
p$labels$colour <- "Significant"
By default, dwplot also hides the legend, but we can reset that with:
p$theme <- list()
So without adding any new geoms or creating the object z, we have:
p
Note that p is still a valid and internally consistent ggplot, so you can continue to style it as desired, for example:
p + theme_bw() + geom_vline(xintercept = 0, lty = 2)
I have discreet data that looks like this:
height <- c(1,2,3,4,5,6,7,8)
weight <- c(100,200,300,400,500,600,700,800)
person <- c("Jack","Jim","Jill","Tess","Jack","Jim","Jill","Tess")
set <- c(1,1,1,1,2,2,2,2)
dat <- data.frame(set,person,height,weight)
I'm trying to plot a graph with same x-axis(person), and 2 different y-axis (weight and height). All the examples, I find is trying to plot the secondary axis (sec_axis), or discreet data using base plots.
Is there an easy way to use sec_axis for discreet data on ggplot2?
Edit: Someone in the comments suggested I try the suggested reply. However, I run into this error now
Here is my current code:
p1 <- ggplot(data = dat, aes(x = person, y = weight)) +
geom_point(color = "red") + facet_wrap(~set, scales="free")
p2 <- p1 + scale_y_continuous("height",sec_axis(~.*1.2, name="height"))
p2
I get the error: Error in x < range[1] :
comparison (3) is possible only for atomic and list types
Alternately, now I have modified the example to match this example posted.
p <- ggplot(dat, aes(x = person))
p <- p + geom_line(aes(y = height, colour = "Height"))
# adding the relative weight data, transformed to match roughly the range of the height
p <- p + geom_line(aes(y = weight/100, colour = "Weight"))
# now adding the secondary axis, following the example in the help file ?scale_y_continuous
# and, very important, reverting the above transformation
p <- p + scale_y_continuous(sec.axis = sec_axis(~.*100, name = "Relative weight [%]"))
# modifying colours and theme options
p <- p + scale_colour_manual(values = c("blue", "red"))
p <- p + labs(y = "Height [inches]",
x = "Person",
colour = "Parameter")
p <- p + theme(legend.position = c(0.8, 0.9))+ facet_wrap(~set, scales="free")
p
I get an error that says
"geom_path: Each group consists of only one observation. Do you need to
adjust the group aesthetic?"
I get the template, but no points get plotted
R function arguments are fed in by position if argument names are not specified explicitly. As mentioned by #Z.Lin in the comments, you need sec.axis= before your sec_axis function to indicate that you are feeding this function into the sec.axis argument of scale_y_continuous. If you don't do that, it will be fed into the second argument of scale_y_continuous, which by default, is breaks=. The error message is thus related to you not feeding in an acceptable data type for the breaks argument:
p1 <- ggplot(data = dat, aes(x = person, y = weight)) +
geom_point(color = "red") + facet_wrap(~set, scales="free")
p2 <- p1 + scale_y_continuous("weight", sec.axis = sec_axis(~.*1.2, name="height"))
p2
The first argument (name=) of scale_y_continuous is for the first y scale, where as the sec.axis= argument is for the second y scale. I changed your first y scale name to correct that.
I've poked around, but been unable to find an answer. I want to do a weighted geom_bar plot overlaid with a vertical line that shows the overall weighted average per facet. I'm unable to make this happen. The vertical line seems to a single value applied to all facets.
require('ggplot2')
require('plyr')
# data vectors
panel <- c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
instrument <-c("V1","V2","V1","V1","V1","V2","V1","V1","V2","V1","V1","V2","V1","V1","V2","V1")
cost <- c(1,4,1.5,1,4,4,1,2,1.5,1,2,1.5,2,1.5,1,2)
sensitivity <- c(3,5,2,5,5,1,1,2,3,4,3,2,1,3,1,2)
# put an initial data frame together
mydata <- data.frame(panel, instrument, cost, sensitivity)
# add a "contribution to" vector to the data frame: contribution of each instrument
# to the panel's weighted average sensitivity.
myfunc <- function(cost, sensitivity) {
return(cost*sensitivity/sum(cost))
}
mydata <- ddply(mydata, .(panel), transform, contrib=myfunc(cost, sensitivity))
# two views of each panels weighted average; should be the same numbers either way
ddply(mydata, c("panel"), summarize, wavg=weighted.mean(sensitivity, cost))
ddply(mydata, c("panel"), summarize, wavg2=sum(contrib))
# plot where each panel is getting its overall cost-weighted sensitivity from. Also
# put each panel's weighted average on the plot as a simple vertical line.
#
# PROBLEM! I don't know how to get geom_vline to honor the facet breakdown. It
# seems to be computing it overall the data and showing the resulting
# value identically in each facet plot.
ggplot(mydata, aes(x=sensitivity, weight=contrib)) +
geom_bar(binwidth=1) +
geom_vline(xintercept=sum(contrib)) +
facet_wrap(~ panel) +
ylab("contrib")
If you pass in the presumarized data, it seems to work:
ggplot(mydata, aes(x=sensitivity, weight=contrib)) +
geom_bar(binwidth=1) +
geom_vline(data = ddply(mydata, "panel", summarize, wavg = sum(contrib)), aes(xintercept=wavg)) +
facet_wrap(~ panel) +
ylab("contrib") +
theme_bw()
Example using dplyr and facet_wrap incase anyone wants it.
library(dplyr)
library(ggplot2)
df1 <- mutate(iris, Big.Petal = Petal.Length > 4)
df2 <- df1 %>%
group_by(Species, Big.Petal) %>%
summarise(Mean.SL = mean(Sepal.Length))
ggplot() +
geom_histogram(data = df1, aes(x = Sepal.Length, y = ..density..)) +
geom_vline(data = df2, mapping = aes(xintercept = Mean.SL)) +
facet_wrap(Species ~ Big.Petal)
vlines <- ddply(mydata, .(panel), summarize, sumc = sum(contrib))
ggplot(merge(mydata, vlines), aes(sensitivity, weight = contrib)) +
geom_bar(binwidth = 1) + geom_vline(aes(xintercept = sumc)) +
facet_wrap(~panel) + ylab("contrib")
Dear Stackoverflow users,
I would like to draw a grouped barplot with three independent variables with error bars. I based my graph on an example on Stacked Overflow (stacked bars within grouped bars), using ggplot with geom_bar. When I add the geom_errorbar according to examples of the help pages, I get the following error:
Error in if (empty(data)) { : missing value where TRUE/FALSE needed
This is the script I use:
treatment<-rep(c(rep(c(1),8),rep(c(2),8)),2)
origin<-rep(c("A","B"),16)
time<-c(rep(c(5),16),rep(c(10),16))
sulfide<-c(0,10,5,8,9,6,16,18,20,25,50,46,17,58,39,43,20,25,50,46,17,58,39,43,100,120,103,104,150,160,200,180)
Reed<-data.frame(treatment,origin,time,sulfide)
# specify factor types
Reed$treatment<-as.factor(Reed$treatment)
Reed$origin<-as.character(Reed$origin)
Reed$time<-as.factor(Reed$time)
library(ggplot2)
library(scales)
#draw plot
ggplot() +geom_bar(data=Reed, aes(y = sulfide, x = treatment, fill=origin), stat="identity",position="dodge") +theme_bw() + facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")+ggtitle("Time)")
This is how I added error bars:
ErrorBars <- function(x, y, upper, lower=upper, length=0.03,...{if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))stop("vectors must be same length")arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)}#function for errorbars
SE<- function(x) sqrt(var(x,na.rm=TRUE)/length(na.omit(x))) #function for SE
Reed$trt<- paste(Reed$treatment,Reed$origin,sep="")#combine treatment and origin to a column
mean_Reed<-data.frame(tapply(Reed$sulfide,list(Reed$trt,Reed$time),mean,na.rm=TRUE)) #mean
SE_Reed<-data.frame(tapply(Reed$sulfide,list(Reed$trt, Reed$time),SE)) # SE
limits <- aes(ymax = mean_Reed + SE_Reed, ymin=mean_Reed - SE_Reed)# Define the top and bottom of the errorbars
#plot with error bars:
ggplot() +geom_bar(data=Reed, aes(y = sulfide, x = treatment, fill=origin), stat="identity",position="dodge") +theme_bw() + facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")+ggtitle("Time)"+ geom_errorbar(limits, width=.2,position="dodge")
I really can't find what I'm doing wrong.
I hope you can help me:)
Leaving aside the issue of error bars for the moment, there's a much more serious problem with your plot. You have 2 values each of treatment, time, and origin, for a total of 8 combinations, but 32 values of sulfide - so there are 4 values of sulfide for each combination. When you plot this using, e.g.,
ggplot(data=Reed) +
geom_bar(aes(y = sulfide, x = treatment, fill=origin), stat="identity",position="dodge") +
facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")
you are plotting bars for all four sulfide values on top of each other all in the same color. This has the effect of displaying only the maximum value. It's a little hard to believe this is what you intended, and even if you did there's a better way to do that. For instance, if you want to plot the mean value of sulfide for each combination of factors, you can do it this way.
ggp <- ggplot(data=Reed, aes(y = sulfide, x = as.factor(treatment), group=origin)) +
geom_bar(aes(fill=origin), stat="summary", fun.y=mean, position="dodge") +
theme_bw() +
facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")+ggtitle("Time")
ggp
This uses stat="summary" to automatically summarize the result using the aggregating function mean (fun.y=mean).
As similar approach can be used to very simply add the error bars:
se <- function(y) sd(y)/length(y) # to calculate standard error in the mean
ggp+stat_summary(geom="errorbar",position=position_dodge(width=0.85),
fun.data=function(y)c(ymin=mean(y)-se(y),ymax=mean(y)+se(y)), width=0.1)
Notice that there is no need to aggregate the data externally - ggplot does it for you.
Finally, this approach lends itself to the use of many built-in functions for generating confidence limits with more statistical rigor.
ggp+stat_summary(fun.data=mean_cl_normal, conf.int=0.95,
geom="errorbar",position=position_dodge(width=0.85), width=0.1)
So here we use the ggplot built-in function mean_cl_normal to calculate 95% confidence limits on the mean assuming the data follows a normal distribution (and that, hence, the means will follow a t-distribution). We use the argument conf.int=... to specify the desired confidence interval, but the default is 0.95 so it really wasn't necessary in this example.
There are several other functions of this type: see the documentation and links therein for an explanation.
If you want to build your error bars by making a summary dataset, you just need to get that dataset in the correct format. There are lots of options for this; I will use dplyr. Notice I keep all the grouping variables from the plot in this dataset in a "tidy" format, with each variable in a separate column.
library(dplyr)
meandat = Reed %>%
group_by(treatment, time, origin) %>%
summarise(mean = mean(sulfide, na.rm = TRUE), se = SE(sulfide))
Source: local data frame [8 x 5]
Groups: treatment, time [?]
treatment time origin mean se
(fctr) (fctr) (chr) (dbl) (dbl)
1 1 5 A 7.50 3.378856
2 1 5 B 10.50 2.629956
3 1 10 A 31.50 7.858117
4 1 10 B 43.00 6.819091
5 2 5 A 31.50 7.858117
6 2 5 B 43.00 6.819091
7 2 10 A 138.25 23.552689
8 2 10 B 141.00 17.540429
Now error bars can be added via geom_errorbar. You'll see I set the aesthetics globally within ggplot to save myself having to re-type some of these, but you can change this as you want. I use position_dodge to get the error bars placed correctly over each bar.
ggplot(data = Reed, aes(y = sulfide, x = treatment, fill=origin)) +
geom_bar(stat="identity", position="dodge") +
theme_bw() +
facet_grid( ~ time)+
xlab("treatment") +
ylab("Sulfide")+
ggtitle("Time")+
geom_errorbar(data = meandat, aes(ymin = mean - se, ymax = mean + se, y = mean),
position = position_dodge(width = .9))
You can actually do all of this via stat_summary, rather than calculating the summary statistics "by hand". An example is here. The code would look like so, and gives the same plot as above.
ggplot(data = Reed, aes(y = sulfide, x = treatment, fill=origin)) +
geom_bar(stat="identity",position="dodge") +
theme_bw() +
facet_grid( ~ time) +
xlab("treatment") +
ylab("Sulfide") +
ggtitle("Time") +
stat_summary(geom = "errorbar", fun.data = mean_cl_normal, mult = 1,
position = position_dodge(width = .9))
I've been using the development version of ggplot2, ggplot2_1.0.1.9003, and found that I needed to add stat_summary function arguments via fun.args. This would look like fun.args = list(mult = 1) to get error bars of 1 standard error.
We have some data which represents many model runs under different scenarios. For a single scenario, we'd like to display the smoothed mean, with the filled areas representing standard deviation at a particular point in time, rather than the quality of the fit of smooting.
For example:
d <- as.data.frame(rbind(cbind(1:20, 1:20, 1),
cbind(1:20, -1:-20, 2)))
names(d)<-c("Time","Value","Run")
ggplot(d, aes(x=Time, y=Value)) +
geom_line(aes(group=Run)) +
geom_smooth()
This produces a graph with two runs represented, and a smoothed mean, but even though the SD between the runs is increasing, the smoother's bars stay the same size. I'd like to make the surrounds of the smoother represent standard deviation at a given timestep.
Is there a non-labour intensive way of doing this, given many different runs and output variables?
hi i'm not sure if I correctly understand what you want, but for example,
d <- data.frame(Time=rep(1:20, 4),
Value=rnorm(80, rep(1:20, 4)+rep(1:4*2, each=20)),
Run=gl(4,20))
mean_se <- function(x, mult = 1) {
x <- na.omit(x)
se <- mult * sqrt(var(x) / length(x))
mean <- mean(x)
data.frame(y = mean, ymin = mean - se, ymax = mean + se)
}
ggplot( d, aes(x=Time,y=Value) ) + geom_line( aes(group=Run) ) +
geom_smooth(se=FALSE) +
stat_summary(fun.data=mean_se, geom="ribbon", alpha=0.25)
note that mean_se is going to appear in the next version of ggplot2.
The accepted answer just works if measurements are aligned/discretized on x. In case of continuous data you could use a rolling window and add a custom ribbon
iris %>%
## apply same grouping as for plot
group_by(Species) %>%
## Important sort along x!
arrange(Petal.Length) %>%
## calculate rolling mean and sd
mutate(rolling_sd=rollapply(Petal.Width, width=10, sd, fill=NA), rolling_mean=rollmean(Petal.Width, k=10, fill=NA)) %>% # table_browser()
## build the plot
ggplot(aes(Petal.Length, Petal.Width, color = Species)) +
# optionally we could rather plot the rolling mean instead of the geom_smooth loess fit
# geom_line(aes(y=rolling_mean), color="black") +
geom_ribbon(aes(ymin=rolling_mean-rolling_sd/2, ymax=rolling_mean+rolling_sd/2), fill="lightgray", color="lightgray", alpha=.8) +
geom_point(size = 1, alpha = .7) +
geom_smooth(se=FALSE)