ggplot: adding new lines from a subset of years - r

I have about 90 years of daily data and I want to plot the long term mean, plus the individual lines for each year of my survey period (2014-2018). The data looks like this:
> head(dischg)
date ddmm year cfs daymo
1 1-Jan-27 01-Jan 1927 715 2018-01-01
2 2-Jan-27 02-Jan 1927 697 2018-01-02
3 3-Jan-27 03-Jan 1927 715 2018-01-03
4 4-Jan-27 04-Jan 1927 796 2018-01-04
5 5-Jan-27 05-Jan 1927 825 2018-01-05
6 6-Jan-27 06-Jan 1927 865 2018-01-06
I have been able to plot the long term mean easily enough:
p1 <- ggplot(dischg, aes(x=daymo, y=cfs)) +
stat_summary(fun.data = "mean_cl_boot", geom = "smooth", colour = "blue")
... but I need some help plotting the subset of years. I tried using "subset"
p2 <- p1 +
ggplot (subset(dischg, year %in% c(2014:2018)), aes(x=daymo, y=cfs, linetype=year)) +
geom_line() +
scale_colour_brewer(palette="Set1")
but I received this error:
Error: Don't know how to add o to a plot
Would it be smarter to just add one year at a time? That seems a bit cumbersome when there are five years of data to plot.

Thank you for providing sample data, however, I unfortunately cannot get the ggplot code to run with that sample data you provided so I will use a built in R dataset. The concepts are the same though.
The issue is that you are trying to add ggplot to an object that is already of class ggplot. Once you have initialized your object as a ggplot object, you don't need to call ggplot each time you want to add a layer. For example, I get the same error you do if I try:
p1 <- ggplot(mtcars, aes(x=hp,y=cyl)) + geom_point()
p2 <- p1 + ggplot(mtcars[mtcars$am == 1, ], aes(x = hp, y = cyl)) + geom_line()
As mentioned in my comment, if you want to add another layer with separate data (in your case the geom_line) you can do this by putting the data directly into the geom_ call. In your case you would do something like:
p1 <- ggplot(mtcars, aes(x=hp,y=cyl)) + geom_point()
p2 <- p1 + geom_line(data = mtcars[mtcars$am == 1, ])
p2

With thanks to feedback from #MikeH., I figured it out:
p1 <- ggplot(dischg, aes(x=daymo, y=cfs)) +
stat_summary(fun.data = "mean_cl_boot", geom = "smooth", colour = "blue") +
geom_line(data=subset(dischg, year %in% c(2014:2018)),
aes(colour=year)) +
scale_colour_brewer(palette="Set1")
(Also, I had to make sure the 'year' was a factor rather than a continuous variable.)

Related

coloring legend in bar chart in R

I'm faily new to R and do need some help in manipulating my graph. I'm trying to compare actual and forecast figures, but cannot get the coloring of the legends right. The data looks like this:
hierarchy Actual Forecast
<fctr> <dbl> <dbl>
1 E 9313 5455
2 K 6257 3632
3 O 7183 8684
4 A 1579 6418
5 S 8755 0149
6 D 5897 7812
7 F 1400 8810
8 G 4960 5710
9 R 3032 0412
And the code looks like this:
ggplot(sam4, aes(hierarchy))+ theme_bw() +
geom_bar(aes(y = Actual, colour="Actual"),fill="#66FF33", stat="identity",position="dodge", width=0.40) +
geom_bar(aes(y = Forecast, colour="Forecast"), fill="#FF3300", stat="identity",position="dodge", width=0.2)
The graph ends up looking like this:
I believe your problem is that your data is not formatted well to use ggplot. You want to tidy up your dataframe first. Check out http://tidyr.tidyverse.org/ to get familiar with the concept of tidy data.
Using the tidyverse (ggplot is part of it), I tidied up your data and I believe got the plot you want.
library(tidyverse) #includes ggplot
newdata <- gather(sam4, actualorforecast, value, -hierarchy)
ggplot(newdata, aes(x = hierarchy)) +
theme_bw() +
geom_bar(aes(y = value, fill = actualorforecast),
stat = "identity",
width = ifelse(newdata$actualorforecast == "Actual", .4, .2),
position = "dodge") +
scale_fill_manual(values= c(Actual ="#66FF33", Forecast="#FF3300"))

How to add the deviation from the mean on a geom_line with geom_smooth?

Is it possible to add a kind of smooth band with the absolute values of the distance from the mean on a geom_line?
I have a matrix like this:
mean Date abs(mean-observed_value)
1 0.2955319 2015-08-04 1.167321e-02
2 0.2802859 2015-08-12 7.537708e-03
3 0.2671653 2015-08-20 2.074987e-03
4 0.2552016 2015-08-28 4.883826e-03
5 0.2554279 2015-09-05 4.419968e-03
On the abs(mean-observed_value) column there are many time series of 54 observations each one, and the Date and mean are like the groups, been repeated for each 54 rows. I was plotting all the time series (using the proper value, like this:
p<-ggplot() +
geom_line(data = y_m, aes(x = Date, y = value, group = variable), color="steelblue", size =0.1)
p + geom_line(data =y_mean, aes(x = Date, y = as.numeric(df.ts_mean)), color=1, size =2) + ylab("EVI")
But now with the deviations I want to plot them as a smooth band. Something like this:
I would appreciate a lot any possible solutions! Thanks a lot!
You can use geom_ribbon from ggplot2() package where You can set up ymin and ymax values (in Your case it will be the abs column), here is an example code:
library(ggplot2)
huron <- data.frame(year = 1875:1972, level = as.vector(LakeHuron))
h <- ggplot(huron, aes(year))
h + geom_ribbon(aes(ymin = level - 1, ymax = level + 1), fill = "grey70") +
geom_line(aes(y = level))
Please for the future post sample data as dput() output, it is much easier to use it, rather then copying each value!

Facing an issue with ggplot

I am having a very simple data frame as below.
cat_group total abort_rate cancel_rate success_rate
100 1804 18.8 45.1 31.8
200 4118 17.7 30.0 48.3
500 14041 19.2 16.9 60.0
I am trying to put this data on a plot such that on the x-axis, I will have cat_group and then I would line plot all the other variables total, abort_rate, cancel_rate and success_rate. My idea is to show how each of these variables vary according to the value in cat_group. I would need four lines in total, one for each variable in a different colour
But when I use the below plot function in R, I am seeing the error: geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
ggplot(my_data_frame, aes(category)) +
geom_line(aes(y = abort_rate, colour = "abort_rate")) +
geom_line(aes(y = success_rate, colour = "success_rate"))+
geom_line(aes(y = success_rate, colour = "total"))+
geom_line(aes(y = success_rate, colour = "cancel_rate"))
Any suggestions on how to resolve this issue?
Assuming that cat_group is of factor type (that's the only way I can reproduce your error) you could do it like this:
my_data_frame$cat_group <- as.factor(my_data_frame$cat_group)
library(ggplot2)
ggplot(my_data_frame, aes(cat_group)) +
geom_line(aes(y = abort_rate, colour = "abort_rate", group=1)) +
geom_line(aes(y = success_rate, colour = "success_rate", group=1))+
geom_line(aes(y = success_rate, colour = "total", group=1))+
geom_line(aes(y = success_rate, colour = "cancel_rate", group=1))
i.e. by specifying one group per geom_line. This has the problem that the scales will not be good enough because they will be set by the first geom_line, and therefore only 2 out of the 4 lines would show.
The typical way of working with such data is to melt the data.frame and then plot it like this:
library(reshape2)
dfm <- melt(my_data_frame, id.vars='cat_group')
ggplot(dfm, aes(x=cat_group, y=value, colour=variable, group=variable)) + geom_line() +
scale_y_log10()
Notice the scale_y_log10 in order to plot (and actually see) all 4 lines. You probably need a log scale since otherwise you will only be able to see the total which is very big and every other line will be overlapped.
One easy way to do this is to use autoplot.zoo:
library(ggplot2)
library(zoo)
z <- read.zoo(my_df)
autoplot(z, facet = NULL) + scale_y_log10()
(continued after graph):
or for separate panels without a log scale:
autoplot(z) + facet_free()
(continued after graph)
Note: Here is the input data in reproducible form:
Lines <- "cat_group total abort_rate cancel_rate success_rate
100 1804 18.8 45.1 31.8
200 4118 17.7 30.0 48.3
500 14041 19.2 16.9 60.0"
my_df <- read.table(text = Lines, header = TRUE)
The best way to solve this to regroup your data so that you have one column for the x axis and one for the y axis and one for what type of data that is contained in the row. To do this you can use the tidyr package.
library(tidyr)
plottingData <- df %>% gather(type,value,-cat_group)
ggplot(plottingData,aes(x=cat_group,y=value,color=type)) + geom_line()

Error when adding errorbars to ggplot

Dear Stackoverflow users,
I would like to draw a grouped barplot with three independent variables with error bars. I based my graph on an example on Stacked Overflow (stacked bars within grouped bars), using ggplot with geom_bar. When I add the geom_errorbar according to examples of the help pages, I get the following error:
Error in if (empty(data)) { : missing value where TRUE/FALSE needed
This is the script I use:
treatment<-rep(c(rep(c(1),8),rep(c(2),8)),2)
origin<-rep(c("A","B"),16)
time<-c(rep(c(5),16),rep(c(10),16))
sulfide<-c(0,10,5,8,9,6,16,18,20,25,50,46,17,58,39,43,20,25,50,46,17,58,39,43,100,120,103,104,150,160,200,180)
Reed<-data.frame(treatment,origin,time,sulfide)
# specify factor types
Reed$treatment<-as.factor(Reed$treatment)
Reed$origin<-as.character(Reed$origin)
Reed$time<-as.factor(Reed$time)
library(ggplot2)
library(scales)
#draw plot
ggplot() +geom_bar(data=Reed, aes(y = sulfide, x = treatment, fill=origin), stat="identity",position="dodge") +theme_bw() + facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")+ggtitle("Time)")
This is how I added error bars:
ErrorBars <- function(x, y, upper, lower=upper, length=0.03,...{if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))stop("vectors must be same length")arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)}#function for errorbars
SE<- function(x) sqrt(var(x,na.rm=TRUE)/length(na.omit(x))) #function for SE
Reed$trt<- paste(Reed$treatment,Reed$origin,sep="")#combine treatment and origin to a column
mean_Reed<-data.frame(tapply(Reed$sulfide,list(Reed$trt,Reed$time),mean,na.rm=TRUE)) #mean
SE_Reed<-data.frame(tapply(Reed$sulfide,list(Reed$trt, Reed$time),SE)) # SE
limits <- aes(ymax = mean_Reed + SE_Reed, ymin=mean_Reed - SE_Reed)# Define the top and bottom of the errorbars
#plot with error bars:
ggplot() +geom_bar(data=Reed, aes(y = sulfide, x = treatment, fill=origin), stat="identity",position="dodge") +theme_bw() + facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")+ggtitle("Time)"+ geom_errorbar(limits, width=.2,position="dodge")
I really can't find what I'm doing wrong.
I hope you can help me:)
Leaving aside the issue of error bars for the moment, there's a much more serious problem with your plot. You have 2 values each of treatment, time, and origin, for a total of 8 combinations, but 32 values of sulfide - so there are 4 values of sulfide for each combination. When you plot this using, e.g.,
ggplot(data=Reed) +
geom_bar(aes(y = sulfide, x = treatment, fill=origin), stat="identity",position="dodge") +
facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")
you are plotting bars for all four sulfide values on top of each other all in the same color. This has the effect of displaying only the maximum value. It's a little hard to believe this is what you intended, and even if you did there's a better way to do that. For instance, if you want to plot the mean value of sulfide for each combination of factors, you can do it this way.
ggp <- ggplot(data=Reed, aes(y = sulfide, x = as.factor(treatment), group=origin)) +
geom_bar(aes(fill=origin), stat="summary", fun.y=mean, position="dodge") +
theme_bw() +
facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")+ggtitle("Time")
ggp
This uses stat="summary" to automatically summarize the result using the aggregating function mean (fun.y=mean).
As similar approach can be used to very simply add the error bars:
se <- function(y) sd(y)/length(y) # to calculate standard error in the mean
ggp+stat_summary(geom="errorbar",position=position_dodge(width=0.85),
fun.data=function(y)c(ymin=mean(y)-se(y),ymax=mean(y)+se(y)), width=0.1)
Notice that there is no need to aggregate the data externally - ggplot does it for you.
Finally, this approach lends itself to the use of many built-in functions for generating confidence limits with more statistical rigor.
ggp+stat_summary(fun.data=mean_cl_normal, conf.int=0.95,
geom="errorbar",position=position_dodge(width=0.85), width=0.1)
So here we use the ggplot built-in function mean_cl_normal to calculate 95% confidence limits on the mean assuming the data follows a normal distribution (and that, hence, the means will follow a t-distribution). We use the argument conf.int=... to specify the desired confidence interval, but the default is 0.95 so it really wasn't necessary in this example.
There are several other functions of this type: see the documentation and links therein for an explanation.
If you want to build your error bars by making a summary dataset, you just need to get that dataset in the correct format. There are lots of options for this; I will use dplyr. Notice I keep all the grouping variables from the plot in this dataset in a "tidy" format, with each variable in a separate column.
library(dplyr)
meandat = Reed %>%
group_by(treatment, time, origin) %>%
summarise(mean = mean(sulfide, na.rm = TRUE), se = SE(sulfide))
Source: local data frame [8 x 5]
Groups: treatment, time [?]
treatment time origin mean se
(fctr) (fctr) (chr) (dbl) (dbl)
1 1 5 A 7.50 3.378856
2 1 5 B 10.50 2.629956
3 1 10 A 31.50 7.858117
4 1 10 B 43.00 6.819091
5 2 5 A 31.50 7.858117
6 2 5 B 43.00 6.819091
7 2 10 A 138.25 23.552689
8 2 10 B 141.00 17.540429
Now error bars can be added via geom_errorbar. You'll see I set the aesthetics globally within ggplot to save myself having to re-type some of these, but you can change this as you want. I use position_dodge to get the error bars placed correctly over each bar.
ggplot(data = Reed, aes(y = sulfide, x = treatment, fill=origin)) +
geom_bar(stat="identity", position="dodge") +
theme_bw() +
facet_grid( ~ time)+
xlab("treatment") +
ylab("Sulfide")+
ggtitle("Time")+
geom_errorbar(data = meandat, aes(ymin = mean - se, ymax = mean + se, y = mean),
position = position_dodge(width = .9))
You can actually do all of this via stat_summary, rather than calculating the summary statistics "by hand". An example is here. The code would look like so, and gives the same plot as above.
ggplot(data = Reed, aes(y = sulfide, x = treatment, fill=origin)) +
geom_bar(stat="identity",position="dodge") +
theme_bw() +
facet_grid( ~ time) +
xlab("treatment") +
ylab("Sulfide") +
ggtitle("Time") +
stat_summary(geom = "errorbar", fun.data = mean_cl_normal, mult = 1,
position = position_dodge(width = .9))
I've been using the development version of ggplot2, ggplot2_1.0.1.9003, and found that I needed to add stat_summary function arguments via fun.args. This would look like fun.args = list(mult = 1) to get error bars of 1 standard error.

Plotting lines and the group aesthetic in ggplot2

This question follows on from an earlier question and its answers.
First some toy data:
df = read.table(text =
"School Year Value
A 1998 5
B 1999 10
C 2000 15
A 2000 7
B 2001 15
C 2002 20", sep = "", header = TRUE)
The original question asked how to plot Value-Year lines for each School. The answers more or less correspond to p1 and p2 below. But also consider p3.
library(ggplot2)
(p1 <- ggplot(data = df, aes(x = Year, y = Value, colour = School)) +
geom_line() + geom_point())
(p2 <- ggplot(data = df, aes(x = factor(Year), y = Value, colour = School)) +
geom_line(aes(group = School)) + geom_point())
(p3 <- ggplot(data = df, aes(x = factor(Year), y = Value, colour = School)) +
geom_line() + geom_point())
Both p1 and p2 do the job. The difference between p1 and p2 is that p1 treats Year as numeric whereas p2 treats Year as a factor. Also, p2 contains a group aesthetic in geom_line. But when the group aesthetic is dropped as in p3, the lines are not drawn.
The question is: Why is the group aesthetic necessary when the x-axis variable is a factor but the group aesthetic is not needed when the x-axis variable is numeric?
In the words of Hadley himself:
The important thing [for a line graph with a factor on the horizontal axis] is to manually specify the grouping. By
default ggplot2 uses the combination of all categorical variables in
the plot to group geoms - that doesn't work for this plot because you
get an individual line for each point. Manually specify group = 1
indicates you want a single line connecting all the points.
You can actually group the points in very different ways as demonstrated by koshke here

Resources