R ggplot boxplot group plots by combining factors - r

I have some water quality (metals) results that are taken in June and December of each year. My current df has Month, Year, Detection. I would like to group by each test, ie June 2019, December 2019 and June 2020. I could create a new factor say Test with values of 0619, 1219, 0620. Also I could create a new factor from (Month Year)for each value.
Before that I was wondering if geom_boxplot could combine factor of Month, Year to accomplish plotting the 3 unique tests. Grouping by Year or Month will not give me the 3 unique tests.
I am looking for a call syntax solution before the new factor route.
ggplot(data = Agm, aes(x = Month+Year, y = Level) , na.rm=TRUE) +
ggtitle("Lead Levels",subtitle=subtext )+
xlab("Test") + ylab("ppb") +
geom_boxplot( fill="red",width = 0.8) + theme_bw()

If I understand correctly, you want to display a boxplot using two columns of factors (Month and Year).
There are a couple of ways you can accomplish this. Firstly, you can simply paste your columns together in within the ggplot call, for example:
ggplot(data = Agm, aes(x = paste(Year, Month), y = Level)) +
geom_boxplot() + theme_bw()
In this situation though I usually create a new column and use that as the variable for the X axis. This will allow you more flexibility in managing the values and how they display. For example:
library(tidyverse)
# Create a new Date column, combining year and month, separated by a -
Agm <- Agm %>% mutate(Date = paste(Year, Month, sep = "-") %>% arrange(Date)
ggplot(data = Agm, aes(x = Date, y = Level)) +
geom_boxplot() + theme_bw()
Note, when using either method above I would suggest that you join based on the year first, and then the month as I have done, so that it doesn't order the data incorrectly on your plot. If you do month first, then January for all the years will be displayed first/left most, then February or October, depending if you have leading zeros or not.

Related

Specify limits to Date axis that crosses the new year

I have this dataset I am working with where I am plotting monthly
summaries. A problem I have encountered in ggplot2 is to let the x axis go from say month 10 to 12 and then continue onwards with months 1 to say 4. In the example below I show this
with a 20 year dataset where I remove months May to September and plot the rest.
library(lubridate)
library(ggplot2)
mon=seq.Date(from=as.Date("2000-01-01"),to=as.Date("2019-12-01"),by="month")
val=rnorm(length(mon))
dd=data.frame(mon,val)
ddsub=subset(dd,month(mon)<5 |month(mon) >9)
ggplot(data=ddsub,aes(month(mon),val,group=month(mon))) + geom_boxplot() +
xlab("Month") + scale_x_continuous(breaks=c(1:12))
What I would like is for the x axis to start in Oct and to continue past the end of year to Apr.
Since month(ddsub$mon) returns a numeric resulting in a continuous horizontal axis, I have not found any neat way of breaking the ascending numerical order.
My only solution is do define the months as factors that I then reorder in the right way
mon_factor=as.factor(month(ddsub$mon))
ddsub$mon_ahead=reorder(mon_factor,rep(c(4,5,6,7,1,2,3),20))
ggplot(data=ddsub,aes(mon_ahead,val)) + geom_boxplot() + xlab("Month")
While this works, I don't find it an elegant solution. It is cumbersome to have to
define a new month variable and then reorder it.
Does anyone know if there is a way of working with the Date-objects directly and define
the limits of the axis so that it begins in Oct and ends in Apr ?
I think using a factor will be simplest here, and you can automate the ordering using a helper column like mo_FY below, which makes October month 1 of the fiscal year. I like the syntax of forcats::fct_reorder to establish the ordering.
ddsub$mo_FY = (month(ddsub$mon) + 2) %% 12 + 1
ddsub$mon_fct = forcats::fct_reorder(factor(month(ddsub$mon)), ddsub$mo_FY)
ggplot(data=ddsub, aes(mon_fct, val)) +
geom_boxplot() +
xlab("Month")
If you want to avoid creating a factor, you can do it on the fly with the modulus operator and creative labels:
ddsub %>%
ggplot() +
geom_boxplot(aes(x = (month(mon)+2) %% 12, y = val, group = month(mon))) +
xlab("Month") +
scale_x_continuous(breaks = c(0:6),labels = month(c(10:12,1:4), label = T))

How to plot bar chart of monthly deviations from annual mean?

SO!
I am trying to create a plot of monthly deviations from annual means for temperature data using a bar chart. I have data across many years and I want to show the seasonal behavior in temperatures between months. The bars should represent the deviation from the annual average, which is recalculated for each year. Here is an example that is similar to what I want, only it is for a single year:
My data is sensitive so I cannot share it yet, but I made a reproducible example using the txhousing dataset (it comes with ggplot2). The salesdiff column is the deviation between monthly sales (averaged acrross all cities) and the annual average for each year. Now the problem is plotting it.
library(ggplot2)
df <- aggregate(sales~month+year,txhousing,mean)
df2 <- aggregate(sales~year,txhousing,mean)
df2$sales2 <- df2$sales #RENAME sales
df2 <- df2[,-2] #REMOVE sales
df3<-merge(df,df2) #MERGE dataframes
df3$salesdiff <- df3$sales - df3$sales2 #FIND deviation between monthly and annual means
#plot deviations
ggplot(df3,aes(x=month,y=salesdiff)) +
geom_col()
My ggplot is not looking good at the moment-
Somehow it is stacking the columns for each month with all of the data across the years. Ideally the date would be along the x-axis spanning many years (I think the dataset is from 2000-2015...), and different colors depending on if salesdiff is higher or lower. You are all awesome, and I would welcome ANY advice!!!!
Probably the main issue here is that geom_col() will not take on different aesthetic properties unless you explicitly tell it to. One way to get what you want is to use two calls to geom_col() to create two different bar charts that will be combined together in two different layers. Also, you're going to need to create date information which can be easily passed to ggplot(); I use the lubridate() package for this task.
Note that we combine the "month" and "year" columns here, and then useymd() to obtain date values. I chose not to convert the double valued "date" column in txhousing using something like date_decimal(), because sometimes it can confuse February and January months (e.g. Feb 1 gets "rounded down" to Jan 31).
I decided to plot a subset of the txhousing dataset, which is a lot more convenient to display for teaching purposes.
Code:
library("tidyverse")
library("ggplot2")
# subset txhousing to just years >= 2011, and calculate nested means and dates
housing_df <- filter(txhousing, year >= 2011) %>%
group_by(year, month) %>%
summarise(monthly_mean = mean(sales, na.rm = TRUE),
date = first(date)) %>%
mutate(yearmon = paste(year, month, sep = "-"),
date = ymd(yearmon, truncated = 1), # create date column
salesdiff = monthly_mean - mean(monthly_mean), # monthly deviation
higherlower = case_when(salesdiff >= 0 ~ "higher", # for fill aes later
salesdiff < 0 ~ "lower"))
ggplot(data = housing_df, aes(x = date, y = salesdiff, fill = as.factor(higherlower))) +
geom_col() +
scale_x_date(date_breaks = "6 months",
date_labels = "%b-%Y") +
scale_fill_manual(values = c("higher" = "blue", "lower" = "red")) +
theme_bw()+
theme(legend.position = "none") # remove legend
Plot:
You can see the periodic behaviour here nicely; an increase in sales appears to occur every spring, with sales decreasing during the fall and winter months. Do keep in mind that you might want to reverse the colours I assigned if you want to use this code for temperature data! This was a fun one - good luck, and happy plotting!
Something like this should work?
Basically you need to create a binary variable that lets you change the color (fill) if salesdiff is positive or negative, called below factordiff.
Plus you needed a date variable for month and year combined.
library(ggplot2)
library(dplyr)
df3$factordiff <- ifelse(df3$salesdiff>0, 1, 0) # factor variable for colors
df3 <- df3 %>%
mutate(date = paste0(year,"-", month), # this builds date like "2001-1"
date = format(date, format="%Y-%m")) # here we create the correct date format
#plot deviations
ggplot(df3,aes(x=date,y=salesdiff, fill = as.factor(factordiff))) +
geom_col()
Of course this results in a hard to read plot because you have lots of dates, you can subset it and show only a restricted time:
df3 %>%
filter(date >= "2014-1") %>% # we filter our data from 2014
ggplot(aes(x=date,y=salesdiff, fill = as.factor(factordiff))) +
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # adds label rotation

Time series from three years in one plot

I am struggling (due to lack of knowledge and experience) to create a plot in R with time series from three different years (2009, 2013 and 2017). Failing to solve this problem by searching online has led me here.
I wish to create a plot that shows change in nitrate concentrations over the course of May to October for all years, but keep failing since the x-axis is defined by one specific year. I also receive errors because the x-axis lengths differ (due to different number of samples). To solve this I have tried making separate columns for month and year, with no success.
Data example:
date NO3.mg.l year month
2009-04-22 1.057495 2009 4
2013-05-08 1.936000 2013 5
2017-05-02 2.608000 2017 5
Code:
ggplot(nitrat.all, aes(x = date, y = NO3.mg.l, colour = year)) + geom_line()
This code produces a plot where the lines are positioned next to one another, whilst I want a plot where they overlay one another. Any help will be much appreciated.
Nitrate plot
Probably, that will be helpful for plotting:
library("lubridate")
library("ggplot2")
# evample of data with some points for each year
nitrat.all <- data.frame(date = c(ymd("2009-03-21"), ymd("2009-04-22"), ymd("2009-05-27"),
ymd("2010-03-15"), ymd("2010-04-17"), ymd("2010-05-10")), NO3.mg.l = c(1.057495, 1.936000, 2.608000,
3.157495, 2.336000, 3.908000))
nitrat.all$year <- format(nitrat.all$date, format = "%Y")
ggplot(data = nitrat.all) +
geom_point(mapping = aes(x = format(date, format = "%m-%d"), y = NO3.mg.l, group = year, colour = year)) +
geom_line(mapping = aes(x = format(date, format = "%m-%d"), y = NO3.mg.l, group = year, colour = year))
As for selecting of the dates corresponding to a certain month, you may subset your data frame by a condition using basic R-functions:
n_month1 <- 3 # an index of the first month of the period to select
n_month2 <- 4 # an index of the first month of the period to select
test_for_month <- (as.numeric(format(nitrat.all$date, format = "%m")) >= n_month1) &
(as.numeric(format(nitrat.all$date, format = "%m")) <= n_month2)
nitrat_to_plot <- nitrat.all[test_for_month, ]
Another quite an elegant approach is to use filter() from dplyr package
nitrat.all$month <- as.numeric(format(nitrat.all$date, format = "%m"))
library("dplyr")
nitrat_to_plot <- filter(nitrat.all, ((month >= n_month1) & (month <= n_month2)))

R - How to create a seasonal plot - Different lines for years

I already asked the same question yesterday, but I didnt get any suggestions until now, so I decided to delete the old one and ask again, giving additional infos.
So here again:
I have a dataframe like this:
Link to the original dataframe: https://megastore.uni-augsburg.de/get/JVu_V51GvQ/
Date DENI011
1 1993-01-01 9.946
2 1993-01-02 13.663
3 1993-01-03 6.502
4 1993-01-04 6.031
5 1993-01-05 15.241
6 1993-01-06 6.561
....
....
6569 2010-12-26 44.113
6570 2010-12-27 34.764
6571 2010-12-28 51.659
6572 2010-12-29 28.259
6573 2010-12-30 19.512
6574 2010-12-31 30.231
I want to create a plot that enables me to compare the monthly values in the DENI011 over the years. So I want to have something like this:
http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Seasonal%20Plot
Jan-Dec on the x-scale, values on the y-scale and the years displayed by different colored lines.
I found several similar questions here, but nothing works for me. I tried to follow the instructions on the website with the example, but the problem is that I cant create a ts-object.
Then I tried it this way:
Ref_Data$MonthN <- as.numeric(format(as.Date(Ref_Data$Date),"%m")) # Month's number
Ref_Data$YearN <- as.numeric(format(as.Date(Ref_Data$Date),"%Y"))
Ref_Data$Month <- months(as.Date(Ref_Data$Date), abbreviate=TRUE) # Month's abbr.
g <- ggplot(data = Ref_Data, aes(x = MonthN, y = DENI011, group = YearN, colour=YearN)) +
geom_line() +
scale_x_discrete(breaks = Ref_Data$MonthN, labels = Ref_Data$Month)
That also didnt work, the plot looks horrible. I dont need to put all the years in 1 plot from 1993-2010. Actually only a few years would be ok, like from 1998-2006 maybe.
And suggestions, how to solve this?
As others have noted, in order to create a plot such as the one you used as an example, you'll have to aggregate your data first. However, it's also possible to retain daily data in a similar plot.
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2018-02-11
library(tidyverse)
library(lubridate)
# Import the data
url <- "https://megastore.uni-augsburg.de/get/JVu_V51GvQ/"
raw <- read.table(url, stringsAsFactors = FALSE)
# Parse the dates, and use lower case names
df <- as_tibble(raw) %>%
rename_all(tolower) %>%
mutate(date = ymd(date))
One trick to achieve this would be to set the year component in your date variable to a constant, effectively collapsing the dates to a single year, and then controlling the axis labelling so that you don't include the constant year in the plot.
# Define the plot
p <- df %>%
mutate(
year = factor(year(date)), # use year to define separate curves
date = update(date, year = 1) # use a constant year for the x-axis
) %>%
ggplot(aes(date, deni011, color = year)) +
scale_x_date(date_breaks = "1 month", date_labels = "%b")
# Raw daily data
p + geom_line()
In this case though, your daily data are quite variable, so this is a bit of a mess. You could hone in on a single year to see the daily variation a bit better.
# Hone in on a single year
p + geom_line(aes(group = year), color = "black", alpha = 0.1) +
geom_line(data = function(x) filter(x, year == 2010), size = 1)
But ultimately, if you want to look a several years at a time, it's probably a good idea to present smoothed lines rather than raw daily values. Or, indeed, some monthly aggregate.
# Smoothed version
p + geom_smooth(se = F)
#> `geom_smooth()` using method = 'loess'
#> Warning: Removed 117 rows containing non-finite values (stat_smooth).
There are multiple values from one month, so when plotting your original data, you got multiple points in one month. Therefore, the line looks strange.
If you want to create something similar to the example your provided, you have to summarize your data by year and month. Below I calculated the mean of each year and month for your data. In addition, you need to convert your year and month to factors if you want to plot it as discrete variables.
library(dplyr)
Ref_Data2 <- Ref_Data %>%
group_by(MonthN, YearN, Month) %>%
summarize(DENI011 = mean(DENI011)) %>%
ungroup() %>%
# Convert the Month column to factor variable with levels from Jan to Dec
# Convert the YearN column to factor
mutate(Month = factor(Month, levels = unique(Month)),
YearN = as.factor(YearN))
g <- ggplot(data = Ref_Data2,
aes(x = Month, y = DENI011, group = YearN, colour = YearN)) +
geom_line()
g
If you don't want to add in library(dplyr), this is the base R code. Exact same strategy and results as www's answer.
dat <- read.delim("~/Downloads/df1.dat", sep = " ")
dat$Date <- as.Date(dat$Date)
dat$month <- factor(months(dat$Date, TRUE), levels = month.abb)
dat$year <- gsub("-.*", "", dat$Date)
month_summary <- aggregate(DENI011 ~ month + year, data = dat, mean)
ggplot(month_summary, aes(month, DENI011, color = year, group = year)) +
geom_path()

plotting multiple plot in R for different calendar date

I have about 20 years of daily data in a time series. It has columns Date, rainfall and other data.
I am trying plot rainfall vs Time. I want to get 20 line plots with different colours and legend is generated that show the years in one graph. I tried the following codes but it is not giving me the desired results. Any suggestion to fix my issue would be most welcome
library(ggplot2)
library(seas)
data(mscdata)
p<-ggplot(data=mscdata,aes(x=date,y=precip,group=year,color=year))
p+geom_line()+scale_x_date(labels=date_format("%m"),breaks=date_breaks("1 months"))
It doesnt look great but here's a method. We first coerce the data into dates in the same year:
mscdata$dayofyear <- as.Date(format(mscdata$date, "%j"), format = "%j")
Then we plot:
library(ggplot2)
library(scales)
p <- ggplot(data = mscdata, aes(x = dayofyear, y = precip, group = year, color = year))
p + geom_line() +
scale_x_date(labels = date_format("%m"), breaks = date_breaks("1 months"))
While I agree with #Jaap that this may not be the best way to depict these data, try to following:
mscdata$doy <- as.numeric(strftime(mscdata$date, format="%j"))
ggplot(data=mscdata,aes(x=doy,y=precip,group=year)) +
geom_line(aes(color=year))
Although the given answers are good answers to your questions as it stands, i don't think it will solve your problem. I think you should be looking at a different way to present the data. #Jaap already suggested using facets. Take for example this approach:
#first add a month column to your dataframe
mscdata$month <- format(mscdata$date, "%m")
#then plot it using boxplot with year on the X-axis and month as facet.
p1 <- ggplot(data = mscdata, aes(x = year, y = precip, group=year))
p1 + geom_boxplot(outlier.shape = 3) + facet_wrap(~month)
This will give you a graph per month, showing the rainfall per year next to one each other. Because i use boxplot, the peaks in rainfall show up as dots ('normal' rain events are inside box).
Another possible approach would be to use stat_summary.

Resources