I have a data frame called "fish" which contains variables such as mass, length and day of the year. I need to make a boxplot of fish length by month but there is no month variable, only day of the year (i.e 1:365). How can I group days by 30 to represent month and then name them so I can make a boxplot? I have attached a screenshot of the data.
You can use this solution:
#load package
require(tidyverse)
#make dataframe
n <- 100
tmp <- tibble(year = rep(c(1994,1994),n/2),day = c(1:n),lenght_mm = rnorm(n),mass_g = rnorm(n,5))
#add month column
tmp <- tmp %>%
mutate(month = as.factor(ifelse(day%%30/30 != 0,day%/%30 +1,day%/%30)))
#make plot
tmp %>%
ggplot(aes(month,lenght_mm,col = month)) +
geom_boxplot() +
theme_bw()
I would add a new column with the full date:
as.Date(104, origin = "2014-01-01")
and from that you can group by month.
months(as.Date(104, origin = "2014-01-01"))
put together:
df %>% mutate(date = as.Date(day_of_the_year, origin = "2014-01-01"),
month = months(date))
Related
I'm trying to create labels in ggplot with geom_point and geom_text, that labels only the first quarter of each year, and no labels in between. The labelled 'dots' should just give the year.
So 2009 + + + 2010 + + + 2011 and so on. I'm sure there is a simple way to do this, but I'm stuck!
I should add that I'm looking to do this "programmatically" and not manually, as I will be doing this for multiple data-sets and several years. Appreciate any help!
CODE:
library(tidyverse)
library(pxweb)
library(lubridate)
# Download data from Statistic Sweden
# PXWEB query
pxweb_query_list <-
list("Kon"=c("1+2"),
"Alder"=c("tot16-64"),
"Arbetskraftstillh"=c("ALÖS"),
"ContentsCode"=c("0000062S"),
"Tid"=c("*"))
# Download data on unemployment and vacancy rates
unemp <-
pxweb_get(url = "https://api.scb.se/OV0104/v1/doris/en/ssd/AM/AM0403/AM0403A/NAKUBefolkning2KTD",
query = pxweb_query_list)
# Convert to data.frame
unemp <- as.data.frame(unemp, column.name.type = "text", variable.value.type = "text")
# PXWEB query
pxlist2 <-
list("Antanstallda"=c("TOT"),
"ContentsCode"=c("AM0701F4"),
"Tid"=c("*"))
vacancy <-
pxweb_get(url = "https://api.scb.se/OV0104/v1/doris/en/ssd/AM/AM0701/AM0701D/KVRekVakStor",
query = pxlist2)
# Convert to data.frame
vacancy <- as.data.frame(vacancy, column.name.type = "text", variable.value.type = "text")
rm(pxlist2, pxweb_query_list)
# Fix date-format
unemp$quarter <- yq(unemp$quarter)
vacancy$quarter <- yq(vacancy$quarter)
# Join df's
final <- inner_join(unemp, vacancy, by = "quarter")
# clean names
final <- final %>% janitor::clean_names()
# rename etc
final <- final %>%
mutate(unemp = percent) %>%
mutate(date = quarter) %>%
select(date, unemp, vacancy_rate)
# Seperate data in two periods
final <- final %>%
mutate(period =
case_when(
date>=ymd('2002-04-01') & date<ymd('2008-01-01') ~ "one",
date>=ymd('2008-01-01') & date<=ymd('2015-01-01') ~ "two"))
ggplot(final) +
aes(x = unemp,
y = vacancy_rate,
group = period,
color = period,
label = date) +
geom_point() +
geom_path() +
geom_text() +
facet_wrap(~period) +
theme_gray()
The result currently looks like this (it's mid-process, I know it looks ugly):
Often the simplest way to do this is just create another label variable in your dataframe, and use ifelse() to set the labels you don't want to an empty string "".
See the example below, where I create date_label that is the year when the month == 1, and any other time the label is "".
final <- final %>%
mutate(
period = case_when(
date>=ymd('2002-04-01') & date<ymd('2008-01-01') ~ "one",
date>=ymd('2008-01-01') & date<=ymd('2015-01-01') ~ "two"),
date_label = ifelse(month(date) == 1, year(date), "")
)
The graph still looks hella ugly, but at least the labels are now right :)
I am trying to compare different years' variables but I am having trouble plotting them together.
The time series is a temperature series which can be found in https://github.com/gonzalodqa/timeseries as temp.csv
I would like to plot something like the image but I find it difficult to subset the months between the years and then combine the lines in the same plot under the same months
If someone can give some advice or point me in the right direction I would really appreciate it
You can try this way.
The first chart shows all the available temperatures, the second chart is aggregated by month.
In the first chart, we force the same year so that ggplot will plot them aligned, but we separate the lines by colour.
For the second one, we just use month as x variable and year as colour variable.
Note that:
with scale_x_datetime we can hide the year so that no one can see that we forced the year 2020 to every observation
with scale_x_continous we can show the name of the months instead of the numbers
[just try to run the charts with and without scale_x_... to understand what I'm talking about]
month.abb is a useful default variable for months names.
# read data
df <- readr::read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
# libraries
library(ggplot2)
library(dplyr)
# line chart by datetime
df %>%
# make datetime: force unique year
mutate(datetime = lubridate::make_datetime(2020, month, day, hour, minute, second)) %>%
ggplot() +
geom_line(aes(x = datetime, y = T42, colour = factor(year))) +
scale_x_datetime(breaks = lubridate::make_datetime(2020,1:12), labels = month.abb) +
labs(title = "Temperature by Datetime", colour = "Year")
# line chart by month
df %>%
# average by year-month
group_by(year, month) %>%
summarise(T42 = mean(T42, na.rm = TRUE), .groups = "drop") %>%
ggplot() +
geom_line(aes(x = month, y = T42, colour = factor(year))) +
scale_x_continuous(breaks = 1:12, labels = month.abb, minor_breaks = NULL) +
labs(title = "Average Temperature by Month", colour = "Year")
In case you want your chart to start from July, you can use this code instead:
months_order <- c(7:12,1:6)
# line chart by month
df %>%
# average by year-month
group_by(year, month) %>%
summarise(T42 = mean(T42, na.rm = TRUE), .groups = "drop") %>%
# create new groups starting from each July
group_by(neworder = cumsum(month == 7)) %>%
# keep only complete years
filter(n() == 12) %>%
# give new names to groups
mutate(years = paste(unique(year), collapse = " / ")) %>%
ungroup() %>%
# reorder months
mutate(month = factor(month, levels = months_order, labels = month.abb[months_order], ordered = TRUE)) %>%
# plot
ggplot() +
geom_line(aes(x = month, y = T42, colour = years, group = years)) +
labs(title = "Average Temperature by Month", colour = "Year")
EDIT
To have something similar to the first plot but starting from July, you could use the following code:
# libraries
library(ggplot2)
library(dplyr)
library(lubridate)
# custom months order
months_order <- c(7:12,1:6)
# fake dates for plot
# note: choose 4 to include 29 Feb which exist only in leap years
dates <- make_datetime(c(rep(3,6), rep(4,6)), months_order)
# line chart by datetime
df %>%
# create date time
mutate(datetime = make_datetime(year, month, day, hour, minute, second)) %>%
# filter years of interest
filter(datetime >= make_datetime(2018,7), datetime < make_datetime(2020,7)) %>%
# create increasing group after each july
group_by(year, month) %>%
mutate(dummy = month(datetime) == 7 & datetime == min(datetime)) %>%
ungroup() %>%
mutate(dummy = cumsum(dummy)) %>%
# force unique years and create custom name
group_by(dummy) %>%
mutate(datetime = datetime - years(year - 4) - years(month>=7),
years = paste(unique(year), collapse = " / ")) %>%
ungroup() %>%
# plot
ggplot() +
geom_line(aes(x = datetime, y = T42, colour = years)) +
scale_x_datetime(breaks = dates, labels = month.abb[months_order]) +
labs(title = "Temperature by Datetime", colour = "Year")
To order month differently and sum up the values in couples of years, you've to work a bit with your data before plotting them:
library(dplyr) # work data
library(ggplot2) # plots
library(lubridate) # date
library(readr) # fetch data
# your data
df <- read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
df %>%
mutate(date = make_date(year, month,day)) %>%
# reorder month
group_by(month_2 = factor(as.character(month(date, label = T, locale = Sys.setlocale("LC_TIME", "English"))),
levels = c('Jul','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar','Apr','May','Jun')),
# group years as you like
year_2 = ifelse( year(date) %in% (2018:2019), '2018/2019', '2020/2021')) %>%
# you can put whatever aggregation function you need
summarise(val = mean(T42, na.rm = T)) %>%
# plot it!
ggplot(aes(x = month_2, y = val, color = year_2, group = year_2)) +
geom_line() +
ylab('T42') +
xlab('month') +
theme_light()
A slightly different solution without the all dates to 2020 trick.
library(tidyverse)
library(lubridate)
df <- read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
df <- df |>
filter(year %in% c(2018, 2019, 2020)) %>%
mutate(year = factor(year),
month = ifelse(month<10, paste0(0,month), month),
day = paste0(0, day),
month_day = paste0(month, "-", day))
df |> ggplot(aes(x=month_day, y=T42, group=year, col=year)) +
geom_line() +
scale_x_discrete(breaks = c("01-01", "02-01", "03-01", "04-01", "05-01", "06-01", "07-01", "08-01", "09-01", "10-01", "11-01", "12-01"))
I have a monthly temporal series with sales in this format (so there's no month or year column):
ts(data = Datos, start = c(2015,1), end = c(2020,12), frequency = 12)
How can I plot a multi-boxplot by month?
If you want to use the boxplots to display the variations for a specific month across the given five years period you can try:
library(tidyverse)
library(tsibble)
ts(data = sample(100), start = c(2015,1), end = c(2020,12), frequency = 12) %>%
as_tsibble() %>%
mutate(month = as.factor(month(index))) %>%
ggplot(aes(month, value)) +
geom_boxplot()
I have crime data of the years 2018-2020. Each row represents one crime. For the sake of this example let's assume that there are two variables crimetype (e.g. theft, robbery) and date (when the crime was committed).
Some sample data:
data <- data.frame(date= sample(seq(as.Date('2018/01/01'), as.Date('2020/12/31'), by="day"),10000, replace=T),
crimetype = sample(c("A", "B", "C"), 100000, replace=T))
My goal is to create a lineplot for, let's say, type "A" crimes. On the x-axis there should be the date (from january 1st to december 31st), on the y-axis there should be the number of crimes per day. However, as I want the three lines (one for each year) to be shown on top of each other, so that I can compare them, there should be no year on the x-axis. Or it should not be displayed at least.
^ . . . . . .
| . . .
| . . .
n | . 2018
| - - -
| - - - - - - - - 2019
| = = =
| = = = = = = = = 2020
|
------------------------------------->
Jan-1 Dec-31
I was trying to create a new date-variable with all the dates in the same year (here 2020).
data <- data %>% mutate(daymonth = substr(date, 5, length(date)),
date_new = as.Date(paste("2020", daymonth, sep="")),
daymonth = NULL)
Is there a better way to do this and how can I plot the graph?
data_plot <- data %>% filter(crimetype == 'A')
ggplot(data = data_plot, aes(x = date_new, y = ?, color=format(date, "%Y")) + geom_line()
For working with dates have a look at the lubridate package which I use here for extracting the year. Also you can get rid of the year by using format(date, "%d-%m"). The following approach is a bit of a hack. To use a date axis but still get rid of the year I set the year for all dates to 2018. The question of which variable to plot ... simply count the obs to get the number of crimes by date. Finally. I set the breaks of the date axis to 1 month. Adjust this as you like. Try this:
library(ggplot2)
library(dplyr)
library(lubridate)
data <- data.frame(date= sample(seq(as.Date('2018/01/01'), as.Date('2020/12/31'), by="day"),10000, replace=T),
crimetype = sample(c("A", "B", "C"), 100000, replace=T))
data_plot <- data %>%
mutate(
year = lubridate::year(date),
year = factor(year),
# A hack. Set year to 2018. Allows me to use a date axis
date_foo = as.Date(paste(2018, format(date, "%m-%d"), sep = "-"))) %>%
filter(crimetype == 'A') %>%
count(date, date_foo, year, crimetype)
ggplot(data = data_plot, aes(x = date_foo, y = n, color = year, group = year)) +
geom_line() +
scale_x_date(date_breaks = "1 month", date_labels = "%d-%m")
#> Warning: Removed 1 row(s) containing missing values (geom_path).
Created on 2020-03-28 by the reprex package (v0.3.0)
I wish to plot the frequency of subscribers over time using start and end date.
I have a method that creates a row for each day per subscriber, then calculates the frequency per day, then plots the frequency by day.
This works fine for small data but does not scale to large subscriber numbers because the rows per customer step is too big.
Is there an efficient method? Many thanks for any help.
library(ggplot2)
library(dplyr)
# create dummy dataset
subscribers <- data.frame(id = seq(1:10),
start = sample(seq(as.Date('2016/01/01'), as.Date('2016/06/01'), by="day"), 10),
end = sample(seq(as.Date('2017/01/01'), as.Date('2017/06/01'), by="day"), 10))
# creates a row for each day per user - OK for small datasets, but not scalable
date_map <- Map(seq, subscribers$start, subscribers$end, by = "day")
date_rows <- data.frame(
org = rep.int(subscribers$id, vapply(date_map, length, 1L)),
date = do.call(c, date_map))
# finds the frequency of users for each day
date_rows %>%
group_by(date) %>%
dplyr::summarise(users = n()) -> plot_data
ggplot(data = plot_data,
aes(x = date, y = users)) +
geom_line(size = 1.2,alpha = .6)
How's this?
library(tidyverse)
df <- subscribers %>%
gather(key, value, start, end) %>%
mutate(key = ifelse(key == "start",1,-1)) %>%
arrange(value)
df$cum <- cumsum(df$key)
ggplot(data = df,
aes(x = value, y = cum)) +
geom_step()