Having trouble plotting means of subgroups of data - r

I have data that describes the a series of observations (sound level) grouped by date and hour. I want to plot the mean sound level per hour for each day with sound level on the Y axis and hour on the X axis and a line graph for each day. Example data:
Hour Date SPL
1 18-May 107.9868
2 18-May 106.5656
1 19-May 107.4321
2 19-May 107.8993
I have played around with the group_by function but I'm not sure out to do any better than this:
spl_mean <- group_by(sound, Hour) %>%
summarize(count = n(), Mean = mean(SPL, na.rm = T))
ggplot(data=spl_mean) + geom_line(aes(x = Hour, y = Mean, group = 1), size = 2)
Which obviously just gives mean for SPL by hour but doesn't preserve the days subgroup.

Use library dplyr for calculate mean per hour and day, and then library ggplot2 to plot your result.
df %>%
group_by(Date, Hour) %>%
summarise(SPL_mean = mean(SPL, na.rm = T) %>%
arrange(Date, Hour) %>%
ggplot(aes(x = Hour, y = SPL_mean, color = Date) + geom_line()

Using ggplot and where D is day, H is hour and V is volume.
# setup for demo
library('tidyverse')
df <- tibble(
'D' = c(1:5,1:5),
'H' = rep(c(1,2), each = 5),
'V' = rnorm(10, 100, 5))
# Figure
ggplot(data = df) +
geom_line(mapping = aes(x = H, y = V, group = D, color = D))
This is telling it to do hour on the x, volume on the y and plot different days individually.
If you need to get per day and hour means first then group_by day and hour, then summarise:
df %>%
group_by(D, H) %>%
summarize(MV = mean(V)) %>%
ggplot() +
geom_line(mapping = aes(x = H, y = MV, group = D, color = D))
Then go make it pretty with labs, theme, scales etc.

Related

Plot time series of different years together

I am trying to compare different years' variables but I am having trouble plotting them together.
The time series is a temperature series which can be found in https://github.com/gonzalodqa/timeseries as temp.csv
I would like to plot something like the image but I find it difficult to subset the months between the years and then combine the lines in the same plot under the same months
If someone can give some advice or point me in the right direction I would really appreciate it
You can try this way.
The first chart shows all the available temperatures, the second chart is aggregated by month.
In the first chart, we force the same year so that ggplot will plot them aligned, but we separate the lines by colour.
For the second one, we just use month as x variable and year as colour variable.
Note that:
with scale_x_datetime we can hide the year so that no one can see that we forced the year 2020 to every observation
with scale_x_continous we can show the name of the months instead of the numbers
[just try to run the charts with and without scale_x_... to understand what I'm talking about]
month.abb is a useful default variable for months names.
# read data
df <- readr::read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
# libraries
library(ggplot2)
library(dplyr)
# line chart by datetime
df %>%
# make datetime: force unique year
mutate(datetime = lubridate::make_datetime(2020, month, day, hour, minute, second)) %>%
ggplot() +
geom_line(aes(x = datetime, y = T42, colour = factor(year))) +
scale_x_datetime(breaks = lubridate::make_datetime(2020,1:12), labels = month.abb) +
labs(title = "Temperature by Datetime", colour = "Year")
# line chart by month
df %>%
# average by year-month
group_by(year, month) %>%
summarise(T42 = mean(T42, na.rm = TRUE), .groups = "drop") %>%
ggplot() +
geom_line(aes(x = month, y = T42, colour = factor(year))) +
scale_x_continuous(breaks = 1:12, labels = month.abb, minor_breaks = NULL) +
labs(title = "Average Temperature by Month", colour = "Year")
In case you want your chart to start from July, you can use this code instead:
months_order <- c(7:12,1:6)
# line chart by month
df %>%
# average by year-month
group_by(year, month) %>%
summarise(T42 = mean(T42, na.rm = TRUE), .groups = "drop") %>%
# create new groups starting from each July
group_by(neworder = cumsum(month == 7)) %>%
# keep only complete years
filter(n() == 12) %>%
# give new names to groups
mutate(years = paste(unique(year), collapse = " / ")) %>%
ungroup() %>%
# reorder months
mutate(month = factor(month, levels = months_order, labels = month.abb[months_order], ordered = TRUE)) %>%
# plot
ggplot() +
geom_line(aes(x = month, y = T42, colour = years, group = years)) +
labs(title = "Average Temperature by Month", colour = "Year")
EDIT
To have something similar to the first plot but starting from July, you could use the following code:
# libraries
library(ggplot2)
library(dplyr)
library(lubridate)
# custom months order
months_order <- c(7:12,1:6)
# fake dates for plot
# note: choose 4 to include 29 Feb which exist only in leap years
dates <- make_datetime(c(rep(3,6), rep(4,6)), months_order)
# line chart by datetime
df %>%
# create date time
mutate(datetime = make_datetime(year, month, day, hour, minute, second)) %>%
# filter years of interest
filter(datetime >= make_datetime(2018,7), datetime < make_datetime(2020,7)) %>%
# create increasing group after each july
group_by(year, month) %>%
mutate(dummy = month(datetime) == 7 & datetime == min(datetime)) %>%
ungroup() %>%
mutate(dummy = cumsum(dummy)) %>%
# force unique years and create custom name
group_by(dummy) %>%
mutate(datetime = datetime - years(year - 4) - years(month>=7),
years = paste(unique(year), collapse = " / ")) %>%
ungroup() %>%
# plot
ggplot() +
geom_line(aes(x = datetime, y = T42, colour = years)) +
scale_x_datetime(breaks = dates, labels = month.abb[months_order]) +
labs(title = "Temperature by Datetime", colour = "Year")
To order month differently and sum up the values in couples of years, you've to work a bit with your data before plotting them:
library(dplyr) # work data
library(ggplot2) # plots
library(lubridate) # date
library(readr) # fetch data
# your data
df <- read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
df %>%
mutate(date = make_date(year, month,day)) %>%
# reorder month
group_by(month_2 = factor(as.character(month(date, label = T, locale = Sys.setlocale("LC_TIME", "English"))),
levels = c('Jul','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar','Apr','May','Jun')),
# group years as you like
year_2 = ifelse( year(date) %in% (2018:2019), '2018/2019', '2020/2021')) %>%
# you can put whatever aggregation function you need
summarise(val = mean(T42, na.rm = T)) %>%
# plot it!
ggplot(aes(x = month_2, y = val, color = year_2, group = year_2)) +
geom_line() +
ylab('T42') +
xlab('month') +
theme_light()
A slightly different solution without the all dates to 2020 trick.
library(tidyverse)
library(lubridate)
df <- read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
df <- df |>
filter(year %in% c(2018, 2019, 2020)) %>%
mutate(year = factor(year),
month = ifelse(month<10, paste0(0,month), month),
day = paste0(0, day),
month_day = paste0(month, "-", day))
df |> ggplot(aes(x=month_day, y=T42, group=year, col=year)) +
geom_line() +
scale_x_discrete(breaks = c("01-01", "02-01", "03-01", "04-01", "05-01", "06-01", "07-01", "08-01", "09-01", "10-01", "11-01", "12-01"))

R: identify outliers and mark them in a boxplot

I have the following fake data representig the answering times (in seconds) of different users in an online questionnaire:
n <- 1000
dat <- data.frame(user = 1:n,
question = sample(paste("q", 1:10, sep = ""), size = 10, replace = TRUE),
time = round(rnorm(n, mean = 10, sd=4), 0)
)
dat %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') +
ggtitle("Answering time per question")
Then, I am plotting the answering times as boxplots for each question. But how can I first calculate a column with a binary variable showing whether a case is an outlier or not [defined as median(time) +/- 3 * mad(time) ] within each question?
library(dplyr)
dat %>%
group_by(question) %>%
mutate(outlier = abs(time - median(time)) > 3*mad(time) ) %>%
ungroup() %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') +
geom_point(data = . %>% filter(outlier), color = "red") +
ggtitle("Answering time per question")
By first grouping within each question, the calculation is applied for each row compared to the median and mad for that question.

ggplot: sums per time unit (e.g. an hour) without group_by() and summarise()

I have the following sample data. It's a list of interviews with a date variable indicating when each interview was completed.
n <- 10000
df <- data.frame(
year = rep(2020,n),
month = sample(1:12, n, replace = T),
day = sample(1:28, n, replace = T),
hour = sample(0:23, n, replace = T),
min = sample(0:59, n, replace = T),
sec = sample(0:59, n, replace = T)
)
df
df %>% mutate(dt = make_datetime(year, month, day, hour, min, sec)) %>%
group_by(format(dt, "%Y/%m/%d")) %>%
summarise(n = n())
My goal is to have a line plot (e.g. number of completed interviews (Y) per week (X) ) where I can easily change the x-axis (e.g. if I'd like to plot the number of interviews per MONTH or MINUTES instead of weeks).
So my question is: Do I always have to use group_by(<TIME UNIT>), then summarise(n = n()) and finally plot it or is there a way to directly calculate/plot the number of interviews per time unit?
Interviews
^
|
| .
| . .
| ... .. .. ... ..... .....
|. . ..
|
|
|__________________________________> Time
Thanks!
You can use stat_bin to automatically bin the counts, with a line geom. The bin widths have to be given as a numeric value of seconds, but this is straightforward since you are using lubridate. Here we'll group by weeks:
df %>%
mutate(dt = make_datetime(year, month, day, hour, min, sec)) %>%
ggplot(aes(x = dt)) +
stat_bin(geom = "line", binwidth = as.numeric(seconds(weeks(1))))
Here by days:
df %>%
mutate(dt = make_datetime(year, month, day, hour, min, sec)) %>%
ggplot(aes(x = dt)) +
stat_bin(geom = "line", binwidth = as.numeric(seconds(days(1))))
You can calculate the number of interviews with the stat-argument in geom_line and the time unit in x.
library(lubridate)
library(ggplot2)
df2 <- df %>%
mutate(dt = make_datetime(year, month, day, hour, min, sec))
# days
ggplot(df2, aes(x = format(dt, "%d"), group = 1)) +
geom_line(stat = "count")
# months
ggplot(df2, aes(x = format(dt, "%m"), group = 1)) +
geom_line(stat = "count")
# weekday
ggplot(df2, aes(x = format(dt, "%w"), group = 1)) +
geom_line(stat = "count")
# week of year
ggplot(df2, aes(x = format(dt, "%W"), group = 1)) +
geom_line(stat = "count")

Spaghetti plot using ggplot in R?

I would like to produce a speghatii plot where i need to see days of the year on the x-axis and data on the y-axis for each Year. I would then want a separate year that had data for only 3 months (PCPNewData) to be plotted on the same figure but different color and bold line. Here is my sample code which produce a graph (attached) where the data for each Year for a particular Day is stacked- i don't want bar graph. I would like to have a line graph. Thanks
library(tidyverse)
library(tidyr)
myDates=as.data.frame(seq(as.Date("2000-01-01"), to=as.Date("2010-12-31"),by="days"))
colnames(myDates) = "Date"
Dates = myDates %>% separate(Date, sep = "-", into = c("Year", "Month", "Day"))
LatestDate=as.data.frame(seq(as.Date("2011-01-01"), to=as.Date("2011-03-31"),by="days"))
colnames(LatestDate) = "Date"
NewDate = LatestDate %>% separate(Date, sep = "-", into = c("Year", "Month", "Day"))
PCPDataHis = data.frame(total_precip = runif(4018, 0,70), Dates)
PCPNewData = data.frame(total_precip = runif(90, 0,70), NewDate)
PCPDataHisPlot =PCPDataHis %>% group_by(Year) %>% gather(key = "Variable", value = "Value", -Year, -Day,-Month)
ggplot(PCPDataHisPlot, aes(Day, Value, colour = Year))+
geom_line()+
geom_line(data = PCPNewData, aes(Day, total_precip))
I would like to have a Figure like below where each line represent data for a particular year
UPDATE:
I draw my desired figure with hand (see attached). I would like to have all the days of the Years on x-axis with its data on the y-axis
You have few errors in your code.
First, your days are in character format. You need to pass them in a numerical format to get line being continuous.
Then, you have multiple data for each days (because you have 12 months per year), so you need to summarise a little bit these data:
Pel2 <- Pelly2Data %>% group_by(year,day) %>% summarise(Value = mean(Value, na.rm = TRUE))
Pel3 <- Pelly2_2011_3months %>% group_by(year, day) %>% summarise(total_precip = mean(total_precip, na.rm = TRUE))
ggplot(Pel2, aes(as.numeric(day), Value, color = year))+
geom_line()+
geom_line(data = Pelly2_2011_3months, aes(as.numeric(day), y= total_precip),size = 2)
It looks better but it is hard to apply a specific color pattern
To my opinion, it will be less confused if you can compare mean of each dataset, such as:
library(tidyverse)
Pel2 <- Pelly2Data %>% group_by(day) %>%
summarise(Mean = mean(Value, na.rm = TRUE),
SEM = sd(Value,na.rm = TRUE)/sqrt(n())) %>%
mutate(Name = "Pel_ALL")
Pel3 <- Pelly2_2011_3months %>% group_by(day) %>%
summarise(Mean = mean(total_precip, na.rm = TRUE),
SEM = sd(total_precip, na.rm = TRUE)/sqrt(n())) %>%
mutate(Name = "Pel3")
Pel <- bind_rows(Pel2,Pel3)
ggplot(Pel, aes(x = as.numeric(day), y = Mean, color = Name))+
geom_ribbon(aes(ymin = Mean-SEM, ymax = Mean+SEM), alpha = 0.2)+
geom_line(size = 2)
EDIT: New graph based on update
To get the graph you post as a drawing, you need to have the day of the year and not the day of the month. We can get this information by setting a date sequence and extract the day of the year by using yday function from `lubridate package.
library(tidyverse)
library(lubridate)
Pelly2$Date = seq(ymd("1990-01-01"),ymd("2010-12-31"), by = "day")
Pelly2$Year_day <- yday(Pelly2$Date)
Pelly2_2011_3months$Date <- seq(ymd("2011-01-01"), ymd("2011-03-31"), by = "day")
Pelly2_2011_3months$Year_day <- yday(Pelly2_2011_3months$Date)
Pelly2$Dataset = "ALL"
Pelly2_2011_3months$Dataset = "2011_Dataset"
Pel <- bind_rows(Pelly2, Pelly2_2011_3months)
Then, you can combine both dataset and represent them with different colors, size, transparency (alpha) as show here:
ggplot(Pel, aes(x = Year_day, y = total_precip, color = year, size = Dataset, alpha = Dataset))+
geom_line()+
scale_size_manual(values = c(2,0.5))+
scale_alpha_manual(values = c(1,0.5))
Does it answer your question ?

Plot subscribers using start and end dates in R

I wish to plot the frequency of subscribers over time using start and end date.
I have a method that creates a row for each day per subscriber, then calculates the frequency per day, then plots the frequency by day.
This works fine for small data but does not scale to large subscriber numbers because the rows per customer step is too big.
Is there an efficient method? Many thanks for any help.
library(ggplot2)
library(dplyr)
# create dummy dataset
subscribers <- data.frame(id = seq(1:10),
start = sample(seq(as.Date('2016/01/01'), as.Date('2016/06/01'), by="day"), 10),
end = sample(seq(as.Date('2017/01/01'), as.Date('2017/06/01'), by="day"), 10))
# creates a row for each day per user - OK for small datasets, but not scalable
date_map <- Map(seq, subscribers$start, subscribers$end, by = "day")
date_rows <- data.frame(
org = rep.int(subscribers$id, vapply(date_map, length, 1L)),
date = do.call(c, date_map))
# finds the frequency of users for each day
date_rows %>%
group_by(date) %>%
dplyr::summarise(users = n()) -> plot_data
ggplot(data = plot_data,
aes(x = date, y = users)) +
geom_line(size = 1.2,alpha = .6)
How's this?
library(tidyverse)
df <- subscribers %>%
gather(key, value, start, end) %>%
mutate(key = ifelse(key == "start",1,-1)) %>%
arrange(value)
df$cum <- cumsum(df$key)
ggplot(data = df,
aes(x = value, y = cum)) +
geom_step()

Resources