I wish to plot the frequency of subscribers over time using start and end date.
I have a method that creates a row for each day per subscriber, then calculates the frequency per day, then plots the frequency by day.
This works fine for small data but does not scale to large subscriber numbers because the rows per customer step is too big.
Is there an efficient method? Many thanks for any help.
library(ggplot2)
library(dplyr)
# create dummy dataset
subscribers <- data.frame(id = seq(1:10),
start = sample(seq(as.Date('2016/01/01'), as.Date('2016/06/01'), by="day"), 10),
end = sample(seq(as.Date('2017/01/01'), as.Date('2017/06/01'), by="day"), 10))
# creates a row for each day per user - OK for small datasets, but not scalable
date_map <- Map(seq, subscribers$start, subscribers$end, by = "day")
date_rows <- data.frame(
org = rep.int(subscribers$id, vapply(date_map, length, 1L)),
date = do.call(c, date_map))
# finds the frequency of users for each day
date_rows %>%
group_by(date) %>%
dplyr::summarise(users = n()) -> plot_data
ggplot(data = plot_data,
aes(x = date, y = users)) +
geom_line(size = 1.2,alpha = .6)
How's this?
library(tidyverse)
df <- subscribers %>%
gather(key, value, start, end) %>%
mutate(key = ifelse(key == "start",1,-1)) %>%
arrange(value)
df$cum <- cumsum(df$key)
ggplot(data = df,
aes(x = value, y = cum)) +
geom_step()
Related
I struggle on my data for a long time and I don't know how to solve my problem. I work on nutritional data, that can be faked by this data set:
library(tidyverse)
library(lubridate)
# Used for data generation
groupFunction <- function(cat){
case_when(
cat == "apple" ~ "food",
cat == "bread" ~ "food",
cat == "cheese" ~ "food",
cat == "chocolate" ~ "candy",
cat == "water" ~ "drink",
cat == "tea" ~ "drink"
)
}
# Generate the data
set.seed(0)
fakeData <- tibble(
id = c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 10)),
eaten_at = sample(seq(as.POSIXct('2020/01/01'), as.POSIXct('2020/01/05'), by="15 min"), 40),
category = sample(rep(c("apple", "bread", "cheese", "chocolate", "water", "tea"), 10), 40),
group = groupFunction(category),
amount = sample(10:100, 40)
)
# For every id, for each day, every hour and each category: sum the eaten amount,
# and keep 0 eaten amount so it is encounted in the mean calculation in step 2!
# PROBLEM: we loose time intervals where a given id didn't eat anything, this will
# biais the mean calculation in step 2!
step1 <- fakeData %>%
mutate(eaten_at_hour = hour(eaten_at)) %>%
group_by(id, eaten_at, eaten_at_hour, category, group) %>%
summarise(eaten_amount = sum(amount)) %>%
ungroup() %>%
complete(nesting(id, eaten_at, eaten_at_hour),
nesting(category, group),
fill = list(eaten_amount = 0))
# For every id, mean over the days the eaten amount for every hour interval.
# As before, keep 0 counts so it's encounted in the mean calculation in step 3!
step2 <- step1 %>%
group_by(id, eaten_at_hour, category, group) %>%
summarise(mean_per_id = mean(eaten_amount)) %>%
ungroup() %>%
complete(nesting(id, eaten_at_hour),
nesting(category, group),
fill = list(mean_per_id = 0))
# Mean over all id
step3 <- step2 %>%
group_by(eaten_at_hour, category, group) %>%
summarise(mean_for_all = mean(mean_per_id)) %>%
ungroup()
# Plot the data
ggplot(step3, aes(x=eaten_at_hour, y=category, color = mean_for_all, shape = group)) +
geom_point( size = 3) +
scale_color_gradient(low="blue", high="red", "Mean eaten\namount [g]")
What I want to build is a plot with 1h time interval on the x axis and the different food categories on the y axis, with mean eaten amount for all ids for each X min period over 24h (i.e. time interval must be flexible). I would like a plot looking like this:
My thought was to compute:
for every id,
for every day this id has eaten something,
for every time interval of X hour (even if the id didn't eat anything),
and for every food category:
-> sum the eaten amount
Then:
for each id,
for each category,
and for each 1 hour interval over the days of participation:
-> average the eaten amount
Then:
-> average all id so that we get for each category and for each 1 hour interval of a 24h period, the mean eaten amount
For this I use the group_by(), nesting() and complete() functions. But I have 3 problems:
I want to be able to set the desired time interval, it can be 15 min but also 2 hours. I didn't found any solutions to this yet.
I need to have all time intervals for all id even if they didn't eat anything (so amout = 0) because when I mean for the days or among ids, the mean would be biaised if I don't include the zero counts.
My actual data set includes about 100k rows, so I think that my way of doing it would not be the most appropriate in term of efficiency. Furthermore, I want to design a shiny app for this data where a user could set the time interval manually for example, that means the plot must be computed again and again (lot of work for a computer when the code is not efficient...)
I'm aware that my question is totally oriented towards a specific problem but since I'm really blocked I would highly appreciate any help/inputs/ideas on one or both of my question. Thanks a lot!
I'm not sure I fully understood your problem, but here is a draft of an answer.
First, a tricky way to group on an interval is to floor the hour (using lubridate::hour) divided by the step, and then multiply the result by the step. Then, I grouped by the id, hour, and group to sum and then by only hour and group to compute the mean.
eaten_n_hours = 2
df = fakeData %>%
mutate(hour = floor(hour(eaten_at)/eaten_n_hours)*eaten_n_hours) %>%
group_by(id, hour, group) %>%
summarise(amount = sum(amount, na.rm=TRUE)) %>%
group_by(hour, group) %>%
summarise(amount_m = mean(amount, na.rm=T),
amount_sd = sd(amount, na.rm=T)) %>%
identity()
Then, you can plot the whole thing like this:
breaks_hour = seq(min(df$hour), max(df$hour)+1, eaten_n_hours)
ggplot(df, aes(x=hour, y=amount_m, group=group, color=group, fill=group))+
geom_col(position="dodge") +
# geom_errorbar(aes(ymin=amount_m-amount_sd, ymax=amount_m+amount_sd), position="dodge") +
scale_x_binned(breaks=breaks_hour)
This is not the prettiest plot ever, but I'm not sure whether it is due to my ununderstanding of the problem or to the example fakeData.
EDIT
I'm not familiar with tiles, but you can try using geom_tiles this way. Also, using scales::breaks_width allow having a flexible time interval.
ggplot(df, aes(x=hour, y=group, fill=amount_m))+
geom_tile()+
scale_x_binned(breaks=scales::breaks_width(3)) # try other values
So I figured out a way to do this (thanks #Dan Chaltiel), it's surely not perfect but I'll post it here so it can be useful to others/or for the discussion:
library(tidyverse)
library(lubridate)
# Used for data generation
groupFunction <- function(cat){
case_when(
cat == "apple" ~ "food",
cat == "bread" ~ "food",
cat == "cheese" ~ "food",
cat == "chocolate" ~ "candy",
cat == "water" ~ "drink",
cat == "tea" ~ "drink"
)
}
# Generate the data
set.seed(0)
fakeData <- tibble(
id = c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 10)),
eaten_at = sample(seq(as.POSIXct('2020/01/01 22:00:00'), as.POSIXct('2020/01/05'), by="17 min"), 40),
category = sample(rep(c("apple", "bread", "cheese", "chocolate", "water", "tea"), 10), 40),
group = groupFunction(category),
amount = sample(10:100, 40)
)
# Set time interval in minutes here (0-60 min only):
set_time <- 60
# Generate time sequence for one day (1440 seconds), with the desired interval input. Then set it as factor.
timeLevels <- seq(from = as.POSIXct("2020-1-1 0:00"), by = paste(set_time, "min", sep = " "), length.out = 1440/set_time)
timeLevels <- paste(hour(timeLevels), minute(timeLevels), sep = ":")
# Calculate the means, keeping zero counts
toPlot <- fakeData %>%
mutate(eaten_hour = floor_date(eaten_at, unit = paste(set_time, "min", sep = " ")),
eaten_hour = paste(hour(eaten_hour), minute(eaten_hour), sep = ":"),
eaten_hour = factor(eaten_hour, levels = timeLevels),
eaten_date = date(eaten_at)) %>%
group_by(eaten_date, eaten_hour, category, group) %>%
summarise(sum_amount = sum(amount)) %>%
ungroup() %>%
complete(eaten_date, eaten_hour, nesting(category, group), fill = list(sum_amount = 0)) %>%
group_by(eaten_hour, category, group) %>%
summarise(mean_amount = mean(sum_amount)) %>%
ungroup()
# Plot the data
gg <- ggplot(toPlot, aes(x=eaten_hour, y=category, fill=mean_amount))
gg <- gg + geom_tile(color="white", size=0.1)
gg <- gg + coord_equal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
gg <- gg + labs(x = "Time of the day", y = NULL, title = "Mean eaten quantity over one day", fill = "Mean amount [g]")
gg
Output looks like:
Still open on any inputs about how to improve my code!
Please help!
I have case data I need to prepare for a report soon and just cannot get the graphs to display properly.
From a dataset with CollectionDate as the "record" of cases (i.e. multiple rows with the same date means more cases that day), I want to display Number of positive cases/total (positive + negative) cases for that day as a percent on the y-axis, with collection dates along the x-axis. Then I want to break down by region. Goal is to look like this but in terms of daily positives/# of tests rather than just positives vs negatives. I also want to add a horizontal line on every graph at 20%.
I have tried manipulating it before, in and after ggplot:
ggplot(df_final, aes(x =CollectionDate, fill = TestResult)) +
geom_bar(aes(y=..prop..)) +
scale_y_continuous(labels=percent_format())
Which is, again, close. But the percents are wrong because they are just taking the proportion of that day against counts of all days instead of per day.
Then I tried using tally()in the following command to try and count per region and aggregate:
df_final %>%
group_by(CollectionDate, Region, as.factor(TestResult)) %>%
filter(TestResult == "Positive") %>%
tally()
and I still cannot get the graphs right.
Suggestions?
A quick look at my data:
head(df_final)
Well, I have to say that I am not 100% sure that I got what you want, but anyway, this can be helpful.
The data: Since you are new here, I have to let you know that using a simple and reproducible version of your data will make it easier to the rest of us to answer. To do this you can simulate a data frame o any other objec, or use dput function on it.
library(ggplot2)
library(dplyr)
data <- data.frame(
# date
CollectionDate = sample(
seq(as.Date("2020-01-01"), by = "day", length.out = 15),
size = 120, replace = TRUE),
# result
TestResult = sample(c("Positive", "Negative"), size = 120, replace = TRUE),
# region
Region = sample(c("Region 1", "Region2"), size = 120, replace = TRUE)
)
With this data, you can do ass follow to get the plots you want.
# General plot, positive cases proportion
data %>%
count(CollectionDate, TestResult, name = "cases") %>%
group_by(CollectionDate) %>%
summarise(positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
geom_hline(yintercept = 0.2)
# positive proportion by day within region
data %>%
count(CollectionDate, TestResult, Region, name = "cases") %>%
group_by(CollectionDate, Region) %>%
summarise(
positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)
) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
# horizontal line at 20%
geom_hline(yintercept = 0.2) +
facet_wrap(~Region)
I can get you halfway there (refer to the comments in the code for clarifications). This code is for the counts per day per region (plotted separately for each region). I think you can tweak things further to calculate the counts per day per county too; and whole state should be a cakewalk. I wish you good luck with your report.
rm(list = ls())
library(dplyr)
library(magrittr)
library(ggplot2)
library(scales)
library(tidyr) #Needed for the spread() function
#Dummy data
set.seed(1984)
sdate <- as.Date('2000-03-09')
edate <- as.Date('2000-05-18')
dateslist <- as.Date(sample(as.numeric(sdate): as.numeric(edate), 10000, replace = TRUE), origin = '1970-01-01')
df_final <- data.frame(Region = rep_len(1:9, 10000),
CollectionDate = dateslist,
TestResult = sample(c("Positive", "Negative"), 10000, replace = TRUE))
#First tally the positve and negative cases
#by Region, CollectionDate, TestResult in that order
df_final %<>%
group_by(Region, CollectionDate, TestResult) %>%
tally()
#Then
#First spread the counts (in n)
#That is, create separate columns for Negative and Positive cases
#for each Region-CollectionDate combination
#Then calculate their proportions (as shown)
#Now you have Negative and Positive
#percentages by CollectionDate by Region
df_final %<>%
spread(key = TestResult, value = n) %>%
mutate(Negative = Negative/(Negative + Positive),
Positive = Positive/(Negative + Positive))
#Plotting this now
#Since the percentages are available already
#Use geom_col() instead of geom_bar()
df_final %>% ggplot() +
geom_col(aes(x = CollectionDate, y = Positive, fill = "Positive"),
position = "identity", alpha = 0.4) +
geom_col(aes(x = CollectionDate, y = Negative, fill = "Negative"),
position = "identity", alpha = 0.4) +
facet_wrap(~ Region, nrow = 3, ncol = 3)
This yields:
I would like to produce a speghatii plot where i need to see days of the year on the x-axis and data on the y-axis for each Year. I would then want a separate year that had data for only 3 months (PCPNewData) to be plotted on the same figure but different color and bold line. Here is my sample code which produce a graph (attached) where the data for each Year for a particular Day is stacked- i don't want bar graph. I would like to have a line graph. Thanks
library(tidyverse)
library(tidyr)
myDates=as.data.frame(seq(as.Date("2000-01-01"), to=as.Date("2010-12-31"),by="days"))
colnames(myDates) = "Date"
Dates = myDates %>% separate(Date, sep = "-", into = c("Year", "Month", "Day"))
LatestDate=as.data.frame(seq(as.Date("2011-01-01"), to=as.Date("2011-03-31"),by="days"))
colnames(LatestDate) = "Date"
NewDate = LatestDate %>% separate(Date, sep = "-", into = c("Year", "Month", "Day"))
PCPDataHis = data.frame(total_precip = runif(4018, 0,70), Dates)
PCPNewData = data.frame(total_precip = runif(90, 0,70), NewDate)
PCPDataHisPlot =PCPDataHis %>% group_by(Year) %>% gather(key = "Variable", value = "Value", -Year, -Day,-Month)
ggplot(PCPDataHisPlot, aes(Day, Value, colour = Year))+
geom_line()+
geom_line(data = PCPNewData, aes(Day, total_precip))
I would like to have a Figure like below where each line represent data for a particular year
UPDATE:
I draw my desired figure with hand (see attached). I would like to have all the days of the Years on x-axis with its data on the y-axis
You have few errors in your code.
First, your days are in character format. You need to pass them in a numerical format to get line being continuous.
Then, you have multiple data for each days (because you have 12 months per year), so you need to summarise a little bit these data:
Pel2 <- Pelly2Data %>% group_by(year,day) %>% summarise(Value = mean(Value, na.rm = TRUE))
Pel3 <- Pelly2_2011_3months %>% group_by(year, day) %>% summarise(total_precip = mean(total_precip, na.rm = TRUE))
ggplot(Pel2, aes(as.numeric(day), Value, color = year))+
geom_line()+
geom_line(data = Pelly2_2011_3months, aes(as.numeric(day), y= total_precip),size = 2)
It looks better but it is hard to apply a specific color pattern
To my opinion, it will be less confused if you can compare mean of each dataset, such as:
library(tidyverse)
Pel2 <- Pelly2Data %>% group_by(day) %>%
summarise(Mean = mean(Value, na.rm = TRUE),
SEM = sd(Value,na.rm = TRUE)/sqrt(n())) %>%
mutate(Name = "Pel_ALL")
Pel3 <- Pelly2_2011_3months %>% group_by(day) %>%
summarise(Mean = mean(total_precip, na.rm = TRUE),
SEM = sd(total_precip, na.rm = TRUE)/sqrt(n())) %>%
mutate(Name = "Pel3")
Pel <- bind_rows(Pel2,Pel3)
ggplot(Pel, aes(x = as.numeric(day), y = Mean, color = Name))+
geom_ribbon(aes(ymin = Mean-SEM, ymax = Mean+SEM), alpha = 0.2)+
geom_line(size = 2)
EDIT: New graph based on update
To get the graph you post as a drawing, you need to have the day of the year and not the day of the month. We can get this information by setting a date sequence and extract the day of the year by using yday function from `lubridate package.
library(tidyverse)
library(lubridate)
Pelly2$Date = seq(ymd("1990-01-01"),ymd("2010-12-31"), by = "day")
Pelly2$Year_day <- yday(Pelly2$Date)
Pelly2_2011_3months$Date <- seq(ymd("2011-01-01"), ymd("2011-03-31"), by = "day")
Pelly2_2011_3months$Year_day <- yday(Pelly2_2011_3months$Date)
Pelly2$Dataset = "ALL"
Pelly2_2011_3months$Dataset = "2011_Dataset"
Pel <- bind_rows(Pelly2, Pelly2_2011_3months)
Then, you can combine both dataset and represent them with different colors, size, transparency (alpha) as show here:
ggplot(Pel, aes(x = Year_day, y = total_precip, color = year, size = Dataset, alpha = Dataset))+
geom_line()+
scale_size_manual(values = c(2,0.5))+
scale_alpha_manual(values = c(1,0.5))
Does it answer your question ?
I have data that describes the a series of observations (sound level) grouped by date and hour. I want to plot the mean sound level per hour for each day with sound level on the Y axis and hour on the X axis and a line graph for each day. Example data:
Hour Date SPL
1 18-May 107.9868
2 18-May 106.5656
1 19-May 107.4321
2 19-May 107.8993
I have played around with the group_by function but I'm not sure out to do any better than this:
spl_mean <- group_by(sound, Hour) %>%
summarize(count = n(), Mean = mean(SPL, na.rm = T))
ggplot(data=spl_mean) + geom_line(aes(x = Hour, y = Mean, group = 1), size = 2)
Which obviously just gives mean for SPL by hour but doesn't preserve the days subgroup.
Use library dplyr for calculate mean per hour and day, and then library ggplot2 to plot your result.
df %>%
group_by(Date, Hour) %>%
summarise(SPL_mean = mean(SPL, na.rm = T) %>%
arrange(Date, Hour) %>%
ggplot(aes(x = Hour, y = SPL_mean, color = Date) + geom_line()
Using ggplot and where D is day, H is hour and V is volume.
# setup for demo
library('tidyverse')
df <- tibble(
'D' = c(1:5,1:5),
'H' = rep(c(1,2), each = 5),
'V' = rnorm(10, 100, 5))
# Figure
ggplot(data = df) +
geom_line(mapping = aes(x = H, y = V, group = D, color = D))
This is telling it to do hour on the x, volume on the y and plot different days individually.
If you need to get per day and hour means first then group_by day and hour, then summarise:
df %>%
group_by(D, H) %>%
summarize(MV = mean(V)) %>%
ggplot() +
geom_line(mapping = aes(x = H, y = MV, group = D, color = D))
Then go make it pretty with labs, theme, scales etc.
I feel like this should be an easy task for ggplot, tidyverse, lubridate, but I cannot seem to find an elegant solution.
GOAL: Create a bar graph of my data aggregated/summarized/grouped_by year and month.
#Libraries
library(tidyverse)
library(lubridate)
# Data
date <- sample(seq(as_date('2013-06-01'), as_date('2014-5-31'), by="day"), 10000, replace = TRUE)
value <- rnorm(10000)
df <- tibble(date, value)
# Summarise
df2 <- df %>%
mutate(year = year(date), month = month(date)) %>%
unite(year_month,year,month) %>%
group_by(year_month) %>%
summarise(avg = mean(value),
cnt = n())
# Plot
ggplot(df2) +
geom_bar(aes(x=year_month, y = avg), stat = 'identity')
When I create the year_month variable, it naturally becomes a character variable instead of a date variable. I have also tried grouping by year(date), month(date) but then I can't figure out how to use two variables as the x-axis in ggplot. Perhaps this could be solved by flooring the dates to the first day of the month...?
You were really close. The missing pieces are floor_date() and scale_x_date():
library(tidyverse)
library(lubridate)
date <- sample(seq(as_date('2013-06-01'), as_date('2014-5-31'), by = "day"),
10000, replace = TRUE)
value <- rnorm(10000)
df <- tibble(date, value) %>%
group_by(month = floor_date(date, unit = "month")) %>%
summarize(avg = mean(value))
ggplot(df, aes(x = month, y = avg)) +
geom_bar(stat = "identity") +
scale_x_date(NULL, date_labels = "%b %y", breaks = "month")