Creating a Cumulative Sum Plot using ggplot with duplicate x values - r

In my hypothetical example, people order ice-cream at a stand and each time an order is placed, the month the order was made and the number of orders placed is recorded. Each row represents a unique person who placed the order. For each flavor of ice-cream, I am curious to know the cumulative orders placed over the various months. For instance if a total of 3 Vanilla orders were placed in April and 4 in May, the graph should show one data point at 3 for April and one at 7 for May.
The issue I am running into is each row is being plotted separately (so there would be 3 separate points at April as opposed to just 1).
My secondary issue is that my dates are not in chronological order on my graph. I thought converting the Month column to Date format would fix this but it doesn't seem to.
Here is my code below:
library(lubridate)
Flavor <- c("Vanilla", "Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","chocolate","chocolate","chocolate")
Month <- c("1-Jun-21", "1-May-19", "1-May-19","1-Apr-19", "1-Apr-19","1-Apr-19","1-Apr-19", "1-Mar-19", "1-Mar-19", "1-Mar-19","1-Mar-19", "1-Apr-19", "1-Mar-19", " 1-Apr-19", " 1-Jan-21", "1-May-19", "1-May-19","1-May-19","1-May-19","1-Jun-19","2-September-19", "1-September-19","1-September-19","1-December-19","1-May-19","1-May-19","1-Jun-19")
Orders <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
data <- data.frame(Flavor,Month,Orders)
data$Month <- dmy(data$Month)
str(data)
data2 <- data[data$Flavor == "Vanilla",]
ggplot(data=data2, aes(x=Month, y=cumsum(Orders))) + geom_point()

In these situations, it's usually best to pre-compute your desired summary and send that to ggplot, rather than messing around with ggplot's summary functions. I've also added a geom_line() for clarity.
data %>%
group_by(Flavor, Month) %>%
summarize(Orders = sum(Orders)) %>%
group_by(Flavor) %>%
arrange(Month) %>%
mutate(Orders = cumsum(Orders)) %>%
ggplot(data = ., aes(x=Month, y=Orders, color = Flavor)) + geom_point() + geom_line()

Related

Geom Histogram of two variables with different data types RStudio

I am trying to do a very simple geom plot but it is becoming complex due to following reasons. I have two variables Date and Condition. Their data type is Date and Char respectively. Following data exist in it.
Date
Condition
2015-11-26
zoo1
2022-01-14
K621
2020-01-14
K20
2021-01-14
G341
2025-01-14
F21
2025-01-14
G309 D
I have total 83742 entries for the above example table. I am trying to find how much are the total MAIN entries by each year and by each month. i-e, I want to generate two separate graphs by Month and by Year which can show the total number of conditions by each month or by each year. Thanks
I use the lubridate library's round_date() function for this type of problem:
library(lubridate)
df %>%
mutate(YearMonth=round_date(Date, "month"),
Year=round_date(Date, "year")) %>%
ggplot(aes(x=YearMonth, fill=Condition))+
geom_bar()
df %>%
mutate(YearMonth=round_date(Date, "month"),
Year=round_date(Date, "year")) %>%
ggplot(aes(x=Year, fill=Condition)) +
geom_bar() +
scale_x_date(date_breaks = "1 year", date_labels = "%Y")

R - Draw cases per 100k population

I try to draw line COVID cases for each date. I do not have output, the lecturer gave just questions. I solved the question but my problem is the output. It looks weird. Here is the question:
"For the ten countries with the highest absolute number of total deaths, draw the following line graphs to visualize whether epidemic has started to slow down and how the growth rate of new cases/deaths differs across those countries.
a) Number of new cases at each date (absolute number vs per 100.000 population)"
Here is my codes:
library(utils)
COVID_data <-read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
#Finding ten countries where the highest absolute total deaths number is
abs_total_deaths <-COVID_data %>%
group_by(countriesAndTerritories) %>%
summarise(abs_total_deaths = sum(deaths)) %>%
arrange(desc(abs_total_deaths))
abs_ten_total_deaths <- c('Italy','France','Germany','Spain','Poland',
'Romania','Czechia','Hungary','Belgium','Bulgaria')
#Calculate new cases by dividing absolute number to 100.000 population
#Draw line for each country
COVID_data %>%
filter(countriesAndTerritories %in% abs_ten_total_deaths) %>%
filter(cases >0) %>%
mutate(new_cases = cases/100000) %>%
ungroup() %>%
ggplot()+
geom_line(aes(x = dateRep, y = new_cases, color = countriesAndTerritories),size=1)+
labs(x="Date",
y="New Cases",
title="New Cases per 100.000 population") +
facet_wrap(~countriesAndTerritories)+
theme_bw()
I will also add a pic of my output. I think my graph is not correct, because the output looks really weird. I can't understand where I make a mistake. If you help me, I'll be appreciated that.
Here is the output:
Looking at Belgium, I get total deaths = 25051 from your data file, which tallies exactly with the data here.
It's obvious that the highest value (by far) for every country occurs "on" the earliest date for the country in the file. Amongst your top ten (I agree with your selection), this is 01Mar2021 for every country apart from Spain, and 28Feb2021 for Spain.
These two facts lead me to conclude (1) your graphs correctly display the data you have asked them to summarise and that (2) you have a data artefact: the first record for each country contains the cumulative total to date, whereas subsequent dates contain data reported "in the previous 24 hours". I use quotes because different countries have different reporting conventions. For example, in the UK (since August 2020) "COVID-related deaths" are deaths from any cause within 28 days of a positive COVID test. Citation
Therefore, to get meaningful graphs, I think your only option is to discard the cumulative data contained in the first record for each country. Here's how I would do that:
library(utils)
library(tidyverse)
COVID_data <-read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
# For better printing
COVID_data <- as_tibble(COVID_data)
# Which countries have the higest absolute death toll?
# [I get the same countries as you do.]
top10 <- COVID_data %>%
group_by(countriesAndTerritories) %>%
summarise(TotalDeaths=sum(deaths)) %>%
slice_max(TotalDeaths, n=10) %>%
distinct(countriesAndTerritories) %>%
pull(countriesAndTerritories)
COVID_data %>%
filter(countriesAndTerritories %in% top10) %>%
mutate(
deathRate=100000 * deaths / popData2020,
caseRate=100000 * cases /popData2020,
Date=lubridate::dmy(dateRep)
) %>%
arrange(countriesAndTerritories, Date) %>%
group_by(countriesAndTerritories) %>%
filter(row_number() > 1) %>%
ggplot() +
geom_line(aes(x=Date, y=deathRate)) +
facet_wrap(~countriesAndTerritories)
The critical part that excludes the first data row for each country is
arrange(countriesAndTerritories, Date) %>%
group_by(countriesAndTerritories) %>%
filter(row_number() > 1) %>%
The call to arrange is necessary because the data are not in date order to begin with.
This gives the following plot
which is much more like what I (and I suspect, you) would expect.
The sawtooth patterns you see are most likely also reporting artefacts: deaths that take place over the weekend (or on public holidays) are not reported until the following Monday (or next working day). This is certainly true in the UK.

Align left border of geom_col column with data anchor

I'm trying to plot some timestamped data with ggplot2 and R. Here is a minimal and reproducible example of my current work
library(lubridate)
library(ggplot2)
sample_size <- 100
start_date = as.POSIXct("2020-01-01 00:00")
end_date = as.POSIXct("2020-01-02 00:00")
timestamps <- as.POSIXct(sample(seq(start_date, end_date, by=60), sample_size))
amount <- rpois(sample_size, 5)
df <- data.frame(timestamps=timestamps, amount=amount)
df$hour_group <- floor_date(df$timestamps, unit="1 hour")
ggplot(df, aes(x=hour_group, y=amount)) + geom_col()
Explanation: First a sample dataframe with the column timestamp and amount is created. The timestamps are uniformly selected between the start_date and end_date. I'd like to plot the amount variable for each hour of the day. Therefore another column hour_group is created and filled with the hour of each timestamp.
Plotting this data yields the following graph:
The columns look alright, but since the first column for example represents the sum of the amount with timestamps between 00:00 and 01:00 I'd like the column to fill exactly this space (not 23:30 to 00:30 as in the current plot). Therefore I want to align the left border of each column with the anchor point (in the example 00:00) and not center the column at this point. How can this be achieved?
My approach: One way I can think is to created another column with the shifted anchor points. In the example a 30minute shift is necessary.
df$hour_group_shifted <- df$hour_group + 60*30
The new plot creates the expected result
I'm still wondering if there may be a simpler way to achieve this directly with a ggplot setting without the extra column.
You can use position_nudge.
ggplot(df, aes(x=hour_group, y=amount)) +
geom_col(position = position_nudge(60*30))
Since ggplot2 3.4.0, you can use just = 0 to align your columns as needed:
ggplot(df, aes(x=hour_group, y=amount)) +
geom_col(just = 0)

ggplot2 timeseries plot through night with non-ordered values

a = c(22,23,00,01,02) #hours from 22 in to 2 in morning next day
b = c(4,8,-12,3,5) #some values
df = data.frame(a,b)
When I plot this data with ggplot2 it sorts the first column a, but I don't want it to be sorted.
The code used in ggplot2 is ggplot(df, aes(a,b)) + geom_line()
In this case, the X-axis is sorted and they are providing wrong results like
hour 0 consists of value 4, and the truth is that hour 22 consist of value 4
R needs to somehow know that what you provide in vector "a" is a time. I have changed your vector slightly to give R the necessary information:
a = as.POSIXct(c("0122","0123","0200","0201","0202"), format="%d%H")
# hours from 22 in to 2 in morning next day (as strings)
# the day is arbitrary but must be provided
b = c(4,8,-12,3,5) #some values
df = data.frame(a,b)
ggplot(df, aes(a,b)) + geom_line()
You can use paste() to glue days and hours together automatically (e.g. paste(day,22,sep=""))

Plot two sub-variables during a 12 month period - R

The table shows the first row with 12 month names and the values of visitors, with portuguese (Portugal) and foreigners (ESTRANGEIRO) (ignore the row with no names)
How can I plot, in ggplot2, a bar graph that shows the portuguese visitors and the foreigners visitors during the 12 month period?
Usually it is better to provide some reproducible code example than to submit a screenshot, see e.g. here: Click
To accomplish what you want to do, you will have to change your format a little bit. Given a dataframe that looks like yours and using reshape2:
df <- data.frame(month=factor(c("Jan","Feb","Mar"),labels=c("Jan","Feb","Mar"),ordered=TRUE),
portugal=c(4000,2330,3000),
foreigner=c(4999,2600,3244),
stringsAsFactors = FALSE)
library(reshape2)
plotdf<-melt(df)
colnames(plotdf)<-c("Month","Country","Visitors")
levels(plotdf$Country)<-c("Portgual","Foreigners")
ggplot(plotdf,aes(x=Month,y=Visitors,fill=Country)) +
geom_bar(stat="identity",position=position_dodge()) +
xlab("Month") +
ylab("Visitors")

Resources