ggplot2 timeseries plot through night with non-ordered values - r

a = c(22,23,00,01,02) #hours from 22 in to 2 in morning next day
b = c(4,8,-12,3,5) #some values
df = data.frame(a,b)
When I plot this data with ggplot2 it sorts the first column a, but I don't want it to be sorted.
The code used in ggplot2 is ggplot(df, aes(a,b)) + geom_line()
In this case, the X-axis is sorted and they are providing wrong results like
hour 0 consists of value 4, and the truth is that hour 22 consist of value 4

R needs to somehow know that what you provide in vector "a" is a time. I have changed your vector slightly to give R the necessary information:
a = as.POSIXct(c("0122","0123","0200","0201","0202"), format="%d%H")
# hours from 22 in to 2 in morning next day (as strings)
# the day is arbitrary but must be provided
b = c(4,8,-12,3,5) #some values
df = data.frame(a,b)
ggplot(df, aes(a,b)) + geom_line()
You can use paste() to glue days and hours together automatically (e.g. paste(day,22,sep=""))

Related

Creating a Cumulative Sum Plot using ggplot with duplicate x values

In my hypothetical example, people order ice-cream at a stand and each time an order is placed, the month the order was made and the number of orders placed is recorded. Each row represents a unique person who placed the order. For each flavor of ice-cream, I am curious to know the cumulative orders placed over the various months. For instance if a total of 3 Vanilla orders were placed in April and 4 in May, the graph should show one data point at 3 for April and one at 7 for May.
The issue I am running into is each row is being plotted separately (so there would be 3 separate points at April as opposed to just 1).
My secondary issue is that my dates are not in chronological order on my graph. I thought converting the Month column to Date format would fix this but it doesn't seem to.
Here is my code below:
library(lubridate)
Flavor <- c("Vanilla", "Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","chocolate","chocolate","chocolate")
Month <- c("1-Jun-21", "1-May-19", "1-May-19","1-Apr-19", "1-Apr-19","1-Apr-19","1-Apr-19", "1-Mar-19", "1-Mar-19", "1-Mar-19","1-Mar-19", "1-Apr-19", "1-Mar-19", " 1-Apr-19", " 1-Jan-21", "1-May-19", "1-May-19","1-May-19","1-May-19","1-Jun-19","2-September-19", "1-September-19","1-September-19","1-December-19","1-May-19","1-May-19","1-Jun-19")
Orders <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
data <- data.frame(Flavor,Month,Orders)
data$Month <- dmy(data$Month)
str(data)
data2 <- data[data$Flavor == "Vanilla",]
ggplot(data=data2, aes(x=Month, y=cumsum(Orders))) + geom_point()
In these situations, it's usually best to pre-compute your desired summary and send that to ggplot, rather than messing around with ggplot's summary functions. I've also added a geom_line() for clarity.
data %>%
group_by(Flavor, Month) %>%
summarize(Orders = sum(Orders)) %>%
group_by(Flavor) %>%
arrange(Month) %>%
mutate(Orders = cumsum(Orders)) %>%
ggplot(data = ., aes(x=Month, y=Orders, color = Flavor)) + geom_point() + geom_line()

Plotting this weeks data versus last weeks in ggplot

I have a datasets structured the following way
date transaction
8/15/2020 585
8/14/2020 780
8/13/2020 1427.8
8/12/2020 4358
8/11/2020 780.9
8/8/2020 585
8/6/2020 1107.4
8/5/2020 2917.35
8/4/2020 1237.1
Is there a way to plot a line graph with all the transactions that occurred this week compared to the previous week? I tried filtering the data manually and assigning it to a new dataframe which seemed to work but its very manual intensive. Would it be possible to use today() and have it register the day of execution and run the results from there? Thanks!
To do that, you need
real Date (using as.Date), so that we can deal with them numerically (not categorically), and so that we can break them into weeks;
use format to get each date's week-of-the-year; and
facet_wrap so that we can use facets and have distinct x axes.
dat$date <- as.Date(dat$date, format = "%m/%d/%Y")
dat$week <- format(dat$date, format = "%V") # or %W
library(ggplot2)
ggplot(dat, aes(date, transaction)) +
facet_wrap("week", ncol = 1, scales = "free_x") +
geom_path()

Align left border of geom_col column with data anchor

I'm trying to plot some timestamped data with ggplot2 and R. Here is a minimal and reproducible example of my current work
library(lubridate)
library(ggplot2)
sample_size <- 100
start_date = as.POSIXct("2020-01-01 00:00")
end_date = as.POSIXct("2020-01-02 00:00")
timestamps <- as.POSIXct(sample(seq(start_date, end_date, by=60), sample_size))
amount <- rpois(sample_size, 5)
df <- data.frame(timestamps=timestamps, amount=amount)
df$hour_group <- floor_date(df$timestamps, unit="1 hour")
ggplot(df, aes(x=hour_group, y=amount)) + geom_col()
Explanation: First a sample dataframe with the column timestamp and amount is created. The timestamps are uniformly selected between the start_date and end_date. I'd like to plot the amount variable for each hour of the day. Therefore another column hour_group is created and filled with the hour of each timestamp.
Plotting this data yields the following graph:
The columns look alright, but since the first column for example represents the sum of the amount with timestamps between 00:00 and 01:00 I'd like the column to fill exactly this space (not 23:30 to 00:30 as in the current plot). Therefore I want to align the left border of each column with the anchor point (in the example 00:00) and not center the column at this point. How can this be achieved?
My approach: One way I can think is to created another column with the shifted anchor points. In the example a 30minute shift is necessary.
df$hour_group_shifted <- df$hour_group + 60*30
The new plot creates the expected result
I'm still wondering if there may be a simpler way to achieve this directly with a ggplot setting without the extra column.
You can use position_nudge.
ggplot(df, aes(x=hour_group, y=amount)) +
geom_col(position = position_nudge(60*30))
Since ggplot2 3.4.0, you can use just = 0 to align your columns as needed:
ggplot(df, aes(x=hour_group, y=amount)) +
geom_col(just = 0)

Plot overlaps of time intervals

I have the following df
Id a_min_date a_max_date b_min_date b_max_date c_min_date c_max_date d_min_date a_max_date
1 2014-01-01 2014-01-10 2014-01-05 2014-01-15 NA NA 2014-02-20 2014-05-01
2 2014-02-01 2014-02-10 NA NA 2015-02-20 2015-03-01 NA NA
I have added the intervals of each group (a, b, c,d) by ID. First, I have converted the start and end dates to lubridate intervals.
I want to plot the intervals and calculate the time difference in days between the end of each group and the start of next group if there is no overlap.
I tried to use IRanges package and converted the dates into integers (as used here (link)), but does not work for me.
ir <- IRanges::IRanges(start = as.integer((as.Date(df$a_min_date))), end = as.integer((as.Date(df$a_max_date))))
bins <- disjointBins(IRanges(start(ir), end(ir) + 1))
dat <- cbind(as.data.frame(ir), bin = bins)
ggplot(dat) +
geom_rect(aes(xmin = start, xmax = end,
ymin = bin, ymax = bin + 0.9)) +
theme_bw()
I got this error for my orginal df:
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") :
solving row 1: range cannot be determined from the supplied arguments (too many NAs)
Does someone have another solution using other packages?
To my knowledge, IRanges is the best package out there to solve this problem.
IRanges needs range values (in this case dates) to compare and does not handle undefined values (NAs)
To solve this problem, I would remove all rows with NAs in df before doing the analysis.
df <- df[complete.cases(df[ , 1:2]),]
Explanation and other ways to remove NAs see Remove rows with all or some NAs (missing values) in data.frame.
If this does not fix the problem, you could convert the dates into integers. Important there is that the dates have the year-month-day format to result in correct intervals.
Example:
str <- "2006-06-26"
splitted<- unlist(strsplit(str,"-"))
[1] "2006" "06" "26"
result <- paste(splitted,collapse="")
[1] "20060626"

Using scale_x_date in ggplot2 with different columns

Say I have the following data:
Date Month Year Miles Activity
3/1/2014 3 2014 72 Walking
3/1/2014 3 2014 85 Running
3/2/2014 3 2014 42 Running
4/1/2014 4 2014 65 Biking
1/1/2015 1 2015 21 Walking
1/2/2015 1 2015 32 Running
I want to make graphs that display the sum of each month's date for miles, grouped and colored by year. I know that I can make a separate data frame with the sum of the miles per month per activity, but the issue is in displaying. Here in Excel is basically what I want--the sums displayed chronologically and colored by activity.
I know ggplot2 has a scale_x_date command, but I run into issues on "both sides" of the problem--if I use the Date column as my X variable, they're not summed. But if I sum my data how I want it in a separate data frame (i.e., where every activity for every month has just one row), I can't use both Month and Year as my x-axis--at least, not in any way that I can get scale_x_date to understand.
(And, I know, if Excel is graphing it correctly why not just use Excel--unfortunately, my data is so large that Excel was running very slowly and it's not feasible to keep using it.) Any ideas?
The below worked fine for me with the small dataset. If you convert you data.frame to a data.table you can sum the data up to the mile per activity and month level with just a couple preprocessing steps. I've left some comments in the code to give you an idea of what's going on but it should be pretty self-explanatory.
# Assuming your dataframe looks like this
df <- data.frame(Date = c('3/1/2014','3/1/2014','4/2/2014','5/1/2014','5/1/2014','6/1/2014','6/1/2014'), Miles = c(72,14,131,534,123,43,56), Activity = c('Walking','Walking','Biking','Running','Running','Running', 'Biking'))
# Load lubridate and data.table
library(lubridate)
library(data.table)
# Convert dataframe to a data.table
setDT(df)
df[, Date := as.Date(Date, format = '%m/%d/%Y')] # Convert data to a column of Class Date -- check with class(df[, Date]) if you are unsure
df[, Date := floor_date(Date, unit = 'month')] # Reduce all dates to the first day of the month for summing later on
# Create ggplot object using data.tables functionality to sum the miles
ggplot(df[, sum(Miles), by = .(Date, Activity)], aes(x = Date, y = V1, colour = factor(Activity))) + # Data.table creates the column V1 which is the sum of miles
geom_line() +
scale_x_date(date_labels = '%b-%y') # %b is used to display the first 3 letters of the month

Resources