Align left border of geom_col column with data anchor - r

I'm trying to plot some timestamped data with ggplot2 and R. Here is a minimal and reproducible example of my current work
library(lubridate)
library(ggplot2)
sample_size <- 100
start_date = as.POSIXct("2020-01-01 00:00")
end_date = as.POSIXct("2020-01-02 00:00")
timestamps <- as.POSIXct(sample(seq(start_date, end_date, by=60), sample_size))
amount <- rpois(sample_size, 5)
df <- data.frame(timestamps=timestamps, amount=amount)
df$hour_group <- floor_date(df$timestamps, unit="1 hour")
ggplot(df, aes(x=hour_group, y=amount)) + geom_col()
Explanation: First a sample dataframe with the column timestamp and amount is created. The timestamps are uniformly selected between the start_date and end_date. I'd like to plot the amount variable for each hour of the day. Therefore another column hour_group is created and filled with the hour of each timestamp.
Plotting this data yields the following graph:
The columns look alright, but since the first column for example represents the sum of the amount with timestamps between 00:00 and 01:00 I'd like the column to fill exactly this space (not 23:30 to 00:30 as in the current plot). Therefore I want to align the left border of each column with the anchor point (in the example 00:00) and not center the column at this point. How can this be achieved?
My approach: One way I can think is to created another column with the shifted anchor points. In the example a 30minute shift is necessary.
df$hour_group_shifted <- df$hour_group + 60*30
The new plot creates the expected result
I'm still wondering if there may be a simpler way to achieve this directly with a ggplot setting without the extra column.

You can use position_nudge.
ggplot(df, aes(x=hour_group, y=amount)) +
geom_col(position = position_nudge(60*30))

Since ggplot2 3.4.0, you can use just = 0 to align your columns as needed:
ggplot(df, aes(x=hour_group, y=amount)) +
geom_col(just = 0)

Related

Creating a Cumulative Sum Plot using ggplot with duplicate x values

In my hypothetical example, people order ice-cream at a stand and each time an order is placed, the month the order was made and the number of orders placed is recorded. Each row represents a unique person who placed the order. For each flavor of ice-cream, I am curious to know the cumulative orders placed over the various months. For instance if a total of 3 Vanilla orders were placed in April and 4 in May, the graph should show one data point at 3 for April and one at 7 for May.
The issue I am running into is each row is being plotted separately (so there would be 3 separate points at April as opposed to just 1).
My secondary issue is that my dates are not in chronological order on my graph. I thought converting the Month column to Date format would fix this but it doesn't seem to.
Here is my code below:
library(lubridate)
Flavor <- c("Vanilla", "Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Vanilla","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","Strawberry","chocolate","chocolate","chocolate")
Month <- c("1-Jun-21", "1-May-19", "1-May-19","1-Apr-19", "1-Apr-19","1-Apr-19","1-Apr-19", "1-Mar-19", "1-Mar-19", "1-Mar-19","1-Mar-19", "1-Apr-19", "1-Mar-19", " 1-Apr-19", " 1-Jan-21", "1-May-19", "1-May-19","1-May-19","1-May-19","1-Jun-19","2-September-19", "1-September-19","1-September-19","1-December-19","1-May-19","1-May-19","1-Jun-19")
Orders <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
data <- data.frame(Flavor,Month,Orders)
data$Month <- dmy(data$Month)
str(data)
data2 <- data[data$Flavor == "Vanilla",]
ggplot(data=data2, aes(x=Month, y=cumsum(Orders))) + geom_point()
In these situations, it's usually best to pre-compute your desired summary and send that to ggplot, rather than messing around with ggplot's summary functions. I've also added a geom_line() for clarity.
data %>%
group_by(Flavor, Month) %>%
summarize(Orders = sum(Orders)) %>%
group_by(Flavor) %>%
arrange(Month) %>%
mutate(Orders = cumsum(Orders)) %>%
ggplot(data = ., aes(x=Month, y=Orders, color = Flavor)) + geom_point() + geom_line()

Plotting this weeks data versus last weeks in ggplot

I have a datasets structured the following way
date transaction
8/15/2020 585
8/14/2020 780
8/13/2020 1427.8
8/12/2020 4358
8/11/2020 780.9
8/8/2020 585
8/6/2020 1107.4
8/5/2020 2917.35
8/4/2020 1237.1
Is there a way to plot a line graph with all the transactions that occurred this week compared to the previous week? I tried filtering the data manually and assigning it to a new dataframe which seemed to work but its very manual intensive. Would it be possible to use today() and have it register the day of execution and run the results from there? Thanks!
To do that, you need
real Date (using as.Date), so that we can deal with them numerically (not categorically), and so that we can break them into weeks;
use format to get each date's week-of-the-year; and
facet_wrap so that we can use facets and have distinct x axes.
dat$date <- as.Date(dat$date, format = "%m/%d/%Y")
dat$week <- format(dat$date, format = "%V") # or %W
library(ggplot2)
ggplot(dat, aes(date, transaction)) +
facet_wrap("week", ncol = 1, scales = "free_x") +
geom_path()

ggplot2 timeseries plot through night with non-ordered values

a = c(22,23,00,01,02) #hours from 22 in to 2 in morning next day
b = c(4,8,-12,3,5) #some values
df = data.frame(a,b)
When I plot this data with ggplot2 it sorts the first column a, but I don't want it to be sorted.
The code used in ggplot2 is ggplot(df, aes(a,b)) + geom_line()
In this case, the X-axis is sorted and they are providing wrong results like
hour 0 consists of value 4, and the truth is that hour 22 consist of value 4
R needs to somehow know that what you provide in vector "a" is a time. I have changed your vector slightly to give R the necessary information:
a = as.POSIXct(c("0122","0123","0200","0201","0202"), format="%d%H")
# hours from 22 in to 2 in morning next day (as strings)
# the day is arbitrary but must be provided
b = c(4,8,-12,3,5) #some values
df = data.frame(a,b)
ggplot(df, aes(a,b)) + geom_line()
You can use paste() to glue days and hours together automatically (e.g. paste(day,22,sep=""))

Force ggplot scales to start on e.g. 1st of year, 1st of month etc

I'm looking for a way to force the date labels on a ggplot to start at a (seemingly) logical time. I've had the problem a number of times but my current problem is I want the breaks to be on the 01/01/yyyy
My data is a large dataset with POSIXct Date column, data to plot in Flow column and a number of site names in the Site column.
library(ggplot2)
library(scales)
ggplot(AllFlowData, aes(x=Date, y = Flow, colour = Site))+geom_line()+
scale_x_datetime(date_breaks = "1 year", expand =c(0,0),labels=date_format("%Y"))
I can force the breaks to be every year and they appear okay without the labels=date_format("%Y") (starting on 01/01 each year) but if I include labels=date_format("%Y") (as there is 10 years of data so gets a bit messy) the date labels move to ~November, and 1989 is the first label even though my data starts on the 01/01/1990.
I have had this problem numerous times in the past on different time steps, such as wanting to force it to the 1st of the month or daily times to be at midnight instead during the day. Is there a generic way to do this?
I have looked at create specific date range in ggplot2 ( scale_x_date), but I do not want to have to hard code my breaks as I have a fair few plots to do with different date ranges.
Thanks
If the dates come to you in a vector like:
dates <- seq.Date(as.Date("2001-03-04"), as.Date("2001-11-04"), by="day")
## "2001-03-04" "2001-03-05" "2001-03-06" ... "2001-11-03" "2001-11-04"
use pretty.Dates() to make a best guess about the end points.
range(pretty(dates))
## "2001-01-01" "2002-01-01"
Then pass this range to ggplot.
However, I recommend coord_cartesian() instead of scale_x_date(). Typically I want to crop the graphic bounds, instead of flat-out exclude the values entirely (which can mess up things like a loess summary).

Plot two sub-variables during a 12 month period - R

The table shows the first row with 12 month names and the values of visitors, with portuguese (Portugal) and foreigners (ESTRANGEIRO) (ignore the row with no names)
How can I plot, in ggplot2, a bar graph that shows the portuguese visitors and the foreigners visitors during the 12 month period?
Usually it is better to provide some reproducible code example than to submit a screenshot, see e.g. here: Click
To accomplish what you want to do, you will have to change your format a little bit. Given a dataframe that looks like yours and using reshape2:
df <- data.frame(month=factor(c("Jan","Feb","Mar"),labels=c("Jan","Feb","Mar"),ordered=TRUE),
portugal=c(4000,2330,3000),
foreigner=c(4999,2600,3244),
stringsAsFactors = FALSE)
library(reshape2)
plotdf<-melt(df)
colnames(plotdf)<-c("Month","Country","Visitors")
levels(plotdf$Country)<-c("Portgual","Foreigners")
ggplot(plotdf,aes(x=Month,y=Visitors,fill=Country)) +
geom_bar(stat="identity",position=position_dodge()) +
xlab("Month") +
ylab("Visitors")

Resources