Can I chronologically order dates as characters in R? - r

I have data in a .csv. The first column is dates, the second column counts a number of days. I want to plot number of days vs. date. (see here)
In my .csv the dates are chronological by year. In RStudio, the initial plot is chronological by the month's number.
install.packages("tidyverse")
library(tidyverse)
#load my spreadsheet
openingData <- read_csv("daysPriorToOpening.csv")
ggplot(data = openingData) +
geom_col(mapping = aes(x = dateOpened, y = daysPrior) +
labs(x = "Date Opened", y = "Days prior to opening at or above 11.0")
That creates this output, with it arranged in order by the number of the month. I like the appearance, just not the order. Someone suggested I try using as.Date()
openingData$dateOpened <- as.Date(openingData$dateOpened, format = "%m/%d/%Y")
Then I ran the code again to graph and it plotted chronologically, but now there are large gaps. See here. The dates aren't labeled as they were in the first picture; the reader has to guess the exact date.
My guess as to the different appearance is that in the first case, the dates are characters and discrete. In the second case, using as.Date() changed them to Dates and they become continuous. Is there a way to either,
keep the display as the first graph but order it by year, or
display as in the second graph but either eliminate the gaps or label the columns with their corresponding date?

openingData %>%
mutate(dateOpened = as.Date(dateOpened,"%m/%d/%y")) %>%
arrange(dateOpened) %>%
mutate(id = factor(row_number(),labels = dateOpened)) %>%
ggplot() +
geom_col(mapping = aes(x = id, y = daysPrior))+
labs(x = "Date Opened", y = "Days prior to opening at or above 11.0")

You need to convert your dates to a factor, and order the factor levels according to the date they represent. This involves converting to a date, ordering, then converting back again.
dates <- as.Date(openingData$dateOpened, format = "%m/%d/%y")
levs <- strftime(sort(dates), format = "%m/%d/%y")
openingData$dateOpened <- factor(strftime(dates, format = "%m/%d/%y"), levs)
ggplot(data = openingData) +
geom_col(mapping = aes(x = dateOpened, y = daysPrior)) +
labs(x = "Date Opened", y = "Days prior to opening at or above 11.0")

Related

ggplot scale for time of date only, when using POSIXct datetimes

In ggplot2, I have a question about appropriate scales for making POSIXct datetimes into time-of-day in an axis. Consider:
library(tidyverse)
library(lubridate)
library(hms)
library(patchwork)
test <- tibble(
dates = c(ymd_hms("2022-01-01 6:00:00"),
ymd_hms("2023-01-01 19:00:00")),
x = c(1, 2),
hms_dates = as_hms(dates)
)
plot1 <- ggplot(test) + geom_point(aes(x = x, y = dates)) +
scale_y_time()
plot2 <- ggplot(test) + geom_point(aes(x = x, y = hms_dates)) +
scale_y_time()
plot1 + plot2
Plot 1 y axis includes dates and time, but Plot 2 shows just time of day. That's what I want! I'd like to generate plot 2 like images without having to use the hms::as_hms approach. This seems to imply some options for scale_y_datetime (or similar) that I can't discover. I'd welcome suggestions.
Does someone have an example of how to use the limits option in scale_*_time, or (see question #1) limits for a scale_y_datetime that specifies hours within the day, e.g. .. limits(c(8,22)) predictably fails.
For your second question, when dealing with dates or datetimes or times you have to set the limits and/or breaks as dates, datetimes or times too, i.e. use limits = as_hms(c("8:00:00", "22:00:00"):
library(tidyverse)
library(lubridate)
library(hms)
ggplot(test) + geom_point(aes(x = x, y = hms_dates)) +
scale_y_time(limits = as_hms(c("8:00:00", "22:00:00")))
#> Warning: Removed 1 rows containing missing values (`geom_point()`).
Concerning your first question. TBMK this could not be achieved via scale_..._datetime. And if you just want to show the time part of your dates then converting to an has object is IMHO the easiest way to achieve that. You could of course set the units to be shown as axis text via the date_labels argument, e.g. date_labels="%H:%M:%S" to show only the time of day. However, as your dates variable is still a datetime the scale, breaks and limits will still reflect that, i.e. you only change the format of the labels and for your example data you end up with an axis showing the same time for each break, i.e. the start of the day.
ggplot(test) + geom_point(aes(x = x, y = dates)) +
scale_y_datetime(date_labels = "%H:%M:%S")

Remove all levels from dates data in R

I have a dataset where one of the columns is dates but in character format. I used the following code to convert it to dates format and then take the month only:
library(lubridate)
dates <- dmy(Austria$date)
Month <- month(dates, label = TRUE, abbr = FALSE)
The problem is that I am taking levels back for the months which I don't want to. I searched on how to remove the levels but everything I found was about removing levels that are unused (which is not my case).
I also, used the as,Date but I am still having the same problem:
dates_Austria <- as.Date(Austria$date, "%d/%m/%Y")
My final purpose is to make a plot which will have unemployment on the horizontal axis, income level on the vertical axis and then change the color of the plot according to the month, like that:
ggplot(data = my_data, aes(x = unemployment, y = income, colour = Month)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
But by using that code I am getting back different regression lines according to the month. I want one line for all the data and the the rest of the dots of the scatter plot to change colour according to the month.
Any help would be appreciated.

How to properly plot a histogram with dates using ggplot?

I would like to create an interactive histogram with dates on the x-axis.
I have used ggplot+ggplotly.
I've read I need to use to pass the proper information using the "text=as.character(mydates)" option and sometimes "tooltips=mytext".
This trick works for other kinds of plots but there is a problem with the histograms, instead of getting a single bar with a single value I get many sub-bars stacked.
I guess the reason is passing "text=as.character(fechas)" produces many values instead of just the class value defining that bar.
How can I solve this problem?
I have tried filtering myself the data but I don't know how to make this the parameters match the parameters used by the histogram, such as where the dates start for each bar.
library(lubridate)
library(ggplot2)
library(ggplotly)
Ejemplo <- data.frame(fechas = dmy("1-1-20")+sample(1:100,100, replace=T),
valores=runif(100))
dibujo <- ggplot(Ejemplo, aes(x=fechas, text=as.character(fechas))) +
theme_bw() + geom_histogram(binwidth=7, fill="darkblue",color="black") +
labs(x="Fecha", y="Nº casos") +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
scale_x_date(date_breaks = "weeks", date_labels = "%d-%m-%Y",
limits=c(dmy("1-1-20"), dmy("1-4-20")))
ggplotly(dibujo)
ggplotly(dibujo, tooltip = "text")
As you can see, the bars are not regular histogram bars but something complex.
Using just ggplot instead of ggplotly shows the same problem, though then you woulnd't need to use the extra "text" parameter.
Presently, feeding as.character(fechas) to the text = ... argument inside of aes() will display the relative counts of distinct dates within each bin. Note the height of the first bar is simply a count of the total number of dates between 6th of January and the 13th of January.
After a thorough reading of your question, it appears you want the maximum date within each weekly interval. In other words, one date should hover over each bar. If you're partial to converting ggplot objects into plotly objects, then I would advise pre-processing the data frame before feeding it to the ggplot() function. First, group by week. Second, pull the desired date by each weekly interval to show as text (i.e., end date). Next, feed this new data frame to ggplot(), but now layer on geom_col(). This will achieve similar output since you're grouping by weekly intervals.
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
set.seed(13)
Ejemplo <- data.frame(fechas = dmy("1-1-20") + sample(1:100, 100, replace = T),
valores = runif(100))
Ejemplo_stat <- Ejemplo %>%
arrange(fechas) %>%
filter(fechas >= ymd("2020-01-01"), fechas <= ymd("2020-04-01")) %>% # specify the limits manually
mutate(week = week(fechas)) %>% # create a week variable
group_by(week) %>% # group by week
summarize(total_days = n(), # total number of distinct days
last_date = max(fechas)) # pull the maximum date within each weekly interval
dibujo <- ggplot(Ejemplo_stat, aes(x = factor(week), y = total_days, text = as.character(last_date))) +
geom_col(fill = "darkblue", color = "black") +
labs(x = "Fecha", y = "Nº casos") +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_x_discrete(label = function(x) paste("Week", x))
ggplotly(dibujo) # add more text (e.g., week id, total unique dates, and end date)
ggplotly(dibujo, tooltip = "text") # only the end date is revealed
The "end date" is displayed once you hover over each bar, as requested. Note, the value "2020-01-12" is not the last day of the second week. It is the last date observed in the second weekly interval.
The benefit of the preprocessing approach is your ability to modify your grouped data frame, as needed. For example, feel free to limit the date range to a smaller (or larger) subset of weeks, or start your weeks on a different day of the week (e.g., Sunday). Furthermore, if you want more textual options to display, you could also display your total number of unique dates next to each bar, or even display the date ranges for each week.

How to plot bar chart of monthly deviations from annual mean?

SO!
I am trying to create a plot of monthly deviations from annual means for temperature data using a bar chart. I have data across many years and I want to show the seasonal behavior in temperatures between months. The bars should represent the deviation from the annual average, which is recalculated for each year. Here is an example that is similar to what I want, only it is for a single year:
My data is sensitive so I cannot share it yet, but I made a reproducible example using the txhousing dataset (it comes with ggplot2). The salesdiff column is the deviation between monthly sales (averaged acrross all cities) and the annual average for each year. Now the problem is plotting it.
library(ggplot2)
df <- aggregate(sales~month+year,txhousing,mean)
df2 <- aggregate(sales~year,txhousing,mean)
df2$sales2 <- df2$sales #RENAME sales
df2 <- df2[,-2] #REMOVE sales
df3<-merge(df,df2) #MERGE dataframes
df3$salesdiff <- df3$sales - df3$sales2 #FIND deviation between monthly and annual means
#plot deviations
ggplot(df3,aes(x=month,y=salesdiff)) +
geom_col()
My ggplot is not looking good at the moment-
Somehow it is stacking the columns for each month with all of the data across the years. Ideally the date would be along the x-axis spanning many years (I think the dataset is from 2000-2015...), and different colors depending on if salesdiff is higher or lower. You are all awesome, and I would welcome ANY advice!!!!
Probably the main issue here is that geom_col() will not take on different aesthetic properties unless you explicitly tell it to. One way to get what you want is to use two calls to geom_col() to create two different bar charts that will be combined together in two different layers. Also, you're going to need to create date information which can be easily passed to ggplot(); I use the lubridate() package for this task.
Note that we combine the "month" and "year" columns here, and then useymd() to obtain date values. I chose not to convert the double valued "date" column in txhousing using something like date_decimal(), because sometimes it can confuse February and January months (e.g. Feb 1 gets "rounded down" to Jan 31).
I decided to plot a subset of the txhousing dataset, which is a lot more convenient to display for teaching purposes.
Code:
library("tidyverse")
library("ggplot2")
# subset txhousing to just years >= 2011, and calculate nested means and dates
housing_df <- filter(txhousing, year >= 2011) %>%
group_by(year, month) %>%
summarise(monthly_mean = mean(sales, na.rm = TRUE),
date = first(date)) %>%
mutate(yearmon = paste(year, month, sep = "-"),
date = ymd(yearmon, truncated = 1), # create date column
salesdiff = monthly_mean - mean(monthly_mean), # monthly deviation
higherlower = case_when(salesdiff >= 0 ~ "higher", # for fill aes later
salesdiff < 0 ~ "lower"))
ggplot(data = housing_df, aes(x = date, y = salesdiff, fill = as.factor(higherlower))) +
geom_col() +
scale_x_date(date_breaks = "6 months",
date_labels = "%b-%Y") +
scale_fill_manual(values = c("higher" = "blue", "lower" = "red")) +
theme_bw()+
theme(legend.position = "none") # remove legend
Plot:
You can see the periodic behaviour here nicely; an increase in sales appears to occur every spring, with sales decreasing during the fall and winter months. Do keep in mind that you might want to reverse the colours I assigned if you want to use this code for temperature data! This was a fun one - good luck, and happy plotting!
Something like this should work?
Basically you need to create a binary variable that lets you change the color (fill) if salesdiff is positive or negative, called below factordiff.
Plus you needed a date variable for month and year combined.
library(ggplot2)
library(dplyr)
df3$factordiff <- ifelse(df3$salesdiff>0, 1, 0) # factor variable for colors
df3 <- df3 %>%
mutate(date = paste0(year,"-", month), # this builds date like "2001-1"
date = format(date, format="%Y-%m")) # here we create the correct date format
#plot deviations
ggplot(df3,aes(x=date,y=salesdiff, fill = as.factor(factordiff))) +
geom_col()
Of course this results in a hard to read plot because you have lots of dates, you can subset it and show only a restricted time:
df3 %>%
filter(date >= "2014-1") %>% # we filter our data from 2014
ggplot(aes(x=date,y=salesdiff, fill = as.factor(factordiff))) +
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # adds label rotation

Time series from three years in one plot

I am struggling (due to lack of knowledge and experience) to create a plot in R with time series from three different years (2009, 2013 and 2017). Failing to solve this problem by searching online has led me here.
I wish to create a plot that shows change in nitrate concentrations over the course of May to October for all years, but keep failing since the x-axis is defined by one specific year. I also receive errors because the x-axis lengths differ (due to different number of samples). To solve this I have tried making separate columns for month and year, with no success.
Data example:
date NO3.mg.l year month
2009-04-22 1.057495 2009 4
2013-05-08 1.936000 2013 5
2017-05-02 2.608000 2017 5
Code:
ggplot(nitrat.all, aes(x = date, y = NO3.mg.l, colour = year)) + geom_line()
This code produces a plot where the lines are positioned next to one another, whilst I want a plot where they overlay one another. Any help will be much appreciated.
Nitrate plot
Probably, that will be helpful for plotting:
library("lubridate")
library("ggplot2")
# evample of data with some points for each year
nitrat.all <- data.frame(date = c(ymd("2009-03-21"), ymd("2009-04-22"), ymd("2009-05-27"),
ymd("2010-03-15"), ymd("2010-04-17"), ymd("2010-05-10")), NO3.mg.l = c(1.057495, 1.936000, 2.608000,
3.157495, 2.336000, 3.908000))
nitrat.all$year <- format(nitrat.all$date, format = "%Y")
ggplot(data = nitrat.all) +
geom_point(mapping = aes(x = format(date, format = "%m-%d"), y = NO3.mg.l, group = year, colour = year)) +
geom_line(mapping = aes(x = format(date, format = "%m-%d"), y = NO3.mg.l, group = year, colour = year))
As for selecting of the dates corresponding to a certain month, you may subset your data frame by a condition using basic R-functions:
n_month1 <- 3 # an index of the first month of the period to select
n_month2 <- 4 # an index of the first month of the period to select
test_for_month <- (as.numeric(format(nitrat.all$date, format = "%m")) >= n_month1) &
(as.numeric(format(nitrat.all$date, format = "%m")) <= n_month2)
nitrat_to_plot <- nitrat.all[test_for_month, ]
Another quite an elegant approach is to use filter() from dplyr package
nitrat.all$month <- as.numeric(format(nitrat.all$date, format = "%m"))
library("dplyr")
nitrat_to_plot <- filter(nitrat.all, ((month >= n_month1) & (month <= n_month2)))

Resources