I have a dataset with events. These events have a start time and a duration. I want to create a scatter plot with the start time on the x-axis and the duration on the y-axis, but I want to alter the x-axis so that it displays the course of a week. That is, I want the x-axis to start on Monday 00:00 and run through Sunday 23:59.
All the solutions I've found online show me how to perform group-by-and-sum over weekdays, which is not what I want to do. I want to plot all data points individually, I simply want to reduce the date-axis to weekday and time.
Any suggestions?
This does what you need. What it does is to create a new variable by putting every observation in one week, and then generate a scatter plot in a necessary format.
library(lubridate)
library(dplyr)
set.seed(1)
tmp <- data.frame(st_time = mdy("01-01-2018") + minutes(sample(1e5, size = 100)))
tmp <- tmp %>%
mutate(st_week = floor_date(st_time, unit = 'week')) %>% # calculate the start of week
mutate(st_time_inweek = st_time - st_week) %>% # calculate the time elapsed from the start of the week
mutate(st_time_all_in_oneweek = st_week[1] + st_time_inweek) %>% # put every obs in one week
mutate(duration = runif(100, 0, 100)) # generate a random duration variable
This is how to generate the plot. The part "%a %H:%M:%S" could be just "%a" as the time portion is not informative.
library(ggplot2)
ggplot(tmp) + aes(x = st_time_all_in_oneweek, y = duration) +
geom_point() + scale_x_datetime(date_labels = "%a %H:%M:%S", date_breaks = "1 day")
With "%a" the plot look like this:
Maybe late, but for others searching:
there is a solution with
scale_x_date(date_labels = '%a')
described here: Weekdays below date on x-axis in ggplot2
Related
I would like to create an interactive histogram with dates on the x-axis.
I have used ggplot+ggplotly.
I've read I need to use to pass the proper information using the "text=as.character(mydates)" option and sometimes "tooltips=mytext".
This trick works for other kinds of plots but there is a problem with the histograms, instead of getting a single bar with a single value I get many sub-bars stacked.
I guess the reason is passing "text=as.character(fechas)" produces many values instead of just the class value defining that bar.
How can I solve this problem?
I have tried filtering myself the data but I don't know how to make this the parameters match the parameters used by the histogram, such as where the dates start for each bar.
library(lubridate)
library(ggplot2)
library(ggplotly)
Ejemplo <- data.frame(fechas = dmy("1-1-20")+sample(1:100,100, replace=T),
valores=runif(100))
dibujo <- ggplot(Ejemplo, aes(x=fechas, text=as.character(fechas))) +
theme_bw() + geom_histogram(binwidth=7, fill="darkblue",color="black") +
labs(x="Fecha", y="Nº casos") +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
scale_x_date(date_breaks = "weeks", date_labels = "%d-%m-%Y",
limits=c(dmy("1-1-20"), dmy("1-4-20")))
ggplotly(dibujo)
ggplotly(dibujo, tooltip = "text")
As you can see, the bars are not regular histogram bars but something complex.
Using just ggplot instead of ggplotly shows the same problem, though then you woulnd't need to use the extra "text" parameter.
Presently, feeding as.character(fechas) to the text = ... argument inside of aes() will display the relative counts of distinct dates within each bin. Note the height of the first bar is simply a count of the total number of dates between 6th of January and the 13th of January.
After a thorough reading of your question, it appears you want the maximum date within each weekly interval. In other words, one date should hover over each bar. If you're partial to converting ggplot objects into plotly objects, then I would advise pre-processing the data frame before feeding it to the ggplot() function. First, group by week. Second, pull the desired date by each weekly interval to show as text (i.e., end date). Next, feed this new data frame to ggplot(), but now layer on geom_col(). This will achieve similar output since you're grouping by weekly intervals.
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
set.seed(13)
Ejemplo <- data.frame(fechas = dmy("1-1-20") + sample(1:100, 100, replace = T),
valores = runif(100))
Ejemplo_stat <- Ejemplo %>%
arrange(fechas) %>%
filter(fechas >= ymd("2020-01-01"), fechas <= ymd("2020-04-01")) %>% # specify the limits manually
mutate(week = week(fechas)) %>% # create a week variable
group_by(week) %>% # group by week
summarize(total_days = n(), # total number of distinct days
last_date = max(fechas)) # pull the maximum date within each weekly interval
dibujo <- ggplot(Ejemplo_stat, aes(x = factor(week), y = total_days, text = as.character(last_date))) +
geom_col(fill = "darkblue", color = "black") +
labs(x = "Fecha", y = "Nº casos") +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_x_discrete(label = function(x) paste("Week", x))
ggplotly(dibujo) # add more text (e.g., week id, total unique dates, and end date)
ggplotly(dibujo, tooltip = "text") # only the end date is revealed
The "end date" is displayed once you hover over each bar, as requested. Note, the value "2020-01-12" is not the last day of the second week. It is the last date observed in the second weekly interval.
The benefit of the preprocessing approach is your ability to modify your grouped data frame, as needed. For example, feel free to limit the date range to a smaller (or larger) subset of weeks, or start your weeks on a different day of the week (e.g., Sunday). Furthermore, if you want more textual options to display, you could also display your total number of unique dates next to each bar, or even display the date ranges for each week.
I'm working on a school project and I have been trying to solve this for some time now but I cant find a solution to this.
The problem is whenever I run this the x axis is full with too many variables. I found a post similar to this but that post is working with normal variables not with date time variables (%Y/%m) like I am, witch creates problems when I try and run code like this one:
"scale_x_discrete(breaks = seq(0, 100, by = 5))"
Keep in mind I have many rows, I don't know if that can cause problems but:
And the code:
plottest1 <- function(St, na){
test1 <- ggplot(data = KunskiDepozit1, aes(x=Datum, y=St, group = 1)) +
geom_line() + labs(x = "Datum", y = na, title = paste("vizualization ", na)) + geom_point()
test1 <- test1 +
theme_update(plot.title = element_text(hjust = 0.5))
return(test1)
}
As Geosopher and PoGibas have noted you need to make sure ggplot understands that Datum is a date. You may want to consider the package lubridate.
If I squint, I think your date information is in one column as YYYY-MM, so, to achieve that, you merely need something like:
date_df <- existing_df %>%
mutate(Datum = paste0(Datum, "-01")) %>%
mutate(Datum = lubridate::ymd(Datum))
I have extracted the following sample code from the chapter on lubridate of R for Data Science (freely available online) which explains how to do it when your date and time elements are split in various columns, using the function lubridate::make_datetime. It also shows that you can plot a date-time variable directly and ggplot will do the right thing.
library(tidyverse)
library(lubridate)
library(nycflights13) # Dataset with flight details
# Custom function to transform the date and time information from several columns
# into one "date-time" column. You may be able to get away simply with make_datetime
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
# Apply that function to relevant columns in the dataset
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
# Plot the dataset
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
As your x axis is a date I would try using an actual date axis instead of a discrete axis. Maybe play around with something like:
scale_x_date(date_breaks = "2 weeks")
Check out the ggplot2 scale documentation for the details!
?ggplot2::scale_x_date()
I have a set of data showing patients arrival and departure in a hospital:
arrival<-c("12:00","12:30","14:23","16:55","00:04","01:00","03:00")
departure<-c("13:00","16:00","17:38","00:30","02:00","07:00","23:00")
I want to produce a histogram counting the number of patients at each time band (00:00-01:00; 01:00-02:00 etc) in the hospital.
So I would get something like between 12:00- 12:59 there is 2 patients etc.
You can try this (change the example data a little bit, to ensure that the departure time is always greater than the arrival time, it will be good if you have date and time both in the arrival and departure), in the figure below, the time label 10:00 actually represents time from 10:00-10:59, you can change the labels if you want.
arrival<-c("12:00","12:30","14:23","16:55","00:04","01:00","03:00")
departure<-c("13:00","16:00","17:38","23:30","02:00","07:00","11:00")
df <- data.frame(arrival=strptime(arrival, '%H:%M'),departure=strptime(departure, '%H:%M'))
hours_present <- do.call('c', apply(df, 1, function(x) seq(from=as.POSIXct(x[1], tz='UTC'),
to=as.POSIXct(x[2], tz='UTC'), by="hour")))
library(ggplot2)
qplot(hours_present, geom='bar') +
scale_x_datetime(date_breaks= "1 hour", date_labels = "%H:%M",
limits = as.POSIXct(c(strptime("0:00", "%H:%M"), strptime("23:00", "%H:%M")), tz='UTC')) +
scale_y_continuous(breaks=1:5) +
theme(axis.text.x = element_text(angle=90, vjust = 0.5))
you can have 'histogram' instead as geom in qplot to get the following figure:
I am trying to plot a time series in ggplot2. Assume I am using the following data structure (2500 x 20 matrix):
set.seed(21)
n <- 2500
x <- matrix(replicate(20,cumsum(sample(c(-1, 1), n, TRUE))),nrow = 2500,ncol=20)
aa <- x
rnames <- seq(as.Date("2010-01-01"), length=dim(aa)[1], by="1 month") - 1
rownames(aa) <- format(as.POSIXlt(rnames, format = "%Y-%m-%d"), format = "%d.%m.%Y")
colnames(aa) <- paste0("aa",1:k)
library("ggplot2")
library("reshape2")
library("scales")
aa <- melt(aa, id.vars = rownames(aa))
names(aa) <- c("time","id","value")
Now the following command to plot the time series produces a weird looking x axis:
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line()
What I found out is that I can change the format to date:
aa$time <- as.Date(aa$time, "%d.%m.%Y")
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line()
This looks better, but still not a good graph. My question is especially how to control the formatting of the x axis.
Does it have to be in Date format? How can I control the amount of breaks (i.e. years) shown in either case? It seems to be mandatory if Date is not used; otherwise ggplot2 uses some kind of useful default for the breaks I believe.
For example the following command does not work:
aa$time <- as.Date(aa$time, "%d.%m.%Y")
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line() +
scale_x_continuous(breaks=pretty_breaks(n=10))
Also if you got any hints how to improve the overall look of the graph feel free to add (e.g. the lines look a bit inprecise imho).
You can format dates with scale_x_date as #Gopala mentioned. Here's an example using a shortened version of your data for illustration.
library(dplyr)
# Dates need to be in date format
aa$time <- as.Date(aa$time, "%d.%m.%Y")
# Shorten data to speed rendering
aa = aa %>% group_by(id) %>% slice(1:200)
In the code below, we get date breaks every six months with date_breaks="6 months". That's probably more breaks than you want in this case and is just for illustration. If you want to determine which months get the breaks (e.g., Jan/July, Feb/Aug, etc.) then you also need to use coord_cartesian and set the start date with xlim and expand=FALSE so that ggplot won't pad the start date. But when you set expand=FALSE you also don't get any padding on the y-axis, so you need to add the padding manually with scale_y_continuous (I'd prefer to be able to set expand separately for the x and y axes, but AFAIK it's not possible). Because the breaks are packed tightly, we use a theme statement to rotate the labels by 90 degrees.
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line(show.legend=FALSE) +
scale_y_continuous(limits=c(min(aa$value) - 2, max(aa$value) + 1)) +
scale_x_date(date_breaks="6 months",
labels=function(d) format(d, "%b %Y")) +
coord_cartesian(xlim=c(as.Date("2009-07-01"), max(aa$time) + 182),
expand=FALSE) +
theme_bw() +
theme(axis.text.x=element_text(angle=-90, vjust=0.5))
What is the smartest way to manipulate POSIX for use in ggplot axis?
I am trying to create a function for plotting many graphs (One per day) spanning a period of weeks, using POSIX time for the x axis.
To do so, I create an additional integer column DF$Day with the day, that I input into the function. Then, I create a subset using that day, which I plot using ggplot2. I figured how to use scale_x_datetime to format the POSIX x axis. Basically, I have it show the hours & minutes only, omitting the date.
Here is my question: How can I set the limits for each individual graph in hours of the day?
Below is some working, reproducible code to get an idea. It creates the first day, shows it for 3 seconds & the proceeds to create the second day. But, each days limits is chosen based on the range of the time variable. How can I make the range, for instance, all day long (0h - 24h)?
DF <- data.frame(matrix(ncol = 0, nrow = 4))
DF$time <- as.POSIXct(c("2010-01-01 02:01:00", "2010-01-01 18:10:00", "2010-01-02 04:20:00", "2010-01-02 13:30:00"))
DF$observation <- c(1,2,1,2)
DF$Day <- c(1,1,2,2)
for (Individual_Day in 1:2) {
Day_subset <- DF[DF$Day == as.integer(Individual_Day),]
print(ggplot( data=Day_subset, aes_string( x="time", y="observation") ) + geom_point() +
scale_x_datetime( breaks=("2 hour"), minor_breaks=("1 hour"), labels=date_format("%H:%M")))
Sys.sleep(3) }
Well, here's one way.
# ...
for (Individual_Day in 1:2) {
Day_subset <- DF[DF$Day == as.integer(Individual_Day),]
lower <- with(Day_subset,as.POSIXct(strftime(min(time),"%Y-%m-%d")))
upper <- with(Day_subset,as.POSIXct(strftime(as.Date(max(time))+1,"%Y-%m-%d"))-1)
limits = c(lower,upper)
print(ggplot( data=Day_subset, aes( x=time, y=observation) ) +
geom_point() +
scale_x_datetime( breaks=("2 hour"),
minor_breaks=("1 hour"),
labels=date_format("%H:%M"),
limits=limits)
)
}
The calculation for lower takes the minimum time in the subset and coerces it to character with only the date part (e.g., strips away the time part). Converting back to POSIXct generates the beginning of that day.
The calculation for upper is a little more complicated. You have to convert the maximum time to a Date value and add 1 (e.g., 1 day), then convert to character (strip off the time part), convert back to POSIXct, and subtract 1 (e.g., 1 second). This generates 23:59 on the end day.
Huge amount of work for such a small thing. I hope someone else posts a simpler way to do this...