Create barplot to represent time series in ggplot2 - r

I have a basic dataframe with 3 columns: (i) a date (when a sample was taken); (ii) a site location and (iii) a binary variable indicating what the condition was when sampling (e.g. wet versus dry).
Some reproducible data:
df <- data.frame(Date = rep(seq(as.Date("2010-01-01"), as.Date("2010-12-01"), by="months"),times=2))
df$Site <- c(rep("Site.A",times = 12),rep("Site.B",times = 12))
df$Condition<- as.factor(c(0,0,0,0,1,1,1,1,0,0,0,0,
0,0,0,0,0,1,1,0,0,0,0,0))
What I would like to do is use ggplot to create a bar chart indicating the condition of each site (y axis) over time (x axis) - the condition indicated by a different colour. I am guessing some kind of flipped barplot would be the way to do this, but I cannot figure out how to tell ggplot2 to recognise the values chronologically, rather than summed for each condition. This is my attempt so far which clearly doesn't do what I need it to.
ggplot(df) +
geom_bar(aes(x=Site,y=Date,fill=Condition),stat='identity')+coord_flip()
So I have 2 questions. Firstly, how do I tell ggplot to recognise changes in condition over time and not just group each condition in a traditional stacked bar chart?
Secondly, it seems ggplot converts the date to a numerical value, how would I reformat the x-axis to show a time period, e.g. in a month-year format? I have tried doing this via the scale_x_date function, but get an error message.
labDates <- seq(from = (head(df$Date, 1)),
to = (tail(df$Date, 1)), by = "1 months")
Datelabels <-format(labDates,"%b %y")
ggplot(df) +
geom_bar(aes(x=Site,y=Date,fill=Condition),stat='identity')+coord_flip()+
scale_x_date(labels = Datelabels, breaks=labDates)
I have also tried converting sampling times to factors and displaying these instead. Below I have done this by changing each sampling period to a letter (in my own code, the factor levels are in a month-year format - I put letters here for simplicity). But I cannot format the axis to place each level of the factor as a tick mark. Either a date or factor solution for this second question would be great!
df$Factor <- as.factor(unique(df$Date))
levels(df$Factor) <- list(A = "2010-01-01", B = "2010-02-01",
C = "2010-03-01", D = "2010-04-01", E = "2010-05-01",
`F` = "2010-06-01", G = "2010-07-01", H = "2010-08-01",
I = "2010-09-01", J = "2010-10-01", K= "2010-11-01", L = "2010-12-01")
ggplot(df) +
geom_bar(aes(x=Site,y=Date,fill=Condition),stat='identity')+coord_flip()+
scale_y_discrete(breaks=as.numeric(unique(df$Date)),
labels=levels(df$Factor))
Thank you in advance!

It doesn't really make sense to use geom_bar() considering you do not want to summarise the data and require the visualisation over "time"
I would rather use geom_line() and increase the line thickness if you want to portray a bar chart.
library(tidyr)
library(dplyr)
library(ggplot2)
library(scales)
library(lubridate)
df <- data.frame(Date = rep(seq.Date(as.Date("2010-01-01"), as.Date("2010-12-01"), by="months"),times=2))
df$Site <- c(rep("Site.A",times = 12),rep("Site.B",times = 12))
df$Condition<- as.factor(c(0,0,0,0,1,1,1,1,0,0,0,0,
0,0,0,0,0,1,1,0,0,0,0,0))
df$Date <- ymd(df$Date)
ggplot(df) +
geom_line(aes(y=Site,x=Date,color=Condition),size=10)+
scale_x_date(labels = date_format("%b-%y"))
Note using coord_flip() also does not work, I think this causes the Date issue, see below threads:
how to use coord_carteisan and coord_flip together in ggplot2
In ggplot2, coord_flip and free scales don't work together

Related

ggplot scale for time of date only, when using POSIXct datetimes

In ggplot2, I have a question about appropriate scales for making POSIXct datetimes into time-of-day in an axis. Consider:
library(tidyverse)
library(lubridate)
library(hms)
library(patchwork)
test <- tibble(
dates = c(ymd_hms("2022-01-01 6:00:00"),
ymd_hms("2023-01-01 19:00:00")),
x = c(1, 2),
hms_dates = as_hms(dates)
)
plot1 <- ggplot(test) + geom_point(aes(x = x, y = dates)) +
scale_y_time()
plot2 <- ggplot(test) + geom_point(aes(x = x, y = hms_dates)) +
scale_y_time()
plot1 + plot2
Plot 1 y axis includes dates and time, but Plot 2 shows just time of day. That's what I want! I'd like to generate plot 2 like images without having to use the hms::as_hms approach. This seems to imply some options for scale_y_datetime (or similar) that I can't discover. I'd welcome suggestions.
Does someone have an example of how to use the limits option in scale_*_time, or (see question #1) limits for a scale_y_datetime that specifies hours within the day, e.g. .. limits(c(8,22)) predictably fails.
For your second question, when dealing with dates or datetimes or times you have to set the limits and/or breaks as dates, datetimes or times too, i.e. use limits = as_hms(c("8:00:00", "22:00:00"):
library(tidyverse)
library(lubridate)
library(hms)
ggplot(test) + geom_point(aes(x = x, y = hms_dates)) +
scale_y_time(limits = as_hms(c("8:00:00", "22:00:00")))
#> Warning: Removed 1 rows containing missing values (`geom_point()`).
Concerning your first question. TBMK this could not be achieved via scale_..._datetime. And if you just want to show the time part of your dates then converting to an has object is IMHO the easiest way to achieve that. You could of course set the units to be shown as axis text via the date_labels argument, e.g. date_labels="%H:%M:%S" to show only the time of day. However, as your dates variable is still a datetime the scale, breaks and limits will still reflect that, i.e. you only change the format of the labels and for your example data you end up with an axis showing the same time for each break, i.e. the start of the day.
ggplot(test) + geom_point(aes(x = x, y = dates)) +
scale_y_datetime(date_labels = "%H:%M:%S")

How to properly plot a histogram with dates using ggplot?

I would like to create an interactive histogram with dates on the x-axis.
I have used ggplot+ggplotly.
I've read I need to use to pass the proper information using the "text=as.character(mydates)" option and sometimes "tooltips=mytext".
This trick works for other kinds of plots but there is a problem with the histograms, instead of getting a single bar with a single value I get many sub-bars stacked.
I guess the reason is passing "text=as.character(fechas)" produces many values instead of just the class value defining that bar.
How can I solve this problem?
I have tried filtering myself the data but I don't know how to make this the parameters match the parameters used by the histogram, such as where the dates start for each bar.
library(lubridate)
library(ggplot2)
library(ggplotly)
Ejemplo <- data.frame(fechas = dmy("1-1-20")+sample(1:100,100, replace=T),
valores=runif(100))
dibujo <- ggplot(Ejemplo, aes(x=fechas, text=as.character(fechas))) +
theme_bw() + geom_histogram(binwidth=7, fill="darkblue",color="black") +
labs(x="Fecha", y="Nº casos") +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
scale_x_date(date_breaks = "weeks", date_labels = "%d-%m-%Y",
limits=c(dmy("1-1-20"), dmy("1-4-20")))
ggplotly(dibujo)
ggplotly(dibujo, tooltip = "text")
As you can see, the bars are not regular histogram bars but something complex.
Using just ggplot instead of ggplotly shows the same problem, though then you woulnd't need to use the extra "text" parameter.
Presently, feeding as.character(fechas) to the text = ... argument inside of aes() will display the relative counts of distinct dates within each bin. Note the height of the first bar is simply a count of the total number of dates between 6th of January and the 13th of January.
After a thorough reading of your question, it appears you want the maximum date within each weekly interval. In other words, one date should hover over each bar. If you're partial to converting ggplot objects into plotly objects, then I would advise pre-processing the data frame before feeding it to the ggplot() function. First, group by week. Second, pull the desired date by each weekly interval to show as text (i.e., end date). Next, feed this new data frame to ggplot(), but now layer on geom_col(). This will achieve similar output since you're grouping by weekly intervals.
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
set.seed(13)
Ejemplo <- data.frame(fechas = dmy("1-1-20") + sample(1:100, 100, replace = T),
valores = runif(100))
Ejemplo_stat <- Ejemplo %>%
arrange(fechas) %>%
filter(fechas >= ymd("2020-01-01"), fechas <= ymd("2020-04-01")) %>% # specify the limits manually
mutate(week = week(fechas)) %>% # create a week variable
group_by(week) %>% # group by week
summarize(total_days = n(), # total number of distinct days
last_date = max(fechas)) # pull the maximum date within each weekly interval
dibujo <- ggplot(Ejemplo_stat, aes(x = factor(week), y = total_days, text = as.character(last_date))) +
geom_col(fill = "darkblue", color = "black") +
labs(x = "Fecha", y = "Nº casos") +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_x_discrete(label = function(x) paste("Week", x))
ggplotly(dibujo) # add more text (e.g., week id, total unique dates, and end date)
ggplotly(dibujo, tooltip = "text") # only the end date is revealed
The "end date" is displayed once you hover over each bar, as requested. Note, the value "2020-01-12" is not the last day of the second week. It is the last date observed in the second weekly interval.
The benefit of the preprocessing approach is your ability to modify your grouped data frame, as needed. For example, feel free to limit the date range to a smaller (or larger) subset of weeks, or start your weeks on a different day of the week (e.g., Sunday). Furthermore, if you want more textual options to display, you could also display your total number of unique dates next to each bar, or even display the date ranges for each week.

How to plot a variable over time with time as rownames

I am trying to plot a time series in ggplot2. Assume I am using the following data structure (2500 x 20 matrix):
set.seed(21)
n <- 2500
x <- matrix(replicate(20,cumsum(sample(c(-1, 1), n, TRUE))),nrow = 2500,ncol=20)
aa <- x
rnames <- seq(as.Date("2010-01-01"), length=dim(aa)[1], by="1 month") - 1
rownames(aa) <- format(as.POSIXlt(rnames, format = "%Y-%m-%d"), format = "%d.%m.%Y")
colnames(aa) <- paste0("aa",1:k)
library("ggplot2")
library("reshape2")
library("scales")
aa <- melt(aa, id.vars = rownames(aa))
names(aa) <- c("time","id","value")
Now the following command to plot the time series produces a weird looking x axis:
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line()
What I found out is that I can change the format to date:
aa$time <- as.Date(aa$time, "%d.%m.%Y")
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line()
This looks better, but still not a good graph. My question is especially how to control the formatting of the x axis.
Does it have to be in Date format? How can I control the amount of breaks (i.e. years) shown in either case? It seems to be mandatory if Date is not used; otherwise ggplot2 uses some kind of useful default for the breaks I believe.
For example the following command does not work:
aa$time <- as.Date(aa$time, "%d.%m.%Y")
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line() +
scale_x_continuous(breaks=pretty_breaks(n=10))
Also if you got any hints how to improve the overall look of the graph feel free to add (e.g. the lines look a bit inprecise imho).
You can format dates with scale_x_date as #Gopala mentioned. Here's an example using a shortened version of your data for illustration.
library(dplyr)
# Dates need to be in date format
aa$time <- as.Date(aa$time, "%d.%m.%Y")
# Shorten data to speed rendering
aa = aa %>% group_by(id) %>% slice(1:200)
In the code below, we get date breaks every six months with date_breaks="6 months". That's probably more breaks than you want in this case and is just for illustration. If you want to determine which months get the breaks (e.g., Jan/July, Feb/Aug, etc.) then you also need to use coord_cartesian and set the start date with xlim and expand=FALSE so that ggplot won't pad the start date. But when you set expand=FALSE you also don't get any padding on the y-axis, so you need to add the padding manually with scale_y_continuous (I'd prefer to be able to set expand separately for the x and y axes, but AFAIK it's not possible). Because the breaks are packed tightly, we use a theme statement to rotate the labels by 90 degrees.
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line(show.legend=FALSE) +
scale_y_continuous(limits=c(min(aa$value) - 2, max(aa$value) + 1)) +
scale_x_date(date_breaks="6 months",
labels=function(d) format(d, "%b %Y")) +
coord_cartesian(xlim=c(as.Date("2009-07-01"), max(aa$time) + 182),
expand=FALSE) +
theme_bw() +
theme(axis.text.x=element_text(angle=-90, vjust=0.5))

Dynamically Set X limits on time plot

I am wondering how to dynamically set the x axis limits of a time series plot containing two time series with different dates. I have developed the following code to provide a reproducible example of my problem.
#Dummy Data
Data1 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1997","9/13/1998"), Area_2D = c(20,11,5,25,50))
Data2 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
Data3 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1996","9/13/1998"), Area_2D = c(20,25,28,30,35))
Data4 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
#Convert date column as date
Data1$Date <- as.Date(Data1$Date,"%m/%d/%Y")
Data2$Date <- as.Date(Data2$Date,"%m/%d/%Y")
Data3$Date <- as.Date(Data3$Date,"%m/%d/%Y")
Data4$Date <- as.Date(Data4$Date,"%m/%d/%Y")
#PLOT THE DATA
max_y1 <- max(Data1$Area_2D)
# Define colors to be used for cars, trucks, suvs
plot_colors <- c("blue","red")
plot(Data1$Date,Data1$Area_2D, col=plot_colors[1],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
par(new=T)
plot(Data2$Date,Data2$Area_2D, col=plot_colors[2],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
The main problem I see with the code above is there are two different x axis on the plot, one for Data1 and another for Data2. I want to have a single x axis spanning the date range determined by the dates in Data1 and Data2.
My questions is:
How do i dynamically create an x axis for both series? (i.e select the minimum and maximum date from the data frames 'Data1' and 'Data2')
The solution is to combine the data into one data.frame, and base the x-axis on that. This approach works very well with the ggplot2 plotting package. First we merge the data and add an ID column, which specifies to which dataset it belongs. I use letters here:
Data1$ID = 'A'
Data2$ID = 'B'
merged_data = rbind(Data1, Data2)
And then create the plot using ggplot2, where the color denotes which dataset it belongs to (can easily be changed to different colors):
library(ggplot2)
ggplot(merged_data, aes(x = Date, y = Area_2D, color = ID)) +
geom_point() + geom_line()
Note that you get one uniform x-axis here. In this case this is fine, but if the timeseries do not overlap, this might be problematic. In that case we can use multiple sub-plots, known as facets in ggplot2:
ggplot(merged_data, aes(x = Date, y = Area_2D)) +
geom_point() + geom_line() + facet_wrap(~ ID, scales = 'free_x')
Now each facet has it's own x-axis, i.e. one for each sub-dataset. What approach is most valid depends on the specific situation.

How to create histogram in R with CSV time data?

I have CSV data of a log for 24 hours that looks like this:
svr01,07:17:14,'u1#user.de','8.3.1.35'
svr03,07:17:21,'u2#sr.de','82.15.1.35'
svr02,07:17:30,'u3#fr.de','2.15.1.35'
svr04,07:17:40,'u2#for.de','2.1.1.35'
I read the data with tbl <- read.csv("logs.csv")
How can I plot this data in a histogram to see the number of hits per hour?
Ideally, I would get 4 bars representing hits per hour per srv01, srv02, srv03, srv04.
Thank you for helping me here!
I don't know if I understood you right, so I will split my answer in two parts. The first part is how to convert your time into a vector you can use for plotting.
a) Converting your data into hours:
#df being the dataframe
df$timestamp <- strptime(df$timestamp, format="%H:%M:%S")
df$hours <- as.numeric(format(df$timestamp, format="%H"))
hist(df$hours)
This gives you a histogram of hits over all servers. If you want to split the histograms this is one way but of course there are numerous others:
b) Making a histogram with ggplot2
#install.packages("ggplot2")
require(ggplot2)
ggplot(data=df) + geom_histogram(aes(x=hours), bin=1) + facet_wrap(~ server)
# or use a color instead
ggplot(data=df) + geom_histogram(aes(x=hours, fill=server), bin=1)
c) You could also use another package:
require(plotrix)
l <- split(df$hours, f=df$server)
multhist(l)
The examples are given below. The third makes comparison easier but ggplot2 simply looks better I think.
EDIT
Here is how thes solutions would look like
first solution:
second solution:
third solution:
An example dataset:
dat = data.frame(server = paste("svr", round(runif(1000, 1, 10)), sep = ""),
time = Sys.time() + sort(round(runif(1000, 1, 36000))))
The trick I use is to create a new variable which only specifies in which hour the hit was recorded:
dat$hr = strftime(dat$time, "%H")
Now we can use some plyr magick:
hits_hour = count(dat, vars = c("server","hr"))
And create the plot:
ggplot(data = hits_hour) + geom_bar(aes(x = hr, y = freq, fill = server), stat="identity", position = "dodge")
Which looks like:
I don't really like this plot, I'd be more in favor of:
ggplot(data = hits_hour) + geom_line(aes(x = as.numeric(hr), y = freq)) + facet_wrap(~ server, nrow = 1)
Which looks like:
Putting all the facets in one row allows easy comparison of the number of hits between the servers. This will look even better when using real data instead of my random data.

Resources