I'm trying to create a histogram from time-series data in R, similar to this question. Each bin should show the total duration for the values falling within the bin. I have non-integer sample times in an zoo object of thousands of rows. The timestamps are irregular, and the data is assumed to be constant between each timestamp (sample-and-hold).
Example data:
library(zoo)
library(ggplot2)
timestamp = as.POSIXct(c("2018-02-21 15:00:00.0", "2018-02-21 15:00:02.5", "2018-02-21 15:00:05.2", "2018-02-21 15:00:07.0", "2018-02-21 15:00:09.3", "2018-02-21 15:00:10.0", "2018-02-21 15:00:12.0"), tz = "GMT")
data = c(0,3,5,1,3,0,2)
z = zoo(data, order.by = timestamp)
x.df <- data.frame(Date = index(z), Value = as.numeric(coredata(z)))
ggplot(x.df, aes(x = Date, y = Value)) + geom_step() + scale_x_datetime(labels = date_format("%H:%M:%OS"))
Please see the times-series plot here. Creating a histogram with hist(z, freq = T) does not care about the timestamps: Plot from hist method.
My desired output is a histogram with duration in seconds on the y-axis, something like this: Histogram with non-integer duration on y-axis.
Edit:
I should point out that the data values are not integers, and that i want to be able to control the bin width(s). I could use diff(timestamp) to create a (non-integer) column showing duration for each point, and plotting a bar graph like suggested by #MKR:
x.df = data.frame(DurationSecs = as.numeric(diff(timestamp)), Value = data[-length(data)])
ggplot(x.df, aes(x = Value, y = DurationSecs)) + geom_bar(stat = "identity")
This gives a histogram with the right bar heights for the example. But this fails when the values are floating point numbers.
Since you want duration (in seconds) on y-axis, hence you should add one column in x.df for duration. A histogram with stat = sum will fit needs of OP. The steps are
library(zoo)
library(dplyr)
timestamp = as.POSIXct(c("2018-02-21 15:00:00.0", "2018-02-21 15:00:02.5",
"2018-02-21 15:00:05.2", "2018-02-21 15:00:07.0", "2018-02-21 15:00:09.3",
"2018-02-21 15:00:10.0", "2018-02-21 15:00:12.0"), tz = "GMT")
data = c(0,3,5,1,3,0,2)
z = zoo(data, order.by = timestamp)
x.df <- data.frame(Date = index(z), Value = as.numeric(coredata(z)))
# DurationSecs is added as numeric. It shows diff from earliest time.
x.df <- x.df %>% arrange(Date) %>%
mutate(DurationSecs = ifelse(is.na(lead(Date)), 0, lead(Date) - Date))
# Draw the plot now
ggplot(x.df, aes(x = Value, y = DurationSecs)) + geom_histogram(stat="sum")
#The data
# Date Value DurationSecs
#1 2018-02-21 15:00:00 0 2.5
#2 2018-02-21 15:00:02 3 2.7
#3 2018-02-21 15:00:05 5 1.8
#4 2018-02-21 15:00:07 1 2.3
#5 2018-02-21 15:00:09 3 0.7
#6 2018-02-21 15:00:10 0 2.0
#7 2018-02-21 15:00:12 2 0.0
After some trial and error I found a solution. The answer provided by MKR sort of works, but I could not set the number of bins and it failed for floating-pont values.
I came across the wonderful functions cut and xtab in this question: How to plot an histogram with y as a sum of the x values for every bin in ggplot2. The solution provided there was painfully slow, drawing each data-point duration as stacked bars.
I don't need separate bars for each data-point, I just need the sum of the durations within each bin. This is my solution:
library(dplyr)
library(magrittr)
library(zoo)
library(ggplot2)
timestamp = as.POSIXct(c("2018-02-21 15:00:00.0", "2018-02-21 15:00:02.5",
"2018-02-21 15:00:05.2", "2018-02-21 15:00:07.0", "2018-02-21 15:00:09.3",
"2018-02-21 15:00:10.0", "2018-02-21 15:00:12.0"), tz = "GMT")
data = c(0,3,5,1,3,0,2)
z = zoo(data, order.by = timestamp)
x.df <- data.frame(Date = index(z), Value = as.numeric(coredata(z)))
# DurationSecs is added as numeric. It shows diff from the previous datapoint.
x.df <- x.df %>% arrange(Date) %>%
mutate(DurationSecs = ifelse(is.na(lead(Date)), 0, lead(Date) - Date))
# Adding a column of bins to the dataframe:
BinCount <- 7
x.df$bins = cut(x.df$Value, pretty(x.df$Value, n = BinCount), include.lowest = TRUE, right = FALSE)
# Creating a new dataframe containing bins and the sum of DurationSecs for each bin.
y.df = data.frame(xtabs(DurationSecs ~ bins, x.df))
# Ready to plot
ggplot(y.df, aes(x = bins, y = Freq)) +
geom_bar(stat = "identity") +
ylab("Duration") +
xlab("Value") +
scale_x_discrete(drop = F) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 1)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10))
The result is shown here. As a bonus, the labels on the x-axis are really beautiful, and I have the frequency table available for further analysis.
Related
I created a plot in R using the ggplot library:
library(ggplot2)
ggplot(df, aes(x = yQ, y = value, group =1)) +
geom_line(aes(color = variable), size = 1) +
scale_color_manual(values = c("#00AFBB", "#E7B800"))
I got the plot that I want but the only problem is that variable, yQ values have the format:
1990Q1
1900Q2
1990Q3
1990Q4
......
......
2017Q1
2017Q2
2017Q3
2017Q4
and because there are many years, the x-axis label cannot show all the dates clearly (they overlapped).
Therefore, I want the x-axis label to show only Q1 and Q3 for every 5 years.
So I want the x-axis to be something like this:
1990Q1 1990Q3 1995Q1 1995Q3 ...... 2015Q1 2015Q3
I tried to use scale_x_date but my dates are not in date format (e.g. 1990Q1) and therefore this does not work. How can I fix it?
The question does not provide reproducible input but using df from the Note below with the autoplot.zoo method of ggplot's autoplot generic we can write:
library(ggplot2)
library(zoo)
z <- read.zoo(df, index = "yQ", FUN = as.yearqtr)
autoplot(z) + scale_x_yearqtr()
Note
Test input--
df <- data.frame(yQ = c("1990Q1", "1990Q2", "1990Q3", "1990Q4"), value = 1:4)
The zoo::format.yearqtr() function is quite easy to use with ggplot2.
Try
scale_x_date(labels = function(x) zoo::format.yearqtr(x, "%YQ%q"))
Use function zoo::as.yearqtr (zoo package) to work with quarterly dates.
Generate example data:
year <- 1990:2000
quar <- paste0("Q", 1:4)
foo <- as.vector(outer(year, quar, paste0))
data <- data.frame(dateQ = foo, Y = rnorm(length(foo)))
head(data)
dateQ Y
1 1990Q1 -0.09944705
2 1991Q1 0.14493910
3 1992Q1 0.54856787
4 1993Q1 1.12966224
5 1994Q1 -0.93539302
6 1995Q1 0.24772265
Transform quarterly date to "normal" date:
data$dateNorm <- as.Date(zoo::as.yearqtr(data$dateQ))
head(data)
dateQ Y dateNorm
1 1990Q1 -0.09944705 1990-01-01
2 1991Q1 0.14493910 1991-01-01
3 1992Q1 0.54856787 1992-01-01
4 1993Q1 1.12966224 1993-01-01
5 1994Q1 -0.93539302 1994-01-01
6 1995Q1 0.24772265 1995-01-01
It sets Q1/2/3/4 as the first day of January/April/July/October.
data[grep("1991", data$dateQ), ]
dateQ Y dateNorm
2 1991Q1 0.1449391 1991-01-01
13 1991Q2 1.5878678 1991-04-01
24 1991Q3 -0.1071823 1991-07-01
35 1991Q4 2.2905729 1991-10-01
Now you can plot it or perform other calculations as it's in Date format.
library(ggplot2)
ggplot(data, aes(dateNorm, Y)) +
geom_line()
You can
manipulate x-axis breaks and labels with scale_x_discrete(breaks = ..., labels = ...)
change the angle of text with theme(axis.text.x = element_text(angle = ...))
I generated some data
Combs <- expand.grid(1990:2017, c("Q1", "Q2", "Q3", "Q4"))
df <- data.frame(
yQ = sort(apply(Combs, 1, paste, collapse="")),
value = runif(112)
)
In the first example, I subset yQ values you want with a logical vector - and change the angle of text
library(ggplot2)
pattern <- c(T, F, T, F, rep(F, 16))
ggplot(df, aes(x = yQ, y = value, group =1)) +
geom_line(aes(color = "red"), size = 1) +
scale_x_discrete(breaks = df$yQ[pattern], labels = df$yQ[pattern]) +
theme(axis.text.x = element_text(angle=90))
But notice that ticks marks not specified by break are not shown - so the alternative is to copy yQ values into a vector and make non-relevant years = ""
xVec <- as.character(df$yQ)
xVec[pattern==F] <- ""
ggplot(df, aes(x = yQ, y = value, group =1)) +
geom_line(aes(color = "red"), size = 1) +
scale_x_discrete(breaks = df$yQ, labels = xVec) +
theme(axis.text.x = element_text(angle=90))
I've the following dataset:
https://app.box.com/s/au58xaw60r1hyeek5cua6q20byumgvmj
I want to create a density plot based on the time of the day. Here is what I've done so far:
library("ggplot2")
library("scales")
library("lubridate")
timestamp_df$timestamp_time <- format(ymd_hms(hn_tweets$timestamp), "%H:%M:%S")
ggplot(timestamp_df, aes(timestamp_time)) +
geom_density(aes(fill = ..count..)) +
scale_x_datetime(breaks = date_breaks("2 hours"),labels=date_format("%H:%M"))
It gives the following error:
Error: Invalid input: time_trans works with objects of class POSIXct only
If I convert that to POSIXct, it adds dates to the data.
Update 1
The following converted data to 'NA'
timestamp_df$timestamp_time <- as.POSIXct(timestamp_df$timestamp_time, format = "%H:%M%:%S", tz = "UTC"
Update 2
Following is what I want to achieve:
One problem with the solutions posted here is that they ignore the fact that this data is circular/polar (i.e. 00hrs == 24hrs). You can see on the plots on the other answer that the ends of the charts dont match up with each other. This wont make too much of a difference with this particular dataset, but for events that happen near midnight, this could be an extremely biased estimator of density. Here's my solution, taking into account the circular nature of time data:
# modified code from https://freakonometrics.hypotheses.org/2239
library(dplyr)
library(ggplot2)
library(lubridate)
library(circular)
df = read.csv("data.csv")
datetimes = df$timestamp %>%
lubridate::parse_date_time("%m/%d/%Y %h:%M")
times_in_decimal = lubridate::hour(datetimes) + lubridate::minute(datetimes) / 60
times_in_radians = 2 * pi * (times_in_decimal / 24)
# Doing this just for bandwidth estimation:
basic_dens = density(times_in_radians, from = 0, to = 2 * pi)
res = circular::density.circular(circular::circular(times_in_radians,
type = "angle",
units = "radians",
rotation = "clock"),
kernel = "wrappednormal",
bw = basic_dens$bw)
time_pdf = data.frame(time = as.numeric(24 * (2 * pi + res$x) / (2 * pi)), # Convert from radians back to 24h clock
likelihood = res$y)
p = ggplot(time_pdf) +
geom_area(aes(x = time, y = likelihood), fill = "#619CFF") +
scale_x_continuous("Hour of Day", labels = 0:24, breaks = 0:24) +
scale_y_continuous("Likelihood of Data") +
theme_classic()
Note that the values and slopes of the density plot match up at the 00h and 24h points.
Here is one approach:
library(ggplot2)
library(lubridate)
library(scales)
df <- read.csv("data.csv") #given in OP
convert character to POSIXct
df$timestamp <- as.POSIXct(strptime(df$timestamp, "%m/%d/%Y %H:%M", tz = "UTC"))
library(hms)
extract hour and minute:
df$time <- hms::hms(second(df$timestamp), minute(df$timestamp), hour(df$timestamp))
convert to POSIXct again since ggplot does not work with class hms.
df$time <- as.POSIXct(df$time)
ggplot(df, aes(time)) +
geom_density(fill = "red", alpha = 0.5) + #also play with adjust such as adjust = 0.5
scale_x_datetime(breaks = date_breaks("2 hours"), labels=date_format("%H:%M"))
to plot it scaled to 1:
ggplot(df) +
geom_density( aes(x = time, y = ..scaled..), fill = "red", alpha = 0.5) +
scale_x_datetime(breaks = date_breaks("2 hours"), labels=date_format("%H:%M"))
where ..scaled.. is a computed variable for stat_density made during plot creation.
I've following dataset:
time tta
08:20:00 1
21:30:00 5
22:00:00 1
22:30:00 1
00:25:00 1
17:00:00 5
I would like to plot bar chart using ggplot so that the x-axis has every every 2 hours(00:00:00,02:00:00,04:00:00 and so on) and y-axis has frequency for a factor tta (1 and 5).
x-axis should be 00-01,01-02,... so on
I approached this using the xts package, but then found that it does not offer flooring the time. Hence, I conclude lubridate to be more practical here, also because ggplot does not understand xts objects right away. Both packages help you transforming time data in many ways.
Use xts::align.time or lubridate::floor_date to shift your times to the next/previous full hour/day/etc.
Either way, you aggregate the data before you pass it to ggplot. You can use sum to sum up tta, or just use length to count the number of occurences, but in the latter case you could also use geom_histogram on the time series only. You can carefully shift the bars in ggplot with position_nudge to represent a period rather than just sitting centered on a point of time. You sould specify scale_x_time(labels = ..., breaks = ...) in the plot.
Data:
time <- c(
"08:20:00",
"21:30:00",
"22:00:00",
"22:30:00",
"00:25:00",
"17:00:00"
)
time <- as.POSIXct(time, format = "%H:%M:%S")
tta <- c(1, 5, 1, 1, 1, 5)
Using xts:
library(xts)
myxts <- xts(tta, order.by = time)
myxts_aligned <- align.time(myxts, n = 60*60*2) # shifts all times to the next full
# 2 hours
myxts_agg <- period.apply(myxts_aligned,
INDEX = endpoints(myxts, "hours", 2),
FUN = sum) # sums up every two hours
require(ggplot2)
ggplot(mapping = aes(x = index(myxts_agg), y = myxts_agg[, 1])) +
geom_bar(stat = "identity",
width = 60*60*2, # one bar to be 2 hours wide
position = position_nudge(x = -60*60), # shift one hour to the left
# so that the bar represents the actual period
colour = "black") +
scale_x_time(labels = function(x) strftime(x, "%H:%M"),
breaks = index(myxts_agg)) + # add more breaks manually if you like
scale_y_continuous() # to escape the warning of ggplot not knowing
# how to deal with xts object
Using lubridate:
require(lubridate)
require(tidyverse)
mydf <- data.frame(time = time, tta = tta)
mydf_agg <-
mydf %>%
group_by(time = floor_date(time, "2 hours")) %>%
summarise(tta_sum = sum(tta), tta_freq = n())
ggplot(mydf_agg, aes(x = time, y = tta_sum)) +
geom_bar(stat = "identity",
width = 60*60*2, # one bar to be 2 hours wide
position = position_nudge(x = 60*60), # shift one hour to the *right*
# so that the bar represents the actual period
colour = "black") +
scale_x_time(labels = function(x) strftime(x, "%H:%M"),
breaks = mydf_agg$time) # add more breaks manually if you like
After all, allmost the same:
use the floor_date function from lubridate
library(tidyverse)
library(lubridate)
your_df %>% group_by(floor_date(time,"2 hours")) %>% count(tta)
and then ggplot with geom_col from there
library(lubridate)
library(ggplot2)
Make sure the class for your timestamp is POSxx
> class(df$timestamp)
[1] "POSIXct" "POSIXt"
Then use the scale_x_datetime function as follows.
gg +
scale_x_datetime(expand = c(0, 0), breaks=date_breaks("1 hour"), labels=date_format("%H:%M"))
On this case, it will space the brakes on the x axis, every one hour and the labels will look 09:00 for example.
I am analyzing day to day data to see when the value would be lower. I set each day as categorical variable so I can differentiate each day. But I want to get each day plotted on top of another day instead of one continuous graph as shown below.
Data set:
Value Day
2013-01-03 01:55:00 0.35435715 1
2013-01-03 02:00:00 0.33018654 1
2013-01-03 02:05:00 0.38976118 1
2013-01-04 02:10:00 0.45583868 2
2013-01-04 02:15:00 0.29290860 2
My current ggplot code is as follows:
g <- ggplot(data = Data, aes(x = Index, color = Dates)) +
geom_line(y = Data$Value) +
scale_x_datetime(date_breaks = TimeIntervalForGraph, date_labels = "%H") +
xlab("Time") +
ylab("Random value")
I would really appreciate if anyone can guide me on how I can turn my x-axis into 24hrs time series so that I can plot each day on the same graph to see when the value is lower during the 24 hrs.Thanks in advance.
Method tried:
I tried creating an 3rd column with time only, for some reasons the following codes didnt work:
time <- format(index(x), format = "%H:%M"))
data <- cbind(data, time)
You need a way of summarising the data for each hour of the day. Here are some approaches you're probably looking for:
library(xts)
library(data.table)
library(ggplot2)
tm <- seq(as.POSIXct("2017-08-08 17:30:00"), by = "5 mins", length.out = 10000)
z <- xts(runif(10000), tm, dimnames = list(NULL, "vals"))
DT <- data.table(time = index(z), coredata(z))
# note the data.table syntax is different:
DT[, hr := hour(time)]
# Plot the average value by hour:
datByHour <- DT[, list(avgval = mean(vals)), by = c("hr")]
# Use line plot if you have one point per hour:
g <- ggplot(data = datByHour, aes(x = hr, y = avgval, colour = avgval)) +
geom_line()
datByHour <- DT[, list(avgval = mean(vals)), by = c("hr")]
# visualise the distribution by hour:
g2 <- ggplot(data = DT, aes(x = hr, y = vals, group = hr)) +
geom_boxplot()
Please try the following and let me know if it works (here I am taking tm time column as given):
Data$tm = strftime(Data$tm, format="%H:%M:%S")
library(ggplot2)
ggplot(Data, aes(x = tm, y = Value, group = Day, colour = Day)) +
geom_line() +
theme_classic()
I am trying to make a heatmap of several years of daily averages of salinity in an estuary in R.
I would like the format to include month on the x-axis and year on the y-axis, so each Jan 1st directly above another Jan. 1st. In other words, NOT like a typical annual calendar style (not like this: http://www.r-bloggers.com/ggplot2-time-series-heatmaps/).
So far I have only been able to plot by the day of the year using:
{r}
d <- read.xlsx('GC salinity transposed.xlsx', sheetName = "vert-3", header = TRUE, stringsAsFactors = FALSE, colClasses = c("integer", "integer", "numeric"), endRow = 2254)
{r}
ggplot(d, aes(x = Day.Number, y = Year)) + geom_tile(aes(fill = Salinity)) + scale_fill_gradient(name = 'Mean Daily Salinity', low = 'white', high = 'blue') + theme(axis.title.y = element_blank())
And get this:
heat map not quite right
Could someone please tell me a better way to do this - a way that would include month, rather than day of the year along the x-axis? Thank you. New to R.
The lubridate package comes in handy for stuff like this. Does this code do what you want? I'm assuming you only have one salinity reading per month and there's no need to average across multiple values in the same month.
library(lubridate)
library(ggplot2)
# Define some data
df <- data.frame(date = seq.Date(from = as.Date("2015-01-01"), by = 1, length.out = 400),
salinity = runif(400, min=5, max=7))
# Create fields for plotting
df$day <- paste0(ifelse(month(df$date)<10,"0",""),
month(df$date),
"-",
ifelse(day(df$date)<10,"0",""),
day(df$date))
df$month <- paste0(ifelse(month(df$date)<10,"0",""),
month(df$date))
df$year <- year(df$date)
library(lubridate)
library(ggplot2)
# Define some data
df <- data.frame(date = seq.Date(from = as.Date("2015-01-01"), by = 1, length.out = 400),
salinity = runif(400, min=5, max=7))
# Create fields for plotting
df$day <- paste0(ifelse(month(df$date)<10,"0",""),
month(df$date),
"-",
ifelse(day(df$date)<10,"0",""),
day(df$date))
df$month <- paste0(ifelse(month(df$date)<10,"0",""),
month(df$date))
df$year <- year(df$date)
#Plot results by month
ggplot(data=df) +
geom_tile(aes(x = month, y = year, fill = salinity)) +
scale_y_continuous(breaks = c(2015,2016))
#Plot results by day
ggplot(data=df) +
geom_tile(aes(x = day, y = year, fill = salinity)) +
scale_y_continuous(breaks = c(2015,2016))
Results by month:
Results by day (do you really want this? It's very hard to read with 366 x-axis values):