Calculating 12-hour differences for hourly data - r

I have data measurements that over a 10 day period that are recorded on an hourly basis with a sample provided below:
Date_Time Measure
1/1/2021 05:00 430.1
1/1/2021 06:00 430.2
1/1/2021 07:00 429.8
First what I want to do is calculate the difference for every 12 hour period - that calculate the difference from 00:00 to 12:00 and 12:00 to 00:00 and so on.
Second I want to be able to find the maximum difference for this period of time.
This is all done in R, and I have only been able to find code for calculating averages or know how to calculate differences individually and not creating its own kind of column of data for it.
I have tried using diff(Measure, lag = 11) thinking that would calculate the difference between 12 hour periods but I kept getting the error:
Error in mutate(., diff_12 = diff(Level, lag = 11)) : x `diff_12` must be size 265 or 1, not 254.

While it is not the cleanest line of code I wrote I used:
mutate(diff = lag(Measure, 12) - Measure)
This answered my own question.

Related

Time series frequency in ts function using r

I am trying it hard to understand frequency setup in ts function using r when i have my time series dataset which is recorded every 15 mins between a time interval in a day which changes from day to day for example one day the data is recorded between 6 am till 9 am and another day it will be betwee 10am and 10 pm for 5 years. If this is the case, how do I set my frequency value in ts function? Please advise.
Raw Data
Thank you.
Regards,
Anil

How to I transform half-hourly data that does not span the whole day to a Time Series in R?

This is my first question on stackoverflow, sorry if the question is poorly put.
I am currently developing a project where I predict how much a person drinks each day. I currently have data that looks like this:
The menge column represents how much water a person has actually drunk in 30 minutes (So first value represents amount from 8:00 till before 8:30 etc..). This is a 1 day sample from 3 months of data. The day starts at 8 AM and ends at 8 PM.
I am trying to forecast the Time Series for each day. For example, given the first one or two time steps, we would predict the whole day and then we know how much in total the person has drunk until 8 PM.
I am trying to model this data as a Time Series object in R (Google Colab), in order to use Croston's Method for the forecasting. Using the ts() function, what should I set the frequency to knowing that:
The data is half-hourly
The data is from 8:00 till 20:00 each day (Does not span the whole day)
Would I need to make the data span the whole day by adding 0 values? Are there maybe better approaches for this? Thank you in advance.
When using the ts() function, the frequency is used to define the number of (usually regularly spaced) observations within a given time period. For your example, your observations are every 30 minutes between 8AM and 8PM, and your time period is 1 day. The time period of 1 day assumes that the patterns over each day is of most interest here, you could also use 1 week here.
So within each day of your data (8AM-8PM) you have 24 observations (24 half hours). So a suitable frequency for this data would be 24.
You can also pad the data with 0 values, however this isn't necessary and would complicate the model. If you padded the data so that it has observations for all half-hours of the day, the frequency would then be 48.

How to make an hourly time series in R with this data?

times booked_res
11:00 23
13:00 26
15:00 27
17:00 25
19:00 28
21:00 30
So I need to use the ts() function in R to convert this frame into a time series. The column on the right are the number of people reserved in each time. How should I approach this? I'm not sure about the arguments and I don't know if the frequency should be set to 24 (hours in a day) or 10 (11:00 to 21:00) as shown above. Any help appreciated.
First, find the frequency by noting that you are taking a sample every two minutes. The frequency is the inverse of the time between samples, which is 1/2 samples per minute or 30 samples per hour. The data you're interested in is on the right, so you can just use that data vector rather than the entire data frame. The code to convert that into a time series is simply:
booked_res <- c(23,26,27,25,28,30)
ts(booked_res,frequency = 30)
A simple plot with your data might be coded like this:
plot(ts(booked_res,frequency = 30),ylab='Number of people reserved',xlab='Time (in hours) since start of sampling',main='Time series chart of people reservations')
UPDATE:
A time series model can only be created in R when the times series is stationary. Using a varying sample rate would make the time series non-stationary, and so you wouldn't be able to create a time-series object in R.
This page on Analytics Vidhya provides a nice definition of stationary and non-stationary time series, while this page on R bloggers gives some resources that relate to analyzing a non-stationary time series.

Generating machine utilisation chart from time data over a 12 hour period

I would like to generate a line plot which depicts the utilisation of a system of machines over a 12 hour period. As I am new to R I would like some advice on the approach I could use to generate such a plot.
Here is an example of the data frame that is used -
Machine StartTime StopTime
A 10:30 11:00
B 12:00 13:00
B 7:00 9:00
A 13:00 16:00
Say, the 12 hour period is from 4:00 to 16:00
My approach (probably not the most efficient) - is to create an empty matrix with 720 rows (1 for each minute), then check if the utilisiation of the system using the formula:
utilisiation = machines Busy / total machines
This would mean that I would some how need to iterate through each minute from 4:00 to 16:00. Is that possible?
Yes, but its not something out of the bag. I'd probably use data frames or data tables instead of a matrix. I'll use data.table in my examples.
To create a sequence of times you can try:
data.table(time=seq(from=as.POSIXlt("2016-06-09 4:00:00"),to=as.POSIXlt("2016-06-09 16:00:00"),by="min"))
However, this is probably unnecessary since the plots can recognize times. (Well, at least ggplot2 can). For instance
require(data.table)
require(reshape2)
require(ggplot2)
#Make the data table.
dt<-data.table(Machine=c("A","B","B","A"),
StartTime=c("10:30","12:00","7:00","13:00"),
StopTime=c("11:00","13:00","9:00","16:00"))
#Reshape the data, so that we have one column of dates.
mdt<-melt(dt,id.vars = "Machine",value.name = "Time",variable.name = "Event")
#Make the time vector a time vector, not a character vector.
mdt[,time:=as.POSIXct(Time,format="%H:%M")]
#delete the character vector.
mdt[,Time:=NULL]
#order the data.table by time.
mdt<-mdt[order(time)]
#Define how each time affects the cumulative number of machines.
mdt[Event=="StartTime",onoff:=1]
mdt[Event=="StopTime",onoff:=-1]
#EDIT: Sum the onoff effects at each point in time -this ensures you get one measurement for each time -eliminating the vertical line.
mdt<-mdt[,list(onoff=sum(onoff)),by=time]
#Calculate the cumulative number of machines on.
mdt[,TotUsage:=cumsum(onoff)]
#Plot the number of machines on at any given time.
ggplot(mdt,aes(x=time,y=TotUsage))+geom_step()
That will get you something like this (EDIT: without the vertical spike):
I made your idea the code. It checks every machine is on/off per minute.
[Caution] If your data is big, this code takes much time. This method is simple but not efficiency.
# make example data
d <- data.frame(Machine = c("A","B","B","A"),
StartTime = strptime(c("10:30", "12:00", "7:00", "13:00"), "%H:%M", "GMT"),
StopTime = strptime(c("11:00", "13:00", "9:00", "16:00"), "%H:%M", "GMT"))
# cut from 4:00 to 16:00 by the minute
time <- seq(strptime(c("04:00"), "%H:%M", "GMT"), strptime("16:00", "%H:%M", "GMT"), 60)
# sum(logic test) returns number of True. Sapply checks it each time.
a <- sapply(1:721, function(x) sum((d$StartTime <= time[x]) & (time[x] < d$StopTime)) / length(levels(d$Machine)))
plot(time, a, type="l", ylab="utilisiation")

Directional statistics in R

I need to create a function for some work I'm doing on directional statistics. I want to show the distribution of flood events using a circle and calculate the mean direction and variance.
I need to calculate the angular value in radians by multiplying the julian date by (360/365). I am having problems because I need a function that takes account of the leap years in the 40 year record I am considering. i.e. IF leap year angular value = julian date x (360/366).
The data I am using is Peaks above threshold so I do not have a piece of data for every year and in some years I have more than one entry
Date Time Flow
04/05/1973 00:00 44.67
22/06/1974 00:00 128.38
22/11/1974 23:45 129.15
26/09/1976 22:00 89.51
15/10/1976 00:00 139.35
24/02/1978 19:30 183.69
27/12/1978 04:00 229.65
18/03/1980 09:15 117.7
02/03/1981 22:00 262.39
Many thanks
Rich
There may be a more elegant way to do this, but try
df$Year<-format(df$Date,"%Y")
that should put just the year if a single column. Then make a new column to indicate if it is a leap year
df$Leap<-0
df$Leap[df$Year=="1972" | df$ Year=="1976" |df$Year=="1980"]<-1
depending on your data, you may find it easier to change to a number and then use the %% to see if you can divide it evenly by 4, but beware of the year 2000.
Then you can use an if statement to the effect of
if (df$Leap==0) {do * 360/365} else {do * 360/366}

Resources