Check whether time-series 1 is greater than time-series 2 for a specific continuous duration - r

I have two time-series (timeseries1, timeseries2) of the same duration as
library(xts)
set.seed <- 1024
series <- seq(from= as.POSIXct(strptime("2015-01-01", format="%Y-%m-%d")),to = as.POSIXct(strptime("2015-01-02", format="%Y-%m-%d")), by= "10 mins")
timeseries1 <- xts(rnorm(length(series),50,2),series)
timeseries2 <- xts(rnorm(length(series),51.5,1),series)
plot(timeseries1,main="")
lines(timeseries2,col="blue")
legend("topleft", legend=c("Timeseries-1","Timeseries-2"),lty = 1, col=c("black","blue"))
Plots is:
I need to find whether timeseries2 is greater than timeseries1 for a continuous duration of one hour. I know, I can start with point by point comparison and keep a counter to check whether timeseries2 is greater than timeseries1 for n intervals, but I think there must be some existing novel method for this.
Is there any existing method to do this for time-series data in R?

You're probably looking for the rle function, which computes the length of runs of equal values.
In your case, you can check if timeseries2 is greater than timeseries1 for a continuous duration of one hour as follows:
comparison <- c(ifelse(coredata(timeseries2) > coredata(timeseries1),1,0))
lenghEnc <- rle(comparison)
any(lenghEnc$lengths>=6 & lenghEnc$values==1)

Related

Rbeast Time Series - Weird Time Axis In Plot

I am looking into ambient air pollution within regions of NSW and conducting a daily time series decomposition analysis using Rbeast to investigate if there is a change point signature around the time of Covid-19 lockdowns.
I have created a looping code to analyse the data for each pollutant within each region - however the Beast X axis ("Date" - i.e. 01-01-2021 - ideally would plot years (2012-2022) is plotting strangely ( I.e. Time = 16000, 17000, 18000 etc.?).
Anyone know how to fix this?
beast_output = list()
target_pollutants = c("PM10", "OZONE", "NO", "NO2")
target_sites = c("WOLLONGONG", "MUSWELLBROOK", "SINGLETON", "CAMBERWELL", "WAGGAWAGGANORTH", "RICHMOND", "CAMDEN", "CHULLORA", "EARLWOOD", "WALLSEND", "BERESFIELD", "BARGO", "BRINGELLY", "PROSPECT", "STMARYS", "OAKDALE", "RANDWICK", "ROZELLE", "NEWCASTLE", "KEMBLAGRANGE", "ALBIONPARKSOUTH")
for (poll in target_pollutants) {
beast_output[[poll]] = list()
df = time_by_poll[[poll]] # grab the target df
sites = colnames(df)
sites$Date = NULL # clear date from the list
for (site in sites) {
ts = ts(df[[site]], start=min(df$Date), end=max(df$Date))
beast_results = beast(ts)
# plot(beastie_resulty)
beast_output[[poll]][[site]] = beast_results
}
}
plot (beast_results[["OZONE"]][["RANDWICK"]])
Thanks for asking and sorry about the issue. Indeed, the API interface in Rbeast is kinda confusing because it was originally coded to handle satellite time series.
Regardless, the BEAST model in the package was formulated only for regular time series.(By regular, I mean equally-spaced time series with the same number of data points per period.) Because leap years have 366 days but others have 356 days, daily time series are treated in BEAST as irregular time series if the periodicity is one year. However, if the periodic variation is weekly/7 days, daily time series are considered as regular. In order to handle irregular time series, I implemented the beast.irreg function which accepts irregular inputs and aggregate them into regular time series before doing the decomposition and changepoint detection.
To illustrate, I got a sample PM10 dataset for several regions (e.g., WOLLONGONG", and "MUSWELLBROOK") from this site https://www.dpie.nsw.gov.au/air-quality/air-quality-data-services/data-download-facility, and I posted the CSV file (as well as another dataset on ozone) under https://github.com/zhaokg/Rbeast/tree/master/R/SampleData. You can directly read the files from R as shown below:
df = read.csv('https://github.com/zhaokg/Rbeast/raw/master/R/SampleData/pm10_1658112168.csv',header=FALSE,skip=3)
dates = as.Date(df[,1], "%d/%m/%Y") # the 1st col is dates
Y = df[,2:ncol(df)] # the rest are PM10 data for the several sample sites
# e.g., Y[,1] for the first region (WOLLONGONG)
library(Rbeast)
o = beast.irreg(log(Y[,1]),time=dates,deltat=1/12, freq=12, tseg.min=3, sseg.min=6)
# log(Y[,1]) : Log-transformation may help if data is skewed bcz the BEAST model
assumes Gaussian errors;
# time=dates : Use the 'time' arg to supply the times of individual data points.
Alternatively, the `beast123' function also handles date strings of different formats
# deltat=1/12: Aggregate the daily time series into a regular one at the interval
of 1 month=1/12 year
# freq=12 : The period is 1 year, which is 12 data points (1.0/deltat=12)
# tseg.min: The minimum trend segment length allowed in the changepoint detection is 3 data points (3 months)
-- the results MAY be sensitive to this parameter
# sseg.min: The minimum seasonal segment length allowed in the changepoint detection is 6 data points (6 months)
-- the results MAY be sensitive to this parameter
plot(o)
# For sure, the daily time series can be re-sampled/aggregated to a different time interval
# Below, it is aggregated into a half monthly time series (dT=1/24 year), and the number
# of data points per period is freq=1.0 year/dT=24
o = beast.irreg( log(Y[,1]),time=dates, deltat = 1/24, freq=24, tseg.min=12, sseg.min=12)
plot(o)
# Aggregated to a weekly time series (e.g., dT=7 / 365 year: the unit again is year),
# and freq=1 year/ dT = 365/7.
# tcp.minmax=c(0,10) : the min and max numbers of changepoints allowed in the trend component
o = beast.irreg( log(Y[,1]),time=dates,deltat=7/365,freq=365/7, tcp.minmax=c(0,15),tseg.min=5, sseg.min=20,ocp=10)
plot(o)
# Finally if you want to run on the daily interval. Specify the dT=deltat=1/365 year, and
# freq = period/dT= 1.0 year/(1/365)year =365. Bcz the raw data is daily,
# the majority of the raw data is kept intact during the aggregation except when
# there is a leap year (the last two days of the leap year are merged into a single day)
o = beast.irreg( log(Y[,1]),time=dates, deltat = 1/365, freq=365/1, tseg.min=30, sseg.min=180)
plot(o)
By default, a time series is decomposed as Y= season + trend + error, but for your dataset in the original scale (e.g., not log-tranformed), there could be some spikes. One way to model this is to add an extra spike/outlier component: Y=season+trend+outlier/spike-like+error
# Use 'ocp' to specify the maximum number of spikes (either upward or downward) to be allowed in the outlier/spike component
o = beast.irreg(Y[,1],time=dates, deltat = 1/365, freq=365/1, tseg.min=30, sseg.min=180, ocp=10)
plot(o)
Below is an example for one time series analyzed at the weekly interval (Again, the exact results vary, depending on the choices of tseg.min or sseg.min).
More important, another issue I noticed from your figure is that your data seem to have lots of missing values, which should be assigned NA but instead assigned zeros in your figure. If that is the case, the analysis result for certain would be wrong. BEAST can handle missing data and these missing values should be given NA or NAN (e.g., Y[Y==0]=NA).

Time series daily data modeling

I am looking to forecast my time series. I have the following period daily data 2021-Jan-1 to 2022-Jul-1.
So I have a column of observations for each day.
what I tried so far:
d1=zoo(data, seq(from = as.Date("2021-01-01"), to = as.Date("2022-07-01"), by = 1))
tsdata <- ts(d1, frequency = 365)
ddata <- decompose(tsdata, "multiplicative")
I get following error here:
Error in decompose(tsdata, "multiplicative") :
time series has no or less than 2 periods
From what i have read it seems like because I do not have two full years? is that correct? I have tried doing it weekly as well:
series <- ts(data, frequency = 52, start = c(2021, 1))
getting the same issue.
How do I go about it without having to extend my dataset to two years since I do not have that, and still being able to decompose it?
Plus when I am actually trying to forecast it, it isn't giving me good enough forecast:
Plot with forecast
My data somewhat resembles a bell curve during that period. so is there a better fitting timeseries model I can apply instead?
A weekly frequency for daily data should have frequency = 7, not 52. It's possible that this fix to your code will produce a model with a seasonal term.
I don't think you'll be able to produce a time series model with annual seasonality with less than 2 years of data.
You can either produce a model with only weekly seasonality (I expect this is what most folks would recommend), or if you truly believe in the annual seasonal pattern exhibited in your data, your "forecast" can be a seasonal naive forecast that is simply last year's value for that particular day. I wouldn't recommend this, because it just seems risky, and I don't really see the same trajectory in your screenshot over 2022 that's apparent in 2021.
decompose requires two full cycles and that a full cycle represent 1 time unit. ts class can't use Date class anyways. To use frequency 7 we must use times 1/7th apart such as 1, 1+1/7, 1+2/7, etc. so that 1 cycle (7 days) covers 1 unit. Then just label the plot appropriately rather than using those times on the X axis. In the code below use %Y in place of %y if the years start in 19?? and end in 20?? so that tapply maintains the order.
# test data
set.seed(123)
s <- seq(from = as.Date("2021-01-01"), to = as.Date("2022-07-01"), by = 1)
data <- rnorm(length(s))
tsdata <- ts(data, freq = 7)
ddata <- decompose(tsdata, "multiplicative")
plot(ddata, xaxt = "n")
m <- tapply(time(tsdata), format(s, "%y/%m"), head, 1)
axis(1, m, names(m))

Weekly and Yearly Seasonality in R

I have daily electric load data from 1-1-2007 till 31-12-2016. I use ts() function to load the data like so
ts_load <- ts(data, start = c(2007,1), end = c(2016,12),frequency = 365)
I want to remove the yearly and weekly seasonality from my data, to decompose the data and remove the seasonality, I use the following code
decompose_load = decompose(ts_load, "additive")
deseasonalized = ts_load - decompose_load$seasonal
My question is, am I doing it right? is this the right way to remove the yearly seasonality? and what is the right way to remove the weekly seasonality?
A few points:
a ts series must have regularly spaced points and the same number of points in each cycle. In the question a frequency of 365 is specified but some years, i.e. leap years, would have 366 points. In particular, if you want the frequency to be a year then you can't use daily or weekly data without adjustment since different years have different numbers of days and the number of weeks in a year is not integer.
decompose does not handle multiple seasonalities. If by weekly you mean remove the effect of Monday, of Tuesday, etc. and if by yearly you mean remove the effect of being 1st of the year, 2nd of the year, etc. then you are asking for multiple seasonalities.
end = c(2017, 12) means the 12th day of 2017 since frequency is 365.
The msts function in the forecast package can handle multiple and non-integer seasonalities.
Staying with base R, another approach is to approximate it by a linear model avoiding all the above problems (but ignoring correlations) and we will discuss that.
Assuming the data shown reproducibly in the Note at the end we define the day of week, dow, and day of year, doy, variables and regress on those with an intercept and trend and then construct just the intercept plus trend plus residuals in the last line of code to deseasonalize. This isn't absolutely necessary but we have used scale to remove the mean of trend in order that the three terms defining data.ds are mutually orthogonal -- Whether or not we do this the third term will be orthogonal to the other 2 by the properties of linear models.
trend <- scale(seq_along(d), TRUE, FALSE)
dow <- format(d, "%a")
doy <- format(d, "%j")
fm <- lm(data ~ trend + dow + doy)
data.ds <- coef(fm)[1] + coef(fm)[2] * trend + resid(fm)
Note
Test data used in reproducible form:
set.seed(123)
d <- seq(as.Date("2007-01-01"), as.Date("2016-12-31"), "day")
n <- length(d)
trend <- 1:n
seas_week <- rep(1:7, length = n)
seas_year <- rep(1:365, length = n)
noise <- rnorm(n)
data <- trend + seas_week + seas_year + noise
you can use the dsa function in the dsa package to adjust a daily time series. The advantage over the regression solution is, that it takes into account that the impact of the season can change over time, which is usually the case.
In order to use that function, your data should be in the xts format (from the xts package). Because in that case the leap year is not ignored.
The code will then look something like this:
install.packages(c("xts", "dsa"))
data = rnorm(365.25*10, 100, 1)
data_xts <- xts::xts(data, seq.Date(as.Date("2007-01-01"), by="days", length.out = length(data)))
sa = dsa::dsa(data_xts, fourier_number = 24)
# the fourier_number is used to model monthly recurring seasonal patterns in the regARIMA part
data_adjusted <- sa$output[,1]

Find adjacent rows that match condition

I have a financial time series in R (currently an xts object, but I'm also looking into tibble right now).
How do I find the probability of 2 adjacent rows matching a condition?
For example I want to know the probability of 2 consecutive days having a higher than mean/median value. I know I can lag the previous days value into the next row which would allow me to get this statistic, but that seems very cumbersome and inflexible.
Is there a better way to get this done?
xts sample data:
foo <- xts(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days"))
What's the probability of 2 consecutive days having a higher than median value?
You can create a new column that calls out which are higher than the median, and then take only those that are consecutive and higher
> foo <- as_tibble(data.table(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days")))
Step 1
Create column to find those that are higher than median
> foo$higher_than_median <- foo$x > median(foo$x)
Step 2
Compare that column using diff,
Take it only when both are consecutively higher or lower..c(0, diff(foo$higher_than_median) == 0
Then add the condition that they must both be higher foo$higher_than_median == TRUE
Full Expression:
foo$both_higher <- c(0, diff(foo$higher_than_median)) == 0 & $higher_than_median == TRUE
Step 3
To find probability take the mean of foo$both_higher
mean(foo$both_higher)
[1] 0.1428571
Here is a pure xts solution.
How do you define the median? There are several ways.
In an online time series use, like computing a moving average, you can compute the median over a fixed lookback window (shown below), or from the origin up to now (an anchored window calculation). You won't know future values in the median computation beyond the current time step (Avoid look ahead bias).:
library(xts)
library(TTR)
x <- rep(c(1,1,5,1,5,5,1, 5, 5, 5), 10)
y <- xts(x = x, seq(as.Date("2016-01-01"), length = length(x), by = "days"), dimnames = list(NULL, "x"))
# Avoid look ahead bias in an online time series application by computing the median over a rolling fixed time window:
nMedLookback <- 5
y$med <- runPercentRank(y[, "x"], n = nMedLookback)
y$isAboveMed <- y$med > 0.5
nSum <- 2
y$runSum2 <- runSum(y$isAboveMed, n = nSum)
z <- na.omit(y)
prob <- sum(z[,"runSum2"] >= nSum) / NROW(z)
The case where your median is over the entire data set is obviously a much easier modification of this.

calculate lag from phase arrows with biwavelet in r

I'm trying to understand the cross wavelet function in R, but can't figure out how to convert the phase lag arrows to a time lag with the biwavelet package. For example:
require(gamair)
data(cairo)
data_1 <- within(cairo, Date <- as.Date(paste(year, month, day.of.month, sep = "-")))
data_1 <- data_1[,c('Date','temp')]
data_2 <- data_1
# add a lag
n <- nrow(data_1)
nn <- n - 49
data_1 <- data_1[1:nn,]
data_2 <- data_2[50:nrow(data_2),]
data_2[,1] <- data_1[,1]
require(biwavelet)
d1 <- data_1[,c('Date','temp')]
d2 <- data_2[,c('Date','temp')]
xt1 <- xwt(d1,d2)
plot(xt1, plot.phase = TRUE)
These are my two time series. Both are identical but one is lagging the other. The arrows suggest a phase angle of 45 degrees - apparently pointing down or up means 90 degrees (in or out of phase) so my interpretation is that I'm looking at a lag of 45 degrees.
How would I now convert this to a time lag i.e. how would I calculate the time lag between these signals?
I've read online that this can only be done for a specific wavelength (which I presume means for a certain period?). So, given that we're interested in a period of 365, and the time step between the signals is one day, how would one alculate the time lag?
So I believe you're asking how you can determine what the lag time is given two time series (in this case you artificially added in a lag of 49 days).
I'm not aware of any packages that make this a one-step process, but since we are essentially dealing with sin waves, one option would be to "zero out" the waves and then find the zero crossing points. You could then calculate the average distance between zero crossing points of wave 1 and wave 2. If you know the time step between measurements, you can easy calculate the lag time (in this case the time between measurement steps is one day).
Here is the code I used to accomplish this:
#smooth the data to get rid of the noise that would introduce excess zero crossings)
#subtracted 70 from the temp to introduce a "zero" approximately in the middle of the wave
spline1 <- smooth.spline(data_1$Date, y = (data_1$temp - 70), df = 30)
plot(spline1)
#add the smoothed y back into the original data just in case you need it
data_1$temp_smoothed <- spline1$y
#do the same for wave 2
spline2 <- smooth.spline(data_2$Date, y = (data_2$temp - 70), df = 30)
plot(spline2)
data_2$temp_smoothed <- spline2$y
#function for finding zero crossing points, borrowed from the msProcess package
zeroCross <- function(x, slope="positive")
{
checkVectorType(x,"numeric")
checkScalarType(slope,"character")
slope <- match.arg(slope,c("positive","negative"))
slope <- match.arg(lowerCase(slope), c("positive","negative"))
ipost <- ifelse1(slope == "negative", sort(which(c(x, 0) < 0 & c(0, x) > 0)),
sort(which(c(x, 0) > 0 & c(0, x) < 0)))
offset <- apply(matrix(abs(x[c(ipost-1, ipost)]), nrow=2, byrow=TRUE), MARGIN=2, order)[1,] - 2
ipost + offset
}
#find zero crossing points for the two waves
zcross1 <- zeroCross(data_1$temp_smoothed, slope = 'positive')
length(zcross1)
[1] 10
zcross2 <- zeroCross(data_2$temp_smoothed, slope = 'positive')
length(zcross2)
[1] 11
#join the two vectors as a data.frame (using only the first 10 crossing points for wave2 to avoid any issues of mismatched lengths)
zcrossings <- as.data.frame(cbind(zcross1, zcross2[1:10]))
#calculate the mean of the crossing point differences
mean(zcrossings$zcross1 - zcrossings$V2)
[1] 49
I'm sure there are more eloquent ways of going about this, but it should get you the information that you need.
In my case, for the tidal wave in semidiurnal, 90 degree equal to 3 hours (90*12.5 hours/360 = 3.125 hours). 12.5 hours is the period of semidiurnal. So, for 45 degree equal to -> 45*12.5/360 = 1.56 hours.
Thus in your case:
90 degree -> 90*365/360 = 91.25 hours.
45 degree -> 45*365/360= 45.625 hours.
My understanding is as follows:
For there to be a simple cause-and-effect relationship between the phenomena recorded in the time series, we would expect that the oscillations are phase-locked (Grinsted 2004); so, the period where you find the "in phase" arrow (--->) indicates the lag between the signals.
See the simulated examples with different distances between cause-and-effect phenomena; observe that greater the distance, greater is the period of occurrence of the "in phase arrow" in the Cross wavelet transform.
Nonlinear Processes in Geophysics (2004) 11: 561–566 SRef-ID: 1607-7946/npg/2004-11-561
See the example here

Resources