how to extract timestamps from ts object in r - r

Consider the treerings dataset.
library("datasets", lib.loc="C:/Program Files/R/R-3.3.1/library")
tr<-treering
length(tr)
[1] 7980
class(tr)
[1] "ts"
From my understanding, it is a time series of length 7980.
How can I find out what the time stamps are for each value?
After plotting the time series, looking at the x axis of the plot, it appears that the time stamps range between -6000 to 2000. But to me the time stamps appear to be "hidden".
plot(tr)
More generally, I'm trying to understand what exactly is a ts object and what are the benefits of using this type of object.
A univariate and multivariate time series can easily be displayed in a data frame with 2 or more columns: Time and variables .
univariatetimeseries <- data.frame(Time = c(0, 1, 2, 3, 4, 5, 6), y = c(1, 2, 3, 4, 5, 6, 7))
multivariatetimeseries <- data.frame(Time = c(0,1,2,3,4,5,6), y = c(1, 2, 3, 4, 5, 6, 7), z = c(7,6,5,4,3,2,1))
This to me seems simple and straighforward and it is consistent with the basic science examples that I learned in high school. Additionally, the time stamps are not "hidden" as is the case of the treering example. So what are the benefits of using ts object?

Object of class comes with many generic functions for convenience. Say for "ts" object class there are ts.plot, plot.ts, etc. If you store your time series as a data frame, you have to do lots of work yourself when plotting them.
Perhaps for seasonal time series, the advantage of using "ts" is more evident. For example, x <- ts(rnorm(36), start = c(2000, 1), frequency = 12) generates monthly time series for 3 years. The print method will nicely arrange it like a matrix when you print x.
A "ts" object has a number of attributes. Modelling fitting routines like arima0 and arima can see such attributes so you don't need to specify them manually.
For your question, there are a number of functions to extract / set attributes of a time series. Have a look at ?start, ?tsp, ?time, ?window.

Related

Time series daily data modeling

I am looking to forecast my time series. I have the following period daily data 2021-Jan-1 to 2022-Jul-1.
So I have a column of observations for each day.
what I tried so far:
d1=zoo(data, seq(from = as.Date("2021-01-01"), to = as.Date("2022-07-01"), by = 1))
tsdata <- ts(d1, frequency = 365)
ddata <- decompose(tsdata, "multiplicative")
I get following error here:
Error in decompose(tsdata, "multiplicative") :
time series has no or less than 2 periods
From what i have read it seems like because I do not have two full years? is that correct? I have tried doing it weekly as well:
series <- ts(data, frequency = 52, start = c(2021, 1))
getting the same issue.
How do I go about it without having to extend my dataset to two years since I do not have that, and still being able to decompose it?
Plus when I am actually trying to forecast it, it isn't giving me good enough forecast:
Plot with forecast
My data somewhat resembles a bell curve during that period. so is there a better fitting timeseries model I can apply instead?
A weekly frequency for daily data should have frequency = 7, not 52. It's possible that this fix to your code will produce a model with a seasonal term.
I don't think you'll be able to produce a time series model with annual seasonality with less than 2 years of data.
You can either produce a model with only weekly seasonality (I expect this is what most folks would recommend), or if you truly believe in the annual seasonal pattern exhibited in your data, your "forecast" can be a seasonal naive forecast that is simply last year's value for that particular day. I wouldn't recommend this, because it just seems risky, and I don't really see the same trajectory in your screenshot over 2022 that's apparent in 2021.
decompose requires two full cycles and that a full cycle represent 1 time unit. ts class can't use Date class anyways. To use frequency 7 we must use times 1/7th apart such as 1, 1+1/7, 1+2/7, etc. so that 1 cycle (7 days) covers 1 unit. Then just label the plot appropriately rather than using those times on the X axis. In the code below use %Y in place of %y if the years start in 19?? and end in 20?? so that tapply maintains the order.
# test data
set.seed(123)
s <- seq(from = as.Date("2021-01-01"), to = as.Date("2022-07-01"), by = 1)
data <- rnorm(length(s))
tsdata <- ts(data, freq = 7)
ddata <- decompose(tsdata, "multiplicative")
plot(ddata, xaxt = "n")
m <- tapply(time(tsdata), format(s, "%y/%m"), head, 1)
axis(1, m, names(m))

Specifying truncation point in glmmTMB R package

I am working with a large dataset that contains longitudinal data on gambling behavior of 184,113 participants. The data is based on complete tracking of electronic gambling behavior within a gambling operator. Gambling behavior data is aggregated on a monthly level, a total of 70 months. I have an ID variable separating participants, a time variable (months), as well as numerous gambling behavior variables such as active days played for given month, bets placed for given month, total losses for given month, etc. Participants vary in when they have been active gambling. One participant may have gambled at month 2, 3, 4, and 7, another participant at 3, 5, and 7, and a third at 23, 24, 48, 65 etc.
I am attempting to run a negative binomial 2 truncated model in glmmTMB and I am wondering how the package handles lack of 0. I have longitudinal data on gambling behavior, days played for each month (for a total of 70 months). The variable can take values between 1-31 (depending on month), there are no 0. Participants’ months with 0 are absent from dataset. Example of how data are structured with just two participants:
# Example variables and data frame in long form
# Includes id variable, time variable and example variable
id <- c(1, 1, 1, 1, 2, 2, 2)
time <- c(2, 3, 4, 7, 3, 5, 7)
daysPlayed <- c(2, 2, 3, 3, 2, 2, 2)
dfLong <- data.frame(id = id, time = time, daysPlayed = daysPlayed)
My question: How do I specify where the truncation happens in glmmTMB? Does it default to 0? I want to truncate 0 and have run the following code (I am going to compare models, the first one is a simple unconditional one):
DaysPlayedUnconditional <- glmmTMB(daysPlayed ~ 1 + (1 | id), dfLong, family = truncated_nbinom2)
Will it do the trick?
From Ben Bolker through r-sig-mixed-models#r-project.org:
"I'm not 100% clear on your question, but: glmmTMB only does zero-truncation, not k-truncation with k>0, i.e. you can only specify the model Prob(x==0) = 0 Prob(x>0) = Prob(NBinom(x))/Prob(NBinom(x>0)) (terrible notation, but hopefully you get the idea)"

How to specify formula in linear model with 100 dependent variables without having to write them explicitly in R

The problem is to (a) model the intra day demand in ATM Widthrawals and (b) create prediction intervals for future demand. One day has 144 10-minute periods and my dataset is the number of ATM widthrawals in each period. Here is a chart so you can have a glipse of what I'm talking about.
My dataset also has other data (mainly dummies), such as Weekday and Holiday. For the purpose of this post, I be using the following data.frame as a representation of my dataset (which has only 6 time periods, between 00:10 and 01:00 and not the full day)
df <- data.frame(H0010=1, H0020=2, H0030=3, H0050=4, H0050=5, H0100=6,
WeekDay=7, Holiday=8)
The first idea that crossed my mind was to fit a linear regression. More precisely, a multivariate multiple linear regression. But because I have 144 dependent variables (one for each 10-minute period) and not only 6, my code in R would be hugely long:
lm.fit <- lm(cbind(H0010, H0020, H0030, H0050, H0050, H0100,
H0200, H0210, H0220, H0230, H0240, H0250,
(and in goes on and on till midnight)
H2310, H2320, H2330, H2340, H2350, H2359)
~ WeekDay + Holiday, data = df)
Is there a way I could write the model formula without having to specify all the 144 dependent variables?
I would also apreciate any other thoughts on how to address this problem using other methods (although this posts question is the above mencioned).
EDIT:
My dataset is composed by the dependent variables (number of transactions) and dummies which are factors. As so, the solution lm(cbind(-Weekday, -Holiday) ~ Weekday + Holiday, data=df) does not work.
f <- sprintf("cbind(%s) ~ WeekDay + Holiday", paste(names(df)[1:6], collapse = ", "))
lm(f, data = df)
Sure, you can select variables by specifying which you would like to exclude:
lm(cbind(-WeekDay, -Holiday) ~ WeekDay + Holiday, data=df)
EDIT:
How's this? I included a more realistic dataframe too.
df <- data.frame(H0010=rnorm(100, 1, 1), H0020=rnorm(100, 2, 1),
H0030=rnorm(100, 3, 1), H0050=rnorm(100, 4, 1),
H0050=rnorm(100, 5, 1), H0100=rnorm(100, 6, 1),
WeekDay=factor(c(rep(seq(1,7), 14), 1, 2)),
Holiday=factor(rbinom(100, 1, prob = .05)))
y <- as.matrix(df[,1:6])
x <- model.matrix(~df$WeekDay+df$Holiday)
lm(y~0+x) #suppress intercept, as it's in the model.matrix

Time Series R with duplicate Items for daily forecast

I would like a guidance no how to plot daily data and use forecasting in R.
There are low purchase for Sunday and Saturday in this data. And there are certain weekdays that have no purchase at all. So its the obstacles for the analysis.
I have around 300 rows with various item name which the items are duplicated inside the column, but with different dates.
Example, I bought exactly 1 soap 3 times a week, at monday, wednesday and also sunday.
This is the example data table :
My trouble so far is that it took me a long time to forecast manually using other statistical software, so I try to learn R from the start and see how it could save the time. The table above have been put into R, the date also have been converted from factor into date class by using the function as.Date(data$Date)
Usually i used exponential smoothing method, since the purchase are still low and sometimes out of stock, so not much of pattern are shown from the historical data. The output of this analysis is that i could provide a forecast for the purchase of the item daily in order to give instruction when should we demand an item.
First please consider adding a reproducible example for a more substantial answer. Look at the most upvoted question with tag R for a how-to.
EDIT: I think this is what you want before creating the ts:
data.agg <- aggregate(data$purchase, by = list(data$date, data$item), FUN = sum)
If your data is not yet of class 'ts' you can create a time-series object with the ts() command. From the ?ts page:
ts(data = NA, start = 1, end = numeric(), frequency = 1,
deltat = 1, ts.eps = getOption("ts.eps"), class = , names = )
as.ts(x, ...)
Generally you could use the HoltWinters function for exponential smoothing like so:
data.hw <- HotlWinters(data)
data.predict <- predict(data.hw, n.ahead = x) # for x = units of time ahead you would like to predict
See also ?HoltWinters for more info on the function
Reproducible Example for aggregate:
data <- data.frame(date = c(1, 2, 1, 2, 1, 1), item = c('b','b','a','a', 'a', 'a'), purchase = c(5,15, 23, 7, 12, 11))
data.agg <- aggregate(data$purchase, by = list(data$date, data$item), FUN = sum)
Reproducible Example for HoltWinters:
library(AER)
data("UKNonDurables")
nd <- window((log(UKNonDurables)), end = c(1970, 4))
tsp(nd)
hw <- HoltWinters(nd)
pred <- predict(hw, n.ahead = 35)
pred
plot(hw, pred, ylim = range(log(UKNonDurables)))
lines(log(UKNonDurables))

Interpolate missing values in a time series with a seasonal cycle

I have a time series for which I want to intelligently interpolate the missing values. The value at a particular time is influenced by a multi-day trend, as well as its position in the daily cycle.
Here is an example in which the tenth observation is missing from myzoo
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- zoo(obs, index)
myzoo[10] <- NA
If I had to implement this, I'd use some kind of weighted mean of close times on nearby days, or add a value for the day to a function line fitted to the larger trend, but I hope there already exist some package or functions that apply to this situation?
EDIT: Modified the code slightly to clarify my problem. There are na.* methods that interpolate from nearest neighbors, but in this case they do not recognize that the missing value is at the time that is the lowest value of the day. Maybe the solution is to reshape the data to wide format and then interpolate, but I wouldn't like to completely disregard the contiguous values from the same day. It is worth noting that diff(myzoo, lag = 4) returns a vector of 10's. The solution may lie with some combination of reshape, na.spline, and diff.inv, but I just can't figure it out.
Here are three approaches that don't work:
EDIT2. Image produced using the following code.
myzoo <- zoo(obs, index)
myzoo[10] <- NA # knock out the missing point
plot(myzoo, type="o", pch=16) # plot solid line
points(na.approx(myzoo)[10], col = "red")
points(na.locf(myzoo)[10], col = "blue")
points(na.spline(myzoo)[10], col = "green")
myzoo[10] <- 31 # replace the missing point
lines(myzoo, type = "o", lty=3, pch=16) # dashed line over the gap
legend(x = "topleft",
legend = c("na.spline", "na.locf", "na.approx"),
col=c("green","blue","red"), pch = 1)
Try this:
x <- ts(myzoo,f=4)
fit <- ts(rowSums(tsSmooth(StructTS(x))[,-2]))
tsp(fit) <- tsp(x)
plot(x)
lines(fit,col=2)
The idea is to use a basic structural model for the time series, which handles the missing value fine using a Kalman filter. Then a Kalman smooth is used to estimate each point in the time series, including any omitted.
I had to convert your zoo object to a ts object with frequency 4 in order to use StructTS. You may want to change the fitted values back to zoo again.
In this case, I think you want a seasonality correction in the ARIMA model. There's not enough date here to fit the seasonal model, but this should get you started.
library(zoo)
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- myzoo.orig <- zoo(obs, index)
myzoo[10] <- NA
myzoo.fixed <- na.locf(myzoo)
myarima.resid <- arima(myzoo.fixed, order = c(3, 0, 3), seasonal = list(order = c(0, 0, 0), period = 4))$residuals
myzoo.reallyfixed <- myzoo.fixed
myzoo.reallyfixed[10] <- myzoo.fixed[10] + myarima.resid[10]
plot(myzoo.reallyfixed)
points(myzoo.orig)
In my tests the ARMA(3, 3) is really close, but that's just luck. With a longer time series you should be able to calibrate the seasonal correction to give you good predictions. It would be helpful to have a good prior on what the underlying mechanisms for both the signal and the seasonal correction to get better out of sample performance.
forecast::na.interp is a good approach. From the documentation
Uses linear interpolation for non-seasonal series and a periodic stl decomposition with seasonal series to replace missing values.
library(forecast)
fit <- na.interp(myzoo)
fit[10] # 32.5, vs. 31.0 actual and 32.0 from Rob Hyndman's answer
This paper evaluates several interpolation methods against real time series, and finds that na.interp is both accurate and efficient:
From the R implementations tested in this paper, na.interp from the forecast package and na.StructTS from the zoo package showed the best overall results.
The na.interp function is also not that much slower than
na.approx [the fastest method], so the loess decomposition seems not to be very demanding in terms of computing time.
Also worth noting that Rob Hyndman wrote the forecast package, and included na.interp after providing his answer to this question. It's likely that na.interp is an improvement upon this approach, even though it performed worse in this instance (probably due to specifying the period in StructTS, where na.interp figures it out).
Package imputeTS has a method for Kalman Smoothing on the state space representation of an ARIMA model - which might be a good solution for this problem.
library(imputeTS)
na_kalman(myzoo, model = "auto.arima")
Also works directly with zoo time series objects. You could also use your own ARIMA models in this function. If you think you can do better then "auto.arima". This would be done this way:
library(imputeTS)
usermodel <- arima(myts, order = c(1, 0, 1))$model
na_kalman(myts, model = usermodel)
But in this case you have to convert the zoo onject back to ts, since arima() only accepts ts.

Resources