Time Series Analysis - Model Choosing - r

I am new to time series analysis and wanted to know what the best r package is to solve my dilema. I have a data frame with the following columns:
Date Spend Result
2017-06-22 2 17
2017-06-21 5 19
2017-06-20 11 45
2017-06-19 34 78
2017-06-18 23 56
2017-06-17 12 34
The business problem trying to be solved is that based on the seasonality of the data and the amount spent, can I predict the Result column.
For example, let's say I wanted to increase my spend to $45 more per day, can I predict the Result based on the spend and the time of year?
I was going to use a generalized additive model but that only takes 1 variable into account. Is it possible to do a simple regression analysis with this with time being one of the variables?
I was thinking of taking the month from the date column and making the month dummy variables. Not sure if there is a better way though.
Thanks!

Related

Getting the same day across different years in R

I have a dataset for a time series spanning a couple of years with daily observations. I'm trying to smooth some clearly wrong data inserted there (for example, negative values when the variable cannot take values below zero) and what I came up with was trying to smooth it or "interpolate" it by using both the mean of the days around that observation and the mean of the same day or couple of days from previous years, as I have yearly seasonality (I'm still unsure about this part, any comment would be greatly appreciated).
So my question is whether I can easily access the same day acrosss different years.
Here's a dummy example of my data:
library(tidyverse)
library(lubridate)
date value
2016-10-01 00:00:00 28
2016-10-02 00:00:00 25
2016-10-03 00:00:00 24
2016-10-04 00:00:00 22
2016-10-05 00:00:00 -6
2016-10-06 00:00:00 26
I have that for years 2016 through 2020. So in this example I would use the dates around 2016-10-05 AND I would like to use the dates around the 5th of October from years 2017 to 2020 to kind of maintain the seasonality, but maybe this is incorrect.
I tried to use +years() from lubridate but I still have to do things manually and I would like to kind of autimatize things.
If your question is solely "whether [you] can easily access the same day [across] different years", you could do that as follows:
# say your data frame is called df
library(lubridate)
day(df$date)
This will return the day part of the date for every entry in that column of your data frame.
Edit to reply to comment from asker:
This is a very basic way to specify the day and month for which you would like to obtain the corresponding rows in your data frame:
df[day(df$dates) == 5 & month(df$dates) == 10, ]

How to make an hourly time series in R with this data?

times booked_res
11:00 23
13:00 26
15:00 27
17:00 25
19:00 28
21:00 30
So I need to use the ts() function in R to convert this frame into a time series. The column on the right are the number of people reserved in each time. How should I approach this? I'm not sure about the arguments and I don't know if the frequency should be set to 24 (hours in a day) or 10 (11:00 to 21:00) as shown above. Any help appreciated.
First, find the frequency by noting that you are taking a sample every two minutes. The frequency is the inverse of the time between samples, which is 1/2 samples per minute or 30 samples per hour. The data you're interested in is on the right, so you can just use that data vector rather than the entire data frame. The code to convert that into a time series is simply:
booked_res <- c(23,26,27,25,28,30)
ts(booked_res,frequency = 30)
A simple plot with your data might be coded like this:
plot(ts(booked_res,frequency = 30),ylab='Number of people reserved',xlab='Time (in hours) since start of sampling',main='Time series chart of people reservations')
UPDATE:
A time series model can only be created in R when the times series is stationary. Using a varying sample rate would make the time series non-stationary, and so you wouldn't be able to create a time-series object in R.
This page on Analytics Vidhya provides a nice definition of stationary and non-stationary time series, while this page on R bloggers gives some resources that relate to analyzing a non-stationary time series.

STL function minimum observation

I am building a forecasting model sing R utilizing stl() function. I have monthly data for 2 years, that's 24 observations, 2 period.
Now stl() won't allow me to decompose my data saying that I have less than the required minimum observation. I check the code and yes, it needs to be 24 + 1 or 2 period + 1.
My question is why it needs extra observation? decompose() function needs only at least 2 period.

Time series analysis applicability?

I have a sample data frame like this (date column format is mm-dd-YYYY):
date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6
I want to convert this data frame into time series using ts(), but the problem is: the current data frame has multiple values for the same date. Can we apply time series in this case?
Can I convert data frame into time series, and build a model (ARIMA) which can forecast count value on a daily basis?
OR should I forecast count value based on grp, but in that case, I have to select only grp and count column of a data frame. So in that case, I have to skip date column, and daily forecast for count value is not possible?
Suppose if I want to aggregate count value on per day basis. I tried with aggregate function, but there we have to specify date value, but I have a very large data set? Any other option available in r?
Can somebody, please, suggest if there is a better approach to follow? My assumption is that the time series forcast works only for bivariate data? Is this assumption right?
It seems like there are two aspects of your problem:
i want to convert this data frame into time series using ts(), but the
problem is- current data frame having multiple values for the same
date. can we apply time series in this case?
If you are happy making use of the xts package you could attempt:
dta2$date <- as.Date(dta2$date, "%d-%m-%Y")
dtaXTS <- xts::as.xts(dta2[,2:3], dta2$date)
which would result in:
>> head(dtaXTS)
count grp
2009-09-01 54 1
2009-09-01 100 2
2009-09-01 546 3
2009-10-01 67 4
2009-11-01 80 5
2009-11-01 45 6
of the following classes:
>> class(dtaXTS)
[1] "xts" "zoo"
You could then use your time series object as univariate time series and refer to the selected variable or as a multivariate time series, example using PerformanceAnalytics packages:
PerformanceAnalytics::chart.TimeSeries(dtaXTS)
Side points
Concerning your second question:
can somebody plz suggest me what is the better approach to follow, my
assumption is time series forcast is works only for bivariate data? is
this assumption also right?
IMHO, this is rather broad. I would suggest that you use created xts object and elaborate on the model you want to utilise and why, if it's a conceptual question about nature of time series analysis you may prefer to post your follow-up question on CrossValidated.
Data sourced via: dta2 <- read.delim(pipe("pbpaste"), sep = "") using the provided example.
Since daily forecasts are wanted we need to aggregate to daily. Using DF from the Note at the end, read the first two columns of data into a zoo series z using read.zoo and argument aggregate=sum. We could optionally convert that to a "ts" series (tser <- as.ts(z)) although this is unnecessary for many forecasting functions. In particular, checking out the source code of auto.arima we see that it runs x <- as.ts(x) on its input before further processing. Finally run auto.arima, forecast or other forecasting function.
library(forecast)
library(zoo)
z <- read.zoo(DF[1:2], format = "%m-%d-%Y", aggregate = sum)
auto.arima(z)
forecast(z)
Note: DF is given reproducibly here:
Lines <- "date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6"
DF <- read.table(text = Lines, header = TRUE)
Updated: Revised after re-reading question.

using multiple methods of data imputation accounting for length of missing data period

I have a dataset of time series data with a number of missing values. The data is of ozone concentrations recorded every hour over a year and the length of periods of missing data varies greatly. Because of the different cycles in the dataset (e.g. daily and seasonal), I want to use different data imputation methods based on the length of the period where data is missing. I am planning on using the zoo package for data imputation.
The breakdown of data imputation for periods of missing data:
<=4 cells with missing data - use linear interpolation (na.approx(x))
>4 to <=23 cells - use spline fit (na.spline(x))
>23 cells - use seasonal Kalman filter (na.StructTS(x))
My guess is that I need to use conditional execution to dictate what cells are affected by commands, however I don’t know how to refer the values cells that come before and after and use them in an if statement.
I am fairly new to R so sorry if there is an obvious answer to this question or if this question has already been answered. I have done a search but couldn’t seem to find anything.
Date2006 Ozone2006
1 06-01-01 0:00 64
2 06-01-01 1:00 64
3 06-01-01 2:00 63
4 06-01-01 3:00 61
5 06-01-01 4:00 NA
6 06-01-01 5:00 52
7 06-01-01 6:00 60
8 06-01-01 7:00 59
9 06-01-01 8:00 47
This is what my dataset looks like. The ozone concentrations are integers.

Resources