Interpreting Anomaly detection R values - r

I have an assignment in which I need to detect anomalies in a dataset. I'm using the 'anomalize' package in R and was wondering how to interpret the following output values of the 'anomalize' function:
Remainder_L1
Remainder_L2
I've checked the documentation but I'm unable to find the calculation method for these values. Can someone explain this calculation?
Anomalize output

The anomolize documentation gives a great example of how to apply anomolize() to a time series
This generates the Remainder_L1 and Remainder_L2 values for CRAN tidyverse downloads (that data comes with the anomolize package, so no need to import data, just run the code below to see how it generates the columns
# install.packages("anomalize")
library(tidyverse)
library(tibbletime)
library(anomalize)
tidyverse_cran_downloads %>%
time_decompose(count, merge = TRUE) %>%
anomalize(remainder)
# package date count observed season trend remainder remainder_l1 remainder_l2 anomaly
# <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 broom 2017-01-01 1053 1053. -1007. 1708. 352. -1725. 1704. No
# 2 broom 2017-01-02 1481 1481 340. 1731. -589. -1725. 1704. No
# 3 broom 2017-01-03 1851 1851 563. 1753. -465. -1725. 1704. No
# 4 broom 2017-01-04 1947 1947 526. 1775. -354. -1725. 1704. No
# 5 broom 2017-01-05 1927 1927 430. 1798. -301. -1725. 1704. No
What do these values mean? From the anomolize source code we see:
"remainder_l1" (lower limit for anomalies), "remainder_l2" (upper limit for anomalies)
In the example above, it's saying in the first row, anomolize() would treat the value (1053) as an anomoly if it was less than -1725, or greater than 1725.

Related

Turning a time series back into a data frame

Firstly, apologies for what is probably a very easy question. I have been following an example to plot STL and have come up with a nice line chart. I would like to extract the data points so I can use them in Tableau in this format:
(sorry, having trouble getting tables to display)
My time series is generated from a count in the same format as the table above, so I assume it is quite simple to stitch it back together, but I am not very experienced with data manipulation in R yet. I am happy with the actual seasonal plot, it's just the matter of tying it all back up into something I can use.
I cannot provide my data, but I can provide the following from a tutorial which does the same thing:
library(xts)
## load co2 data set
load(url("https://userpage.fu-berlin.de/soga/300/30100_data_sets/KeelingCurve.Rdata"))
library(lubridate)
start <- c(year(xts::first(co2)), month(xts::first(co2)))
start
end <- c(year(xts::last(co2)), month(xts::last(co2)))
end
# creation of a ts object
co2 <- ts(data = as.vector(coredata(co2)),
start = start,
end = end, frequency = 12)
# set up stl function
fit <- stl(co2, s.window = "periodic")
I am able to extract the list of y-axis values using:
seasonal_stl <- fit$time.series[,1]
What I would like to do is reconstruct that into a table of Month, Year and the seasonal value. Can anyone suggest how to do that? Many thanks in advance.
You can use the tsibble package to convert the ts object into a data frame in the form you want.
ts(fit$time.series, start=start, frequency=12) |>
tsibble::as_tsibble() |>
tidyr::pivot_wider(names_from = "key", values_from = "value") |>
tibble::as_tibble()
But you might find it easier to use the tsibble and feasts packages from the start, like this.
library(tsibble)
library(feasts)
library(lubridate)
## load co2 data set
load(url("https://userpage.fu-berlin.de/soga/300/30100_data_sets/KeelingCurve.Rdata"))
start <- c(year(xts::first(co2)), month(xts::first(co2)))
# creation of a tsibble object
co2 <- ts(co2, start=start, frequency=12) |>
as_tsibble()
# Fit STL
fit <- co2 |>
model(stl = STL(value ~ season(window = "periodic")))
# Extract components
components(fit)
#> # A dable: 711 x 7 [1M]
#> # Key: .model [1]
#> # : value = trend + season_year + remainder
#> .model index value trend season_year remainder season_adjust
#> <chr> <mth> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 stl 1958 Mar 316. 315. 1.46 -0.551 314.
#> 2 stl 1958 Apr 317. 315. 2.59 -0.0506 315.
#> 3 stl 1958 May 318. 315. 3.00 -0.514 315.
#> 4 stl 1958 Jun 317. 315. 2.28 -0.286 315.
#> 5 stl 1958 Jul 316. 315. 0.668 -0.00184 315.
#> 6 stl 1958 Aug 315. 315. -1.48 1.13 316.
#> 7 stl 1958 Sep 313. 315. -3.16 1.01 316.
#> 8 stl 1958 Oct 313. 315. -3.25 0.468 316.
#> 9 stl 1958 Nov 313. 316. -2.05 -0.148 315.
#> 10 stl 1958 Dec 315. 316. -0.860 -0.0377 316.
#> # … with 701 more rows
Created on 2023-01-26 with reprex v2.0.2

Doing operations down columns in R, indexing by another column

I am trying to compute the hedging error for an options pricing model. Each day, I will compute an equivalent position that one should take when hedging against this option in the market, let's call it X_s, and compute the cash position of the hedge, let's call it X_0, for every given day. This doesn't present any issues since I can mapply() a function that calculates all the necessary partials given my parameters, stock price, etc. to compute X_s and X_0. Where I am starting to run into issues is when trying to compute the hedging error for my models. Here's a subset of my data that I'm looking at:
date optionid px_last r X_s_position X_0_cash mp_ba
1 2020-03-03 127117475 3003.37 0.011587702 0.642588548 -1783.881169 146.05
2 2020-03-03 131373646 3003.37 0.011587702 0.527107056 -1477.947518 105.15
3 2020-03-06 127117475 2972.37 0.008128021 0.566540143 -1558.566925 125.40
4 2020-03-09 127117475 2746.56 0.004745339 0.133284145 -332.122900 33.95
5 2020-03-10 127117475 2882.23 0.005884274 0.413389283 -1125.632994 65.85
6 2020-03-11 127117475 2741.38 0.006223502 0.131700734 -333.691757 27.35
7 2020-03-12 127117475 2480.64 0.003787032 0.003680431 -8.179825 0.95
So, let's say we're looking at optionid == 127117475. On the first observation date we won't have any hedge error, so we go to the next observation on 2020-03-06. The hedge error on that day would be
0.642588548*2972.37 + -1783.881169*exp(0.011587702*as.numeric(2020-03-06 - 2020-03-03)/365) - 105.15
So in row 3, in the new 'hedge error' column I want to create, the value would be 20.80985. So, to calculate the hedge error for the next observation of optionid == 127117475, I take the previous observation X_s_position multiply it by the next spot price (px_last), add the X_0_cash value multiplied by exp(r*(difference in days between the two observations)/365) and then subtract the next observation of the option price (mp_ba)
Perhaps like so? Should the mp_ba in your example be 125.40?
library(dplyr)
df %>%
group_by(optionid) %>%
mutate(hedge_error = lag(X_s_position)*px_last + X_0_cash*exp(lag(r)*as.numeric(date - lag(date))/365) - mp_ba)
Result
# A tibble: 7 × 8
# Groups: optionid [2]
date optionid px_last r X_s_position X_0_cash mp_ba hedge_error
<date> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020-03-03 127117475 3003. 0.0116 0.643 -1784. 146. NA
2 2020-03-03 131373646 3003. 0.0116 0.527 -1478. 105. NA
3 2020-03-06 127117475 2972. 0.00813 0.567 -1559. 125. 226.
4 2020-03-09 127117475 2747. 0.00475 0.133 -332. 34.0 1190.
5 2020-03-10 127117475 2882. 0.00588 0.413 -1126. 65.8 -807.
6 2020-03-11 127117475 2741. 0.00622 0.132 -334. 27.4 772.
7 2020-03-12 127117475 2481. 0.00379 0.00368 -8.18 0.95 318.

How can I add exogenous variables to my ARIMA model estimation while using fable package with model() extension

I am trying to estimate ARIMA models for 100 different series. So I employed fabletools::model() method and fable::ARIMA() function to do that job. But I couldn't able to use my exogenous variables in model estimation.
My series has 3 different columns, first ID tag identifying the first outlet, then Date.Time tag, and finally the Sales. In addition to these variables I also have dummy variables representing hour of day and week of day.
Following the code given bellow I transformed the dataframe which contains my endegounus and exegenous variables to tstibble.
ts_forecast <- df11 %>% select(-Date) %>%
mutate(ID = factor(ID)) %>% group_by(ID) %>% as_tsibble(index=Date.Time,key=ID)%>%tsibble::fill_gaps(Sales=0) %>%
fabletools::model(Arima = ARIMA(Sales,stepwise = TRUE,xreg=df12))
With this code I try to forecast values for same date.time interval for multiple outlets indentified with ID factor. But, The code returns the following error.
> Could not find an appropriate ARIMA model.
> This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
> For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots
Sales are my endogenous target var and df12 includes dummy variables representing hour and day. Some of the stores don't create sales in some specific hours so their dummy representing 01:00 AM could be equal to zero for all observation. However I don't think that would be a problem while fable uses stepwise method. I suppose, when the code sees variable with 0s it can exclude them
I am not sure what is the problem. Am I using problematic way to add xreg to the model (in ARIMA hep page it says xreg= like previous forecast package is OK) or issue is related with the second problem i mentioned dummies including "0" for all observation. If it is the second one there could be solution that can exclude all variables with constant 0 value.
I would be delighted if you can help me.
Thanks
Here is an example using hourly pedestrian count data.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tsibble)
library(fable)
#> Loading required package: fabletools
# tsibble with hourly data
df <- pedestrian %>%
mutate(dow = lubridate::wday(Date, label=TRUE))
# Training data
train <- df %>%
filter(Date <= "2015-01-31")
# Fit models
fit <- train %>%
model(arima = ARIMA(Count ~ season("day") + dow + pdq(2,0,0) + PDQ(0,0,0)))
# Forecast period
fcast_xregs <- df %>%
filter(Date > "2015-01-31", Date <= "2015-02-07")
# Forecasts
fit %>%
forecast(fcast_xregs)
#> # A fable: 504 x 8 [1h] <Australia/Melbourne>
#> # Key: Sensor, .model [3]
#> Sensor .model Date_Time Count .mean Date Time
#> <chr> <chr> <dttm> <dist> <dbl> <date> <int>
#> 1 Birra… arima 2015-02-01 00:00:00 N(-67, 174024) -67.1 2015-02-01 0
#> 2 Birra… arima 2015-02-01 01:00:00 N(-270, 250881) -270. 2015-02-01 1
#> 3 Birra… arima 2015-02-01 02:00:00 N(-286, 310672) -286. 2015-02-01 2
#> 4 Birra… arima 2015-02-01 03:00:00 N(-283, 351704) -283. 2015-02-01 3
#> 5 Birra… arima 2015-02-01 04:00:00 N(-264, 380588) -264. 2015-02-01 4
#> 6 Birra… arima 2015-02-01 05:00:00 N(-244, 4e+05) -244. 2015-02-01 5
#> 7 Birra… arima 2015-02-01 06:00:00 N(-137, 414993) -137. 2015-02-01 6
#> 8 Birra… arima 2015-02-01 07:00:00 N(93, 424929) 93.0 2015-02-01 7
#> 9 Birra… arima 2015-02-01 08:00:00 N(292, 431894) 292. 2015-02-01 8
#> 10 Birra… arima 2015-02-01 09:00:00 N(225, 436775) 225. 2015-02-01 9
#> # … with 494 more rows, and 1 more variable: dow <ord>
Created on 2020-10-09 by the reprex package (v0.3.0)
Notes:
You don't need to create dummy variables in R. The formula interface will handle categorical variables appropriately.
The season("day") special within ARIMA will generate the appropriate seasonal categorical variable, equivalent to 23 hourly dummy variables.
I've specified a specific ARIMA model to save computation time. But omit the pdq special to automatically select the optimal model.
Keep the PDQ(0,0,0) special as you don't need the ARIMA model to handle the seasonality when you are doing that with the exogenous variables.

ts object does not work for daily data in R - really confused

I have spent 1-day search for the answer to this question and yet still could not figure out how this works (relatively new to R).
The data:
I have the daily revenue of a store. The starting date is November 2017, and the end date is February 2020. (It is not a typical Jan - Dec every year data). There is no missing value, and every day's sale is recorded. There are 2 columns: date (in proper date format) and revenue (in numerical format).
I am trying to build a time series forecasting model for my sales data. One pre-requisite is that I need to transform my data into the ts object. All those posts online I have seen dealt with yearly or monthly data, yet I have not yet seen anyone mention daily data.
I tried to convert my data to a ts object this way (I named my data "d"):
d_ts <- ts(d$revenue, start=min(d$date), end = max(d$date), frequency = 365)
I then got really weird results as such:
Start = c(17420, 1)
End = c(18311, 1)
Frequency = 365
[1] 1174.77 214.92 10.00 684.86 7020.04 11302.50 30613.55 29920.98 24546.49 22089.89 30291.65 32993.05 26517.11 39670.38 30361.32 17510.72
[17] 9888.76 3032.27 1229.74 2426.36 ....... [ reached getOption("max.print") -- omitted 324216 entries ]
There are 892 days in this dataset, how come the ts object's dimension to be 325,216 x 1 ????
I looked into this book called "Hands-On Time-Series with R" and found the following excerpt:
enter image description here
This basically means the ts() object does NOT work for daily data. Is this why my ts() conversion is messed up?
My questions are ...
(1) How can I make my daily revenue data to be a time series object before feeding into a model, if ts() does not work for daily data? All those time-series models require my data to be in time-series format though.
(2) Does the fact that my data does not start on Jan 2017 & end on Dec 2019 (i.e. those perfect 12 months in a year data shown in many online posts) have any complications? If so, what should I adjust so that the time series forecasting would be meaningful?
I have been stuck on this issue and could not wrap my head around. I really, really appreciate your help!
The ts function can work with any time interval, that's defined by the start and end points. As you're using dates, one unit corresponds to one day, as this is how they're stored internally. The help file at ?ts also shows examples of how to use annual or quarterly data,
To read in your daily data correctly you need to set frequency=1. Using some data similar in structure to what you've got:
#Compile a dataframe like yours
library(lubridate)
set.seed(0)
d <- data.frame(date=seq.Date(dmy("01/11/2017/"), by="day", length.out=892))
d$revenue <- runif(892)
head(d)
#date revenue
# 1 2017-11-01 0.8966972
# 2 2017-11-02 0.2655087
# 3 2017-11-03 0.3721239
# 4 2017-11-04 0.5728534
# 5 2017-11-05 0.9082078
# 6 2017-11-06 0.2016819
#Convert to timeseries object
d_ts <- ts(d$revenue, start=min(d$date), end = max(d$date), frequency = 1)
d_ts
# Time Series:
# Start = 17471
# End = 18362
# Frequency = 1
# [1] 0.896697200 0.265508663 0.372123900 0.572853363 0.908207790 0.201681931 0.898389685 0.944675269 0.660797792
# [10] 0.629114044 0.061786270 0.205974575 0.176556753 0.687022847 0.384103718 0.769841420 0.497699242 0.717618508
With daily data, you are better off using a tsibble class rather than a ts class. There are modelling and forecasting tools available via the fable package.
library(tsibble)
library(fable)
set.seed(1)
d_tsibble <- data.frame(
date = seq(as.Date("2017-11-01"), by = "day", length.out = 892),
revenue = rnorm(892)
) %>%
as_tsibble(index = date)
d_tsibble
#> # A tsibble: 892 x 2 [1D]
#> date revenue
#> <date> <dbl>
#> 1 2017-11-01 -0.626
#> 2 2017-11-02 0.184
#> 3 2017-11-03 -0.836
#> 4 2017-11-04 1.60
#> 5 2017-11-05 0.330
#> 6 2017-11-06 -0.820
#> 7 2017-11-07 0.487
#> 8 2017-11-08 0.738
#> 9 2017-11-09 0.576
#> 10 2017-11-10 -0.305
#> # … with 882 more rows
d_tsibble %>%
model(
arima = ARIMA(revenue)
) %>%
forecast(h = "14 days")
#> # A fable: 14 x 4 [1D]
#> # Key: .model [1]
#> .model date revenue .distribution
#> <chr> <date> <dbl> <dist>
#> 1 arima 2020-04-11 -0.0178 N(-1.8e-02, 1.1)
#> 2 arima 2020-04-12 -0.0117 N(-1.2e-02, 1.1)
#> 3 arima 2020-04-13 -0.00765 N(-7.7e-03, 1.1)
#> 4 arima 2020-04-14 -0.00501 N(-5.0e-03, 1.1)
#> 5 arima 2020-04-15 -0.00329 N(-3.3e-03, 1.1)
#> 6 arima 2020-04-16 -0.00215 N(-2.2e-03, 1.1)
#> 7 arima 2020-04-17 -0.00141 N(-1.4e-03, 1.1)
#> 8 arima 2020-04-18 -0.000925 N(-9.2e-04, 1.1)
#> 9 arima 2020-04-19 -0.000606 N(-6.1e-04, 1.1)
#> 10 arima 2020-04-20 -0.000397 N(-4.0e-04, 1.1)
#> 11 arima 2020-04-21 -0.000260 N(-2.6e-04, 1.1)
#> 12 arima 2020-04-22 -0.000171 N(-1.7e-04, 1.1)
#> 13 arima 2020-04-23 -0.000112 N(-1.1e-04, 1.1)
#> 14 arima 2020-04-24 -0.0000732 N(-7.3e-05, 1.1)
Created on 2020-04-01 by the reprex package (v0.3.0)

How to loop through objects in the global environment - R

I have looked far and wide for a solution to this issue, but I cannot seem to figure it out. I do not have much experience working with xts objects in R.
I have 40 xts objects (ETF data) and I want to run the quantmod function WeeklyReturn on each of them individually.
I have tried to refer to them by using the ls() function:
lapply(ls(), weeklyReturn)
I have also tried the object() function
lapply(object(), weeklyReturn)
I have also tried using as.xts() in my call to coerce the ls() objects to be used as xts but to no avail.
How can I run this function on every xts object in the environment?
Thank you,
It would be better to load all of your xts objects into a list or create them in a way that returns them in a list to begin with. Then you could do results = lapply(xts.list, weeklyReturn).
To work with objects in the global environment, you could test for whether the object is an xts object and then run weeklyReturn on it if it is. Something like this:
results = lapply(setNames(ls(), ls()), function(i) {
x = get(i)
if(is.xts(x)) {
weeklyReturn(x)
}
})
results = results[!sapply(results, is.null)]
Or you could select only the xts objects to begin with:
results = sapply(ls()[sapply(ls(), function(i) is.xts(get(i)))],
function(i) weeklyReturn(get(i)), simplify=FALSE, USE.NAMES=TRUE)
lapply(ls(), weeklyReturn) doesn't work, because ls() returns the object names as strings. The get function takes a string as an argument and returns the object with that name.
An alternate solution using the tidyquant package. Note that this is data frame based so I will not be working with xts objects. I use two core functions to scale the analysis. First, tq_get() is used to go from a vector of ETF symbols to getting the prices. Second, tq_transmute() is used to apply the weeklyReturn function to the adjusted prices.
library(tidyquant)
etf_vec <- c("SPY", "QEFA", "TOTL", "GLD")
# Use tq_get to get prices
etf_prices <- tq_get(etf_vec, get = "stock.prices", from = "2017-01-01", to = "2017-05-31")
etf_prices
#> # A tibble: 408 x 8
#> symbol date open high low close volume adjusted
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 SPY 2017-01-03 227.121 227.919 225.951 225.24 91366500 223.1760
#> 2 SPY 2017-01-04 227.707 228.847 227.696 226.58 78744400 224.5037
#> 3 SPY 2017-01-05 228.363 228.675 227.565 226.40 78379000 224.3254
#> 4 SPY 2017-01-06 228.625 229.856 227.989 227.21 71559900 225.1280
#> 5 SPY 2017-01-09 229.009 229.170 228.514 226.46 46265300 224.3848
#> 6 SPY 2017-01-10 228.575 229.554 228.100 226.46 63771900 224.3848
#> 7 SPY 2017-01-11 228.453 229.200 227.676 227.10 74650000 225.0190
#> 8 SPY 2017-01-12 228.595 228.847 227.040 226.53 72113200 224.4542
#> 9 SPY 2017-01-13 228.827 229.503 228.786 227.05 62717900 224.9694
#> 10 SPY 2017-01-17 228.403 228.877 227.888 226.25 61240800 224.1767
#> # ... with 398 more rows
# Use tq_transmute to apply weeklyReturn to multiple groups
etf_returns_w <- etf_prices %>%
group_by(symbol) %>%
tq_transmute(select = adjusted, mutate_fun = weeklyReturn)
etf_returns_w
#> # A tibble: 88 x 3
#> # Groups: symbol [4]
#> symbol date weekly.returns
#> <chr> <date> <dbl>
#> 1 SPY 2017-01-06 0.0087462358
#> 2 SPY 2017-01-13 -0.0007042173
#> 3 SPY 2017-01-20 -0.0013653367
#> 4 SPY 2017-01-27 0.0098350474
#> 5 SPY 2017-02-03 0.0016159256
#> 6 SPY 2017-02-10 0.0094619381
#> 7 SPY 2017-02-17 0.0154636969
#> 8 SPY 2017-02-24 0.0070186222
#> 9 SPY 2017-03-03 0.0070964211
#> 10 SPY 2017-03-10 -0.0030618336
#> # ... with 78 more rows

Resources