STL decomposition of time series with missing values for anomaly detection - r

I am trying to detect anomalous values in a time series of climatic data with some missing observations. Searching the web I found many available approaches. Of those, stl decomposition seems appealing, in the sense of removing trend and seasonal components and studying the remainder. Reading STL: A Seasonal-Trend Decomposition Procedure Based on Loess, stl appears to be flexible in determining the settings for assigning variability, unaffected by outliers and possible to apply despite missing values. However, trying to apply it in R, with four years of observations and defining all the parameters according to http://stat.ethz.ch/R-manual/R-patched/library/stats/html/stl.html , I encounter error:
time series contains internal NAs
when na.action = na.omit, and
series is not periodic or has less than two periods
when na.action = na.exclude.
I have double checked that the frequency is correctly defined. I have seen relevant questions in blogs, but didn't find any suggestion that could solve this. Is it not possible to apply stl in a series with missing values? I am very reluctant to interpolate them, as I do not want to be introducing (and consequently detecting...) artifacts. For the same reason, I do not know how advisable it would be to use ARIMA approaches instead (and if missing values would still be a problem).
Please share if you know a way to apply stl in a series with missing values, or if you believe my choices are methodologically not sound, or if you have any better suggestion. I am quite new in the field and overwhelmed by the heaps of (seemingly...) relevant information.

In the beginning of stl we find
x <- na.action(as.ts(x))
and soon after that
period <- frequency(x)
if (period < 2 || n <= 2 * period)
stop("series is not periodic or has less than two periods")
That is, stl expects x to be ts object after na.action(as.ts(x)) (otherwise period == 1). Let us check na.omit and na.exclude first.
Clearly, at the end of getAnywhere("na.omit.ts") we find
if (any(is.na(object)))
stop("time series contains internal NAs")
which is straightforward and nothing can be done (na.omit does not exclude NAs from ts objects). Now getAnywhere("na.exclude.default") excludes NA observations, but returns an object of class exclude:
attr(omit, "class") <- "exclude"
and this is a different situation. As mentioned above, stl expects na.action(as.ts(x)) to be ts, but na.exclude(as.ts(x)) is of class exclude.
Hence if one is satisfied with NAs exclusion then e.g.
nottem[3] <- NA
frequency(nottem)
# [1] 12
na.new <- function(x) ts(na.exclude(x), frequency = 12)
stl(nottem, na.action = na.new, s.window = "per")
works. In general, stl does not work with NA values (i.e. with na.action = na.pass), it crashes deeper in Fortran (see full source code here):
z <- .Fortran(C_stl, ...
Alternatives to na.new are not delightful:
na.contaguous - finds the longest consecutive stretch of non-missing values in a time series object.
na.approx, na.locf from zoo or some other interpolation function.
Not sure about this one, but another one Fortran implementation can be found for Python here. One could use Python of possibly install R from source after some modifications, in case this module really allows missing values.
As we can see in the paper, there is no some simple procedure for missing values (like approximating them in the very beginning) which could be applied to the time series before calling stl. So considering the fact that original implementation is quite lengthy I would think about some other alternatives than whole new implementation.
Update: a quite optimal in many aspects choice when having NAs could be na.approx from zoo, so let us check its performance, i.e. compare results of stl with full data set and results when having some number of NAs, using na.approx. I am using MAPE as a measure of accuracy, but only for trend, because seasonal component and remainder crosses zero and it would distort the result. Positions for NAs are chosen at random.
library(zoo)
library(plyr)
library(reshape)
library(ggplot2)
mape <- function(f, x) colMeans(abs(1 - f / x) * 100)
stlCheck <- function(data, p = 3, ...){
set.seed(20130201)
pos <- lapply(3^(0:p), function(x) sample(1:length(data), x))
datasetsNA <- lapply(pos, function(x) {data[x] <- NA; data})
original <- data.frame(stl(data, ...)$time.series, stringsAsFactors = FALSE)
original$id <- "Original"
datasetsNA <- lapply(datasetsNA, function(x)
data.frame(stl(x, na.action = na.approx, ...)$time.series,
id = paste(sum(is.na(x)), "NAs"),
stringsAsFactors = FALSE))
stlAll <- rbind.fill(c(list(original), datasetsNA))
stlAll$Date <- time(data)
stlAll <- melt(stlAll, id.var = c("id", "Date"))
results <- data.frame(trend = sapply(lapply(datasetsNA, '[', i = "trend"), mape, original[, "trend"]))
results$id <- paste(3^(0:p), "NAs")
results <- melt(results, id.var = "id")
results$x <- min(stlAll$Date) + diff(range(stlAll$Date)) / 4
results$y <- min(original[, "trend"]) + diff(range(original[, "trend"])) / (4 * p) * (0:p)
results$value <- round(results$value, 2)
ggplot(stlAll, aes(x = Date, y = value, colour = id, group = id)) + geom_line() +
facet_wrap(~ variable, scales = "free_y") + theme_bw() +
theme(legend.title = element_blank(), strip.background = element_rect(fill = "white")) +
labs(x = NULL, y = NULL) + scale_colour_brewer(palette = "Set1") +
lapply(unique(results$id), function(z)
geom_text(data = results, colour = "black", size = 3,
aes(x = x, y = y, label = paste0("MAPE (", id, "): ", value, "%"))))
}
nottem, 240 observations
stlCheck(nottem, s.window = 4, t.window = 50, t.jump = 1)
co2, 468 observations
stlCheck(log(co2), s.window = 21)
mdeaths, 72 observations
stlCheck(mdeaths, s.window = "per")
Visually we do see some differences in trend in cases 1 and 3. But these differences are pretty small in 1 and also satisfactory in 3 considering sample size (72).

Realize this is an old question, but thought I'd update since there is a newer stl package available in R called stlplus. Here is its homepage on github. You can install it from CRAN with install.packages("stlplus") or directly from github with devtools::install_github("hafen/stlplus").

Related

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

R - Extract coefficients from a factor of lm object using conditions

I have fitted a lm with the following code:
Eq1_females = <- lm(earnings ~ event_time + factor(age) + factor(year) - 1, data=females)
Now, I would like to calculate a predicted value based on the factor coefficients, but this predicted value depends on certain conditions in the data. I therefore create a list of the coefficients and I now want to extract the factor coefficients if age = k and year = y, but it keeps returning 0 or NA. However, if I input a number (e.g. 34) instead of k, it does give the right value. I tried two different codes:
estimates <- coef(Eq1_females)
k = females$age[1]
Eq1_females$coefficients["factor(age)k"]
and
estimates <- coef(Eq1_females)
k = females$age[1]
beta_age = estimates[grep("^factor\\(age\\)k", names(estimates))]
(note that in the end, I would like to loop over different rows of females$age)
What does work, is calculating
beta_age = estimates[grep("^factor\\(age\\)34", names(estimates))]
Could anyone tell me if there is a way of also getting the code to work with k in the beta_age formula?
Thanks a lot in advance!
Answer
Paste the right number to the regex pattern using paste0:
beta = estimates[grep(paste0("^factor\\(Petal.Width\\)", k), names(estimates))]
This returns:
factor(Petal.Width)0.2
3.764947
Rationale
In "^factor\\(age\\)k", it will treat k as the literal k. However, you are referring to variable k. By using paste(..., sep = "") or paste0(...) you can simply paste k to the base pattern.

How to accomplish replicated calculation and plot in subset dataset?

I have a simulated data created like this:
average_vector = c(0,0,25)
sigma_matrix = matrix(c(4,1,0,1,8,0,0,0,9),nrow=3,ncol=3)
set.seed(12345)
data0 = as.data.frame(mvrnorm(n =20000, mu = average_vector, Sigma=sigma_matrix))
names(data0)=c("hard","smartness","age")
set.seed(13579)
data0$final=0.5*data0$hard+0.2*data0$smartness+(-0.1)*data0$age+rnorm(n=dim(data0)[1],mean=90,sd=6)
Now, I want to randomly sample 50 students 1,000 times (1,000 sets of 50 people), I used this code:
datsub<-(replicate(1000, sample(1:nrow(data0),50)))
After that step, I encountered a issue: I want to ask if I want to run a regression model with the 50 selected people (1,000 times), and record/store the point estimates of “hard” from model 4, where is given like this:
model4 = lm(formula = final ~ hard + smartness + age, data = data0), and plot the variation around the line of 0.5 (true value), is there any way I can achieve that? Thanks a lot!
I would highly suggest looking into either caret or the newer (and still maintained) TidyModels if you're just getting into R modelling. Either of these will make your life easier, once you get used to the dplyr-like syntax.
What you're trying to do is bootstrapping. Here is the manual approach using only base functions.
n <- nrow(data0)
k <- 1000
ns <- 50
samples <- replicate(k, sample(seq_len(n), ns))
params <- vector('list', k)
for(i in seq_len(n)){
params[[i]] <- coef( lm(formula = final ~ hard + smartness + age, data = data0[samples[, i],]) )
}
# merge params into columns
params <- do.call(rbind, params)
# Create plot from here.
plot(x = seq_len(n), y = params[, "hard"])
abline(h = 0.5)
Note the above may have a few typos as your example is not reproducible.

Is there a way to simulate time series data with a specific rolling mean and autocorrelation in R?

I have an existing time series (1000 samples) and calculated the rolling mean using the filter() function in R, averaging across 30 samples each. The goal of this was to create a "smoothed" version of the time series. Now I would like to create artificial data that "look like" the original time series, i.e., are somewhat noisy, that would result in the same rolling mean if I would apply the same filter() function to the artificial data. In short, I would like to simulate a time series with the same overall course but not the exact same values as those of an existing time series. The overall goal is to investigate whether certain methods can detect similarity of trends between time series, even when the fluctuations around the trend are not the same.
To provide some data, my time series looks somewhat like this:
set.seed(576)
ts <- arima.sim(model = list(order = c(1,0,0), ar = .9), n = 1000) + 900
# save in dataframe
df <- data.frame("ts" = ts)
# plot the data
plot(ts, type = "l")
The filter function produces the rolling mean:
my_filter <- function(x, n = 30){filter(x, rep(1 / n, n), sides = 2, circular = T)}
df$rolling_mean <- my_filter(df$ts)
lines(df$rolling_mean, col = "red")
To simulate data, I have tried the following:
Adding random noise to the rolling mean.
df$sim1 <- df$rolling_mean + rnorm(1000, sd = sd(df$ts))
lines(df$sim1, col = "blue")
df$sim1_rm <- my_filter(df$sim1)
lines(df$sim1_rm, col = "green")
The problem is that a) the variance of the simulated values is higher than the variance of the original values, b) that the rolling average, although quite similar to the original, sometimes deviates quite a bit from the original, and c) that there is no autocorrelation. To have an autocorrelational structure in the data would be good since it is supposed to resemble the original data.
Edit: Problem a) can be solved by using sd = sqrt(var(df$ts)-var(df$rolling_mean)) instead of sd = sd(df$ts).
I tried arima.sim(), which seems like an obvious choice to specify the autocorrelation that should be present in the data. I modeled the original data using arima(), using the model parameters as input for arima.sim().
ts_arima <- arima(ts, order = c(1,0,1))
my_ar <- ts_arima$coef["ar1"]
my_ma <- ts_arima$coef["ma1"]
my_intercept <- ts_arima$coef["intercept"]
df$sim2 <- arima.sim(model = list(order = c(1,0,1), ar = my_ar, ma = my_ma), n = 1000) + my_intercept
plot(df$ts)
lines(df$sim2, col = "blue")
The resulting time series is very different from the original. Maybe a higher order for ar and ma in arima.sim() would solve this, but I think a whole different method might be more appropriate.

Interpolate missing values in a time series with a seasonal cycle

I have a time series for which I want to intelligently interpolate the missing values. The value at a particular time is influenced by a multi-day trend, as well as its position in the daily cycle.
Here is an example in which the tenth observation is missing from myzoo
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- zoo(obs, index)
myzoo[10] <- NA
If I had to implement this, I'd use some kind of weighted mean of close times on nearby days, or add a value for the day to a function line fitted to the larger trend, but I hope there already exist some package or functions that apply to this situation?
EDIT: Modified the code slightly to clarify my problem. There are na.* methods that interpolate from nearest neighbors, but in this case they do not recognize that the missing value is at the time that is the lowest value of the day. Maybe the solution is to reshape the data to wide format and then interpolate, but I wouldn't like to completely disregard the contiguous values from the same day. It is worth noting that diff(myzoo, lag = 4) returns a vector of 10's. The solution may lie with some combination of reshape, na.spline, and diff.inv, but I just can't figure it out.
Here are three approaches that don't work:
EDIT2. Image produced using the following code.
myzoo <- zoo(obs, index)
myzoo[10] <- NA # knock out the missing point
plot(myzoo, type="o", pch=16) # plot solid line
points(na.approx(myzoo)[10], col = "red")
points(na.locf(myzoo)[10], col = "blue")
points(na.spline(myzoo)[10], col = "green")
myzoo[10] <- 31 # replace the missing point
lines(myzoo, type = "o", lty=3, pch=16) # dashed line over the gap
legend(x = "topleft",
legend = c("na.spline", "na.locf", "na.approx"),
col=c("green","blue","red"), pch = 1)
Try this:
x <- ts(myzoo,f=4)
fit <- ts(rowSums(tsSmooth(StructTS(x))[,-2]))
tsp(fit) <- tsp(x)
plot(x)
lines(fit,col=2)
The idea is to use a basic structural model for the time series, which handles the missing value fine using a Kalman filter. Then a Kalman smooth is used to estimate each point in the time series, including any omitted.
I had to convert your zoo object to a ts object with frequency 4 in order to use StructTS. You may want to change the fitted values back to zoo again.
In this case, I think you want a seasonality correction in the ARIMA model. There's not enough date here to fit the seasonal model, but this should get you started.
library(zoo)
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- myzoo.orig <- zoo(obs, index)
myzoo[10] <- NA
myzoo.fixed <- na.locf(myzoo)
myarima.resid <- arima(myzoo.fixed, order = c(3, 0, 3), seasonal = list(order = c(0, 0, 0), period = 4))$residuals
myzoo.reallyfixed <- myzoo.fixed
myzoo.reallyfixed[10] <- myzoo.fixed[10] + myarima.resid[10]
plot(myzoo.reallyfixed)
points(myzoo.orig)
In my tests the ARMA(3, 3) is really close, but that's just luck. With a longer time series you should be able to calibrate the seasonal correction to give you good predictions. It would be helpful to have a good prior on what the underlying mechanisms for both the signal and the seasonal correction to get better out of sample performance.
forecast::na.interp is a good approach. From the documentation
Uses linear interpolation for non-seasonal series and a periodic stl decomposition with seasonal series to replace missing values.
library(forecast)
fit <- na.interp(myzoo)
fit[10] # 32.5, vs. 31.0 actual and 32.0 from Rob Hyndman's answer
This paper evaluates several interpolation methods against real time series, and finds that na.interp is both accurate and efficient:
From the R implementations tested in this paper, na.interp from the forecast package and na.StructTS from the zoo package showed the best overall results.
The na.interp function is also not that much slower than
na.approx [the fastest method], so the loess decomposition seems not to be very demanding in terms of computing time.
Also worth noting that Rob Hyndman wrote the forecast package, and included na.interp after providing his answer to this question. It's likely that na.interp is an improvement upon this approach, even though it performed worse in this instance (probably due to specifying the period in StructTS, where na.interp figures it out).
Package imputeTS has a method for Kalman Smoothing on the state space representation of an ARIMA model - which might be a good solution for this problem.
library(imputeTS)
na_kalman(myzoo, model = "auto.arima")
Also works directly with zoo time series objects. You could also use your own ARIMA models in this function. If you think you can do better then "auto.arima". This would be done this way:
library(imputeTS)
usermodel <- arima(myts, order = c(1, 0, 1))$model
na_kalman(myts, model = usermodel)
But in this case you have to convert the zoo onject back to ts, since arima() only accepts ts.

Resources