R - find outliers in time series dataset using standard deviation - r

i have a xts time series object with numeric values for the data. str (dataTS)
An ‘xts’ object on 2014-02-14 14:27:00/2014-02-28 14:22:00 containing:
Data: num [1:4032, 1] 51.8 44.5 41.2 48.6 46.7 ...
Indexed by objects of class: [POSIXlt,POSIXt] TZ:
xts Attributes:
NULL
I want to find the data points that are more than (2 * s.d.) away from mean.
I would like to create an new dataset from it.
[,1]
2015-02-14 14:27:00 51.846
2015-02-14 14:32:00 44.508
2016-02-14 14:37:00 41.244
2015-02-14 14:42:00 48.568
2015-02-14 14:47:00 46.714
2015-02-14 14:52:00 44.986
2015-02-14 14:57:00 49.108
2015-02-14 15:02:00 1000.470
2015-02-14 15:07:00 53.404
2015-02-14 15:12:00 45.400
2015-02-14 15:17:00 3.216
2015-02-14 15:22:00 49.7204
the time series.
i want to subset the outliers 3.216 and 1000.470

You can scale your data to have zero mean and unit standard deviation. You can then directly identify individual observations that are >= 2 sd away from the mean.
As an example, I randomly sample some data from a Cauchy distribution.
set.seed(2010);
smpl <- rcauchy(10, location = 4, scale = 3);
To illustrate, I store the sample data and scaled sample data in a data.frame; I also mark observations that are >= 2 standard deviations away from the mean.
library(tidyverse);
df <- data.frame(Data = smpl) %>%
mutate(
Data.scaled = as.numeric(scale(Data)),
deviation_greater_than_2sd = ifelse(Data.scaled >= 2, TRUE, FALSE));
df;
# Data Data.scaled deviation_greater_than_2sd
#1 8.007951 -0.2639689 FALSE
#2 -34.072054 -0.5491882 FALSE
#3 465.099800 2.8342104 TRUE
#4 7.191778 -0.2695010 FALSE
#5 2.383882 -0.3020890 FALSE
#6 3.544079 -0.2942252 FALSE
#7 -7.002769 -0.3657119 FALSE
#8 4.384503 -0.2885287 FALSE
#9 15.722492 -0.2116796 FALSE
#10 4.268082 -0.2893179 FALSE
We can also visualise the distribution of Data.scaled:
ggplot(df, aes(Data.scaled)) + geom_histogram();
The "outlier" is 2.8 units of standard deviation away from the mean.

Related

Weighted Moving Average based on Irregular Date Intervals

I am new to time series and was hoping someone could provide some input/ideas here.
I am trying to find ways to impute missing values.
I was hoping to find the moving average, but most of the packages (smooth, mgcv, etc.) don't seem to take time intervals into consideration.
For example, the dataset might look like something below and I would want value at 2016-01-10 to have the greatest influence in calculating the missing value:
Date Value Diff_Days
2016-01-01 10 13
2016-01-10 14 4
2016-01-14 NA 0
2016-01-28 30 14
2016-01-30 50 16
I have instances where NA might be the first observation or the last observation. Sometimes NA values also occur multiple times, at which point the rolling window would need to expand, and this is why I would like to use the moving average.
Is there a package that would take date intervals / separate weights into consideration?
Or please suggest if there is a better way to impute NA values in such cases.
You can use glm or any different model.
Input
con <- textConnection("Date Value Diff_Days
2015-12-14 NA 0
2016-01-01 10 13
2016-01-10 14 4
2016-01-14 NA 0
2016-01-28 30 14
2016-02-14 NA 0
2016-02-18 NA 0
2016-02-29 50 16")
df <- read.table(con, header = T)
df$Date <- as.Date(df$Date)
df$Date.numeric <- as.numeric(df$Date)
fit <- glm(Value ~ Date.numeric, data = df)
df.na <- df[is.na(df$Value),]
predicted <- predict(fit, df.na)
df$Value[is.na(df$Value)] <- predicted
plot(df$Date, df$Value)
points(df.na$Date, predicted, type = "p", col="red")
df$Date.numeric <- NULL
rm(df.na)
print(df)
Output
Date Value Diff_Days
1 2015-12-14 -3.054184 0
2 2016-01-01 10.000000 13
3 2016-01-10 14.000000 4
4 2016-01-14 18.518983 0
5 2016-01-28 30.000000 14
6 2016-02-14 40.092149 0
7 2016-02-18 42.875783 0
8 2016-02-29 50.000000 16

Cut timestamp into numeric slots in R

I have timestamp column, having data in the form 2016-01-01 00:41:23
I want to convert this data into 12 slots each of 2hrs from the entire dataset. The data is of not importance, only the time needs to be considered.
00:00:00 - 01:59:59 - slot1
02:00:00 - 03:59:59 - slot2
.......
22:00:00 - 23:59:59 - slot12
How can I achieve this in R?
x <- c("01:59:59", "03:59:59", "05:59:59",
"07:59:59", "09:59:59", "11:59:59",
"13:59:59", "15:59:59", "17:59:59",
"19:59:59", "21:59:59", "23:59:59")
cut(pickup_time, breaks = x)
Above code gives error: : 'x' must be numeric
Considering your dataframe as df we can use cut with breaks of 2 hours.
df$slotnumber <- cut(strptime(df$x, "%H:%M:%S"), breaks = "2 hours",
labels = paste0("slot", 1:12))
# x slotnumber
#1 01:59:59 slot1
#2 03:59:59 slot2
#3 05:59:59 slot3
#4 07:59:59 slot4
#5 09:59:59 slot5
#6 11:59:59 slot6
#7 13:59:59 slot7
#8 15:59:59 slot8
#9 17:59:59 slot9
#10 19:59:59 slot10
#11 21:59:59 slot11
#12 23:59:59 slot12
data
df <- data.frame(x)

R: 3rd Wedndesday of a specific month using XTS

I want to retrieve the third Wedndesday of specific months in R.
This is not exactly a duplicate question of How to figure third Friday of a month in R because I want to use either Base R or XTS.
The data is in x:
library(xts)
x = xts(1:100, Sys.Date()+1:100)
and I can retrieve wednesdays by using:
wed=x[.indexwday(x) %in% 3]
> wed
[,1]
2015-09-30 6
2015-10-07 13
2015-10-14 20
2015-10-21 27
2015-10-28 34
2015-11-04 41
2015-11-11 48
2015-11-18 55
2015-11-25 62
2015-12-02 69
2015-12-09 76
2015-12-16 83
2015-12-23 90
2015-12-30 97
>
I haven't figured out how to get the third observation in each month of this wed vector using xts but there must be a way.
third=wed[head(endpoints(wed, "months") + 3, -3)]
returns a wrong result.
I have read the xts documentation and couln't find the right function there.
Any help would be appreciated.
Why not just
library(xts)
x = xts(1:3650, Sys.Date()+1:3650)
x[.indexwday(x) == 3 &
.indexmday(x) >= 15 &
.indexmday(x) <= 21
]
If first Wednesday is on 1st then third is on 15th.
If first Wednesday is on 7th then third is on 21st.
So anywhere between 15th and 21st.
Take your wed object, split it by month, then select the 3rd row. Then use do.call and rbind to put it back together.
R> # 3rd or last available Wednesday
R> wedList <- split(wed, "months")
R> do.call(rbind, lapply(wedList, function(x) x[min(nrow(x),3),]))
# [,1]
# 2015-09-30 6
# 2015-10-21 27
# 2015-11-18 55
# 2015-12-16 83
R> # no observation if 3rd Wednesday isn't available
R> do.call(rbind, lapply(wedList, function(x) if(nrow(x) < 3) NULL else x[3,]))
# [,1]
# 2015-10-21 27
# 2015-11-18 55
# 2015-12-16 83

Warnings when using custom function to every row of a table using dplyr?

I am trying to replicate something like this with a custom function but I am getting errors. I have the following data frame
> dd
datetimeofdeath injurydatetime
1 2/10/05 17:30
2 2/13/05 19:15
3 2/15/05 1:10
4 2/24/05 21:00 2/16/05 20:36
5 3/11/05 0:45
6 3/19/05 23:05
7 3/19/05 23:13
8 3/23/05 20:51
9 3/31/05 11:30
10 4/9/05 3:07
The typeof these is integer but for some reason they have levels as if they were factors. This could be the root of my problem but I am not sure.
> typeof(dd$datetimeofdeath)
[1] "integer"
> typeof(dd$injurydatetime)
[1] "integer"
> dd$injurydatetime
[1] 2/10/05 17:30 2/13/05 19:15 2/15/05 1:10 2/16/05 20:36 3/11/05 0:45 3/19/05 23:05 3/19/05 23:13 3/23/05 20:51 3/31/05 11:30
[10] 4/9/05 3:07
549 Levels: 1/1/07 18:52 1/1/07 20:51 1/1/08 17:55 1/1/11 15:25 1/1/12 0:22 1/1/12 22:58 1/11/06 23:50 1/11/07 6:26 ... 9/9/10 8:15
Now I would like to apply the following function rowwise()
library(lubridate)
library(dplyr)
get_time_alive = function(datetimeofdeath, injurydatetime)
{
if(as.character(datetimeofdeath) == "" | as.character(injurydatetime) == "") return(NA)
time_of_death = parse_date_time(as.character(datetimeofdeath), "%m/%d/%y %H:%M")
time_of_injury = parse_date_time(as.character(injurydatetime), "%m/%d/%y %H:%M")
time_alive = as.duration(new_interval(time_of_injury,time_of_death))
time_alive_hours = as.numeric(time_alive) / (60*60)
return(time_alive_hours)
}
This works on individual rows, but not when I do the operation rowwise.
> get_time_alive(dd$datetimeofdeath[1], dd$injurydatetime[1])
[1] NA
> get_time_alive(dd$datetimeofdeath[4], dd$injurydatetime[4])
[1] 192.4
> dd = dd %>% rowwise() %>% dplyr::mutate(time_alive_hours=get_time_alive(datetimeofdeath, injurydatetime))
There were 20 warnings (use warnings() to see them)
> dd
Source: local data frame [10 x 3]
Groups:
datetimeofdeath injurydatetime time_alive_hours
1 2/10/05 17:30 NA
2 2/13/05 19:15 NA
3 2/15/05 1:10 NA
4 2/24/05 21:00 2/16/05 20:36 NA
5 3/11/05 0:45 NA
6 3/19/05 23:05 NA
7 3/19/05 23:13 NA
8 3/23/05 20:51 NA
9 3/31/05 11:30 NA
10 4/9/05 3:07 NA
As you can see the fourth element is NA even though when I applied my custom function to it by itself I got 192.4. Why is my custom function failing here?
I think you can simplify your code a lot and just use something like this:
dd %>%
mutate_each(funs(as.POSIXct(as.character(.), format = "%m/%d/%y %H:%M"))) %>%
mutate(time_alive = datetimeofdeath - injurydatetime)
# datetimeofdeath injurydatetime time_alive
#1 <NA> 2005-02-15 01:10:00 NA days
#2 2005-02-24 21:00:00 2005-02-16 20:36:00 8.016667 days
#3 <NA> 2005-03-11 00:45:00 NA days
Side notes:
I shortened your input data, because it's not easy to copy (I only took those three rows that you also see in my answer)
If you want the "time_alive" formatted in hours, just use mutate(time_alive = (datetimeofdeath - injurydatetime)*24) in the last mutate.
If you use this code, there's no need for rowwise() - which should also make it faster, I guess

How can I filter specifically for certain months if the days are not the same in each year?

This is probably a very simple question that has been asked already but..
I have a data frame that I have constructed from a CSV file generated in excel. The observations are not homogeneously sampled, i.e they are for "On Peak" times of electricity usage. That means they exclude different days each year. I have 20 years of data (1993-2012) and am running both non Robust and Robust LOESS to extract seasonal and linear trends.
After the decomposition has been done, I want to focus only on the observations from June through September.
How can I create a new data frame of just those results?
Sorry about the formatting, too.
Date MaxLoad TMAX
1 1993-01-02 2321 118.6667
2 1993-01-04 2692 148.0000
3 1993-01-05 2539 176.0000
4 1993-01-06 2545 172.3333
5 1993-01-07 2517 177.6667
6 1993-01-08 2438 157.3333
7 1993-01-09 2302 152.0000
8 1993-01-11 2553 144.3333
9 1993-01-12 2666 146.3333
10 1993-01-13 2472 177.6667
As Joran notes, you don't need anything other than base R:
## Reproducible data
df <-
data.frame(Date = seq(as.Date("2009-03-15"), as.Date("2011-03-15"), by="month"),
MaxLoad = floor(runif(25,2000,3000)), TMAX=runif(25,100,200))
## One option
df[months(df$Date) %in% month.name[6:9],]
# Date MaxLoad TMAX
# 4 2009-06-15 2160 188.4607
# 5 2009-07-15 2151 164.3946
# 6 2009-08-15 2694 110.4399
# 7 2009-09-15 2460 150.4076
# 16 2010-06-15 2638 178.8341
# 17 2010-07-15 2246 131.3283
# 18 2010-08-15 2483 112.2635
# 19 2010-09-15 2174 160.9724
## Another option: strftime() will be more _generally_ useful than months()
df[as.numeric(strftime(df$Date, "%m")) %in% 6:9,]

Resources