rollapply for moving average with non-business day - r

I'd like to get MovingAverage in data which have "NA" in the middle of data like below.
date <- seq.Date(as.Date("2018-07-02"),as.Date("2018-07-14"),by = "days")
A <- c(100,110,120,130,140,NA,NA,150,160,170,180,190,200)
B <- c(200,220,240,260,280,NA,NA,300,320,340,360,380,400)
C <- c(150,160,170,180,190,200,210,NA,NA,220,230,240,250)
dataset <- data.frame(A,B,C)
dataset <- as.xts(dataset, order.by = date)
If I use rollapply like below to get 3-day MovingAverage...
y <- rollapply(dataset, width = 3, function(x) mean(x, na.rm = TRUE ))
This is not what I want.
For example, In MovingAverage of A at "2018-07-09", the result is (NA+NA+150)/1 = 150. But I want to get (130+140+150)/3 = 140.
How can I do that?

I assume you want NAs to stay as NA and otherwise to take the mean of the last 3 non-NAs.
1) Take 5 elements at a time and if the last element is NA then return NA; otherwise, remove the NAs and take the mean of the last 3. Note that this does imply that the first 4 rows will be NA.
mean_bus <- function(x) if (is.na(tail(x, 1))) NA else mean(tail(na.omit(x), 3))
y1 <- rollapplyr(dataset, width = 5, mean_bus)
2) An alternate would be to take the last 3 non-NAs and then overwrite that with NAs in all positions where the input is NA.
mean_omit <- function(x) mean(tail(na.omit(x), 3))
y <- rollapplyr(dataset, 5, mean_omit)
y2 <- replace(y, is.na(dataset), NA)
all.equal(y1, y2)
## [1] TRUE
3) If you prefer to fill in the first 4 rows with partial values then convert to zoo and use the partial= argument of rollapplyr.zoo. mean_bus is from (1).
y3 <- as.xts(rollapplyr(as.zoo(dataset), 5, mean_bus, partial = TRUE))

You could either remove the NAs in each series before you compute the moving average (MA).
Or you use a larger window and keep only the last three values for the MA.
y <- rollapply(dataset, width = 5,
function(x) {mean(tail(x[ !is.na(x) ], 3))})

Related

R: Extrapolating x no. of values beyond known values

I'm looking for a function/method to extrapolate (linearly) for an x number of values beyond the original values.
Let's say I start with:
a <- c(NA, NA, NA, NA, NA, NA, 1, 2, 3, NA, NA, NA, NA, NA, NA)
And I want to extrapolate two values beyond, I would end up with:
[1] NA NA NA NA -1 0 1 2 3 4 5 NA NA NA NA
What I found so far is the approxExtrap function from Hmisc (https://rdrr.io/cran/Hmisc/man/approxExtrap.html). But since you have to define 'xout', I feel that I have to write a loop and every time select pieces I want to extrapolate on. This is possible of course, but ultimately I expect to have sequences of millions of datapoints with a lot of gaps, so I feel this may be too time consuming. So I hope I'm overlooking a simpler solution.
Added: There are no small gaps in the data, but typically ~ 100 NA's and then ~ 40 datapoints. I would like to extrapolate/extend the 40 datapoints with 5 new datapoints before the start and after the end of the 40 datapoints and replace 5 NA's at both locations. It is not possible to interpolate between two sequences of 40 datapoints.
I managed to solve the problem by:
Determining the ranges of the different series of data
Define the range I want to extrapolate to
Do the actual extrapolation through the Hmisc package
Initially, I thought I could only manage this by some loops that had to go through the raw data row by row, and was hoping for an existing function.
I'm sure many of you would have coded this way more efficient and nicer. But wanted to post my script anyway for people with a similar problem.
require(Hmisc)
extrapol.length <- 5
test <- data.frame('Time' = c(1:100), # I didn't use this as my data was equally spread in time, if you want to use it, see the first argument in the approxExtrap-function in the secondlast line
'x' = c(rep(NA, 10), 1:30, rep(NA, 30), 1:10, rep(NA, 20)))
## Determine start and end of the continuous (non-NA) data streams
length.values <- diff(c(0, which(is.na(test[,2]))))-2 # length non-NA's
length.values <- length.values[length.values > -1]
length.nas <- diff(c(0, which(!is.na(test[,2])))) # length NA's
length.nas <- length.nas[length.nas > 1]
if(is.na(test[1,2])){
# data starts with NA
length.nas <- data.frame('Order' = seq(1, length(length.nas)*2, by = 2),
'Length' = length.nas, 'Type' = 'na')
length.values <- data.frame('Order' = seq(2, length(length.values)*2, by = 2),
'Length' = length.values, 'Type' = 'value')
start.end <- rbind(length.nas, length.values)
start.end <- start.end[order(start.end$Order),]
value.seqs <- data.frame('no' = c(1:length(start.end$Type[start.end$Type == 'na'])),
'start' = NA, 'end' = NA)
for(a in value.seqs$no){
value.seqs$start[a] <- sum(start.end$Length[1:((a*2)-1)])
value.seqs$end[a] <- sum(start.end$Length[1:(a*2)])
}
}else{
# Data starts with actual values
length.nas <- data.frame('Order' = seq(2, length(length.nas)*2, by = 2),
'Length' = length.nas, 'Type' = 'na')
length.values <- data.frame('Order' = seq(1, length(length.values)*2, by = 2),
'Length' = length.values, 'Type' = 'value')
start.end <- rbind(length.nas, length.values)
start.end <- start.end[order(start.end$Order),]
value.seqs <- data.frame('no' = c(1:length(start.end$Type[start.end$Type == 'value'])),
'start' = c(1,rep(NA, (length(start.end$Type[start.end$Type == 'value'])-1))), 'end' = NA)
for(a in value.seqs$no){
value.seqs$end[a] <- sum(start.end$Length[1:((a*2)-1)])+1
if(a < max(value.seqs$no))
value.seqs$start[a+1] <- sum(start.end$Length[1:(a*2)])+1
}
}
## Do not extrapolate outside of the time-range of the original dataframe
value.seqs$start.extr <- value.seqs$start - extrapol.length
value.seqs$start.extr[value.seqs$start.extr < 1] <- 1 # do not extrapolate below time < 1
value.seqs$end.extr <- value.seqs$end + extrapol.length
value.seqs$end.extr[value.seqs$end.extr > nrow(test) | is.na(value.seqs$end.extr)] <- nrow(test)
value.seqs$end[is.na(value.seqs$end)] <- max(which(!is.na(test[,2])))
## Extrapolate
for(b in value.seqs$no){
test[c(value.seqs$start.extr[b]:value.seqs$end.extr[b]),3] <- approxExtrap(value.seqs$start[b]:value.seqs$end[b],test[c(value.seqs$start[b]:value.seqs$end[b]),2],xout=c(value.seqs$start.extr[b]:value.seqs$end.extr[b]))[2]
}
Thanks for thinking along!

How can I use rollapply with a 5 month window?

I noticed this in the documentation of rollapply() to roll by 3 days:
## rolling mean by time window (e.g., 3 days) rather than
## by number of observations (e.g., when these are unequally spaced):
#
## - test data
tt <- as.Date("2000-01-01") + c(1, 2, 5, 6, 7, 8, 10)
z <- zoo(seq_along(tt), tt)
## - fill it out to a daily series, zm, using NAs
## using a zero width zoo series g on a grid
g <- zoo(, seq(start(z), end(z), "day"))
zm <- merge(z, g)
## - 3-day rolling mean
rollapply(zm, 3, mean, na.rm = TRUE, fill = NA)
Suppose I have the following data:
data.zoo <- read.zoo(
data.frame(
date = sample(seq(as.Date('2001-04-12'), as.Date("2019-04-05"), by="day"), 600),
val = runif(1:600),
val2 = runif(1:600)
))
Is it possible to somehow use rollapply() with a 5 month rolling window to calculate the rolling mean of val? The problem with a 5-month rolling window is that the number of days in a month varies...
NOTE: I would prefer a base-R solution but other libraries would be interesting to see
Since width can be a vector of widths, one for each row of the input, we can simply compute the number of days between each date and 5 months prior and use those numbers for the width vector:
library(zoo)
ym <- as.yearmon(time(data.zoo))
w <- as.Date(ym) - as.Date(ym - 5/12)
r <- rollapplyr(data.zoo, w, mean, fill = NA)
Alternately we could write w like this with lubridate.
library(lubridate)
w <- time(data.zoo) - (time(data.zoo) %m-% months(5))
Update
If there can be missing dates then
library(lubridate)
w <- sapply(time(data.zoo), function(x)
length(intersect(seq(x %m-% months(5), x, "day"), time(data.zoo)))
or repeat this replacing %m-% months(5) with subtract5m which does not use additional packages:
subtract5m <- function(x) {
if (length(x) == 1) seq(x, length = 2, by = "-5 month")[2]
else as.Date(sapply(x, subtract5m))
}
w <- sapply(time(data.zoo), function(x)
length(intersect(seq(subtract5m(x), x, "day"), time(data.zoo))))
Note that due to the ambiguity of the definition of 5 months ago the various computations for w may vary slightly based on slightly different assumptions.
Improving on G. Grothendieck's ideas I went with:
ym <- as.yearmon(time(data.zoo))
ym.cutoff.ideal <- ym - 5/12
ym.cutoff.closest.to.ideal <- as.yearmon(time(data.zoo)[findInterval(as.Date(ym.cutoff.ideal), as.Date(ym)) + 1])
w <- time(data.zoo) - as.Date(ym.cutoff.closest.to.ideal) + 1
r <- rollapplyr(data.zoo, w, mean, fill = NA)
It looks like it is working correctly...

DMwR::unscale to unscale only selected columns

I've got a data.frame with 4 columns which I want to scale and then add some new columns (without scaling them). Then I perform some calculations after which I need to unscale only first 4 columns (as the remaining two weren't scaled in the first place). DMwR::unscale seems to allow for that with col.ids argument. But when I specify the fucntion like below it returns
Error in DMwR::unscale(cbind(scale(x), x2), scale(x), 1:4) :
Incorrect dimension of data to unscale.
x <- matrix(2*rnorm(400) + 1, ncol = 4)
x2 <- matrix(9*rnorm(200), ncol = 2)
DMwR::unscale(cbind(scale(x), x2), scale(x), 1:4)
What am I doing wrong? How can I unscale only selected 4 first columns of matrix?
The DMwR::unscale(vals, norm.data, col.ids) function requires that norm.data has a number of columns larger than that of vals.
I suggest to consider the following modified version of unscale:
myunscale <- function (vals, norm.data, col.ids) {
cols <- if (missing(col.ids)) 1:NCOL(vals) else col.ids
if (length(cols) > NCOL(vals))
stop("Incorrect dimension of data to unscale.")
centers <- attr(norm.data, "scaled:center")[cols]
scales <- attr(norm.data, "scaled:scale")[cols]
unvals <- scale(vals[,cols], center = (-centers/scales), scale = 1/scales)
unvals <- cbind(unvals,vals[,-cols])
attr(unvals, "scaled:center") <- attr(unvals, "scaled:scale") <- NULL
unvals
}
set.seed(1)
x <- matrix(2*rnorm(4000) + 1, ncol = 4)
x2 <- matrix(9*rnorm(2000), ncol = 2)
x_unsc <- myunscale(cbind(scale(x), x2), scale(x) , 1:4)
The mean values and the standard deviations of x_unsc are:
apply(x_unsc, 2, mean)
# [1] 0.9767037 0.9674762 1.0306181 1.0334445 -0.1805717 -0.1053083
apply(x_unsc, 2, sd)
# [1] 2.069832 2.079963 2.062214 2.077307 8.904343 8.810420

for each row in a data frame, find whether there is a "close" row in another data frame

I have the following data frame:
library(dplyr)
set.seed(42)
df <- data_frame(x = sample(seq(0, 1, 0.1), 5, replace = T), y = sample(seq(0, 1, 0.1), 5, replace = T), z= sample(seq(0, 1, 0.1), 5, replace = T) )
For each row in df, I would like to find out whether there is a row in df2 which is close to it ("neighbor") in all columns, where "close" means that it is not different by more than 0.1 in each column.
So for instance, a proper neighbor to the row (1, 0.5, 0.5) would be (0.9, 0.6, 0.4).
The second data set is
set.seed(42)
df2 <- data_frame(x = sample(seq(0, 1, 0.1), 10, replace = T), y = sample(seq(0, 1, 0.1), 10, replace = T), z= sample(seq(0, 1, 0.1), 10, replace = T) )
In this case there is no "neighbor", so Im supposed to get "FALSE" for all rows of df.
My actual data frames are much bigger than this (dozens of columns and hundreds of thousands of rows, so the naming has to be very general rather than "x", "y" and "z".
I have a sense that this can be done using mutate and funs, for example I tried this line:
df <- df %>% mutate_all(funs(close = (. <= df2(, .)+0.1) & (. >= df2(, .)-0.1))
But got an error.
Any ideas?
You can use package fuzzyjoin
library(fuzzyjoin)
# adding two rows that match
df2 <- rbind(df2,df[1:2,] +0.01)
df %>%
fuzzy_left_join(df2,match_fun= function(x,y) y<x+0.1 & y> x-0.1 ) %>%
mutate(found=!is.na(x.y)) %>%
select(-4:-6)
# # A tibble: 5 x 4
# x.x y.x z.x found
# <dbl> <dbl> <dbl> <lgl>
# 1 1 0.5 0.5 TRUE
# 2 1 0.8 0.7 TRUE
# 3 0.3 0.1 1 FALSE
# 4 0.9 0.7 0.2 FALSE
# 5 0.7 0.7 0.5 FALSE
find more info there: Joining/matching data frames in R
The machine learning approach to finding a close entry in a multi-dimensional dataset is Euclidian distance.
The general approach is to normalize all the attributes. Make the range for each column the same, zero to one or negative one to one. That equalizes the effect of the columns with large and small values. When more advanced approaches are used one would center the adjusted column values on zero. The test criteria is scaled the same.
The next step is to calculate the distance of each observation from its neighbors. If the data set is small or computing time is cheap, calculate the distance from every observation to every other. The Euclidian distance from observation1 (row1) to observation2 (row2) is sqrt((X1 - X2)^2 + sqrt((Y1 - Y2)^2 + ...). Choose your criteria and select.
In your case, the section criterion is simpler. Two observations are close if no attribute is more than 0.1 from the other observation. I assume that df and df2 have the same number of columns in the same order. I make the assumption that close observations are relatively rare. My approach tells me once we discover a pair is distant, discontinue investigation. If you have hundred of thousands of rows, you will likely exhaust memory if you try to calculate all the combinations at the same time.
~~~~~
You have a big problem. If your data sets df and df2 are one hundred thousand rows each, and four dozen columns, the machine needs to do 4.8e+11 comparisons. The scorecard at the end will have 1e+10 results (close or distant). I started with some subsetting to do comparisons with tearful results. R wanted matrices of the same size. The kluge I devised was unsuccessful. Therefore I regressed to the days of FORTRAN and did it with loops. With the loop approach, you could subset the problem and finish without smoking your machine.
From the sample data, I did the comparisons by hand, all 150 of them: nrow(df) * nrow(df2) * ncol(df). There were no close observations in the sample data by the definition you gave.
Here is how I intended to present the results before transferring the results to a new column in df.
dfclose <- matrix(TRUE, nrow = nrow(df), ncol = nrow(df2))
dfclose # Have a look
This matrix describes the distance from observation in df (rows in dfclose) to observation in df2 (colums in dfclose). If close, the entry is TRUE.
Here is the repository of the result of the distance measures:
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
dfdist # have a look; it's the same format, but with numbers
We start with the assumption that all observations in df aare close to df2.
The total distance is zero. To that we add the Manhattan Distance. When the total Manhattan distance is greater than .1, they are no longer close. We needn't evaluate any more.
closeCriterion <- function(origin, dest) {
manhattanDistance <- abs(origin-dest)
#print(paste("manhattanDistance =", manhattanDistance))
if (manhattanDistance < .1) ret <- 0 else ret <- 1
}
convertScore <- function(x) if (x>0) FALSE else TRUE
for (j in 1:ncol(df)) {
print(paste("col =",j))
for (i in 1:nrow(df)) {
print(paste("df row =",i))
for (k in 1:nrow(df2)) {
# print(paste("df2 row (and dflist column) =", k))
distantScore <- closeCriterion(df[i,j], df2[k,j])
#print(paste("df and dfdist row =", i, " df2 row (and dflist column) =", k, " distantScore = ", distantScore))
dfdist[i,k] <- dfdist[i,k] + distantScore
}
}
}
dfdist # have a look at the numerical results
dfclose <- matrix(lapply(dfdist, convertScore), ncol = nrow(df2))
I wanted to see what the process would look like at scale.
set.seed(42)
df <- matrix(rnorm(3000), ncol = 30)
set.seed(42)
df2 <-matrix(rnorm(5580), ncol = 30)
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
Then I ran the code block to see what would happen.
~ ~ ~
You might consider the problem definition. I ran the model several times, changing the criterion for closeness. If the entry in each of three dozen columns in df2 has a 90% chance of matching its correspondent in df, the row only has a 2.2% chance of matching. The example data is not such a good test case for the algorithm.
Best of luck
Here's one way to calculate that column without fuzzyjoin
library(tidyverse)
found <-
expand.grid(row.df = seq(nrow(df)),
row.df2 = seq(nrow(df2))) %>%
mutate(in.range = pmap_lgl(., ~ all(abs(df[.x,] - df2[.y,]) <= 0.1))) %>%
group_by(row.df) %>%
summarise_at('in.range', any) %>%
select(in.range)

use rollapply for certain rows

I wonder if it is possible to use rollapply() only for certain rows of a dataframe. I know the "by" argument can specify the every by-th time point at which I calculate FUN, but now I have a very specific vector of row indices to which I wish to apply the rollapply(). For example, I have the below dataframe:
df <- data.frame(x = (1:10), y = (11:20))
I know how to calculate the rolling mean for y column when the rolling width is 3.
library(zoo)
m <- rollapply(df$y, width = 3, FUN = mean, fill = NA, align = "right")
But what if I want the width-3-mean only for the 4th and 9th row? Is there something in "by" argument that I can manipulate? Or some other better methods (using apply to do rolling calculation maybe)?
Hopefully I am understanding your question correctly. I think you are asking how to perform a function on every 4th and 9th element in a sliding window? If yes, just restrict your function to the 4th and 9th element using x[4] and x[9]. Like this:
output <- rollapply(df, 9, function(x) (x[4] + x[9])/2), fill = NA, align = "right")
I also interpret your question as asking how to get the mean when the window contains the 4th or 9th row? This can be done by sub setting. The question you need to think about is where you want the 4th and 9th row to be located within your window. Do you want the 4th row to be at position x[1], x[2], or x[3] within your window? Depending on what is at the other positions will obviously effect your output. Say you dont know, and all three seem reasonable, you will need to write a function a that creates a list of dataframes containing the range of data you are interested in, and then use an apply function, or a for loop, to rollapply the mean function over each dataframe in the list. You can then all of these outputs into a dataframe to work with further. Like this:
# the rlist library has a function that allows us to add items to a list
# which will be handy later on
library(rlist)
library(zoo)
# your example data
df <- data.frame(x = (1:10), y = (11:20))
# a vector of your desired rows
desired_rows <- c(4,9)
# A for loop that generates a list of dataframes
# with your desired rows in the middle of each
for (i in desired_rows){
lower_bound <- i-2
upper_bound <- i+2
df_subset <- df[c(lower_bound:upper_bound), ]
if(exists("list_df_range")){
list_df_range <- list.append(list_df_range, df_subset)
}else{
list_df_range <- list(df_subset)
}
}
# a second for loop that applies your rollapply function to each
# data frame in the list and then
# returns a dataframe of the final results
# with each column named after the originating row
for (n in list_df_range){
m <- rollapply(n$y, width = 3, FUN = mean, fill = NA, align = "right")
if(exists("final_out")){
final_out <- cbind(final_out, m)
}else{
final_out <- data.frame(m)
}
}
names(final_out) <- desired_rows
Based on the comment below the question by the poster it seems that what is wanted is to take the mean of each rolling window of width 3 excluding the middle element in each window and only keeping the 4th and 9th elements so
cc <- c(4, 9)
rollapply(df$y, list(c(-2, 0)), mean, fill = NA)[cc]
## [1] 13 18
or
rollapplyr(df$y, 3, function(x) mean(x[-2]), fill = NA)[cc]
## [1] 13 18
or
sapply(cc, function(ix) mean(df$y[seq(to = ix, by = 2, length = 2)]))
## [1] 13 18
or
(df$y[cc - 2] + df$y[cc]) / 2
## [1] 13 18

Resources