Rolling window to calculate Value at Risk in R - r

I'm sure this is very obvious but i'm a begginer in R and i spent a good part of the afternoon trying to solve this...
I'm trying to create a rolling window to calculate the Value at Risk (VaR) over time.
I already calculated the unconditional VaR for my entire timeserie of 7298 daily returns.
Now, what i'm trying to do is do a rolling window that calculates VaR for a window of 25 days that will roll every one observation for my entire timeserie.
I tried
apply.rolling(nas, trim = TRUE, gap = 25, by = 1, FUN = function(x) VaR(R = nas, p = 0.99, method="historical"))
and
rollapply(nas, width = 25, FUN = function(x) VaR(R = nas, p = 0.99, method="historical"))
where nas is my time serie.
My code still runs from an hour ago... I don't know what I did wrong...
Thank you very much in advance for any help you can provide.
H.

It should be:
rollapply(nas, width = 25, FUN = function(x) VaR(R = x, p = 0.99, method="historical"))
Basically you applying a function that takes in value x (which is a filtered nas to 25 time units), and produces output based on the x. In your original attempt, the function was function(x) VaR(R = nas, p = 0.99, method="historical"), so it takes in value x, but still calculate VaR of the whole nas, and it does that >7000 times, hence it takes forever.

Related

R outer function not inserting elements as arguments

I am working on a script that should estimate the probability of having at least 2 out of n people having a same birthday within k days from eachother. To estimate this I have the following function:
birthdayRangeCheck.prob = function(nPeople, seperation, nSimulations) {
count = 0
for (i in 1:nSimulations) {
count = count + birthdayRangeCheck(nPeople, seperation)
}
return(count / nSimulations)
}
Now just entering simple values for nPeople, seperation, nSimulations gives me a normal number.
e.g.
birthdayRangeCheck.prob(10,4,100)
-> 0.75
However when I want to plot the probability as a function of nPeople, and seperation I stumble upon the following problem:
x = 1:999
y = 0:998
z = outer(X = x, Y = y, FUN = birthdayRangeCheck.prob, nSimulations = 100)
numerical expression has 576 elements: only the first used... (a lot of times)
So it seems like outer is not entering single elements of x and y, but rather the vectors themselfs, which is the opposite of what outer should do right?
Am I overlooking something? Because I can't figure out what is causing this error. (replacing FUN with e.g. sin(x+y) works like a charm so I did pin it down to the function itself. But since the function works just fine with numeric arguments I don't see why R doesn't understand to just enter elements of x and y as arguments.)
Any help would be greatly appreciated. Thanks ;)

Rolling weighted average in R (multiple observations)

Is there any fast function that is able to calculate a rolling average that is weighted? This is necessary because I have multiple observation (not always the same number) per data point (change in seconds) and I average that. When I take the rolling average, I want to re-weight to get an unbiased rolling average.
So far, I came up with this solution (in this example with a window of 3 seconds).
sam <- data.table(val_mean=c(1:15),N=c(11:25))
sam[,weighted:=val_mean*N]
sam[,rollnumerator:=rollapply(weighted,3,sum,fill=NA,align="left")]
sam[,rolldenominator:=rollapply(N,3,sum,fill=NA,align="left")]
sam[,rollnumerator/rolldenominator]
I couldn't find any question that already addresses this problem.
This is not about unequal spacing of the data: I can take care of that by expanding my data.table with NAs to include each second (the example above is equally spaced). Also, I don't want to include weights in the sense of RcppRoll's roll_mean: There, weights are fixed for all time windows ("A vector of length n, giving the weights for each element within a window."), while in my case the weights change according to the values currently processed. Thirdly, I don't want an adaptive window size, it should stay fixed (say at 3 seconds).
1) Use by.column = FALSE:
library(data.table)
library(zoo)
wmean <- function(x) weighted.mean(x[, 1], x[, 2])
sam[, rollapplyr(.SD, 3, wmean, by.column = FALSE, fill = NA, align = "left")]
2) Another approach is to encode the values and weights into a complex vector:
wmean_cmplx <- function(x) weighted.mean(Re(x), Im(x))
sam[, rollapply(complex(real = val_mean, imag = N), 3, wmean_cmplx,
fill = NA, align = "left")]

Calculate all distances between two set of points using st_distance

I have two sets of points stored in R as sf objects. Point object x contains 204,467 and point y contains 5,297 points.
In theory, I would want to calculate the distance from all points in x to all points in y. I understand that this would create a beast of a matrix, but it is doable using st_distance(x, y, by_element=FALSE) in the sf package in about 40 minutes on my i7 desktop.
What I want to do is to calculate the distance from all of the points in x to all of the points in y, then I want to convert this into a data.frame, that contains all variables for the respective x and y pair of points. This is because I want flexibility in terms of aggregation using dplyr, for instance, I want to find the number of points in y, that is within 10, 50, 100 km from x, and where x$year < y$year.
I successfully created the distance matrix, which has around 1,083,061,699 cells. I know this is a very inefficient way of doing this, but it gives flexibility in terms of aggregation. Other suggestions are welcome.
Below is code to create two sf point objects, and measure the distance between them. Next, I would want to convert this into a data.frame with all variables from x and y, but this is where I fail to proceed.
If my suggested workflow is unfeasible, can someone provide an alternative solution to measure distance to all points within a predefined radius, and create a data.frame of the result with all variables from x and y?
# Create two sf point objects
set.seed(123)
library(sf)
pts1 <- st_as_sf(x = data.frame(id=seq(1,204467,1),
year=sample(seq(from = 1990, to = 2018, by = 1), size = 204467, replace = TRUE),
xcoord=sample(seq(from = -180, to = 180, by = 1), size = 204467, replace = TRUE),
ycoord=sample(seq(from = -90, to = 90, by = 1), size = 204467, replace = TRUE)),
coords=c("xcoord","ycoord"),crs=4326)
pts2 <- st_as_sf(x = data.frame(id=seq(1,5297,1),
year=sample(seq(from = 1990, to = 2018, by = 1), size = 5297, replace = TRUE),
xcoord=sample(seq(from = -180, to = 180, by = 1), size = 5297, replace = TRUE),
ycoord=sample(seq(from = -90, to = 90, by = 1), size = 5297, replace = TRUE)),
coords=c("xcoord","ycoord"),crs=4326)
distmat <- st_distance(pts1,pts2,by_element = FALSE)
I would consider approaching this differently. Once you have your distmat matrix, you can do the types of calculation you describe without needing a data.frame. You can use standard subsetting to find which points meet your specified criteria.
For example, to find the combinations of points where pts1$year is greater than pts2$year we can do:
subset_points = outer(pts1$year, pts2$year, `>`)
Then, to find how many of these are separated more than 100 km, we can do
library(units)
sum(distmat[subset_points] > (100 * as_units('km', 1)))
A note on memory usage
However you approach this with sf or data.frame objects, the chances are that you will start to bump up against RAM limits with 1e9 floating points in each matrix or column of a data.table. You might think about instead converting your distance matrix to a raster. Then the raster can be stored on disk rather than in memory, and you can utilise the memory-safe functions in the raster package to crunch your way through.
How we might use rasters to work from disk and save RAM
We can use memory-safe raster operations for the very large matrices like this, for example:
library(raster)
# convert our matrices to rasters, so we can work on them from disk
r = raster(matrix(as.numeric(distmat), length(pts1$id), length(pts2$id)))
s = raster(subset_points)
remove('distmat', 'subset_points')
# now create a raster equal to r, but with zeroes in the cells we wish to exclude from calculation
rs = overlay(r,s,fun=function(x,y){x*y}, filename='out1.tif')
# find which cells have value greater than x (1e6 in the example)
Big_cells = reclassify(rs, matrix(c(-Inf, 1e6, 0, 1e6, Inf, 1), ncol=3, byrow=TRUE), 'out.tiff', overwrite=T)
# and finally count the cells
N = cellStats(Big_cells, sum)

Using subsets and whole dataframes simultaneously in a loop

I'm trying to write a function that loops over rows of a dataframe and uses information about other rows to determine the output for each loop.
Consider the following dataframe, which is meant to represent people who have a longitude coordinate, a latitude coordinate, and a value to represent if they are or are not sick:
game.mat<-as.data.frame(matrix(0, nrow = 100, ncol = 3))
colnames(game.mat)<-c("PosX","PosY","Sick")
game.mat[,"PosX"]<-sample(x = c(1:100), 100, replace = TRUE)
game.mat[,"PosY"]<-sample(x = c(1:100), 100, replace = TRUE)
game.mat[,"Sick"]<-sample((c(rep(0,8),rep(1,2))),100,replace=TRUE)
Some minority of people will be sick at baseline. My function is meant to infect people who have neighboring x-y coordinates with a sick person (so anyone who is adjacent to someone who is sick). I considered embedding a function like this in an ifelse statement:
search_sick<-function(d,corx,cory){
d2<-d[d$PosX<corx+2&d$PosX>corx-2&d$PosY<cory+2&d$PosY>cory-2,]
if(sum(d2$Sick>0)){
d$Sick<-1
} else{
d$Sick<-0
}
}
But it makes everyone sick, perhaps because it gives everyone a value of 1 if anyone is next to a sick person. I also considered using an apply function. But from what I understand about apply, it will only operate within the a single row at a time so it will be impossible to retrieve information about whether other rows have neighboring coordinate values.
I hope this makes sense. Happy to provide any additional information.
Here's an example using apply
set.seed(1)
game.mat<-as.data.frame(matrix(0, nrow = 100, ncol = 3))
colnames(game.mat)<-c("PosX","PosY","Sick")
game.mat[,"PosX"]<-sample(x = c(1:100), 100, replace = TRUE)
game.mat[,"PosY"]<-sample(x = c(1:100), 100, replace = TRUE)
game.mat[,"Sick"]<-sample((c(rep(0,8),rep(1,2))),100,replace=TRUE)
#plot the sick individuals in red
plot(PosY~PosX, data=game.mat, col=as.factor(Sick), pch=16)
We'll modify your function to have a flexible search radius "r", and to return the indices of the newly infected individuals
search_sick<-function(d, corx, cory, r){
indx<-which(d$PosX<corx+r & d$PosX>corx-r & d$PosY<cory+r & d$PosY>cory-r)
}
contagious<-game.mat[game.mat$Sick==1,]
infected<-apply(contagious, 1, function(x) {
search_sick(game.mat, x[1], x[2], r=10)
})
game.mat$T1<-game.mat$Sick
game.mat$T1[unique(unlist(infected))]<-1
#circle points which have become sick
points(PosY~PosX, data=game.mat[game.mat$Sick==0 & game.mat$T1==1,], col="red", cex=2)

Error in rollapply: subscript out of bounds

I'd first like to describe my problem:
What i want to do is to calculate the number of spikes on prices in a 24 hour window, while I possess half hourly data.
I have seen all Stackoverflow posts like e.g. this one:
Rollapply for time series
(If there are more relevant ones, please let me know ;) )
As I cannot and probably also should not upload my data, here's a minimal example:
I simulate a random variable, convert it to an xts object, and use a user defined function to detect "spikes" (of course pretty ridiculous in this case, but illustrates the error).
library(xts)
##########Simulate y as a random variable
y <- rnorm(n=100)
##########Add a date variable so i can convert it to a xts object later on
yDate <- as.Date(1:100)
##########bind both variables together and convert to a xts object
z <- cbind(yDate,y)
z <- xts(x=z, order.by=yDate)
##########use the rollapply function on the xts object:
x <- rollapply(z, width=10, FUN=mean)
The function works as it is supposed to: it takes the 10 preceding values and calculates the mean.
Then, I defined an own function to find peaks: A peak is a local maximum (higher than m points around it) AND is at least as big as the mean of the timeseries+h.
This leads to:
find_peaks <- function (x, m,h){
shape <- diff(sign(diff(x, na.pad = FALSE)))
pks <- sapply(which(shape < 0), FUN = function(i){
z <- i - m + 1
z <- ifelse(z > 0, z, 1)
w <- i + m + 1
w <- ifelse(w < length(x), w, length(x))
if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])&x[i+1]>mean(x)+h) return(i + 1) else return(numeric(0))
})
pks <- unlist(pks)
pks
}
And works fine: Back to the example:
plot(yDate,y)
#Is supposed to find the points which are higher than 3 points around them
#and higher than the average:
#Does so, so works.
points(yDate[find_peaks(y,3,0)],y[find_peaks(y,3,0)],col="red")
However, using the rollapply() function leads to:
x <- rollapply(z,width = 10,FUN=function(x) find_peaks(x,3,0))
#Error in `[.xts`(x, c(z:i, (i + 2):w)) : subscript out of bounds
I first thought, well, maybe the error occurs because for it might run int a negative index for the first points, because of the m parameter. Sadly, setting m to zero does not change the error.
I have tried to trace this error too, but do not find the source.
Can anyone help me out here?
Edit: A picture of spikes:Spikes on the australian Electricity Market. find_peaks(20,50) determines the red points to be spikes, find_peaks(0,50) additionally finds the blue ones to be spikes (therefore, the second parameter h is important, because the blue points are clearly not what we want to analyse when we talk about spikes).
I'm still not entirely sure what it is that you are after. On the assumption that given a window of data you want to identify whether its center is greater than the rest of the window at the same time as being greater than the mean of the window + h then you could do the following:
peakfinder = function(x,h = 0){
xdat = as.numeric(x)
meandat = mean(xdat)
center = xdat[ceiling(length(xdat)/2)]
ifelse(all(center >= xdat) & center >= (meandat + h),center,NA)
}
y <- rnorm(n=100)
z = xts(y, order.by = as.Date(1:100))
plot(z)
points(rollapply(z,width = 7, FUN = peakfinder, align = "center"), col = "red", pch = 19)
Although it would appear to me that if the center point is greater than it's neighbours it is necessarily greater than the local mean too so this part of the function would not be necessary if h >= 0. If you want to use the global mean of the time series, just substitute the calculation of meandat with the pre-calculated global mean passed as an argument to peakfinder.

Resources