R: Calculate standard deviation for specific time interval - r

I have a dataset with daily bond returns for some unique RIC codes (in total approx. 200.000 observations).
Now I want to calculate the standard deviation of those returns for the combined period t-30 to t-6 and t+6 to t+30. This means for every observation i,t, I need the 24 returns before t in the window t-30 to t-6 and 24 returns in the window t+6 to t+30 and calculate the standard deviation based on those 48 observations.
Here is a small snippet of my dataset:
#My data:
date <- c("2022-05-11", "2022-05-12","2022-05-13","2022-05-16","2022-05-17","2022-05-11", "2022-05-12","2022-05-13","2022-05-16","2022-05-17")
ric <- c("AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "SE247827293=", "SE247827293=", "SE247827293=", "SE247827293=", "SE247827293=")
return <- c(0.001009681, 0.003925873, 0.000354606, -0.000472641, -0.002935700, 0.003750854, 0.012317347, -0.001314047, 0.001014453, -0.007234452)
df <- data.frame(ric, date, return)
I have tried to use the slider package to generate two lists with the returns of the specific time frame. However, I feel that there is some more efficient way to solve this problem. I hope to find some help here.
This is what I tried before:
x <- slide(df$return, ~.x, .before=30, .after = -6)
y <- slide(df$return, ~.x, .before=-6, .after = 30)
z <- mapply(c, x, y, SIMPLIFY=TRUE)
for (i in 1:length(z))
{
df$sd[i] <- sd(z[[i]])
}

Related

Subsetting a rasterbrick to give mean of three minimum months in each year

I'm interested in creating two variables from a time-series of spatial raster data in R from an netcdf file. Opening the data, subsetting by Z and creating a mean value is straight forward:
# Working example below using a ~16mb nc file of sea ice from HadISST
library(R.utils)
library(raster)
library(tidyverse)
download.file("https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_ice.nc.gz","HadISST_ice.nc.gz")
gunzip("HadISST_ice.nc.gz", ext="gz", FUN=gzfile)
hadISST <- brick('Datasets/HadISST/HadISST_ice.nc')
# subset to a decade time period and create decadal mean from monthly data
hadISST_a <- hadISST %>% subset(., which(getZ(.) >= as.Date("1900-01-01") & getZ(.) <= as.Date("1909-12-31"))) %>% mean(., na.rm = TRUE)
But, I'm interested in extracting 1) annual mean values, and 2) annual mean of three minimum monthly values for the subsetted time period. My current work flow is using nc_open() and ncvar_get() to open the data, raster::extract() to get the values and then tidverse group_by() and slice_min() to get the annual coolest months, but it's a slow and cpu intensive approach. Is there a more effective way of doing this without converting from raster - data.frame?
Questions:
Using the above code how can I extract annual means rather than a mean of ALL months over the decadal period?
Is there a way of using slice_min(order_by = sst, n = 3) or similar with brick objects to get the minimum three values per year prior to annual averaging?
Example data
if (!file.exists("HadISST_ice.nc")) {
download.file("https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_ice.nc.gz","HadISST_ice.nc.gz")
R.utils:::gunzip("HadISST_ice.nc.gz")
}
library(terra)
hadISST <- rast('HadISST_ice.nc')
Annual mean
y <- format(time(hadISST), "%Y")
m <- tapp(hadISST, y, mean)
Mean of the lowest three monthly values by year (this takes much longer because a user-defined R function is used). I now see that there is a bug in the CRAN version. You can instead use version 1.5-47 that you can install like this install.packages('terra', repos='https://rspatial.r-universe.dev').
f <- function(i) mean(sort(i)[1:3])
m3 <- tapp(hadISST, y, f)
To make this faster (if you have multiple cores):
m3 <- tapp(hadISST, y, f, cores=4)
There are likely much more intelligent ways to do this, but a thought as to your annual means, here calling hadISST, hadI:
v <- getValues(hadI)
v_t <- t(v) # this takes a while
v_mean <- vector(mode='numeric')
for(k in 1:nrow(v_t)) {
v_mean[k] = mean(v_t[k, ], na.rm = TRUE)
}
length(v_mean)
[1] 1828
v_mean[1:11]
[1] 0.1651351 0.1593368 0.1600364 0.1890360 0.1931470 0.1995657 0.1982052
[8] 0.1917534 0.1917840 0.1911638 0.1911657
I'm a little unclear on proposition 2, as it seems it would be the average of 3 0's...

Recursive prediction using vector autoregression

I use the Canada data in the vars package as an example. I want to use the first 48 observations to run the var, predict the next quarter, add the predicted data to the original 48 observations, use the 49 observations to run the var, predict the next quarter, add the new prediction to the 49 observations, re-run the var, reiterate the process until all observations are used. In the end, I hope to generate a dataframe that contain predicted values, which I can use to calculate RMSE. Below is my code:
library(dplyr)
library(stats)
library(vars)
data(Canada)
Canada_df <- as.data.frame(Canada)
prefit2 <- Canada_df[1:48,]
locs <- data.frame()
while (i <= nrow(Canada_df) & i >=48){
varfit <- VAR(y = prefit2, p = 6, type = 'const')
pred <- predict(varfit, n.ahead =1)
locs <- sapply(pred$fcst[1:3], function (k) k[ , 1]) %>% matrix(nrow = 1) %>% data.frame()
colnames(locs) <- colnames(prefit)
prefit2 <- bind_rows(prefit2, locs)
i = i + 1
}
When I run the code, there is no error generated. However, there are still 48 observations in prefit2, meaning no prediction was appended during the while loop. Additionally, i turns out to be 5, so it looks like the loop stops when it becomes 5. Not sure where is the problem.
I tried your code and it actually ran as you expected. The only thing worth reminding is to declare i before the while loop. In this case, for example, you want to declare i <- 48.

Standard deviation in every quarter-hour in R

I have raw data of power system frequency. 86 400 numbers.
frequency=a$Ist_Frq
plot.ts(frequency, main="System frequency [Hz]", xlab="Time [s]")
See example:
Raw data
Now, i have to determine quarter-hour time interval.
frequency=ts(a$Ist_Frq, start=1, frequency=900)
[quarter-hour time interval][2]
My question is:
Is there any way how to determine standart deviation in every quarter-hour?
Thanks for your answers.
There are probably several solutions to this problem: here is one
#some data
x <- rnorm(10000)
#identify quarter hour segments
y <- rep(1:ceiling(length(x)/(15 * 60)), each = 15 * 60)[1:length(x)]
#use tapply to find sd of x for every value of y
tapply(x, y, sd)
nb the last value might be based on fewer than 900 values

calculate lag from phase arrows with biwavelet in r

I'm trying to understand the cross wavelet function in R, but can't figure out how to convert the phase lag arrows to a time lag with the biwavelet package. For example:
require(gamair)
data(cairo)
data_1 <- within(cairo, Date <- as.Date(paste(year, month, day.of.month, sep = "-")))
data_1 <- data_1[,c('Date','temp')]
data_2 <- data_1
# add a lag
n <- nrow(data_1)
nn <- n - 49
data_1 <- data_1[1:nn,]
data_2 <- data_2[50:nrow(data_2),]
data_2[,1] <- data_1[,1]
require(biwavelet)
d1 <- data_1[,c('Date','temp')]
d2 <- data_2[,c('Date','temp')]
xt1 <- xwt(d1,d2)
plot(xt1, plot.phase = TRUE)
These are my two time series. Both are identical but one is lagging the other. The arrows suggest a phase angle of 45 degrees - apparently pointing down or up means 90 degrees (in or out of phase) so my interpretation is that I'm looking at a lag of 45 degrees.
How would I now convert this to a time lag i.e. how would I calculate the time lag between these signals?
I've read online that this can only be done for a specific wavelength (which I presume means for a certain period?). So, given that we're interested in a period of 365, and the time step between the signals is one day, how would one alculate the time lag?
So I believe you're asking how you can determine what the lag time is given two time series (in this case you artificially added in a lag of 49 days).
I'm not aware of any packages that make this a one-step process, but since we are essentially dealing with sin waves, one option would be to "zero out" the waves and then find the zero crossing points. You could then calculate the average distance between zero crossing points of wave 1 and wave 2. If you know the time step between measurements, you can easy calculate the lag time (in this case the time between measurement steps is one day).
Here is the code I used to accomplish this:
#smooth the data to get rid of the noise that would introduce excess zero crossings)
#subtracted 70 from the temp to introduce a "zero" approximately in the middle of the wave
spline1 <- smooth.spline(data_1$Date, y = (data_1$temp - 70), df = 30)
plot(spline1)
#add the smoothed y back into the original data just in case you need it
data_1$temp_smoothed <- spline1$y
#do the same for wave 2
spline2 <- smooth.spline(data_2$Date, y = (data_2$temp - 70), df = 30)
plot(spline2)
data_2$temp_smoothed <- spline2$y
#function for finding zero crossing points, borrowed from the msProcess package
zeroCross <- function(x, slope="positive")
{
checkVectorType(x,"numeric")
checkScalarType(slope,"character")
slope <- match.arg(slope,c("positive","negative"))
slope <- match.arg(lowerCase(slope), c("positive","negative"))
ipost <- ifelse1(slope == "negative", sort(which(c(x, 0) < 0 & c(0, x) > 0)),
sort(which(c(x, 0) > 0 & c(0, x) < 0)))
offset <- apply(matrix(abs(x[c(ipost-1, ipost)]), nrow=2, byrow=TRUE), MARGIN=2, order)[1,] - 2
ipost + offset
}
#find zero crossing points for the two waves
zcross1 <- zeroCross(data_1$temp_smoothed, slope = 'positive')
length(zcross1)
[1] 10
zcross2 <- zeroCross(data_2$temp_smoothed, slope = 'positive')
length(zcross2)
[1] 11
#join the two vectors as a data.frame (using only the first 10 crossing points for wave2 to avoid any issues of mismatched lengths)
zcrossings <- as.data.frame(cbind(zcross1, zcross2[1:10]))
#calculate the mean of the crossing point differences
mean(zcrossings$zcross1 - zcrossings$V2)
[1] 49
I'm sure there are more eloquent ways of going about this, but it should get you the information that you need.
In my case, for the tidal wave in semidiurnal, 90 degree equal to 3 hours (90*12.5 hours/360 = 3.125 hours). 12.5 hours is the period of semidiurnal. So, for 45 degree equal to -> 45*12.5/360 = 1.56 hours.
Thus in your case:
90 degree -> 90*365/360 = 91.25 hours.
45 degree -> 45*365/360= 45.625 hours.
My understanding is as follows:
For there to be a simple cause-and-effect relationship between the phenomena recorded in the time series, we would expect that the oscillations are phase-locked (Grinsted 2004); so, the period where you find the "in phase" arrow (--->) indicates the lag between the signals.
See the simulated examples with different distances between cause-and-effect phenomena; observe that greater the distance, greater is the period of occurrence of the "in phase arrow" in the Cross wavelet transform.
Nonlinear Processes in Geophysics (2004) 11: 561–566 SRef-ID: 1607-7946/npg/2004-11-561
See the example here

Generate table with proportions of success vs failure

I want to calculate the error rate by interval where 0 is good and 1 is bad. If I have a sample of 100 observation as levels divided in intervals as follows:
X <- 10; q<-sample(c(0,1), replace=TRUE, size=X)
l <- sample(c(1:100),replace=T,size=10)
bornes<-seq(min(l),max(l),5)
v <- cut(l,breaks=bornes,include.lowest=T)
table(v)
How can I get a table or function that calculates the default rate by each interval, the number of bad observations divided by the total number of observations?
tx_erreur<-function(x){
t<-table(x,q)
return(sum(t[,2])/sum(t))
}
I already tried this code above and tapply.
Thank you!
I think you want this:
tapply(q,# the variable to be summarized
v,# the variable that defines the bins
function(x) # the function to calculate the summary statistics within each bin
sum(x)/length(x))

Resources