I am new to R and am trying to calculate the bootstrapped standard deviation (sd) and associated standard error within a 30 observation rolling window. The function below performs the rolling window appropriately if I just want sd. But when I add the bootstrap function using the boot package I get the error specified below. I gather that I am trying to store bootstrap results in a vector that isn't the correct size. Does anyone have any advice on how to store just the bootstrapped sd and associated stderror for each window in rows of a new matrix? The goal is to then plot the sd and associated 95% confidence intervals for each window along the timeseries. Thanks in advance for any help.
> head(data.srs)
LOGFISH
1 0.8274083
2 1.0853433
3 0.8049845
4 0.8912097
5 1.3514569
6 0.8694499
###Function to apply rolling window
rollWin <- function(timeSeries, windowLength)
{
data<-timeSeries
nOut <- length(data[, 1]) - windowLength + 1
out <- numeric(nOut)
if (length(data[,1]) >= windowLength)
{
for (i in 1:nOut)
{
sd.fun <- function(data,d)sd(data[d], na.rm = TRUE)
out[i] <- boot(data[i:(i + windowLength - 1), ], sd.fun, R=1000)
}
}
return (list(result=out))
}
###run rolling window function. ex. rollWin(data, windowlength)
a.temp<-rollWin(data.srs,30)
> warnings()
Warning messages:
1: In out[i] <- boot(data[i:(i + windowLength - 1), ], sd.fun, ... :
number of items to replace is not a multiple of replacement length
You can simplify it quite a lot. I am not familiar with the boot package, but we can roll a function along a vector using the rollapply function quite easily, and then we can make bootstrap samples using the replicate function:
# Create some data, 12 items long
r <- runif(12)
# [1] 0.44997964 0.27425412 0.07327872 0.68054759 0.33577348 0.49239478
# [7] 0.93421646 0.19633079 0.45144966 0.53673296 0.71813017 0.85270346
require(zoo)
# use rollapply to calculate function alonga moving window
# width is the width of the window
sds <- rollapply( r , width = 4 , by = 1 , sd )
#[1] 0.19736258 0.26592331 0.16770025 0.12585750 0.13730946 0.08488467
#[7] 0.16073722 0.22460430 0.22462168
# Now we use replicate to repeatedly evaluate a bootstrap sampling method
# 'n' is number of replications
n <- 4
replicate( n , rollapply( r , width = n , function(x) sd( x[ sample(length(x) , repl = TRUE) ] ) ) )
# [,1] [,2] [,3] [,4]
# [1,] 0.17934073 0.1815371 0.11603320 0.2992379
# [2,] 0.03551822 0.2862702 0.18492837 0.2526193
# [3,] 0.09042535 0.2419768 0.13124738 0.1666012
# [4,] 0.17238705 0.1410475 0.18136178 0.2457248
# [5,] 0.32008385 0.1709326 0.32909368 0.2550859
# [6,] 0.30832533 0.1480320 0.02363968 0.1275594
# [7,] 0.23069951 0.1275594 0.25648052 0.3016909
# [8,] 0.11235170 0.2493055 0.26089969 0.3012610
# [9,] 0.16819174 0.2099518 0.18033502 0.0906986
Each column represents the rollapply which bootstraps the observations in the current window before applying sd.
Related
`values <- matrix(c(0.174,0.349,1.075,3.1424,0.173,0.346,1.038,3.114,0.171,0.343,1.03,3.09,0.17,0.34,1.02,3.06),ncol=4) `
I am attempting to maximize the total value for the dataset taking only one value from each row, and with associated costs for each column
subject to:
One value column used per row.
cost of each use of column 1 is 4
cost of each use of column 2 is 3
cost of each use of column 3 is 2
cost of each use of column 4 is 1
total cost <= 11
These are stand in values for a larger dataset. I need to be able to apply it directly to all the rows of a dataset.
I have been trying to use the lpSolve package, with no success.
`f.obj <- values
f.con <- c(4,3,2,1)
f.dir <- "<="
f.rhs <- 11
lp("max", f.obj, f.con, f.dir, f.rhs)`
I am getting a solution of "0"
I do not know how to model this in a way that chooses one value per row and then uses a different value in calculating the constraints.
Looks like the problem is as follows:
We have a matrix a[i,j] with values, and a vector c[j] with costs.
We want to select one value for each row such that:
a. total cost <= 11
b. total value is maximized
To develop a mathematical model, we introduce binary variables x[i,j] ∈ {0,1}. With this, we can write:
max sum((i,j), a[i,j]*x[i,j])
subject to
sum((i,j), c[j]*x[i,j]) <= 11
sum(j, x[i,j]) = 1 ∀i
x[i,j] ∈ {0,1}
Implement in R. I use here CVXR.
#
# data
# A : values
# C : cost
#
A <- matrix(c(0.174,0.349,1.075,3.1424,0.173,0.346,1.038,3.114,0.171,0.343,1.03,3.09,0.17,0.34,1.02,3.06),ncol=4)
C <- c(4,3,2,1)
maxcost <- 11
#
# form a matrix cmat[i,j] indicating the cost of element i,j
#
cmat <- matrix(C,nrow=dim(A)[1],ncol=dim(A)[2],byrow=T)
#
# problem:
# pick one value from each row
# such that total value of selected cells is maximized
# and cost of selected cells is limited to maxcost
#
# model:
# min sum((i,j), a[i,j]*x[i,j])
# subject to
# sum((i,j), c[j]*x[i,j]) <= maxcost
# sum(j,x[i,j]) = 1 ∀i
# x[i,j] ∈ {0,1}
#
#
library(CVXR)
x = Variable(dim(A), name="x", boolean=T)
p <- Problem(Maximize(sum_entries(A*x)),
constraints=list(
sum_entries(cmat*x) <= maxcost,
sum_entries(x,axis=1) == 1
))
res <- solve(p,verbose=T)
res$status
res$value
res$getValue(x)*A
The output looks like:
> res$status
[1] "optimal"
> res$value
[1] 4.7304
> res$getValue(x)*A
[,1] [,2] [,3] [,4]
[1,] 0.0000 0 0.000 0.17
[2,] 0.0000 0 0.343 0.00
[3,] 1.0750 0 0.000 0.00
[4,] 3.1424 0 0.000 0.00
The description in the original post is not very precise. For instance, I assumed that we need to select precisely one cell from each row. If we just want "select at most one cell from each row", then replace
sum(j, x[i,j]) = 1 ∀i
by
sum(j, x[i,j]) <= 1 ∀i
As mentioned by Steve, the lpSolve package expects a single objective function not a matrix. You could reformulate as maximize(sum(RowSums(values*xij)) given constraint
Eg, change the matrix to a vector, and change the problem to a integer optimization problem
obj <- as.vector(values)
f.con <- rep(f.con, each = 4)
r <- lp('max', obj, matrix(f.con, nrow = 1), f.dir, f.rhs, int.vec = seq_along(obj))
#' Success: the objective function is 9.899925
If x be a n*m matrix, when I use cor(x), I have a m*m correlation matrix between each pair of columns.
How can I use cor.test function on the n*m matrix to have a m*m p-value matrix also?
There may be an existing function, but here's my version. p_cor_mat runs cor.test on each pair of columns in matrix x and records the p-value. These are then put into a square matrix and returned.
# Set seed
set.seed(42)
# Matrix of data
x <- matrix(runif(120), ncol = 4)
# Function for creating p value matrix
p_cor_mat <- function(x){
# All combinations of columns
colcom <- t(combn(1:ncol(x), 2))
# Calculate p values
p_vals <- apply(colcom, MAR = 1, function(i)cor.test(x[,i[1]], x[,i[2]])$p.value)
# Create matrix for result
p_mat <- diag(ncol(x))
# Fill upper & lower triangles
p_mat[colcom] <- p_mat[colcom[,2:1]] <- p_vals
# Return result
p_mat
}
# Test function
p_cor_mat(x)
#> [,1] [,2] [,3] [,4]
#> [1,] 1.0000000 0.4495713 0.9071164 0.8462530
#> [2,] 0.4495713 1.0000000 0.5960786 0.7093539
#> [3,] 0.9071164 0.5960786 1.0000000 0.7466226
#> [4,] 0.8462530 0.7093539 0.7466226 1.0000000
Created on 2019-03-06 by the reprex package (v0.2.1)
Please also see the cor.mtest() function in the corrplot package.
https://www.rdocumentation.org/packages/corrplot/versions/0.92/topics/cor.mtest
I’m trying to simulate the Matlab ewstats function here defined:
https://it.mathworks.com/help/finance/ewstats.html
The results given by Matlab are the following ones:
> ExpReturn = 1×2
0.1995 0.1002
> ExpCovariance = 2×2
0.0032 -0.0017
-0.0017 0.0010
I’m trying to replicate the example with the RiskPortfolios R package:
https://cran.r-project.org/web/packages/RiskPortfolios/RiskPortfolios.pdf
The R code I’m using is this one:
library(RiskPortfolios)
rets <- as.matrix(cbind(c(0.24, 0.15, 0.27, 0.14), c(0.08, 0.13, 0.06, 0.13)))
w <- 0.98
rets
w
meanEstimation(rets, control = list(type = 'ewma', lambda = w))
covEstimation(rets, control = list(type = 'ewma', lambda = w))
The mean estimation is the same of the one in the example, but the covariance matrix is different:
> rets
[,1] [,2]
[1,] 0.24 0.08
[2,] 0.15 0.13
[3,] 0.27 0.06
[4,] 0.14 0.13
> w
[1] 0.98
>
> meanEstimation(rets, control = list(type = 'ewma', lambda = w))
[1] 0.1995434 0.1002031
>
> covEstimation(rets, control = list(type = 'ewma', lambda = w))
[,1] [,2]
[1,] 0.007045044 -0.003857217
[2,] -0.003857217 0.002123827
Am I missing something?
Thanks
They give the same answer if type = "lw" is used:
round(covEstimation(rets, control = list(type = 'lw')), 4)
## 0.0032 -0.0017
## -0.0017 0.0010
They are using different algorithms. From the RiskPortfolio manual:
ewma ... See RiskMetrics (1996)
From the Matlab hlp page:
There is no relationship between ewstats function and the RiskMetrics® approach for determining the expected return and covariance from a return time series.
Unfortunately Matlab does not tell us which algorithm is used.
For those who eventually need an equivalent ewstats function in R, here the code I wrote:
ewstats <- function(RetSeries, DecayFactor=NULL, WindowLength=NULL){
#EWSTATS Expected return and covariance from return time series.
# Optional exponential weighting emphasizes more recent data.
#
# [ExpReturn, ExpCovariance, NumEffObs] = ewstats(RetSeries, ...
# DecayFactor, WindowLength)
#
# Inputs:
# RetSeries : NUMOBS by NASSETS matrix of equally spaced incremental
# return observations. The first row is the oldest observation, and the
# last row is the most recent.
#
# DecayFactor : Controls how much less each observation is weighted than its
# successor. The k'th observation back in time has weight DecayFactor^k.
# DecayFactor must lie in the range: 0 < DecayFactor <= 1.
# The default is DecayFactor = 1, which is the equally weighted linear
# moving average Model (BIS).
#
# WindowLength: The number of recent observations used in
# the computation. The default is all NUMOBS observations.
#
# Outputs:
# ExpReturn : 1 by NASSETS estimated expected returns.
#
# ExpCovariance : NASSETS by NASSETS estimated covariance matrix.
#
# NumEffObs: The number of effective observations is given by the formula:
# NumEffObs = (1-DecayFactor^WindowLength)/(1-DecayFactor). Smaller
# DecayFactors or WindowLengths emphasize recent data more strongly, but
# use less of the available data set.
#
# The standard deviations of the asset return processes are given by:
# STDVec = sqrt(diag(ECov)). The correlation matrix is :
# CorrMat = VarMat./( STDVec*STDVec' )
#
# See also MEAN, COV, COV2CORR.
NumObs <- dim(RetSeries)[1]
NumSeries <- dim(RetSeries)[2]
# size the series and the window
if (is.null(WindowLength)) {
WindowLength <- NumObs
}
if (is.null(DecayFactor)) {
DecayFactor = 1
}
if (DecayFactor <= 0 | DecayFactor > 1) {
stop('Must have 0< decay factor <= 1.')
}
if (WindowLength > NumObs){
stop(sprintf('Window Length #d must be <= number of observations #d',
WindowLength, NumObs))
}
# ------------------------------------------------------------------------
# size the data to the window
RetSeries <- RetSeries[NumObs-WindowLength+1:NumObs, ]
# Calculate decay coefficients
DecayPowers <- seq(WindowLength-1, 0, by = -1)
VarWts <- sqrt(DecayFactor)^DecayPowers
RetWts <- (DecayFactor)^DecayPowers
NEff = sum(RetWts) # number of equivalent values in computation
# Compute the exponentially weighted mean return
WtSeries <- matrix(rep(RetWts, times = NumSeries),
nrow = length(RetWts), ncol = NumSeries) * RetSeries
ERet <- colSums(WtSeries)/NEff;
# Subtract the weighted mean from the original Series
CenteredSeries <- RetSeries - matrix(rep(ERet, each = WindowLength),
nrow = WindowLength, ncol = length(ERet))
# Compute the weighted variance
WtSeries <- matrix(rep(VarWts, times = NumSeries),
nrow = length(VarWts), ncol = NumSeries) * CenteredSeries
ECov <- t(WtSeries) %*% WtSeries / NEff
list(ExpReturn = ERet, ExpCovariance = ECov, NumEffObs = NEff)
}
I'm trying to write a loop that will iterate through my vector of doubles, and calculate the standard deviation of every group of 5 values. Below is the code I've written to do so, however, when I attempt to run it, it gives the majority of my out NA as a value, which isn't accurate.
data is large matrix with 53412 elements, should be approx 1175 rows.
for(i in floor((nrow(data)/5)-5)){sd5[i] <-sd(data[seq((5*i) + 1,(5*i) + 5),6])}
I've attempted to itterate through it manually, just executing the following in the console
sd(data[seq((5) + 1,(5*i) + 5),6])
sd(data[seq((10) + 1,(10) + 5),6])
sd(data[seq((15) + 1,(15) + 5),6])
Each of those operate properly, however, when I attempted to do it with the loop, it results in NA for the majority of my data, including the 2nd and 3rd lines in the code block above.
Here is a few lines from the CSV it is reading
2016-04-01,108.779999,110.00,108.199997,109.989998,25626200,109.989998
2016-03-31,109.720001,109.900002,108.879997,108.989998,25685700,108.989998
2016-03-30,108.650002,110.419998,108.599998,109.559998,45159900,109.559998
2016-03-29,104.889999,107.790001,104.879997,107.68,30774100,107.68
2016-03-28,106.00,106.190002,105.059998,105.190002,19303600,105.190002
Just in case, I wanted to point out that I am grabbing the correct values from the CSV file, atleast when I manually execute sd(), as I've compared the console output to the CSV file. However, that doesn't mean I'm not iterating incorrectly in a way that I just can't seem to find. I set the loop to round down to avoid any out of bounds errors.
No need for a for loop.
If the vector is stored as x, you could do:
NN <- length(x)
x <- x[1:(5*floor(length(x)/5))]
dim(x) <- c(5, length(x)/5)
apply(x, 2, sd)
If it's in a data.frame, I'd use data.table (especially since sd is GForce-optimized in the current devel version):
library(data.table); setDT(data)
data[ , sd(x), by = .(grp = (0:(length(x) - 1) %/% 5))]
You can just recast into a 5 column (or row) matrix and get the FUN of the rows (or columns)
And since the matrix is so large, you can use the matrixStats library
mm <- read.csv(header = FALSE, text = "2016-04-01,108.779999,110.00,108.199997,109.989998,25626200,109.989998
2016-03-31,109.720001,109.900002,108.879997,108.989998,25685700,108.989998
2016-03-30,108.650002,110.419998,108.599998,109.559998,45159900,109.559998
2016-03-29,104.889999,107.790001,104.879997,107.68,30774100,107.68
2016-03-28,106.00,106.190002,105.059998,105.190002,19303600,105.190002")
set.seed(1)
mm <- mm[, -1]
mm <- matrix(sample(unlist(mm), 1500 * 55000, TRUE), 1500)
# num [1:1500, 1:55000] 110 109 110 110 110 ...
m2 <- matrix(mm, ncol = 5, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 110.42 108.88 109.56 109.56 109.90
# [2,] 108.99 107.68 105.19 107.68 109.72
# [3,] 109.90 110.00 25626200.00 108.88 30774100.00
# [4,] 105.06 25685700.00 105.19 108.88 30774100.00
# [5,] 107.68 109.90 105.19 104.89 107.79
# [6,] 108.88 108.78 108.88 108.99 108.20
system.time({
sds <- apply(m2, 1, sd)
})
# user system elapsed
## a damn long time
# Timing stopped at: 114.028 0.81 115.398
library('matrixStats')
system.time({
sds <- rowSds(m2)
})
# user system elapsed
# 0.347 0.051 0.402
head(sds)
# [1] 5.620328e-01 1.726982e+00 1.555266e+07 1.556640e+07 2.072692e+00 3.141340e-01
I have a sample dataframe:
data<-data.frame(a=c(1,2,3),b=c(4,5,5),c=c(6,8,7),d=c(8,9,10))
And wish to calculate the z-scores for every row in the data frame and did :
scores<-apply(data,1,zscore)
I used the zscore function from
install.packages(c("R.basic"), contriburl="http://www.braju.com/R/repos/")
And obtained this
row.names V1 V2 V3
a -1.2558275 -1.2649111 -1.0883839
b -0.2511655 -0.3162278 -0.4186092
c 0.4186092 0.6324555 0.2511655
d 1.0883839 0.9486833 1.2558275
But when I try manually calculating the z score for the first row of the data frame I obtain the following values:
-1.45 -0.29 0.4844, 1.25
Manually, for the first row, I calculated as follows:
1) calculate the row mean (4.75) for first row
2) Subtract each value from the row mean (e.g; 4.75-1., 4.75-4., 4.75-6., 4.75-8)
3) square each difference.
4) add them up and divide by the amount of samples in row 1
5) thus I obtain the variance( answer = 6.685) and then get the standard deviation ( 2.58) of the first row alone
6) Then apply the formula of z score.
The zscore function, whatever it is, seems to be the same as scale in the base package.
apply(data, 1, scale)
## [,1] [,2] [,3]
## [1,] -1.2558275 -1.2649111 -1.0883839
## [2,] -0.2511655 -0.3162278 -0.4186092
## [3,] 0.4186092 0.6324555 0.2511655
## [4,] 1.0883839 0.9486833 1.2558275
For each column, it is calculating (x - mean(x)) / sd(x).