I have several variables in my dataset that represent daily timing of events across a week.
For example for two rows might look like:
t1 = c(NA,12.6,10.7,11.5,12.5,9.5,14.1)
t2 = c(23.7,1.2,NA,22.9,23.2,0.5,0.1)
I want to calculate the variance of these rows. To do this, I need the mean and because these are periodic variables, I've adapted the code from this page:
#This can all be wrapped in a function like this
circ.mean <- function(m,int,na.rm=T) {
if(na.rm) m <- m[!]
rad.m = m*(360/int)*(pi/180)
mean.cos = mean(cos(rad.m))
mean.sin = mean(sin(rad.m))
x.deg = atan(mean.sin/mean.cos)*(180/pi)
This works as expected for t2:
> circ.mean(t2,24)
[1] -0.06803088
although ideally the answer would be 23.93197. But for t1, it gives an incorrect answer:
> circ.mean(t1,24)
[1] -0.1810074
whereas using the normal mean function gives the right answer:
> mean(t1,na.rm=T)
[1] 11.81667
My questions are:
1) Is this "circular mean" code correct and if so, am I using it correctly?
2) I've had a stab my own circ.var function (see below) to calculate the variance of a periodic variable - will this produce the correct variances for all possible input timing vectors?
circ.var <- function(m,int=NULL,na.rm=TRUE) {
if(is.null(int)) stop("Period parameter missing")
if(na.rm) m <- m[!]
if(sum(! return(NA)
mean.m = circ.mean(m,int)
var.m = 1/(n-1)*sum((((m-mean.m+(int/2))%%int)-(int/2))^2)
Any help would be hugely appreciated! Thanks for taking the time to read this!

I deleted my old answer, as I believe there was a mistake in the solution I provided.
I've written a series of R scripts that I've made available at my GitHub page which should calculate the mean, variance and other stats.
Thanks to #Gregor for his help.


Taylor diagram using JJAS mean values in R

I need some assistance in solving the following issue;
I am drawing a Taylor diagram using R. I calculated the JJAS mean precipitation values (mm/day) for observation and two models than I manually defined those values to get the Taylor diagram. It gives me output but that does not seem right as standard deviation values are too low (sample is attached).
This is the code I am using:
Alternate way is to use netcdf files of observation and models but I don't know how to extract the precipitation information and use them (I know how to view netcdf data in R but extraction is challenging at this stage for me).
Kindly solve this problem.
I have no knowledge about this but a cursory look at the code of taylor.diagram clarifies what is going on here.
If you enter taylor.diagram without parentheses in the console it will print the function. Around line 15 you will find this function to calculate SD:
SD <- function(x, subn) {
meanx <- mean(x, na.rm = TRUE)
devx <- x - meanx
ssd <- sqrt(sum(devx * devx, na.rm = TRUE)/(length(x[!]) -
We can run this function with parameter subn as TRUE or FALSE (in R, TRUE equates to 1 and FALSE equates to 0):
> SD(ref, TRUE)
[1] 0.4505061
> SD(ref, FALSE)
[1] 0.3901498
> SD(model1, FALSE)
[1] 0.2798994
> SD(model1, TRUE)
[1] 0.3232
And from this we can see that subn is set to FALSE. Further inspection of the code shows:
subn <- sd.method != "sample"
In other words: if sd.method equals sample (the default value) then subn will be FALSE.
It's up to you to decide what is the correct choice here.
One of the great things of R is that all R functions can be inspected at the console. Doing so can resolve most questions related to 'why is this function behaving like this' with limited effort.

Converting R code to MATLAB code: Stuck at sapply()

I have the following R code, which I am trying to convert to MATLAB. (No, I do not want to run the R code in MATLAB like shown here).
The R code is here:
# model parameters
dt <- 0.001
t <- seq(dt,0.3,dt)
D = 1
d = 0.5
# model
ft <- n*d/sqrt(2*D*t^3)*dnorm(d/sqrt(2*D*t),0,1)
fmids <- n*d/sqrt(2*D*(t+dt/2)^3)*dnorm(d/sqrt(2*D*(t+dt/2)),0,1)
# simulation
# simulation by drawing from uniform distribution
# and converting to time by using quantile function of normal distribution
ps <- runif(n,0,1)
ts <- 2*pnorm(-d/sqrt(2*D*t))
sumn <- sapply(ts, FUN = function(tb) sum(ps < tb))
And the MATLAB code I have done so far is
% # model
ft = (n*d)./sqrt(2*D.*t.^3).*normpdf(d./sqrt(2*D.*t),0,1);
fmids = (n*d)./sqrt(2*D*((t+dt)./2).^3).*normpdf(d./sqrt(2*D.*((t+dt)./2)),0,1);
% # simulation
% #
% # simulation by drawing from uniform distribution
% # and converting to time by using quantile function of normal distribution
ps = rand(1,n);
ts = 2*normcdf(-d./sqrt(2*D*t));
So, here is where I am stuck. I don't understand what function sumn = sapply(ts, FUN = function(tb) sum(ps < tb)) does and where the parameter 'tb' came from. It is not defined in the given R code as well.
Could anyone tell me what the equivalent of that function R code is in MATLAB?
So, based on the comments from #Croote, I came up with the following code for the function defined in sapply()
sumidx = bsxfun(#lt,ps,ts');
summat = sumidx.*repmat(ps,300,1);
sumn = sum(summat,2);
sumnfin = sumn(2:end)-sumn(1:end-1);
However, I am not getting the desired results. The curves should overlap with each other: the blue curve is correct, so the orange need to overlap with the blue curve.
What am I missing here? Is R's pnorm() equivalent to MATLAB'snormcdf() as I have done here?
So, after fiddling around, I discovered that I all I had to do was obtain the number of occurrences of tb < pb. The line summat = sumidx.*repmat(ps,300,1) is not supposed to be there. After removing that line and keeping sumn = sum(sumidx,2);, I get the desired result.
So, based on the comments from #Croote and after fiddling around, I came up with the following code for the function defined in sapply()
sumidx = bsxfun(#lt,ps,ts');
sumn = sum(sumidx,2);
And for the plot, I coded it as
sumnfin = sumn(2:end)-sumn(1:end-1);
Finally, I get the desired result

Why are simulated stock returns re-scaled and re-centered in the “pbo” vignette in the pbo (probability of backtest overfitting) package in R?

Here's the relevant code from the vignette, altered slightly to fit it on the page here, and make it easy to reproduce. Code for visualizations omitted. Comments are from vignette author.
(Full vignette:
#First, we assemble the trials into an NxT matrix where each column
#represents a trial and each trial has the same length T. This example
#is random data so the backtest should be overfit.`
n <- 100
t <- 2400
m <- data.frame(matrix(rnorm(n*t),nrow=t,ncol=n,
dimnames=list(1:t,1:n)), check.names=FALSE)
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
#We can use any performance evaluation function that can work with the
#reassembled sub-matrices during the cross validation iterations.
#Following the original paper we can use the Sharpe ratio as
sharpe <- function(x,rf=0.03/252) {
sr <- apply(x,2,function(col) {
er = col - rf
#Now that we have the trials matrix we can pass it to the pbo function
#for analysis.
my_pbo <- pbo(m,s=8,f=sharpe,threshold=0)
Here's the portion i'm curious about:
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
Why is the data transformed within the for loop, and does this kind of re-scaling and re-centering need to be done with real returns? Or is this just something the author is doing to make his simulated returns look more like the real thing?
Googling and searching through stackoverflow turned up some articles and posts regarding scaling volatility to the square root of time, but this doesn't look quite like what I've seen. Usually they involve multiplying some short term (i.e. daily) measure of volatility by the root of time, but this isn't quite that. Also, the documentation for the package doesn't include this chunk of re-scaling and re-centering code. Documentation:
Why is the data transformed in this way/what is result of this
Is it only necessary for this simulated data, or do I need to
similarly transform real returns?
I posted this question on the r-help mailing list and got the following answer:
"Hi Joe,
The centering and re-scaling is done for the purposes of his example, and
also to be consistent with his definition of the sharpe function.
In particular, note that the sharpe function has the rf (riskfree)
parameter with a default value of .03/252 i.e. an ANNUAL 3% rate converted
to a DAILY rate, expressed in decimal.
That means that the other argument to this function, x, should be DAILY
returns, expressed in decimal.
Suppose he wanted to create random data from a distribution of returns with
ANNUAL mean MU_A and ANNUAL std deviation SIGMA_A, both stated in decimal.
The equivalent DAILY returns would have mean MU_D = MU_A / 252 and standard
deviation SIGMA_D = SIGMA_A/SQRT(252).
He calls MU_D by the name mu_base and SIGMA_D by the name sigma_base.
His loop now converts the random numbers in his matrix so that each column
has mean MU_D and std deviation SIGMA_D.
I followed up with this:
"If I'm understanding correctly, if I’m wanting to use actual returns from backtests rather than simulated returns, I would need to make sure my risk-adjusted return measure, sharpe ratio in this case, matches up in scale with my returns (i.e. daily returns with daily sharpe, monthly with monthly, etc). And I wouldn’t need to transform returns like the simulated returns are in the vignette, as the real returns are going to have whatever properties they have (meaning they will have whatever average and std dev they happen to have). Is that correct?"
I was told this was correct.

How to do top down forecasted proportions for hts objects with 2 levels?

I had previously asked this question trying to get top down forecasted proportions forecast recombination using the hts package. The solution there works great for multilevel hierarchies, however I have found I get an error when I try to use the solution on a two level hierarchy.
# Create the hierarchy
newhts <- hts(htseg1$bts, list(ncol(htseg1$bts)))
# forecast creation adapted from the `combinef()` example
h <- 12
ally <- aggts(newhts)
allf <- matrix(NA, nrow = h, ncol = ncol(ally))
for(i in 1:ncol(ally))
allf[,i] <- forecast(auto.arima(ally[,i]), h = h, PI = FALSE)$mean
allf <- ts(allf, start = 51)
# Earo Wang's solution to my previous question
hts:::TdFp(allf, nodes = htseg1$nodes)
Error in *.default(fcasts[, 1L], prop) : time-series/vector length mismatch
The problem seems to arise because a two level hierarchy skips the last if conditional with the condition if (l.levels > 2L). The last statement of this conditional multiplies includes a piece where prop is multiplied by the time series flist[[k + 1L]], which converts prop into a time series matrix. When this statement is skipped, prop remains a regular matrix causing the error when the time series vector fcasts[, 1L] is multiplied by the matrix prop.
I understand that TdFp is a non exported function and therefore may not be as robust as the other functions in the package, but is there any way around this problem? Since it is a relatively simple case, I can code a solution myself, but since hts::forecast.hts() can handle two level hierarchies for method = "tdfp", I thought there might be a nice clean solution.

What is the formula to calculate the gini with sample weight

I need your helps to explain how I can obtain the same result as this function does:
gini(x, weights=rep(1,length=length(x))) --> page 2. Gini
Let's say, we need to measure the inocme of the population N. To do that, we can divide the population N into K subgroups. And in each subgroup kth, we will take nk individual and ask for their income. As the result, we will get the "individual's income" and each individual will have particular "sample weight" to represent for their contribution to the population N. Here is example that I simply get from previous link and the dataset is from NLS
# Convert the wage growth from (log. dollar) to (dollar)
y <- exp(recent$chpermwage);y
# Compute the unweighted estimate
gini_y <- gini(y)
# Compute the weighted estimate
gini_yw <- gini(y,w=recent$wgt)
> --- Here is the result----
> gini_y = 0.3418394
> gini_yw = 0.3483615
I know how to compute the Gini without WEIGHTS by my own code. Therefore, I would like to keep the command gini(y) in my code, without any doubts. The only thing I concerned is that the way gini(y,w) operate to obtain the result 0.3483615. I tried to do another calculation as follow to see whether I can come up with the same result as gini_yw. Here is another code that I based on CDF, Section 9.5, from this book: ‘‘Relative
Distribution Methods in the Social Sciences’’ by Mark S. Handcock,
# test how gini computes with the sample weights
z <- exp(recent$chpermwage) * recent$wgt
gini_z <- gini(z)
# Result gini_z = 0.3924161
As you see, my calculation gini_z is different from command gini(y, weights). If someone of you know how to build correct computation to obtain exactly
gini_yw = 0.3483615, please give me your advices.
Thanks a lot friends.
function (x, weights = rep(1, length = length(x)))
ox <- order(x)
x <- x[ox]
weights <- weights[ox]/sum(weights)
p <- cumsum(weights)
nu <- cumsum(weights * x)
n <- length(nu)
nu <- nu/nu[n]
sum(nu[-1] * p[-n]) - sum(nu[-n] * p[-1])
This is the source code for the function gini which can be seen by entering gini into the console. No parentheses or anything else.
This can be done for any function or object really.
This is bit late, but one may be interested in concentration/diversity measures contained in the [SciencesPo][1] package.
