Rolling weighted average in R (multiple observations) - r

Is there any fast function that is able to calculate a rolling average that is weighted? This is necessary because I have multiple observation (not always the same number) per data point (change in seconds) and I average that. When I take the rolling average, I want to re-weight to get an unbiased rolling average.
So far, I came up with this solution (in this example with a window of 3 seconds).
sam <- data.table(val_mean=c(1:15),N=c(11:25))
sam[,weighted:=val_mean*N]
sam[,rollnumerator:=rollapply(weighted,3,sum,fill=NA,align="left")]
sam[,rolldenominator:=rollapply(N,3,sum,fill=NA,align="left")]
sam[,rollnumerator/rolldenominator]
I couldn't find any question that already addresses this problem.
This is not about unequal spacing of the data: I can take care of that by expanding my data.table with NAs to include each second (the example above is equally spaced). Also, I don't want to include weights in the sense of RcppRoll's roll_mean: There, weights are fixed for all time windows ("A vector of length n, giving the weights for each element within a window."), while in my case the weights change according to the values currently processed. Thirdly, I don't want an adaptive window size, it should stay fixed (say at 3 seconds).

1) Use by.column = FALSE:
library(data.table)
library(zoo)
wmean <- function(x) weighted.mean(x[, 1], x[, 2])
sam[, rollapplyr(.SD, 3, wmean, by.column = FALSE, fill = NA, align = "left")]
2) Another approach is to encode the values and weights into a complex vector:
wmean_cmplx <- function(x) weighted.mean(Re(x), Im(x))
sam[, rollapply(complex(real = val_mean, imag = N), 3, wmean_cmplx,
fill = NA, align = "left")]

Related

Calculating the percentage of overlap between polytopes (n-dimensions)

I have to figure out the percentage of overlap between polytopes in n-dimensional spaces, where my only available source of reference is a set of randomly sampled points within those polytopes.
Assume that the following two R objects are two sets of randomly sampled points from two different polytopes in 5 dimensions:
one <- matrix(runif(5000, min = 0, max = 5), ncol = 5)
two <- matrix(runif(5000, min = 0, max = 4), ncol = 5)
In this example, I selected a smaller range for the second object, so we know that there should be less than 10% overlap. Let me know if I am wrong.
EDIT:
Just to make it really clear, the question is what is the percentage of overlap between those two objects?
I need a method that generalizes to n-dimensional spaces.
This stackoverflow question is somewhat similar to what I am trying to do, but I didn't manage to get it to work.
So, the most straightforward way is to use the hypervolume package.
library(hypervolume)
one <- hypervolume(matrix(runif(5000, min = 0, max = 5), ncol = 5))
two <- hypervolume(matrix(runif(5000, min = 0, max = 4), ncol = 5))
three = hypervolume_set(one, two, check.memory=FALSE)
get_volume(three)
This will get you the volume.
hypervolume_overlap_statistics(three)
This function will output four different metrics, one if which is the Jaccard Similarity Index.
The Jaccard Similarity is the proportion of overlap between the two sample sets (the intersection divided by the union).
Alternatives
Chris suggested volesti as an alternative. Another alternative would be the geometry package.
They do not calculate the proportion straight away. Here you need to find the intersection (e.g. intersectn in geometry, VpolytopeIntersection in volesti), then calculate the volume for the polytopes separately and also their intersection, then you need to divide the volume of the intersection with the sum of the volumes for the two polytopes.
Here, they are also using a different method to calculate the volume and it might be more appropriate for you if you are trying to construct convex hulls in an n-dimensional space. For me, hypervolume is a better solution, because I am doing something more akin to Hutchinson’s n-dimensional hypervolume concept from ecology and evolutionary biology.

Trim argument in mean() when number of observations is odd

I need some clarification about the trim argument in the function mean().
In ?mean we find that
trim is the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed
If trim is non-zero, a symmetrically trimmed mean is computed
I assume that it will trim the values symmetrically, taking as many observations from the lower range of values as from the upper.
My question is, if x has an odd number of observations, and if we set trim = 0.5, will it remove one less observation in order to cut the same ones from both sides? Or will it just take one extra out randomly either from the top or the bottom?
Thanks in advance,
Ines
I don't exactly know the answer to your question, but I tested with this:
vec <- c(rep(0, 50), rep(1, 51))
mean(vec)
# 0.5049505
mean(vec, trim = .1)
# 0.5061728
So in this case it seems that the function trimmed one low value before

Rolling window to calculate Value at Risk in R

I'm sure this is very obvious but i'm a begginer in R and i spent a good part of the afternoon trying to solve this...
I'm trying to create a rolling window to calculate the Value at Risk (VaR) over time.
I already calculated the unconditional VaR for my entire timeserie of 7298 daily returns.
Now, what i'm trying to do is do a rolling window that calculates VaR for a window of 25 days that will roll every one observation for my entire timeserie.
I tried
apply.rolling(nas, trim = TRUE, gap = 25, by = 1, FUN = function(x) VaR(R = nas, p = 0.99, method="historical"))
and
rollapply(nas, width = 25, FUN = function(x) VaR(R = nas, p = 0.99, method="historical"))
where nas is my time serie.
My code still runs from an hour ago... I don't know what I did wrong...
Thank you very much in advance for any help you can provide.
H.
It should be:
rollapply(nas, width = 25, FUN = function(x) VaR(R = x, p = 0.99, method="historical"))
Basically you applying a function that takes in value x (which is a filtered nas to 25 time units), and produces output based on the x. In your original attempt, the function was function(x) VaR(R = nas, p = 0.99, method="historical"), so it takes in value x, but still calculate VaR of the whole nas, and it does that >7000 times, hence it takes forever.

Sampling within ranges

I have troubles to sample within a certain background or excluding some possibilities.
I am trying to create a R function that shuffles genomic regions.
For now the function works well and follow those steps:
Retrieves all the genomic regions lengths and chromosomes of the query.
Calculates all the possible starts as the specified chromosome total size minus the length of each query regions.
Calculates the shuffled genomic regions as the start is sampled from 0 to the possible starts and the width is simply the width of each query regions.
This function uses GenomicRanges object, here is its code:
GrShuffle <- function(regions, chromSizes = LoadChromSizes("hg19")) {
# Gets all the regions lengths from the query.
regionsLength <- regions#ranges#width
# The possible starts are the chromosome sizes - the regions lengths.
possibleStarts <- chromSizes[as.vector(regions#seqnames), ] - regionsLength
# Gets all the random starts from sampling the possible starts.
randomStarts <- unlist(lapply(possibleStarts, sample.int, size = 1))
granges <- GRanges(regions#seqnames, IRanges(start = randomStarts,
width = regionsLength),
strand=regions#strand)
return(granges)
}
But now I need to use a universe, i.e. an other set of regions that will determine in which ranges the randoms will take place. The universe works like a restriction to sampling. It will be another set of regions like the query. And no shuffling should take place outside of those regions.
Any clue on how to sample within ranges in R?
The lapply is important as it drastically reduces the execution time of the function compared to using a loop.
[EDIT]
Here is a reproducible example that does not use GenomicRanges to siplify at maximum what I want to achieve.
## GENERATES A RANDOM QUERY
chromSizes <- c(100,200,250)
names(chromSizes) <- c("1","2","3")
queryChrom <- sample(names(chromSizes), 100, replace = TRUE)
queryLengths <- sample(10, 100, replace = TRUE)
queryPossibleStarts <- chromSizes[queryChrom] - queryLengths
queryStarts <- unlist(lapply(queryPossibleStarts, sample.int, size = 1))
query <- data.frame(queryChrom, queryStarts, queryStarts + queryLengths)
colnames(query) <- c("chrom", "start", "end")
##
##SIMPLIFIED FUNCTION
# Gets all the regions lengths from the query.
regionsLength <- query$end - query$start
# The possible starts are the chromosome sizes - the regions lengths.
possibleStarts <- chromSizes[query$chrom] - regionsLength
# Gets all the random starts from sampling the possible starts.
randomStarts <- unlist(lapply(possibleStarts, sample.int, size = 1))
shuffledQuery <- data.frame(queryChrom, randomStarts, randomStarts + queryLengths)
colnames(shuffledQuery) <- c("chrom", "start", "end")
##

Apply a vector to an optimize function in R

Hoew can I apply vector of observations to find the local maximum between each observation in R. I do the following code, but according to plot the result should be just two local maximum.
How can I do this in R?
x = c(0.0000005, 0.1578947, 0.3157895, 0.4736842, 0.6315789, 0.7894737,
0.9473684, 1.1052632,1.2631579, 1.4210526, 1.5789474, 1.7368421,
1.8947368, 2.0526316, 2.2105263, 2.3684211, 2.5263158 ,
2.6842105, 2.8421053, 3.000000)
f = function(x) (x+1)*(x-2)*(x-1)*(x)*(x+1)*(x-2)*(x-3)
plot(x, f(x), type="l")
maximums = sapply(x, function(x)optimize(f, c(0, x), maximum = TRUE)$maximum)
I'm not sure how to apply optimize to that sequence for that purpose but it surely wouldn't be applied pointwise.. You could conceivably make a polynomial spline and then differentiate it. The numerical analog of differentiation is diff and the conditions for a local maximum are that the first derivative be small and the second derivative be negative. Here is a plot of conditions that satisfy those (shifting the coloring by one to account for the shortening of the vector when you diff it:
plot(x,f(x),
col=c("red","blue")[1+seq_along(x) %in% # adding one to the logical values 0,1
c(0, which( diff(diff(f(x)))<0 & diff(f(x)) < 0.1))])

Resources