Function for polynomials of arbitrary order (symbolic method preferred) - r

I've found polynomial coefficients from my data:
R <- c(0.256,0.512,0.768,1.024,1.28,1.437,1.594,1.72,1.846,1.972,2.098,2.4029)
Ic <- c(1.78,1.71,1.57,1.44,1.25,1.02,0.87,0.68,0.54,0.38,0.26,0.17)
NN <- 3
ft <- lm(Ic ~ poly(R, NN, raw = TRUE))
pc <- coef(ft)
So I can create a polynomial function:
f1 <- function(x) pc[1] + pc[2] * x + pc[3] * x ^ 2 + pc[4] * x ^ 3
And for example, take a derivative:
g1 <- Deriv(f1)
How to create a universal function so that it doesn't have to be rewritten for every new polynomial degree NN?

My original answer may not be what you really want, as it was numerical rather symbolic. Here is the symbolic solution.
## use `"x"` as variable name
## taking polynomial coefficient vector `pc`
## can return a string, or an expression by further parsing (mandatory for `D`)
f <- function (pc, expr = TRUE) {
stringexpr <- paste("x", seq_along(pc) - 1, sep = " ^ ")
stringexpr <- paste(stringexpr, pc, sep = " * ")
stringexpr <- paste(stringexpr, collapse = " + ")
if (expr) return(parse(text = stringexpr))
else return(stringexpr)
}
## an example cubic polynomial with coefficients 0.1, 0.2, 0.3, 0.4
cubic <- f(pc = 1:4 / 10, TRUE)
## using R base's `D` (requiring expression)
dcubic <- D(cubic, name = "x")
# 0.2 + 2 * x * 0.3 + 3 * x^2 * 0.4
## using `Deriv::Deriv`
library(Deriv)
dcubic <- Deriv(cubic, x = "x", nderiv = 1L)
# expression(0.2 + x * (0.6 + 1.2 * x))
Deriv(f(1:4 / 10, FALSE), x = "x", nderiv = 1L) ## use string, get string
# [1] "0.2 + x * (0.6 + 1.2 * x)"
Of course, Deriv makes higher order derivatives easier to get. We can simply set nderiv. For D however, we have to use recursion (see examples of ?D).
Deriv(cubic, x = "x", nderiv = 2L)
# expression(0.6 + 2.4 * x)
Deriv(cubic, x = "x", nderiv = 3L)
# expression(2.4)
Deriv(cubic, x = "x", nderiv = 4L)
# expression(0)
If we use expression, we will be able to evaluate the result later. For example,
eval(cubic, envir = list(x = 1:4)) ## cubic polynomial
# [1] 1.0 4.9 14.2 31.3
eval(dcubic, envir = list(x = 1:4)) ## its first derivative
# [1] 2.0 6.2 12.8 21.8
The above implies that we can wrap up an expression for a function. Using a function has several advantages, one being that we are able to plot it using curve or plot.function.
fun <- function(x, expr) eval.parent(expr, n = 0L)
Note, the success of fun requires expr to be an expression in terms of symbol x. If expr was defined in terms of y for example, we need to define fun with function (y, expr). Now let's use curve to plot cubic and dcubic, on a range 0 < x < 5:
curve(fun(x, cubic), from = 0, to = 5) ## colour "black"
curve(fun(x, dcubic), add = TRUE, col = 2) ## colour "red"
The most convenient way, is of course to define a single function FUN rather than doing f + fun combination. In this way, we also don't need to worry about the consistency on the variable name used by f and fun.
FUN <- function (x, pc, nderiv = 0L) {
## check missing arguments
if (missing(x) || missing(pc)) stop ("arguments missing with no default!")
## expression of polynomial
stringexpr <- paste("x", seq_along(pc) - 1, sep = " ^ ")
stringexpr <- paste(stringexpr, pc, sep = " * ")
stringexpr <- paste(stringexpr, collapse = " + ")
expr <- parse(text = stringexpr)
## taking derivatives
dexpr <- Deriv::Deriv(expr, x = "x", nderiv = nderiv)
## evaluation
val <- eval.parent(dexpr, n = 0L)
## note, if we take to many derivatives so that `dexpr` becomes constant
## `val` is free of `x` so it will only be of length 1
## we need to repeat this constant to match `length(x)`
if (length(val) == 1L) val <- rep.int(val, length(x))
## now we return
val
}
Suppose we want to evaluate a cubic polynomial with coefficients pc <- c(0.1, 0.2, 0.3, 0.4) and its derivatives on x <- seq(0, 1, 0.2), we can simply do:
FUN(x, pc)
# [1] 0.1000 0.1552 0.2536 0.4144 0.6568 1.0000
FUN(x, pc, nderiv = 1L)
# [1] 0.200 0.368 0.632 0.992 1.448 2.000
FUN(x, pc, nderiv = 2L)
# [1] 0.60 1.08 1.56 2.04 2.52 3.00
FUN(x, pc, nderiv = 3L)
# [1] 2.4 2.4 2.4 2.4 2.4 2.4
FUN(x, pc, nderiv = 4L)
# [1] 0 0 0 0 0 0
Now plotting is also easy:
curve(FUN(x, pc), from = 0, to = 5)
curve(FUN(x, pc, 1), from = 0, to = 5, add = TRUE, col = 2)
curve(FUN(x, pc, 2), from = 0, to = 5, add = TRUE, col = 3)
curve(FUN(x, pc, 3), from = 0, to = 5, add = TRUE, col = 4)

Since my final solution with symbolic derivatives eventually goes too long, I use a separate session for numerical calculations. We can do this as for polynomials, derivatives are explicitly known so we can code them. Note, there will be no use of R expression here; everything is done directly by using functions.
So we first generate polynomial basis from degree 0 to degree p - n, then multiply coefficient and factorial multiplier. It is more convenient to use outer than poly here.
## use `outer`
g <- function (x, pc, nderiv = 0L) {
## check missing aruments
if (missing(x) || missing(pc)) stop ("arguments missing with no default!")
## polynomial order p
p <- length(pc) - 1L
## number of derivatives
n <- nderiv
## earlier return?
if (n > p) return(rep.int(0, length(x)))
## polynomial basis from degree 0 to degree `(p - n)`
X <- outer(x, 0:(p - n), FUN = "^")
## initial coefficients
## the additional `+ 1L` is because R vector starts from index 1 not 0
beta <- pc[n:p + 1L]
## factorial multiplier
beta <- beta * factorial(n:p) / factorial(0:(p - n))
## matrix vector multiplication
drop(X %*% beta)
}
We still use the example x and pc defined in the symbolic solution:
x <- seq(0, 1, by = 0.2)
pc <- 1:4 / 10
g(x, pc, 0)
# [1] 0.1000 0.1552 0.2536 0.4144 0.6568 1.0000
g(x, pc, 1)
# [1] 0.200 0.368 0.632 0.992 1.448 2.000
g(x, pc, 2)
# [1] 0.60 1.08 1.56 2.04 2.52 3.00
g(x, pc, 3)
# [1] 2.4 2.4 2.4 2.4 2.4 2.4
g(x, pc, 4)
# [1] 0 0 0 0 0 0
The result is consistent with what we have with FUN in the the symbolic solution.
Similarly, we can plot g using curve:
curve(g(x, pc), from = 0, to = 5)
curve(g(x, pc, 1), from = 0, to = 5, col = 2, add = TRUE)
curve(g(x, pc, 2), from = 0, to = 5, col = 3, add = TRUE)
curve(g(x, pc, 3), from = 0, to = 5, col = 4, add = TRUE)

Now after quite much effort in demonstrating how we can work out this question ourselves, consider using R package polynom. As a small package, it aims at implementing construction, derivatives, integration, arithmetic and roots-finding of univariate polynomials. This package is written completely with R language, without any compiled code.
## install.packages("polynom")
library(polynom)
We still consider the cubic polynomial example used before.
pc <- 1:4 / 10
## step 1: making a "polynomial" object as preparation
pcpoly <- polynomial(pc)
#0.1 + 0.2*x + 0.3*x^2 + 0.4*x^3
## step 2: compute derivative
expr <- deriv(pcpoly)
## step 3: convert to function
g1 <- as.function(expr)
#function (x)
#{
# w <- 0
# w <- 1.2 + x * w
# w <- 0.6 + x * w
# w <- 0.2 + x * w
# w
#}
#<environment: 0x9f4867c>
Note, by step-by-step construction, the resulting function has all parameters inside. It only requires a single argument for x value. In contrast, functions in the other two answers will take coefficients and derivative order as mandatory arguments, too. We can call this function
g1(seq(0, 1, 0.2))
# [1] 0.200 0.368 0.632 0.992 1.448 2.000
To produce the same graph we see in other two answers, we get other derivatives as well:
g0 <- as.function(pcpoly) ## original polynomial
## second derivative
expr <- deriv(expr)
g2 <- as.function(expr)
#function (x)
#{
# w <- 0
# w <- 2.4 + x * w
# w <- 0.6 + x * w
# w
#}
#<environment: 0x9f07c68>
## third derivative
expr <- deriv(expr)
g3 <- as.function(expr)
#function (x)
#{
# w <- 0
# w <- 2.4 + x * w
# w
#}
#<environment: 0x9efd740>
Perhaps you have already noticed that I did not specify nderiv, but recursively take 1 derivative at a time. This may be a disadvantage of this package. It does not facilitate higher order derivatives.
Now we can make a plot
## As mentioned, `g0` to `g3` are parameter-free
curve(g0(x), from = 0, to = 5)
curve(g1(x), add = TRUE, col = 2)
curve(g2(x), add = TRUE, col = 3)
curve(g3(x), add = TRUE, col = 4)

Related

floating-point error fligner.test r function?

I have noticed that using the statistical test fligner.test from the r stats package provides different results with a simple transformation, even though this shouldn't be the case.
Here an example (the difference for the original dataset is much more dramatic):
g <- factor(rep(1:2, each=6))
x1 <- c(2,2,6,6,1,4,5,3,5,6,5,5)
x2 <- (x1-1)/5 #> cor(x1,x2) [1] 1
fligner.test(x1,g) # chi-squared = 4.2794, df = 1, p-value = 0.03858
fligner.test(x2,g) # chi-squared = 4.8148, df = 1, p-value = 0.02822
Looking at the function code, I have noticed that the median centering might be causing the issue:
x1 <- x1 - tapply(x1,g,median)[g]
x2 <- x2 - tapply(x2,g,median)[g]
unique(abs(x1)) # 1 3 2 0
unique(abs(x2)) # 0.2 0.6 0.4 0.2 0.0 <- repeated 0.2
Is this a known issue, and how should this inconsistency be resolved?
I think your analysis is correct here. In your example the problem ultimately occurs because (0.8 - 0.6) == 0.2 is FALSE unless rounded to 15 decimal places. You should file a bug report, since this is avoidable.
If you are desperate in the meantime, you can adapt stats:::fligner.test.default by applying a tiny bit of rounding at the median centering stage to remove floating point inequalities:
fligner <- function (x, g, ...)
{
if (is.list(x)) {
if (length(x) < 2L)
stop("'x' must be a list with at least 2 elements")
DNAME <- deparse1(substitute(x))
x <- lapply(x, function(u) u <- u[complete.cases(u)])
k <- length(x)
l <- lengths(x)
if (any(l == 0))
stop("all groups must contain data")
g <- factor(rep(1:k, l))
x <- unlist(x)
}
else {
if (length(x) != length(g))
stop("'x' and 'g' must have the same length")
DNAME <- paste(deparse1(substitute(x)), "and",
deparse1(substitute(g)))
OK <- complete.cases(x, g)
x <- x[OK]
g <- g[OK]
g <- factor(g)
k <- nlevels(g)
if (k < 2)
stop("all observations are in the same group")
}
n <- length(x)
if (n < 2)
stop("not enough observations")
x <- round(x - tapply(x, g, median)[g], 15)
a <- qnorm((1 + rank(abs(x))/(n + 1))/2)
a <- a - mean(a)
v <- sum(a^2)/(n - 1)
a <- split(a, g)
STATISTIC <- sum(lengths(a) * vapply(a, mean, 0)^2)/v
PARAMETER <- k - 1
PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
names(STATISTIC) <- "Fligner-Killeen:med chi-squared"
names(PARAMETER) <- "df"
METHOD <- "Fligner-Killeen test of homogeneity of variances"
RVAL <- list(statistic = STATISTIC, parameter = PARAMETER,
p.value = PVAL, method = METHOD, data.name = DNAME)
class(RVAL) <- "htest"
return(RVAL)
}
This now returns the correct result for both your vectors:
fligner(x1,g)
#>
#> Fligner-Killeen test of homogeneity of variances
#>
#> data: x1 and g
#> Fligner-Killeen:med chi-squared = 4.2794, df = 1, p-value = 0.03858
fligner(x2,g)
#>
#> Fligner-Killeen test of homogeneity of variances
#>
#> data: x2 and g
#> Fligner-Killeen:med chi-squared = 4.2794, df = 1, p-value = 0.03858

Compute sample statistics for a data vector with ties which is stored as a frequency table

I am trying to get some summary statistics (mean, variance and quantiles) from a data vector with tied values. In particular, it is stored in a frequency distribution table: unique data values var and number of ties frequency.
I know I could use rep function to first expand the vector to its full format:
xx <- rep(mydata$var, mydata$frequency)
then do standard
mean(xx)
var(xx)
quantile(xx)
But the frequency is really large and I have many unique values, which makes the program really slow. Is there a way to compute these statistics directly from var and frequency?
set.seed(0)
x <- runif(10) ## unique data values
k <- sample.int(5, 10, TRUE) ## frequency
n <- sum(k)
xx <- rep.int(x, k) ## "expanded" data
#################
## sample mean ##
#################
mean(xx) ## using `xx`
#[1] 0.6339458
mu <- c(crossprod(x, k)) / n ## using `x` and `k`
#[1] 0.6339458
#####################
## sample variance ##
#####################
var(xx) * (n - 1) / n ## using `xx`
#[1] 0.06862544
v <- c(crossprod(x ^ 2, k)) / n - mu * mu ## using `x` and `k`
#[1] 0.06862544
Computing quantiles are much more involved, but doable. We need to first understand how quantiles are computed in a standard way.
xx <- sort(xx)
pp <- seq(0, 1, length = n)
plot(pp, xx); abline(v = pp, col = 8, lty = 2)
The standard quantile computation is a linear interpolation problem. However, when data have ties, we can clearly see that there are "runs" (of the same value) and "jumps" (between two values) in the plot. Linear interpolation is only needed on "jumps", while on "runs" the quantiles are just the run values.
The following function finds quantiles only using x and k. For demonstration purpose there is an argument verbose. If TRUE it will produce a plot and a data frame containing information of "runs" (and "jumps").
find_quantile <- function (x, k, prob = seq(0, 1, length = 5), verbose = FALSE) {
if (is.unsorted(x)) {
ind <- order(x); x <- x[ind]; k <- k[ind]
}
m <- length(x) ## number of unique values
n <- sum(k) ## number of data
d <- 1 / (n - 1) ## break [0, 1] into (n - 1) intervals
## the right and left end of each run
r <- (cumsum(k) - 1) * d
l <- r - (k - 1) * d
if (verbose) {
breaks <- seq(0, 1, d)
plot(r, x, "n", xlab = "prob (p)", ylab = "quantile (xq)", xlim = c(0, 1))
abline(v = breaks, col = 8, lty = 2)
## sketch each run
segments(l, x, r, x, lwd = 3)
## sketch each jump
segments(r[-m], x[-m], l[-1], x[-1], lwd = 3, col = 2)
## sketch `prob`
abline(v = prob, col = 3)
print( data.frame(x, k, l, r) )
}
## initialize the vector of quantiles
xq <- numeric(length(prob))
run <- rbind(l, r)
i <- findInterval(prob, run, rightmost.closed = TRUE)
## odd integers in `i` means that `prob` lies on runs
## quantiles on runs are just run values
on_run <- (i %% 2) != 0
run_id <- (i[on_run] + 1) / 2
xq[on_run] <- x[run_id]
## even integers in `i` means that `prob` lies on jumps
## quantiles on jumps are linear interpolations
on_jump <- !on_run
jump_id <- i[on_jump] / 2
xl <- x[jump_id] ## x-value to the left of the jump
xr <- x[jump_id + 1] ## x-value to the right of the jump
pl <- r[jump_id] ## percentile to the left of the jump
pr <- l[jump_id + 1] ## percentile to the right of the jump
p <- prob[on_jump] ## probability on the jump
## evaluate the line `(pl, xl) -- (pr, xr)` at `p`
xq[on_jump] <- (xr - xl) / (pr - pl) * (p - pl) + xl
xq
}
Applying the function to the example data above with verbose = TRUE gives:
result <- find_quantile(x, k, prob = seq(0, 1, length = 5), TRUE)
# x k l r
#1 0.2016819 4 0.0000000 0.1111111
#2 0.2655087 2 0.1481481 0.1851852
#3 0.3721239 1 0.2222222 0.2222222
#4 0.5728534 4 0.2592593 0.3703704
#5 0.6291140 2 0.4074074 0.4444444
#6 0.6607978 5 0.4814815 0.6296296
#7 0.8966972 1 0.6666667 0.6666667
#8 0.8983897 3 0.7037037 0.7777778
#9 0.9082078 2 0.8148148 0.8518519
#10 0.9446753 4 0.8888889 1.0000000
Each row of the data frame is a "run". x gives the run values, k is the run length, and l and r are the left and right percentile of the run. In the figure, "runs" are drawn in black horizontal lines.
Information of "jumps" is implied by the r, x values of a row and the l, x values of the next row. In the figure, "jumps" are drawn in red lines.
The vertical green lines signals the prob values we give.
The computed quantiles are
result
#[1] 0.2016819 0.5226710 0.6607978 0.8983897 0.9446753
which are identical to
quantile(xx, names = FALSE)
#[1] 0.2016819 0.5226710 0.6607978 0.8983897 0.9446753

How to perform sequential mathematical operations with R [duplicate]

I'm trying to use R to calculate the moving average over a series of values in a matrix. There doesn't seem to be a built-in function in R that will allow me to calculate moving averages. Do any packages provide one? Or do I need to write my own?
Or you can simply calculate it using filter, here's the function I use:
ma <- function(x, n = 5){filter(x, rep(1 / n, n), sides = 2)}
If you use dplyr, be careful to specify stats::filter in the function above.
Rolling Means/Maximums/Medians in the zoo package (rollmean)
MovingAverages in TTR
ma in forecast
Using cumsum should be sufficient and efficient. Assuming you have a vector x and you want a running sum of n numbers
cx <- c(0,cumsum(x))
rsum <- (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
As pointed out in the comments by #mzuther, this assumes that there are no NAs in the data. to deal with those would require dividing each window by the number of non-NA values. Here's one way of doing that, incorporating the comment from #Ricardo Cruz:
cx <- c(0, cumsum(ifelse(is.na(x), 0, x)))
cn <- c(0, cumsum(ifelse(is.na(x), 0, 1)))
rx <- cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]
rn <- cn[(n+1):length(cx)] - cn[1:(length(cx) - n)]
rsum <- rx / rn
This still has the issue that if all the values in the window are NAs then there will be a division by zero error.
In data.table 1.12.0 new frollmean function has been added to compute fast and exact rolling mean carefully handling NA, NaN and +Inf, -Inf values.
As there is no reproducible example in the question there is not much more to address here.
You can find more info about ?frollmean in manual, also available online at ?frollmean.
Examples from manual below:
library(data.table)
d = as.data.table(list(1:6/2, 3:8/4))
# rollmean of single vector and single window
frollmean(d[, V1], 3)
# multiple columns at once
frollmean(d, 3)
# multiple windows at once
frollmean(d[, .(V1)], c(3, 4))
# multiple columns and multiple windows at once
frollmean(d, c(3, 4))
## three above are embarrassingly parallel using openmp
The caTools package has very fast rolling mean/min/max/sd and few other functions. I've only worked with runmean and runsd and they are the fastest of any of the other packages mentioned to date.
You could use RcppRoll for very quick moving averages written in C++. Just call the roll_mean function. Docs can be found here.
Otherwise, this (slower) for loop should do the trick:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n):i])
}
res
}
Here is example code showing how to compute a centered moving average and a trailing moving average using the rollmean function from the zoo package.
library(tidyverse)
library(zoo)
some_data = tibble(day = 1:10)
# cma = centered moving average
# tma = trailing moving average
some_data = some_data %>%
mutate(cma = rollmean(day, k = 3, fill = NA)) %>%
mutate(tma = rollmean(day, k = 3, fill = NA, align = "right"))
some_data
#> # A tibble: 10 x 3
#> day cma tma
#> <int> <dbl> <dbl>
#> 1 1 NA NA
#> 2 2 2 NA
#> 3 3 3 2
#> 4 4 4 3
#> 5 5 5 4
#> 6 6 6 5
#> 7 7 7 6
#> 8 8 8 7
#> 9 9 9 8
#> 10 10 NA 9
In fact RcppRoll is very good.
The code posted by cantdutchthis must be corrected in the fourth line to the window be fixed:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n+1):i])
}
res
}
Another way, which handles missings, is given here.
A third way, improving cantdutchthis code to calculate partial averages or not, follows:
ma <- function(x, n=2,parcial=TRUE){
res = x #set the first values
if (parcial==TRUE){
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res
}else{
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res[-c(seq(1,n-1,1))] #remove the n-1 first,i.e., res[c(-3,-4,...)]
}
}
In order to complement the answer of cantdutchthis and Rodrigo Remedio;
moving_fun <- function(x, w, FUN, ...) {
# x: a double vector
# w: the length of the window, i.e., the section of the vector selected to apply FUN
# FUN: a function that takes a vector and return a summarize value, e.g., mean, sum, etc.
# Given a double type vector apply a FUN over a moving window from left to the right,
# when a window boundary is not a legal section, i.e. lower_bound and i (upper bound)
# are not contained in the length of the vector, return a NA_real_
if (w < 1) {
stop("The length of the window 'w' must be greater than 0")
}
output <- x
for (i in 1:length(x)) {
# plus 1 because the index is inclusive with the upper_bound 'i'
lower_bound <- i - w + 1
if (lower_bound < 1) {
output[i] <- NA_real_
} else {
output[i] <- FUN(x[lower_bound:i, ...])
}
}
output
}
# example
v <- seq(1:10)
# compute a MA(2)
moving_fun(v, 2, mean)
# compute moving sum of two periods
moving_fun(v, 2, sum)
You may calculate the moving average of a vector x with a window width of k by:
apply(embed(x, k), 1, mean)
The slider package can be used for this. It has an interface that has been specifically designed to feel similar to purrr. It accepts any arbitrary function, and can return any type of output. Data frames are even iterated over row wise. The pkgdown site is here.
library(slider)
x <- 1:3
# Mean of the current value + 1 value before it
# returned as a double vector
slide_dbl(x, ~mean(.x, na.rm = TRUE), .before = 1)
#> [1] 1.0 1.5 2.5
df <- data.frame(x = x, y = x)
# Slide row wise over data frames
slide(df, ~.x, .before = 1)
#> [[1]]
#> x y
#> 1 1 1
#>
#> [[2]]
#> x y
#> 1 1 1
#> 2 2 2
#>
#> [[3]]
#> x y
#> 1 2 2
#> 2 3 3
The overhead of both slider and data.table's frollapply() should be pretty low (much faster than zoo). frollapply() looks to be a little faster for this simple example here, but note that it only takes numeric input, and the output must be a scalar numeric value. slider functions are completely generic, and you can return any data type.
library(slider)
library(zoo)
library(data.table)
x <- 1:50000 + 0L
bench::mark(
slider = slide_int(x, function(x) 1L, .before = 5, .complete = TRUE),
zoo = rollapplyr(x, FUN = function(x) 1L, width = 6, fill = NA),
datatable = frollapply(x, n = 6, FUN = function(x) 1L),
iterations = 200
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slider 19.82ms 26.4ms 38.4 829.8KB 19.0
#> 2 zoo 177.92ms 211.1ms 4.71 17.9MB 24.8
#> 3 datatable 7.78ms 10.9ms 87.9 807.1KB 38.7
EDIT: took great joy in adding the side parameter, for a moving average (or sum, or ...) of e.g. the past 7 days of a Date vector.
For people just wanting to calculate this themselves, it's nothing more than:
# x = vector with numeric data
# w = window length
y <- numeric(length = length(x))
for (i in seq_len(length(x))) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- mean(x[ind])
}
y
But it gets fun to make it independent of mean(), so you can calculate any 'moving' function!
# our working horse:
moving_fn <- function(x, w, fun, ...) {
# x = vector with numeric data
# w = window length
# fun = function to apply
# side = side to take, (c)entre, (l)eft or (r)ight
# ... = parameters passed on to 'fun'
y <- numeric(length(x))
for (i in seq_len(length(x))) {
if (side %in% c("c", "centre", "center")) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
} else if (side %in% c("l", "left")) {
ind <- c((i - floor(w) + 1):i)
} else if (side %in% c("r", "right")) {
ind <- c(i:(i + floor(w) - 1))
} else {
stop("'side' must be one of 'centre', 'left', 'right'", call. = FALSE)
}
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- fun(x[ind], ...)
}
y
}
# and now any variation you can think of!
moving_average <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = mean, side = side, na.rm = na.rm)
}
moving_sum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = sum, side = side, na.rm = na.rm)
}
moving_maximum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = max, side = side, na.rm = na.rm)
}
moving_median <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = median, side = side, na.rm = na.rm)
}
moving_Q1 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.25)
}
moving_Q3 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.75)
}
Though a bit slow but you can also use zoo::rollapply to perform calculations on matrices.
reqd_ma <- rollapply(x, FUN = mean, width = n)
where x is the data set, FUN = mean is the function; you can also change it to min, max, sd etc and width is the rolling window.
One can use runner package for moving functions. In this case mean_run function. Problem with cummean is that it doesn't handle NA values, but mean_run does. runner package also supports irregular time series and windows can depend on date:
library(runner)
set.seed(11)
x1 <- rnorm(15)
x2 <- sample(c(rep(NA,5), rnorm(15)), 15, replace = TRUE)
date <- Sys.Date() + cumsum(sample(1:3, 15, replace = TRUE))
mean_run(x1)
#> [1] -0.5910311 -0.2822184 -0.6936633 -0.8609108 -0.4530308 -0.5332176
#> [7] -0.2679571 -0.1563477 -0.1440561 -0.2300625 -0.2844599 -0.2897842
#> [13] -0.3858234 -0.3765192 -0.4280809
mean_run(x2, na_rm = TRUE)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] -0.13873536 -0.14571604 -0.12596067 -0.11116961 -0.09881996 -0.08871569
#> [13] -0.05194292 -0.04699909 -0.05704202
mean_run(x2, na_rm = FALSE )
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] NA NA NA NA NA NA
#> [13] NA NA NA
mean_run(x2, na_rm = TRUE, k = 4)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.10546063 -0.16299272
#> [7] -0.21203756 -0.39209010 -0.13274756 -0.05603811 -0.03894684 0.01103493
#> [13] 0.09609256 0.09738460 0.04740283
mean_run(x2, na_rm = TRUE, k = 4, idx = date)
#> [1] -0.187600111 -0.090220655 -0.004349696 0.168349653 -0.206571573 -0.494335093
#> [7] -0.222969541 -0.187600111 -0.087636571 0.009742884 0.009742884 0.012326968
#> [13] 0.182442234 0.125737145 0.059094786
One can also specify other options like lag, and roll only at specific indexes. More in package and function documentation.
Here is a simple function with filter demonstrating one way to take care of beginning and ending NAs with padding, and computing a weighted average (supported by filter) using custom weights:
wma <- function(x) {
wts <- c(seq(0.5, 4, 0.5), seq(3.5, 0.5, -0.5))
nside <- (length(wts)-1)/2
# pad x with begin and end values for filter to avoid NAs
xp <- c(rep(first(x), nside), x, rep(last(x), nside))
z <- stats::filter(xp, wts/sum(wts), sides = 2) %>% as.vector
z[(nside+1):(nside+length(x))]
}
vector_avg <- function(x){
sum_x = 0
for(i in 1:length(x)){
if(!is.na(x[i]))
sum_x = sum_x + x[i]
}
return(sum_x/length(x))
}
I use aggregate along with a vector created by rep(). This has the advantage of using cbind() to aggregate more than 1 column in your dataframe at time. Below is an example of a moving average of 60 for a vector (v) of length 1000:
v=1:1000*0.002+rnorm(1000)
mrng=rep(1:round(length(v)/60+0.5), length.out=length(v), each=60)
aggregate(v~mrng, FUN=mean, na.rm=T)
Note the first argument in rep is to simply get enough unique values for the moving range, based on the length of the vector and the amount to be averaged; the second argument keeps the length equal to the vector length, and the last repeats the values of the first argument the same number of times as the averaging period.
In aggregate you could use several functions (median, max, min) - mean shown for example. Again, could could use a formula with cbind to do this on more than one (or all) columns in a dataframe.

How to efficiently do complex row operations with nested functions in R?

Given a multidimensional array, e.g. a zoo object z, with columns a,b,c,x. Given further a function W(w=c(1,1,1), x) which for example weights every column individually, but which also DEPENDS on the specific row value in column x. How to efficiently do row operations here, e.g. calculating the rowWeightedMeans?
It is known that R::zoo is very fast and efficient for row operations, if the function is very simple, e.g.:
W <- function(w) { return(w); }
z[,"wmean"] <- rowWeightedMeans(z[,1:3], w=W(c(0.1,0.5,0.3)))
But what if W() depends on a value in that row? E.g.:
W <- function(w, x) { return(w*x); }
z[,"wmean"] <- rowWeightedMeans(z[,1:3], w=W(c(0.1,0.5,0.3), z[,4]))
R complains here because it does not know how to hanlde the multi-dimensions of the arguments in the nested function.
The solution could be a for(i in 1:nrow(z)) loop, and computing the values individually for every row i. However, for large data sets this takes a enormous amount of extra computational effort and time.
EDIT
Ok guys, thanks for your time and critics. I tried and tested all your answers but must admit that the actual problem was not solved or understood. For example, I hadn't ask to rewrite my weight function or calculations, because I already presented a minimal version of much more complex calculations. The issue or question here lies much deeper. So I sat back and tried to boil down the problem to the root of the evil and found a minimal working example for you without any zoos, weightedMeans, and so on. Here you go:
z <- data.frame(matrix (1:20, nrow = 4))
colnames (z) <- c ("a", "b", "c", "x", "y")
z
# a b c x y
#1 1 5 9 13 17
#2 2 6 10 14 18
#3 3 7 11 15 19
#4 4 8 12 16 20
W <- function(abc, w, p) {
ifelse (w[1] == p, return(length(p)), return(0))
# Please do not complain! I know this is stupid, but it is an MWE
# and my calculations contained in W() are much more complex!
}
z[,"y"] <- W(z[,1:3], c(14,7,8), z[,"x"])
# same result: z[,"y"] <- apply(z[,1:3], 1, W, c(14,7,8), z[,"x"])
z
# a b c x y
#1 1 5 9 13 4
#2 2 6 10 14 4
#3 3 7 11 15 4
#4 4 8 12 16 4
# expected outcome:
# a b c x y
#1 1 5 9 13 0
#2 2 6 10 14 4
#3 3 7 11 15 0
#4 4 8 12 16 0
The problem I am facing is, that R passes all lines of z[,"x"] to the function, however, I expect it to take only the line which corresponds to the line of z[,"y"] that is currently processed internally when R loops through it. In this example, I expect 14==14 only in line number 2!
So: how to tell R to pass line by line to functions?
SOLUTION
Besides the awarded and accepted answer, I like to summarize the solution here to improve clarity and provide a better overview about the discussion.
This question was not about rewriting the specific function W (e.g. weighting). It was only about the inability of R to pass multiple row-by-row arguments to a general function. By either using z$y <- f(z$a, z$x) or z$y <- apply(z$a, 1, f, z$x), both methods only pass the first argument as row-by-row, and the second argument as a complete column with all rows. It seems that this is an intrinsic behaviour of R around which we need to work around.
To solve this, the whole row needs to be passed as a single argument to a wrapper function, which in turn then applies the specific calculations on that row. Solution for the problem with the weights:
f <- function(x) weighted.mean(x[1:3], W(c(0.1,0.5,0.3), x[4]))
z[,"wmean"] <- apply(z[,1:4], 1, f)
Solution for the geenral problem with the data frame:
f <- function(x) W(x[1:3], c(14,7,8), x[4])
z$y <- apply(z, 1, f)
Brian presents also even faster methods using compiled C code in his accepted answer. Thanks to #BrianAlbertMonroe, #jaimedash and #inscaven for dealing with the poorly clarified question and for hinting to this solution.
Haven't really worked with zoo or rowWeightedMeans but if you simply apply weights to row elements before taking the mean of them, and require the weights to depend on one of the elements of the row:
z <- matrix(rnorm(100),ncol=4)
W <- function(row, weights){
weights <- weights * row[4]
row2 <- row[1:3] * weights
sum(row2) / sum(weights)
}
w.means <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3))
If the above gives the correct answer but you're worried about quickness write the W function in Rcpp or use the built in cmpfun,
N <- 10000
z <- matrix(rnorm(N),ncol=4)
# Interpreted R function
W1 <- function(row, weights){
weights <- weights * row[4]
row2 <- row[1:3] * weights
mean(row2)
}
# Compiled R function
W2 <- compiler::cmpfun(W1)
# C++ function imported into R via Rcpp
Rcpp::cppFunction('double Wcpp(NumericVector row, NumericVector weights){
int x = row.size() ;
NumericVector wrow(x - 1);
NumericVector nweights(x - 1);
nweights = weights * row[x - 1];
for( int i = 0; i < (x-1) ; i++){
wrow[i] = row[i] * nweights[i];
}
double res = sum(wrow) / sum(nweights);
return(res);
}')
w.means0 <- apply(z,1,W,weights=c(0.1,0.5,0.3))
w.means1 <- apply(z,1,W2,weights=c(0.1,0.5,0.3))
w.means2 <- apply(z,1,Wcpp,weights=c(0.1,0.5,0.3))
identical( w.means0, w.means1, w.means2 )
#[1] TRUE
Or
# Write the whole thing in C++
Rcpp::cppFunction('NumericVector WM(NumericMatrix z , NumericVector weights){
int x = z.ncol() ;
int y = z.nrow() ;
NumericVector res(y);
NumericVector wrow(x - 1);
NumericVector nweights(x - 1);
double nwsum;
double mult;
for( int row = 0 ; row < y ; row++){
mult = z(row,x-1);
nweights = weights * mult;
nwsum = sum(nweights);
for( int i = 0; i < (x-1) ; i++){
wrow[i] = z(row,i) * nweights[i] ;
}
res[row] = sum(wrow) / nwsum;
}
return(res);
}')
microbenchmark::microbenchmark(
w.means0 <- apply(z,1,W1,weights=c(0.1,0.5,0.3)),
w.means1 <- apply(z,1,W2,weights=c(0.1,0.5,0.3)),
w.means2 <- apply(z,1,Wcpp,weights=c(0.1,0.5,0.3)),
w.means3 <- WM(z = z, weights = c(0.1, 0.5, 0.3))
)
Unit: microseconds
expr min lq mean median uq max neval
w.means0 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)) 12114.834 12536.9330 12995.1722 12838.2805 13163.4835 15796.403 100
w.means1 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3)) 9941.571 10286.8085 10769.7330 10410.9465 10788.6800 19526.840 100
w.means2 <- apply(z, 1, Wcpp, weights = c(0.1, 0.5, 0.3)) 10919.112 11631.5530 12849.7294 13262.9705 13707.7465 17438.524 100
w.means3 <- WM(z = z, weights = c(0.1, 0.5, 0.3)) 94.172 107.9855 146.2606 125.0075 140.2695 2089.933 100
EDIT:
Incorporating the weighted.means function slows down the computation dramatically, and does not handle missing values specially according to the help file, so you will still need to write code to manage them.
> z <- matrix(rnorm(100),ncol=4)
> W <- function(row, weights){
+ weights <- weights * row[4]
+ row2 <- row[1:3] * weights
+ sum(row2) / sum(weights)
+
+ }
> W1 <- compiler::cmpfun(W)
> W2 <- function(row, weights){
+ weights <- weights * row[4]
+ weighted.mean(row[1:3],weights)
+ }
> W3 <- compiler::cmpfun(W2)
> w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3))
> w.means2 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3))
> identical(w.means1,w.means2)
[1] TRUE
> microbenchmark(
+ w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3)),
+ w.means1 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)),
+ w.means2 < .... [TRUNCATED]
Unit: microseconds
expr min lq mean median uq max neval
w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3)) 145.315 167.4550 172.8163 172.9120 180.6920 194.673 100
w.means1 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)) 124.087 134.3365 143.6803 137.8925 148.7145 225.459 100
w.means2 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3)) 307.311 346.6320 356.4845 354.7325 371.7620 412.110 100
w.means2 <- apply(z, 1, W3, weights = c(0.1, 0.5, 0.3)) 280.073 308.7110 323.0156 324.1230 333.7305 407.963 100
Here's a solution with zoo::rollapply. It produces the same answer as matrixStats::rowWeightedMeans for the simpler case.
if(! require(matrixStats)) {
install.packages('matrixStats')
library(matrixStats)
}
if(! require(zoo)) {
install.packages('zoo')
library(zoo)
}
z <- zoo (matrix (1:20, nrow = 5))
colnames (z) <- c ("a", "b", "c", "x")
z$x <- 0 # so we can see an effect below...
z
## a b c x
## 1 1 6 11 0
## 2 2 7 12 0
## 3 3 8 13 0
## 4 4 9 14 0
## 5 5 10 15 0
weights <- c(0.1,0.5,0.3)
W <- function (w) { return(w); }
z$wmean <- rowWeightedMeans(z[,1:3], w=W(weights))
## z[,new]<- doesn't work to create new columns in zoo
## objects
## use $
rowWeightMean_zoo <- function (r, W, weights) {
s <- sum(W(weights))
return(sum(r[1:3] * W(weights) / s))
}
z$wmean_zoo <- rollapply(z, width=1, by.column=FALSE,
function (r) rowWeightMean_zoo(r, W, weights))
z
For the requirement in the question, that the return value be dependent on some ancillary data in the row, rowWeightedMeans doesn't work. But, the function passed to rollapply can be modified to use other elements of the row.
W2 <- function (w, x) { return(w * x); }
# z$wmean2 <- rowWeightedMeans(z[,1:3], w=W2(c(0.1,0.5,0.3), z[,4]))
## doesn't work
## Error in rowWeightedMeans(z[, 1:3], w = W#(c(0.1, 0.5, 0.3), z[, 4])) :
## The length of argument 'w' is does not match the number of column in 'x': 5 != 3
## In addition: Warning message:
## In `*.default`(w, x) :
## longer object length is not a multiple of shorter object length
## Calls: rowWeightedMeans -> W -> Ops.zoo -> NextMethod
rowWeightMean_zoo_dependent <- function (r, W, weights) {
s <- sum(W(weights, r[4]))
return(sum(r[1:3] * W2(weights, r[4]) / s))
}
z$wmean2_zoo <- rollapply(z, width=1, by.column=FALSE,
function (r) rowWeightMean_zoo_dependent(r, W2, weights))
z
## a b c x wmean wmean_zoo wmean2_zoo
## 1 1 6 11 0 7.111111 7.111111 NaN
## 2 2 7 12 0 8.111111 8.111111 NaN
## 3 3 8 13 0 9.111111 9.111111 NaN
## 4 4 9 14 0 10.111111 10.111111 NaN
## 5 5 10 15 0 11.111111 11.111111 NaN
I think this can be solved by clever reshaping. I would use dplyr for that - but the workflow should work similar for plyr or data.table - all these packages are heavily optimized.
for this example I assume the weight function is w(x) = w0 ^ x
Here I create some sample data z, and generic weights w (note I add a row number r to z):
library(dplyr)
library(tidyr)
N <- 10
z <- data.frame(r=1:N, a=rnorm(N), b=rnorm(N), c=rnorm(N), x=rpois(N, 5))
w <- data.frame(key=c('a','b','c'), weight=c(0.1,0.5,0.3))
Now the calculation would be:
res <- z %>% gather(key,value,-r,-x) %>% # convert to long format, but keep row numbers and x
left_join(w, 'key') %>% # add generic weights
mutate(eff_weight = weight^x) %>% # calculate effective weights
group_by(r) %>% # group by the orignal lines for the weighted mean
summarise(ws = sum(value*eff_weight), ww=sum(eff_weight)) %>% # calculate to helper values
mutate(weighted_mean = ws/ww) %>% # effectively calculate the weighted mean
select(r, weighted_mean) # remove unneccesary output
left_join(z, res) # add to the original data
I added some notes - but if you have trouble understanding you could evaluate res stepwise (remove tail including %>%) and have a look at the results.
Update
took the challenge to find the way to do the same in base R:
N <- 10
z <- data.frame(a=rnorm(N), b=rnorm(N), c=rnorm(N), x=rpois(N, 5))
w <- data.frame(key=c('a','b','c'), weight=c(0.1,0.5,0.3))
long.z <- reshape(z, idvar = "row", times=c('a','b','c'),
timevar='key',
varying = list(c('a','b','c')), direction = "long")
compose.z <- merge(long.z,w, by='key')
compose.z2 <- within(compose.z, eff.weight <- weight^x)
sum.stat <- by(compose.z2, compose.z2$row, function(x) {sum(x$a * x$eff.weight )/sum(x$eff.weight)})
nice.data <- c(sum.stat)
It requires a bit more verbose function. But the same pattern can be applied.

Taylor approximation in R

Is there a function/package in R which takes a function f and a parameter k, and then returns a Taylor approximation of f of degree k?
You can use Ryacas to work with the yacas computer algebra system (which you will need to install as well)
Using an example from the vignette
library(Ryacas)
# run yacasInstall() if prompted to install yacas
#
yacas("texp := Taylor(x,0,3) Exp(x)")
## expression(x + x^2/2 + x^3/6 + 1)
# or
Now, if you want to turn that into a function that you can give values of x
myTaylor <- function(f, k, var,...){
.call <- sprintf('texp := Taylor( %s, 0, %s) %s', var,k,f)
result <- yacas(.call)
foo <- function(..., print = FALSE){
if(print){print(result)}
Eval(result, list(...))}
return(foo)
}
# create the function
foo <- myTaylor('Exp(x)', 3, 'x')
foo(x=1:5)
## [1] 2.666667 6.333333 13.000000 23.666667 39.333333
foo(x=1:5, print = TRUE)
## expression(x + x^2/2 + x^3/6 + 1)
## [1] 2.666667 6.333333 13.000000 23.666667 39.333333
Compare the above symbolic solution with a numerical Taylor approximation:
library(pracma)
p <- taylor(f = exp, x0 = 0, n = 4) # Numerical coefficients
# 0.1666667 0.5000000 1.0000000 1.0000000 # x^3/6 + x^2/2 + x + 1
polyval(p, 1:5) # Evaluate the polynomial
# 2.66667 6.33333 13.00000 23.66667 39.33334 # exp(x) at x = 1:5
As a followup, consider:
foo <- myTaylor('Exp(x)', 3, 'x')
sprintf('%2.15f',foo(x=1:5))
[1] "2.666666666666667" "6.333333333333333" "13.000000000000000"
[4] "23.666666666666664" "39.333333333333329"
p <- taylor(f = exp, x0 = 0, n = 3)
sprintf('%2.15f',polyval(p,1:5))
[1] "2.666666721845557" "6.333333789579300" "13.000001556539996"
[4] "23.666670376066413" "39.333340601497312"
Which of these is more accurate I'll leave up to the reader :-)
The below function returns the function obtained by using the Taylor series approximation of n-th order of function f at the point a.
taylor <- function(f, n, a) {
ith_derivative <- as.expression(body(f))
f_temp <- function(x) x
series <- as.character(f(a))
for (i in seq_len(n)) {
ith_derivative <- body(f_temp) <- D(ith_derivative, "x")
series <- paste0(series, "+", f_temp(a) / factorial(i), "*(x - ", a, ")^", i)
}
f_output <- function(x) x
body(f_output) <- parse(text = series)
f_output
}
taylor(f = function(x) sin(x), n = 3, a = 0)
If you are asking for Taylor approximation in a background of error propagation, you might try the "propagate" function of my qpcR package, which evaluates symbolic gradient vectors together with the covariance matrix in the form of g * V * t(g) (the famous matrix notation for error propagation), which is equivalent to the first-order Taylor expansion.

Resources