Efficient implementation of double for-loop - r

I am new to R and I was wondering if there is any more efficient implementation of the following setting? Time series length (x,y) is around 5000 and h != nrow(q).
set.seed(1)
h = 21
x <- rnorm(5e3, 1)
y <- rnorm(5e3, 2)
q <- c(0.1, 0.3, 0.5, 0.7, 0.9)
qx <- quantile(x, probs = q)
qx <- expand.grid(qx, qx)
qy <- quantile(y, probs = q)
qy <- expand.grid(qy, qy)
q <- expand.grid(q, q)
f <- function(z, l, qz) {
n <- length(z)
1/(n - l) * sum((z[1:(n-l)] <= qz[[1]]) * (z[(1+l):n] <= qz[[2]])) - prod(q[i,])
}
sum = 0
for (i in 1:h) {
for (j in 1:nrow(q)) {
sum = sum + (f(x, l = i, qx[j,]) - f(y, l = i, qy[j,]))^2
}
}
sum
# 0.0008698279
Thank you very much!

One faster alternative to loops might be under some circumstances a sapply function.
This function works as follows: for each element of a vector perform some function.
Aternatively, you could take a look at a foreach package which offers some fast looping.
Here is an example using sapply: depending on what exactly you need, you might want to use either of the functions. Also, sapply is just one of the faster ways of doing this, not necessarily the fastest.
# setup from the question
set.seed(1)
h = 1
x <- rnorm(5e3, 1)
y <- rnorm(5e3, 2)
q <- c(0.1, 0.3, 0.5, 0.7, 0.9)
qx <- quantile(x, probs = q)
qx <- expand.grid(qx, qx)
qy <- quantile(y, probs = q)
qy <- expand.grid(qy, qy)
q <- expand.grid(q, q)
f <- function(z, l, qz) {
n <- length(z)
1/(n - l) * sum((z[1:(n-l)] <= qz[[1]]) * (z[(1+l):n] <= qz[[2]])) - prod(q[i,])
}
# load microbenchmark library for comparison of execution times
library(microbenchmark)
microbenchmark({
# the version from question with for loop
sum = 0
for (i in 1:h) {
for (j in 1:nrow(q)) {
sum = sum + (f(x, l = i, qx[j,]) - f(y, l = i, qy[j,]))^2
}
}
},
{
# using sapply and storing to object. this will give you h*j matrix as well as the sum
sum = 0
sapply(1:h, function(i) sapply(1:nrow(q), function(j) {sum <<- sum + (f(x, l = i, qx[j,]) - f(y, l = i, qy[j,]))^2}))
},
{
# use sapply and sum the output
sum(sapply(1:h, function(i) sapply(1:nrow(q), function(j) {(f(x, l = i, qx[j,]) - f(y, l = i, qy[j,]))^2})))},
# run each code 200 times to get the time comparison
times = 200
)

Related

Is there any way to improve performance (e.g. vectorize) this look-up and recoding problem implemented by a for loop?

I need to make recodings to data sets of the following form.
# List elements of varying length
set.seed(12345)
n = 1e3
m = sample(2:5, n, T)
V = list()
for(i in 1:n) {
for(j in 1:m[i])
if(j ==1) V[[i]] = 0 else V[[i]][j] = V[[i]][j-1] + rexp(1, 1/10)
}
As an example consider
[1] 0.00000 23.23549 30.10976
Each list element contains a ascending vector of length m, each starting with 0 and ending somewhere in positive real numbers.
Now, consider a value s, where s is smaller than the maximum v_m of each V[[i]]. Also let v_m_ denote the m-1-th element of V[[i]]. Our goal is to find all elements of V[[i]] that are bounded by v_m_ - s and v_m - s. In the example above, if s=5, the desired vector v would be 23.23549. v can contain more elements if the interval encloses more values. As an example consider:
> V[[1]]
[1] 0.000000 2.214964 8.455576 10.188048 26.170458
If we now let s=16, the resulting vector is now 0 2.214964 8.455576, so that it has length 3. The code below implements this procedure using a for loop. It returns v in a list for all n. Note that I also attach the (upper/lower) bound before/afterv, if the bound lead to a reduction in length of v (in other words, if the bound has a positive value).
This loop is too slow in my application because n is large and the procedure is part of a larger algorithm that has to be run many times with some parameters changing. Is there a way to obtain the result faster than with a for loop, for example using vectorization? I know lapply in general is not faster than for.
# Series maximum and one below maximum
v_m = sapply(V, function(x) x[length(x)])
v_m_ = sapply(V, function(x) x[length(x)-1])
# Set some offsets s
s = runif(n,0,v_m)
# Procedure
d1 = (v_m_ - s)
d2 = (v_m - s)
if(sum(d2 < 0) > 0) stop('s cannot be larger than series maximum.')
# For loop - can this be done faster?
W = list()
for(i in 1:n){
v = V[[i]]
l = length(v)
v = v[v > d1[i]]
if(l > length(v)) v = c(d1[i], v)
l = length(v)
v = v[v < d2[i]]
if(l > length(v)) v = c(v, d2[i])
W[[i]] = v
}
I guess you can try mapply like below
V <- lapply(m, function(i) c(0, cumsum(rexp(i - 1, 1 / 10))))
v <- sapply(V, tail, 2)
s <- runif(n, 0, v[1, ])
if (sum(v[2, ] < 0) > 0) stop("s cannot be larger than series maximum.")
W <- mapply(
function(x, lb, ub) c(pmax(lb,0), x[x >= lb & x <= ub], pmin(ub,max(x))),
V,
v[1,]-s,
v[2,]-s
)
I don't think vectorization will be an option since the operation goes from a list of unequal-length vectors to another list of unequal-length vectors.
For example, the following vectorizes all the comparisons, but the unlist/relist operations are too expensive (not to mention the final lapply(..., unique)). Stick with the for loop.
W <- lapply(
relist(
pmax(
pmin(
unlist(V),
rep.int(d2, lengths(V))
),
rep.int(d1, lengths(V))
),
V
),
unique
)
I see two things that will give modest gains in speed. First, if s is always greater than 0, your final if statement will always evaluate to TRUE, so it can be skipped, simplifying some of the code. Second is to pre-allocate W. These are both implemented in fRecode2 below. A third thing that gives a slight gain is to avoid multiple reassignments to v. This is implemented in fRecode3 below.
For additional speed, move to Rcpp--it will allow the vectors in W to be built via a single pass through each vector element in V instead of two.
set.seed(12345)
n <- 1e3
m <- sample(2:5, n, T)
V <- lapply(m, function(i) c(0, cumsum(rexp(i - 1, 1 / 10))))
v_m <- sapply(V, function(x) x[length(x)])
v_m_ <- sapply(V, function(x) x[length(x)-1])
s <- runif(n,0,v_m)
d1 <- (v_m_ - s)
d2 <- (v_m - s)
if(sum(d2 < 0) > 0) stop('s cannot be larger than series maximum.')
fRecode1 <- function() {
# original function
W = list()
for(i in 1:n){
v = V[[i]]
l = length(v)
v = v[v > d1[i]]
if(l > length(v)) v = c(d1[i], v)
l = length(v)
v = v[v < d2[i]]
if(l > length(v)) v = c(v, d2[i])
W[[i]] = v
}
W
}
fRecode2 <- function() {
W <- vector("list", length(V))
i <- 0L
for(v in V){
l <- length(v)
v <- v[v > d1[i <- i + 1L]]
if (l > length(v)) v <- c(d1[i], v)
W[[i]] <- c(v[v < d2[i]], d2[[i]])
}
W
}
fRecode3 <- function() {
W <- vector("list", length(V))
i <- 0L
for(v in V){
idx1 <- sum(v <= d1[i <- i + 1L]) + 1L
idx2 <- sum(v < d2[i])
if (idx1 > 1L) {
if (idx2 >= idx1) {
W[[i]] <- c(d1[i], v[idx1:idx2], d2[i])
} else {
W[[i]] <- c(d1[i], d2[i])
}
} else {
W[[i]] <- c(v[1:idx2], d2[i])
}
}
W
}
microbenchmark::microbenchmark(fRecode1 = fRecode1(),
fRecode2 = fRecode2(),
fRecode3 = fRecode3(),
times = 1e3,
check = "equal")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> fRecode1 2.0210 2.20405 2.731124 2.39785 2.80075 12.7946 1000
#> fRecode2 1.2829 1.43315 1.917761 1.54715 1.88495 51.8183 1000
#> fRecode3 1.2710 1.38920 1.741597 1.45640 1.76225 5.4515 1000
Not a huge speed boost: fRecode3 shaves just under a microsecond on average for each vector in V.

Iterative optimization of alternative glm family

I'm setting up an alternative response function to the commonly used exponential function in poisson glms, which is called softplus and defined as $\frac{1}{c} \log(1+\exp(c \eta))$, where $\eta$ corresponds to the linear predictor $X\beta$
I already managed optimization by setting parameter $c$ to arbitrary fixed values and only searching for $\hat{\beta}$.
BUT now for the next step I have to optimize this parameter $c$ as well (iteratively changing between updated $\beta$ and current $c$).
I tried to write a log-lik function, score function and then setting up a Newton Raphson optimization (using a while loop)
but I don't know how to seperate the updating of c in an outer step and updating \beta in an inner step..
Are there any suggestions?
# Response function:
sp <- function(eta, c = 1 ) {
return(log(1 + exp(abs(c * eta)))/ c)
}
# Log Likelihood
l.lpois <- function(par, y, X){
beta <- par[1:(length(par)-1)]
c <- par[length(par)]
l <- rep(NA, times = length(y))
for (i in 1:length(l)){
l[i] <- y[i] * log(sp(X[i,]%*%beta, c)) - sp(X[i,]%*%beta, c)
}
l <- sum(l)
return(l)
}
# Score function
score <- function(y, X, par){
beta <- par[1:(length(par)-1)]
c <- par[length(par)]
s <- matrix(rep(NA, times = length(y)*length(par)), ncol = length(y))
for (i in 1:length(y)){
s[,i] <- c(X[i,], 1) * (y[i] * plogis(c * X[i,]%*%beta) / sp(X[i,]%*%beta, c) - plogis(c * X[i,]%*%beta))
}
score <- rep(NA, times = nrow(s))
for (j in 1:length(score)){
score[j] <- sum(s[j,])
}
return(score)
}
# Optimization function
opt <- function(y, X, b.start, eps=0.0001, maxiter = 1e5){
beta <- b.start[1:(length(b.start)-1)]
c <- b.start[length(b.start)]
b.old <- b.start
i <- 0
conv <- FALSE
while(conv == FALSE){
eta <- X%*%b.old[1:(length(b.old)-1)]
s <- score(y, X, b.old)
h <- numDeriv::hessian(l.lpois,b.old,y=y,X=X)
invh <- solve(h)
# update
b.new <- b.old + invh %*% s
i <- i + 1
# Test
if(any(is.nan(b.new))){
b.new <- b.old
warning("convergence failed")
break
}
# convergence reached?
if(sqrt(sum((b.new - b.old)^2))/sqrt(sum(b.old^2)) < eps | i >= maxiter){
conv <- TRUE
}
b.old <- b.new
}
eta <- X%*%b.new[1:(length(b.new)-1)]
# covariance
invh <- solve(numDeriv::hessian(l.lpois,b.new,y=y,X=X))
fitted <- sp(eta, b.new[length(b.new)])
result <- list("coefficients" = c(beta = b.new),
"fitted.values" = fitted,
"covariance" = invh)
}
# Running fails ..
n <- 100
x <- runif(n, 0, 1)
Xdes <- cbind(1, x)
eta <- 1 + 2 * x
y <- rpois(n, sp(eta, c = 1))
opt(y,Xdes,c(0,1,1))
You have 2 bugs:
line 25:
(y[i] * plogis(c * X[i,]%*%beta) / sp(X[i,]%*%beta, c) - plogis(c * X[i,]%*%beta))
this returns matrix so you must convert to numeric:
as.numeric(y[i] * plogis(c * X[i,]%*%beta) / sp(X[i,]%*%beta, c) - plogis(c * X[i,]%*%beta))
line 23:
) is missing:
you have:
s <- matrix(rep(NA, times = length(y)*length(par), ncol = length(y))
while it should be:
s <- matrix(rep(NA, times = length(y)*length(par)), ncol = length(y))

Sampling a log-concave distribution using the adaptive rejection sampling method (R)

I am not very familiar with R. I have been trying to use the implementation of the adaptive rejection sampling method in R, in order to sample from the following distribution:
here is my R code:
library(ars)
g1 <- function(x,r){(1./r)*((1-x)^r)}
f1 <- function(x,a,k) {
add<-0
for(i in 1:k) {
add<- add+g1(x,i)
}
res <- (a* add)+(a-1)*log(x)+k*log(1-x)
return(res)
}
g2 <- function(x,r){(1-x)^(r-1)}
f1prima <- function(x,a,k) {
add<-0
for(i in 1:k) {
add<- add-g2(x,i)
}
res <- (a* add)+(a-1)/x-k/(1-x)
return(res)
}
mysample1<-ars(20,f1,f1prima,x=c(0.001,0.09),m=2,emax=128,lb=TRUE,xlb=0.0, ub=TRUE, xub=1,a=0.5,k=100)
The function is a log-concave, but I get different error messages when I run ars and fiddling around with the input parameters won't help here. Any suggestion would be appreciated.
First thing, which you already noticed is that your log-concave function is not very well defined at x=0 and x=1.0. So useful interval would be something like 0.01...0.99, not 0.0...1.0
Second, I don't like the idea to compute hundreds of terms in your summation term.
So, good idea might be to express it in following way, starting with derivative
S1N-1 qi is obviously geometric series and could be replaced with
(1-qN)/(1-q), where q=1-x.
This is derivative, so to get to similar term in function itself, just integrate it.
http://www.wolframalpha.com/input/?i=integrate+(1-q%5EN)%2F(1-q)+dq will return Gauss Hypergeometric function 2F1 plus logarithm
-qN+1 2F1(1, N+1; N+2; q)/(N+1) - log(1-q)
NB: It is the same integral as Beta before, but dealing with it was a bit more cumbersome
So, code to compute those terms:
library(gsl)
library(ars)
library(ggplot2)
Gauss2F1 <- function(a, b, c, x) {
ifelse(x >= 0.0 & x < 1.0, hyperg_2F1(a, b, c, x), hyperg_2F1(c - a, b, c, 1.0 - 1.0/(1.0 - x))/(1.0 - x)^b)
}
f1sum <- function(x, N) {
q <- 1.0 - x
- q^(N+1) * Gauss2F1(1, N+1, N+2, q)/(N+1) - log(1.0 - q)
}
f1sum.1 <- function(x, N) {
q <- 1.0 - x
res <- rep(0.0, length.out = length(x))
s <- rep(1.0, length.out = length(x))
for(k in 1:N) {
s <- s * q / as.numeric(k)
res <- res + s
}
res
}
f1 <- function(x, a, N) {
a * f1sum(x, N) + (a - 1.0)*log(x) + N*log(1.0 - x)
}
f1.1 <- function(x, a, N) {
a * f1sum.1(x, N) + (a - 1.0)*log(x) + N*log(1.0 - x)
}
f1primesum <- function(x, N) {
q <- 1.0 - x
(1.0 - q^N)/(1.0 - q)
}
f1primesum.1 <- function(x, N) {
res <- rep(0.0, length.out = length(x))
s <- rep(1.0, length.out = length(x))
for(k in 1:N) {
res <- res + s
s <- s * q
}
-res
}
f1prime <- function(x, a, N) {
a* f1primesum(x, N) + (a - 1.0)/x - N/(1.0 - x)
}
f1prime.1 <- function(x, a, N) {
a* f1primesum.1(x, N) + (a - 1.0)/x - N/(1.0 - x)
}
p <- ggplot(data.frame(x = c(0, 1)), aes(x = x)) +
stat_function(fun = f1, args = list(0.5, 100), colour = "#4271AE") +
stat_function(fun = f1.1, args = list(0.5, 100), colour = "#1F3552") +
scale_x_continuous(name = "X", breaks = seq(0, 1, 0.2), limits=c(0.001, 0.5)) +
scale_y_continuous(name = "F") +
ggtitle("Log-concave function")
p
As you can see, I've implemented both versions - one using summation and another using analytical form of sums. Computed data for a=0.5, N=100.
First, there is a bit of a difference between direct sum and 2F1 - I attribute it to precision loss in summation.
Second, more important result - function is NOT log-concave. No questions why ars() if failing left and right. See graph below

Knapsack 0-1 in R

I have been trying to formulate a simple knapsack problem, but I cannot see why it is not working.
i <- c(1,2,3,4)
v <- c(100,80,10,120)
w <- c(10,5,10,4)
k <- 15
F <- function(i,k){
if (i==0 | k==0){
output <- 0
} else if (k<w[i]){
output <- F(i-1,w)
} else {
output <- max(v[i]+ F(i-1, k-w[i]), F(i-1,k))
}
return(output)
}
Having a look at the knapsack function of the package adagio should help you, where w is the vector of weights, p the vector of profits and cap is your k. (see ?knapsack)
knapsack <- function (w, p, cap) {
n <- length(w)
x <- logical(n)
F <- matrix(0, nrow = cap + 1, ncol = n)
G <- matrix(0, nrow = cap + 1, ncol = 1)
for (k in 1:n) {
F[, k] <- G
H <- c(numeric(w[k]), G[1:(cap + 1 - w[k]), 1] + p[k])
G <- pmax(G, H)
}
fmax <- G[cap + 1, 1]
f <- fmax
j <- cap + 1
for (k in n:1) {
if (F[j, k] < f) {
x[k] <- TRUE
j <- j - w[k]
f <- F[j, k]
}
}
inds <- which(x)
wght <- sum(w[inds])
prof <- sum(p[inds])
return(list(capacity = wght, profit = prof, indices = inds))
}
However, the problems in your function seem to be
You did not declare all the objects used in your function (w and v) : you should also declare them as parameters of your function.
F which is the name of your function is called inside your function. Hence, as (i==0 | k==0) could never be true, the function will never stop processing.

Error in seq.default(a, length = max(0, b - a - 1)) : length must be non-negative number

I tried running the code below.
set.seed(307)
y<- rnorm(200)
h2=0.3773427
t=seq(-3.317670, 2.963407, length.out=500)
fit=density(y, bw=h2, n=1024, kernel="epanechnikov")
integrate.xy(fit$x, fit$y, min(fit$x), t[407])
However, i recived the following message:
"Error in seq.default(a, length = max(0, b - a - 1)) :
length must be non-negative number"
I am not sure what's wrong.
I do not encounter any problem when i use t[406] or t[408] as follow:
integrate.xy(fit$x, fit$y, min(fit$x), t[406])
integrate.xy(fit$x, fit$y, min(fit$x), t[408])
Does anyone know what's the problem and how to fix it? Appreciate your help please. Thanks!
I went through the source code for the integrate.xy function, and there seems to be a bug relating to the usage of the xtol argument.
For reference, here is the source code of integrate.xy function:
function (x, fx, a, b, use.spline = TRUE, xtol = 2e-08)
{
dig <- round(-log10(xtol))
f.match <- function(x, table) match(signif(x, dig), signif(table,
dig))
if (is.list(x)) {
fx <- x$y
x <- x$x
if (length(x) == 0)
stop("list 'x' has no valid $x component")
}
if ((n <- length(x)) != length(fx))
stop("'fx' must have same length as 'x'")
if (is.unsorted(x)) {
i <- sort.list(x)
x <- x[i]
fx <- fx[i]
}
if (any(i <- duplicated(x))) {
n <- length(x <- x[!i])
fx <- fx[!i]
}
if (any(diff(x) == 0))
stop("bug in 'duplicated()' killed me: have still multiple x[]!")
if (missing(a))
a <- x[1]
else if (any(a < x[1]))
stop("'a' must NOT be smaller than min(x)")
if (missing(b))
b <- x[n]
else if (any(b > x[n]))
stop("'b' must NOT be larger than max(x)")
if (length(a) != 1 && length(b) != 1 && length(a) != length(b))
stop("'a' and 'b' must have length 1 or same length !")
else {
k <- max(length(a), length(b))
if (any(b < a))
stop("'b' must be elementwise >= 'a'")
}
if (use.spline) {
xy <- spline(x, fx, n = max(1024, 3 * n))
if (xy$x[length(xy$x)] < x[n]) {
if (TRUE)
cat("working around spline(.) BUG --- hmm, really?\n\n")
xy$x <- c(xy$x, x[n])
xy$y <- c(xy$y, fx[n])
}
x <- xy$x
fx <- xy$y
n <- length(x)
}
ab <- unique(c(a, b))
xtol <- xtol * max(b - a)
BB <- abs(outer(x, ab, "-")) < xtol
if (any(j <- 0 == apply(BB, 2, sum))) {
y <- approx(x, fx, xout = ab[j])$y
x <- c(ab[j], x)
i <- sort.list(x)
x <- x[i]
fx <- c(y, fx)[i]
n <- length(x)
}
ai <- rep(f.match(a, x), length = k)
bi <- rep(f.match(b, x), length = k)
dfx <- fx[-c(1, n)] * diff(x, lag = 2)
r <- numeric(k)
for (i in 1:k) {
a <- ai[i]
b <- bi[i]
r[i] <- (x[a + 1] - x[a]) * fx[a] + (x[b] - x[b - 1]) *
fx[b] + sum(dfx[seq(a, length = max(0, b - a - 1))])
}
r/2
}
The value given to the xtol argument, is being overwritten in the line xtol <- xtol * max(b - a). But the value of the dig variable is calculated based on the original value of xtol, as given in the input to the function. Because of this mismatch, f.match function, in the line bi <- rep(f.match(b, x), length = k), returns no matches between x and b (i.e., NA). This results in the error that you have encountered.
A simple fix, at least for the case in question, would be to remove the xtol <- xtol * max(b - a) line. But, you should file a bug report with the maintainer of this package, for a more rigorous fix.

Resources