Improving my R function performance with VCM - r

I am running a simulation using Varying-Coefficient Models, yet there are some adjustments. There is no R package that can do what I am looking for.
My code is not running fast enough. I am looking forward to making vcm function run faster
###########################################################################
###########################################################################
### ###
### EPANECHNIKOV FUNCTION ###
### ###
###########################################################################
###########################################################################
epan <- function(t,h){
idx = 0.75 * (1 - (t/h)**2) / h
kernal = 0.50 * (abs(idx) + idx)
kernal
}
###########################################################################
###########################################################################
### ###
### UNPENALIZED ###
### VARYING COEFFICIENT MODEL ###
### ###
###########################################################################
###########################################################################
vcm <- function(x,y,z,z0) {
n = dim(x)[1]
p = dim(x)[2]
n0 = length(z0)
Z = outer(z,z0,"-")
Width = sd(z) * n**(-0.2) * 2
H = sapply(X = 1:n0, FUN = function(X) epan(t = Z[,X], h = Width))
diag(H) = 0
W_h = H / rep(colSums(H), each = n0)
G = lapply(X = 1:n0, FUN = function(X) cbind(x, Z[,X]*x))
AB = matrix(NA, n0, 2*p)
II = 1e-4 * diag(2*p) # to avoid singularity
for(i in 1:n0) {
AB[i,] = solve(crossprod(G[[i]] * W_h[,i], G[[i]]) + II) %*% crossprod(G[[i]] * W_h[,i], y)
}
AB
}
What I have done so far is
Profile the code and see where the slow part
Used sapply and lapply instead of for loop, yet no significant difference
How to use the code? Here is a small simulation where the code functions are used.
n = 100000
p = 5
n0 = 1000
z = runif(n)
z0 = seq(0.05, 0.95, length.out = n0)
x = MASS::mvrnorm(n, rep(0,p), diag(p))
gz = cbind(2*sin(2*pi*z), 3*z*(1-2*z), exp(-2*z + z**2), 2*z, 0)
y = apply(x * gz, 1, sum) + rnorm(n)
vvc_m = vcm(x,y,z,z0)
I am willing to use Rcpp or any other libraries if they would significantly improve my code's performance even though I have no experience with Rcpp.
Your help is appreciated!

The apply and lapply are not needed. Also, G[[i]] * W_h[,i] needs to be computed only once. These changes will shave off a few seconds, but the bulk of the time is spent in the for loop. You are probably correct that any gains there will have to be with Rcpp/RcppArmadillo.
vcm2 <- function(x,y,z,z0) {
n = dim(x)[1]
p = dim(x)[2]
n0 = length(z0)
Z = outer(z,z0,"-")
Width = sd(z) * n**(-0.2) * 2
H = epan(Z, Width)
diag(H) = 0
W_h = H / rep(colSums(H), each = n0)
AB = matrix(NA, n0, 2*p)
II = 1e-4 * diag(2*p) # to avoid singularity
G = matrix(x, n, 2*p)
idx = (p + 1):(2*p)
for(i in 1:n0) {
G[,idx] = Z[,i]*x
GW_h = G*W_h[,i]
AB[i,] = solve(crossprod(GW_h, G) + II) %*% crossprod(GW_h, y)
}
AB
}
system.time(vvc_m <- vcm(x,y,z,z0))
#> user system elapsed
#> 21.71 5.42 27.14
system.time(vvc_m2 <- vcm2(x,y,z,z0))
#> user system elapsed
#> 19.45 3.52 22.99
identical(vvc_m, vvc_m2)
#> [1] TRUE

Related

REBayes Error in KWDual MKS_RES_TERM_STALL

I am trying to run the following simulation below. Note that this does require Mosek and RMosek to be installed!
I keep getting the error
Error in KWDual(A, d, w, ...) :
Mosek error: MSK_RES_TRM_STALL: The optimizer is terminated due to slow progress.
How can I resolve the MSK_RES_TRM_STALL error?
Further Research
When looking up the documentation for this I found this:
The optimizer is terminated due to slow progress.
Stalling means that numerical problems prevent the optimizer from making reasonable progress and that it makes no sense to continue. In many cases this happens if the problem is badly scaled or otherwise ill-conditioned. There is no guarantee that the solution will be feasible or optimal. However, often stalling happens near the optimum, and the returned solution may be of good quality. Therefore, it is recommended to check the status of the solution. If the solution status is optimal the solution is most likely good enough for most practical purposes.
Please note that if a linear optimization problem is solved using the interior-point optimizer with basis identification turned on, the returned basic solution likely to have high accuracy, even though the optimizer stalled.
Some common causes of stalling are a) badly scaled models, b) near feasible or near infeasible problems.
So I checked the final value A, but nothing was in it. I found that if I change the simulations from 1000 to 30 I do get values (A <- sim1(30, 30, setting = 1)), but this is suboptimal.
Reproducible Script
KFE <- function(y, T = 300, lambda = 1/3){
# Kernel Fourier Estimator: Stefanski and Carroll (Statistics, 1990)
ks <- function(s,x) exp(s^2/2) * cos(s * x)
K <- function(t, y, lambda = 1/3){
k <- y
for(i in 1:length(y)){
k[i] <- integrate(ks, 0, 1/lambda, x = (y[i] - t))$value/pi
}
mean(k)
}
eps <- 1e-04
if(length(T) == 1) T <- seq(min(y)-eps, max(y)+eps, length = T)
g <- T
for(j in 1:length(T))
g[j] <- K(T[j], y, lambda = lambda)
list(x = T, y = g)
}
BDE <- function(y, T = 300, df = 5, c0 = 1){
# Bayesian Deconvolution Estimator: Efron (B'ka, 2016)
require(splines)
eps <- 1e-04
if(length(T) == 1) T <- seq(min(y)-eps, max(y)+eps, length = T)
X <- ns(T, df = df)
a0 <- rep(0, ncol(X))
A <- dnorm(outer(y,T,"-"))
qmle <- function(a, X, A, c0){
g <- exp(X %*% a)
g <- g/sum(g)
f <- A %*% g
-sum(log(f)) + c0 * sum(a^2)^.5
}
ahat <- nlm(qmle, a0, X=X, A=A, c0 = c0)$estimate
g <- exp(X %*% ahat)
g <- g/integrate(approxfun(T,g),min(T),max(T))$value
list(x = T,y = g)
}
W <- function(G, h, interp = FALSE, eps = 0.001){
#Wasserstein distance: ||G-H||_W
H <- cumsum(h$y)
H <- H/H[length(H)]
W <- integrate(approxfun(h$x, abs(G(h$x) - H)),min(h$x),max(h$x))$value
list(W=W, H=H)
}
biweight <- function(x0, x, bw){
t <- (x - x0)/bw
(1-t^2)^2*((t> -1 & t<1)-0) *15/16
}
Wasser <- function(G, h, interp = FALSE, eps = 0.001, bw = 0.7){
#Wasserstein distance: ||G-H||_W
if(interp == "biweight"){
yk = h$x
for (j in 1:length(yk))
yk[j] = sum(biweight(h$x[j], h$x, bw = bw)*h$y/sum(h$y))
H <- cumsum(yk)
H <- H/H[length(H)]
}
else {
H <- cumsum(h$y)
H <- H/H[length(H)]
}
W <- integrate(approxfun(h$x, abs(G(h$x) - H)),min(h$x),max(h$x),
rel.tol = 0.001, subdivisions = 500)$value
list(W=W, H=H)
}
sim1 <- function(n, R = 10, setting = 0){
A <- matrix(0, 4, R)
if(setting == 0){
G0 <- function(t) punif(t,0,6)/8 + 7 * pnorm(t, 0, 0.5)/8
rf0 <- function(n){
s <- sample(0:1, n, replace = TRUE, prob = c(1,7)/8)
rnorm(n) + (1-s) * runif(n,0,6) + s * rnorm(n,0,0.5)
}
}
else{
G0 <- function(t) 0 + 7 * (t > 0)/8 + (t > 2)/8
rf0 <- function(n){
s <- sample(0:1, n, replace = TRUE, prob = c(1,7)/8)
rnorm(n) + (1-s) * 2 + s * 0
}
}
for(i in 1:R){
y <- rf0(n)
g <- BDE(y)
Wg <- Wasser(G0, g)
h <- GLmix(y)
Wh <- Wasser(G0, h)
Whs <- Wasser(G0, h, interp = "biweight")
k <- KFE(y)
Wk <- Wasser(G0, k)
A[,i] <- c(Wg$W, Wk$W, Wh$W, Whs$W)
}
A
}
require(REBayes)
set.seed(12)
A <- sim1(1000, 1000, setting = 1)
I ran the code and indeed it stalls at the end, but the solution is not any worse than in the preceding cases that solve without stall:
17 1.7e-07 3.1e-10 6.8e-12 1.00e+00 5.345949918e+00 5.345949582e+00 2.4e-10 0.40
18 2.6e-08 3.8e-11 2.9e-13 1.00e+00 5.345949389e+00 5.345949348e+00 2.9e-11 0.41
19 2.6e-08 3.8e-11 2.9e-13 1.00e+00 5.345949389e+00 5.345949348e+00 2.9e-11 0.48
20 2.6e-08 3.8e-11 2.9e-13 1.00e+00 5.345949389e+00 5.345949348e+00 2.9e-11 0.54
Optimizer terminated. Time: 0.62
Interior-point solution summary
Problem status : PRIMAL_AND_DUAL_FEASIBLE
Solution status : OPTIMAL
Primal. obj: 5.3459493890e+00 nrm: 6e+00 Viol. con: 2e-08 var: 0e+00 cones: 4e-09
Dual. obj: 5.3459493482e+00 nrm: 7e-01 Viol. con: 1e-11 var: 4e-11 cones: 0e+00
A quick hack for now that worked for me is to relax the termination tolerances a little bit in the call to GLmix:
control <- list()
control$dparam <- list(INTPNT_CO_TOL_REL_GAP=1e-7,INTPNT_CO_TOL_PFEAS=1e-7,INTPNT_CO_TOL_DFEAS=1e-7)
h <- GLmix(y,control=control,verb=5)
A better solution as I indicated in the comments is not to treat the stall termination code as an error by the REBayes package but use solution status/quality instead.
I have modified the return from KWDual to avoid such messages provided that
the status sol$itr$solsta from Mosek is "Optimal" in REBayes v2.2 now on CRAN.

Iterative optimization of alternative glm family

I'm setting up an alternative response function to the commonly used exponential function in poisson glms, which is called softplus and defined as $\frac{1}{c} \log(1+\exp(c \eta))$, where $\eta$ corresponds to the linear predictor $X\beta$
I already managed optimization by setting parameter $c$ to arbitrary fixed values and only searching for $\hat{\beta}$.
BUT now for the next step I have to optimize this parameter $c$ as well (iteratively changing between updated $\beta$ and current $c$).
I tried to write a log-lik function, score function and then setting up a Newton Raphson optimization (using a while loop)
but I don't know how to seperate the updating of c in an outer step and updating \beta in an inner step..
Are there any suggestions?
# Response function:
sp <- function(eta, c = 1 ) {
return(log(1 + exp(abs(c * eta)))/ c)
}
# Log Likelihood
l.lpois <- function(par, y, X){
beta <- par[1:(length(par)-1)]
c <- par[length(par)]
l <- rep(NA, times = length(y))
for (i in 1:length(l)){
l[i] <- y[i] * log(sp(X[i,]%*%beta, c)) - sp(X[i,]%*%beta, c)
}
l <- sum(l)
return(l)
}
# Score function
score <- function(y, X, par){
beta <- par[1:(length(par)-1)]
c <- par[length(par)]
s <- matrix(rep(NA, times = length(y)*length(par)), ncol = length(y))
for (i in 1:length(y)){
s[,i] <- c(X[i,], 1) * (y[i] * plogis(c * X[i,]%*%beta) / sp(X[i,]%*%beta, c) - plogis(c * X[i,]%*%beta))
}
score <- rep(NA, times = nrow(s))
for (j in 1:length(score)){
score[j] <- sum(s[j,])
}
return(score)
}
# Optimization function
opt <- function(y, X, b.start, eps=0.0001, maxiter = 1e5){
beta <- b.start[1:(length(b.start)-1)]
c <- b.start[length(b.start)]
b.old <- b.start
i <- 0
conv <- FALSE
while(conv == FALSE){
eta <- X%*%b.old[1:(length(b.old)-1)]
s <- score(y, X, b.old)
h <- numDeriv::hessian(l.lpois,b.old,y=y,X=X)
invh <- solve(h)
# update
b.new <- b.old + invh %*% s
i <- i + 1
# Test
if(any(is.nan(b.new))){
b.new <- b.old
warning("convergence failed")
break
}
# convergence reached?
if(sqrt(sum((b.new - b.old)^2))/sqrt(sum(b.old^2)) < eps | i >= maxiter){
conv <- TRUE
}
b.old <- b.new
}
eta <- X%*%b.new[1:(length(b.new)-1)]
# covariance
invh <- solve(numDeriv::hessian(l.lpois,b.new,y=y,X=X))
fitted <- sp(eta, b.new[length(b.new)])
result <- list("coefficients" = c(beta = b.new),
"fitted.values" = fitted,
"covariance" = invh)
}
# Running fails ..
n <- 100
x <- runif(n, 0, 1)
Xdes <- cbind(1, x)
eta <- 1 + 2 * x
y <- rpois(n, sp(eta, c = 1))
opt(y,Xdes,c(0,1,1))
You have 2 bugs:
line 25:
(y[i] * plogis(c * X[i,]%*%beta) / sp(X[i,]%*%beta, c) - plogis(c * X[i,]%*%beta))
this returns matrix so you must convert to numeric:
as.numeric(y[i] * plogis(c * X[i,]%*%beta) / sp(X[i,]%*%beta, c) - plogis(c * X[i,]%*%beta))
line 23:
) is missing:
you have:
s <- matrix(rep(NA, times = length(y)*length(par), ncol = length(y))
while it should be:
s <- matrix(rep(NA, times = length(y)*length(par)), ncol = length(y))

Parallel big matrix multiplication

I need to multiply two big matrices A and B as follow:
library(bigmemory)
library(bigalgebra)
library(biganalytics)
A <- big.matrix( replicate(100, rnorm(10^5)) )
B <- big.matrix( replicate(10^5, rnorm(100)) )
AB <- A %*% B
How could I compute this multiplication in parallel?
The only tutorial I've come across so far is this one:
> library("doRedis")
> registerDoRedis(queue="example")
> L = foreach(j=1:2,.packages="VAM",.combine=c) %dopar%
+ {
+ key = paste("X",j,sep="")
+ ridx = ((j-1)*5 + 1):min((j*5),nrow(A))
+ X = A[ridx,] %*% B[,]
+ Y = as.big.matrix(X,backingfile=key)
+ vnew(Y, key)
+ key
+ }
> X = vam(matrix(L,nrow=2))
> sum(X[,] - A[,] %*% B[,])
[1] 0
But I'm not sure how to put it into practice. There may also be a simpler/more efficient way to achieve the same result?
Installing Microsoft R Open, I go from 3 sec to 0.1 sec!
library(bigmemory)
library(bigalgebra)
N <- 200
M <- 1e5
A <- big.matrix(N, M, init = rnorm(N * M))
B <- big.matrix(M, N, init = rnorm(N * M))
system.time(AB <- A %*% B)

Markowitz model / portfolio optimization using local search in R

I am taking baby steps to use metaheuristics for solving constrained optimization problems. I am trying to solve basic Markowitz Mean-Variance optimization model (given below) using NMOFpackage in R.
Min
lambda * [sum{i=1 to N}sum{j = 1 to N}w_i*w_i*Sigma_ij] - (1-lambda) * [sum{i=1 to N}(w_i*mu_i)]
subject to
sum{i=1 to N}{w_i} = 1
0 <= w_i <= 1; i = 1,...,N
where, lambda takes values between 0 and 1, N is number of assets.
Following is my code (Based on Book: Numerical Methods and Optimization in Finance):
library(NMOF)
na <- dim(fundData)[2L]
ns <- dim(fundData)[1L]
Sigma <- cov(fundData)
winf <- 0.0
wsup <- 1.0
m <- colMeans(fundData)
resample <- function(x,...) x[sample.int(length(x),...)]
data <- list(R = t(fundData),
m = m,
na = dim(fundData)[2L],
ns = dim(fundData)[1L],
Sigma = Sigma,
eps = 0.5/100,
winf = winf,
wsup = wsup,
nFP = 100)
w0 <- runif(data$na); w0 <- w0/sum(w0)
OF <- function(w,data){
wmu <- crossprod(w,m)
res <- crossprod(w, data$Sigma)
res <- tcrossprod(w,res)
result <- res - wmu
}
neighbour <- function(w, data){
toSell <- w > data$winf
toBuy <- w < data$wsup
i <- resample(which(toSell), size = 1L)
j <- resample(which(toBuy), size = 1L)
eps <- runif(1) * data$eps
eps <- min(w[i] - data$winf, data$wsup - w[j], eps)
w[i] <- w[i] - eps
w[j] <- w[j] + eps
w
}
algo <- list(x0 = w0, neighbour = neighbour, nS = 5000L)
system.time(sol1 <- LSopt(OF, algo, data))
I am not sure how to include lambda in the objective function (OF). The above code does not include lambda in OF. I tried using for loop but it resulted in following error:
OF <- function(w,data){
lambdaSeq <- seq(.001,0.999, length = data$nFP)
for(lambda in lambdaSeq){
wmu <- crossprod(w,m)
res <- crossprod(w, data$Sigma)
res <- tcrossprod(w,res)
result <- lambda*res - (1-lambda)*wmu
}
}
Error:
Local Search.
Initial solution:
| | 0%
Error in if (xnF <= xcF) { : argument is of length zero
Timing stopped at: 0.01 0 0.03
It would be nice if someone could help me in this regard.
P.S: I am also aware that this can be solved using quadratic programming. This is just an initiation to include other constraints.
If I understand correctly, you want to replicate the mean--variance efficient frontier by Local Search? Then you need to run a Local Search for every value of lambda that you want to include in the frontier.
The following example should help you get going. I start by attaching the package and setting up the list data.
require("NMOF")
data <- list(m = colMeans(fundData), ## expected returns
Sigma = cov(fundData), ## expected var of returns
na = dim(fundData)[2L], ## number of assets
eps = 0.2/100, ## stepsize for LS
winf = 0, ## minimum weight
wsup = 1, ## maximum weight
lambda = 1)
Next I compute a benchmark for the minimum-variance case (i.e. lambda equals one).
## benchmark: the QP solution
## ==> this will only work with a recent version of NMOF,
## which you can get by saying:
## install.packages('NMOF', type = 'source',
## repos = c('http://enricoschumann.net/R',
## getOption('repos')))
##
require("quadprog")
sol <- NMOF:::minvar(data$Sigma, 0, 1)
Objective function and neighbourhood function. I have slightly simplified both functions (for clarity; using crossprod in the objective function would probably be more efficient).
OF <- function(w, data){
data$lambda * (w %*% data$Sigma %*% w) -
(1 - data$lambda) * sum(w * data$m)
}
neighbour <- function(w, data){
toSell <- which(w > data$winf)
toBuy <- which(w < data$wsup)
i <- toSell[sample.int(length(toSell), size = 1L)]
j <- toBuy[sample.int(length(toBuy), size = 1L)]
eps <- runif(1) * data$eps
eps <- min(w[i] - data$winf, data$wsup - w[j], eps)
w[i] <- w[i] - eps
w[j] <- w[j] + eps
w
}
Now we can run Local Search. Since it is a fairly large dataset (200 assets),
you will need a relatively large number of steps to reproduce the QP solution.
w0 <- runif(data$na) ## a random initial solution
w0 <- w0/sum(w0)
algo <- list(x0 = w0, neighbour = neighbour, nS = 50000L)
sol1 <- LSopt(OF, algo, data)
You can compare the weights you get from Local Search with the QP solution.
par(mfrow = c(3,1), mar = c(2,4,1,1), las = 1)
barplot(sol, main = "QP solution")
barplot(sol1$xbest, main = "LS solution")
barplot(sol - sol1$xbest,
ylim = c(-0.001,0.001)) ## +/-0.1%
Finally, if you want to compute the whole frontier, you need to rerun this code for different levels of data$lambda.

Optimisation in R using Ucminf package

I am not able to apply ucminf function to minimise my cost function in R.
Here is my cost function:
costfunction <- function(X,y,theta){
m <- length(y);
J = 1/m * ((-t(y)%*%log(sigmoid(as.matrix(X)%*%as.matrix(theta)))) - ((1-t(y))%*%log(1-sigmoid(as.matrix(X)%*%as.matrix(theta)))))
}
Here is my sigmoid function:
sigmoid <- function(t){
g = 1./(1+exp(-t))
}
Here is my gradient function:
gradfunction <- function(X,y,theta){
grad = 1/ m * t(X) %*% (sigmoid(as.matrix(X) %*% as.matrix(theta) - y));
}
I am trying to do the following:
library("ucminf")
data <- read.csv("ex2data1.txt",header=FALSE)
X <<- data[,c(1,2)]
y <<- data[,3]
qplot(X[,1],X[,2],colour=factor(y))
m <- dim(X)[1]
n <- dim(X)[2]
X <- cbind(1,X)
initial_theta <<- matrix(0,nrow=n+1,ncol=1)
cost <- costfunction(X,y,initial_theta)
grad <- gradfunction(X,y,initial_theta)
This is where I want to call ucminf to find the minimum cost and values of theta. I am not sure how to do this.
Looks like you are trying to do the week2 problem of the machine learning course of Coursera.
No need to use ucminf packages here, you can simply use the R function optim it works
We will define the sigmoid and cost function first.
sigmoid <- function(z)
1 / (1 + exp(-z))
costFunction <- function(theta, X, y) {
m <- length(y)
J <- -(1 / m) * crossprod(c(y, 1 - y),
c(log(sigmoid(X %*% theta)), log(1 - sigmoid(X %*% theta))))
grad <- (1 / m) * crossprod(X, sigmoid(X %*% theta) - y)
list(J = J, grad = grad)
}
Let's load the data now, to make this code it reproductible, I put the data in my dropbox.
download.file("https://dl.dropboxusercontent.com/u/8750577/ex2data1.txt",
method = "curl", destfile = "/tmp/ex2data1.txt")
data <- matrix(scan('/tmp/ex2data1.txt', what = double(), sep = ","),
ncol = 3, byrow = TRUE)
X <- data[, 1:2]
y <- data[, 3, drop = FALSE]
m <- nrow(X)
n <- ncol(X)
X <- cbind(1, X)
initial_theta = matrix(0, nrow = n + 1)
We can then compute the result of the cost function at the initial theta like this
cost <- costFunction(initial_theta, X, y)
(grad <- cost$grad)
## [,1]
## [1,] -0.100
## [2,] -12.009
## [3,] -11.263
(cost <- cost$J)
## [,1]
## [1,] 0.69315
Finally we can use optim to ge the optimal theta
res <- optim(par = initial_theta,
fn = function(t) costFunction(t, X, y)$J,
gr = function(t) costFunction(t, X, y)$grad,
method = "BFGS", control = list(maxit = 400))
(theta <- res$par)
## [,1]
## [1,] -25.08949
## [2,] 0.20566
## [3,] 0.20089
(cost <- res$value)
## [1] 0.2035
If you have some problem with the function download.file, the data can be downloaded
here
As you did not provide a reproducible example it is hard to exactly give you the code you need, but the general idea is to hand the functions over to ucminf:
ucminf(start, costfunction, gradfunction, y = y, theta = initial_theta)
Note that start needs to be a vector of initial starting values which when handed over as X to the two functions need to produce a result. Usually you use random starting value (e.g., runif).

Resources