Why is the Portmanteau test slower in Rcpp than in R? - r

I need to do a power study of the main Portmanteau tests Article, for this I must evaluate them in different scenarios, sample sizes and different ARMA models (p, q) generating 180 scenarios which takes me close
  6 hours. Program my function in R and Rcpp, however I find the surprise that in C ++, it is slower, my question is why?
My R Code:
Portmanteau <- function(x,h=1,type = c("Box-Pierce","Ljun-Box","Monti"),fitdf = 0){
Ti <- length(x)
df <- h-fitdf
ri <- acf(x, lag.max = h, plot = FALSE, na.action = na.pass)
pi <- pacf(x, lag.max = h, plot = FALSE, na.action = na.pass)
if(type == "Monti"){d<-0} else{d<-1}
if(type == "Box-Pierce"){wi <- 1} else{wi <- (Ti+2)/seq(Ti-1,Ti-h)}
Q <- Ti*(d*sum(wi*identity(ri$acf[-1]^2))+(1-d)*sum(wi*identity(pi$acf^2)))
pv <- pchisq(Q,df,lower.tail = F)
result <- cbind(Statistic = Q, df,p.value = pv)
rownames(result) <- paste(type,"test")
return(result)
}
My Rcpp code
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector PortmanteauC(NumericVector x, int h = 1,const char* type = "Box-Pierce" ,int fitdf = 0) {
Environment stats("package:stats");
Function acf = stats["acf"];
Function pacf = stats["pacf"];
Function na_pass = stats["na.pass"];
List ri = acf(x, h, "correlation", false, na_pass);
List pi = pacf(x, h, false, na_pass);
int Ti = x.size();
int df = h - fitdf;
double d;
NumericVector wi;
NumericVector rk = ri["acf"];
NumericVector pk = pi["acf"];
NumericVector S(h);
for(int i = 0; i < h; ++i){S[i] = Ti-i-1;}
rk.erase(0);
if(strcmp(type,"Monti") == 0){d=0;} else{d=1;}
if(strcmp(type,"Box-Pierce") == 0){wi = rep(1,h);} else{wi = (Ti+2)/S;}
double Q = Ti*(d*sum(wi*pow(rk,2)) + (1-d)*sum(wi*pow(pk,2)));
double pv = R::pchisq(Q,df,0,false);
NumericVector result(3);
result[0] = Q;
result[1] = df;
result[2] = pv;
return(result);
}
Example
set.seed(1)
y = arima.sim(model = list(ar = 0.5), n = 250)
mod = arima(y, order = c(1,0,0))
res = mod$residuals
Box-Pierce
library(rbenchmark)
benchmark(PortmanteauC(res, h=10, type = "Box-Pierce",fitdf = 1),replications = 500,Portmanteau(res,h = 10, type = "Box-Pierce", fitdf= 1),
Box.test(res, lag = 10, type = "Box-Pierce", fitdf= 1))[,1:4]
test replications elapsed relative
3 Box.test(res, lag = 10, type = "Box-Pierce", fitdf = 1) 500 0.17 1.000
2 Portmanteau(res, h = 10, type = "Box-Pierce", fitdf = 1) 500 0.44 2.588
1 PortmanteauC(res, h = 10, type = "Box-Pierce", fitdf = 1) 500 1.82 10.706
Ljun-Box
benchmark(Box.test(res, lag = 5, type = "Ljung-Box", fitdf= 1),replications = 500,
Portmanteau(res,h = 5, type = "Ljung-Box", fitdf= 1),
PortmanteauC(res,h = 5, type = "Ljung-Box", fitdf= 1))[,1:4]
test replications elapsed relative
1 Box.test(res, lag = 5, type = "Ljung-Box", fitdf = 1) 500 0.17 1.000
2 Portmanteau(res, h = 5, type = "Ljung-Box", fitdf = 1) 500 0.45 2.647
3 PortmanteauC(res, h = 5, type = "Ljung-Box", fitdf = 1) 500 1.84 10.824
I would have expected Rcpp to be much faster than the byte compiled R.

Let's analyze the performance properties of your R code. Since a individual call is so fast, that the sampling profiler provided by R cannot be used easily, I simply use repeat() to repeat the code until interrupted:
Portmanteau <- function(x,h=1,type = c("Box-Pierce","Ljun-Box","Monti"),fitdf = 0){
Ti <- length(x)
df <- h-fitdf
ri <- acf(x, lag.max = h, plot = FALSE, na.action = na.pass)
pi <- pacf(x, lag.max = h, plot = FALSE, na.action = na.pass)
if(type == "Monti"){d<-0} else{d<-1}
if(type == "Box-Pierce"){wi <- 1} else{wi <- (Ti+2)/seq(Ti-1,Ti-h)}
Q <- Ti*(d*sum(wi*identity(ri$acf[-1]^2))+(1-d)*sum(wi*identity(pi$acf^2)))
pv <- pchisq(Q,df,lower.tail = F)
result <- cbind(Statistic = Q, df,p.value = pv)
rownames(result) <- paste(type,"test")
return(result)
}
set.seed(1)
profvis::profvis({
repeat({
y = arima.sim(model = list(ar = 0.5), n = 250)
mod = arima(y, order = c(1,0,0))
res = mod$residuals
Portmanteau(res, h = 10, type = "Box-Pierce", fitdf = 1)
})
})
I let it run for about 49s. Part of the graphical output provided in RStudio can be seen here:
We learn from this:
arima() takes about seven times longer than Portmenteau(). Depending on the ratio of calls between these two functions, you might be optimizing the wrong function.
For the Portmenteau() call almost the complete time is spend in pacf() and acf(). These R functions are also used within your Rcpp code, but with the additional complexitiy of going back to R from C++. This explains why your C++ is slower than your R code.

Related

Avoiding duplication in R

I am trying to fit a variety of (truncated) probability distributions to the same very thin set of quantiles. I can do it but it seems to require lots of duplication of the same code. Is there a neater way?
I am using this code by Nadarajah and Kotz to generate the pdf of the truncated distributions:
qtrunc <- function(p, spec, a = -Inf, b = Inf, ...)
{
tt <- p
G <- get(paste("p", spec, sep = ""), mode = "function")
Gin <- get(paste("q", spec, sep = ""), mode = "function")
tt <- Gin(G(a, ...) + p*(G(b, ...) - G(a, ...)), ...)
return(tt)
}
where spec can be the name of any untruncated distribution for which code in R exists, and the ... argument is used to provide the names of the parameters of that untruncated distribution.
To achieve the best fit I need to measure the distance between the given quantiles and those calculated using arbitrary values of the parameters of the distribution. In the case of the gamma distribution, for example, the code is as follows:
spec <- "gamma"
fit_gamma <- function(x, l = 0, h = 20, t1 = 5, t2 = 13){
ct1 <- qtrunc(p = 1/3, spec, a = l, b = h, shape = x[1],rate = x[2])
ct2 <- qtrunc(p = 2/3, spec, a = l, b = h, shape = x[1],rate = x[2])
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2- ct2)^2
return(sqrt(sum(dist)))
}
where l is the lower truncation, h is the higher and I am given the two tertiles t1 and t2.
Finally, I seek the best fit using optim, thus:
gamma_fit <- optim(par = c(2, 4),
fn = fit_gamma,
l = l,
h = h,
t1 = t1,
t2 = t2,
method = "L-BFGS-B",
lower = c(1.01, 1.4)
Now suppose I want to do the same thing but fitting a normal distribution instead. The names of the parameters of the normal distribution that I am using in R are mean and sd.
I can achieve what I want but only by writing a whole new function fit_normal that is extremely similar to my fit_gamma function but with the new parameter names used in the definition of ct1 and ct2.
The problem of duplication of code becomes very severe because I wish to try fitting a large number of different distributions to my data.
What I want to know is whether there is a way of writing a generic fit_spec as it were so that the parameter names do not have to be written out by me.
Use x as a named list to create a list of arguments to pass into qtrunc() using do.call().
fit_distro <- function(x, spec, l = 0, h = 20, t1 = 5, t2 = 13){
args <- c(x, list(spec = spec, a = l, b = h))
ct1 <- do.call(qtrunc, args = c(list(p = 1/3), args))
ct2 <- do.call(qtrunc, args = c(list(p = 2/3), args))
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2 - ct2)^2
return(sqrt(sum(dist)))
}
This is called as follows, which is the same as your original function.
fit_distro(list(shape = 2, rate = 3), "gamma")
# [1] 13.07425
fit_gamma(c(2, 3))
# [1] 13.07425
This will work with other distributions, for however many parameters they have.
fit_distro(list(mean = 10, sd = 3), "norm")
# [1] 4.08379
fit_distro(list(shape1 = 2, shape2 = 3, ncp = 10), "beta")
# [1] 12.98371

R : Changing values of variables after certain time

the question I am trying to ask is how to I change one of the values of my variables (noted as LO$M in my list) after I pass a certain time.
The thing I am trying to achieve is that after 20,000 seconds passing I would like to change my value of Lac to the value of Lac at time 20,0000 +10,000
So at t = 20,000, Lac = Lac + 10,000
The issue I am having with my code is that within my if command I have if tt>= 20000, but this leads to the issue that every value of Lac after 20,000 being increased by 10,000 when what i want is that the FIRST value after 20,000 be increased by 10,000.
Basically, after 20,000 of my experiment passing I am trying to inject 10,000 more Lac into the experiment.
My code is given below:
LO = list()
LO$M = c(i = 1, ri = 0, I = 50, Lac = 20, ILac = 0, o = 1, Io = 0, RNAP = 100, RNAPo = 0, r = 0, z = 0)
LO$Pre = matrix(c(1,0,0,0,0,0,0,0,0,0,0,
0,1,0,0,0,0,0,0,0,0,0,
0,0,1,1,0,0,0,0,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,
0,0,1,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,1,0,0,0,0,
0,0,0,0,0,1,0,1,0,0,0,
0,0,0,0,0,0,0,0,1,0,0,
0,0,0,0,0,0,0,0,1,0,0,
0,0,0,0,0,0,0,0,0,1,0,
0,0,0,1,0,0,0,0,0,0,1,
0,1,0,0,0,0,0,0,0,0,0,
0,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,
0,0,0,0,0,0,0,0,0,0,1), ncol=11, byrow=TRUE)
LO$Post = matrix(c(1,1,0,0,0,0,0,0,0,0,0,
0,1,1,0,0,0,0,0,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,
0,0,1,1,0,0,0,0,0,0,0,
0,0,0,0,0,0,1,0,0,0,0,
0,0,1,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,1,0,0,
0,0,0,0,0,1,0,1,0,0,0,
0,0,0,0,0,1,0,1,0,1,0,
0,0,0,0,0,0,0,0,0,1,1,
0,0,0,0,0,0,0,0,0,0,1,
0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,
0,0,0,1,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0), ncol=11, byrow=TRUE)
LO$h = function(x,t,th=c(0.02,0.1,0.005,0.1,1,0.01,0.1,0.01,0.03,0.1,1e-05,0.01,0.002,0.01,0.001))
{
with(as.list(c(x, th)), {
return(c(th[1]*i, th[2]*ri, th[3]*I*Lac, th[4]*ILac, th[5]*I*o, th[6]*Io, th[7]*o*RNAP,
th[8]*RNAPo, th[9]*RNAPo, th[10]*r, th[11]*Lac*z, th[12]*ri, th[13]*I,
th[13]*ILac, th[14]*r, th[15]*z))
})
}
gillespie1 = function (N, n, ...)
{
tt = 0
x = N$M
S = t(N$Post - N$Pre)
u = nrow(S)
v = ncol(S)
tvec = vector("numeric", n)
xmat = matrix(ncol = u, nrow = n + 1)
xmat[1, ] = x
for (i in 1:n) {
h = N$h(x, tt, ...)
tt = tt + rexp(1, sum(h))
j = sample(v, 1, prob = h)
x = x + S[, j]
tvec[i] = tt
xmat[i + 1, ] = x
if( tt >=20000){
x[4] = x[4] +10000
}
}
return(list(t = tvec, x = xmat))
}
newout = gillespie1(LO,200000)
matplot(newout$x[,4], type="l", lwd=0.25, col="grey")
I don't have a high enough reputation to attach images, but it should look something like this:
https://gyazo.com/0ffd940a22df23b2ccfdf4a17e85dca8
Sorry if this isn't clear. Thanks
In this example, you have the function myTask(). When you call execMyTask(), you will execute myTask()once, and after that, you will execute it at random intervals between 1 to max_wait milliseconds. When you get tired, you can kill the task with tclTaskDelete().
library(tcltk2)
myTask <- function() cat("some task!\n")
id = "execMyTask"
execMyTask <- function(max_wait = 3000) {
id <- toString(match.call()[[1]])
myTask()
wait = sample(1:max_wait, 1)
cat("Waiting", wait, "miliseconds\n") # replace with your function
if (is.null(tclTaskGet(id))) {
tclTaskSchedule(wait=wait, execMyTask(), id=id, redo = TRUE)
} else {
tclTaskChange(wait=wait, execMyTask(), id=id, redo = TRUE)
}
}
execMyTask()
tclTaskDelete(id)
So far, there is a little problem with this approach, because we can not supply arguments to the function fun in tclTaskChange().

specClust() in kknn - arpack iteration limit increase

I am applying spectral clustering to a dataset with 4200 rows and 2 columns.
spec <- specClust(df1, centers=7, nn = 7, method = "symmetric")
I have the below error.
n .Call("R_igraph_arpack", func, extra, options, env, sym, PACKAGE = "igraph") :
At arpack.c:944 : ARPACK error, Maximum number of iterations reached
In addition: Warning message:
In .Call("R_igraph_arpack", func, extra, options, env, sym, PACKAGE = "igraph") :
At arpack.c:776 :ARPACK solver failed to converge (1001 iterations, 0/7 eigenvectors converged)
How do i increase the iterations of arpack because this doesnt work:
spec <- specClust(df1, centers=7, nn = 7, method = "symmetric",iter.max=301000)
Digging into the specClust, the ... does not pass anything to the arpack call.
The simplest thing to do I think is to copy the specClust code add maxiter=10000 and source the function in your script.
specCLust2 <- function (data, centers = NULL, nn = 7, method = "symmetric",
gmax = NULL, max.iter = 10000, ...)
{
call = match.call()
if (is.data.frame(data))
data = as.matrix(data)
da = apply(data, 1, paste, collapse = "#")
indUnique = which(!duplicated(da))
indAll = match(da, da[indUnique])
data2 = data
data = data[indUnique, ]
n <- nrow(data)
data = scale(data, FALSE, TRUE)
if (is.null(gmax)) {
if (!is.null(centers))
gmax = centers - 1L
else gmax = 1L
}
test = TRUE
while (test) {
DC = mydist(data, nn)
sif <- rbind(1:n, as.vector(DC[[2]]))
g <- graph(sif, directed = FALSE)
g <- decompose(g, min.vertices = 4)
if (length(g) > 1) {
if (length(g) >= gmax)
nn = nn + 2
else test = FALSE
}
else test = FALSE
}
W <- DC[[1]]
n <- nrow(data)
wi <- W[, nn]
SC <- matrix(1, nrow(W), nn)
SC[] <- wi[DC[[2]]] * wi
W = W^2/SC
alpha = 1/(2 * (nn + 1))
qua = abs(qnorm(alpha))
W = W * qua
W = dnorm(W, sd = 1)
DC[[1]] = W
L = Laplacian(DC, nn, method)
f <- function(x, extra) as.vector(extra %*% x)
if (is.null(centers))
kmax = 25
else kmax = max(centers)
###
#add the maxiter parameter to the arpack call, below
###
U <- arpack(f, extra = L, options = list(n = n, which = "SM",
nev = kmax, ncv = 2 * kmax, mode = 1, maxiter=max.iter), sym = TRUE)
ind <- order(U[[1]])
U[[2]] = U[[2]][indAll, ind]
U[[1]] = U[[1]][ind]
if (is.null(centers)) {
tmp = which.max(diff(U[[1]])) + 1
centers = which.min(AUC(U[[1]][1:tmp]))
}
if (method == "symmetric") {
rs = sqrt(rowSums(U[[2]]^2))
U[[2]] = U[[2]]/rs
}
result = kmeans(U[[2]], centers = centers, nstart = 20, ...)
archeType = getClosest(U[[2]][indAll, ], result$centers)
result$eigenvalue = U[[1]]
result$eigenvector = U[[2]]
result$data = data2
result$indAll = indAll
result$indUnique = indUnique
result$L = L
result$archetype = archeType
result$call = call
class(result) = c("specClust", "kmeans")
result
}

arima method in mtsdi

I have a large data set(more than 2000 rows and 2000 variables) with lots of missing values. I am using mnimputfunction of mtsdi package of R for imputing all missing values. This is my code
formula = data
imput_out <- mnimput(formula,data, by = NULL, log = FALSE, log.offset = 1,
eps = 1e-3, maxit = 1e2, ts = TRUE, method = "arima", ar.control = list(order = c(1,1,1), period = 4, f.eps = 1e-6, f.maxit = 1e3, ga.bf.eps = 1e-6,verbose = TRUE, digits = getOption("digits")))
But I am getting an error
Error in o[1:3, j] : incorrect number of dimensions
Please help me out.
you have to get real deep into the package source to uncover whats going on here.
the ar.control is placed into a variable o that is iterated on by the j # of columns that you put into your formula. so if your formula looks like ~c31+c32+c33 your ar term need to be 3 columns of (p,d,q) values
I assigned it outside of the ar.control parameter for ease of editing
arcontrol<-list(order=cbind(c(1,0,0),c(0,0,1),c(1,0,0)), period=NULL)
mnimput(formula,data,eps=1e-3,ts=TRUE, method="arima", ar.control=arcontrol
here is the package source if you are interested
function (xn, o, s, eps, maxit)
{
rows <- dim(xn)[1]
cols <- dim(xn)[2]
models <- as.list(rep(NA, cols))
ar.pred <- matrix(NA, nrow = rows, ncol = cols)
for (j in 1:cols) {
if (is.null(s)) {
order <- o[1:3, j]
seasonal <- list(order = c(0, 0, 0), period = NA)
}
else {
order <- o[1:3, j]
seasonal <- list(order = o[4:6, j], period = s)
}
models[[j]] <- arima(xn[, j], order = order, seasonal = seasonal,
xreg = NULL, optim.control = list(maxit = maxit,
reltol = eps))
ar.pred[, j] <- xn[, j] - residuals(models[[j]])
}
retval <- list(ar.pred = ar.pred, models = models)
return(retval)
}

Is it a bug In Rglpk

I used Rglpk to solve a linear programming problem, but its results seems weird. I changed it to lpSolve, and the two results are different.
Please comment the Rglpk and uncomment lpSolve statements to change the solver to lpSolve.
# Lo, S.-F., & Lu, W.-M. (2009). An integrated performance evaluation of financial holding companies in Taiwan.
# European Journal of Operational Research, 198(1), 341–350. doi:10.1016/j.ejor.2008.09.006
sbm = function(X,Y)
{
# Here X is N * m matrix, Y is N*s matrix.
library(Rglpk)
# require(lpSolve)
N = nrow(X)
m = ncol(X)
s = ncol(Y)
# variables are
# t
# gamma_j,j=1..N
# s_i^(-),i=1..m
# s_r^(+),r=1..s
efficiency = numeric(N)
max_positive_y = apply(Y[,1:s], MARGIN = 2, function(x) max(x[x>0]))
min_positive_y = apply(Y[,1:s], MARGIN = 2, function(x) min(x[x>0]))
dir = rep("==",1+m+s+1)
rhs = c(1,rep(0,m),rep(0,s),0)
for(i in 1:N)
{
x = X[i,]
y = Y[i,]
#variables
coef_t = 1
coef_gamma = rep(0,N)
coef_s_i = -1/(m * x)
coef_s_r = rep(0,s)
obj = c(coef_t,coef_gamma,coef_s_i,coef_s_r)
coef_constraint1_s=y
for(r in 1:s)
{
if(y[r]<0){
coef_constraint1_s[r] =
min_positive_y[r] * (max_positive_y[r] - min_positive_y[r])/
(max_positive_y[r] - y[r])
}
}
constraint1 = c(1, rep(0,N), rep(0,m) , 1/(s*coef_constraint1_s))
constraint2 = cbind(-x, t(X), diag(m), matrix(0,m,s))
constraint3 = cbind(-y, t(Y), matrix(0,s,m), -diag(s))
constraint4 = c(-1, rep(1,N), rep(0,m), rep(0,s))
mat = rbind(constraint1,constraint2,constraint3,constraint4)
results = Rglpk_solve_LP(obj = obj,mat = mat,dir = dir,rhs = rhs,max = FALSE)
efficiency[i] = results$optimum
# results <- lp("min", obj, mat, dir, rhs)
# efficiency[i] = results$objval
}
efficiency
}

Resources