Speed up monte carlo simulation with nested loop (2) - r

I would like to know if there is a more efficiency way to speed up below code. It uses a procedure where subsampling is required in the nested loop (which a previous answer https://stackoverflow.com/a/13629611/1176697 help to make more efficient).
R has a tendency to hang when B=500, although computer OS isn't unduly affected.
The goal is to run the below code with B=1000 and using larger m values(m=75,m=100,m=150)
I have detailed the procedure in the code below and included a link to a reproducible data set.
#Estimation of order m `leave one out' hyperbolic efficiency scores
#The procedure sequentially works through each observation in `IOs' and
#calculates a DEA order M efficiency score by leaving out the observation
# under investigation in the DEA reference set
# Step 1: Load the packages, create Inputs (x) and Outputs (y), choose
# m(the order of the partial frontier or reference set to be used in DEA)
# and B the number of monte carlo simulations of each m order DEA estimate
# Step 2: For each observations a in x, x1 and y1
#are create which 'leaves out' this observation.
# Step 3: From these matrices subsamples (xref, yref) of size [m,] are
# taken and used in DEA estimation.
# Step 4: The DEA estimation uses the m subsample from step 3
# as a reference set and evaluates the efficiency of the observation that
# has been 'left out'
#(thus the first two arguments in DEA are matrices of order [1,3] )
# Step 5: Steps 3 and 4 are repeated B times to obtain B simulations of the
# order m efficiency score and a mean and standard deviation are
# calculated and placed in effm.
# IOs data can be found here: https://dl.dropbox.com/u/1972975/IOs.txt
# From IOs an Input matrix (x[1376,3]) and an Output matrix (y[1376,3])
# are created.
library(Benchmarking)
x <- IOs[,1:3]
y<-IOs[,4:6]
A<-nrow(x)
effm <- matrix(nrow = A, ncol = 2)
m <- 50
B <- 500
pb <- txtProgressBar(min = 0,
max = A, style=3)
for(a in 1:A) {
x1 <- x[-a,]
y1 <- y[-a,]
theta <- numeric(B)
xynrow<-nrow(x1)
mB<-m*B
xrefm <- x1[sample(1:xynrow, mB, replace=TRUE),] # get all of your samples at once(https://stackoverflow.com/a/13629611/1176697)
yrefm <- y1[sample(1:xynrow, mB, replace=TRUE),]
deaX <- as.matrix(x[a,], ncol=3)
deaY <-as.matrix(y[a,], ncol=3)
for(i in 1:B){
theta[i] <- dea(deaX, deaY, RTS = 'vrs', ORIENTATION = 'graph',
xrefm[(1:m) + (i-1) * m,], yrefm[(1:m) + (i-1) * m,], FAST=TRUE)
}
effm[a,1] <- mean(theta)
effm[a,2] <- sd(theta) / sqrt(B)
setTxtProgressBar(pb, a)
}
close(pb)

Related

Faster alternative to R car::Anova for sum of square crossproduct matrix calculation for subsets of predictors

I need to compute the sum of squares crossproduct matrix (indeed the trace of this matrix) in a multivariate linear model, with Y (n x q) and X (n x p). Standard R code for doing that is:
require(MASS)
require(car)
# Example data
q <- 10
n <- 1000
p <- 10
Y <- mvrnorm(n, mu = rep(0, q), Sigma = diag(q))
X <- as.data.frame(mvrnorm(n, mu = rnorm(p), Sigma = diag(p)))
# Fit lm
fit <- lm( Y ~ ., data = X )
# Type I sums of squares
summary(manova(fit))$SS
# Type III sums of squares
type = 3 # could be also 2 (II)
car::Anova(fit, type = type)$SSP
This has to be done thousands of times, unfortunately, it gets slow when the number of predictors is relatively large. As often I am interested only in a subset of s predictors, I tried to re-implement this calculation. Although my implementation directly translating linear algebra for s = 1 (below) is faster for small sample sizes (n),
# Hat matrix (X here stands for the actual design matrix)
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
# Remove predictor of interest (e.g. 2)
X.r <- X[, -2]
H1 <- tcrossprod(tcrossprod(X.r, solve(crossprod(X.r))), X.r)
# Compute e.g. type III sum of squares
SS <- crossprod(Y, H - H1) %*% Y
car still goes faster for large n:
I already tried Rcpp implementation which much success, as these matrix products in R already use a very efficient code.
Any hint on how to do this faster?
UPDATE
After reading the answers, I tried the solution proposed in this post which relies on QR/SVD/Cholesky factorization for hat matrix calculation. However it seems that car::Anova is still faster to compute all p = 30 matrices than me computing just one (s = 1)!! for e.g. n = 5000, q = 10:
Unit: milliseconds
expr min lq mean median uq max neval
ME 1137.5692 1202.9888 1257.8979 1251.6834 1318.9282 1398.9343 10
QR 1005.9082 1031.9911 1084.5594 1037.5659 1095.7449 1364.9508 10
SVD 1026.8815 1065.4629 1152.6631 1087.9585 1241.4977 1446.8318 10
Chol 969.9089 1056.3093 1115.9608 1102.1169 1210.7782 1267.1274 10
CAR 205.1665 211.8523 218.6195 214.6761 222.0973 242.4617 10
UPDATE 2
The best solution for now was to go over the car::Anova code (i.e. functions car:::Anova.III.mlm and subsequently car:::linearHypothesis.mlm) and re-implement them to account for a subset of predictors, instead of all of them.
The relevant code by car is as follows (I skipped checks, and simplified a bit):
B <- coef(fit) # Model coefficients
M <- model.matrix(fit) # Model matrix M
V <- solve(crossprod(M)) # M'M
p <- ncol(M) # Number of predictors in M
I.p <- diag(p) # Identity (p x p)
terms <- labels(terms(fit)) # terms (add intercept)
terms <- c("(Intercept)", terms)
n.terms <- length(terms)
assign <- fit$assign # assignation terms <-> p variables
SSP <- as.list(rep(0, n.terms)) # Initialize empty list for sums of squares cross-product matrices
names(SSP) <- terms
for (term in 1:n.terms){
subs <- which(assign == term - 1)
L <- I.p[subs, , drop = FALSE]
SSP[[term]] <- t(L %*% B) %*% solve(L %*% V %*% t(L)) %*% (L %*% B)
}
Then it is just a matter of selecting the subset of terms.
This line and the similar one below it for H1 could probably be improved:
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
The general idea is that you should rarely use solve(Y) %*% Z, because it is the same as solve(Y, Z) but slower. I haven't fully expanded your tcrossprod calls to see what the best equivalent formulation of the expressions for H and H1 would be.
You could also look at this question https://stats.stackexchange.com/questions/139969/speeding-up-hat-matrices-like-xxx-1x-projection-matrices-and-other-as for a description of doing it via QR decomposition.

How to generate samples from MVN model?

I am trying to run some code on R based on this paper here through example 5.1. I want to simulate the following:
My background on R isn't great so I have the following code below, how can I generate a histogram and samples from this?
xseq<-seq(0, 100, 1)
n<-100
Z<- pnorm(xseq,0,1)
U<- pbern(xseq, 0.4, lower.tail = TRUE, log.p = FALSE)
Beta <- (-1)^U*(4*log(n)/(sqrt(n)) + abs(Z))
Some demonstrations of tools that will be of use:
rnorm(1) # generates one standard normal variable
rnorm(10) # generates 10 standard normal variables
rnorm(1, 5, 6) # generates 1 normal variable with mu = 5, sigma = 6
# not needed for this problem, but perhaps worth saying anyway
rbinom(5, 1, 0.4) # generates 5 Bernoulli variables that are 1 w/ prob. 0.4
So, to generate one instance of a beta:
n <- 100 # using the value you gave; I have no idea what n means here
u <- rbinom(1, 1, 0.4) # make one Bernoulli variable
z <- rnorm(1) # make one standard normal variable
beta <- (-1)^u * (4 * log(n) / sqrt(n) + abs(z))
But now, you'd like to do this many times for a Monte Carlo simulation. One way you might do this is by building a function, having beta be its output, and using the replicate() function, like this:
n <- 100 # putting this here because I assume it doesn't change
genbeta <- function(){ # output of this function will be one copy of beta
u <- rbinom(1, 1, 0.4)
z <- rnorm(1)
return((-1)^u * (4 * log(n) / sqrt(n) + abs(z)))
}
# note that we don't need to store beta anywhere directly;
# rather, it is just the return()ed value of the function we defined
betadraws <- replicate(5000, genbeta())
hist(betadraws)
This will have the effect of making 5000 copies of your beta variable and putting them in a histogram.
There are other ways to do this -- for instance, one might just make a big matrix of the random variables and work directly with it -- but I thought this would be the clearest approach for starting out.
EDIT: I realized that I ignored the second equation entirely, which you probably didn't want.
We've now made a vector of beta values, and you can control the length of the vector in the first parameter of the replicate() function above. I'll leave it as 5000 in my continued example below.
To get random samples of the Y vector, you could use something like:
x <- replicate(5000, rnorm(17))
# makes a 17 x 5000 matrix of independent standard normal variables
epsilon <- rnorm(17)
# vector of 17 standard normals
y <- x %*% betadraws + epsilon
# y is now a 17 x 1 matrix (morally equivalent to a vector of length 17)
and if you wanted to get many of these, you could wrap that inside another function and replicate() it.
Alternatively, if you didn't want the Y vector, but just a single Y_i component:
x <- rnorm(5000)
# x is a vector of 5000 iid standard normal variables
epsilon <- rnorm(1)
# epsilon_i is a single standard normal variable
y <- t(x) %*% betadraws + epsilon
# t() is the transpose function; y is now a 1 x 1 matrix

Simulate Compound poisson process in r

I'm trying to simulate a compound Poisson process in r. The process is defined by $ \sum_{j=1}^{N_t} Y_j $ where $Y_n$ is i.i.d sequence independent $N(0,1)$ values and $N_t$ is a Poisson process with parameter $1$. I'm trying to simulate this in r without luck. I have an algorithm to compute this as follows:
Simutale the cPp from 0 to T:
Initiate: $ k = 0 $
Repeat while $\sum_{i=1}^k T_i < T$
Set $k = k+1$
Simulate $T_k \sim exp(\lambda)$ (in my case $\lambda = 1$)
Simulate $Y_k \sim N(0,1)$ (This is just a special case, I would like to be able to change this to any distribution)
The trajectory is given by $X_t = \sum_{j=1}^{N_t} Y_j $ where $N(t) = sup(k : \sum_{i=1}^k T_i \leq t )$
Can someone help me to simulate this in r so that I can plot the process? I have tried, but can't get it done.
Use cumsum for the cumulative sums that determine the times N_t as well as the X_t. This illustrative code specifies the number of times to simulate, n, simulates the times in n.t and the values in x, and (to display what it has done) plots the trajectory.
n <- 1e2
n.t <- cumsum(rexp(n))
x <- c(0,cumsum(rnorm(n)))
plot(stepfun(n.t, x), xlab="t", ylab="X")
This algorithm, since it relies on low-level optimized functions, is fast: the six-year-old system I tested it on will generate over three million (time, value) pairs per second.
That's usually good enough for simulation, but it doesn't quite satisfy the problem, which asks to generate a simulation out to time T. We can leverage the preceding code, but the solution is a little trickier. It computes a reasonable upper limit on how many times will occur in the Poisson process before time T. It generates the inter-arrival times. This is wrapped in a loop that will repeat the procedure in the (rare) event the time T is not actually reached.
The additional complexity doesn't change the asymptotic calculation time.
T <- 1e2 # Specify the end time
T.max <- 0 # Last time encountered
n.t <- numeric(0) # Inter-arrival times
while (T.max < T) {
#
# Estimate how many random values to generate before exceeding T.
#
T.remaining <- T - T.max
n <- ceiling(T.remaining + 3*sqrt(T.remaining))
#
# Continue the Poisson process.
#
n.new <- rexp(n)
n.t <- c(n.t, n.new)
T.max <- T.max + sum(n.new)
}
#
# Sum the inter-arrival times and cut them off after time T.
#
n.t <- cumsum(n.t)
n.t <- n.t[n.t <= T]
#
# Generate the iid random values and accumulate their sums.
#
x <- c(0,cumsum(rnorm(length(n.t))))
#
# Display the result.
#
plot(stepfun(n.t, x), xlab="t", ylab="X", sub=paste("n =", length(n.t)))

Calculating divergence between joint posterior distributions

I wish to calculate the distance between two 3-dimensional posterior distributions. The draws are stored at two 30,000x3 matrices.
So far I have been successful in calculating Total Variation distance between two 2-dimensional posteriors (two 30,000x2 matrices) by splitting the grid into bins. However, I am having trouble calculating the divergence between posteriors with more parameters. Some examples of related distance measures can be found here.
NOTE: I do not wish to calculate the distance between the marginals (column-wise entries), rather than obtain an overall value after comparing the joint distributions in R.
I would really appreciate it if somebody could point out what I am missing here.
EDIT 1: Some example code for calculating Total variation distance between posterior samples stored in two matrices has been added below:
EDIT 2: This is a R question.
set.seed(123)
comparison.2D <- matrix(rnorm(40000*2,0,1),ncol=2)
ground.truth.2D <- matrix(rnorm(40000*2,0,2),ncol=2)
# Function to calculate TVD between matrices with 2 columns:
Total.Variation.Distance.2D<-function(true,
comparison,
burnin,
window.size){
# Bandwidth for theta.1.
my_bw_x<-window.size
# Bandwidth for theta.2.
my_bw_y<-window.size
range_x<-range(c(true[-c(1:burnin),1],comparison[-c(1:burnin),1]))
range_y<-range(c(true[-c(1:burnin),2],comparison[-c(1:burnin),2]))
xx <- seq(range_x[1],range_x[2],by=my_bw_x)
yy <- seq(range_y[1],range_y[2],by=my_bw_y)
true.pointidxs <- matrix( c( findInterval(true[-c(1:burnin),1], xx),
findInterval(true[-c(1:burnin),2], yy) ), ncol=2)
comparison.pointidxs <- matrix( c( findInterval(comparison[-c(1:burnin),1], xx),
findInterval(comparison[-c(1:burnin),2], yy) ), ncol=2)
# Count the frequencies in the corresponding cells:
square.mat.dims <- max(length(xx),nrow=length(yy))
frequencies.true <- frequencies.comparison <- matrix(0, ncol=square.mat.dims, nrow=square.mat.dims)
for (i in 1:dim(true.pointidxs)[1]){
frequencies.true[true.pointidxs[i,1], true.pointidxs[i,2]] <- frequencies.true[true.pointidxs[i,1],
true.pointidxs[i,2]] + 1
frequencies.comparison[comparison.pointidxs[i,1], comparison.pointidxs[i,2]] <- frequencies.comparison[comparison.pointidxs[i,1],
comparison.pointidxs[i,2]] + 1
}# End for
# Normalize frequencies matrix:
frequencies.true <- frequencies.true/dim(true.pointidxs)[1]
frequencies.comparison <- frequencies.comparison/dim(comparison.pointidxs)[1]
TVD <-0.5*sum(abs(frequencies.comparison-frequencies.true))
return(TVD)
}# End function
TVD.2D <- Total.Variation.Distance.2D(true=ground.truth.2D, comparison=comparison.2D,burnin=10000,window.size=0.05)

Storing For Loop values after simulation

I'm brand new to R and trying to implement a simple model (which I will extend later) that deals with corporate bond defaults.
For starters, I'm using only two clients.
Parameters:
- two clients (which I name "A" and "B")
- a cash flow of $10,000 will be received from each client if they do not default within 10 years
- pulling together concepts using standard normal random variables, dependent uniform random variables and Gaussian copulas
- run some number of simulations
- store the sum of Client A cash flow plus Client B cash flow and store in a vector named "result"
- finally, take the average of the result vector
My code is:
# define variables
nSim <- 5 # of simulations
rho <- 0.3 # rho
lambda <- 0.01 # default intensity
T <- 10 # time to default
for (i in 1:nSim){
# Step 1: generate 2 independent standard normal random variables
z1 <- rnorm(1, mean=0, sd=1)
z2 <- rnorm(1, mean=0, sd=1)
# Step 2: map the normals into correlated normals
# by Cholesky composition of the correlation matrix
# w1 = z1
# w2 = rho(z1)+sqrt(1-(rho^2))*z2
w1 <- z1
w2 <- rho*z1 - sqrt(1-(rho^2))*z2
# Step 3: using the correlated normals, generate two dependent uniform variables
u <- runif(1, min=0, max=1)
v <- runif(1, min=0, max=1)
# Step 4: using the dependent uniforms, generate two dependent exponentials
tau.A <- (-1/lambda)*log(u)
tau.B <- (-1/lambda)*log(v)
payout.A <- if (tau.A > 10) {10000} else {0}
payout.B <- if (tau.B > 10) {10000} else {0}
result[i] = (payout.A[i] + payout.B[i])
}
# calculate expected value of portfolio
mean(result)
When I run this code, I'm getting an error of "NA" and can't figure out why (again, I'm brand new to R). I don't think each of the simulation values is being stored in the results vector, but don't know how to diagnose the problem.
Thanks in advance to anyone who can help!
--Sarah
Everything works until the results[i] <- (payout.A[i] + payout.B[i]) line. The problem is you never set results.
Before your for loop, add the line:
results <- vector('numeric', length = nSim)
This will create a vector of 0s with a length of nSim. In R is is best to preallocate the space instead of dynamically growing a vector using c().
No the problem is the presence of the [i] assignments in the results[i] <- (payout.A[i] + payout.B[i]) line.
The [i] assignment is okay for the results parameter but not the two payout parameters because each of these are being generated in each loop. So simply remove them to form the line:
results[i] <- (payout.A + payout.B)
will solve your issue. If you wish to keep each payout in its own vector then you need to assign it as such, but it seems that you don't.

Resources