Using R to honor correlations for LatinHypercube / Monte Carlo trials - r

I am currently using python and RPY to use the functionality inside R.
How do I use R library to generate Monte carlo samples that honor the correlation between 2 variables..
e.g
if variable A and B have a correlation of 85% (0.85), i need to generate all the monte carlo samples honoring that correlation between A & B.
Would appreciate if anyone can share ideas / snippets
Thanks

The rank correlation method of Iman and Conover seems to be a widely used and general approach to producing correlated monte carlo samples for computer based experiments, sensitivity analysis etc. Unfortunately I have only just come across this and don't have access to the PDF so don't know how the authors actually implement their method, but you could follow this up.
Their method is more general because each variable can come from a different distribution unlike the multivariate normal of #Dirk's answer.
Update: I found an R implementation of the above approach in package mc2d, in particular you want the cornode() function.
Here is an example taken from ?cornode
> require(mc2d)
> x1 <- rnorm(1000)
> x2 <- rnorm(1000)
> x3 <- rnorm(1000)
> mat <- cbind(x1, x2, x3)
> ## Target
> (corr <- matrix(c(1, 0.5, 0.2, 0.5, 1, 0.2, 0.2, 0.2, 1), ncol=3))
[,1] [,2] [,3]
[1,] 1.0 0.5 0.2
[2,] 0.5 1.0 0.2
[3,] 0.2 0.2 1.0
> ## Before
> cor(mat, method="spearman")
x1 x2 x3
x1 1.00000000 0.01218894 -0.02203357
x2 0.01218894 1.00000000 0.02298695
x3 -0.02203357 0.02298695 1.00000000
> matc <- cornode(mat, target=corr, result=TRUE)
Spearman Rank Correlation Post Function
x1 x2 x3
x1 1.0000000 0.4515535 0.1739153
x2 0.4515535 1.0000000 0.1646381
x3 0.1739153 0.1646381 1.0000000
The rank correlations in matc are now very close to the target correlations of corr.
The idea with this is that you draw the samples separately from the distribution for each variable, and then use the Iman & Connover approach to make the samples (as close) to the target correlations as possible.

That is a FAQ. Here is one answer using a recommended package:
R> library(MASS)
R> example(mvrnorm)
mvrnrmR> Sigma <- matrix(c(10,3,3,2),2,2)
mvrnrmR> Sigma
[,1] [,2]
[1,] 10 3
[2,] 3 2
mvrnrmR> var(mvrnorm(n=1000, rep(0, 2), Sigma))
[,1] [,2]
[1,] 8.82287 2.63987
[2,] 2.63987 1.93637
mvrnrmR> var(mvrnorm(n=1000, rep(0, 2), Sigma, empirical = TRUE))
[,1] [,2]
[1,] 10 3
[2,] 3 2
R>
Switching between correlation and covariance is straightforward (hint: outer product of vector of standard deviations).

This question was not tagged as python, but based on your comment it looks like you might be looking for a Python solution as well. The most basic Python implementation of Iman Convover, that I can concoct looks like the following in Python (actually numpy):
def makeCorrelated( y, corMatrix ):
c = multivariate_normal(zeros(size( y, 0 ) ) , corMatrix, size( y, 1 ) )
key = argsort( argsort(c, axis=0), axis=0 ).T
out = map(take, map(sort, y), key)
out = array(out)
return out
where y is an array of samples from the marginal distributions and corMatrix is a positive semi definite, symmetric correlation matrix. Given that this function uses multivariate_normal() for the c matrix, you can tell this uses an implied Gaussian Copula. To use different copula structures you'll need to use different drivers for the c matrix.

Related

kernel PCA with Kernlab and classification of Colon--cancer dataset

I need to Perform kernel PCA on the colon-­‐cancer dataset:
and then
I need to Plot number of principal components vs classification accuracy with PCA data.
For the first part i am using kernlab in R as follows (let number of features be 2 and then i will vary it from say 2-100)
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=2)
I am having tough time to understand how to use this PCA data for classification ( i can use any classifier for eg SVM)
EDIT : My Question is how to feed the output of PCA into a classifier
data looks like this (cleaned data)
uncleaned original data looks like this
I will show you a small example on how to use the kpca function of the kernlab package here:
I checked the colon-cancer file but it needs a bit of cleaning to be able to use it so I will use a random data set to show you how:
Assume the following data set:
y <- rep(c(-1,1), c(50,50))
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
x4 <- runif(100)
x5 <- runif(100)
df <- data.frame(y,x1,x2,x3,x4,x5)
> df
y x1 x2 x3 x4 x5
1 -1 0.125841208 0.040543611 0.317198114 0.40923767 0.635434021
2 -1 0.113818719 0.308030825 0.708251147 0.69739496 0.839856000
3 -1 0.744765204 0.221210582 0.002220568 0.62921565 0.907277935
4 -1 0.649595597 0.866739474 0.609516644 0.40818013 0.395951297
5 -1 0.967379006 0.926688915 0.847379556 0.77867315 0.250867680
6 -1 0.895060293 0.813189446 0.329970821 0.01106764 0.123018797
7 -1 0.192447416 0.043720717 0.170960540 0.03058768 0.173198036
8 -1 0.085086619 0.645383728 0.706830885 0.51856286 0.134086770
9 -1 0.561070374 0.134457795 0.181368729 0.04557505 0.938145228
In order to run the pca you need to do:
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=4)
which is the same way as you use it. However, I need to point out that the features argument is the number of principal components and not the number of classes in your y variable. Maybe you knew this already but having 2000 variables and producing only 2 principal components might not be what you are looking for. You need to choose this number carefully by checking the eigen values. In your case I would probably pick 100 principal components and chose the first n number of principal components according to the highest eigen values. Let's see this in my random example after running the previous code:
In order to see the eigen values:
> kpc#eig
Comp.1 Comp.2 Comp.3 Comp.4
0.03756975 0.02706410 0.02609828 0.02284068
In my case all of the components have extremely low eigen values because my data is random. In your case I assume you will get better ones. You need to choose the n number of components that have the highest values. A value of zero shows that the component does not explain any of the variance. (Just for the sake of the demonstration I will use all of them in the svm below).
In order to access the principal components i.e. the PCA output you do this:
> kpc#pcv
[,1] [,2] [,3] [,4]
[1,] -0.1220123051 1.01290883 -0.935265092 0.37279158
[2,] 0.0420830469 0.77483019 -0.009222970 1.14304032
[3,] -0.7060568260 0.31153129 -0.555538694 -0.71496666
[4,] 0.3583160509 -0.82113573 0.237544936 -0.15526000
[5,] 0.1158956953 -0.92673486 1.352983423 -0.27695507
[6,] 0.2109994978 -1.21905573 -0.453469345 -0.94749503
[7,] 0.0833758766 0.63951377 -1.348618472 -0.26070127
[8,] 0.8197838629 0.34794455 0.215414610 0.32763442
[9,] -0.5611750477 -0.03961808 -1.490553198 0.14986663
...
...
This returns a matrix of 4 columns i.e. the number of the features argument which is the PCA output i.e. the principal components. kerlab uses the S4 Method Dispatch System and that is why you use # at kpc#pcv.
You then need to use the above matrix to feed in an svm in the following way:
svmmatrix <- kpc#pcv
library(e1071)
svm(svmmatrix, as.factor(y))
Call:
svm.default(x = svmmatrix, y = as.factor(y))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.25
Number of Support Vectors: 95
And that's it! A very good explanation I found on the internet about pca can be found here in case you or anyone else reading this wants to find out more.

bootstrapping from a matrix

I have spent over a week looking at different forums in order to figure this out and unfortunately remain stuck. I'm new to boostrapping and have found it difficult to get it to work using R for my data set.
I have a matrix of data, that I would like to simply draw 1000 samples from and the matrix by parametric bootstrapping. And then calculate the mean from these sampled values. I have tried the below code and get no results.
Any help would be appreciated.
A1 A2 A3 A4 D1 D2 E1
[1,] 0.900111 -0.314068 0.203188 -0.548964 -0.107771 -0.072454 0.084097
[2,] -0.314068 0.195798 -0.138751 0.198521 0.066360 0.048523 -0.126348
[3,] 0.203188 -0.138751 0.400325 -0.128715 -0.180103 -0.037768 0.128198
[4,] -0.548964 0.198521 -0.128715 1.190415 0.067779 0.047209 -0.053145
[5,] -0.107771 0.066360 -0.180103 0.067779 0.149419 0.039649 -0.102587
[6,] -0.072454 0.048523 -0.037768 0.047209 0.039649 0.396405 0.016789
[7,] 0.084097 -0.126348 0.128198 -0.053145 -0.102587 0.016789 0.790767
#creating the data matrix
data <- read.csv("Matrix.csv", header=F)
data1 <- as.matrix(data)
#Bootstrap 1000 samples
psi<- function (data,i) mean (data[i])
byboot = boot(data, psi, R=1000)
myboot
If you're trying to sample from a correlated normal distribution you can use MASS::mvrnorm
library(MASS)
x <- mvrnorm(1000, rep(0,7), data)
colMeans(x)
cov(x) # check the covariance matrix is approximately recovered

Can I generate bivariate normal random variables with correlation 1 using Cholesky factorization?

Is it possible to set a correlation = 1 using the cholesky decomposition technique?
set.seed(88)
mu<- 0
sigma<-1
x<-rnorm(10000, mu, sigma)
y<-rnorm(10000, mu, sigma)
MAT<-cbind(x,y)
cor(MAT[,1],MAT[,2])
#this doesn't work because 1 makes it NOT positive-definite. any number 0 to .99 works
correlationMAT<- matrix(1,nrow = 2,ncol = 2)
U<-chol(correlationMAT)
newMAT<- MAT %*% U
cor(newMAT[,1], newMAT[,2]) #.....but I want to make this cor = 1
Any ideas?
Actually you can, by using pivoted Cholesky factorization.
correlationMAT<- matrix(1,nrow = 2,ncol = 2)
U <- chol(correlationMAT, pivot = TRUE)
#Warning message:
#In chol.default(correlationMAT, pivot = TRUE) :
# the matrix is either rank-deficient or indefinite
U
# [,1] [,2]
#[1,] 1 1
#[2,] 0 0
#attr(,"pivot")
#[1] 1 2
#attr(,"rank")
#[1] 1
Note, U has identical columns. If we do MAT %*% U, we replicate MAT[, 1] twice, which means the second random variable will be identical to the first one.
newMAT<- MAT %*% U
cor(newMAT)
# [,1] [,2]
#[1,] 1 1
#[2,] 1 1
You don't need to worry that two random variables are identical. Remember, this only means they are identical after standardization (to N(0, 1)). You can rescale them by different standard deviation, then shift them by different mean to make them different.
Pivoted Cholesky factorization is very useful. My answer for this post: Generate multivariate normal r.v.'s with rank-deficient covariance via Pivoted Cholesky Factorization gives a more comprehensive picture.

Simulating data from multivariate distribution in R based on Winbugs/JAGS script

I am trying to simulate data, based on part of a JAGS/Winbugs script. The script comes from Eaves & Erkanli (2003, see, http://psych.colorado.edu/~carey/pdffiles/mcmc_eaves.pdf, page 295-296).
The (part of) the script I want to base my simulations on is as follows (different variable names than in the original paper):
for(fam in 1 : nmz ){
a2mz[fam, 1:N] ~ dmnorm(mu[1:N], tau.a[1:N, 1:N])
a1mz[fam, 1:N] ~ dmnorm(a2mz[fam, 1:N], tau.a[1:N, 1:N])
}
#Prior
tau.a[1:N, 1:N] ~ dwish(omega.g[,], N)
I want to simulate data in R for the parameters a2mz and a1mz as given in the script above.
So basically, I want to simualte data from -N- (e.g. = 3) multivariate distributions with -fam- (e.g. 10) persons with sigma tau.a.
To make this more illustrative: The purpose is to simulate genetic effects for -fam- (e.g. 10) families. The genetic effect is the same for each family (e.g. monozygotic twins), with a variance of tau.a (e.g. 0.5). Of these genetic effects, 3 'versions' (3 multivariate distributions) have to be simulated.
What I tried in R to simulate the data as given in the JAGS/Winbugs script is as follows:
library(MASS)
nmz = 10 #number of families, here e.g. 10
var_a = 0.5 #tau.g in the script
a2_mz <- mvrnorm(3, mu = rep(0, nmz), Sigma = diag(nmz)*var_a)
This simulates data for the a2mz parameter as referred to in the JAGS/Winbugs script above:
> print(t(a2_mz))
[,1] [,2] [,3]
[1,] -1.1563683 -0.4478091 -0.15037563
[2,] 0.5673873 -0.7052487 0.44377336
[3,] 0.2560446 0.9901964 -0.65463341
[4,] -0.8366952 0.4924839 -0.56891991
[5,] 0.7343780 0.5429955 0.87529201
[6,] 0.5592868 -0.3899988 -0.33709105
[7,] -1.8233663 -0.7149141 -0.18153049
[8,] -0.8213804 -1.4397075 -0.09159725
[9,] -0.7002797 -0.3996970 -0.29142215
[10,] 1.1084067 0.3884869 -0.46207940
However, when I then try to use these data to simulate data for the a1mz (third line of the JAGS/Winbugs) script, then something goes wrong and I am not sure what:
a1_mz <- mvrnorm(3, mu = t(a2_mz), Sigma = c(diag(nmz)*var_a, diag(nmz)*var_a, diag(nmz)*var_a))
This results in the error:
Error in eigen(Sigma, symmetric = TRUE, EISPACK = EISPACK) :
non-square matrix in 'eigen'
Can anyone give me any hints or tips on what I am doing wrong?
Many thanks,
Best regards,
inga
mvrnorm() takes a mean-vector and a variance matrix as input, and that's not what you're feeding it. I'm not sure I understand your question, but if you want to simulate 3 samples from 3 different multivariate normal distributions with same variance and different mean. Then just use:
a1_mz<-array(dim=c(dim(a2_mz),3))
for(i in 1:3) a1_mz[,,i]<-mvrnorm(3,t(a2_mz)[,i],diag(nmz)*var_a)

log covariance to arithmetic covariance matrix function?

Is there a function that can convert a covariance matrix built using log-returns into a covariance matrix based on simple arithmetic returns?
Motivation: We'd like to use a mean-variance utility function where expected returns and variance is specified in arithmetic terms. However, estimating returns and covariances is often performed with log-returns because of the additivity property of log returns, and we assume asset prices follow a lognormal stochastic process.
Meucci describes a process to generate a arithmetic-returns based covariance matrix for a generic/arbitrary distribution of lognormal returns on Appendix page 5.
Here's my translation of the formulae:
linreturn <- function(mu,Sigma) {
m <- exp(mu+diag(Sigma)/2)-1
x1 <- outer(mu,mu,"+")
x2 <- outer(diag(Sigma),diag(Sigma),"+")/2
S <- exp(x1+x2)*(exp(Sigma)-1)
list(mean=m,vcov=S)
}
edit: fixed -1 issue based on comments.
Try an example:
m1 <- c(1,2)
S1 <- matrix(c(1,0.2,0.2,1),nrow=2)
Generate multivariate log-normal returns:
set.seed(1001)
r1 <- exp(MASS::mvrnorm(200000,mu=m1,Sigma=S1))-1
colMeans(r1)
## [1] 3.485976 11.214211
var(r1)
## [,1] [,2]
## [1,] 34.4021 12.4062
## [2,] 12.4062 263.7382
Compare with expected results from formulae:
linreturn(m1,S1)
## $mean
## [1] 3.481689 11.182494
## $vcov
## [,1] [,2]
## [1,] 34.51261 12.08818
## [2,] 12.08818 255.01563

Resources