Why are my estimated and theoretical results for sobol sensitivity analysis different? - r

I am working on the sobol sensitivity analysis. I am trying to compute the first order effect and total effect indices in both an estimated and theoretical way.
Firstly, I computed the estimated values by following the steps in Wikipedia "Variance-based sensitivity analysis".
Here is the code:
set.seed(123)
x_1<-ceiling(1000*runif(1000)) ## generate x1 from range [1,1000]
x_2<-ceiling(100*runif(1000)) ## generate x2 from range [1,100]
x_3<-ceiling(10*runif(1000)) ## generate x3 from range [1,10]
x<-cbind(x_1,x_2,x_3) ## combine as one matrix
A<-matrix(x[1:500,],ncol=3) ## divide this one matrix into two
B<-matrix(x[501:1000,],ncol=3)
AB1<-cbind(B[,1],A[,-1]) ## replace the first column of sample A by the first column of sample B
AB2<-cbind(A[,1],B[,2],A[,3]) ## replace the second column of sample A by the second column of sample B
AB3<-cbind(A[,-3],B[,3]) ## replace the third column of sample A by the third column of sample B
trial<-function(x){ ## define the trial function:
#x1+x2*x3^2
x[,1]+(x[,2])*(x[,3])^2
}
Y_A<-trial(A) ## the output of A
Y_B<-trial(B) ## the output of B
Y_AB1<-trial(AB1) ## the output of AB1
Y_AB2<-trial(AB2) ## the output of AB2
Y_AB3<-trial(AB3) ## the output of AB3
Y<-matrix(cbind(Y_A,Y_B),ncol=1) ## the matrix of total outputs
S1<-mean(Y_B*(Y_AB1-Y_A))/var(Y) ## first order effect of x1
St1<-(sum((Y_A-Y_AB1)^2)/(2*500))/var(Y) ## total order effect of x1
S2<-mean(Y_B*(Y_AB2-Y_A))/var(Y) ## first order effect of x2
St2<-(sum((Y_A-Y_AB2)^2)/(2*500))/var(Y) ## total order effect of x2
S3<-mean(Y_B*(Y_AB3-Y_A))/var(Y) ## first order effect of x3
St3<-(sum((Y_A-Y_AB3)^2)/(2*500))/var(Y) ## total order effect of x3
S<-matrix(c(S1,St1,S2,St2,S3,St3),nrow=2,ncol=3) ## define the results
matrix
rownames(S)<-list("first order","total")
colnames(S)<-list("X1","X2","X3")
S ## print result
The results are:
X1 X2 X3
first order 0.01734781 0.2337758 0.5261082
total 0.01523861 0.4078107 0.6471387
And then, I wanted to compute the theoretical results to validate the above estimated results. I follow the equations in the picture below:
In order to compute the partial variance, I approached it in two ways: monte carlo integration and doing the integration by hand ( I just don't trust computer...)
Here is the code:
pimc=function(n){ ## monte carlo integration
a=runif(n)
g=(a+mean(x_2)*(mean(x_3)^2))^2
pimc=mean(g)
pimc
}
pimc(10000)/var(Y)
##
# doing integration by hand
(1/3+mean(x_2)*(mean(x_3))^2+((mean(x_2))^2)*((mean(x_3))^4))/var(Y)
## first order effect x1
(mean(x_1)^2+mean(x_1)*mean(x_3)^2+(1/3)*mean(x_3)^4)/var(Y)
## first order effect x2
(mean(x_1)^2+(2/3)*(mean(x_1)*mean(x_2))+(1/5)*(mean(x_2)^2))/var(Y)
## first order effect x3
The results are the following:
[,1]
[1,] 0.4414232 ## first order indice of x1 computed by monte carlo integration
[,1]
[1,] 0.4414238 ## first order indice of x1 computed by hand
[,1]
[1,] 0.0505044 ## first order indice of x2 computed by hand
[,1]
[1,] 0.05086958 ## first order indice of x3 computed by hand
The results are constant by using monte-carlo integration and hand solving. But they are quite different from the estimated results. Why is this?
When we do partial integral, we only look at one parameter, and we treat other parameters as constant numbers. So I chose to take the mean of each parameter as the constant number. Is it wrong?

Related

Minimal depth interaction from randomForestExplainer package

So when using the minimal depth interaction feature of the randomForestExplainer package, in R, I'm getting some hard to interpret results.
I simulated some data (x1, x2,..., x5) where x1 is binary and x2-x5 are continuous. In my model, there are no interactions.
Im using the randomForest package to create a random forest and then running it through the randomForestExplainer package.
Here's the code I'm using to simulate the data and random forest:
library(randomForest)
library(randomForestExplainer)
n <- 100
p <- 4
# Create data:
xrandom <- matrix(rnorm(n*p)+5, nrow=n)
colnames(xrandom)<- paste0("x",2:5)
d <- data.frame(xrandom)
d$x1 <- factor(sample(1:2, n, replace=T))
# Equation:
y <- d$x2 + rnorm(n)/5
y[d$x1==1] <- y[d$x1==1]+5
d$y <- y
# Random Forest:
fr <- randomForest(y ~ ., data=d,localImp=T)
# Random Forest Explainer:
interactions_frame <- min_depth_interactions(fr, names(d)[-6])
head(interactions_frame, 2)
This produces the following:
variable root_variable mean_min_depth occurrences interaction
1 x1 x1 4.670732 0 x1:x1
2 x1 x2 2.606190 221 x2:x1
uncond_mean_min_depth
1 1.703252
2 1.703252
So, my question is, if x1:x1 has 0 occurrence ( which is expected) then how can it also have a mean_min_depth?
Surely if it has 0 occurrences, then it can't possibly have a minimum depth? [or rather, the min depth = 0 or NA]
What's going on here? Am I misinterpreting something?
Thanks
My understanding is this has to do with the choice of the mean_sample argument of min_depth_interactions. The default choice replaces NAs with the depth of maximum subtree whose root is x1. Details below.
What is this argument mean_sample for? It specifies how to deal with trees where the interaction of interest is not present. There are three options:
relevant_trees. This only considers the trees where the interaction of interest is present. In your example, this gives NA for mean_min_depth of interaction x1:x1, which is the behavior you were looking for.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "relevant_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 NA 0 x1:x1 1.947475
2 x1 x2 1.426606 218 x2:x1 1.947475
all_trees. There is a major problem with relevant_trees, that is for an interaction only showing up in a small number of trees, taking the mean of conditional minimum depth ignores the fact that this interaction is not that important. In this case, a small mean conditional minimum depth doesn't mean an interaction is important. To address this, specifying mean_sample = "all_trees" replaces the conditional minimum depth for the interaction of interest by the mean depth of maximal subtree of the root variable. Basically, if we are looking at the interaction of x1:x2, it says for a tree where this interaction is absent, give it a value of the deepest tree whose root is x1. This gives a (hopefully large) numeric value to mean_min_depth of interaction x1:x2 thus making it less important.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "all_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.97568
2 x1 x2 3.654522 218 x2:x1 1.97568
top_trees. Now this is the default choice for mean_sample. My understanding is it's similar to all_trees, but tries to down-weight the contribution of replacing missing values. The motivation, is all_trees pulls mean_min_depth close to the same value when there are many parameters but not enough observations, i.e. shallow trees. To reduce the contribution of replacing missing values, top_trees only calculates the mean conditional minimal depth on a subset of n trees, where n is the number of trees where ANY interactions with specified roots are present. Let's say in your example, out of those 500 trees only 300 have any interaction x1:whatever, then we only consider those 300 trees when filling in value for x1:x1. Because there are 0 occurrence of this interaction, replacing 500 NAs vs replacing 300 NAs with the same value doesn't affect the mean, so it's the same value 4.787879. (There's a slight difference between our results, I think it has to do with seed values).
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "top_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.947475
2 x1 x2 2.951051 218 x2:x1 1.947475
This answer is based on my understanding of the package author's thesis: https://rawgit.com/geneticsMiNIng/BlackBoxOpener/master/randomForestExplainer_Master_thesis.pdf

R / Rolling Regression with extended Data Frame

Hallo I'm currently working on a Regression Analysis with the following Code:
for (i in 1:ncol(Ret1)){
r2.out[i]=summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
}
r2.out
This Code runs a simple OLS Regression of each column in the data Frame agianst the first column and provides the R^2 of These regressions. At the Moment the Regression uses all data Points of a column. What I Need now is that the Code instead of using all data Points in a column just uses a rolling window of data Points. So he calculates for a rolling window of 30 Days the R^2 over the entire time Frame. The output is a Matrix with all the R^2 per rolling window for each (1,i) pair.
This Code does the rolling Regression part but does not make the Regression for each (1,i) pair.
dolm <- function(x) summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
rollapplyr(Ret1, 30, dolm, by.column = FALSE)
I really appreciate any help you can provide.
Using the built-in anscombe data frame we regress the y1 column against x1 and then x2, etc. We use a width of 3 here for purposes of illustration.
xnames should be set to the names of the x variables. In the anscombe data set the column names that begin with x are the x variables. As another example, if all the columns are x variables except the first then xnames <- names(DF)[-1] could be used.
We define an R squared function, rsq which takes the indexes to use, ix and the x variable name xname. We then sapply over the xnames and for each one rollapply over the indices 1:n.
library(zoo)
xnames <- grep("x", names(anscombe), value = TRUE)
n <- nrow(anscombe)
w <- 3
rsq <- function(ix, xname) summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq
sapply(xnames, function(xname) rollapply(1:n, w, rsq, xname = xname ))
giving the following result of dimensions n - w + 1 by length(xnames):
x1 x2 x3 x4
[1,] 2.285384e-01 2.285384e-01 2.285384e-01 0.0000000
[2,] 3.591782e-05 3.591782e-05 3.591782e-05 0.0000000
[3,] 9.841920e-01 9.841920e-01 9.841920e-01 0.0000000
[4,] 5.857410e-01 5.857410e-01 5.857410e-01 0.0000000
[5,] 9.351609e-01 9.351609e-01 9.351609e-01 0.0000000
[6,] 8.760332e-01 8.760332e-01 8.760332e-01 0.7724447
[7,] 9.494869e-01 9.494869e-01 9.494869e-01 0.7015512
[8,] 9.107256e-01 9.107256e-01 9.107256e-01 0.3192194
[9,] 8.385510e-01 8.385510e-01 8.385510e-01 0.0000000
Variations
1) It would also be possible to reverse the order of the rollapply and sapply replacing the last line of code with:
rollapply(1:n, 3, function(ix) sapply(xnames, rsq, ix = ix))
2) Another variation is to replace the definition of rsq and the sapply/rollapply line with the following single statement. It may be a bit harder to read so you may prefer the first solution but it does entail one simplification -- namely, xname need no longer be an explicit argument of the inner anonymous function (which takes the place of rsq above):
sapply(xnames, function(xname) rollapply(1:n, 3, function(ix)
summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq))
Update: Have fixed line which is now n <- nrow(anscombe)

kernel PCA with Kernlab and classification of Colon--cancer dataset

I need to Perform kernel PCA on the colon-­‐cancer dataset:
and then
I need to Plot number of principal components vs classification accuracy with PCA data.
For the first part i am using kernlab in R as follows (let number of features be 2 and then i will vary it from say 2-100)
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=2)
I am having tough time to understand how to use this PCA data for classification ( i can use any classifier for eg SVM)
EDIT : My Question is how to feed the output of PCA into a classifier
data looks like this (cleaned data)
uncleaned original data looks like this
I will show you a small example on how to use the kpca function of the kernlab package here:
I checked the colon-cancer file but it needs a bit of cleaning to be able to use it so I will use a random data set to show you how:
Assume the following data set:
y <- rep(c(-1,1), c(50,50))
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
x4 <- runif(100)
x5 <- runif(100)
df <- data.frame(y,x1,x2,x3,x4,x5)
> df
y x1 x2 x3 x4 x5
1 -1 0.125841208 0.040543611 0.317198114 0.40923767 0.635434021
2 -1 0.113818719 0.308030825 0.708251147 0.69739496 0.839856000
3 -1 0.744765204 0.221210582 0.002220568 0.62921565 0.907277935
4 -1 0.649595597 0.866739474 0.609516644 0.40818013 0.395951297
5 -1 0.967379006 0.926688915 0.847379556 0.77867315 0.250867680
6 -1 0.895060293 0.813189446 0.329970821 0.01106764 0.123018797
7 -1 0.192447416 0.043720717 0.170960540 0.03058768 0.173198036
8 -1 0.085086619 0.645383728 0.706830885 0.51856286 0.134086770
9 -1 0.561070374 0.134457795 0.181368729 0.04557505 0.938145228
In order to run the pca you need to do:
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=4)
which is the same way as you use it. However, I need to point out that the features argument is the number of principal components and not the number of classes in your y variable. Maybe you knew this already but having 2000 variables and producing only 2 principal components might not be what you are looking for. You need to choose this number carefully by checking the eigen values. In your case I would probably pick 100 principal components and chose the first n number of principal components according to the highest eigen values. Let's see this in my random example after running the previous code:
In order to see the eigen values:
> kpc#eig
Comp.1 Comp.2 Comp.3 Comp.4
0.03756975 0.02706410 0.02609828 0.02284068
In my case all of the components have extremely low eigen values because my data is random. In your case I assume you will get better ones. You need to choose the n number of components that have the highest values. A value of zero shows that the component does not explain any of the variance. (Just for the sake of the demonstration I will use all of them in the svm below).
In order to access the principal components i.e. the PCA output you do this:
> kpc#pcv
[,1] [,2] [,3] [,4]
[1,] -0.1220123051 1.01290883 -0.935265092 0.37279158
[2,] 0.0420830469 0.77483019 -0.009222970 1.14304032
[3,] -0.7060568260 0.31153129 -0.555538694 -0.71496666
[4,] 0.3583160509 -0.82113573 0.237544936 -0.15526000
[5,] 0.1158956953 -0.92673486 1.352983423 -0.27695507
[6,] 0.2109994978 -1.21905573 -0.453469345 -0.94749503
[7,] 0.0833758766 0.63951377 -1.348618472 -0.26070127
[8,] 0.8197838629 0.34794455 0.215414610 0.32763442
[9,] -0.5611750477 -0.03961808 -1.490553198 0.14986663
...
...
This returns a matrix of 4 columns i.e. the number of the features argument which is the PCA output i.e. the principal components. kerlab uses the S4 Method Dispatch System and that is why you use # at kpc#pcv.
You then need to use the above matrix to feed in an svm in the following way:
svmmatrix <- kpc#pcv
library(e1071)
svm(svmmatrix, as.factor(y))
Call:
svm.default(x = svmmatrix, y = as.factor(y))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.25
Number of Support Vectors: 95
And that's it! A very good explanation I found on the internet about pca can be found here in case you or anyone else reading this wants to find out more.

How to use glmnet in R for classification problems

I want to use the glmnet in R to do classification problems.
The sample data is as follows:
y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
1,0.766126609,45,2,0.802982129,9120,13,0,6,0,2
0,0.957151019,40,0,0.121876201,2600,4,0,0,0,1
0,0.65818014,38,1,0.085113375,3042,2,1,0,0,0
y is a binary response (0 or 1).
I used the following R code:
prr=cv.glmnet(x,y,family="binomial",type.measure="auc")
yy=predict(prr,newx, s="lambda.min")
However, the predicted yy by glmnet is scattered between [-24,5].
How can I restrict the output value to [0,1] thus I use it to do classification problems?
I have read the manual again and found that type="response" in predict method will produce what I want:
lassopre2=predict(prr,newx, type="response")
will output values between [0,1]
A summary of the glmnet path at each step is displayed if we just enter the object name or use the print function:
print(fit)
##
## Call: glmnet(x = x, y = y)
##
## Df %Dev Lambda
## [1,] 0 0.0000 1.63000
## [2,] 2 0.0553 1.49000
## [3,] 2 0.1460 1.35000
## [4,] 2 0.2210 1.23000
It shows from left to right the number of nonzero coefficients (Df), the percent (of null) deviance explained (%dev) and the value of λ
(Lambda). Although by default glmnet calls for 100 values of lambda the program stops early if `%dev% does not change sufficently from one lambda to the next (typically near the end of the path.)
We can obtain the actual coefficients at one or more λ
’s within the range of the sequence:
coef(fit,s=0.1)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.150928
## V1 1.320597
## V2 .
## V3 0.675110
## V4 .
## V5 -0.817412
Here is the original explanation for more information by Hastie

log covariance to arithmetic covariance matrix function?

Is there a function that can convert a covariance matrix built using log-returns into a covariance matrix based on simple arithmetic returns?
Motivation: We'd like to use a mean-variance utility function where expected returns and variance is specified in arithmetic terms. However, estimating returns and covariances is often performed with log-returns because of the additivity property of log returns, and we assume asset prices follow a lognormal stochastic process.
Meucci describes a process to generate a arithmetic-returns based covariance matrix for a generic/arbitrary distribution of lognormal returns on Appendix page 5.
Here's my translation of the formulae:
linreturn <- function(mu,Sigma) {
m <- exp(mu+diag(Sigma)/2)-1
x1 <- outer(mu,mu,"+")
x2 <- outer(diag(Sigma),diag(Sigma),"+")/2
S <- exp(x1+x2)*(exp(Sigma)-1)
list(mean=m,vcov=S)
}
edit: fixed -1 issue based on comments.
Try an example:
m1 <- c(1,2)
S1 <- matrix(c(1,0.2,0.2,1),nrow=2)
Generate multivariate log-normal returns:
set.seed(1001)
r1 <- exp(MASS::mvrnorm(200000,mu=m1,Sigma=S1))-1
colMeans(r1)
## [1] 3.485976 11.214211
var(r1)
## [,1] [,2]
## [1,] 34.4021 12.4062
## [2,] 12.4062 263.7382
Compare with expected results from formulae:
linreturn(m1,S1)
## $mean
## [1] 3.481689 11.182494
## $vcov
## [,1] [,2]
## [1,] 34.51261 12.08818
## [2,] 12.08818 255.01563

Resources