How to use mgarchbekk package in R? - r

I am trying to estimate BEKK Garch model by using mgarchBEKK pachage, which is available here.
library(quantmod)
library(rugarch)
library(mgarchBEKK)
eps<- read.csv("C.csv", header=TRUE)
> head(eps)
v1 v2
1 -0.001936598 0.001968415
2 -0.000441797 -0.002724438
3 0.003752762 -0.010221719
4 -0.004511632 -0.014637860
5 -0.001426905 0.010597786
6 0.007435739 -0.005880712
> tail(eps)
v1 v2
1954 -0.043228944 0.0000530712
1955 0.082546871 -0.0028188110
1956 0.025058992 0.0058264010
1957 0.001751445 -0.0298050150
1958 -0.007973320 -0.0037243560
1959 -0.005207348 0.0012664230
## Simulate a BEKK process:
simulated <- simulateBEKK(2,1959, c(1,1), params = NULL)
## Prepare the input for the estimation process:
simulated1 <- do.call(cbind, simulated$eps)
## Estimate with default arguments:
estimated <- BEKK(simulated1)
H IS SINGULAR!...
H IS SINGULAR!...
Warning message:
In BEKK(simulated1) : negative inverted hessian matrix element
## Show diagnostics:
diagnoseBEKK(estimated)
## Likewise, you can estimate an mGJR process:
estimated2 <- mGJR(simulated[,1], simulated[,2])
I don't know, what is the problem in my code, because in results it is showing 3968 number of series, instead of 2 series.

You are estimating a model.
What does that mean?
To quote my statistics professors, the philosophy is to "find a stochastic model which may have created the observed series."
That is exactly what the mgarchBEKK is doing. The model is being fitted to the data you supplied (your V1 and V2 series). In simple words this means, that a lot of different parameter combinations are tried out and of those (in your case 3968 tries) the combination that "fits best" is what you see in your results.
I did the same with 3 time series of the length 8596. My results look something like this:
Number of estimated series : 25788
Length of estimated series : 8596
Estimation Time : 3.258482
So the number of estimated series is way above the 3 vectors I used.
The estimates look something like this (since this is a bi- or multivariate model-estimation you have parameter matrices, not single values as you would have with a one dimensional GARCH model):
C estimates:
[,1] [,2] [,3]
[1,] 0.9797469 0.2189191 0.202451941
[2,] 0.0000000 1.0649323 0.003050169
[3,] 0.0000000 0.0000000 0.896492130
ARCH estimates:
[,1] [,2] [,3]
[1,] 0.29110077 -0.008445699 0.008570904
[2,] -0.02109381 0.419092657 0.325321939
[3,] -0.01280835 -0.057648910 0.482502301
GARCH estimates:
[,1] [,2] [,3]
[1,] -0.27770297 0.03587415 -0.73029389
[2,] -0.05172256 -0.25601327 0.01918367
[3,] 0.07945086 0.03364686 -0.50664759
I can't describe the math behind this fitting, as far as I know some form of maximum likelihood estimation is used.
I'm a rookie in statistics, so if anything I said is wrong please feel free to correct me.

Related

kernel PCA with Kernlab and classification of Colon--cancer dataset

I need to Perform kernel PCA on the colon-­‐cancer dataset:
and then
I need to Plot number of principal components vs classification accuracy with PCA data.
For the first part i am using kernlab in R as follows (let number of features be 2 and then i will vary it from say 2-100)
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=2)
I am having tough time to understand how to use this PCA data for classification ( i can use any classifier for eg SVM)
EDIT : My Question is how to feed the output of PCA into a classifier
data looks like this (cleaned data)
uncleaned original data looks like this
I will show you a small example on how to use the kpca function of the kernlab package here:
I checked the colon-cancer file but it needs a bit of cleaning to be able to use it so I will use a random data set to show you how:
Assume the following data set:
y <- rep(c(-1,1), c(50,50))
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
x4 <- runif(100)
x5 <- runif(100)
df <- data.frame(y,x1,x2,x3,x4,x5)
> df
y x1 x2 x3 x4 x5
1 -1 0.125841208 0.040543611 0.317198114 0.40923767 0.635434021
2 -1 0.113818719 0.308030825 0.708251147 0.69739496 0.839856000
3 -1 0.744765204 0.221210582 0.002220568 0.62921565 0.907277935
4 -1 0.649595597 0.866739474 0.609516644 0.40818013 0.395951297
5 -1 0.967379006 0.926688915 0.847379556 0.77867315 0.250867680
6 -1 0.895060293 0.813189446 0.329970821 0.01106764 0.123018797
7 -1 0.192447416 0.043720717 0.170960540 0.03058768 0.173198036
8 -1 0.085086619 0.645383728 0.706830885 0.51856286 0.134086770
9 -1 0.561070374 0.134457795 0.181368729 0.04557505 0.938145228
In order to run the pca you need to do:
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=4)
which is the same way as you use it. However, I need to point out that the features argument is the number of principal components and not the number of classes in your y variable. Maybe you knew this already but having 2000 variables and producing only 2 principal components might not be what you are looking for. You need to choose this number carefully by checking the eigen values. In your case I would probably pick 100 principal components and chose the first n number of principal components according to the highest eigen values. Let's see this in my random example after running the previous code:
In order to see the eigen values:
> kpc#eig
Comp.1 Comp.2 Comp.3 Comp.4
0.03756975 0.02706410 0.02609828 0.02284068
In my case all of the components have extremely low eigen values because my data is random. In your case I assume you will get better ones. You need to choose the n number of components that have the highest values. A value of zero shows that the component does not explain any of the variance. (Just for the sake of the demonstration I will use all of them in the svm below).
In order to access the principal components i.e. the PCA output you do this:
> kpc#pcv
[,1] [,2] [,3] [,4]
[1,] -0.1220123051 1.01290883 -0.935265092 0.37279158
[2,] 0.0420830469 0.77483019 -0.009222970 1.14304032
[3,] -0.7060568260 0.31153129 -0.555538694 -0.71496666
[4,] 0.3583160509 -0.82113573 0.237544936 -0.15526000
[5,] 0.1158956953 -0.92673486 1.352983423 -0.27695507
[6,] 0.2109994978 -1.21905573 -0.453469345 -0.94749503
[7,] 0.0833758766 0.63951377 -1.348618472 -0.26070127
[8,] 0.8197838629 0.34794455 0.215414610 0.32763442
[9,] -0.5611750477 -0.03961808 -1.490553198 0.14986663
...
...
This returns a matrix of 4 columns i.e. the number of the features argument which is the PCA output i.e. the principal components. kerlab uses the S4 Method Dispatch System and that is why you use # at kpc#pcv.
You then need to use the above matrix to feed in an svm in the following way:
svmmatrix <- kpc#pcv
library(e1071)
svm(svmmatrix, as.factor(y))
Call:
svm.default(x = svmmatrix, y = as.factor(y))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.25
Number of Support Vectors: 95
And that's it! A very good explanation I found on the internet about pca can be found here in case you or anyone else reading this wants to find out more.

Simulating data from multivariate distribution in R based on Winbugs/JAGS script

I am trying to simulate data, based on part of a JAGS/Winbugs script. The script comes from Eaves & Erkanli (2003, see, http://psych.colorado.edu/~carey/pdffiles/mcmc_eaves.pdf, page 295-296).
The (part of) the script I want to base my simulations on is as follows (different variable names than in the original paper):
for(fam in 1 : nmz ){
a2mz[fam, 1:N] ~ dmnorm(mu[1:N], tau.a[1:N, 1:N])
a1mz[fam, 1:N] ~ dmnorm(a2mz[fam, 1:N], tau.a[1:N, 1:N])
}
#Prior
tau.a[1:N, 1:N] ~ dwish(omega.g[,], N)
I want to simulate data in R for the parameters a2mz and a1mz as given in the script above.
So basically, I want to simualte data from -N- (e.g. = 3) multivariate distributions with -fam- (e.g. 10) persons with sigma tau.a.
To make this more illustrative: The purpose is to simulate genetic effects for -fam- (e.g. 10) families. The genetic effect is the same for each family (e.g. monozygotic twins), with a variance of tau.a (e.g. 0.5). Of these genetic effects, 3 'versions' (3 multivariate distributions) have to be simulated.
What I tried in R to simulate the data as given in the JAGS/Winbugs script is as follows:
library(MASS)
nmz = 10 #number of families, here e.g. 10
var_a = 0.5 #tau.g in the script
a2_mz <- mvrnorm(3, mu = rep(0, nmz), Sigma = diag(nmz)*var_a)
This simulates data for the a2mz parameter as referred to in the JAGS/Winbugs script above:
> print(t(a2_mz))
[,1] [,2] [,3]
[1,] -1.1563683 -0.4478091 -0.15037563
[2,] 0.5673873 -0.7052487 0.44377336
[3,] 0.2560446 0.9901964 -0.65463341
[4,] -0.8366952 0.4924839 -0.56891991
[5,] 0.7343780 0.5429955 0.87529201
[6,] 0.5592868 -0.3899988 -0.33709105
[7,] -1.8233663 -0.7149141 -0.18153049
[8,] -0.8213804 -1.4397075 -0.09159725
[9,] -0.7002797 -0.3996970 -0.29142215
[10,] 1.1084067 0.3884869 -0.46207940
However, when I then try to use these data to simulate data for the a1mz (third line of the JAGS/Winbugs) script, then something goes wrong and I am not sure what:
a1_mz <- mvrnorm(3, mu = t(a2_mz), Sigma = c(diag(nmz)*var_a, diag(nmz)*var_a, diag(nmz)*var_a))
This results in the error:
Error in eigen(Sigma, symmetric = TRUE, EISPACK = EISPACK) :
non-square matrix in 'eigen'
Can anyone give me any hints or tips on what I am doing wrong?
Many thanks,
Best regards,
inga
mvrnorm() takes a mean-vector and a variance matrix as input, and that's not what you're feeding it. I'm not sure I understand your question, but if you want to simulate 3 samples from 3 different multivariate normal distributions with same variance and different mean. Then just use:
a1_mz<-array(dim=c(dim(a2_mz),3))
for(i in 1:3) a1_mz[,,i]<-mvrnorm(3,t(a2_mz)[,i],diag(nmz)*var_a)

How to use glmnet in R for classification problems

I want to use the glmnet in R to do classification problems.
The sample data is as follows:
y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
1,0.766126609,45,2,0.802982129,9120,13,0,6,0,2
0,0.957151019,40,0,0.121876201,2600,4,0,0,0,1
0,0.65818014,38,1,0.085113375,3042,2,1,0,0,0
y is a binary response (0 or 1).
I used the following R code:
prr=cv.glmnet(x,y,family="binomial",type.measure="auc")
yy=predict(prr,newx, s="lambda.min")
However, the predicted yy by glmnet is scattered between [-24,5].
How can I restrict the output value to [0,1] thus I use it to do classification problems?
I have read the manual again and found that type="response" in predict method will produce what I want:
lassopre2=predict(prr,newx, type="response")
will output values between [0,1]
A summary of the glmnet path at each step is displayed if we just enter the object name or use the print function:
print(fit)
##
## Call: glmnet(x = x, y = y)
##
## Df %Dev Lambda
## [1,] 0 0.0000 1.63000
## [2,] 2 0.0553 1.49000
## [3,] 2 0.1460 1.35000
## [4,] 2 0.2210 1.23000
It shows from left to right the number of nonzero coefficients (Df), the percent (of null) deviance explained (%dev) and the value of λ
(Lambda). Although by default glmnet calls for 100 values of lambda the program stops early if `%dev% does not change sufficently from one lambda to the next (typically near the end of the path.)
We can obtain the actual coefficients at one or more λ
’s within the range of the sequence:
coef(fit,s=0.1)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.150928
## V1 1.320597
## V2 .
## V3 0.675110
## V4 .
## V5 -0.817412
Here is the original explanation for more information by Hastie

using eigenvalues to test for singularity: identifying collinear columns

I am trying to check if my matrix is singular using the eigenvalues approach (i.e. if one of the eigenvalues is zero then the matrix is singular). Here is the code:
z <- matrix(c(-3,2,1,4,-9,6,3,12,5,5,9,4),nrow=4,ncol=3)
eigen(t(z)%*%z)$values
I know the eigenvalues are sorted in descending order. Can someone please let me know if there is a way to find out what eigenvalue is associated to what column in the matrix? I need to remove the collinear columns.
It might be obvious in the example above but it is just an example intended to save you time from creating a new matrix.
Example:
z <- matrix(c(-3,2,1,4,-9,6,3,12,5,5,9,4),nrow=4,ncol=3)
m <- crossprod(z) ## slightly more efficient than t(z) %*% z
This tells you that the third eigenvector corresponds to the collinear combinations:
ee <- eigen(m)
(evals <- zapsmall(ee$values))
## [1] 322.7585 124.2415 0.0000
Now examine the corresponding eigenvectors, which are listed as columns corresponding to their respective eigenvalues:
(evecs <- zapsmall(ee$vectors))
## [1,] -0.2975496 -0.1070713 0.9486833
## [2,] -0.8926487 -0.3212138 -0.3162278
## [3,] -0.3385891 0.9409343 0.0000000
The third eigenvalue is zero; the first two elements of the third eigenvector (evecs[,3]) are non-zero, which tells you that columns 1 and 2 are collinear.
Here's a way to automate this test:
testcols <- function(ee) {
## split eigenvector matrix into a list, by columns
evecs <- split(zapsmall(ee$vectors),col(ee$vectors))
## for non-zero eigenvalues, list non-zero evec components
mapply(function(val,vec) {
if (val!=0) NULL else which(vec!=0)
},zapsmall(ee$values),evecs)
}
testcols(ee)
## [[1]]
## NULL
## [[2]]
## NULL
## [[3]]
## [1] 1 2
You can use tmp <- svd(z) to do a svd. The eigenvalues are then saved in tmp$d as a diagonal matrix of eigenvalues. This works also with a non square matrix.
> diag(tmp$d)
[,1] [,2] [,3]
[1,] 17.96548 0.00000 0.000000e+00
[2,] 0.00000 11.14637 0.000000e+00
[3,] 0.00000 0.00000 8.787239e-16

log covariance to arithmetic covariance matrix function?

Is there a function that can convert a covariance matrix built using log-returns into a covariance matrix based on simple arithmetic returns?
Motivation: We'd like to use a mean-variance utility function where expected returns and variance is specified in arithmetic terms. However, estimating returns and covariances is often performed with log-returns because of the additivity property of log returns, and we assume asset prices follow a lognormal stochastic process.
Meucci describes a process to generate a arithmetic-returns based covariance matrix for a generic/arbitrary distribution of lognormal returns on Appendix page 5.
Here's my translation of the formulae:
linreturn <- function(mu,Sigma) {
m <- exp(mu+diag(Sigma)/2)-1
x1 <- outer(mu,mu,"+")
x2 <- outer(diag(Sigma),diag(Sigma),"+")/2
S <- exp(x1+x2)*(exp(Sigma)-1)
list(mean=m,vcov=S)
}
edit: fixed -1 issue based on comments.
Try an example:
m1 <- c(1,2)
S1 <- matrix(c(1,0.2,0.2,1),nrow=2)
Generate multivariate log-normal returns:
set.seed(1001)
r1 <- exp(MASS::mvrnorm(200000,mu=m1,Sigma=S1))-1
colMeans(r1)
## [1] 3.485976 11.214211
var(r1)
## [,1] [,2]
## [1,] 34.4021 12.4062
## [2,] 12.4062 263.7382
Compare with expected results from formulae:
linreturn(m1,S1)
## $mean
## [1] 3.481689 11.182494
## $vcov
## [,1] [,2]
## [1,] 34.51261 12.08818
## [2,] 12.08818 255.01563

Resources