I am looking for a way to create a specified correlation between 2 variables, regardless of their distribution, given that the ordening is allowed to change. The motivation has to do with Bayesian statistics.
Imagine variable a which holds 100 random normal numbers, while
variable b holds the numbers 1...100.
There will be 100 factorial permutations possible, and most of the time correlations between -0.95 and 0.95 will exist among all possible permutations of variable b.
I wrote a little script in R to try to find the correlation in an iterative way.
Iterate through all the indexes, checking whether the previous correlation is
lower or higher than the sought correlation.
If the correlation is too low it will switch the number belonging to the index with the number belonging to a random index lower.
If the correlation is too high it will switch the number belonging to the index with the number belonging to a random index higher.
It will then check whether the new correlation is better than the old one, and keep the one closest to the wanted correlation.
It will keep going over all the indices in order (from 1 to 100), and after every iteration it then checks whether it is within the wanted correlation +/- tolerance and return the permuted variable.
Usually in around 2000 iterations the specified correlation will be found by a tolerance of 0.0005.
Index in the picture represents iterations.
My question is how to do this permutation in a smarter way, such that the correlation will be quicker found.
Based on flodel's idea to, at each iteration, propose several candidates. Here it actually tests all candidates; while this is fine for my variables of length 100, a sample should be preferred later for more cases.
AnnealCor <- function(x, y, corpop, tol) {
while(abs(cor(x,y) - corpop) > tol) {
for (i in 1:length(y)) {
numbers <- 1:length(y)
correlation <- 1:length(y)
for (j in numbers) {
switcher <- y
switcher[c(i,j)] <- y[c(j,i)]
correlation[j] <- cor(x, switcher)
}
tokeep <- which(abs(correlation - corpop) == min(abs(correlation - corpop)))[1]
y[c(i, tokeep)] <- y[c(tokeep,i)]
if (abs(cor(x,y) - corpop) < tol) {break}
}
}
return(y)
}
Benchmark time based on 100 repetitions has a median of 200 miliseconds.
Related
I would like to simulate data for some cases (e.g. nPerson=1000 obversations) at
some consecutive timesteps (e.g. ts = 3) for N intercorrelated variables (e.g. N=5).
The simulation should be based on a correlation matrix (corrMat, nrows=nPerson,.ncols = N).
corrMat should be identical for all timesteps.
I already found out that the MASS package has a function to create
random data fitting the constraints given by corrMat.
t1 <- mvrnorm(nPerson,mu=rep(0, N),Sigma=corrMat,empirical=T)
Now I would like to simulate t2 as a function of t1 and corrMat.
The data of t2 therefore should correlate according to corrMat
and they should also have same variance as the variables of t1.
One important constrained: for the intial values corrMat[i,i] = 1,
for consequtive timesteps it should be posible, that corrMat[i,i] < 1,
because each variable is depending on itsself a timestep before,
but a perfect correlation is notintended.
Maybe there is a variance decomposition of the correlation matrix,
that calculates an error variance for each of the n variables at the
next time step, so that one could calculate the
values at timestep t+1 as sum of the weighted correlations of the
variables at timestep t and then adding a random error,distributed
according to the error variance (with mean of error = 0) that replicates
the correlation matrix again at t+1.
Assuming normal errors:
getRand <- function (range) {
return (rnorm(1,mean=0, sd=range) )
}
That the (very simplified) code for the i-th variable x_i:
x_i[t+1] = 0
for (j:1..N) {
x_i[t+1] = x_i[t+1] + corrMat[i,j] * x_j[t]
}
x_i[t+1] = x_i[t+1] + getRand(sdErr)
So the question would be more specific: how to calculate sdErr?
For simplification I try to assume, that the variance for all variables
should be 1.
Thank you for any hint, how to get one step further!
I will do a mathematical formulation of the problem to stats.stackexchange.com,
as mikeck suggested to discuss details of the correlation problems more
in depth.
I still am interested in finding a geneal formula to calculate sdErr
to use it in the calculation of x_i[t+1].
But meanwhile I found a useful practical solution to the specific question "how to calculate sdErr?" without a formula for sdErr:
(1) simply calculate all variables WITHOUT errors (according to the equation above).
(2) calculate variances of the new variables
(3) calculate (for each i) differences var(x_i[t]) - var(x_i[t+1]) = sdErr ^ 2
So this sdErr can be added to each variable for each new observation.
This should lead to observations at t+1 which at least have the same variances as the observations in t.
Details concercing the question, if the model definition is adequate,
will be part of another post.
I'm writing KNN classifier in R. I want to add weighting scheme, e. g. inverted indices 1/d. As it is, for Iris dataset I get almost perfect 66% accuracy (no matter the metric used) since value no. 3 ("virginica") almost never shows up and I want to make it better with weighting. My question is: what exactly and how do I weight? I've read that I should weight classes of K nearest neighbours with those distances.
I've tried creating vectors of classes and distances to K nearest neighbours and then taking weighted mean from it:
inverted <- function(vals, distances)
{
inv_distances <- 1 / distances
# eliminate division-by-zero errors
inv_distances <- ifelse((inv_distances < 0.01), 0.01, inv_distances)
weighted.mean(vals, inv_distances)
}
My results are weird: for correct vectors vals (classes) and distances I sometimes get NaN (Not a Number) or NA values. Also my weights don't sum to 1, and... they probably should? I'm not sure. I just need someone to clear this weighting scheme for me.
EDIT:
I've debugged above code, since it multiplied by weight too late (therefore not eliminating distance 0 and causing NaNs). I've also changed it to harmonic series weights, not using distance (so first neighbour has weight 1, second 1/2, third 1/3 etc.). I still don't know exactly how it works and what other weights may be.
inverted <- function(vals)
{
weights <- 1 / seq(length(vals))
res <- weighted.mean(vals, weights)
res
}
Objective function to be maximized : pos%*%mu where pos is the weights row vector and mu is the column vector of mean returns of d stocks
Constraints: 1) ones%*%pos = 1 where ones is a row vector of 1's of size 1*d (d is the number of stocks)
2) pos%*%cov%*%t(pos) = rb^2 # where cov is the covariance matrix of size d*d and rb is risk budget which is the free parameter whose values will be changed to draw the efficient frontier
I want to write a code for this optimization problem in R but I can't think of any function or library for help.
PS: solve.QP in library quadprog has been used to minimize covariance subject to a target return . Can this function be also used to maximize return subject to a risk budget ? How should I specify the Dmat matrix and dvec vector for this problem ?
EDIT :
library(quadprog)
mu <- matrix(c(0.01,0.02,0.03),3,1)
cov # predefined covariance matrix of size 3*3
pos <- matrix(c(1/3,1/3,1/3),1,3) # random weights vector
edr <- pos%*%mu # expected daily return on portfolio
m1 <- matrix(1,1,3) # constraint no.1 ( sum of weights = 1 )
m2 <- pos%*%cov # constraint no.2
Amat <- rbind(m1,m2)
bvec <- matrix(c(1,0.1),2,1)
solve.QP(Dmat= ,dvec= ,Amat=Amat,bvec=bvec,meq=2)
How should I specify Dmat and dvec ? I want to optimize over pos
Also, I think I have not specified constraint no.2 correctly. It should make the variance of portfolio equal to the risk budget.
(Disclaimer: There may be a better way to do this in R. I am by no means an expert in anything related to R, and I'm making a few assumptions about how R is doing things, notably that you're using an interior-point method. Also, there is likely an R package for what you're trying to do, but I don't know what it is or how to use it.)
Minimising risk subject to a target return is a linearly-constrained problem with a quadratic objective, looking like this:
min x^T Q x
subject to sum x_i = 1
sum ret_i x_i >= target
(and x >= 0 if you want to be long-only).
Maximising return subject to a risk budget is quadratically-constrained, however; it looks like this:
max ret^T x
subject to sum x_i = 1
x^T Q x <= riskbudget
(and maybe x >= 0).
Convex quadratic terms in the objective impose less of a computational cost in an interior-point method compared to introducing a convex quadratic constraint. With a quadratic objective term, the Q matrix just shows up in the augmented system. With a convex quadratic constraint, you need to optimise over a more complicated cone containing a second-order cone factor and you need to be careful about how you solve the linear systems that arise.
I would suggest you use the risk-minimisation formulation repeatedly, doing a binary search on the target parameter until you've found a portfolio approximately maximising return subject to your risk budget. I am suggesting this approach because it is likely sufficient for your needs.
If you really want to solve your problem directly, I would suggest using an interface Todd, Toh, and Tutuncu's SDPT3. This really is overkill; SDPT3 permits you to formulate and solve symmetric cone programs of your choosing. I would also note that portfolio optimisation problems are particularly special cases of symmetric cone programs; other approaches exist that are reportedly very successful. Unfortunately, I'm not studied up on them.
I want to generate a process where in every step there is a realisation of a Poisson random variable, this realisation should be saved and then it should be realize the next Poisson random variable and add it to the sum of all realisations before. Furthermore there should be a chance that in every step this process stops. Hope that makes sense to you guys... Any thought is appreciated!
More compactly, pick a single geometrically distributed random number for the total number of steps achieved before stopping, then use cumsum to sum that many Poisson deviates:
stopping.prob <- 0.3 ## for example
lambda <- 3.5 ## for example
n <- rgeom(1,1-stopping.prob)+1 ## constant probability per step of stopping
cumsum(rpois(n,lambda))
You are very vague on the parameters of your simulation but how's this?
Lambda for random Poisson number.
lambda <- 5
This is the threshold value when the function exits.
th <- 0.999
Create a vector of length 1000.
bin <- numeric(1000)
Run the darn thing. It basically rolls a "dice" (values generated are between 0 and 1). If the values is below th, it returns a random Poisson number. If the value is above th (but not equal), the function stops.
for (i in 1:length(bin)) {
if (runif(1) < th) {
bin[i] <- rpois(1, lambda = lambda)
} else {
stop("didn't meet criterion, exiting")
}
}
Remove zeros if any.
bin <- bin[bin != 0]
You can use cumsum to cumulatively sum values.
cumsum(bin)
I am new to R and cointegration so please have patience with me as I try to explain what it is that I am trying to do. I am trying to find cointegrated variables among 1500-2000 voltage variables in the west power system in Canada/US. THe frequency is hourly (common in power) and cointegrated combinations can be as few as N variables and a maximum of M variables.
I tried to use ca.jo but here are issues that I ran into:
1) ca.jo (Johansen) has a limit to the number of variables it can work with
2) ca.jo appears to force the first variable in the y(t) vector to be the dependent variable (see below).
Eigenvectors, normalised to first column: (These are the cointegration relations)
V1.l2 V2.l2 V3.l2
V1.l2 1.0000000 1.0000000 1.0000000
V2.l2 -0.2597057 -2.3888060 -0.4181294
V3.l2 -0.6443270 -0.6901678 0.5429844
As you can see ca.jo tries to find linear combinations of the 3 variables but by forcing the coefficient on the first variable (in this case V1) to be 1 (i.e. the dependent variable). My understanding was that ca.jo would try to find all combinations such that every variable is selected as a dependent variable. You can see the same treatment in the examples given in the documentation for ca.jo.
3) ca.jo does not appear to find linear combinations of fewer than the number of variables in the y(t) vector. So if there were 5 variables and 3 of them are cointegrated (i.e. V1 ~ V2 + V3) then ca.jo fails to find this combination. Perhaps I am not using ca.jo correctly but my expectation was that a cointegrated combination where V1 ~ V2 + V3 is the same as V1 ~ V2 + V3 + 0 x V4 + 0 x V5. In other words the coefficient of the variable that are NOT cointegrated should be zero and ca.jo should find this type of combination.
I would greatly appreciate some further insight as I am fairly new to R and cointegration and have spent the past 2 months teaching myself.
Thank you.
I have also posted on nabble:
http://r.789695.n4.nabble.com/ca-jo-cointegration-multivariate-case-tc3469210.html
I'm not an expert, but since no one is responding, I'm going to try to take a stab at this one.. EDIT: I noticed that I just answered to a 4 year old question. Hopefully it might still be useful to others in the future.
Your general understanding is correct. I'm not going to go in great detail about the whole procedure but will try to give some general insight. The first thing that the Johansen procedure does is create a VECM out of the VAR model that best corresponds to the data (This is why you need the lag length for the VAR as input to the procedure as well). The procedure will then investigate the non-lagged component matrix of the VECM by looking at its rank: If the variables are not cointegrated then the rank of the matrix will not be significantly different from 0. A more intuitive way of understanding the johansen VECM equations is to notice the comparibility with the ADF procedure for each distinct row of the model.
Furthermore, The rank of the matrix is equal to the number of its eigenvalues (characteristic roots) that are different from zero. Each eigenvalue is associated with a different cointegrating vector, which
is equal to its corresponding eigenvector. Hence, An eigenvalue significantly different
from zero indicates a significant cointegrating vector. Significance of the vectors can be tested with two distinct statistics: The max statistic or the trace statistic. The trace test tests the null hypothesis of less than or equal to r cointegrating vectors against the alternative of more than r cointegrating vectors. In contrast, The maximum eigenvalue test tests the null hypothesis of r cointegrating vectors against the alternative of r + 1 cointegrating vectors.
Now for an example,
# We fit data to a VAR to obtain the optimal VAR length. Use SC information criterion to find optimal model.
varest <- VAR(yourData,p=1,type="const",lag.max=24, ic="SC")
# obtain lag length of VAR that best fits the data
lagLength <- max(2,varest$p)
# Perform Johansen procedure for cointegration
# Allow intercepts in the cointegrating vector: data without zero mean
# Use trace statistic (null hypothesis: number of cointegrating vectors <= r)
res <- ca.jo(yourData,type="trace",ecdet="const",K=lagLength,spec="longrun")
testStatistics <- res#teststat
criticalValues <- res#criticalValues
# chi^2. If testStatic for r<= 0 is greater than the corresponding criticalValue, then r<=0 is rejected and we have at least one cointegrating vector
# We use 90% confidence level to make our decision
if(testStatistics[length(testStatistics)] >= criticalValues[dim(criticalValues)[1],1])
{
# Return eigenvector that has maximum eigenvalue. Note: we throw away the constant!!
return(res#V[1:ncol(yourData),which.max(res#lambda)])
}
This piece of code checks if there is at least one cointegrating vector (r<=0) and then returns the vector with the highest cointegrating properties or in other words, the vector with the highest eigenvalue (lamda).
Regarding your question: the procedure does not "force" anything. It checks all combinations, that is why you have your 3 different vectors. It is my understanding that the method just scales/normalizes the vector to the first variable.
Regarding your other question: The procedure will calculate the vectors for which the residual has the strongest mean reverting / stationarity properties. If one or more of your variables does not contribute further to these properties then the component for this variable in the vector will indeed be 0. However, if the component value is not 0 then it means that "stronger" cointegration was found by including the extra variable in the model.
Furthermore, you can test test significance of your components. Johansen allows a researcher to test a hypothesis about one or more
coefficients in the cointegrating relationship by viewing the hypothesis as
a restriction on the non-lagged component matrix in the VECM. If there exist r cointegrating vectors, only these linear combinations or linear transformations of them, or combinations of the cointegrating vectors, will be stationary. However, I'm not aware on how to perform these extra checks in R.
Probably, the best way for you to proceed is to first test the combinations that contain a smaller number of variables. You then have the option to not add extra variables to these cointegrating subsets if you don't want to. But as already mentioned, adding other variables can potentially increase the cointegrating properties / stationarity of your residuals. It will depend on your requirements whether or not this is the behaviour you want.
I've been searching for an answer to this and I think I found one so I'm sharing with you hoping it's the right solution.
By using the johansen test you test for the ranks (number of cointegration vectors), and it also returns the eigenvectors, and the alphas and betas do build said vectors.
In theory if you reject r=0 and accept r=1 (value of r=0 > critical value and r=1 < critical value) you would search for the highest eigenvalue and from that build your vector. On this case, if the highest eigenvalue was the first, it would be V1*1+V2*(-0.26)+V3*(-0.64).
This would generate the cointegration residuals for these variables.
Again, I'm not 100%, but preety sure the above is how it works.
Nonetheless, you can always use the cajools function from the urca package to create a VECM automatically. You only need to feed it a cajo object and define the number of ranks (https://cran.r-project.org/web/packages/urca/urca.pdf).
If someone could confirm / correct this, it would be appreciated.