I'm trying to understand Cramer V and if it fits my data (after CHI squared returns a significant association between variables) and also looking for alternatives (I've heard a Spearman correlation might be better in some cases..?)
1st example: I have a yes/no response from users and there are 2 conditions (2x2 matrix)
library(rcompanion)
x <- matrix(c(144,120,66,90), nrow = 2)
colnames(x) <- c("No", "Yes")
rownames(x) <- c("Cond1", "Cond2")
cramerV(x, conf = 95)
2nd example: similar but the options for the answers are 3 (so a 3x2 matrix).
library(rcompanion)
x <- matrix(c(57,27,87,93,66,90), nrow = 2)
colnames(x) <- c("NA", "No", "Yes")
rownames(x) <- c("Cond1", "Cond2")
cramerV(x, conf = 95)
is this the correct implementation of the Cramer V? I'm struggling to understand what the p-value is actually telling me. how does a single p-value represent how strong is the association between all those values? are there better alternatives? thanks
General information:
Cramer's V gives us information about the statistical relationship between two or more nominally scaled variables (e.g. eye color: green, blue, brown)
When determining Cramer's V, the chi-square value (X²) is standardized. This enables us to compare relationships between variables using Cramer’s V. (Similar to Pearson contingency coefficient that is also a standardized measure of correlation based on chi-square (X²).
Your example 1:
Here we want to assess the association of condition(cond1/cond2) and answer (yes/no)
We need the chi-square value to calculate Cramer’s V, therefore
we first determine chi-square value and then convert into a value for Cramer’s V.
We get Cramer V 0.1183
Interpretation:
Cramer's V is always between 0 and 1.
0 means no association
1 means complete/very strong association
Rule of thumb:
Cramer's V
0 - 0.2 -> weak association
0.2 - 0.6 -> middle association
0.6 - 1 -> strong association
In this example with a Cramer V of 0.12 there is a weak association between condition and answer.
Note:
As we use nominal scaled variables we can assess the strength of the association but not the direction!
Same interpretation for example 2.
It does not matter how many nominal scaled variables you assess.
The association here is 0.19 -> weak.
In your examples you can use Carmer's V to assess the strength of association.
Related
I'm estimating a Phillips Curve Model - and, as such, need to take into account the unemployment Gap, which is the difference between actual unemployment and the Nairu (here, Nairu is the unobservable variable).
I need to impose the following constraint: some of the coefficients (say, beta 1 and beta 2) in the Z matrix (which relationates to the Nairu) must be the same that in the D matrix (which accounts for the unemployment). However, it doesn't seem possible to impose such a constraint simultaneously in the Z and D matrix.
Can you guys help me?
I've tried setting the same "names" for the coefficients in the Z and D matrix, but that didn't work.
I am learning to use R to cluster data points and I created a toy example. I use Silhouette statistics to determine an optimal cluster number, but the optimal number it determines is not what i expect. I include all my steps and data as below. I wonder if I have misunderstood/misused anything? I would really appreciate for any comments!
First, data matrix "m" loaded from a file look like this. Each row is the feature vector of an object
Then R code:
d <- dist(m, method="euclidean")
The distance matrix looks like this:
Next perform clustering:
clustering <- hclust(d, "average")
Then calculate silhouette, for all possible cluster numbers, i.e., 1<=i <=10:
sub <- cutree(clustering, k=i) #replace i with 1, 2, 3... 10
si <- silhouette(sub, d)
sm <- summary(si, FUN=mean)
sm #to print
For example, I get the following mean silhouette values for each i:
i=1, NaN
i=2, 0.19
i=3, 0.157
....
i=8, 0.09
...
The maximum is i=2, suggesting there are two clusters, as below:
i.e.,
cluster1 = {4}
cluster2 = {all else}
I wonder why it is not predicting 3 clusters as below, which is what I expect to be reasonable:
cluster1 = {4}
cluster2 = {1,2,5,6,7}
cluster3 = {3,8,9,10}
I obtain this outcome by looking at the feature vectors of each object and grouping objects based on the fact that they have at least feature in common that is a non-zero value. Therefore, I cannot understand why cluster2 and cluster3 should be merged, as suggested by the highest silhouette value?
Euclidean distance always considers all features.
It does not look for 0 values. They are not special.
Given the large amount of 0 values, you should be using a different distance and/or algorithm.
I am trying to build a second-order Markov Chain model, now I am try to find transition matrix from the following data.
dat<-data.frame(replicate(20,sample(c("A", "B", "C","D"), size = 100, replace=TRUE)))
Now I know how to fit the first order Markov transition matrix using the function markovchainFit(dat) in markovchain package.
Is there any way to fit the second order transition matrix?
How do evaluate the Markov Chain models? i.e. Should I choose the first order model or second order model?
This function should produce a Markov chain transition matrix to any lag order that you wish.
dat<-data.frame(replicate(20,sample(c("A", "B", "C","D"), size = 100, replace=TRUE)))
Markovmatrix <- function(X,l=1){
tt <- table(X[,-c((ncol(X)-l+1):ncol(X))] , c(X[,-c(1:l)]))
tt <- tt / rowSums(tt)
return(tt)
}
Markovmatrix(as.matrix(dat),1)
Markovmatrix(as.matrix(dat),2)
where l is the lag.
e.g. 2nd order matrix, the output is:
A B C D
A 0.2422803 0.2185273 0.2446556 0.2945368
B 0.2426304 0.2108844 0.2766440 0.2698413
C 0.2146119 0.2716895 0.2123288 0.3013699
D 0.2480000 0.2560000 0.2320000 0.2640000
As for how to test what order model. There are several suggestions. One put forward by Gottman and Roy (1990) in their introductory book to Sequential Analysis is to use information value. There is a chapter on that - most of the chapter is available online.
You can also perform a likelihood-ratio chi-Square test. This is very similar to a chi square test in that you are comparing observed to expected frequencies of transitions. However, the formula is as follows:
The degrees of freedom are the square of the number of codes minus one. In your case you have 4 codes, so (4-1)^2 = 9. You can then look up the associated p-value.
I hope this helps.
Objective function to be maximized : pos%*%mu where pos is the weights row vector and mu is the column vector of mean returns of d stocks
Constraints: 1) ones%*%pos = 1 where ones is a row vector of 1's of size 1*d (d is the number of stocks)
2) pos%*%cov%*%t(pos) = rb^2 # where cov is the covariance matrix of size d*d and rb is risk budget which is the free parameter whose values will be changed to draw the efficient frontier
I want to write a code for this optimization problem in R but I can't think of any function or library for help.
PS: solve.QP in library quadprog has been used to minimize covariance subject to a target return . Can this function be also used to maximize return subject to a risk budget ? How should I specify the Dmat matrix and dvec vector for this problem ?
EDIT :
library(quadprog)
mu <- matrix(c(0.01,0.02,0.03),3,1)
cov # predefined covariance matrix of size 3*3
pos <- matrix(c(1/3,1/3,1/3),1,3) # random weights vector
edr <- pos%*%mu # expected daily return on portfolio
m1 <- matrix(1,1,3) # constraint no.1 ( sum of weights = 1 )
m2 <- pos%*%cov # constraint no.2
Amat <- rbind(m1,m2)
bvec <- matrix(c(1,0.1),2,1)
solve.QP(Dmat= ,dvec= ,Amat=Amat,bvec=bvec,meq=2)
How should I specify Dmat and dvec ? I want to optimize over pos
Also, I think I have not specified constraint no.2 correctly. It should make the variance of portfolio equal to the risk budget.
(Disclaimer: There may be a better way to do this in R. I am by no means an expert in anything related to R, and I'm making a few assumptions about how R is doing things, notably that you're using an interior-point method. Also, there is likely an R package for what you're trying to do, but I don't know what it is or how to use it.)
Minimising risk subject to a target return is a linearly-constrained problem with a quadratic objective, looking like this:
min x^T Q x
subject to sum x_i = 1
sum ret_i x_i >= target
(and x >= 0 if you want to be long-only).
Maximising return subject to a risk budget is quadratically-constrained, however; it looks like this:
max ret^T x
subject to sum x_i = 1
x^T Q x <= riskbudget
(and maybe x >= 0).
Convex quadratic terms in the objective impose less of a computational cost in an interior-point method compared to introducing a convex quadratic constraint. With a quadratic objective term, the Q matrix just shows up in the augmented system. With a convex quadratic constraint, you need to optimise over a more complicated cone containing a second-order cone factor and you need to be careful about how you solve the linear systems that arise.
I would suggest you use the risk-minimisation formulation repeatedly, doing a binary search on the target parameter until you've found a portfolio approximately maximising return subject to your risk budget. I am suggesting this approach because it is likely sufficient for your needs.
If you really want to solve your problem directly, I would suggest using an interface Todd, Toh, and Tutuncu's SDPT3. This really is overkill; SDPT3 permits you to formulate and solve symmetric cone programs of your choosing. I would also note that portfolio optimisation problems are particularly special cases of symmetric cone programs; other approaches exist that are reportedly very successful. Unfortunately, I'm not studied up on them.
I am new to R and cointegration so please have patience with me as I try to explain what it is that I am trying to do. I am trying to find cointegrated variables among 1500-2000 voltage variables in the west power system in Canada/US. THe frequency is hourly (common in power) and cointegrated combinations can be as few as N variables and a maximum of M variables.
I tried to use ca.jo but here are issues that I ran into:
1) ca.jo (Johansen) has a limit to the number of variables it can work with
2) ca.jo appears to force the first variable in the y(t) vector to be the dependent variable (see below).
Eigenvectors, normalised to first column: (These are the cointegration relations)
V1.l2 V2.l2 V3.l2
V1.l2 1.0000000 1.0000000 1.0000000
V2.l2 -0.2597057 -2.3888060 -0.4181294
V3.l2 -0.6443270 -0.6901678 0.5429844
As you can see ca.jo tries to find linear combinations of the 3 variables but by forcing the coefficient on the first variable (in this case V1) to be 1 (i.e. the dependent variable). My understanding was that ca.jo would try to find all combinations such that every variable is selected as a dependent variable. You can see the same treatment in the examples given in the documentation for ca.jo.
3) ca.jo does not appear to find linear combinations of fewer than the number of variables in the y(t) vector. So if there were 5 variables and 3 of them are cointegrated (i.e. V1 ~ V2 + V3) then ca.jo fails to find this combination. Perhaps I am not using ca.jo correctly but my expectation was that a cointegrated combination where V1 ~ V2 + V3 is the same as V1 ~ V2 + V3 + 0 x V4 + 0 x V5. In other words the coefficient of the variable that are NOT cointegrated should be zero and ca.jo should find this type of combination.
I would greatly appreciate some further insight as I am fairly new to R and cointegration and have spent the past 2 months teaching myself.
Thank you.
I have also posted on nabble:
http://r.789695.n4.nabble.com/ca-jo-cointegration-multivariate-case-tc3469210.html
I'm not an expert, but since no one is responding, I'm going to try to take a stab at this one.. EDIT: I noticed that I just answered to a 4 year old question. Hopefully it might still be useful to others in the future.
Your general understanding is correct. I'm not going to go in great detail about the whole procedure but will try to give some general insight. The first thing that the Johansen procedure does is create a VECM out of the VAR model that best corresponds to the data (This is why you need the lag length for the VAR as input to the procedure as well). The procedure will then investigate the non-lagged component matrix of the VECM by looking at its rank: If the variables are not cointegrated then the rank of the matrix will not be significantly different from 0. A more intuitive way of understanding the johansen VECM equations is to notice the comparibility with the ADF procedure for each distinct row of the model.
Furthermore, The rank of the matrix is equal to the number of its eigenvalues (characteristic roots) that are different from zero. Each eigenvalue is associated with a different cointegrating vector, which
is equal to its corresponding eigenvector. Hence, An eigenvalue significantly different
from zero indicates a significant cointegrating vector. Significance of the vectors can be tested with two distinct statistics: The max statistic or the trace statistic. The trace test tests the null hypothesis of less than or equal to r cointegrating vectors against the alternative of more than r cointegrating vectors. In contrast, The maximum eigenvalue test tests the null hypothesis of r cointegrating vectors against the alternative of r + 1 cointegrating vectors.
Now for an example,
# We fit data to a VAR to obtain the optimal VAR length. Use SC information criterion to find optimal model.
varest <- VAR(yourData,p=1,type="const",lag.max=24, ic="SC")
# obtain lag length of VAR that best fits the data
lagLength <- max(2,varest$p)
# Perform Johansen procedure for cointegration
# Allow intercepts in the cointegrating vector: data without zero mean
# Use trace statistic (null hypothesis: number of cointegrating vectors <= r)
res <- ca.jo(yourData,type="trace",ecdet="const",K=lagLength,spec="longrun")
testStatistics <- res#teststat
criticalValues <- res#criticalValues
# chi^2. If testStatic for r<= 0 is greater than the corresponding criticalValue, then r<=0 is rejected and we have at least one cointegrating vector
# We use 90% confidence level to make our decision
if(testStatistics[length(testStatistics)] >= criticalValues[dim(criticalValues)[1],1])
{
# Return eigenvector that has maximum eigenvalue. Note: we throw away the constant!!
return(res#V[1:ncol(yourData),which.max(res#lambda)])
}
This piece of code checks if there is at least one cointegrating vector (r<=0) and then returns the vector with the highest cointegrating properties or in other words, the vector with the highest eigenvalue (lamda).
Regarding your question: the procedure does not "force" anything. It checks all combinations, that is why you have your 3 different vectors. It is my understanding that the method just scales/normalizes the vector to the first variable.
Regarding your other question: The procedure will calculate the vectors for which the residual has the strongest mean reverting / stationarity properties. If one or more of your variables does not contribute further to these properties then the component for this variable in the vector will indeed be 0. However, if the component value is not 0 then it means that "stronger" cointegration was found by including the extra variable in the model.
Furthermore, you can test test significance of your components. Johansen allows a researcher to test a hypothesis about one or more
coefficients in the cointegrating relationship by viewing the hypothesis as
a restriction on the non-lagged component matrix in the VECM. If there exist r cointegrating vectors, only these linear combinations or linear transformations of them, or combinations of the cointegrating vectors, will be stationary. However, I'm not aware on how to perform these extra checks in R.
Probably, the best way for you to proceed is to first test the combinations that contain a smaller number of variables. You then have the option to not add extra variables to these cointegrating subsets if you don't want to. But as already mentioned, adding other variables can potentially increase the cointegrating properties / stationarity of your residuals. It will depend on your requirements whether or not this is the behaviour you want.
I've been searching for an answer to this and I think I found one so I'm sharing with you hoping it's the right solution.
By using the johansen test you test for the ranks (number of cointegration vectors), and it also returns the eigenvectors, and the alphas and betas do build said vectors.
In theory if you reject r=0 and accept r=1 (value of r=0 > critical value and r=1 < critical value) you would search for the highest eigenvalue and from that build your vector. On this case, if the highest eigenvalue was the first, it would be V1*1+V2*(-0.26)+V3*(-0.64).
This would generate the cointegration residuals for these variables.
Again, I'm not 100%, but preety sure the above is how it works.
Nonetheless, you can always use the cajools function from the urca package to create a VECM automatically. You only need to feed it a cajo object and define the number of ranks (https://cran.r-project.org/web/packages/urca/urca.pdf).
If someone could confirm / correct this, it would be appreciated.