Design Covariance Matrix in a simulation study in R in an efficient way - r

In my simulation study I need to come up with a covariance matrix for multivariate data.
My data:
dataset=data.frame(observation=rep(1:8,2),plot=rep(1:4,each=2),time=rep(1:2,8),treatment=rep(c("A","B","A","B"),each=4),OutputVariable=rep(c("P","Q"),each=8))
This dataset is multivariate, for every observation (1:8) there is more than one result. In this case, we observe a value for OutputVariable P and for OutputVariable Q at the same time. Note that actual outputs are not in this dataset as I will generate them at a later stage.
The desired Covariance Matrix would be 16x16. Where CovarMat[2,9] indicates the Covariance between the second line (Observation 2 of variable P) and the 9th line (Observation 1 of variable Q) in the dataset.
The value of, for instance, CovarMat[2,9] is based on rules like these:
CovarMat[2,9]=0
If dataset$plot[2]==dataset$plot[9] then CovarMat[2,9]=CovarMat[2,9]+1.5
If dataset$time[2]==dataset$time[9] then CovarMat[2,9]=CovarMat[2,9]+1.5
If (dataset$plot[2]==dataset$plot[9])&(dataset$time[2]==dataset$time[9]) then CovarMat[2,9]=CovarMat[2,9]+3
If abs(dataset$time[2]-dataset$time[9])=1 then CovarMat[2,9]=CovarMat[2,9]+2
Using For-loops thats easy enough (and thats what I did up to now). But my current dataset is 13,200 lines. And thus my CovarMat consists of 174,240,000 cells. Therefore, I am in desperate need of a more efficient way.

Related

Interpreting the psych::cor.smoother function

I've tried to contact William Revelle about this but he isn't responding.
In the psych package there is a function called cor.smoother, which determines whether or not a correlation matrix is positive definite. Its explanation is as follows:
"cor.smoother examines all of nvar minors of rank nvar-1 by systematically dropping one variable at a time and finding the eigen value decomposition. It reports those variables, which, when dropped, produce a positive definite matrix. It also reports the number of negative eigenvalues when each variable is dropped. Finally, it compares the original correlation matrix to the smoothed correlation matrix and reports those items with absolute deviations great than cut. These are all hints as to what might be wrong with a correlation matrix."
It is the really the statement in bold that I am hoping someone can interpret in a more understandable way for me?
A belated answer to your question.
Correlation matrices are said to be improper (or more accurately, not positive semi-definite) when at least one of the eigen values of the matrix is less than 0. This can happen if you have some missing data and are using pair-wise complete correlations. It is particularly likely to happen if you are doing tetrachoric or polychoric correlations based upon data sets with some or even a lot of missing data.
(A correlation matrix, R, may be decomposed into a set of eigen vectors (X) and eigen values (lambda) where R = X lambda X’. This decomposition is the basis of components analysis and factor analysis, but that is more than you want to know.)
The cor.smooth function finds the eigen values and then adjusts the negative ones by making them slightly positive (and adjusting the other ones to compensate for this change).
The cor.smoother function attempts to identify the variables that are making the matrix improper. It does this by considering all the matrices generated by dropping one variable at a time and seeing which ones of those are not positive semi-definite (i.e. have eigen values < 0.) Ideally, this will identify one variable that is messing things up.
An example of this is in the burt data set where the sorrow-tenderness correlation was probably mistyped and the .87 should be .81.
cor.smoother(burt) #identifies tenderness and sorrow as likely culprits

R: Putting Variables in order by a different variable

Once again I have been set another programming task and to most of which I have done, so a quick run through: I've had to take n amount of samples of multivariate normal distribution with dimension p (called it X) then to put it into a matrix (Matx) where the first two values in each row were taken and summed a long with a value randomly drawn from the standard normal distribution. (Call this vector Y) Then we had to order Y numerically and split it up into H groups, and then I had to find out the mean of each row in the matrix and now having to order then in terms of which Y group they were associated. I've struggled a fair bit and have now hit a brick wall. Quite confusing I understand, if anyone could help it'd be greatly appreciated!
Task:Return the pxH matrix which has in the first column the mean of the observations in the first group and in the Hth column the mean in the observations in the Hth group.
Code:
library('MASS')
x<-mvrnorm(36,0,1)
Matx<-matrix(c(x), ncol=6, byrow=TRUE)
v<-rnorm(6)
y1<-sum(x[1:2],v[1])
y2<-sum(x[7:8],v[2])
y3<-sum(x[12:13],v[3])
y4<-sum(x[19:20],v[4])
y5<-sum(x[25:26],v[5])
y6<-sum(x[31:32],v[6])
y<-c(y1,y2,y3,y4,y5,y6)
out<-order(y)
h1<-c(out[1:2])
h2<-c(out[3:4])
h3<-c(out[5:6])
x1<-c(x[1:6])
x2<-c(x[7:12])
x3<-c(x[13:18])
x4<-c(x[19:24])
x5<-c(x[25:30])
x6<-c(x[31:36])
mx1<-mean(x1)
mx2<-mean(x2)
mx3<-mean(x3)
mx4<-mean(x4)
mx5<-mean(x5)
mx6<-mean(x6)
d<-c(mx1,mx2,mx3,mx4,mx5,mx6)[order(out)]
d

fisher's exact test (R) - simulated p-value does not vary

I have a problem using fisher’s exact test in R with a simulated p-value, but I don’t know if it’s a caused by “the technique” ( R ) or if it is (statistically) intended to work that way.
One of the datasets I want to work with:
matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,3,57,11,2,87,1,2,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704,40,759,404,151,1491,9,40,144),ncol=2,nrow=27)
The resulting p-value is always the same, no matter how often I repeat the test:
p = 1 / (B+1)
(B = number of replicates used in the Monte Carlo test)
When I shorten the matrix it works if the number of rows is lower than 19. Nevertheless it is not a matter of number of cells in the matrix. After transforming it into a matrix with 3 columns it still does not work, although it does when using the same numbers in just two columns.
Varying simulated p-values:
>a <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=2,nrow=18)
>b <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704),ncol=2,nrow=19)
>c <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=3,nrow=12)
>fisher.test(a,simulate.p.value=TRUE)$p.value
Number of cells in a and b are the same, but the simulation only works with matrix a.
Does anyone know if it is a statistical issue or a R issue and, if so, how it could be solved?
Thanks for your suggestions
I think that you are just seeing a very significant result. The p-value is being computed as the number of simulated (and the original) matrices that are as extreme or more extreme than the original. If none of the randomly generated matrices are as or more extreme then the p-value will just be 1 (the original matrix is as extreme as itself) divided by the total number of matrices which is $B+1$ (the B simulated and the 1 original matrix). If you run the function with enough samples (high enough B) then you will start to see some of the random matrices as or more extreme and therefor varying p-values, but the time to do so is probably not reasonable.

Looking for an efficient way to compute the variances of a multinomial distribution in R

I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?
If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))

R, cointegration, multivariate, co.ja(), johansen

I am new to R and cointegration so please have patience with me as I try to explain what it is that I am trying to do. I am trying to find cointegrated variables among 1500-2000 voltage variables in the west power system in Canada/US. THe frequency is hourly (common in power) and cointegrated combinations can be as few as N variables and a maximum of M variables.
I tried to use ca.jo but here are issues that I ran into:
1) ca.jo (Johansen) has a limit to the number of variables it can work with
2) ca.jo appears to force the first variable in the y(t) vector to be the dependent variable (see below).
Eigenvectors, normalised to first column: (These are the cointegration relations)
V1.l2 V2.l2 V3.l2
V1.l2 1.0000000 1.0000000 1.0000000
V2.l2 -0.2597057 -2.3888060 -0.4181294
V3.l2 -0.6443270 -0.6901678 0.5429844
As you can see ca.jo tries to find linear combinations of the 3 variables but by forcing the coefficient on the first variable (in this case V1) to be 1 (i.e. the dependent variable). My understanding was that ca.jo would try to find all combinations such that every variable is selected as a dependent variable. You can see the same treatment in the examples given in the documentation for ca.jo.
3) ca.jo does not appear to find linear combinations of fewer than the number of variables in the y(t) vector. So if there were 5 variables and 3 of them are cointegrated (i.e. V1 ~ V2 + V3) then ca.jo fails to find this combination. Perhaps I am not using ca.jo correctly but my expectation was that a cointegrated combination where V1 ~ V2 + V3 is the same as V1 ~ V2 + V3 + 0 x V4 + 0 x V5. In other words the coefficient of the variable that are NOT cointegrated should be zero and ca.jo should find this type of combination.
I would greatly appreciate some further insight as I am fairly new to R and cointegration and have spent the past 2 months teaching myself.
Thank you.
I have also posted on nabble:
http://r.789695.n4.nabble.com/ca-jo-cointegration-multivariate-case-tc3469210.html
I'm not an expert, but since no one is responding, I'm going to try to take a stab at this one.. EDIT: I noticed that I just answered to a 4 year old question. Hopefully it might still be useful to others in the future.
Your general understanding is correct. I'm not going to go in great detail about the whole procedure but will try to give some general insight. The first thing that the Johansen procedure does is create a VECM out of the VAR model that best corresponds to the data (This is why you need the lag length for the VAR as input to the procedure as well). The procedure will then investigate the non-lagged component matrix of the VECM by looking at its rank: If the variables are not cointegrated then the rank of the matrix will not be significantly different from 0. A more intuitive way of understanding the johansen VECM equations is to notice the comparibility with the ADF procedure for each distinct row of the model.
Furthermore, The rank of the matrix is equal to the number of its eigenvalues (characteristic roots) that are different from zero. Each eigenvalue is associated with a different cointegrating vector, which
is equal to its corresponding eigenvector. Hence, An eigenvalue significantly different
from zero indicates a significant cointegrating vector. Significance of the vectors can be tested with two distinct statistics: The max statistic or the trace statistic. The trace test tests the null hypothesis of less than or equal to r cointegrating vectors against the alternative of more than r cointegrating vectors. In contrast, The maximum eigenvalue test tests the null hypothesis of r cointegrating vectors against the alternative of r + 1 cointegrating vectors.
Now for an example,
# We fit data to a VAR to obtain the optimal VAR length. Use SC information criterion to find optimal model.
varest <- VAR(yourData,p=1,type="const",lag.max=24, ic="SC")
# obtain lag length of VAR that best fits the data
lagLength <- max(2,varest$p)
# Perform Johansen procedure for cointegration
# Allow intercepts in the cointegrating vector: data without zero mean
# Use trace statistic (null hypothesis: number of cointegrating vectors <= r)
res <- ca.jo(yourData,type="trace",ecdet="const",K=lagLength,spec="longrun")
testStatistics <- res#teststat
criticalValues <- res#criticalValues
# chi^2. If testStatic for r<= 0 is greater than the corresponding criticalValue, then r<=0 is rejected and we have at least one cointegrating vector
# We use 90% confidence level to make our decision
if(testStatistics[length(testStatistics)] >= criticalValues[dim(criticalValues)[1],1])
{
# Return eigenvector that has maximum eigenvalue. Note: we throw away the constant!!
return(res#V[1:ncol(yourData),which.max(res#lambda)])
}
This piece of code checks if there is at least one cointegrating vector (r<=0) and then returns the vector with the highest cointegrating properties or in other words, the vector with the highest eigenvalue (lamda).
Regarding your question: the procedure does not "force" anything. It checks all combinations, that is why you have your 3 different vectors. It is my understanding that the method just scales/normalizes the vector to the first variable.
Regarding your other question: The procedure will calculate the vectors for which the residual has the strongest mean reverting / stationarity properties. If one or more of your variables does not contribute further to these properties then the component for this variable in the vector will indeed be 0. However, if the component value is not 0 then it means that "stronger" cointegration was found by including the extra variable in the model.
Furthermore, you can test test significance of your components. Johansen allows a researcher to test a hypothesis about one or more
coefficients in the cointegrating relationship by viewing the hypothesis as
a restriction on the non-lagged component matrix in the VECM. If there exist r cointegrating vectors, only these linear combinations or linear transformations of them, or combinations of the cointegrating vectors, will be stationary. However, I'm not aware on how to perform these extra checks in R.
Probably, the best way for you to proceed is to first test the combinations that contain a smaller number of variables. You then have the option to not add extra variables to these cointegrating subsets if you don't want to. But as already mentioned, adding other variables can potentially increase the cointegrating properties / stationarity of your residuals. It will depend on your requirements whether or not this is the behaviour you want.
I've been searching for an answer to this and I think I found one so I'm sharing with you hoping it's the right solution.
By using the johansen test you test for the ranks (number of cointegration vectors), and it also returns the eigenvectors, and the alphas and betas do build said vectors.
In theory if you reject r=0 and accept r=1 (value of r=0 > critical value and r=1 < critical value) you would search for the highest eigenvalue and from that build your vector. On this case, if the highest eigenvalue was the first, it would be V1*1+V2*(-0.26)+V3*(-0.64).
This would generate the cointegration residuals for these variables.
Again, I'm not 100%, but preety sure the above is how it works.
Nonetheless, you can always use the cajools function from the urca package to create a VECM automatically. You only need to feed it a cajo object and define the number of ranks (https://cran.r-project.org/web/packages/urca/urca.pdf).
If someone could confirm / correct this, it would be appreciated.

Resources