Simulation to find random sequences - r

With R I can try to find the probability that the Age vector below resulted from random sampling. I used the runs test (from randtests package) with resulted in p-value = 0.2892. Other colleagues used the rle functune (run length encoding in R) or others to simulate whether the probabilities of random allocation generating the observed sequences. Their result shows p < 0.00000001 that this sequence is the result of random sampling. I am trying to find the R code to replicate their findings. any help is highly appreciated on how to simulate to replicate their findings.
Update: I received advice from statistician that I can do this using non-parametric bootstrap. However, I still do not know how this can be done. I appreciate your help.
example:
Age <-c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73,69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73) ;
randtests::runs.test(Age);
X <- rle(Age);X$lengths

What was initially presented isn't the whole story. If one looks at the supplement where these numbers are from, the reported p-value is for comparing two vectors. OP only provides one, and hence the task is not well-defined.
The full assertion of the research article is that
group1 <- c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73)
group2 <- c(69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73)
being two independent random samples has a p-value < 0.00000001.
Even checking identity along position (10 entries in original) with permutations within a group, I'm seeing only 2 or 3 draws per million that have a similar number of identical values. I.e., something like:
set.seed(123)
mean(replicate(1e6, sum(sample(group1, length(group1)) == group2)) >= 10)
# 2e-06
Testing correlations and/or bootstrapping could easily be in the p-value range that is reported (nothing as extreme in 100 million simulations).

Related

Using bootstrapping to compare full and sample datasets

This is a fairly complicated situation, so I'll try to succinctly explain but feel free to ask for clarification.
I have several datasets of biological data that vary significantly in sample size (e.g., 253-1221 observations/dataset). I need to estimate individual breeding parameters and compare them (for a different analysis), but because of the large sample size differences, I took a sub-set of data from each dataset so the sample sizes were equal for each comparison. For example, the smallest dataset had 253 observations, so for all the others I used the following code
AT_EABL_subset <- Atlantic_EABL[sample(1:nrow(Atlantic_EABL), 253,replace=FALSE),]
to take a subset of 253 observations from the full dataset (in this case AT_EABL originally had 1,221 observations).
It's now suggested that I use bootstrapping to check if the parameter estimates from my subsets are similar to the full dataset estimates. I'm looking for code that will run, say, 200 iterations of the above subset data and calculate the average of the coefficients so I can compare them to the coefficients from my model with the full dataset. I found a site that uses the sample function to achieve this (https://towardsdatascience.com/bootstrap-regression-in-r-98bfe4ff5007), but when I get to this portion of the code
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])
}
I get
Error: $ operator not defined for this S4 class
Below is the code I'm using. I don't know if I'm getting the above error because of the type of model I'm running (glmer vs. lm used in the link), or if there's a different function that will give me the data I need. Any advice is greatly appreciated.
sample_coef_intercept <- NULL
sample_coef_x1 <- NULL
for (i in 1:2) {
boot.sample = AT_EABL_subset[sample(1:nrow(AT_EABL_subset), nrow(AT_EABL_subset), replace = FALSE), ]
model_bootstrap <- glmer(cbind(YOUNG_HOST_TOTAL_ATLEAST,CLUTCH_SIZE_HOST_ATLEAST-YOUNG_HOST_TOTAL_ATLEAST)~as.factor(YEAR)+(1|LatLong),binomial,data=boot.sample)}
sample_coef_intercept <-
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])

Calculation of Precision,Recall and F-Score with confusionMatrix

We have developed an algorithm that detects number of repetions per resistance exercise machine out of accelerometer data. People performed always 10 repetitions 2x per machine.
n people x 10 repetitions x 2 sets = total amount of repetitions performed .
Now, I wanted to calculate the precision, recall and f-score with confusionMatrix from the caret package.
I made an xlsx file with two rows representing real (upper row) and algorithmically predicted number of repetitions (lower row) as depicted in the picture:
I coded the following:
reps_prec_phone1<- read.xlsx("Reps_for_Precision_Recall_FSCORE.xlsx", sheet = "2Vec_Phone1", startRow = 0, colNames = FALSE)
reps_prec_pred_phone1<-as.factor(reps_prec_phone1[1,])
reps_prec_real_Phone1<-as.factor(reps_prec_phone1[2,])
result_phone1 <- confusionMatrix(reps_prec_pred_phone1, reps_prec_real_Phone1, mode="prec_recall")
The result looks like this:
As you can see in the confusionMatrix, 385 sets (1 set consists of 10 repetitions) instead of 3850 repetitions were counted. Now I am wondering, methodologically how can I get confusionMatrix to calculate the number of repetitions instead of the number of sets.
In my case the error rate is 1-Accuracy = 2.5%. As 1 set consists of 10 repetitions. As set vs repetition is a factor of 10, I could simply divide the error rate by 10 and recalulate the accuracy 1-0.0025 = 0.9975. However,
does anyone know how to solve this issue with confusionMatrix?
Thank you in advance for your brain power & experience!
There's a theoretical mistake.
A confusion matrix is made to compare observations given with predicted values, R convert your data as factor, then your values {10,11} are interpreted as the levels of that factor not as numeric values, then R count the ties. In short put, you have a wrong idea about what a confusion matrix is.
Also, any model will perform biased predictions because you have extremely unbalanced data, in short put, there's nothing to predict.
Then, you don't have a programming problem it's more a theoretical one. Visit Stack Exchange to clear your ideas.
Visit me!

fisher's exact test (R) - simulated p-value does not vary

I have a problem using fisher’s exact test in R with a simulated p-value, but I don’t know if it’s a caused by “the technique” ( R ) or if it is (statistically) intended to work that way.
One of the datasets I want to work with:
matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,3,57,11,2,87,1,2,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704,40,759,404,151,1491,9,40,144),ncol=2,nrow=27)
The resulting p-value is always the same, no matter how often I repeat the test:
p = 1 / (B+1)
(B = number of replicates used in the Monte Carlo test)
When I shorten the matrix it works if the number of rows is lower than 19. Nevertheless it is not a matter of number of cells in the matrix. After transforming it into a matrix with 3 columns it still does not work, although it does when using the same numbers in just two columns.
Varying simulated p-values:
>a <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=2,nrow=18)
>b <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704),ncol=2,nrow=19)
>c <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=3,nrow=12)
>fisher.test(a,simulate.p.value=TRUE)$p.value
Number of cells in a and b are the same, but the simulation only works with matrix a.
Does anyone know if it is a statistical issue or a R issue and, if so, how it could be solved?
Thanks for your suggestions
I think that you are just seeing a very significant result. The p-value is being computed as the number of simulated (and the original) matrices that are as extreme or more extreme than the original. If none of the randomly generated matrices are as or more extreme then the p-value will just be 1 (the original matrix is as extreme as itself) divided by the total number of matrices which is $B+1$ (the B simulated and the 1 original matrix). If you run the function with enough samples (high enough B) then you will start to see some of the random matrices as or more extreme and therefor varying p-values, but the time to do so is probably not reasonable.

Perform a Shapiro-Wilk Normality Test

I want to perform a Shapiro-Wilk Normality Test test. My data is csv format. It looks like this:
heisenberg
HWWIchg
1 -15.60
2 -21.60
3 -19.50
4 -19.10
5 -20.90
6 -20.70
7 -19.30
8 -18.30
9 -15.10
However, when I perform the test, I get:
shapiro.test(heisenberg)
Error in [.data.frame(x, complete.cases(x)) :
undefined columns selected
Why isnt`t R selecting the right column and how do I do that?
What does shapiro.test do?
shapiro.test tests the Null hypothesis that "the samples come from a Normal distribution" against the alternative hypothesis "the samples do not come from a Normal distribution".
How to perform shapiro.test in R?
The R help page for ?shapiro.test gives,
x - a numeric vector of data values. Missing values are allowed,
but the number of non-missing values must be between 3 and 5000.
That is, shapiro.test expects a numeric vector as input, that corresponds to the sample you would like to test and it is the only input required. Since you've a data.frame, you'll have to pass the desired column as input to the function as follows:
> shapiro.test(heisenberg$HWWIchg)
# Shapiro-Wilk normality test
# data: heisenberg$HWWIchg
# W = 0.9001, p-value = 0.2528
Interpreting results from shapiro.test:
First, I strongly suggest you read this excellent answer from Ian Fellows on testing for normality.
As shown above, the shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a Normal distribution. As Ian Fellows nicely put it, you are testing against the assumption of Normality". In other words (correct me if I am wrong), it would be much better if one tests the NULL hypothesis that the samples do not come from a Normal distribution. Why? Because, rejecting a NULL hypothesis is not the same as accepting the alternative hypothesis.
In case of the null hypothesis of shapiro.test, a p-value <= 0.05 would reject the null hypothesis that the samples come from normal distribution. To put it loosely, there is a rare chance that the samples came from a normal distribution. The side-effect of this hypothesis testing is that this rare chance happens very rarely. To illustrate, take for example:
set.seed(450)
x <- runif(50, min=2, max=4)
shapiro.test(x)
# Shapiro-Wilk normality test
# data: runif(50, min = 2, max = 4)
# W = 0.9601, p-value = 0.08995
So, this (particular) sample runif(50, min=2, max=4) comes from a normal distribution according to this test. What I am trying to say is that, there are many many cases under which the "extreme" requirements (p < 0.05) are not satisfied which leads to acceptance of "NULL hypothesis" most of the times, which might be misleading.
Another issue I'd like to quote here from #PaulHiemstra from under comments about the effects on large sample size:
An additional issue with the Shapiro-Wilk's test is that when you feed it more data, the chances of the null hypothesis being rejected becomes larger. So what happens is that for large amounts of data even very small deviations from normality can be detected, leading to rejection of the null hypothesis event though for practical purposes the data is more than normal enough.
Although he also points out that R's data size limit protects this a bit:
Luckily shapiro.test protects the user from the above described effect by limiting the data size to 5000.
If the NULL hypothesis were the opposite, meaning, the samples do not come from a normal distribution, and you get a p-value < 0.05, then you conclude that it is very rare that these samples do not come from a normal distribution (reject the NULL hypothesis). That loosely translates to: It is highly likely that the samples are normally distributed (although some statisticians may not like this way of interpreting). I believe this is what Ian Fellows also tried to explain in his post. Please correct me if I've gotten something wrong!
#PaulHiemstra also comments about practical situations (example regression) when one comes across this problem of testing for normality:
In practice, if an analysis assumes normality, e.g. lm, I would not do this Shapiro-Wilk's test, but do the analysis and look at diagnostic plots of the outcome of the analysis to judge whether any assumptions of the analysis where violated too much. For linear regression using lm this is done by looking at some of the diagnostic plots you get using plot(lm()). Statistics is not a series of steps that cough up a few numbers (hey p < 0.05!) but requires a lot of experience and skill in judging how to analysis your data correctly.
Here, I find the reply from Ian Fellows to Ben Bolker's comment under the same question already linked above equally (if not more) informative:
For linear regression,
Don't worry much about normality. The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine.
Worry about unequal variances (heteroskedasticity). I worry about this to the point of (almost) using HCCM tests by default. A scale location plot will give some idea of whether this is broken, but not always. Also, there is no a priori reason to assume equal variances in most cases.
Outliers. A cooks distance of > 1 is reasonable cause for concern.
Those are my thoughts (FWIW).
Hope this clears things up a bit.
You are applying shapiro.test() to a data.frame instead of the column. Try the following:
shapiro.test(heisenberg$HWWIchg)
You failed to specify the exact columns (data) to test for normality.
Use this instead
shapiro.test(heisenberg$HWWIchg)
Set the data as a vector and then place in the function.

R, cointegration, multivariate, co.ja(), johansen

I am new to R and cointegration so please have patience with me as I try to explain what it is that I am trying to do. I am trying to find cointegrated variables among 1500-2000 voltage variables in the west power system in Canada/US. THe frequency is hourly (common in power) and cointegrated combinations can be as few as N variables and a maximum of M variables.
I tried to use ca.jo but here are issues that I ran into:
1) ca.jo (Johansen) has a limit to the number of variables it can work with
2) ca.jo appears to force the first variable in the y(t) vector to be the dependent variable (see below).
Eigenvectors, normalised to first column: (These are the cointegration relations)
V1.l2 V2.l2 V3.l2
V1.l2 1.0000000 1.0000000 1.0000000
V2.l2 -0.2597057 -2.3888060 -0.4181294
V3.l2 -0.6443270 -0.6901678 0.5429844
As you can see ca.jo tries to find linear combinations of the 3 variables but by forcing the coefficient on the first variable (in this case V1) to be 1 (i.e. the dependent variable). My understanding was that ca.jo would try to find all combinations such that every variable is selected as a dependent variable. You can see the same treatment in the examples given in the documentation for ca.jo.
3) ca.jo does not appear to find linear combinations of fewer than the number of variables in the y(t) vector. So if there were 5 variables and 3 of them are cointegrated (i.e. V1 ~ V2 + V3) then ca.jo fails to find this combination. Perhaps I am not using ca.jo correctly but my expectation was that a cointegrated combination where V1 ~ V2 + V3 is the same as V1 ~ V2 + V3 + 0 x V4 + 0 x V5. In other words the coefficient of the variable that are NOT cointegrated should be zero and ca.jo should find this type of combination.
I would greatly appreciate some further insight as I am fairly new to R and cointegration and have spent the past 2 months teaching myself.
Thank you.
I have also posted on nabble:
http://r.789695.n4.nabble.com/ca-jo-cointegration-multivariate-case-tc3469210.html
I'm not an expert, but since no one is responding, I'm going to try to take a stab at this one.. EDIT: I noticed that I just answered to a 4 year old question. Hopefully it might still be useful to others in the future.
Your general understanding is correct. I'm not going to go in great detail about the whole procedure but will try to give some general insight. The first thing that the Johansen procedure does is create a VECM out of the VAR model that best corresponds to the data (This is why you need the lag length for the VAR as input to the procedure as well). The procedure will then investigate the non-lagged component matrix of the VECM by looking at its rank: If the variables are not cointegrated then the rank of the matrix will not be significantly different from 0. A more intuitive way of understanding the johansen VECM equations is to notice the comparibility with the ADF procedure for each distinct row of the model.
Furthermore, The rank of the matrix is equal to the number of its eigenvalues (characteristic roots) that are different from zero. Each eigenvalue is associated with a different cointegrating vector, which
is equal to its corresponding eigenvector. Hence, An eigenvalue significantly different
from zero indicates a significant cointegrating vector. Significance of the vectors can be tested with two distinct statistics: The max statistic or the trace statistic. The trace test tests the null hypothesis of less than or equal to r cointegrating vectors against the alternative of more than r cointegrating vectors. In contrast, The maximum eigenvalue test tests the null hypothesis of r cointegrating vectors against the alternative of r + 1 cointegrating vectors.
Now for an example,
# We fit data to a VAR to obtain the optimal VAR length. Use SC information criterion to find optimal model.
varest <- VAR(yourData,p=1,type="const",lag.max=24, ic="SC")
# obtain lag length of VAR that best fits the data
lagLength <- max(2,varest$p)
# Perform Johansen procedure for cointegration
# Allow intercepts in the cointegrating vector: data without zero mean
# Use trace statistic (null hypothesis: number of cointegrating vectors <= r)
res <- ca.jo(yourData,type="trace",ecdet="const",K=lagLength,spec="longrun")
testStatistics <- res#teststat
criticalValues <- res#criticalValues
# chi^2. If testStatic for r<= 0 is greater than the corresponding criticalValue, then r<=0 is rejected and we have at least one cointegrating vector
# We use 90% confidence level to make our decision
if(testStatistics[length(testStatistics)] >= criticalValues[dim(criticalValues)[1],1])
{
# Return eigenvector that has maximum eigenvalue. Note: we throw away the constant!!
return(res#V[1:ncol(yourData),which.max(res#lambda)])
}
This piece of code checks if there is at least one cointegrating vector (r<=0) and then returns the vector with the highest cointegrating properties or in other words, the vector with the highest eigenvalue (lamda).
Regarding your question: the procedure does not "force" anything. It checks all combinations, that is why you have your 3 different vectors. It is my understanding that the method just scales/normalizes the vector to the first variable.
Regarding your other question: The procedure will calculate the vectors for which the residual has the strongest mean reverting / stationarity properties. If one or more of your variables does not contribute further to these properties then the component for this variable in the vector will indeed be 0. However, if the component value is not 0 then it means that "stronger" cointegration was found by including the extra variable in the model.
Furthermore, you can test test significance of your components. Johansen allows a researcher to test a hypothesis about one or more
coefficients in the cointegrating relationship by viewing the hypothesis as
a restriction on the non-lagged component matrix in the VECM. If there exist r cointegrating vectors, only these linear combinations or linear transformations of them, or combinations of the cointegrating vectors, will be stationary. However, I'm not aware on how to perform these extra checks in R.
Probably, the best way for you to proceed is to first test the combinations that contain a smaller number of variables. You then have the option to not add extra variables to these cointegrating subsets if you don't want to. But as already mentioned, adding other variables can potentially increase the cointegrating properties / stationarity of your residuals. It will depend on your requirements whether or not this is the behaviour you want.
I've been searching for an answer to this and I think I found one so I'm sharing with you hoping it's the right solution.
By using the johansen test you test for the ranks (number of cointegration vectors), and it also returns the eigenvectors, and the alphas and betas do build said vectors.
In theory if you reject r=0 and accept r=1 (value of r=0 > critical value and r=1 < critical value) you would search for the highest eigenvalue and from that build your vector. On this case, if the highest eigenvalue was the first, it would be V1*1+V2*(-0.26)+V3*(-0.64).
This would generate the cointegration residuals for these variables.
Again, I'm not 100%, but preety sure the above is how it works.
Nonetheless, you can always use the cajools function from the urca package to create a VECM automatically. You only need to feed it a cajo object and define the number of ranks (https://cran.r-project.org/web/packages/urca/urca.pdf).
If someone could confirm / correct this, it would be appreciated.

Resources