Calculation of Precision,Recall and F-Score with confusionMatrix - r

We have developed an algorithm that detects number of repetions per resistance exercise machine out of accelerometer data. People performed always 10 repetitions 2x per machine.
n people x 10 repetitions x 2 sets = total amount of repetitions performed .
Now, I wanted to calculate the precision, recall and f-score with confusionMatrix from the caret package.
I made an xlsx file with two rows representing real (upper row) and algorithmically predicted number of repetitions (lower row) as depicted in the picture:
I coded the following:
reps_prec_phone1<- read.xlsx("Reps_for_Precision_Recall_FSCORE.xlsx", sheet = "2Vec_Phone1", startRow = 0, colNames = FALSE)
reps_prec_pred_phone1<-as.factor(reps_prec_phone1[1,])
reps_prec_real_Phone1<-as.factor(reps_prec_phone1[2,])
result_phone1 <- confusionMatrix(reps_prec_pred_phone1, reps_prec_real_Phone1, mode="prec_recall")
The result looks like this:
As you can see in the confusionMatrix, 385 sets (1 set consists of 10 repetitions) instead of 3850 repetitions were counted. Now I am wondering, methodologically how can I get confusionMatrix to calculate the number of repetitions instead of the number of sets.
In my case the error rate is 1-Accuracy = 2.5%. As 1 set consists of 10 repetitions. As set vs repetition is a factor of 10, I could simply divide the error rate by 10 and recalulate the accuracy 1-0.0025 = 0.9975. However,
does anyone know how to solve this issue with confusionMatrix?
Thank you in advance for your brain power & experience!

There's a theoretical mistake.
A confusion matrix is made to compare observations given with predicted values, R convert your data as factor, then your values {10,11} are interpreted as the levels of that factor not as numeric values, then R count the ties. In short put, you have a wrong idea about what a confusion matrix is.
Also, any model will perform biased predictions because you have extremely unbalanced data, in short put, there's nothing to predict.
Then, you don't have a programming problem it's more a theoretical one. Visit Stack Exchange to clear your ideas.
Visit me!

Related

Simulation to find random sequences

With R I can try to find the probability that the Age vector below resulted from random sampling. I used the runs test (from randtests package) with resulted in p-value = 0.2892. Other colleagues used the rle functune (run length encoding in R) or others to simulate whether the probabilities of random allocation generating the observed sequences. Their result shows p < 0.00000001 that this sequence is the result of random sampling. I am trying to find the R code to replicate their findings. any help is highly appreciated on how to simulate to replicate their findings.
Update: I received advice from statistician that I can do this using non-parametric bootstrap. However, I still do not know how this can be done. I appreciate your help.
example:
Age <-c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73,69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73) ;
randtests::runs.test(Age);
X <- rle(Age);X$lengths
What was initially presented isn't the whole story. If one looks at the supplement where these numbers are from, the reported p-value is for comparing two vectors. OP only provides one, and hence the task is not well-defined.
The full assertion of the research article is that
group1 <- c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73)
group2 <- c(69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73)
being two independent random samples has a p-value < 0.00000001.
Even checking identity along position (10 entries in original) with permutations within a group, I'm seeing only 2 or 3 draws per million that have a similar number of identical values. I.e., something like:
set.seed(123)
mean(replicate(1e6, sum(sample(group1, length(group1)) == group2)) >= 10)
# 2e-06
Testing correlations and/or bootstrapping could easily be in the p-value range that is reported (nothing as extreme in 100 million simulations).

fisher's exact test (R) - simulated p-value does not vary

I have a problem using fisher’s exact test in R with a simulated p-value, but I don’t know if it’s a caused by “the technique” ( R ) or if it is (statistically) intended to work that way.
One of the datasets I want to work with:
matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,3,57,11,2,87,1,2,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704,40,759,404,151,1491,9,40,144),ncol=2,nrow=27)
The resulting p-value is always the same, no matter how often I repeat the test:
p = 1 / (B+1)
(B = number of replicates used in the Monte Carlo test)
When I shorten the matrix it works if the number of rows is lower than 19. Nevertheless it is not a matter of number of cells in the matrix. After transforming it into a matrix with 3 columns it still does not work, although it does when using the same numbers in just two columns.
Varying simulated p-values:
>a <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=2,nrow=18)
>b <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704),ncol=2,nrow=19)
>c <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=3,nrow=12)
>fisher.test(a,simulate.p.value=TRUE)$p.value
Number of cells in a and b are the same, but the simulation only works with matrix a.
Does anyone know if it is a statistical issue or a R issue and, if so, how it could be solved?
Thanks for your suggestions
I think that you are just seeing a very significant result. The p-value is being computed as the number of simulated (and the original) matrices that are as extreme or more extreme than the original. If none of the randomly generated matrices are as or more extreme then the p-value will just be 1 (the original matrix is as extreme as itself) divided by the total number of matrices which is $B+1$ (the B simulated and the 1 original matrix). If you run the function with enough samples (high enough B) then you will start to see some of the random matrices as or more extreme and therefor varying p-values, but the time to do so is probably not reasonable.

Boxplot including outliers in R, make the whole ranges being compared.

I am comparing several values using R, they are 8 variables stored in 1000 length vectors. That means, 1000*8 matrix, 8 columns represent 8 variables.
Then I call
boxplot(test),
I got like:
The mean values of 8 variables are very close to each other. Which makes the comparison and interpretation very hard. Can I include all the outliers in my plot ? Then the whole range would be easier to compare ? Or any other suggestions could be given to distinguish these variables ?
Here is the boxplot in question (since the OP doesn't have the rep to post pictures):
It looks like the medians (and likely also the means) are pretty much identical, but the variances differ between the eight categories, with category 1 having the lowest and 8 the highest variance. Depending on the real question involved, these two pieces of information (similar median/mean, different variance) may already be enough.
If you want a formal significance test whether the variances are equal, you can use Hartley's or Bartlett's test. If you want to formally test equality of means with unequal variances (so ANOVA is not appropriate), look here.

Error probability function

I have DNA amplicons with base mismatches which can arise during the PCR amplification process. My interest is, what is the probability that a sequence contains errors, given the error rate per base, number of mismatches and the number of bases in the amplicon.
I came across an article [Cummings, S. M. et al (2010). Solutions for PCR, cloning and sequencing errors in population genetic analysis. Conservation Genetics, 11(3), 1095–1097. doi:10.1007/s10592-009-9864-6]
that proposes this formula to calculate the probability mass function in such cases.
I implemented the formula with R as shown here
pcr.prob <- function(k,N,eps){
v = numeric(k)
for(i in 1:k) {
v[i] = choose(N,k-i) * (eps^(k-i)) * (1 - eps)^(N-(k-i))
}
1 - sum(v)
}
From the article, suggest we analysed an 800 bp amplicon using a PCR of 30 cycles with 1.85e10-5 misincorporations per base per cycle, and found 10 unique sequences that are each 3 bp different from their most similar sequence. The probability that a novel sequences was generated by three independent PCR errors equals P = 0.0011.
However when I use my implementation of the formula I get a different value.
pcr.prob(3,800,0.0000185)
[1] 5.323567e-07
What could I be doing wrong in my implementation? Am I misinterpreting something?
Thanks
I think they've got the right number (0.00113), but badly explained in their paper.
The calculation you want to be doing is:
pbinom(3, 800, 1-(1-1.85e-5)^30, lower=FALSE)
I.e. what's the probability of seeing less than three modifications in 800 independent bases, given 30 amplifications that each have a 1.85e-5 chance of going wrong. I.e. you're calculating the probability it doesn't stay correct 30 times.
Somewhat statsy, may be worth a move…
Thinking about this more, you will start to see floating-point inaccuracies when working with very small probabilities here. I.e. a 1-x where x is a small number will start to go wrong when the absolute value of x is less than about 1e-10. Working with log-probabilities is a good idea at this point, specifically the log1p function is a great help. Using:
pbinom(3, 800, 1-exp(log1p(-1.85e-5)*30), lower=FALSE)
will continue to work even when the error incorporation rate is very low.

R, cointegration, multivariate, co.ja(), johansen

I am new to R and cointegration so please have patience with me as I try to explain what it is that I am trying to do. I am trying to find cointegrated variables among 1500-2000 voltage variables in the west power system in Canada/US. THe frequency is hourly (common in power) and cointegrated combinations can be as few as N variables and a maximum of M variables.
I tried to use ca.jo but here are issues that I ran into:
1) ca.jo (Johansen) has a limit to the number of variables it can work with
2) ca.jo appears to force the first variable in the y(t) vector to be the dependent variable (see below).
Eigenvectors, normalised to first column: (These are the cointegration relations)
V1.l2 V2.l2 V3.l2
V1.l2 1.0000000 1.0000000 1.0000000
V2.l2 -0.2597057 -2.3888060 -0.4181294
V3.l2 -0.6443270 -0.6901678 0.5429844
As you can see ca.jo tries to find linear combinations of the 3 variables but by forcing the coefficient on the first variable (in this case V1) to be 1 (i.e. the dependent variable). My understanding was that ca.jo would try to find all combinations such that every variable is selected as a dependent variable. You can see the same treatment in the examples given in the documentation for ca.jo.
3) ca.jo does not appear to find linear combinations of fewer than the number of variables in the y(t) vector. So if there were 5 variables and 3 of them are cointegrated (i.e. V1 ~ V2 + V3) then ca.jo fails to find this combination. Perhaps I am not using ca.jo correctly but my expectation was that a cointegrated combination where V1 ~ V2 + V3 is the same as V1 ~ V2 + V3 + 0 x V4 + 0 x V5. In other words the coefficient of the variable that are NOT cointegrated should be zero and ca.jo should find this type of combination.
I would greatly appreciate some further insight as I am fairly new to R and cointegration and have spent the past 2 months teaching myself.
Thank you.
I have also posted on nabble:
http://r.789695.n4.nabble.com/ca-jo-cointegration-multivariate-case-tc3469210.html
I'm not an expert, but since no one is responding, I'm going to try to take a stab at this one.. EDIT: I noticed that I just answered to a 4 year old question. Hopefully it might still be useful to others in the future.
Your general understanding is correct. I'm not going to go in great detail about the whole procedure but will try to give some general insight. The first thing that the Johansen procedure does is create a VECM out of the VAR model that best corresponds to the data (This is why you need the lag length for the VAR as input to the procedure as well). The procedure will then investigate the non-lagged component matrix of the VECM by looking at its rank: If the variables are not cointegrated then the rank of the matrix will not be significantly different from 0. A more intuitive way of understanding the johansen VECM equations is to notice the comparibility with the ADF procedure for each distinct row of the model.
Furthermore, The rank of the matrix is equal to the number of its eigenvalues (characteristic roots) that are different from zero. Each eigenvalue is associated with a different cointegrating vector, which
is equal to its corresponding eigenvector. Hence, An eigenvalue significantly different
from zero indicates a significant cointegrating vector. Significance of the vectors can be tested with two distinct statistics: The max statistic or the trace statistic. The trace test tests the null hypothesis of less than or equal to r cointegrating vectors against the alternative of more than r cointegrating vectors. In contrast, The maximum eigenvalue test tests the null hypothesis of r cointegrating vectors against the alternative of r + 1 cointegrating vectors.
Now for an example,
# We fit data to a VAR to obtain the optimal VAR length. Use SC information criterion to find optimal model.
varest <- VAR(yourData,p=1,type="const",lag.max=24, ic="SC")
# obtain lag length of VAR that best fits the data
lagLength <- max(2,varest$p)
# Perform Johansen procedure for cointegration
# Allow intercepts in the cointegrating vector: data without zero mean
# Use trace statistic (null hypothesis: number of cointegrating vectors <= r)
res <- ca.jo(yourData,type="trace",ecdet="const",K=lagLength,spec="longrun")
testStatistics <- res#teststat
criticalValues <- res#criticalValues
# chi^2. If testStatic for r<= 0 is greater than the corresponding criticalValue, then r<=0 is rejected and we have at least one cointegrating vector
# We use 90% confidence level to make our decision
if(testStatistics[length(testStatistics)] >= criticalValues[dim(criticalValues)[1],1])
{
# Return eigenvector that has maximum eigenvalue. Note: we throw away the constant!!
return(res#V[1:ncol(yourData),which.max(res#lambda)])
}
This piece of code checks if there is at least one cointegrating vector (r<=0) and then returns the vector with the highest cointegrating properties or in other words, the vector with the highest eigenvalue (lamda).
Regarding your question: the procedure does not "force" anything. It checks all combinations, that is why you have your 3 different vectors. It is my understanding that the method just scales/normalizes the vector to the first variable.
Regarding your other question: The procedure will calculate the vectors for which the residual has the strongest mean reverting / stationarity properties. If one or more of your variables does not contribute further to these properties then the component for this variable in the vector will indeed be 0. However, if the component value is not 0 then it means that "stronger" cointegration was found by including the extra variable in the model.
Furthermore, you can test test significance of your components. Johansen allows a researcher to test a hypothesis about one or more
coefficients in the cointegrating relationship by viewing the hypothesis as
a restriction on the non-lagged component matrix in the VECM. If there exist r cointegrating vectors, only these linear combinations or linear transformations of them, or combinations of the cointegrating vectors, will be stationary. However, I'm not aware on how to perform these extra checks in R.
Probably, the best way for you to proceed is to first test the combinations that contain a smaller number of variables. You then have the option to not add extra variables to these cointegrating subsets if you don't want to. But as already mentioned, adding other variables can potentially increase the cointegrating properties / stationarity of your residuals. It will depend on your requirements whether or not this is the behaviour you want.
I've been searching for an answer to this and I think I found one so I'm sharing with you hoping it's the right solution.
By using the johansen test you test for the ranks (number of cointegration vectors), and it also returns the eigenvectors, and the alphas and betas do build said vectors.
In theory if you reject r=0 and accept r=1 (value of r=0 > critical value and r=1 < critical value) you would search for the highest eigenvalue and from that build your vector. On this case, if the highest eigenvalue was the first, it would be V1*1+V2*(-0.26)+V3*(-0.64).
This would generate the cointegration residuals for these variables.
Again, I'm not 100%, but preety sure the above is how it works.
Nonetheless, you can always use the cajools function from the urca package to create a VECM automatically. You only need to feed it a cajo object and define the number of ranks (https://cran.r-project.org/web/packages/urca/urca.pdf).
If someone could confirm / correct this, it would be appreciated.

Resources