I'm trying to calculate the AUC for a large-ish data set and having trouble finding one that both handles values that aren't just 0's or 1's and works reasonably quickly.
So far I've tried the ROCR package, but it only handles 0's and 1's and the pROC package will give me an answer but could take 5-10 minutes to calculate 1 million rows.
As a note all of my values fall between 0 - 1 but are not necessarily 1 or 0.
EDIT: both the answers and predictions fall between 0 - 1.
Any suggestions?
EDIT2:
ROCR can deal with situations like this:
Ex.1
actual prediction
1 0
1 1
0 1
0 1
1 0
or like this:
Ex.2
actual prediction
1 .25
1 .1
0 .9
0 .01
1 .88
but NOT situations like this:
Ex.3
actual prediction
.2 .25
.6 .1
.98 .9
.05 .01
.72 .88
pROC can deal with Ex.3 but it takes a very long time to compute. I'm hoping that there's a faster implementation for a situation like Ex.3.
So far I've tried the ROCR package, but it only handles 0's and 1's
Are you talking about the reference class memberships or the predicted class memberships?
The latter can be between 0 and 1 in ROCR, have a look at its example data set ROCR.simple.
If your reference is in [0, 1], you could have a look at (disclaimer: my) package softclassval. You'd have to construct the ROC/AUC from sensitivity and specificity calculations, though. So unless you think of an optimized algorithm (as ROCR developers did), it'll probably take long, too. In that case you'll also have to think what exactly sensitivity and specificity should mean, as this is ambiguous with reference memberships in (0, 1).
Update after clarification of the question
You need to be aware that grouping the reference or actual together looses information. E.g., if you have actual = 0.5 and prediction = 0.8, what is that supposed to mean? Suppose these values were really actual = 5/10 and prediction = 5/10.
By summarizing the 10 tests into two numbers, you loose the information whether the same 5 out of the 10 were meant or not. Without this, actual = 5/10 and prediction = 8/10 is consistent with anything between 30 % and 70 % correct recognition!
Here's an illustration where the sensitivity is discussed (i.e. correct recognition e.g. of click-through):
You can find the whole poster and two presentaions discussing such issues at softclassval.r-forge.r-project.org, section "About softclassval".
Going on with these thoughts, weighted versions of mean absolute, mean squared, root mean squared etc. errors can be used as well.
However, all those different ways to express of the same performance characteristic of the model (e.g. sensitivity = % correct recognitions of actual click-through events) do have a different meaning, and while they coincide with the usual calculation in unambiguous reference and prediction situations, they will react differently with ambiguous reference / partial reference class membership.
Note also, as you use continuous values in [0, 1] for both reference/actual and prediction, the whole test will be condensed into one point (not a line!) in the ROC or specificity-sensitivity plot.
Bottom line: the grouping of the data gets you in trouble here. So if you could somehow get the information on the single clicks, go and get it!
Can you use other error measures for assessing method performance? (e.g. Mean Absolute Error, Root Mean Square Error)?
This post might also help you out, but if you have different numbers of classes for observed and predicted values, then you might run into some issues.
https://stat.ethz.ch/pipermail/r-help/2008-September/172537.html
Related
I am applying an Exploratory factor analysis on a dataset using the factanal() package in R. After applying Scree Test I found out that 2 factors need to be retained from 20 features.
Trying to find what this uniqueness represents, I found the following from here
"A high uniqueness for a variable usually means it doesn’t fit neatly into our factors. ..... If we subtract the uniquenesses from 1, we get a quantity called the communality. The communality is the proportion of variance of the ith variable contributed by the m common factors. ......In general, we’d like to see low uniquenesses or high communalities, depending on what your statistical program returns."
I understand that if the value of uniquenesses is high, it could be better represented in a single factor. But what is a good threshold for this uniquenesses measure? All of my features show a value greater than 0.3, and most of them range from 0.3 to 0.7. Does the following mean that my factor analysis doesn't work well on my data? I have tried rotation, the results are not very different. What else should I try then?
You can partition an indicator variable's variance into its...
Uniqueness (h2): Variance that is not explained by the common factors
Communality (1-h2): The variance that is explained by the common factors
Which values can be considered "good" depends on your context. You should look for examples in your application domain to know what you can expect. In the package psych, you can find some examples from psychology:
library(psych)
m0 <- fa(Thurstone.33,2,rotate="none",fm="mle")
m0
m0$loadings
When you run the code, you can see that the communalities are around 0.6. The absolute factor loadings of the unrotated solution vary between 0.27 and 0.85.
An absolute value of 0.4 is often used as an arbitrary cutoff for acceptable factor loadings in psychological domains.
I'm doing a meta-regression-analysis for Granger non-causality tests in my Master thesis. The effects of interest are F- and chi-square distributed, so to use theme in a meta-regression they must be converted to normal variates. Right now, I'm using probit-function (inverse of the standard normal cumulative distribution) for this. And this is basically its the qnorm() of the p-values (as far as I know).
My problem is now, the underlying studies sometimes report p-values of 0 or 1. Transforming them with qnorm() gives me Inf and -Inf values.
My solution approach is to exchange 0 p-values with values near 0, for example 1e-180
and 1 p-values with values near 1, for example 0.9999999999999999 (only 16 9 are possible because R is changing the results for more "9"s to 1).
Does anybody know a better solution for this problem? Is this mathematically reasonable? Excluding the 0 and 1 p-values would change the results completely and therefor is, in my honest opinion, wrong.
My code sample right now:
df$p_val[df$p_val == 0] <- 1e-180
df$p_val[df$p_val == 1] <- 0.9999999999999999
df$probit <- -qnorm(df$p_val)
The minus in front of the qnorm helps intuition, so that positive values are associated wth rejecting the null hypothesis of non-causality at higher levels of significance.
I would be really glad for support / hints / etc.!
I am doing an lm()regression with R where I use stock quotations. I used exponential weights for the regression : the older the data, the less weight. My weights formula is like this : alpha^(seq(685,1,by=-1))) (the data length is 685), and to find alpha I tried every value between 0.9 and 1.1 with a step of 0.0001 and I chose the alpha which minimizes the difference between the predicted values and the real values. This alpha is equal to 0.9992 so I would like to know if it is statistically different from 1.
In other words I would like to know if the weights are different from 1. Is it possible to achieve that and if so, how could I do this ?
I don't really know whether this question should be asked on stats.stackexchange but it involves Rso I hope it is not misplaced.
I have to explain to my class how to do basic arithmetic coding on a small message. I've been investigating lots of documents and reading a lot and I can say I theoretically understand how this method works, but still have some questions.
I'm stepping through these examples (first example, second page) - we have 'eaii!' message, and we want to code it using the arithmetic method.
In the example, it sets
Symbol Probability Range
a .2 [0 , 0.2)
e .3 [0.2 , 0.5)
i .1 [0.5 , 0.6)
o .2 [0.6 , 0.8)
u .1 [0.8 , 0.9)
! .1 [0.9 , 1.0)
My first question is, how did it set probabilities?, my logic tells me that if i have two 'i' symbols then that symbol should have the highest probability, shouldn't it?
Also how did it determined which range to start from and other ranges afterwards??
Another example was coding the message 'abc', which was set like this:
Symbol Probability Range
a .7 [0 , 0.7)
b .1 [0.7 , 0.8)
c .2 [0.8 , 1.0)
I also don't understand why the first symbol has substantially greater probability than the others and even if it was an order of appearance thing, I don't understand how it set it to 0.7, like why not 0.8 or 0.5.
I hope I made myself clear and I'd appreciate any kind of help.
They are imagining a fixed model for the data that was established long before that specific message is to be encoded. The model was in principle constructed from a large ensemble of such messages, so there is no reason to believe that eaii! by itself should match the probabilities in the model. Of course, the model is just for illustration purposes, and no more real than the eaii! message. (Though I think I said exactly that the other day when I was pulling something out of the oven.)
The order of the symbols in the model is arbitrary. It just needs to be the same model on both ends. It is of course important that the probabilities add up to one.
The second model is simply another arbitrary model to illustrate how a symbol can be coded in less than a bit, when it has a probability greater than 1/2. For that model, each a in a series of a's would take a little over half a bit.
I think the probability is for the sample.
To determine which probability to test for a specific sample, just count each character, and divide, for each, its number of occurrences by the total number of characters. (sum is 1).
Note that the arithmetic algorithm is efficient when there is a huge repetition of the same character.
For example :
aaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaabaaaaa will compress very good
abfpababbahnajdapkalamkmdamlkapaaapokpokpdq will not compress very good (try Huffman)
I have DNA amplicons with base mismatches which can arise during the PCR amplification process. My interest is, what is the probability that a sequence contains errors, given the error rate per base, number of mismatches and the number of bases in the amplicon.
I came across an article [Cummings, S. M. et al (2010). Solutions for PCR, cloning and sequencing errors in population genetic analysis. Conservation Genetics, 11(3), 1095–1097. doi:10.1007/s10592-009-9864-6]
that proposes this formula to calculate the probability mass function in such cases.
I implemented the formula with R as shown here
pcr.prob <- function(k,N,eps){
v = numeric(k)
for(i in 1:k) {
v[i] = choose(N,k-i) * (eps^(k-i)) * (1 - eps)^(N-(k-i))
}
1 - sum(v)
}
From the article, suggest we analysed an 800 bp amplicon using a PCR of 30 cycles with 1.85e10-5 misincorporations per base per cycle, and found 10 unique sequences that are each 3 bp different from their most similar sequence. The probability that a novel sequences was generated by three independent PCR errors equals P = 0.0011.
However when I use my implementation of the formula I get a different value.
pcr.prob(3,800,0.0000185)
[1] 5.323567e-07
What could I be doing wrong in my implementation? Am I misinterpreting something?
Thanks
I think they've got the right number (0.00113), but badly explained in their paper.
The calculation you want to be doing is:
pbinom(3, 800, 1-(1-1.85e-5)^30, lower=FALSE)
I.e. what's the probability of seeing less than three modifications in 800 independent bases, given 30 amplifications that each have a 1.85e-5 chance of going wrong. I.e. you're calculating the probability it doesn't stay correct 30 times.
Somewhat statsy, may be worth a move…
Thinking about this more, you will start to see floating-point inaccuracies when working with very small probabilities here. I.e. a 1-x where x is a small number will start to go wrong when the absolute value of x is less than about 1e-10. Working with log-probabilities is a good idea at this point, specifically the log1p function is a great help. Using:
pbinom(3, 800, 1-exp(log1p(-1.85e-5)*30), lower=FALSE)
will continue to work even when the error incorporation rate is very low.