Performe qnorm() for p-values of 0 and 1 - r

I'm doing a meta-regression-analysis for Granger non-causality tests in my Master thesis. The effects of interest are F- and chi-square distributed, so to use theme in a meta-regression they must be converted to normal variates. Right now, I'm using probit-function (inverse of the standard normal cumulative distribution) for this. And this is basically its the qnorm() of the p-values (as far as I know).
My problem is now, the underlying studies sometimes report p-values of 0 or 1. Transforming them with qnorm() gives me Inf and -Inf values.
My solution approach is to exchange 0 p-values with values near 0, for example 1e-180
and 1 p-values with values near 1, for example 0.9999999999999999 (only 16 9 are possible because R is changing the results for more "9"s to 1).
Does anybody know a better solution for this problem? Is this mathematically reasonable? Excluding the 0 and 1 p-values would change the results completely and therefor is, in my honest opinion, wrong.
My code sample right now:
df$p_val[df$p_val == 0] <- 1e-180
df$p_val[df$p_val == 1] <- 0.9999999999999999
df$probit <- -qnorm(df$p_val)
The minus in front of the qnorm helps intuition, so that positive values are associated wth rejecting the null hypothesis of non-causality at higher levels of significance.
I would be really glad for support / hints / etc.!

Related

Fisher's Exact Test

In this post https://stats.stackexchange.com/questions/94909/course-of-action-for-2x2-tables-with-0s-in-cell-and-low-cell-counts, OP said that s/he got a p-value 0.5152 while conducted a Fisher's exact test for the following data:
Control Cases
A 8 0
B 14 0
But I am getting p-value=1 and odds ratio=0 for the data. My R codes are:
a <- matrix(c(8,14,0,0),2,2)
(res <- fisher.test(a))
Where am I doing mistake?
Good afternoon :)
https://en.wikipedia.org/wiki/Fisher%27s_exact_test
Haven't used these in a while, but I'm assuming its your column of two 0's:
p = choose(14, 14) * choose(8, 8)/ choose(22, 22)
which is 1.0. For odds ratio, read here: https://en.wikipedia.org/wiki/Odds_ratio
The 0's are either the numerators or the denominators. I think this makes sense, as a column of 0's effectively mean you have a group with no observations in.
You get the strange p-value=1 and OR=0 because one or more of your counts is 0. It should not be computed by the chi-square equation, which through multiplication yields chi-values of 0 for these respective cells:
Chi square equation, cell-by-cell.
Instead, you should use the Fisher's exact test ("fisher.test()") which to some extent can correct for the very low cell counts (normally you should use Fisher's for whenever you have at least 20% of cells with a count of <5). Source: https://www.ncbi.nlm.nih.gov/pubmed/23894860 Using the chi-square analysis will require you to correct using the Yates' correction, (e.g.: chisq.test(matrix, correct = T)).

Perform a Shapiro-Wilk Normality Test

I want to perform a Shapiro-Wilk Normality Test test. My data is csv format. It looks like this:
heisenberg
HWWIchg
1 -15.60
2 -21.60
3 -19.50
4 -19.10
5 -20.90
6 -20.70
7 -19.30
8 -18.30
9 -15.10
However, when I perform the test, I get:
shapiro.test(heisenberg)
Error in [.data.frame(x, complete.cases(x)) :
undefined columns selected
Why isnt`t R selecting the right column and how do I do that?
What does shapiro.test do?
shapiro.test tests the Null hypothesis that "the samples come from a Normal distribution" against the alternative hypothesis "the samples do not come from a Normal distribution".
How to perform shapiro.test in R?
The R help page for ?shapiro.test gives,
x - a numeric vector of data values. Missing values are allowed,
but the number of non-missing values must be between 3 and 5000.
That is, shapiro.test expects a numeric vector as input, that corresponds to the sample you would like to test and it is the only input required. Since you've a data.frame, you'll have to pass the desired column as input to the function as follows:
> shapiro.test(heisenberg$HWWIchg)
# Shapiro-Wilk normality test
# data: heisenberg$HWWIchg
# W = 0.9001, p-value = 0.2528
Interpreting results from shapiro.test:
First, I strongly suggest you read this excellent answer from Ian Fellows on testing for normality.
As shown above, the shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a Normal distribution. As Ian Fellows nicely put it, you are testing against the assumption of Normality". In other words (correct me if I am wrong), it would be much better if one tests the NULL hypothesis that the samples do not come from a Normal distribution. Why? Because, rejecting a NULL hypothesis is not the same as accepting the alternative hypothesis.
In case of the null hypothesis of shapiro.test, a p-value <= 0.05 would reject the null hypothesis that the samples come from normal distribution. To put it loosely, there is a rare chance that the samples came from a normal distribution. The side-effect of this hypothesis testing is that this rare chance happens very rarely. To illustrate, take for example:
set.seed(450)
x <- runif(50, min=2, max=4)
shapiro.test(x)
# Shapiro-Wilk normality test
# data: runif(50, min = 2, max = 4)
# W = 0.9601, p-value = 0.08995
So, this (particular) sample runif(50, min=2, max=4) comes from a normal distribution according to this test. What I am trying to say is that, there are many many cases under which the "extreme" requirements (p < 0.05) are not satisfied which leads to acceptance of "NULL hypothesis" most of the times, which might be misleading.
Another issue I'd like to quote here from #PaulHiemstra from under comments about the effects on large sample size:
An additional issue with the Shapiro-Wilk's test is that when you feed it more data, the chances of the null hypothesis being rejected becomes larger. So what happens is that for large amounts of data even very small deviations from normality can be detected, leading to rejection of the null hypothesis event though for practical purposes the data is more than normal enough.
Although he also points out that R's data size limit protects this a bit:
Luckily shapiro.test protects the user from the above described effect by limiting the data size to 5000.
If the NULL hypothesis were the opposite, meaning, the samples do not come from a normal distribution, and you get a p-value < 0.05, then you conclude that it is very rare that these samples do not come from a normal distribution (reject the NULL hypothesis). That loosely translates to: It is highly likely that the samples are normally distributed (although some statisticians may not like this way of interpreting). I believe this is what Ian Fellows also tried to explain in his post. Please correct me if I've gotten something wrong!
#PaulHiemstra also comments about practical situations (example regression) when one comes across this problem of testing for normality:
In practice, if an analysis assumes normality, e.g. lm, I would not do this Shapiro-Wilk's test, but do the analysis and look at diagnostic plots of the outcome of the analysis to judge whether any assumptions of the analysis where violated too much. For linear regression using lm this is done by looking at some of the diagnostic plots you get using plot(lm()). Statistics is not a series of steps that cough up a few numbers (hey p < 0.05!) but requires a lot of experience and skill in judging how to analysis your data correctly.
Here, I find the reply from Ian Fellows to Ben Bolker's comment under the same question already linked above equally (if not more) informative:
For linear regression,
Don't worry much about normality. The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine.
Worry about unequal variances (heteroskedasticity). I worry about this to the point of (almost) using HCCM tests by default. A scale location plot will give some idea of whether this is broken, but not always. Also, there is no a priori reason to assume equal variances in most cases.
Outliers. A cooks distance of > 1 is reasonable cause for concern.
Those are my thoughts (FWIW).
Hope this clears things up a bit.
You are applying shapiro.test() to a data.frame instead of the column. Try the following:
shapiro.test(heisenberg$HWWIchg)
You failed to specify the exact columns (data) to test for normality.
Use this instead
shapiro.test(heisenberg$HWWIchg)
Set the data as a vector and then place in the function.

How is NaN handled in Pearson correlation user-user similarity matrix in a recommender system?

I am generating a user-user similarity matrix from a user-rating data (particularly MovieLens100K data). Computing correlation leads to some NaN values. I have tested in a smaller dataset:
User-Item rating matrix
I1 I2 I3 I4
U1 4 0 5 5
U2 4 2 1 0
U3 3 0 2 4
U4 4 4 0 0
User-User Pearson Correlation similarity matrix
U1 U2 U3 U4 U5
U1 1 -1 0 -nan 0.755929
U2 -1 1 1 -nan -0.327327
U3 0 1 1 -nan 0.654654
U4 -nan -nan -nan -nan -nan
U5 0.755929 -0.327327 0.654654 -nan 1
For computing the pearson correlation , only corated items are considered between two users. (See Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, Gediminas Adomavicius, Alexander Tuzhilin
How can i handle the NaN values?
EDIT
Here is a code with which i find pearson correlation in R. The R matrix is the user-item rating matrix. Contains 1 to 5 scale rating 0 means not rated. S is the user-user correlation matrix.
for (i in 1:nrow (R))
{
cat ("user: ", i, "\n");
for (k in 1:nrow (R))
{
if (i != k)
{
corated_list <- which (((R[i,] != 0) & (R[k,] != 0)) == TRUE);
ui <- (R[i,corated_list] - mean (R[i,corated_list]));
uk <- (R[k,corated_list] - mean (R[k,corated_list]));
temp <- sum (ui * uk) / sqrt (sum (ui^2) * sum (uk^2));
S[i,k] <- ifelse (is.nan (temp), 0, temp)
}
else
{
S[i,k] <- 0;
}
}
}
Note that in the S[i,k] <- ifelse (is.nan (temp), 0, temp) line i am replacing the NaNs with 0.
I recently developed a recommender system in Java for user-user & user-item matrix. Firstly as you probably already have found. RS are difficult. For my implementation I utilised the Apache Common Math Library which is fantastic, you are using R which is probably relatively similar in how it calculates Pearson's.
Your question was: How can I handle NaN values, followed by an edit saying you are saying NaN is = 0.
My answer is this:
You shouldn't really handle NaN values as 0, because what you are saying is that there is absolutely no correlation between users or users/items. This might be the case, but it is likely not always the case. Ignoring this will skew your recommendations.
Firstly you should be asking yourself, "why am I getting NaN values"? Here are some reasons from the Wiki page of NaN detailing why you might get a NaN value:
There are three kinds of operations that can return NaN:
Operations with a NaN as at least one operand.
Indeterminate forms
The divisions 0/0 and ±∞/±∞
The multiplications 0×±∞ and ±∞×0
The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
The standard has alternative functions for powers:
The standard pow function and the integer exponent pown function define 00, 1∞, and ∞0 as 1.
The powr function defines all three indeterminate forms as invalid operations and so returns NaN.
Real operations with complex results, for example:
The square root of a negative number.
The logarithm of a negative number
The inverse sine or cosine of a number that is less than −1 or greater than +1.
You should debug your application and step through each step to see which of the above reasons is the offending cause.
Secondly understanding that Pearsons Correlation can be represented in a number of different ways, and you need to consider whether you are calculating it across a sample or population and then find the appropriate method of calculating it i.e. for a population:
cor(X, Y) = Σ[(xi - E(X))(yi - E(Y))] / [(n - 1)s(X)s(Y)]
where E(X) is the mean of X,
E(Y) is the mean of the Y values and
s(X), s(Y) are standard deviations and
standard deviations is generally the positive square root of the variance and
variance = sum((x_i - mean)^2) / (n - 1)
where mean is the Mean and
n is the number of sample observations.
This is probably where your NaN are appearing i.e. dividing by 0 for not rated. If you can I would suggest not using the value of 0 to mean not rated, instead use null. I would do this for 2 reasons:
1. The 0 is probably what is cocking up your results with NaNs, and
2. Readability / Understandability. Your Scale is 1 - 5, so 0 should not feature, confuses things. So avoid that if possible.
Thirdly from a recommender stand point, think about things from a recommendation point of view. If you have 2 users and they only have 1 rating in common, say U1 and U4 for I1 in your smaller dataset. Is that 1 item in common really enough to offer recommendations on? The answer is - of course not. So can I also suggest you set a minimum threshold of ratingsInCommon to ensure that the quality of recommendation is better. The minimum you can set for this threshold is 2, but consider setting it a bit higher. If you read the MovieLens research then they set it to between 5-10 (cant remember off the top of my head). The higher you set this the less coverage you will get but you will achieve "better" (lower error scores) recommendations. You've probably done your reading of the academic literature then you will have probably picked up on this point, but thought I would mention it anyway.
On the above point. Look at U4 and compare with every other User. Notice how U4 does not have more that 1 item in common with any user. Now hopefully you will notice that the NaNs appear exclusively with U4. If you have followed this answer then you will hopefully now see that the reason you are getting NaNs is because you can actually compute Pearson's with just 1 item in common :).
Finally one thing that slightly bothers me about the sample dataset above is number of correlations that are 1's and -1's. Think about what that is actually saying about these users preferences, then sense check them against the actual ratings. E.g. look at U1 and U2 ratings. for Item 1 they have strong positive correlation of 1 (both rated it a 4) then for Item 3 they have a strong negative correlation (U1 rated it 5, U3 rated it 1), seems strange that Pearson Correlation between these two users is -1 (i.e. their preferences are completely opposite). This is clearly not the case, really the Pearson score should be a bit above or a bit below 0. This issue links back into points about using 0 on the scale and also comparing only a small amount of items together.
Now, there are strategies in place for "filling in" items that users have not rated. I am not going to go into them you need read up on that, but essentially it is like using the average score for that item or the average rating for that user. Both methods have their downsides, and personally I don't really like either of them. My advice is to only calculate Pearson correlations between users when they have 5 or more items in common, and ignore the items where ratings are 0 (or better - null) ratings.
So to conclude.
NaN does not equal 0 so do not set it to 0.
0's in your scale are better represented as null
You should only calculate Pearson Correlations when the number of items in common between two users is >1, preferably greater than 5/10.
Only calculate the Pearson Correlation for two users where they have commonly rated items, do not include items in the score that have not been rated by the other user.
Hope that helps and good luck.

R fast AUC function for non-binary dependent variable

I'm trying to calculate the AUC for a large-ish data set and having trouble finding one that both handles values that aren't just 0's or 1's and works reasonably quickly.
So far I've tried the ROCR package, but it only handles 0's and 1's and the pROC package will give me an answer but could take 5-10 minutes to calculate 1 million rows.
As a note all of my values fall between 0 - 1 but are not necessarily 1 or 0.
EDIT: both the answers and predictions fall between 0 - 1.
Any suggestions?
EDIT2:
ROCR can deal with situations like this:
Ex.1
actual prediction
1 0
1 1
0 1
0 1
1 0
or like this:
Ex.2
actual prediction
1 .25
1 .1
0 .9
0 .01
1 .88
but NOT situations like this:
Ex.3
actual prediction
.2 .25
.6 .1
.98 .9
.05 .01
.72 .88
pROC can deal with Ex.3 but it takes a very long time to compute. I'm hoping that there's a faster implementation for a situation like Ex.3.
So far I've tried the ROCR package, but it only handles 0's and 1's
Are you talking about the reference class memberships or the predicted class memberships?
The latter can be between 0 and 1 in ROCR, have a look at its example data set ROCR.simple.
If your reference is in [0, 1], you could have a look at (disclaimer: my) package softclassval. You'd have to construct the ROC/AUC from sensitivity and specificity calculations, though. So unless you think of an optimized algorithm (as ROCR developers did), it'll probably take long, too. In that case you'll also have to think what exactly sensitivity and specificity should mean, as this is ambiguous with reference memberships in (0, 1).
Update after clarification of the question
You need to be aware that grouping the reference or actual together looses information. E.g., if you have actual = 0.5 and prediction = 0.8, what is that supposed to mean? Suppose these values were really actual = 5/10 and prediction = 5/10.
By summarizing the 10 tests into two numbers, you loose the information whether the same 5 out of the 10 were meant or not. Without this, actual = 5/10 and prediction = 8/10 is consistent with anything between 30 % and 70 % correct recognition!
Here's an illustration where the sensitivity is discussed (i.e. correct recognition e.g. of click-through):
You can find the whole poster and two presentaions discussing such issues at softclassval.r-forge.r-project.org, section "About softclassval".
Going on with these thoughts, weighted versions of mean absolute, mean squared, root mean squared etc. errors can be used as well.
However, all those different ways to express of the same performance characteristic of the model (e.g. sensitivity = % correct recognitions of actual click-through events) do have a different meaning, and while they coincide with the usual calculation in unambiguous reference and prediction situations, they will react differently with ambiguous reference / partial reference class membership.
Note also, as you use continuous values in [0, 1] for both reference/actual and prediction, the whole test will be condensed into one point (not a line!) in the ROC or specificity-sensitivity plot.
Bottom line: the grouping of the data gets you in trouble here. So if you could somehow get the information on the single clicks, go and get it!
Can you use other error measures for assessing method performance? (e.g. Mean Absolute Error, Root Mean Square Error)?
This post might also help you out, but if you have different numbers of classes for observed and predicted values, then you might run into some issues.
https://stat.ethz.ch/pipermail/r-help/2008-September/172537.html

R, cointegration, multivariate, co.ja(), johansen

I am new to R and cointegration so please have patience with me as I try to explain what it is that I am trying to do. I am trying to find cointegrated variables among 1500-2000 voltage variables in the west power system in Canada/US. THe frequency is hourly (common in power) and cointegrated combinations can be as few as N variables and a maximum of M variables.
I tried to use ca.jo but here are issues that I ran into:
1) ca.jo (Johansen) has a limit to the number of variables it can work with
2) ca.jo appears to force the first variable in the y(t) vector to be the dependent variable (see below).
Eigenvectors, normalised to first column: (These are the cointegration relations)
V1.l2 V2.l2 V3.l2
V1.l2 1.0000000 1.0000000 1.0000000
V2.l2 -0.2597057 -2.3888060 -0.4181294
V3.l2 -0.6443270 -0.6901678 0.5429844
As you can see ca.jo tries to find linear combinations of the 3 variables but by forcing the coefficient on the first variable (in this case V1) to be 1 (i.e. the dependent variable). My understanding was that ca.jo would try to find all combinations such that every variable is selected as a dependent variable. You can see the same treatment in the examples given in the documentation for ca.jo.
3) ca.jo does not appear to find linear combinations of fewer than the number of variables in the y(t) vector. So if there were 5 variables and 3 of them are cointegrated (i.e. V1 ~ V2 + V3) then ca.jo fails to find this combination. Perhaps I am not using ca.jo correctly but my expectation was that a cointegrated combination where V1 ~ V2 + V3 is the same as V1 ~ V2 + V3 + 0 x V4 + 0 x V5. In other words the coefficient of the variable that are NOT cointegrated should be zero and ca.jo should find this type of combination.
I would greatly appreciate some further insight as I am fairly new to R and cointegration and have spent the past 2 months teaching myself.
Thank you.
I have also posted on nabble:
http://r.789695.n4.nabble.com/ca-jo-cointegration-multivariate-case-tc3469210.html
I'm not an expert, but since no one is responding, I'm going to try to take a stab at this one.. EDIT: I noticed that I just answered to a 4 year old question. Hopefully it might still be useful to others in the future.
Your general understanding is correct. I'm not going to go in great detail about the whole procedure but will try to give some general insight. The first thing that the Johansen procedure does is create a VECM out of the VAR model that best corresponds to the data (This is why you need the lag length for the VAR as input to the procedure as well). The procedure will then investigate the non-lagged component matrix of the VECM by looking at its rank: If the variables are not cointegrated then the rank of the matrix will not be significantly different from 0. A more intuitive way of understanding the johansen VECM equations is to notice the comparibility with the ADF procedure for each distinct row of the model.
Furthermore, The rank of the matrix is equal to the number of its eigenvalues (characteristic roots) that are different from zero. Each eigenvalue is associated with a different cointegrating vector, which
is equal to its corresponding eigenvector. Hence, An eigenvalue significantly different
from zero indicates a significant cointegrating vector. Significance of the vectors can be tested with two distinct statistics: The max statistic or the trace statistic. The trace test tests the null hypothesis of less than or equal to r cointegrating vectors against the alternative of more than r cointegrating vectors. In contrast, The maximum eigenvalue test tests the null hypothesis of r cointegrating vectors against the alternative of r + 1 cointegrating vectors.
Now for an example,
# We fit data to a VAR to obtain the optimal VAR length. Use SC information criterion to find optimal model.
varest <- VAR(yourData,p=1,type="const",lag.max=24, ic="SC")
# obtain lag length of VAR that best fits the data
lagLength <- max(2,varest$p)
# Perform Johansen procedure for cointegration
# Allow intercepts in the cointegrating vector: data without zero mean
# Use trace statistic (null hypothesis: number of cointegrating vectors <= r)
res <- ca.jo(yourData,type="trace",ecdet="const",K=lagLength,spec="longrun")
testStatistics <- res#teststat
criticalValues <- res#criticalValues
# chi^2. If testStatic for r<= 0 is greater than the corresponding criticalValue, then r<=0 is rejected and we have at least one cointegrating vector
# We use 90% confidence level to make our decision
if(testStatistics[length(testStatistics)] >= criticalValues[dim(criticalValues)[1],1])
{
# Return eigenvector that has maximum eigenvalue. Note: we throw away the constant!!
return(res#V[1:ncol(yourData),which.max(res#lambda)])
}
This piece of code checks if there is at least one cointegrating vector (r<=0) and then returns the vector with the highest cointegrating properties or in other words, the vector with the highest eigenvalue (lamda).
Regarding your question: the procedure does not "force" anything. It checks all combinations, that is why you have your 3 different vectors. It is my understanding that the method just scales/normalizes the vector to the first variable.
Regarding your other question: The procedure will calculate the vectors for which the residual has the strongest mean reverting / stationarity properties. If one or more of your variables does not contribute further to these properties then the component for this variable in the vector will indeed be 0. However, if the component value is not 0 then it means that "stronger" cointegration was found by including the extra variable in the model.
Furthermore, you can test test significance of your components. Johansen allows a researcher to test a hypothesis about one or more
coefficients in the cointegrating relationship by viewing the hypothesis as
a restriction on the non-lagged component matrix in the VECM. If there exist r cointegrating vectors, only these linear combinations or linear transformations of them, or combinations of the cointegrating vectors, will be stationary. However, I'm not aware on how to perform these extra checks in R.
Probably, the best way for you to proceed is to first test the combinations that contain a smaller number of variables. You then have the option to not add extra variables to these cointegrating subsets if you don't want to. But as already mentioned, adding other variables can potentially increase the cointegrating properties / stationarity of your residuals. It will depend on your requirements whether or not this is the behaviour you want.
I've been searching for an answer to this and I think I found one so I'm sharing with you hoping it's the right solution.
By using the johansen test you test for the ranks (number of cointegration vectors), and it also returns the eigenvectors, and the alphas and betas do build said vectors.
In theory if you reject r=0 and accept r=1 (value of r=0 > critical value and r=1 < critical value) you would search for the highest eigenvalue and from that build your vector. On this case, if the highest eigenvalue was the first, it would be V1*1+V2*(-0.26)+V3*(-0.64).
This would generate the cointegration residuals for these variables.
Again, I'm not 100%, but preety sure the above is how it works.
Nonetheless, you can always use the cajools function from the urca package to create a VECM automatically. You only need to feed it a cajo object and define the number of ranks (https://cran.r-project.org/web/packages/urca/urca.pdf).
If someone could confirm / correct this, it would be appreciated.

Resources