I'm computing Spearman's rho on small sets of paired rankings.
Spearman is well known for not handling ties properly. For example, taking 2 sets of 8 rankings, even if 6 are ties in one of the two sets, the correlation is still very high:
> cor.test(c(1,2,3,4,5,6,7,8), c(0,0,0,0,0,0,7,8), method="spearman")
Spearman's rank correlation rho
S = 19.8439, p-value = 0.0274
sample estimates:
rho
0.7637626
Warning message:
Cannot compute exact p-values with ties
The p-value <.05 seems like a pretty high statistical significance for this data.
Is there a ties-corrected version of Spearman in R?
What is the best formula to date to compute it with a lot of ties?
Well, Kendall tau rank correlation is also a non-parametric test for statistical dependence between two ordinal (or rank-transformed) variables--like Spearman's, but unlike Spearman's, can handle ties.
More specifically, there are three Kendall tau statistics--tau-a, tau-b, and tau-c. tau-b is specifically adapted to handle ties.
The tau-b statistic handles ties (i.e., both members of the pair have the same ordinal value) by a divisor term, which represents the geometric mean between the number of pairs not tied on x and the number not tied on y.
Kendall's tau is not Spearman's--they are not the same, but they are also quite similar. You'll have to decide, based on context, whether the two are similar enough such one can be substituted for the other.
For instance, tau-b:
Kendall_tau_b = (P - Q) / ( (P + Q + Y0)*(P + Q + X0) )^0.5
P: number of concordant pairs ('concordant' means the ranks of each member of the pair of data points agree)
Q: number of discordant pairs
X0: number of pairs not tied on x
Y0: number of pairs not tied on y
There is in fact a variant of Spearman's rho that explicitly accounts for ties. In situations in which i needed a non-parametric rank correlation statistic, i have always chosen tau over rho. The reason is that rho sums the squared errors, whereas tau sums the absolute
discrepancies. Given that both tau and rho are competent statistics and we are left to choose, a linear penalty on discrepancies (tau) has always seemed to me, a more natural way to express rank correlation. That's not a recommendation, your context might be quite different and dictate otherwise.
I think exact=FALSE does the trick.
cor.test(c(1,2,3,4,5,6,7,8), c(0,0,0,0,0,0,7,8), method="spearman", exact=FALSE)
Spearman's rank correlation rho
data: c(1, 2, 3, 4, 5, 6, 7, 8) and c(0, 0, 0, 0, 0, 0, 7, 8)
S = 19.8439, p-value = 0.0274
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7637626
cor.test with method="spearman" actually calculates Spearman coefficient corrected for ties.
I've checked it by "manually" calculating tie-corrected and tie-uncorrected Spearman coefficients from equations in Zar 1984, Biostatistical Analysis. Here's the code - just substitute your own variable names to check for yourself:
ym <- data.frame(lousy, dors) ## my data
## ranking variables
ym$l <- rank(ym$lousy)
ym$d <- rank(ym$dors)
## calculating squared differences between ranks
ym$d2d <- (ym$l-ym$d)^2
## calculating variables for equations 19.35 and 19.37 in Zar 1984
lice <- as.data.frame(table(ym$lousy))
lice$t <- lice$Freq^3-lice$Freq
dorsal <- as.data.frame(table(ym$dors))
dorsal$t <- dorsal$Freq^3-dorsal$Freq
n <- nrow(ym)
sum.d2 <- sum(ym$d2d)
Tx <- sum(lice$t)/12
Ty <-sum(dorsal$t)/12
## calculating the coefficients
rs1 <- 1 - (6*sum.d2/(n^3-n)) ## "standard" Spearman cor. coeff. (uncorrected for ties) - eq. 19.35
rs2 <- ((n^3-n)/6 - sum.d2 - Tx - Ty)/sqrt(((n^3-n)/6 - 2*Tx)*((n^3-n)/6 - 2*Ty)) ## Spearman cor.coeff. corrected for ties - eq.19.37
##comparing with cor.test function
cor.test(ym$lousy,ym$dors, method="spearman") ## cor.test gives tie-corrected coefficient!
Ties-corrected Spearman
Using method="spearman" gives you the ties-corrected Spearman. Spearman's rho, according to the definition, is simply the Pearson's sample correlation coefficient computed for ranks of sample data. So it works both in presence and in absence of ties.
You can see that after replacing your original data with their ranks (midranks for ties) and using method="pearson", you will get the same result:
> cor.test(rank(c(1,2,3,4,5,6,7,8)), rank(c(0,0,0,0,0,0,7,8)), method="pearson")
Pearson's product-moment correlation
data: rank(c(1, 2, 3, 4, 5, 6, 7, 8)) and rank(c(0, 0, 0, 0, 0, 0, 7, 8))
t = 2.8983, df = 6, p-value = 0.0274
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1279559 0.9546436
sample estimates:
cor
0.7637626
Notice, there exists a simplified no-ties Spearman version, that is in fact used in cor.test() implementation in absence of ties, but it is equivalent to the definition above.
P-value
In case of ties in data, exact p-values are not computed neither for Spearman nor for Kendall measures (within cor.test() implementation), hence the warning. As mentioned in Eduardo's post, for not to get a warning you should set exact=FALSE,
The paper "A new rank correlation coefficient with application to the consensus ranking problem" is aimed to solve the ranking with tie problem. It also mentions that Tau-b should not be used as a ranking correlation measure for measuring agreement between weak orderings.
Emond, E. J. and Mason, D. W. (2002), A new rank correlation coefficient with application to the consensus ranking problem. J. Multi‐Crit. Decis. Anal., 11: 17-28. doi:10.1002/mcda.313
I was having a similar problem and by reading the answers here and the help file on R I saw that, when you have ties, you have to add the parameter exact = FALSE) to the cor.test() function. By adding this, it does not try to calculate an exact P value, but instead "the test statistic is the estimate scaled to zero mean and unit variance, and is approximately normally distributed".
The result, in my case, was exactly the same, but without the warning about ties.
cor.test(x, y, method = "spearm", exact = FALSE)
The R package ConsRank contains an implementation of Edmon and Mason's Tau_X. This appears to be the (mathematically) best currently known method for handling ties.
See the docs, which give the usage as
Tau_X(X, Y=NULL)
where X can be a matrix.
As pointed out by #wibeasley, Emond and Mason (2002) proposed Tau_X, a new rank correlation coefficient which appears to be superior to Kendal's Tau-b. NelsonGon was concerned that the paper is from 2002, predating the question by a few years, but seems to have overlooked that Spearman's correlation dates from 1904, and Kendall's Tau from 1938.
Related
I was reading this topic on Rbloggers about the use of the Wilcoxon rank sum test: https://www.r-bloggers.com/wilcoxon-mann-whitney-rank-sum-test-or-test-u/
Especially this part, here I quote:
"We can finally compare the intervals tabulated on the tables of Wilcoxon for independent samples. The tabulated interval for two groups of 6 samples each is (26, 52)".
How can I get these "tabulated" values ?
I understand they used a table where the values are reported following the size of each samples, but I was wondering if there was a way to get them in R.
It is important because as I can understand the post, once you have a p-value > 0.05 and so cannot reject the null hypothesis H0, you can actually confirm H0 by comparing "computed" and "tabulated" intervals.
So what I would need is the tabulated intervals, using R.
tl;dr
You can get confidence intervals for a Mann-Whitney-Wilcoxon test by specifying conf.int=TRUE.
Don't believe everything you read on the internet ...
If by "confirm" you mean "make sure that the computation is true", you don't need to double-check by consulting the original tables; the p-value should be enough to decide whether you can reject H0 or not. You can trust R for standard, widely used statistical methods. (I also show below how to repeat the computation with a different implementation from the coin package, which is a nearly independent check.)
if by "confirm" you mean "accept the null hypothesis", please don't do this; this is a fundamental violation of frequentist statistical theory, which says that you can reject a null hypothesis, but that you can never accept the null. Wide confidence intervals and p-values greater than a given threshold are evidence that the conclusion is uncertain (we can't be sure whether the null or the alternative is true), not that the null is true. The concluding text of the blog post referred to ("we conclude by accepting the hypothesis H0 of equality of means") is statistically incorrect.
A better way to interpret the uncertainty is to look at the confidence intervals. You can compute these for the Wilcoxon test: from ?wilcox.test:
... (if argument ‘conf.int’ is true [and a two-sample test is being performed]), a nonparametric
confidence interval and an estimator for ... the difference of the location parameters
‘x-y’ is computed.
> a = c(6, 8, 2, 4, 4, 5)
> b = c(7, 10, 4, 3, 5, 6)
> wilcox.test(b,a, conf.int=TRUE, correct=FALSE)
data: b and a
W = 22, p-value = 0.5174
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-1.999975 4.000016
sample estimates:
difference in location
0.9999395
The high p-value (0.5174) says that we really can't tell whether the values in a or b have signicantly different ranks. The difference in location gives us the estimated difference between the median ranks, and the confidence interval gives the confidence interval on this difference. In this case, for a sample size of 12, the estimated difference in ranks is 1 (group b has slightly higher ranks than group a), and the confidence interval is (-2, 4) (the data are consistent with group b having slightly lower or much higher ranks than group a). It is admittedly rather difficult to interpret the substantive meaning of these values - that's one of the disadvantages of rank-based nonparametric tests ...
You can assume that the p-value computed by wilcox.test() is a reasonable summary of the evidence against the null hypothesis; there's no need to look up ranges in the tables. If you're worried about wilcox.test() in base R, you can try wilcox_test() from the coin package:
dd <- data.frame(f=rep(c("a","b"),each=6),x=c(a,b))
wilcox_test(x~f,data=dd,conf.int=TRUE) ## asymptotic test
which gives nearly identical results to wilcox.test(), and
wilcox_test(x~f,data=dd,conf.int=TRUE, distribution="exact")
which gives a slightly different p-value, but essentially the same confidence intervals.
of historical interest only
As for the tables: I found them on Google books, by doing a Google Scholar search with author:katti author:wilcox. There you can read the description of how they were computed; this wouldn't be impossible to replicate, but it seems unnecessary since p-values and confidence intervals are available via other methods. Digging through you find this:
The number 0.0206 in the red box indicates that the interval (26,52) corresponds to a one-tail p-value of 0.0206 (2-tailed = 0.0412); that's the closest you can get with a discrete range. The next closest range is given in the line below [(27,51), one-tailed p=0.0325, two-tailed=0.065]. In the 21st century you should never have to do this procedure.
I am trying to output a correlation matrix for various locations. The row names 'PC1', PC2' etc. represent principal components. Since the percentage variance explained (and thus the weights) of principal components decreases from PC1 to PC4, I need to run Pearson correlation such that it takes the weights of PC's into account.
In other words, row 1 is more important in determining the correlation among locations than row 2, and row 2 is more important than row 3, and so on...
A simple weight vector for the 4 rows can be as follows:
w = [1.00, 0.75, 0.50, 0.25]
I did go through this, but I am not fully clear with the solution, and unlike this question, I need to find the correlation within the columns of a SINGLE matrix, while weighing its rows.
Ok, this is very easy to do in R using cov.wt (available in stats)
weighted_corr <- cov.wt(DF, wt = w, cor = TRUE)
corr_matrix <- weighted_corr$cor
That's it!
I would greatly appreciate any guidance on the following: I am running ANOVA (aov) to retrieve p_value s for a number of subsets of a larger data set. So I kind of bumped into a subset where my numeric variables/values are equally 36. Because it is a part of a loop ANOVA is still executed along with reporting an seemingly infinitely small p_value 1.2855e-134--> Correct me if I am wrong but the smaller the p_value the higher the probability that the difference between the factors is significantly different?
For simplicity this is the subset:
sUBSET_FOR_ANOVA
Here is how I calculate ANOVA and retrieve p_value, where TEMP_DF2 is just the subset you see attached:
#
anova_sweep <- aov(TEMP_DF2$GOOD_PTS~TEMP_DF2$MACH,data = TEMP_DF2)
p_value <- summary(anova_sweep)[[1]][["Pr(>F)"]]
p_value <- p_value[1]
#
Many thanks for any guidance,
I can't replicate your findings. Let's produce an example dataset with all values being 36:
df <- data.frame(gr = rep(letters[1:2], 100),
y = 36)
summary(aov(y~gr, data = df))
Gives:
Df Sum Sq Mean Sq F value Pr(>F)
gr 1 1.260e-27 1.262e-27 1 0.319
Residuals 198 2.499e-25 1.262e-27
Basically, depending on the sample size, we obtain a p-value around 0.3 or so. The F statistic is (by definition) always 1, since the between and within group variances are equal.
Are there results misleading? To some extent, yes. The estimated SS and MS should be 0, aov calculates them as very very small. Some other statistical tests in R and in some packages check for zero variance and would produce an error, but aov apparently does not.
However, more importantly, I would say your data is violating the assumptions of the ANOVA and therefore any result cannot be trusted to base conclusion on. The expectation in R when it comes to statistical tests is usually that it is upon the user to employ the tests in the correct circumstances.
I am trying to find a structural break in the mean of a time-series that is skewed, fat-tailed, and heteroskedastic. I apply the Andrews(1993) supF-test via the strucchange package. My understanding is that this is valid even with my nonspherical disturbances. But I would like to confirm this via bootstrapping. I would like to estimate the max t-stat from a difference in mean test at each possible breakpoint (just like the Andrews F-stat) and then bootstrap the critical value. In other words, I want to find my max t-stat in the time-ordered data. Then scramble the data and find the max t-stat in the scrambled data, 10,000 times. Then compare the max t-stat from the time-ordered data to a critical value given by the rank 9,500 max t-stat from the unordered data. Below I generate example data and apply the Andrews supF-test. Is there any way to "correct" the Andrews test for nonspherical disturbances? Is there any way to do the bootstrap I am trying to do?
library(strucchange)
Thames <- ts(matrix(c(rlnorm(120, 0, 1), rlnorm(120, 2, 2), rlnorm(120, 4, 1)), ncol = 1), frequency = 12, start = c(1985, 1))
fs.thames <- Fstats(Thames ~ 1)
sctest(fs.thames)
I'm adding a second answer to analyze the simulated Thames data provided.
Regarding the points from my first general methodological answer: (1) In this case, a log() transformation is clearly appropriate to deal with the extreme skewness of the observations. (2) As the data are heteroscedastic, the inference should be based on HC or HAC covariances. Below I employ the Newey-West HAC estimator, although the data are just heteroscedastic but not autocorrelated. The HAC-corrected inference affects the supF test and the confidence intervals for the breakpoint estimates. The breakpoints themselves and the corresponding segment-specific intercepts are estimated by OLS, i.e., treating the heteroscedasticity as a nuisance term. (3) I did not add any bootstrap or permutation inference as the asymptotic inference appears to be convincing enough in this case.
First, we simulate the data using a particular seed. (Note that other seeds may not lead to such clear-cut breakpoint estimates when analyzing the series in levels.)
library("strucchange")
set.seed(12)
Thames <- ts(c(rlnorm(120, 0, 1), rlnorm(120, 2, 2), rlnorm(120, 4, 1)),
frequency = 12, start = c(1985, 1))
Then we compute the sequence of HAC-corrected Wald/F statistics and estimate the optimal breakpoints (for m = 1, 2, 3, ... breaks) via OLS. To illustrate how much better this works for the series in logs rather than in levels, both versions are shown.
fs_lev <- Fstats(Thames ~ 1, vcov = NeweyWest)
fs_log <- Fstats(log(Thames) ~ 1, vcov = NeweyWest)
bp_lev <- breakpoints(Thames ~ 1)
bp_log <- breakpoints(log(Thames) ~ 1)
The visualization below shows the time series with the fitted intercepts in the first row, the sequence of Wald/F statistics with the 5% critical value of the supF test in the second row, and the residual sum of squares and BIC for the selection of the number of breakpoints in the last row. The code to replicate the graphic is at the end of this answer.
Both supF tests are clearly significant but in levels (sctest(fs_lev)) the test statistic is "only" 82.79 while in logs (sctest(fs_log)) it is 282.46. Also, the two peaks pertaining to the two breakpoints can be seen much better when analyzing the data in logs.
Similarly, the breakpoint estimates are somewhat better and the confidence intervals much narrower for the log-transformed data. In levels, we get:
confint(bp_lev, breaks = 2, vcov = NeweyWest)
##
## Confidence intervals for breakpoints
## of optimal 3-segment partition:
##
## Call:
## confint.breakpointsfull(object = bp_lev, breaks = 2, vcov. = NeweyWest)
##
## Breakpoints at observation number:
## 2.5 % breakpoints 97.5 %
## 1 NA 125 NA
## 2 202 242 263
plus an error message and warnings which all reflect that the asymptotic inference is not a useful approximation here. In contrast, the confidence intervals are quite reasonable for the analysis in logs. Due to the increased variance in the middle segment, its start and end are somewhat more uncertain than for the first and last segment:
confint(bp_log, breaks = 2, vcov = NeweyWest)
##
## Confidence intervals for breakpoints
## of optimal 3-segment partition:
##
## Call:
## confint.breakpointsfull(object = bp_log, breaks = 2, vcov. = NeweyWest)
##
## Breakpoints at observation number:
## 2.5 % breakpoints 97.5 %
## 1 107 119 121
## 2 238 240 250
##
## Corresponding to breakdates:
## 2.5 % breakpoints 97.5 %
## 1 1993(11) 1994(11) 1995(1)
## 2 2004(10) 2004(12) 2005(10)
Finally, the replication code for the figure above is included here. The confidence intervals for the breakpoints in levels are cannot added in the graphic due to the error mentioned above. Hence, only the log-transformed series also has the confidence intervals.
par(mfrow = c(3, 2))
plot(Thames, main = "Thames")
lines(fitted(bp_lev, breaks = 2), col = 4, lwd = 2)
plot(log(Thames), main = "log(Thames)")
lines(fitted(bp_log, breaks = 2), col = 4, lwd = 2)
lines(confint(bp_log, breaks = 2, vcov = NeweyWest))
plot(fs_lev, main = "supF test")
plot(fs_log, main = "supF test")
plot(bp_lev)
plot(bp_log)
(1) Skewness and heavy tails. As usual in linear regression models, the asymptotic justification for the inference does not depend on normality and also holds for any other error distribution given zero expectation, homoscedasticity, and lack of correlation (the usual Gauss-Markov assumptions). However, if you have a well-fitting skewed distribution for your data of interest, then you might be able to increase efficiency by basing your inference on the corresponding model. For example, the glogis package provides some functions for structural change testing and dating based on a generalized logistic distribution that allows for heavy tails and skewness. Windberger & Zeileis (2014, Eastern European Economics, 52, 66–88, doi:10.2753/EEE0012-8775520304) used this to track changes in skewness of inflation dynamics over time. (See ?breakpoints.glogisfit for a worked example.) Furthermore, if the skewness itself is not really of interest then a log or sqrt transformation might also be good enough to make the data more "normal".
(2) Heteroscedasticity and autocorrelation. As usual in linear regression models, the standard errors (or more broadly the covariance matrix) is not consistent in the presence of heteroscedasticity and/or autocorrelation. One can either try to include this explicitly in the model (e.g., an AR model) or treat it as a nuisance term and employ heteroscedasticity and autocorrelation consistent (HAC) covariance matrices (e.g., Newey-West or Andrews' quadratic spectral kernal HAC). The function Fstats() in strucchange allows to plug in such estimators, e.g., from the sandwich package. See ?durab for an example using vcovHC().
(3) Bootstrap and permutation p-values. The "scrambling" of the time series you describe above sounds more like applying permutations (i.e., sampling without replacement) rather than bootstrap (i.e., sampling with replacement). The former is feasible if the errors are uncorrelated or exchangeable. If you are regressing just on a constant, then you can employ the function maxstat_test() from the coin package to carry out the supF test. The test statistic is computed in a somewhat different way, however, this can be shown to be equivalent to the supF test in the constant-only case (see Zeileis & Hothorn, 2013, Statistical Papers, 54, 931–954, doi:10.1007/s00362-013-0503-4). If you want to perform the permutation test in a more general model, then you would have to do the permutations "by hand" and simply store the test statistic from each permutation. Alternatively, the bootstrap can be applied, e.g., via the boot package (where you would still need to write your own small function that computes the test statistic from a given bootstrap sample). There are also some R packages (e.g., tseries) that implement bootstrap schemes for dependent series.
I'm getting different results from R and SAS when I try to calculate a weighted variance. Does anyone know what might be causing this difference?
I create vectors of weights and values and I then calculate the weighted variance using the
Hmisc library wtd.var function:
library(Hmisc)
wt <- c(5, 5, 4, 1)
x <- c(3.7,3.3,3.5,2.8)
wtd.var(x,weights=wt)
I get an answer of:
[1] 0.0612381
But if I try to reproduce these results in SAS I get a quite different result:
data test;
input wt x;
cards;
5 3.7
5 3.3
4 3.5
1 2.8
;
run;
proc means data=test var;
var x;
weight wt;
run;
Results in an answer of
0.2857778
You probably have a difference in how the variance is calculated. SAS gives you an option, VARDEF, which may help here.
proc means data=test var vardef=WDF;
var x;
weight wt;
run;
That on your dataset gives a variance similar to r. Both are 'right', depending on how you choose to calculate the weighted variance. (At my shop we calculate it a third way, of course...)
Complete text from PROC MEANS documentation:
VARDEF=divisor specifies the divisor to use in the calculation of the
variance and standard deviation. The following table shows the
possible values for divisor and associated divisors.
Possible Values for VARDEF=
Value Divisor Formula for Divisor
DF degrees of freedom n - 1
N number of observations n
WDF sum of weights minus one ([Sigma]iwi) - 1
WEIGHT | WGT sum of weights [Sigma]iwi
The procedure computes the variance as CSS/Divisor, where CSS
is the corrected sums of squares and equals Sum((Xi-Xbar)^2). When you
weight the analysis variables, CSS equals sum(Wi*(Xi-Xwbar)^2), where
Xwbar is the weighted mean.
Default: DF Requirement: To compute the standard error of the mean,
confidence limits for the mean, or the Student's t-test, use the
default value of VARDEF=.
Tip: When you use the WEIGHT statement and
VARDEF=DF, the variance is an estimate of Sigma^2, where the
variance of the ith observation is Sigma^2/wi and wi is the
weight for the ith observation. This method yields an estimate of the
variance of an observation with unit weight.
Tip: When you use the
WEIGHT statement and VARDEF=WGT, the computed variance is
asymptotically (for large n) an estimate of Sigma^2/wbar, where
wbar is the average weight. This method yields an asymptotic
estimate of the variance of an observation with average weight.