Comparing multiple classifiers: Nemenyi + Holm test in R - r

I try to reproduce the results from (1) as a newbie to R. Table 6 are the AUCs of 4 classifier on 14 data sets:
auc <- matrix(c(
0.763 , 0.768 , 0.771 , 0.798 ,
0.599 , 0.591 , 0.590 , 0.569 ,
0.954 , 0.971 , 0.968 , 0.967 ,
0.628 , 0.661 , 0.654 , 0.657 ,
0.882 , 0.888 , 0.886 , 0.898 ,
0.936 , 0.931 , 0.916 , 0.931 ,
0.661 , 0.668 , 0.609 , 0.685 ,
0.583 , 0.583 , 0.563 , 0.625 ,
0.775 , 0.838 , 0.866 , 0.875 ,
1.000 , 1.000 , 1.000 , 1.000 ,
0.940 , 0.962 , 0.965 , 0.962 ,
0.619 , 0.666 , 0.614 , 0.669 ,
0.972 , 0.981 , 0.975 , 0.975 ,
0.957 , 0.978 , 0.946 , 0.970),
nrow = 14,
byrow = TRUE,
dimnames = list(1 : 14, c("C4.5", "C4.5+m", "C4.5+cf", "C4.5+m+cf"))
)
Friedman chi-squared = 10.952, df = 3, p-value = 0.01199
The paper says (page 13) chi-square = 9.28 and FF=3.69. Where do I get this value from the test above?
Next step is the Nemenyi for which I used the PMCMR lib.
> library(PMCMR)
> posthoc.friedman.nemenyi.test(auc)
Pairwise comparisons using Nemenyi multiple comparison test
with q approximation for unreplicated blocked data
data: auc
C4.5 C4.5+m C4.5+cf
C4.5+m 0.089 - -
C4.5+cf 0.972 0.227 -
C4.5+m+cf 0.062 0.999 0.170
P value adjustment method: none
Can I get the critical value of 2.569 and the corresponding critical distance (CD) of 1.25 from somewhere?
How can I apply the Holm test?
(1) Demšar, Janez. "Statistical comparisons of classifiers over multiple data sets." The Journal of Machine Learning Research 7 (2006): 1-30.

Related

Split String in R for several signs

How can I split a string like
x = "0.989(0.975)&0.964(0.937)&0.877(0.771)&&0.962(0.903)&0.971(0.867)&0.932(0.828)&&0.984(0.892)&0.937(0.869)&0.910(0.722)&&0.970(0.867)&0.942(0.811)&0.875(0.747)"
to get all numbers is a numeric vector like
y = c(0.989, 0.975, 0.964, 0.937, 0.877)
and so on.
I want to eliminate the parentheses, the "&" and "&&".
Use gsub with scan i.e. gsub to replace all the characters other than the . and digits with a single delimiter , and then with scan read it at once
out <- scan(text = gsub("[^.0-9]+", ",", x), what = numeric(),
sep=",", quiet = TRUE)
str(out)
#num [1:25] 0.989 0.975 0.964 0.937 0.877 0.771 0.962 0.903 0.971 0.867 ...
Another option using regmatches + as.numeric
as.numeric(regmatches(x, gregexpr("\\d+\\.\\d+", x))[[1]])
gives
[1] 0.989 0.975 0.964 0.937 0.877 0.771 0.962 0.903 0.971 0.867 0.932 0.828
[13] 0.984 0.892 0.937 0.869 0.910 0.722 0.970 0.867 0.942 0.811 0.875 0.747

how the function influence.measures obtains actually the results in the columns dfb.1 , dfb.x, and dffi?

In some exam I have had, I was given this code and asked for finding the values marked with ??????
the code in the question was:
CODE
attach (mydat)
mydat
lr1 = glm ( Y ~X , data = mydat , family = binomial ( link = logit ))
summary (lr1)
influence . measures (lr1)
OUTPUT
> mydat
X Y
1 1.74 0
2 1.90 0
3 1.91 0
4 1.97 1
5 2.02 1
6 2.27 0
7 2.32 1
8 2.39 0
9 2.42 0
10 3.07 0
> lr1 = glm ( Y ~X , data = mydat , family = binomial ( link = logit ))
> summary ( lr1 )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) 1.741 4.754 0.366 0.714
X -1.194 2.202 -0.542 0.588
> influence . measures ( lr1 )
Influence measures of
glm ( formula = Y ~ X , family = binomial ( link = logit ) , data = mydat ) :
dfb .1 _ dfb . X dffit cov . r cook . d hat inf
1 -0.5825 0.5258 -0.675 1.418 0.2232 0.303
2 -0.2805 0.2359 -0.397 1.301 0.0772 0.177
3 ?????? ?????? -0.386 1.296 0.0728 0.171
4 0.3203 -0.2515 0.547 0.962 0.1780 0.142
5 0.2426 -0.1736 0.511 0.935 0.1586 0.124
6 0.0613 -0.0955 -0.244 1.302 0.0281 0.116
7 -0.2203 0.2995 0.596 0.826 0.2346 0.128
8 0.1368 -0.1700 -0.272 1.369 0.0342 0.150
9 0.1548 -0.1878 -0.281 1.391 0.0366 0.162
10 0.5586 -0.5949 -0.628 2.496 0.1709 0.526
but after solving(by hand) using the formulas of DFBETAS and DFFITS for logistic regression
the answers where different from the answers of running the function influence.measures
I'm sure that there is nothing wrong with the calculations,
So I wonder how does the function influence.measures actually calculates DFBETAS and DFFITS?
You can just type the function to see how it was calculated. The line for the dfbetas columns is
dfbetas <- cf/outer(infl$sigma, sqrt(diag(xxi)))
Look earlier in the function for the calculations that compute cf, infl, and xxi. You'll probably need to spend a while reading help pages ?outer, ?diag, and the functions used in the earlier calculations.
You can use stats:::dfbetas.lm to see a more compact version of the same calculation.

R effects package Error: non-conformable arguments

I am using effects R package and effect function on a cox model. There is a default method for this function so it somehow should work for any model.
When I try to use this function I get this error:
Any idea how to fix this and what is wrong?
> eff_cf <- effect("TP53:MDM2", model)
Error in mod.matrix %*% mod$coefficients[!is.na(coef(mod))] :
non-conformable arguments
My model looks like this:
> model
Call:
coxph(formula = Surv(times, patient.vital_status) ~ TP53 + MDM2 +
TP53:MDM2, data = clinForPlot)
coef exp(coef) se(coef) z p
TP53Other -0.163 0.850 0.217 -0.752 4.5e-01
TP53WILD -1.086 0.337 0.277 -3.928 8.6e-05
MDM2(1183.7,1674.7] -0.669 0.512 0.235 -2.851 4.4e-03
MDM2(1674.7,2248.5] -0.744 0.475 0.305 -2.444 1.5e-02
MDM2(2248.5,50339] -0.867 0.420 0.375 -2.308 2.1e-02
TP53Other:MDM2(1183.7,1674.7] 0.394 1.483 0.412 0.958 3.4e-01
TP53WILD:MDM2(1183.7,1674.7] 0.133 1.142 0.413 0.323 7.5e-01
TP53Other:MDM2(1674.7,2248.5] -0.192 0.825 0.517 -0.372 7.1e-01
TP53WILD:MDM2(1674.7,2248.5] 0.546 1.726 0.433 1.260 2.1e-01
TP53Other:MDM2(2248.5,50339] -0.140 0.869 0.650 -0.215 8.3e-01
TP53WILD:MDM2(2248.5,50339] 0.786 2.195 0.484 1.623 1.0e-01
Likelihood ratio test=72.8 on 11 df, p=3.54e-11 n= 1321, number of events= 258
And the model and the data.frame used for model can be reproduced using this code
library(archivist)
model <- loadFromGitub("68eeefba87be70364eb3801cec58eb3d",
user = "MarcinKosinski",
repo = "Museum",
value = TRUE)
clinForPlot <- loadFromGitub("cfa5145e6b98964d5f8b760bf749e426",
user = "MarcinKosinski",
repo = "Museum",
value = TRUE)
Any idea how to fix this and what is wrong?

How to set the level above which to display factor loadings from factanal() in R?

I was performing factor analysis with data state.x77, which is in R by default. After running the analysis, I inspected the factor loadings.
> output = factanal(state.x77, factors=3, rotation="promax")
> ld = output$loadings
> ld
Loadings:
Factor1 Factor2 Factor3
Population 0.161 0.239 -0.316
Income -0.149 0.681
Illiteracy 0.446 -0.284 -0.393
Life Exp -0.924 0.172 -0.221
Murder 0.917 0.103 -0.129
HS Grad -0.414 0.731
Frost 0.107 1.046
Area 0.387 0.585 0.101
Factor1 Factor2 Factor3
SS loadings 2.274 1.519 1.424
Proportion Var 0.284 0.190 0.178
Cumulative Var 0.284 0.474 0.652
It looks like that by default R is blocking all values less than 0.1. I was wondering if there is a way to set this blocking level by hand, say 0.3 instead of 0.1?
try this:
print(output$loadings, cutoff = 0.3)
see ?print.loadings for the details.

Spearman correlation loop in R

A previous post explained how to do a Chi-squared loop in R on all your data-pairs: Chi Square Analysis using for loop in R.
I wanted to use this code to do the same thing for a Spearman correlation.
I've already tried altering a few of the variables and I was able to calculate the pearson correlation variables using this code:
library(plyr)
combos <- combn(ncol(fullngodata),2)
adply(combos, 2, function(x) {
test <- cor.test(fullngodata[, x[1]], fullngodata[, x[2]])
out <- data.frame("Row" = colnames(fullngodata)[x[1]]
, "Column" = colnames(fullngodata[x[2]])
, "cor" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
})
But since I work with data on an ordinal scale, I need to use the Spearman correlation.
I thought I could get this data by just adding the method="spearman" command but this does not seem to work. If I use the code:
library(plyr)
combos <- combn(ncol(fullngodata),2)
adply(combos, 2, function(x) {
test <- cor.test(fullngodata[, x[1]], fullngodata[, x[2]], method="spearman")
out <- data.frame("Row" = colnames(fullngodata)[x[1]]
, "Column" = colnames(fullngodata[x[2]])
, "Chi.Square" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
})
I get the response:
Error in data.frame(Row = colnames(fullngodata)[x[1]], Column =
colnames(fullngodata[x[2]]), :
arguments imply differing number of rows: 1, 0
In addition: Warning message:
In cor.test.default(fullngodata[, x[1]], fullngodata[, x[2]], method = "spearman") :
Cannot compute exact p-values with ties
what am I doing wrong?
Try rcor.test function in ltm package.
mat <- matrix(rnorm(1000), 100, 10, dimnames = list(NULL, LETTERS[1:10]))
rcor.test(mat, method = "spearman")
A B C D E F G H I J
A ***** -0.035 0.072 0.238 -0.097 0.007 -0.010 -0.031 0.039 -0.090
B 0.726 ***** -0.042 -0.166 0.005 0.025 0.007 -0.231 0.005 0.006
C 0.473 0.679 ***** 0.046 0.074 -0.020 0.091 -0.183 -0.040 -0.084
D 0.017 0.098 0.647 ***** -0.060 -0.151 -0.175 -0.068 0.039 0.181
E 0.338 0.960 0.466 0.553 ***** 0.254 0.055 -0.031 0.072 -0.059
F 0.948 0.805 0.843 0.133 0.011 ***** -0.014 -0.121 0.153 0.048
G 0.923 0.941 0.370 0.081 0.588 0.892 ***** -0.060 -0.050 0.011
H 0.759 0.021 0.069 0.501 0.756 0.230 0.555 ***** -0.053 -0.193
I 0.700 0.963 0.690 0.701 0.476 0.130 0.621 0.597 ***** -0.034
J 0.373 0.955 0.406 0.072 0.561 0.633 0.910 0.055 0.736 *****
upper diagonal part contains correlation coefficient estimates
lower diagonal part contains corresponding p-values
The problem is that cor.test returns a value NULL for parameter when you do the spearman test. From ?cor.test: parameter: the degrees of freedom of the test statistic in the case that it follows a t distribution.
You can see this in the following example:
x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
y <- c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)
str(cor.test(x, y, method = "spearman"))
List of 8
$ statistic : Named num 48
..- attr(*, "names")= chr "S"
$ parameter : NULL
$ p.value : num 0.0968
$ estimate : Named num 0.6
..- attr(*, "names")= chr "rho"
$ null.value : Named num 0
..- attr(*, "names")= chr "rho"
$ alternative: chr "two.sided"
$ method : chr "Spearman's rank correlation rho"
$ data.name : chr "x and y"
- attr(*, "class")= chr "htest"
Solution: if you remove the following line from your code, it should work:
, "df"= test$parameter

Resources