Spearman correlation loop in R - r

A previous post explained how to do a Chi-squared loop in R on all your data-pairs: Chi Square Analysis using for loop in R.
I wanted to use this code to do the same thing for a Spearman correlation.
I've already tried altering a few of the variables and I was able to calculate the pearson correlation variables using this code:
library(plyr)
combos <- combn(ncol(fullngodata),2)
adply(combos, 2, function(x) {
test <- cor.test(fullngodata[, x[1]], fullngodata[, x[2]])
out <- data.frame("Row" = colnames(fullngodata)[x[1]]
, "Column" = colnames(fullngodata[x[2]])
, "cor" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
})
But since I work with data on an ordinal scale, I need to use the Spearman correlation.
I thought I could get this data by just adding the method="spearman" command but this does not seem to work. If I use the code:
library(plyr)
combos <- combn(ncol(fullngodata),2)
adply(combos, 2, function(x) {
test <- cor.test(fullngodata[, x[1]], fullngodata[, x[2]], method="spearman")
out <- data.frame("Row" = colnames(fullngodata)[x[1]]
, "Column" = colnames(fullngodata[x[2]])
, "Chi.Square" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
})
I get the response:
Error in data.frame(Row = colnames(fullngodata)[x[1]], Column =
colnames(fullngodata[x[2]]), :
arguments imply differing number of rows: 1, 0
In addition: Warning message:
In cor.test.default(fullngodata[, x[1]], fullngodata[, x[2]], method = "spearman") :
Cannot compute exact p-values with ties
what am I doing wrong?

Try rcor.test function in ltm package.
mat <- matrix(rnorm(1000), 100, 10, dimnames = list(NULL, LETTERS[1:10]))
rcor.test(mat, method = "spearman")
A B C D E F G H I J
A ***** -0.035 0.072 0.238 -0.097 0.007 -0.010 -0.031 0.039 -0.090
B 0.726 ***** -0.042 -0.166 0.005 0.025 0.007 -0.231 0.005 0.006
C 0.473 0.679 ***** 0.046 0.074 -0.020 0.091 -0.183 -0.040 -0.084
D 0.017 0.098 0.647 ***** -0.060 -0.151 -0.175 -0.068 0.039 0.181
E 0.338 0.960 0.466 0.553 ***** 0.254 0.055 -0.031 0.072 -0.059
F 0.948 0.805 0.843 0.133 0.011 ***** -0.014 -0.121 0.153 0.048
G 0.923 0.941 0.370 0.081 0.588 0.892 ***** -0.060 -0.050 0.011
H 0.759 0.021 0.069 0.501 0.756 0.230 0.555 ***** -0.053 -0.193
I 0.700 0.963 0.690 0.701 0.476 0.130 0.621 0.597 ***** -0.034
J 0.373 0.955 0.406 0.072 0.561 0.633 0.910 0.055 0.736 *****
upper diagonal part contains correlation coefficient estimates
lower diagonal part contains corresponding p-values

The problem is that cor.test returns a value NULL for parameter when you do the spearman test. From ?cor.test: parameter: the degrees of freedom of the test statistic in the case that it follows a t distribution.
You can see this in the following example:
x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
y <- c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)
str(cor.test(x, y, method = "spearman"))
List of 8
$ statistic : Named num 48
..- attr(*, "names")= chr "S"
$ parameter : NULL
$ p.value : num 0.0968
$ estimate : Named num 0.6
..- attr(*, "names")= chr "rho"
$ null.value : Named num 0
..- attr(*, "names")= chr "rho"
$ alternative: chr "two.sided"
$ method : chr "Spearman's rank correlation rho"
$ data.name : chr "x and y"
- attr(*, "class")= chr "htest"
Solution: if you remove the following line from your code, it should work:
, "df"= test$parameter

Related

Split String in R for several signs

How can I split a string like
x = "0.989(0.975)&0.964(0.937)&0.877(0.771)&&0.962(0.903)&0.971(0.867)&0.932(0.828)&&0.984(0.892)&0.937(0.869)&0.910(0.722)&&0.970(0.867)&0.942(0.811)&0.875(0.747)"
to get all numbers is a numeric vector like
y = c(0.989, 0.975, 0.964, 0.937, 0.877)
and so on.
I want to eliminate the parentheses, the "&" and "&&".
Use gsub with scan i.e. gsub to replace all the characters other than the . and digits with a single delimiter , and then with scan read it at once
out <- scan(text = gsub("[^.0-9]+", ",", x), what = numeric(),
sep=",", quiet = TRUE)
str(out)
#num [1:25] 0.989 0.975 0.964 0.937 0.877 0.771 0.962 0.903 0.971 0.867 ...
Another option using regmatches + as.numeric
as.numeric(regmatches(x, gregexpr("\\d+\\.\\d+", x))[[1]])
gives
[1] 0.989 0.975 0.964 0.937 0.877 0.771 0.962 0.903 0.971 0.867 0.932 0.828
[13] 0.984 0.892 0.937 0.869 0.910 0.722 0.970 0.867 0.942 0.811 0.875 0.747

Quickly sum a big list of lists?

I have a 10000 lists (results of a simulation), each containing 22500 lists (each list is a pixel in an image) which contains a vector of length 55.
# Simple Example
m <- replicate(2, list(runif(55)))
m2 <- replicate(3, list(m))
str(m2,list.len = 3)
List of 3
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
# my function
m3 <- lapply(seq_along(m2[[1]]), FUN = function(j) Reduce('+', lapply(seq_along(m2), FUN = function(i) m2[[i]][[j]])))
#by hand
identical(m2[[1]][[1]] + m2[[2]][[1]] + m2[[3]][[1]], m3[[1]] )
I wrote a nested lapply with Reduce to sum the lists. On a small example, as in above, it's fast but on my real data, it's really slow.
#slow code
m <- replicate(22500, list(runif(55)))
m2 <- replicate(10000, list(m))
str(m2,list.len = 3)
m3 <- lapply(seq_along(m2[[1]]), FUN = function(j) Reduce('+', lapply(seq_along(m2), FUN = function(i) m2[[i]][[j]])))
How can I speed this up, or should I change data structures?
Thanks.
This gives some improvement (>2x):
split(Reduce(`+`, lapply(m2, unlist)), rep(seq_along(m2[[1]]), lengths(m2[[1]])))
Since your data is essentially rectangular, had you stored it in this shape:
library(data.table)
d = rbindlist(lapply(m2, function(x) transpose(as.data.table(x))), id = T
)[, id.in := 1:.N, by = .id]
# .id V1 V2 V55 id.in
#1: 1 0.4605065 0.09744975 ... 0.8620728 1
#2: 1 0.6666742 0.10435471 ... 0.3991940 2
#3: 2 0.4605065 0.09744975 ... 0.8620728 1
#4: 2 0.6666742 0.10435471 ... 0.3991940 2
#5: 3 0.4605065 0.09744975 ... 0.8620728 1
#6: 3 0.6666742 0.10435471 ... 0.3991940 2
You could do the aggregation even faster by doing:
d[, lapply(.SD, sum), by = id.in]
But if the list is your starting point, the conversion would take up the majority of the time.

Comparing multiple classifiers: Nemenyi + Holm test in R

I try to reproduce the results from (1) as a newbie to R. Table 6 are the AUCs of 4 classifier on 14 data sets:
auc <- matrix(c(
0.763 , 0.768 , 0.771 , 0.798 ,
0.599 , 0.591 , 0.590 , 0.569 ,
0.954 , 0.971 , 0.968 , 0.967 ,
0.628 , 0.661 , 0.654 , 0.657 ,
0.882 , 0.888 , 0.886 , 0.898 ,
0.936 , 0.931 , 0.916 , 0.931 ,
0.661 , 0.668 , 0.609 , 0.685 ,
0.583 , 0.583 , 0.563 , 0.625 ,
0.775 , 0.838 , 0.866 , 0.875 ,
1.000 , 1.000 , 1.000 , 1.000 ,
0.940 , 0.962 , 0.965 , 0.962 ,
0.619 , 0.666 , 0.614 , 0.669 ,
0.972 , 0.981 , 0.975 , 0.975 ,
0.957 , 0.978 , 0.946 , 0.970),
nrow = 14,
byrow = TRUE,
dimnames = list(1 : 14, c("C4.5", "C4.5+m", "C4.5+cf", "C4.5+m+cf"))
)
Friedman chi-squared = 10.952, df = 3, p-value = 0.01199
The paper says (page 13) chi-square = 9.28 and FF=3.69. Where do I get this value from the test above?
Next step is the Nemenyi for which I used the PMCMR lib.
> library(PMCMR)
> posthoc.friedman.nemenyi.test(auc)
Pairwise comparisons using Nemenyi multiple comparison test
with q approximation for unreplicated blocked data
data: auc
C4.5 C4.5+m C4.5+cf
C4.5+m 0.089 - -
C4.5+cf 0.972 0.227 -
C4.5+m+cf 0.062 0.999 0.170
P value adjustment method: none
Can I get the critical value of 2.569 and the corresponding critical distance (CD) of 1.25 from somewhere?
How can I apply the Holm test?
(1) Demšar, Janez. "Statistical comparisons of classifiers over multiple data sets." The Journal of Machine Learning Research 7 (2006): 1-30.

Mysterious source of output in R?

I am using following code using mtcars data and factanal function for factor analysis. The print of fit$loadings give the proportional variance but it does not seem to be there in str(fit$loadings) :
> fit <- factanal(mtcars, 3, rotation="varimax")
> fit$loadings
Loadings:
Factor1 Factor2 Factor3
mpg 0.643 -0.478 -0.473
cyl -0.618 0.703 0.261
disp -0.719 0.537 0.323
hp -0.291 0.725 0.513
drat 0.804 -0.241
wt -0.778 0.248 0.524
qsec -0.177 -0.946 -0.151
vs 0.295 -0.805 -0.204
am 0.880
gear 0.908 0.224
carb 0.114 0.559 0.719
Factor1 Factor2 Factor3
SS loadings 4.380 3.520 1.578
Proportion Var 0.398 0.320 0.143 <<<<<<<<<<<<< I NEED THESE NUMBERS AS A VECTOR
Cumulative Var 0.398 0.718 0.862
>
> str(fit$loadings)
loadings [1:11, 1:3] 0.643 -0.618 -0.719 -0.291 0.804 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:11] "mpg" "cyl" "disp" "hp" ...
..$ : chr [1:3] "Factor1" "Factor2" "Factor3"
How can I get Proportional variance vector from fit$loadings? Thanks for your help.
Let obj <- fit$loadings. Here is a complete path how to obtain the result.
By writing fit$loadings (or obj) we actually call print(obj). So, after looking at str, you might want to check what does the specific print method do with obj. To know what method we should look for, we check class(obj) and get "loadings".
Then, writing print.loadings does not give anything because the function is hidden. Therefore, since function factanal is in the package stats, we call stats:::print.loadings and get a complete source code of the function. By inspecting it, we see that we can get the desired result as follows.
colSums(obj^2) / nrow(obj)
# Factor1 Factor2 Factor3
# 0.3982190 0.3199652 0.1434125

llply subset by a function applied to an element in the list

I have a list that I am failing at subsetting. This is what the list looks like:
dune.envfit <-
structure(list(vectors = structure(list(arrows = structure(c(0.280345610462046,
0.23065528788472, -0.529798086330133, 0.143997109594625, -0.164377286767545,
-0.135605646341866, -0.963240459395111, -0.317899182650768, 0.128920508855905,
0.219446302740393, 0.716407585418307, 0.787931102449639, 0.346775848714473,
-0.237555622845223, -0.436638573595095, -0.218629125777214, -0.316702141990071,
0.0503363265919834, 0.959899129437392, 0.973035527702261, -0.848123804477229,
0.989578108300903, -0.986397540343111, -0.990762892260406, -0.26864068452916,
-0.948124522238495, 0.991654931110784, 0.975624579545622, -0.697681998875642,
0.615763410566507, -0.937948032008361, -0.971373937294294, -0.899637013494243,
-0.975808026899703, -0.948525040923486, 0.998732323610899), .Dim = c(18L,
2L), .Dimnames = list(c("TIC.mg.L", "Alkalinity.mg.L", "TOC.mg.L",
"DIC.mg.L", "DOC.mg.L", "Na", "K", "Mg", "Ca", "Dissolved.N.mg.L",
"Total.N.mg.L", "SI75", "FI", "HIX", "BIX", "EX.max", "MAXI",
"MaxEX"), c("MDS1", "MDS2")), decostand = "normalize"), r = structure(c(0.59685535501548,
0.185756858794004, 0.774378247785189, 0.19777058682652, 0.773575290577961,
0.355910385488688, 0.808105377304065, 0.757415261345049, 0.808139825741362,
0.756826061659786, 0.766773822773421, 0.297255075467242, 0.0482776303145529,
0.458034365410462, 0.533467657077853, 0.757925150420371, 0.757494019430756,
0.452226922967628), .Names = c("TIC.mg.L", "Alkalinity.mg.L",
"TOC.mg.L", "DIC.mg.L", "DOC.mg.L", "Na", "K", "Mg", "Ca", "Dissolved.N.mg.L",
"Total.N.mg.L", "SI75", "FI", "HIX", "BIX", "EX.max", "MAXI",
"MaxEX")), permutations = 999, pvals = c(0.014, 0.418, 0.003,
0.367, 0.003, 0.1, 0.002, 0.003, 0.001, 0.003, 0.003, 0.158,
0.836, 0.042, 0.02, 0.003, 0.003, 0.095)), .Names = c("arrows",
"r", "permutations", "pvals"), class = "vectorfit"), factors = NULL,
na.action = function (object, ...)
UseMethod("na.action")), .Names = c("vectors", "factors",
"na.action"), class = "envfit")
I would like to pull out every element in the list that is tied to any pvals<0.05. I know the pvals are somewhere in the first element of the list. They are the column named Pr(>r),even though in the str() of the list they are called pvals:
> dune.envfit[[1]]
MDS1 MDS2 r2 Pr(>r)
TIC.mg.L 0.99242 -0.12290 0.5970 0.017 *
Alkalinity.mg.L 0.91283 0.40833 0.1860 0.414
TOC.mg.L -0.55732 0.83030 0.7743 0.005 **
DIC.mg.L 0.42268 0.90628 0.1972 0.394
DOC.mg.L -0.52551 -0.85079 0.7742 0.003 **
Na -0.38616 -0.92243 0.3543 0.096 .
K -0.36924 0.92934 0.8080 0.001 ***
Mg -0.91182 0.41058 0.7574 0.004 **
Ca 0.35178 0.93608 0.8076 0.001 ***
Dissolved.N.mg.L 0.86231 0.50638 0.7572 0.003 **
Total.N.mg.L 0.26924 -0.96307 0.7671 0.005 **
SI75 0.41623 -0.90926 0.2971 0.165
FI 0.18448 -0.98284 0.0486 0.824
HIX -0.95753 -0.28834 0.4569 0.056 .
BIX -0.65163 0.75853 0.5327 0.027 *
EX.max -0.86522 -0.50139 0.7572 0.006 **
MAXI -0.91477 0.40398 0.7571 0.006 **
MaxEX 0.11461 0.99341 0.4542 0.070 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
P values based on 999 permutations.
So far I have been trying to apply llply as so:
dune.envfit<-llply(dune.envfit, dune.envfit$vectors$pvals, summarize, function(x) x<0.05)
But this is not working with an error "Error in llply(dune.envfit, dune.envfit$vectors$pvals, summarize, function(x) x < :
.fun is not a function." I do not know how to write a function for the purposes of llply type functions, and I believe the error is telling me that.
Thank you.
After discussing some possibilities, I now have only the pvals I want. However, I need to know what the name of the new list is that contains these pvals for later coding.
str(dune.envfit_subset)
List of 2
$ : num [1:12, 1:2] 0.1125 -0.2401 -0.0644 -0.7822 -0.1268 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:12] "TIC.mg.L" "TOC.mg.L" "DOC.mg.L" "K" ...
.. ..$ : chr [1:2] "MDS1" "MDS2"
$ :List of 3
..$ r : Named num [1:12] 0.597 0.774 0.774 0.808 0.759 ...
.. ..- attr(*, "names")= chr [1:12] "TIC.mg.L" "TOC.mg.L" "DOC.mg.L" "K" ...
..$ permutations: num [1:12] 999 NA NA NA NA NA NA NA NA NA ...
..$ pvals : num [1:12] 0.024 0.002 0.002 0.002 0.004 0.004 0.005 0.005 0.044 0.029 ...
It sounds like you want to subset all the values such that the associated p-value is less than 0.05.
Give this a try.
> dv <- dune.envfit$vectors
> dvSubset <- c(arrows = list(dv$arrows[w <- which(dv$pvals < 0.05),]),
lapply(dv[-1], `[`, w))
It looks like you could also do
> as.data.frame(dvSubset)[-4]
which returns a data frame of the desired subsets with the permutations column removed
> str(dvSubset)
List of 4
$ arrows : num [1:12, 1:2] 0.28 -0.53 -0.164 -0.963 -0.318 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:12] "TIC.mg.L" "TOC.mg.L" "DOC.mg.L" "K" ...
.. ..$ : chr [1:2] "MDS1" "MDS2"
$ r : Named num [1:12] 0.597 0.774 0.774 0.808 0.757 ...
..- attr(*, "names")= chr [1:12] "TIC.mg.L" "TOC.mg.L" "DOC.mg.L" "K" ...
$ permutations: num [1:12] 999 NA NA NA NA NA NA NA NA NA ...
$ pvals : num [1:12] 0.014 0.003 0.003 0.002 0.003 0.001 0.003 0.003 0.042 0.02 ...

Resources