Poorly implemented two-sample Kolmogorov-Smirnov test (kstest2) in Matlab? - r

Am I missing something obvious or Matlab's kstest2 is giving very poor p-values?
Under very poor I mean that I have the suspicion that it is even wrongly implemented.
Help page of kstest2 states that the function calculates asymptotic p-value, though I did not find any reference about which method is used exactly. Anyway, the description further states:
asymptotic p-value becomes very accurate for large sample sizes, and is believed to be reasonably accurate for sample sizes n1 and n2, such that (n1*n2)/(n1 + n2) ≥ 4
Example 1
Let's take Example 6 from Lehman and D'Abrera (1975):
sampleA = [6.8, 3.1, 5.8, 4.5, 3.3, 4.7, 4.2, 4.9];
sampleB = [4.4, 2.5, 2.8, 2.1, 6.6, 0.0, 4.8, 2.3];
[h,p,ks2stat] = kstest2(sampleA, sampleB, 'Tail', 'unequal');
(n1*n2)/(n1 + n2) = 4 in this case so the p-value should be reasonably accurate.
Matlab yields p = 0.0497, while the solution given in the book is 0.0870.
To validate the solution I used R, which I trust more than Matlab, especially in statistics.
Using ks.test from stats package and ks.boot from Matching package:
ks.test(sampleA, sampleB, alternative = "two.sided")
ks.boot(sampleA, sampleB, alternative = "two.sided")
Both give p = 0.0870.
Example 2
Lets use kstest2's own example to compare Matlab and R results for larger sample size:
rng(1); % For reproducibility
x1 = wblrnd(1,1,1,50);
x2 = wblrnd(1.2,2,1,50);
[h,p,ks2stat] = kstest2(x1,x2);
This yields p = 0.0317. Now, using the same x1 and x2 vectors R gives p = 0.03968.
About 20% difference when very accurate result is expected (n1*n2)/(n1 + n2) = 25.
Am I missing, messing up something?
Is it possible that Matlab's kstest2 performs so poorly as the examples indicate? What approximation, algorithm kstest2 is using? (I can see the implemented code for kstest2, however a reference to book or paper would be much better to understand what is going on.)
I am using Matlab 2016a.
Lehman and D'Abrera (1975). Nonparametrics: Statistical Methods Based on Ranks. 1st edition. Springer.

I think that the correct test to compare with R's ks.test in MATLAB or Octave would be kolmogorov_smirnov_test_2:
sampleA = [6.8, 3.1, 5.8, 4.5, 3.3, 4.7, 4.2, 4.9];
sampleB = [4.4, 2.5, 2.8, 2.1, 6.6, 0.0, 4.8, 2.3];
kolmogorov_smirnov_test_2(sampleA, sampleB)
pval: 0.0878664
The difference appears to be the use of ks versus lambda, i.e.
ks = sqrt (n) * d;
pval = 1 - kolmogorov_smirnov_cdf (ks);
versus
lambda = max((sqrt(n) + 0.12 + 0.11/sqrt(n)) * d , 0);
pval = 1 - kolmogorov_smirnov_cdf (lambda);
I presume the different test statistics arise from the differences in the research papers cited by these 2 functions. If you want a deeper dive into the statistical theory you may want to reach out to CrossValidated.

Related

Clustering values with given threshold

I have several vectors:
a <- c(1.1, 2.9, 3.9, 5.2)
b <- c(1.0, 1.9, 4.0, 5.1)
c <- c(0.9, 2.1, 3.1, 4.1, 5.0, 11.13)
They can have different length.
I want to combine them in such a way to obtain a general vector with averaged values if there are similar meanings in all of the vectors or in any pairs of them; and with an initial meanings if this meaning is only in one vector. For averaging I would like to use a threshold = 0.2.
My explanation could be a bit confusing, but here is the general vector I want to obtain:
d <- c(1, 2, 3, 4, 5.1, 11.13)
I have around 12 vectors and about 2000 values in each vector.
I will be glad for any help
Seems like a clustering problem (clustered by distance). You can try the code below
library(igraph)
v <- sort(c(a, b, c))
tapply(
v,
membership(components(graph_from_adjacency_matrix(as.matrix(dist(v)) <= 0.2 + sqrt(.Machine$double.eps)))),
mean
)
which gives
1 2 3 4 5 6
1.00 2.00 3.00 4.00 5.10 11.13

Summary function for principal coordinates analysis

R function prcomp() for performing principal components analysis provides a handy summary() method which (as is expected) offers a summary of the results at a glance:
data(cars)
pca <- prcomp(cars)
summary(pca)
Importance of components:
PC1 PC2
Standard deviation 26.1252 3.08084
Proportion of Variance 0.9863 0.01372
Cumulative Proportion 0.9863 1.00000
I would like a similar summary for displaying the results of a principal coordinates analysis, usually performed using the cmdscale() function:
pco <- cmdscale(dist(cars), eig=TRUE)
But this function does not provide a summary() method and therefore there is no direct way of displaying the results as percent variances, etc.
Has anyone out there already developed some simple method for summarizing such results from PCO?
Thanks in advance for any assistance you can provide!
With best regards,
Instead of the basic cmdscale, one can use capscale or dbRda from package vegan instead. These functions generalize PCO as they "also perform unconstrained principal coordinates analysis, optionally using extended dissimilarities" (citation from ?capscale help page). This is much more than a workaround.
The following gives an example:
library(vegan)
A <- data.frame(
x = c(-2.48, -4.03, 1.15, 0.94, 5.33, 4.72),
y = c(-3.92, -3.4, 0.92, 0.78, 3.44, 0.5),
z = c(-1.11, -2.18, 0.21, 0.34, 1.74, 1.12)
)
pco <- capscale(dist(A) ~ 1) # uses euclidean distance
# or
pco <- capscale(A ~ 1) # with same result
pco # shows some main results
summary(pco) # shows more results, including eigenvalues
The summary function has several arguments, documented in ?summary.cca. Parts of the result can be extracted and formated by the user, the eigenvalues for example with pco$CA$eig.

Cohen d for mixed effects model

has anyone worked with the powerlmm package?
I want to calculate statistical power for my model with the study_parameters command. It takes multiple values as input one of which is Cohen's d.
How do I figure out what is the Cohen d for my model? Also, for which effects exactly should I specify cohen d: for fixed, random or for the entire model?
p <- study_parameters(n1 = 6,
n2 = per_treatment(control=27, treatment=55),
sigma_subject_intercept = 4.56,
sigma_subject_slope = 0.22,
sigma_error = 5.39,
cor_subject = -0.19,
dropout = dropout_manual(0.0000000, 0.2666667, 0.3466667, 0.4666667, 0.47, 0.7333333),
effect_size = cohend(0.5),
standardizer = "posttest_SD"))
get_power(p)
Some details about the study:
2 groups: Treatment and Control
6 Timepoints
n(control)=27, n(treatment)=55
Thanks in advance for your answers.
You get Cohens d from the output of the function. I see you made a small error in your code. Delete the ")", after cohend(.05 and it works.
By definition, you are interested in the effect size of the fixed effect.
Hope it works out for you.

How to use an equation with my data in R?

I am struggling with a portion of the data analysis for some research I have carried out. Other researchers have used an equation to estimate population growth rate that I would like to implement, but I am hitting a wall with trying to do so. Below is the equation:
Where N0 is the initial number of females in a cohort,
Ax in the number of females emerging on day X, Wx is a measure of mean female size on day x
per replicate, f(wx) is a function relating fecundity to female size, and D is the time (in days)
for a female to reproduce.
N0 (n=15) and D (7) are fixed numbers that I can put in the equation. f(wx) is a function that I have (y = 91.85x - 181.40). Below is a small sample of my data:
df <- data.frame(replicate = c('1','1','2','2','3','3','4','4'),
size = c(5.1, 4.9, 4.7, 4.6, 5.1,2.4,4.3,4.4),
day_emerging = c('6','7','6','7','6','8','7','6'))
I am sorry if this is a bad question for this site. I am just lost for how to handle this. I need R to be able to do the equation for different days. I'm not sure if that is actually possible with my current data format, because R will have to figure out how many females emerged on day x and then perform the other calculations for that day. So maybe this is impossible.
Thank you very much for any advice you can offer.
Here is a base R solution. Hope this is what you are after
dfs <- split(df,df$day_emerging)
p <- sum(sapply(dfs, function(v) nrow(v)*f(mean(v$size))))
q <- sum(sapply(dfs, function(v) nrow(v)*as.numeric(unique(v$day_emerging))*f(mean(v$size))))
res <- log(p/n)/(D + q/p)
such that
> res
[1] 0.5676656
DATA
n <- 15
D <- 7
f <- function(x) 91.85*x-181.4
df <- data.frame(replicate = c('1','1','2','2','3','3','4','4'),
size = c(5.1, 4.9, 4.7, 4.6, 5.1,2.4,4.3,4.4),
day_emerging = c('6','7','6','7','6','8','7','6'))
The answer to this is not particularly R-specific, but rather a skill in and of itself. What you want to do is translate a formal mathematical language into one that works in R (or Python or Matlab,etc).
This is a skill that's worth developing. In python-like psuedocode:
numerator = math.log((1 / n_0) * sum(A * f(w))
denominator = D + (sum(X * A * f(w)) / sum(A * f(w))
r_prime = numerator / denominator
As you can see, there's a lot of unknown variables that you'll have to set previously. Also, things f(w) will need to be defined as helper functions earlier in the script so they can be used. In general, you just want to be able to break down your equation into small parts that you can verify are correct.
It very much helps to do some unit testing with these things - package the equation as a function (or set of small functions that you'll use together) and feed it data that you've run through the equation and verified in another way - by hand, or by a more familiar package. This way, you only have to worry about expressing it in the correct syntax and will know when you've gotten everything correct.

using the apcluster package in R, it is possible to "score" unclustered data points

I am new to R and I have a request that I am not sure is possible. We have a number of retail locations that my boss would like to use affinity propagation to group into clusters. We will not be clustering based on geographic location. Once he has found a configuration he likes, he wants to be able to input other locations to determine which of those set clusters they should fall into.
The only solution I have been able to come up with is to use the same options and re-cluster with the original points and the new ones added in, however I believe that this might change the outcome.
Am I understanding this right, or are there other options?
Sorry for the late answer, I just incidentally stumbled over your question.
I agree with Anony-Mousse's answer that clustering is the first step and classification is the second. However, I'm not sure whether this is the best option here. Elena601b is obviously talking about a task with truly spatial data, so my impression is that the best approach is to cluster first and then to "classify" new points/samples/locations by looking for the closest cluster exemplar. Here is some code for synthetic data:
## if not available, run the following first:
## install.packages("apcluster")
library(apcluster)
## create four synthetic 2D clusters
cl1 <- cbind(rnorm(30, 0.3, 0.05), rnorm(30, 0.7, 0.04))
cl2 <- cbind(rnorm(30, 0.7, 0.04), rnorm(30, 0.4, .05))
cl3 <- cbind(rnorm(20, 0.50, 0.03), rnorm(20, 0.72, 0.03))
cl4 <- cbind(rnorm(25, 0.50, 0.03), rnorm(25, 0.42, 0.04))
x <- rbind(cl1, cl2, cl3, cl4)
## run apcluster() (you may replace the Euclidean distance by a different
## distance, e.g. driving distance, driving time)
apres <- apcluster(negDistMat(r=2), x, q=0)
## create new samples
xNew <- cbind(rnorm(10, 0.3, 0.05), rnorm(10, 0.7, 0.04))
## auxiliary predict() function
predict.apcluster <- function(s, exemplars, newdata)
{
simMat <- s(rbind(exemplars, newdata),
sel=(1:nrow(newdata)) + nrow(exemplars))[1:nrow(exemplars), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.apcluster(negDistMat(r=2), x[apres#exemplars, ], xNew)
## ... the result is a vector of indices to which exemplar/cluster each
## data sample is assigned
I will probably add such a predict() method in a future release of the package (I am the maintainer of the package). I hope that helps.
Clustering is not a drop-in replacement for classification.
Few clustering algorithms can meaningfully integrate new information.
The usual approach for your problem however is simple:
Do clustering.
use the cluster labels as class labels
train a classifier
apply the classifier to the new data

Resources