coin::wilcox_test versus wilcox.test in R - r

In trying to figure out which one is better to use I have come across two issues.
1) The W statistic given by wilcox.test is different from that of coin::wilcox_test. Here's my output:
wilcox_test:
Exact Wilcoxon Mann-Whitney Rank Sum Test
data: data$variableX by data$group (yes, no)
Z = -0.7636, p-value = 0.4489
alternative hypothesis: true mu is not equal to 0
wilcox.test:
Wilcoxon rank sum test with continuity correction
data: data$variable by data$group
W = 677.5, p-value = 0.448
alternative hypothesis: true location shift is not equal to 0
I'm aware that there's actually two values for W and that the smaller one is usually reported. When wilcox.test is used with comma instead of "~" I can get the other value, but this comes up as W = 834.5. From what I understand, coin::statistic() can return three different statistics using ("linear", "standarized", and "test") where "linear" is the normal W and "standardized" is just the W converted to a z-score. None of these match up to the W I get from wilcox.test though (linear = 1055.5, standardized = 0.7636288, test = -0.7636288). Any ideas what's going on?
2) I like the options in wilcox_test for "distribution" and "ties.method", but it seems that you can not apply a continuity correction like in wilcox.test. Am I right?

I encountered the same issue when trying to apply Wendt formula to compute effect sizes using the coin package, and obtained aberrant r values due to the fact that the linear statistic outputted by wilcox_test() is unadjusted.
A great explanation is already given here, and therefore I will simply address how to obtain adjusted U statistics with the wilcox_test() function. Let's use a the following data frame:
d <- data.frame( x = c(rnorm(n = 60, mean = 10, sd = 5), rnorm(n = 30, mean = 16, sd = 5)),
g = c(rep("a",times = 60), rep("b",times = 30)) )
We can perform identical tests with wilcox.test() and wilcox_test():
w1 <- wilcox.test( formula = x ~ g, data = d )
w2 <- wilcox_test( formula = x ~ g, data = d )
Which will output two distinct statistics:
> w1$statistic
W
321
> w2#statistic#linearstatistic
[1] 2151
The values are indeed totally different (albeit the tests are equivalent).
To obtain the U statistics identical to that of wilcox.test(), you need to subtract wilcox_test()'s output statistic by the minimal value that the sum of the ranks of the reference sample can take, which is n_1(n_1+1)/2.
Both commands take the first level in the factor of your grouping variable g as reference (which will by default be alphabetically ordered).
Then you can compute the smallest sum of the ranks possible for the reference sample:
n1 <- table(w2#statistic#x)[1]
And
w2#statistic#linearstatistic- n1*(n1+1)/2 == w1$statistic
should return TRUE
Voilà.

It seems to be one is performing Mann-Whitney's U and the other Wilcoxon rank test, which is defined in many different ways in literature. They are pretty much equivalent, just look at the p-value. If you want continuity correction in wilcox.test just use argument correct=T.
Check https://stats.stackexchange.com/questions/79843/is-the-w-statistic-outputted-by-wilcox-test-in-r-the-same-as-the-u-statistic

Related

What does the output of the function mvrnorm of MASS mean?

Using the mvrnorm() from the MASS package, now we can simulate realizations of multivariate normal distributions. This function works as follows:
library(MASS)
MASS::mvrnorm(
n = 10, # Number of realizations,
mu = c(1, 5), # Parameter vector mu,
Sigma = my_cov_matrix(1, 3, 0.2) # Parameter matrix Sigma
)
What does this output mean? Why are there two columns with ten random variables each?
The task is as follows:
Now, I created a function my_mvrnorm(n, mu_1, mu_2, sigma_1, sigma_2, rho), which simulates realizations of the corresponding multivariate normal distribution depending on mu and the matrix n and stores them in a tibble with the column names X and Y. In addition, this tibble is to contain a third column rho, in which all entries are filled with rho.
This should look like the following then:
But I couldn't write a function yet, because I don't quite understand what the values in table X and Y should be. Can someone help me?
Attempt:
my_mvrnorm <- function(n, mu_1, mu_2, sigma_1, sigma_2, rho){
mu = c(mu_1, mu_2)
sigma = my_cov_matrix(sigma_1, sigma_2, rho)
tb <- tibble(
X = ,
Y = ,
rho = rep(rho, n)
)
return(tb)
}
The n = 10 specification says do 10 samples. The mu = c(1, 5) specification says do two means. So, you get a 10 X 2 matrix as the result. If you check, the first column has a mean close to 2, and the second a mean close to 5. Is my_cov_matrix defined somewhere else?

2-sample independent t-test where each of two columns is in different data frame

I need to run a 2-sample independent t-test, comparing Column1 to Column2. But Column1 is in DataframeA, and Column2 is in DataframeB. How should I do this?
Just in case relevant (feel free to ignore): I am a true beginner. My experience with R so far has been limited to running 2-sample matched t-tests within the same data frame by doing the following:
t.test(response ~ Column1,
data = (Dataframe1 %>%
gather(key = "Column1", value = "response", "Column1", "Column2")),
paired = TRUE)
TL;DR
t_test_result = t.test(DataframeA$Column1, DataframeB$Column2, paired=TRUE)
Explanation
If the data is paired, I assume that both dataframes will have the same number of observations (same number of rows). You can check this with nrow(DataframeA) == nrow(DataframeB) .
You can think of each column of a dataframe as a vector (an ordered list of values). The way that you have used t.test is by using a formula (y~x), and you were essentially saying: Given the dataframe specified in data, perform a t test to assess the significance in the difference in means of the variable response between the paired groups in Column1.
Another way of thinking about this is by grabbing the data in data and separating it into two vectors: the vector with observations for the first group of Column1, and the one for the second group. Then, for each vector, you compute the mean and stdev and apply the appropriate formula that will give you the t statistic and hence the p value.
Thus, you can just extract those 2 vectors separately and provide them as arguments to the t.test() function. I hope it was beginner-friendly enough ^^ otherwise let me know
EDIT: a few additions
(I was going to reply in the comments but realized I did not have space hehe)
Regarding the what #Ashish did in order to turn it into a Welch's test, I'd say it was to set var.equal = FALSE. The paired parameter controls whether the t-test is run on paired samples or not, and since your data frames have unequal number of rows, I'm suspecting the observations are not matched.
As for the Cohen's d effect size, you can check this stats exchange question, from which I copy the code:
For context, m1 and m2 are the group's means (which you can get with n1 = mean(DataframeA$Column1)), s1 and s2 are the standard deviations (s2 = sd(DataframeB$Column2)) and n1 and n2 the sample sizes (n2 = length(DataframeB$Column2))
lx <- n1- 1 # Number of observations in group 1
ly <- n2- 1 # # Number of observations in group 1
md <- abs(m1-m2) ## mean difference (numerator)
csd <- lx * s1^2 + ly * s2^2
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d
This should work for you
res = t.test(DataFrameA$Column1, DataFrameB$Column2, alternative = "two.sided", var.equal = FALSE)

R function to find which of 3 variables correlates most with another value?

I am conducting a study that analyzes speakers' production and measures their average F2 values. What I need is an R function that allows me to find a relationship for these F2 values with 3 other variables, and if there is, which one is the most significant. These variables have been coded as 1, 2, or 3 for things like "yes" "no" answers or whether responses are positive, neutral or negative (1, 2, 3 respectively).
Is there a particular technique or R function/test that we can use to approach this problem? I've considered using ANOVA or a T-Test but am unsure if this will give me what I need.
A quick solution might look like this. Here, the cor function is used. Read its help page (?cor) to understand what is calculated. By default, the Pearson correlation coefficient is used. The function below return the variable with the highest Pearson correlation with respect to the reference variable.
set.seed(111)
x <- rnorm(100)
y <- rnorm(100)
z <- rnorm(100)
ref <- 0.5*x + 0.5*rnorm(100)
find_max_corr <- function(vars, ref){
val <- sapply(vars, cor, y = ref)
val[which.max(val)]
}
find_max_corr(list('x' = x, 'y' = y, 'z' = z), ref)

r statistic hypothesis how to test

Produce 10 random observations from a normal distribution with mean 80 and typical deviation 30 Let's pretend that we do not know the mean μ of the distribution.
Using the sample ,check(test) the 2 hypothesis
H0 : µ = 80 vs H1 : µ not equal 80.
Repeat the process for 100 times and record only the p-value each time.
Using the 5% significance level to comment your results
Show all the values of p-value.
Here is what i did
t<-c( rnorm(10, mean = 80, sd = 30))
t.test (y, mu = 80)
t.test(y, mu =80, alternative = ”greater”)$p.value
t.test(y, mu = 80, alternative = ”less”)$p.value
notes:
Suppose that in a vector y is stored the data of a sample.
This command
t.test(y, mu = 9)
make two-sided hypothesis check(testing), specifically it check whether the mean of the distribution from which the data comes from is equal to 9 ,in case of one-sided check the command is,
t.test(y, mu = 9, alternative = ”greater”) or t.test(y, mu = 9,
alternative = ”less”)
accordingly.
These commands give the results of the check(testing), including the confidence interval. If someone wants only the value of p-value ,must add $p.value in the end command . For example, the command
t.test (y, mu = 9) $p.value
only gives the p-value for the two -sided check(test) hypothesis
[EDIT: I'm assuming this is for a school assignment and that you are very new to R.]
Not entirely clear what your question is... However, your code seems to contain some errors..
You create 10 random observations with mean 80 and sd 30. You assign those observations to a vector, t. This is not a smart idea to begin with because t is the R command for transpose - it is not a good idea to use redefine reserved names like this.
You then perform the test using the t.test command. Note that in R, unlike in say Python, the "." does not refer to a method of an object. So when you call t.test(y ... ), you are performing a t-test on a vector of observations y, which you have not defined.
The notes you post assume that your vector of observations is, in fact, called y. If you run ?t.test in the R console, you will see that y is the default name of the parameter of the t.test function that corresponds to a vector of observations.
You probably want is this:
y<-c( rnorm(10, mean = 80, sd = 30))
t.test (y, mu = 80)
t.test(y, mu =80, alternative = "greater")$p.value
t.test(y, mu = 80, alternative = "less")$p.value
But note that you could have used any reasonable variable name for the vector of observations - you would just want to call t.test on the correct vector. For instance,
sample_observations <- c( rnorm(10, mean = 80, sd = 30))
t.test (sample_observations, mu = 80)
[EDIT: There appeared to be unicode in the pasted code snippets. That's fixed now]

Z-scores rounded to infinity for small p-values in R

I am working with a genome-wide association study dataset, with p-values ranging from 1E-30 to 1. I have an R data frame "data" which includes a variable "p" for the p-values.
I need to perform genomic correction of the p-values, which I am doing using the following code:
p=data$p
Zsq = qchisq(1-p, 1)
lambda = median(Zsq)/0.456
newZsq = Zsq/lambda
Newp = 1-pchisq(newZsq, 1)
In the command on the second line, where I use the qchisq function to convert p-values to z-scores, z-scores for p-values < 1E-16 are being rounded to infinity. This means the p-values for my most significant data points are rounded to 0 after the genomic correction, and I lose their ranking.
Is there any way around this?
Read help(".Machine"). Then set lower.tail=FALSE and avoid taking differences with 1:
p <- 1e-17
Zsq = qchisq(p, 1, lower.tail=FALSE)
lambda = median(Zsq)/0.456
newZsq = Zsq/lambda
Newp = pchisq(newZsq, 1, lower.tail=FALSE)
#[1] 0.4994993

Resources