Mann-Whitney-U test - r

I have a csv with several thousands of samples whose gene expression after different treatments should be compared:
ID U1 U2 U3 H1 H2 H3
1 5.95918 6.07211 6.01437 5.89113 5.89776 5.95443
2 6.56789 5.98897 6.67844 5.78987 6.01789 6.12789
..
I was asked to do a Mann Whitney u test and R is giving me results when I use this:
results <- apply(data,1,function(x){wilcox.test(x[1:3],x[4:6])$pvalue})
However, I just get values like 0.1 or 0.5..
When I added alternative ="greater" I got values like 0.35000 or 0.05000 and a few samples got pvalues like 0.14314 (that's a value I am okay with).
So I am wondering why R is giving me such strange pvalues (0.35000,..) and how I can fix it to get "normal" pvalues.

You are doing a non-parametric test, where the test statistics is derived from the ranks. With a sample size of 3, there are just a few possible distinct values for the test statistics.
Example:
set.seed(42)
x <- matrix(rnorm(3000), ncol=6)
ps <- apply(x, 1, function(a) wilcox.test(a[1:3], a[4:6])$p.value)
table(ps)
#ps
#0.1 0.2 0.4 0.7 1
# 54 45 108 141 152

Related

choose thresholds for 100% sensitivity in glm and lda (wbcd, R)

I'm working on Wisconsin Breast Cancer Dataset, my aim is to build a model which features a good accuracy and 100% sensitivity. I know that in order to achieve this, I've to work with the thresholds. The problem is that I don't understand how does thresholds work and how can I properly choose them.
I'm studying on the famous Intro to SL (with applications in R) book, but I'm not able to find the explanation about choosing the threshold in chapter 4.
Here is the code I've written so far:
df <- subset(df, select = -c(X, id)) # Selecting features
set.seed(4)
# Train and test
nrows <- NROW(df)
index <- sample(1:nrows, 0.7 * nrows)
traindf <- df[index,]
testdf <- df[-index,]
glm.fit=glm(diagnosis~., data=traindf ,family=binomial)
glm.probs=predict(glm.fit,testdf,type="response")
glm.pred=rep("B",dim(tested)[1])
glm.pred[glm.probs >.5]="M"
table(glm.pred, testdf[,1])
Now, this gives me
glm.pred B M
B 108 3
M 4 56
What I want is to put 0 in the top right of the table, but changing the thresholds doesn't work.
How can I fix the problem?
The same is with the lad function (which I avoid to write here).
Thanks

Grouping in R changes mean substantially

I have a file containing the predictions for two models (A and B) on a binary classification problem. Now I'd like to understand how good they are predicting the observations that they are most confident about. To do that I want to group their predictions into 10 groups based on how confident they are. Each of these groups should have an identical number of observations. However, when I do that the accuracy of the models change substantially! How can that be?
I've also tested with n_groups=100, but it only makes a minor difference. The CSV file is here and the code is below:
# Grouping observations
conf <- read.table(file="conf.csv", sep=',', header=T)
n_groups <- 10
conf$model_a_conf <- pmax(conf$model_a_pred_0, conf$model_a_pred_1)
conf$model_b_conf <- pmax(conf$model_b_pred_0, conf$model_b_pred_1)
conf$conf_group_model_a <- cut(conf$model_a_conf, n_groups, labels=FALSE, ordered_result=TRUE)
conf$conf_group_model_b <- cut(conf$model_b_conf, n_groups, labels=FALSE, ordered_result=TRUE)
# Test of original mean.
mean(conf$model_a_acc) # 0.78
mean(conf$model_b_acc) # 0.777
# Test for mean in aggregated data. They should be similar.
(acc_model_a <- mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean))) # 0.8491
(acc_model_b <- mean(tapply(conf$model_b_acc, conf$conf_group_model_b, FUN=mean))) # 0.7526
Edited to clarify slightly.
table(conf$conf_group_model_a)
1 2 3 4 5 6 7 8 9 10
2515 2628 2471 2128 1792 1321 980 627 398 140
The groups you are using are unbalanced. So when you take the mean of each of those groups with tapply thats fine, however to simply take the mean afterwards is not the way to go.
You need to weight the means by their size if you want to do the process you have.
something like this is quick and dirty:
mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean) * (table(conf$conf_group_model_a)/nrow(conf)) * 1000)

R : contingency table for two random variables

I have two Beta variables B1(8, 5) and B2(4, 7) and I have generated 1000 samples for each of them.
If I had to use a 2-Way contingency table to test for the independence of the two sets of samples,
1) would it be appropriate to use a contingency table?
2) Any pointers to approach this correctly?
(I created a frequency table but all I see is this - all the samples of X arranged in rows, and all samples of Y arranged in columns , and 0 written across each cell)
Beta random variables can take any value continuously from 0 to 1. So a simple contingency table does not make much sense.
You could look at the covariance, or a plot, or bin the data and then look at a contingency table. Perhaps something like
> set.seed(1)
> B1 <- rbeta(1000, shape1=8, shape2=5)
> B2 <- rbeta(1000, shape1=4, shape2=7)
> cov(B1,B2)
[1] 0.0003400774
> plot (B1,B2)
> CT <- table(cut(B1,4), cut(B2,4))
> print(CT)
(0.0518,0.246] (0.246,0.44] (0.44,0.635] (0.635,0.829]
(0.214,0.401] 15 30 11 3
(0.401,0.587] 77 173 83 12
(0.587,0.774] 106 231 126 20
(0.774,0.96] 25 54 30 4
> chisq.test(CT)
Pearson's Chi-squared test
data: CT
X-squared = 2.4747, df = 9, p-value = 0.9816
Warning message:
In chisq.test(CT) : Chi-squared approximation may be incorrect

Simlulating a t-test in R

I am looking for a way to simulate the power of a simple t-test in different sample sizes. My idea is to generate 400 random normal distribution samples, each with mean 5 and variance 1, and perform a t-test on each one concerning the hypothesis that the true mean is 4, i.e. the t-test would take the form:
t=(mean(x)-4)*sqrt(n)/sd(x) # for each sample x which consists of n observations.
For comparison I would like, the first 100 samples to consist of 10 observations, the next 100 ones of 100, the next 100 of 1000 and finally the last 100 of 5000, which I think is the upper limit. A t-test will have to be performed on each and every sample.
Lastly, I would like to see on what percentage of each sample group- let's call them, n10,n100,n1000,n5000, depending on how many observations they comprise- my (false) hypothesis is rejected.
Could you please help me write the corresponding R-code? I know the small commands but have trouble putting it all together. This is a nice exercise and hopefully I shall then be able to modify it a bit and use it for different purposes as well.
Thank you in advance.
Here's a one liner for 400 t.tests of n=10:
R>simulations <- replicate(400, t.test(rnorm(10, mean=5, sd=1), mu=4),
simplify=FALSE);
Then you can analyze it:
R>table(sapply(simulations, "[[", "p.value") < .05)
FALSE TRUE
75 325
I'm still learning R, too, so handle with care:
n <- 5
N <- 100
samplesizes <- as.integer(10^(1:n))
set.seed(1)
# generate samples
samples <- replicate(N, mapply(rnorm, samplesizes, mean=4, sd=sqrt(1)))
# perform t-tests
t.tests <- lapply(samples, function(x) t.test(x, mu=5, alternative="two.sided"))
# get p-values
t.test.pvalues <- sapply(t.tests, function(x) x$p.value)
rejected <- t.test.pvalues > .05
sampleIndices <- rep(1:n, N)
res <- aggregate(rejected, list(sample=sampleIndices), FUN=function(x) sum(x)/length(x) )
names(res)[2] <- "percRejected"
print(res, row.names=F)
# sample percRejected
# 1 0.16
# 2 0.00
# 3 0.00
# 4 0.00
# 5 0.00

R package caret confusionMatrix with missing categories

I am using the function confusionMatrix in the R package caret to calculate some statistics for some data I have. I have been putting my predictions as well as my actual values into the table function to get the table to be used in the confusionMatrix function as so:
table(predicted,actual)
However, there are multiple possible outcomes (e.g. A, B, C, D), and my predictions do not always represent all the possibilities (e.g. only A, B, D). The resulting output of the table function does not include the missing outcome and looks like this:
A B C D
A n1 n2 n2 n4
B n5 n6 n7 n8
D n9 n10 n11 n12
# Note how there is no corresponding row for `C`.
The confusionMatrix function can't handle the missing outcome and gives the error:
Error in !all.equal(nrow(data), ncol(data)) : invalid argument type
Is there a way I can use the table function differently to get the missing rows with zeros or use the confusionMatrix function differently so it will view missing outcomes as zero?
As a note: Since I am randomly selecting my data to test with, there are times that a category is also not represented in the actual result as opposed to just the predicted. I don't believe this will change the solution.
You can use union to ensure similar levels:
library(caret)
# Sample Data
predicted <- c(1,2,1,2,1,2,1,2,3,4,3,4,6,5) # Levels 1,2,3,4,5,6
reference <- c(1,2,1,2,1,2,1,2,1,2,1,3,3,4) # Levels 1,2,3,4
u <- union(predicted, reference)
t <- table(factor(predicted, u), factor(reference, u))
confusionMatrix(t)
First note that confusionMatrix can be called as confusionMatrix(predicted, actual) in addition to being called with table objects. However, the function throws an error if predicted and actual (both regarded as factors) do not have the same number of levels.
This (and the fact that the caret package spit an error on me because they don't get the dependencies right in the first place) is why I'd suggest to create your own function:
# Create a confusion matrix from the given outcomes, whose rows correspond
# to the actual and the columns to the predicated classes.
createConfusionMatrix <- function(act, pred) {
# You've mentioned that neither actual nor predicted may give a complete
# picture of the available classes, hence:
numClasses <- max(act, pred)
# Sort predicted and actual as it simplifies what's next. You can make this
# faster by storing `order(act)` in a temporary variable.
pred <- pred[order(act)]
act <- act[order(act)]
sapply(split(pred, act), tabulate, nbins=numClasses)
}
# Generate random data since you've not provided an actual example.
actual <- sample(1:4, 1000, replace=TRUE)
predicted <- sample(c(1L,2L,4L), 1000, replace=TRUE)
print( createConfusionMatrix(actual, predicted) )
which will give you:
1 2 3 4
[1,] 85 87 90 77
[2,] 78 78 79 95
[3,] 0 0 0 0
[4,] 89 77 82 83
I had the same problem and here is my solution:
tab <- table(my_prediction, my_real_label)
if(nrow(tab)!=ncol(tab)){
missings <- setdiff(colnames(tab),rownames(tab))
missing_mat <- mat.or.vec(nr = length(missings), nc = ncol(tab))
tab <- as.table(rbind(as.matrix(tab), missing_mat))
rownames(tab) <- colnames(tab)
}
my_conf <- confusionMatrix(tab)
Cheers
Cankut

Resources