R : contingency table for two random variables

R : contingency table for two random variables - r

I have two Beta variables B1(8, 5) and B2(4, 7) and I have generated 1000 samples for each of them.
If I had to use a 2-Way contingency table to test for the independence of the two sets of samples,
1) would it be appropriate to use a contingency table?
2) Any pointers to approach this correctly?
(I created a frequency table but all I see is this - all the samples of X arranged in rows, and all samples of Y arranged in columns , and 0 written across each cell)

Beta random variables can take any value continuously from 0 to 1. So a simple contingency table does not make much sense.
You could look at the covariance, or a plot, or bin the data and then look at a contingency table. Perhaps something like
> set.seed(1)
> B1 <- rbeta(1000, shape1=8, shape2=5)
> B2 <- rbeta(1000, shape1=4, shape2=7)
> cov(B1,B2)
[1] 0.0003400774
> plot (B1,B2)
> CT <- table(cut(B1,4), cut(B2,4))
> print(CT)
(0.0518,0.246] (0.246,0.44] (0.44,0.635] (0.635,0.829]
(0.214,0.401] 15 30 11 3
(0.401,0.587] 77 173 83 12
(0.587,0.774] 106 231 126 20
(0.774,0.96] 25 54 30 4
> chisq.test(CT)
Pearson's Chi-squared test
data: CT
X-squared = 2.4747, df = 9, p-value = 0.9816
Warning message:
In chisq.test(CT) : Chi-squared approximation may be incorrect

Related

0 observations of N variables

I have a dataset, which looks like:
y Age Height
0 Aage Aheight
1 Bage Bheight
All variables are divided into at least two categories.
When I open dataset with the code:
DM_input = read.csv(file="C:/Users/user/Desktop/test.CSV",header = TRUE, sep = ",")
R correctly shows: 5040 observations of 11 variables.
When I try to break down dataset into test and train with the following code:
> train <- DM_input[DM_input$rand <= 0.7, c(2,3,4,5,6,7,8,9,10)]
> test <- DM_input[DM_input$rand > 0.7, c(2,3,4,5,6,7,8,9,10)]
I get 0 observations out of 11 variables, and the tables are empty.
I do not understand why that is happening, I removed special characters - it did not help.
Thanks

I think that sample.int could help you to break down dataset.
Here is a example:
data(iris)
# number of rows of dataset
size_iris <- nrow(iris)
# set the proportion of sample split to 0.7
size_sample <- floor(0.7*size_iris)
# set a reproducible random result
set.seed(2020)
# sample the dataset
mysample <- sample.int(n=size_iris, size=size_sample, replace=F)
train <- iris[mysample,]
test <- iris[-mysample,]
# checking sizes
size_iris
[1] 150
nrow(train)
[1] 105
nrow(test)
[1] 45
There is a similar question here with a lot of good answers: How to split data into training/testing sets using sample function

McNemar test in R - sparse data

I'm attempting to run a good sized dataset through R, using the McNemar test to determine whether I have a difference in the proportion of objects detected by one method over another on paired samples. I've noticed that the test works fine when I have a 2x2 table of
test1
y n
y 34 2
n 12 16
but if I try and run something more like:
34 0
12 0
it errors telling me that ''x' and 'y' must have the same number of levels (minimum 2)'.
I should clarify, that I've tried converting wide data to a 2x2 matrix using the table function on my wide data set, where rather than appearing as above, it negates the final column, giving me.
test1
y
y 34
n 12
I've also run mcnemar.test using the factor object option, which gives me the same error, so I'm assuming that it does something similar. I'm wondering whether there is either a way to force the table function to generate the 2nd column despite their being no observations which would fall under either of those categories, or whether there would be a way to make the test overlook this missing data?

Perhaps there's a better way to do this, but you can force R to construct a sparse contingency table by ensuring that the tabulated factors have the same levels attribute and that there are exactly 2 distinct levels specified.
# Example data
x1 <- c(rep("y", 34), rep("n", 12))
x2 <- rep("n", 46)
# Set levels explicitly
x1 <- factor(x1, levels = c("y", "n"))
x2 <- factor(x2, levels = c("y", "n"))
table(x1, x2)
# x2
# x1 y n
# y 0 34
# n 0 12
mcnemar.test(table(x1, x2))
#
# McNemar's Chi-squared test with continuity correction
#
# data: table(x1, x2)
# McNemar's chi-squared = 32.0294, df = 1, p-value = 1.519e-08

Difference between runif and sample in R?

In terms of probability distribution they use? I know that runif gives fractional numbers and sample gives whole numbers, but what I am interested in is if sample also use the 'uniform probability distribution'?

Consider the following code and output:
> set.seed(1)
> round(runif(10,1,100))
[1] 27 38 58 91 21 90 95 66 63 7
> set.seed(1)
> sample(1:100, 10, replace=TRUE)
[1] 27 38 58 91 21 90 95 67 63 7
This strongly suggests that when asked to do the same thing, the 2 functions give pretty much the same output (though interestingly it is round that gives the same output rather than floor or ceiling). The main differences are in the defaults and if you don't change those defaults then both would give something called a uniform (though sample would be considered a discrete uniform and by default without replacement).
Edit
The more correct comparison is:
> ceiling(runif(10,0,100))
[1] 27 38 58 91 21 90 95 67 63 7
instead of using round.
We can even step that up a notch:
> set.seed(1)
> tmp1 <- sample(1:100, 1000, replace=TRUE)
> set.seed(1)
> tmp2 <- ceiling(runif(1000,0,100))
> all.equal(tmp1,tmp2)
[1] TRUE
Of course if the probs argument to sample is used (with not all values equal), then it will no longer be uniform.

sample samples from a fixed set of inputs, and if a length-1 input is passed as the first argument, returns an integer output(s).
On the other hand, runif returns a sample from a real-valued range.
> sample(c(1,2,3), 1)
[1] 2
> runif(1, 1, 3)
[1] 1.448551

sample() runs faster than ceiling(runif())
This is useful to know if doing many simulations or bootstrapping.
Crude time trial script that time tests 4 equivalent scripts:
n<- 100 # sample size
m<- 10000 # simulations
system.time(sample(n, size=n*m, replace =T)) # faster than ceiling/runif
system.time(ceiling(runif(n*m, 0, n)))
system.time(ceiling(n * runif(n*m)))
system.time(floor(runif(n*m, 1, n+1)))
The proportional time advantage increases with n and m but watch you don't fill memory!
BTW Don't use round() to convert uniformly distributed continuous to uniformly distributed integer since terminal values get selected only half the time they should.

Mann-Whitney-U test

I have a csv with several thousands of samples whose gene expression after different treatments should be compared:
ID U1 U2 U3 H1 H2 H3
1 5.95918 6.07211 6.01437 5.89113 5.89776 5.95443
2 6.56789 5.98897 6.67844 5.78987 6.01789 6.12789
..
I was asked to do a Mann Whitney u test and R is giving me results when I use this:
results <- apply(data,1,function(x){wilcox.test(x[1:3],x[4:6])$pvalue})
However, I just get values like 0.1 or 0.5..
When I added alternative ="greater" I got values like 0.35000 or 0.05000 and a few samples got pvalues like 0.14314 (that's a value I am okay with).
So I am wondering why R is giving me such strange pvalues (0.35000,..) and how I can fix it to get "normal" pvalues.

You are doing a non-parametric test, where the test statistics is derived from the ranks. With a sample size of 3, there are just a few possible distinct values for the test statistics.
Example:
set.seed(42)
x <- matrix(rnorm(3000), ncol=6)
ps <- apply(x, 1, function(a) wilcox.test(a[1:3], a[4:6])$p.value)
table(ps)
#ps
#0.1 0.2 0.4 0.7 1
# 54 45 108 141 152

R package caret confusionMatrix with missing categories

I am using the function confusionMatrix in the R package caret to calculate some statistics for some data I have. I have been putting my predictions as well as my actual values into the table function to get the table to be used in the confusionMatrix function as so:
table(predicted,actual)
However, there are multiple possible outcomes (e.g. A, B, C, D), and my predictions do not always represent all the possibilities (e.g. only A, B, D). The resulting output of the table function does not include the missing outcome and looks like this:
A B C D
A n1 n2 n2 n4
B n5 n6 n7 n8
D n9 n10 n11 n12
# Note how there is no corresponding row for `C`.
The confusionMatrix function can't handle the missing outcome and gives the error:
Error in !all.equal(nrow(data), ncol(data)) : invalid argument type
Is there a way I can use the table function differently to get the missing rows with zeros or use the confusionMatrix function differently so it will view missing outcomes as zero?
As a note: Since I am randomly selecting my data to test with, there are times that a category is also not represented in the actual result as opposed to just the predicted. I don't believe this will change the solution.

You can use union to ensure similar levels:
library(caret)
# Sample Data
predicted <- c(1,2,1,2,1,2,1,2,3,4,3,4,6,5) # Levels 1,2,3,4,5,6
reference <- c(1,2,1,2,1,2,1,2,1,2,1,3,3,4) # Levels 1,2,3,4
u <- union(predicted, reference)
t <- table(factor(predicted, u), factor(reference, u))
confusionMatrix(t)

First note that confusionMatrix can be called as confusionMatrix(predicted, actual) in addition to being called with table objects. However, the function throws an error if predicted and actual (both regarded as factors) do not have the same number of levels.
This (and the fact that the caret package spit an error on me because they don't get the dependencies right in the first place) is why I'd suggest to create your own function:
# Create a confusion matrix from the given outcomes, whose rows correspond
# to the actual and the columns to the predicated classes.
createConfusionMatrix <- function(act, pred) {
# You've mentioned that neither actual nor predicted may give a complete
# picture of the available classes, hence:
numClasses <- max(act, pred)
# Sort predicted and actual as it simplifies what's next. You can make this
# faster by storing `order(act)` in a temporary variable.
pred <- pred[order(act)]
act <- act[order(act)]
sapply(split(pred, act), tabulate, nbins=numClasses)
}
# Generate random data since you've not provided an actual example.
actual <- sample(1:4, 1000, replace=TRUE)
predicted <- sample(c(1L,2L,4L), 1000, replace=TRUE)
print( createConfusionMatrix(actual, predicted) )
which will give you:
1 2 3 4
[1,] 85 87 90 77
[2,] 78 78 79 95
[3,] 0 0 0 0
[4,] 89 77 82 83

I had the same problem and here is my solution:
tab <- table(my_prediction, my_real_label)
if(nrow(tab)!=ncol(tab)){
missings <- setdiff(colnames(tab),rownames(tab))
missing_mat <- mat.or.vec(nr = length(missings), nc = ncol(tab))
tab <- as.table(rbind(as.matrix(tab), missing_mat))
rownames(tab) <- colnames(tab)
}
my_conf <- confusionMatrix(tab)
Cheers
Cankut