McNemar test in R - sparse data - r

I'm attempting to run a good sized dataset through R, using the McNemar test to determine whether I have a difference in the proportion of objects detected by one method over another on paired samples. I've noticed that the test works fine when I have a 2x2 table of
test1
y n
y 34 2
n 12 16
but if I try and run something more like:
34 0
12 0
it errors telling me that ''x' and 'y' must have the same number of levels (minimum 2)'.
I should clarify, that I've tried converting wide data to a 2x2 matrix using the table function on my wide data set, where rather than appearing as above, it negates the final column, giving me.
test1
y
y 34
n 12
I've also run mcnemar.test using the factor object option, which gives me the same error, so I'm assuming that it does something similar. I'm wondering whether there is either a way to force the table function to generate the 2nd column despite their being no observations which would fall under either of those categories, or whether there would be a way to make the test overlook this missing data?

Perhaps there's a better way to do this, but you can force R to construct a sparse contingency table by ensuring that the tabulated factors have the same levels attribute and that there are exactly 2 distinct levels specified.
# Example data
x1 <- c(rep("y", 34), rep("n", 12))
x2 <- rep("n", 46)
# Set levels explicitly
x1 <- factor(x1, levels = c("y", "n"))
x2 <- factor(x2, levels = c("y", "n"))
table(x1, x2)
# x2
# x1 y n
# y 0 34
# n 0 12
mcnemar.test(table(x1, x2))
#
# McNemar's Chi-squared test with continuity correction
#
# data: table(x1, x2)
# McNemar's chi-squared = 32.0294, df = 1, p-value = 1.519e-08

Related

2-sample independent t-test where each of two columns is in different data frame

I need to run a 2-sample independent t-test, comparing Column1 to Column2. But Column1 is in DataframeA, and Column2 is in DataframeB. How should I do this?
Just in case relevant (feel free to ignore): I am a true beginner. My experience with R so far has been limited to running 2-sample matched t-tests within the same data frame by doing the following:
t.test(response ~ Column1,
data = (Dataframe1 %>%
gather(key = "Column1", value = "response", "Column1", "Column2")),
paired = TRUE)
TL;DR
t_test_result = t.test(DataframeA$Column1, DataframeB$Column2, paired=TRUE)
Explanation
If the data is paired, I assume that both dataframes will have the same number of observations (same number of rows). You can check this with nrow(DataframeA) == nrow(DataframeB) .
You can think of each column of a dataframe as a vector (an ordered list of values). The way that you have used t.test is by using a formula (y~x), and you were essentially saying: Given the dataframe specified in data, perform a t test to assess the significance in the difference in means of the variable response between the paired groups in Column1.
Another way of thinking about this is by grabbing the data in data and separating it into two vectors: the vector with observations for the first group of Column1, and the one for the second group. Then, for each vector, you compute the mean and stdev and apply the appropriate formula that will give you the t statistic and hence the p value.
Thus, you can just extract those 2 vectors separately and provide them as arguments to the t.test() function. I hope it was beginner-friendly enough ^^ otherwise let me know
EDIT: a few additions
(I was going to reply in the comments but realized I did not have space hehe)
Regarding the what #Ashish did in order to turn it into a Welch's test, I'd say it was to set var.equal = FALSE. The paired parameter controls whether the t-test is run on paired samples or not, and since your data frames have unequal number of rows, I'm suspecting the observations are not matched.
As for the Cohen's d effect size, you can check this stats exchange question, from which I copy the code:
For context, m1 and m2 are the group's means (which you can get with n1 = mean(DataframeA$Column1)), s1 and s2 are the standard deviations (s2 = sd(DataframeB$Column2)) and n1 and n2 the sample sizes (n2 = length(DataframeB$Column2))
lx <- n1- 1 # Number of observations in group 1
ly <- n2- 1 # # Number of observations in group 1
md <- abs(m1-m2) ## mean difference (numerator)
csd <- lx * s1^2 + ly * s2^2
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d
This should work for you
res = t.test(DataFrameA$Column1, DataFrameB$Column2, alternative = "two.sided", var.equal = FALSE)

cor function with NA values due to 0 variance

Beginner R user here. I am using the cor function to get the Kendal's tau-b rank correlation coefficient between 2 columns of a dataframe. Examples of such columns are as folows:
A B
1 1
1 2
1 3
when I use cor(d,method="kendall")
The result is NA for the correlation between A and B. Shouldnt it be 0? And if not is there a way that I can replace this NA result with 0 using a parameter in the cor function?
Consider what would happen if we slightly perturb the constant column. We get vastly different solutions depending on the particular perturbation used. In fact we can get any correlation we like with different perturbations. As a result it really makes no sense to use any particular value for the correlation and it would be best left as NA.
x <- c(1, 1, 1)
y <- 1:3
cor(x + (1:3) * 1e-10, y, method = "spearman")
## [1] 1
cor(x - (1:3) * 1e-10, y, method = "spearman")
## [1] -1

Minimal depth interaction from randomForestExplainer package

So when using the minimal depth interaction feature of the randomForestExplainer package, in R, I'm getting some hard to interpret results.
I simulated some data (x1, x2,..., x5) where x1 is binary and x2-x5 are continuous. In my model, there are no interactions.
Im using the randomForest package to create a random forest and then running it through the randomForestExplainer package.
Here's the code I'm using to simulate the data and random forest:
library(randomForest)
library(randomForestExplainer)
n <- 100
p <- 4
# Create data:
xrandom <- matrix(rnorm(n*p)+5, nrow=n)
colnames(xrandom)<- paste0("x",2:5)
d <- data.frame(xrandom)
d$x1 <- factor(sample(1:2, n, replace=T))
# Equation:
y <- d$x2 + rnorm(n)/5
y[d$x1==1] <- y[d$x1==1]+5
d$y <- y
# Random Forest:
fr <- randomForest(y ~ ., data=d,localImp=T)
# Random Forest Explainer:
interactions_frame <- min_depth_interactions(fr, names(d)[-6])
head(interactions_frame, 2)
This produces the following:
variable root_variable mean_min_depth occurrences interaction
1 x1 x1 4.670732 0 x1:x1
2 x1 x2 2.606190 221 x2:x1
uncond_mean_min_depth
1 1.703252
2 1.703252
So, my question is, if x1:x1 has 0 occurrence ( which is expected) then how can it also have a mean_min_depth?
Surely if it has 0 occurrences, then it can't possibly have a minimum depth? [or rather, the min depth = 0 or NA]
What's going on here? Am I misinterpreting something?
Thanks
My understanding is this has to do with the choice of the mean_sample argument of min_depth_interactions. The default choice replaces NAs with the depth of maximum subtree whose root is x1. Details below.
What is this argument mean_sample for? It specifies how to deal with trees where the interaction of interest is not present. There are three options:
relevant_trees. This only considers the trees where the interaction of interest is present. In your example, this gives NA for mean_min_depth of interaction x1:x1, which is the behavior you were looking for.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "relevant_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 NA 0 x1:x1 1.947475
2 x1 x2 1.426606 218 x2:x1 1.947475
all_trees. There is a major problem with relevant_trees, that is for an interaction only showing up in a small number of trees, taking the mean of conditional minimum depth ignores the fact that this interaction is not that important. In this case, a small mean conditional minimum depth doesn't mean an interaction is important. To address this, specifying mean_sample = "all_trees" replaces the conditional minimum depth for the interaction of interest by the mean depth of maximal subtree of the root variable. Basically, if we are looking at the interaction of x1:x2, it says for a tree where this interaction is absent, give it a value of the deepest tree whose root is x1. This gives a (hopefully large) numeric value to mean_min_depth of interaction x1:x2 thus making it less important.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "all_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.97568
2 x1 x2 3.654522 218 x2:x1 1.97568
top_trees. Now this is the default choice for mean_sample. My understanding is it's similar to all_trees, but tries to down-weight the contribution of replacing missing values. The motivation, is all_trees pulls mean_min_depth close to the same value when there are many parameters but not enough observations, i.e. shallow trees. To reduce the contribution of replacing missing values, top_trees only calculates the mean conditional minimal depth on a subset of n trees, where n is the number of trees where ANY interactions with specified roots are present. Let's say in your example, out of those 500 trees only 300 have any interaction x1:whatever, then we only consider those 300 trees when filling in value for x1:x1. Because there are 0 occurrence of this interaction, replacing 500 NAs vs replacing 300 NAs with the same value doesn't affect the mean, so it's the same value 4.787879. (There's a slight difference between our results, I think it has to do with seed values).
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "top_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.947475
2 x1 x2 2.951051 218 x2:x1 1.947475
This answer is based on my understanding of the package author's thesis: https://rawgit.com/geneticsMiNIng/BlackBoxOpener/master/randomForestExplainer_Master_thesis.pdf

R: Testing each level of a factor without creating new variables

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below:
For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following:
Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?
Note
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.
Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test() you run above:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar) continuously, and also eliminates the need for dummy variables.
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply through the data.frame you named new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))

R package caret confusionMatrix with missing categories

I am using the function confusionMatrix in the R package caret to calculate some statistics for some data I have. I have been putting my predictions as well as my actual values into the table function to get the table to be used in the confusionMatrix function as so:
table(predicted,actual)
However, there are multiple possible outcomes (e.g. A, B, C, D), and my predictions do not always represent all the possibilities (e.g. only A, B, D). The resulting output of the table function does not include the missing outcome and looks like this:
A B C D
A n1 n2 n2 n4
B n5 n6 n7 n8
D n9 n10 n11 n12
# Note how there is no corresponding row for `C`.
The confusionMatrix function can't handle the missing outcome and gives the error:
Error in !all.equal(nrow(data), ncol(data)) : invalid argument type
Is there a way I can use the table function differently to get the missing rows with zeros or use the confusionMatrix function differently so it will view missing outcomes as zero?
As a note: Since I am randomly selecting my data to test with, there are times that a category is also not represented in the actual result as opposed to just the predicted. I don't believe this will change the solution.
You can use union to ensure similar levels:
library(caret)
# Sample Data
predicted <- c(1,2,1,2,1,2,1,2,3,4,3,4,6,5) # Levels 1,2,3,4,5,6
reference <- c(1,2,1,2,1,2,1,2,1,2,1,3,3,4) # Levels 1,2,3,4
u <- union(predicted, reference)
t <- table(factor(predicted, u), factor(reference, u))
confusionMatrix(t)
First note that confusionMatrix can be called as confusionMatrix(predicted, actual) in addition to being called with table objects. However, the function throws an error if predicted and actual (both regarded as factors) do not have the same number of levels.
This (and the fact that the caret package spit an error on me because they don't get the dependencies right in the first place) is why I'd suggest to create your own function:
# Create a confusion matrix from the given outcomes, whose rows correspond
# to the actual and the columns to the predicated classes.
createConfusionMatrix <- function(act, pred) {
# You've mentioned that neither actual nor predicted may give a complete
# picture of the available classes, hence:
numClasses <- max(act, pred)
# Sort predicted and actual as it simplifies what's next. You can make this
# faster by storing `order(act)` in a temporary variable.
pred <- pred[order(act)]
act <- act[order(act)]
sapply(split(pred, act), tabulate, nbins=numClasses)
}
# Generate random data since you've not provided an actual example.
actual <- sample(1:4, 1000, replace=TRUE)
predicted <- sample(c(1L,2L,4L), 1000, replace=TRUE)
print( createConfusionMatrix(actual, predicted) )
which will give you:
1 2 3 4
[1,] 85 87 90 77
[2,] 78 78 79 95
[3,] 0 0 0 0
[4,] 89 77 82 83
I had the same problem and here is my solution:
tab <- table(my_prediction, my_real_label)
if(nrow(tab)!=ncol(tab)){
missings <- setdiff(colnames(tab),rownames(tab))
missing_mat <- mat.or.vec(nr = length(missings), nc = ncol(tab))
tab <- as.table(rbind(as.matrix(tab), missing_mat))
rownames(tab) <- colnames(tab)
}
my_conf <- confusionMatrix(tab)
Cheers
Cankut

Resources