Binary outcome, different trial #s across low N? - r

I have a sample of 4 individuals, all who have a varying number of trials (I work with a special population so what I get is what I get!)
The outcome is a binary yes/no
I want to know:
did the total sample select yes more often than chance?
did each individual select yes more often than chance?
Here is dummy data in R.
SbjEL <- data.frame(Sbj = c('EL'),
TrialNum = c(1:12),
Choice = c(0,0,1,1,1,1,1,1,1,1,1, NA))
SbjKZ <- data.frame(Sbj = c('KZ'),
TrialNum = c(1:12),
Choice = c(0,1,1,1,1,1,1,1,1,1,1, 1))
SbjMA <- data.frame(Sbj = c('MA'),
TrialNum = c(1:12),
Choice = c(0,0,1,1,1,1,1,1,1,1,1, 1))
SbjTC <- data.frame(Sbj = c('EL'),
TrialNum = c(1:12),
Choice = c(1,1,1,1,1,1,1,1, NA,NA,NA, NA))
For a different experiment with the same sample, I had more trials and did a one sample t test for the sample, and a binomial distribution to see what # of trials of Yes would be higher than chance.
# Did group select YES more than chance? --> 43 yes/48
Response_v <- c(21,22)
t.test(Response_v, mu = 12, alternative = "two.sided")
# How many YES selections would be more often than chance?
# 24 trials were completed --> 21 yes / 24
binom.test(21, 24, 1/2)
My issue is this starts to fall apart when I get down to 8-12 trials.
Any ideas? I am lost

A t-test is not appropriate here for either Q1 or Q2. With large samples you can use some approximations, but your counts are very small. So, you’re on the right track with the binomial test, but not the t-test.
For your Q1: you first ought to decide how the subjects are assumed to relate to each other. Are you pretty confident that each is providing an estimate of the same Bernoulli probability, p? Or instead, a-priori do you want to allow the possibility that subjects have different p’s? There are further questions to answer, overlapping with those you need to consider for Q2.
For your Q2: The exact method of choice depends on a number of things: For example, do you want to incorporate prior information (e.g. using historical data as a reference)? If not, there are purely frequentist methods to use off the shelf. Next, do you expect the yes/no’s to be independent, or are they more like a ‘signal’ in which the order matters? Third, is it possible that there is a mixture of Bernoullis for any of the subjects? And so on. These questions can be considered through software such as that found at www.datatrie.com/advisor

Related

Error in generalized linear mixed model cross-validation: The value in 'data[[cat_col]]' must be constant within each ID

I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html

I need to find all predictors(p-value < 0.05) from my dataset using loops. Is there any way to do it?

I am new to R and I am using glm() function to fit a logistic model. I have 5 columns. I need to find all possible predictors using a loop based on their p-values(less than 0.05).
My dataset has 40,000 entries which contains numerical and categorical variables and it looks more or less like this:
"Age" "Sex" "Occupation" "Education" "Income"
50 Male Farmer High School False
30 Female Maid High School False
25 Male Engineer Graduate True
The target variable "Income" denotes if the person earns more or less than 30K. If true means, they earn more than 30K and vice versa. I would like to find the predictor variables that can be used to predict the target using loops. Also, can I find the best 3 predictors based on their p-values?
Thanks in Advance!
If I understood correctly your question you are looking into a way of test univariable models given your dataframe (i am in fact in doubt if you want to test every combination of these variables including cross variation)
My suggestion is to use purrr::map function and create list for every column. Check the following example based on your information:
library(tidyr)
library(purrr)
## Sample data
df <- data.frame(
Age = rnorm(n = 40000,
mean = mean(c(50,30,25)),
sd(c(50,30,25))),
Ocupation = sample(x = c("Farmer", "Maid", "Engineer"),
size = 40000,
replace = TRUE),
Education = sample(x = c("High School", "Graduate", "UnderGraduate"),
size = 40000,
replace = TRUE),
Income = as.logical(rbinom(40000, 1, 0.5))
)
## Split dataframe into lists
list_df <- Map(cbind, split.default(df[-4], names(df)[-4]))
list_df <- lapply(list_df, cbind, "target" = df[4])
## Use map to fit a model for each list
list_models <- map(.x = list_df,
.f = ~glm(Income ~ ., data = .x, family = binomial))
You can call each model using list_models[i].
Now addressing the second part of your question concerning p-values. Given that each project is unique and so are their metrics i suggest you double check you usage of p-values. Granted, they are important, but they provid you a probability of acceptance given a specific statistic test and treshold which depends on context. It is a fundamental tool of statistical quality and decision (not only about t-test, but f-test and hence forward). But for ranking ? hmm i would say is a litle odd. But just saying :)

R: Find cutoffpoint for continous variable to assign observations to two groups

I have the following data
Species <- c(rep('A', 47), rep('B', 23))
Value<- c(3.8711, 3.6961, 3.9984, 3.8641, 4.0863, 4.0531, 3.9164, 3.8420, 3.7023, 3.9764, 4.0504, 4.2305,
4.1365, 4.1230, 3.9840, 3.9297, 3.9945, 4.0057, 4.2313, 3.7135, 4.3070, 3.6123, 4.0383, 3.9151,
4.0561, 4.0430, 3.9178, 4.0980, 3.8557, 4.0766, 4.3301, 3.9102, 4.2516, 4.3453, 4.3008, 4.0020,
3.9336, 3.5693, 4.0475, 3.8697, 4.1418, 4.0914, 4.2086, 4.1344, 4.2734, 3.6387, 2.4088, 3.8016,
3.7439, 3.8328, 4.0293, 3.9398, 3.9104, 3.9008, 3.7805, 3.8668, 3.9254, 3.7980, 3.7766, 3.7275,
3.8680, 3.6597, 3.7348, 3.7357, 3.9617, 3.8238, 3.8211, 3.4176, 3.7910, 4.0617)
D<-data.frame(Species,Value)
I have the two species A and B and want to find out which is the best cutoffpoint for value to determine the species.
I found the following question:
R: Determine the threshold that maximally separates two groups based on a continuous variable?
and followed the accepted answer to find the best value with the dose.p function from the MASS package. I have several similar values and it worked for them, but not for the one given above (which is also the reason why i needed to include all 70 observations here).
D$Species_b<-ifelse(D$Species=="A",0,1)
my.glm<-glm(Species_b~Value, data = D, family = binomial)
dose.p(my.glm,p=0.5)
gives me 3.633957 as threshold:
Dose SE
p = 0.5: 3.633957 0.1755291
this results in 45 correct assignments. however, if I look at the data, it is obvious that this is not the best value. By trial and error I found that 3.8 gives me 50 correct assignments, which is obviously better.
Why does the function work for other values, but not for this one? Am I missing an obvious mistake? Or is there maybe a different/ better approach to solving my problem? I have several values I need to do this for, so I really do not want to just randomly test values until I find the best one.
Any help would be greatly appreciated.
I would typically use a receiver operating characteristic curve (ROC) for this type of analysis. This allows a visual and numerical assessment of how the sensitivity and specificity of your cutoff changes as you adjust your threshold. This allows you to select the optimum threshold based on when the overall accuracy is optimum. For example, using pROC:
library(pROC)
species_roc <- roc(D$Species, D$Value)
We can get a measure of how good a discriminator Value is for predicting Species by examining the area under the curve:
auc(species_roc)
#> Area under the curve: 0.778
plot(species_roc)
and we can find out the optimum cut-off threshold like this:
coords(species_roc, x = "best")
#> threshold specificity sensitivity
#> 1 3.96905 0.6170213 0.9130435
We see that this threshold correctly identifies 50 cases:
table(Actual = D$Species, Predicted = c("A", "B")[1 + (D$Value < 3.96905)])
#> Predicted
#> Actual A B
#> A 29 18
#> B 2 21

R - Compare performance of two types while controlling for interaction

I have been programming in R and have a dataset containing the results (succes or not) of two Machine Learning algorithms which have been tried out using different amounts of parameters. An example is provided below:
type success paramater_amount
a1 0 15639
a1 0 18623
a1 1 19875
a2 1 12513
a2 1 10256
a2 0 12548
I now want to compare both algorithms to see which one has the best overall performance. But there is a catch. It is known that the higher the parameter_amount, the higher the chances for success. When checking out the parameter amounts both algorithms were tested on, one can also notice that a1 has been tested with higher parameter amounts than a2 was. This would make simply counting the amount of successes of both algorithms unfair.
What would be a good approach to handle this scenario?
I will give you an answer but without any guarantees on the truth of what I'm telling you. Indeed for more precisions you should give more informations on the algorithm and other. I also propose to migrate this question to cross-validate.
Indeed, your question is a statistical question. Because, in statistics, we search for sparcity. We prefer a simpler model than a very complex one at given performance because we are worried of over-fitting : https://statisticsbyjim.com/regression/overfitting-regression-models/.
One way to do what you want is to compare the performance with respect to the complexity of the model like for this toy example :
library(tidyverse)
library(ggplot2)
set.seed(123)
# number of estimation for each models
n <- 1000
performance_1 <- round(runif(n))
complexity_1 <- round(rnorm(n, mean = n, sd = 50))
performance_2 <- round(runif(n, min = 0, max = 0.6))
complexity_2 <- round(rnorm(n, mean = n, sd = 50))
df <- data.frame(performance = c(performance_1, performance_2),
complexity = c(complexity_1, complexity_2),
models = as.factor(c(rep(1, n), rep(2, n))))
temp <- df %>% group_by(complexity, models) %>% summarise(perf = sum(performance))
ggplot(temp, aes(x = complexity, y = perf, group = models, fill = models)) +
geom_smooth() +
theme_classic()
It only works if you have many data points. Complexity for you is the number of parameters fitted. In that toy exemple, the first model seems a better because for each level of complexity it is better.

T-test with grouping variable

I've got a data frame with 36 variables and 74 observations. I'd like to make a two sample paired ttest of 35 variables by 1 grouping variable (with two levels).
For example: the data frame contains "age" "body weight" and "group" variables.
Now I suppose I can do the ttest for each variable with this code:
t.test(age~group)
But, is there a way to test all the 35 variables with one code, and not one by one?
Sven has provided you with a great way of implementing what you wanted to have implemented. I, however, want to warn you about the statistical aspect of what you are doing.
Recall that if you are using the standard confidence level of 0.05, this means that for each t-test performed, you have a 5% chance of committing Type 1 error (incorrectly rejecting the null hypothesis.) By the laws of probability, running 35 individual t-tests compounds your probability of committing type 1 error by a factor of 35, or more exactly:
Pr(Type 1 Error) = 1 - (0.95)^35 = 0.834
Meaning that you have about an 83.4% chance of falsely rejecting a null hypothesis. Basically what this means is that, by running so many T-tests, there is a very high probability that at least one of your T-tests is going to provide an incorrect result.
Just FYI.
An example data frame:
dat <- data.frame(age = rnorm(10, 30), body = rnorm(10, 30),
weight = rnorm(10, 30), group = gl(2,5))
You can use lapply:
lapply(dat[1:3], function(x)
t.test(x ~ dat$group, paired = TRUE, na.action = na.pass))
In the command above, 1:3 represents the numbers of the columns including the variables. The argument paired = TRUE is necessary to perform a paired t-test.

Resources