R Percentiles of data frame with non-zero subset of observations - r

I would like to calculate the percentiles of the following tibble...
I have a non-zero subset of 10 observations in each of 3 variables i.e...
n <- 10
tibb <- tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y)
(The excluded observations are all zero)
Therefore the mean is the sum of the fields / 10 (as opposed to / 5):
meantibb <- tibb %>% group_by() %>%
summarise_if(is.numeric, sum, na.rm = TRUE) / n
meantibb
How do I get the following percentiles of x, y and z in the tibble please?
perciles <- c(0.5, 0.75)
percentiles <- function(p) quantile(p, perciles)
Thank you

You could create a data set including the zeroes
missingRowCount <- n - nrow(tibb)
colCount <- ncol(tibb)
zeroTibb <- matrix(rep(0, missingRowCount * colCount), ncol = colCount, nrow = missingRowCount) %>% as.tibble()
colnames(zeroTibb) <- colnames(tibb)
allTibb <- dplyr::bind_rows(tibb, zeroTibb)
Once you have the full data you can run the following to get a tibble of percentiles
percTibble = sapply(allTibb, percentiles) %>%
as.tibble()
The assumption here is that the data is not going to be too large when the zeroes are included.

You're close, your method of creating the mean (and subsequently the percentiles) could be simpler if you use gather first and then group the data by the three different factors.
library(dplyr)
n <- 10
tibb <- tibble(x = 1:5, y = 1, z = x ^ 2 + y)
tibb %>%
gather("fctr", "value") %>%
group_by(fctr) %>%
summarise(mean = sum(value) / n,
perc_50 = quantile(value, 0.5),
perc_75 = quantile(value, 0.75))
However, I'm not sure if you want the percentile of the non-zero subset or the entire dataset, because this will change your outcomes, i.e.
> x = 1:5
> quantile(x, 0.1)
10%
1.4
> test <- c(0,0,0,0,0,1,2,3,4,5)
> quantile(test, 0.1)
10%
0

Related

Removing outliers from statistical testing of stat_compare_means

I have a larger dataset where it has to be presented in boxplot format, however there may be outliers within each group and I would want to perform statistical testing after excluding the outliers first, for sample df and code below:
df = data.frame(name = c(rep("Bob",5),rep("Tom",5)),
score = c(2,3,4,5,100,5,8,9,10,95))
df %>% ggplot(aes(x=name,y=score)) + geom_boxplot() +
stat_compare_means(comparisons = list(c("Bob","Tom")),method="t.test", paired=F)
The stat_compare_means function is used because I have much more groups and facets in the larger dataset making manual elimination of outliers very tedious (unless it can be incorporated into the whole dataset) so I was wondering if it is possible to somehow incorporate it into the function to make them ignore the outliers when computing the statistical tests? Thanks
If you want to remove the outliers in your statistical test, that means you will show test scores (without outliers) on a graph with outliers which is misleading. So you could remove the outliers beforehand to do the t.test. The first graph shows the t.test without outliers to a graph with outliers and the second graph shows a t.test without outliers to a graph without outliers:
library(dplyr)
library(ggpubr)
df = data.frame(name = c(rep("Bob",5),rep("Tom",5)),
score = c(2,3,4,5,100,5,8,9,10,95))
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
val <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - val)] <- NA
y[x > (qnt[2] + val)] <- NA
y
}
df2 <- df %>%
group_by(name) %>%
mutate(score = remove_outliers(score)) %>%
ungroup()
indx <- which(is.na(df2$score), arr.ind=TRUE)
df %>% ggplot(aes(x=name,y=score)) +
geom_boxplot() +
stat_compare_means(data = df2[-indx,], comparisons = list(c("Bob","Tom")),
method="t.test",
paired=F)
df2 %>% ggplot(aes(x=name,y=score)) +
geom_boxplot() +
stat_compare_means(comparisons = list(c("Bob","Tom")),
method="t.test",
paired=F)
#> Warning: Removed 2 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 2 rows containing non-finite values (stat_signif).
Created on 2022-08-10 by the reprex package (v2.0.1)

ggplot - use data passed to ggplot to calculate the mean of the data in subsequent geom calls [duplicate]

I was wondering why variable mean_y is not recognized by my
geom_hline(yintercept = unique(mean_y)) call?
library(tidyverse)
set.seed(20)
n_groups <- 2
n_in_group <- 20
sd_e = 2
groups <- gl(n_groups, n_in_group, labels = c("T","C"))
age <-rnorm(length(groups), 25, 3)
betas <- c(5,0,0,2)
dat <- data.frame(groups=groups,age=age)
X <- model.matrix(~ groups * age, data = dat)
lin_pred <- as.vector(X %*% betas)
dat$y <- rnorm(nrow(X), lin_pred, sd_e)
dat %>% group_by(groups) %>% mutate(mean_y = mean(y)) %>%
ungroup() %>%
ggplot()+aes(x = age, y = y) +
geom_point(aes(color=groups)) +
geom_hline(yintercept = unique(mean_y)) # Error in unique(mean_y) :
# object 'mean_y' not found
Variables need to be inside aes(), try:
geom_hline(aes(yintercept = mean_y))

Simulation to visualize power does not predict built-in power functions

I built a shiny web APP with a nice interface to calculate sample size needed to differentiate between test groups. I wanted to build a simulation that visualizes how power works, my understanding is that power is the probability of differentiating test groups from each other when in fact these are different. Using the built in power calculator I find that I need 10 replicates to distinguish 2 samples groups when %CV =15, %Improvement = 20 and power = 80%. When I try to simulate this using random number generators I on average get a power of 40-55%. See the code below.
cv <- 15 #coefficient of variance is the standard deviation divided by the mean
percent_increase <- 20 #percent increase to detect
mean1 <- 40
mean2 <- mean1 + (mean1*(percent_increase/100))
sd1 <- (cv*mean1)/100
sd2 <- (cv*mean2)/100
pooled_sd <- sqrt((sd1^2 + sd2^2)/2)
difference <- (percent_increase/100)*mean1
pwrt <- power.t.test(delta=difference,sd=sd1,power=0.8,sig.level = .05, type="two.sample", alternative = "two.sided")
print(paste("Number of replicates needed is", pwrt$n))
#Simulate sample populations and tag which samples are different from each other.
record_test <- c()
for(i in 1:1000){
rep_sequence <- seq(2,50, by=4)
data_list <- list() # empty data list
for(r in 1:length(rep_sequence)){
d <- rnorm(rep_sequence[r], mean = mean1, sd = pooled_sd)
d2 <- rnorm(rep_sequence[r], mean = mean2, sd = pooled_sd)
df <- data.frame("value"=d, "sample"="Sample 1")
df2 <- data.frame("value"=d2, "sample"="Sample 2")
df3 <- rbind(df, df2)
df3$rep_n <- rep_sequence[r]
data_list[[r]] <- df3
}
all_data <- do.call(rbind, data_list)
all_data_summ <- all_data %>%
group_by(sample, rep_n) %>%
summarise(N= n(),
mean = mean(value),
sd = sd(value)
) %>%
ungroup() %>%
mutate(se = sd / sqrt(N),
ci_lower = mean - qnorm(0.975)*se,
ci_upper = mean + qnorm(0.975)*se,
#ci_lower = mean - qt(1 - (0.05/2), N -1)* se,
#ci_upper = mean + qt(1 - (0.05/2), N -1)* se,
)
different_tag <- (all_data_summ %>% filter(sample == 'Sample 1') %>% select(N, ci_upper) ) %>% #Sample 1 set
left_join(all_data_summ %>% filter(sample == 'Sample 2') %>% select(N, ci_lower), by=c("N"="N")) %>% #Sample 2 set
rename(ci_upper_s1 = ci_upper, ci_lower_s2 = ci_lower) %>%
mutate(different = ifelse(ci_lower_s2 > ci_upper_s1 ,'yes', 'no'))
all_data_summ1 <- all_data_summ %>%
left_join(different_tag %>% select(N, different), by=c("N"="N"))
replicate_n <- 10 #at n=10 the power should be 80%
test_result <- all_data_summ1[all_data_summ1$rep_n == replicate_n, ]$different[[1]]
record_test <- c(record_test, test_result)
ggplot(all_data_summ1, aes(rep_n, mean)) +
geom_errorbar(aes(x=rep_n, ymin=ci_lower, ymax= ci_upper, group=sample),position=position_dodge(width=1.5), width=1.5,size=1, colour="red") +
geom_point(data= filter(all_data_summ1, different == "yes"), colour="black", size=8,stroke=2, aes(rep_n,mean,group=sample),position=position_dodge(width=1.5), shape=0) +
geom_point(position=position_dodge(width=1.5), size=4, pch=21, aes(fill=sample)) +
scale_x_continuous(breaks = rep_sequence) +
labs(x="Replication", y="Average", title= paste0('Sample Means with 95% Confidence Interval Bars at CV=',cv, '%')) +
theme_gray(base_size = 25) +
theme(plot.title = element_text(hjust = 0.5))
}
print(table(record_test)/length(record_test))

Maximum likelihood with order statistics in R

I'm testing the Maximum Likelihood method, when only the maximum value of a sample is provided. I'm assuming the sample is from a Gaussian Distribution.
First I generate 10.000 random number with mean = 2.45 & sd = 1
library(tidyverse)
set.seed(91)
n <- 10000
mean <- 2.45
sd <- 1
random_numbers <- rnorm(n, mean, sd)
Then I extract the max value, and I assume that's the only value I know.
maximum <- random_numbers[which.max(random_numbers)]
Then, I estimate the density value for that maximum value using different mean values. I use the formula:
mean_space <- seq(0, 10, by = 0.01)
densities <- n * (pnorm(maximum, mean_space, sd)^(n-1)) * dnorm(maximum,
mean_space,1)
df <- data.frame(x = mean_space, y = densities)
g <- ggplot(df, aes(x = x, y = y)) +
geom_line() +
geom_vline(xintercept = mean)
print(g)
df %>% filter(y == max(y))
However, I'm getting density values higher than 1 which I think are not correct.

how to identify where of a column is more rich in one values?

If i generate randomly a binary data frame like below
Mat <- matrix(sample(0:1, 200*50, replace = TRUE),200,50)
If I have 200 rows for each column and I set a threshold like 50 up and 30 down.
how can I check whether the 50 rows in top of each column contains more 1 values or the 30 rows down of each column or the middle ?
how can I then plot something to show graphically the results ?
By doing
f <- function(x, u = 200, d = 200){
res <- list(NA)
for(i in 1:ncol(x)){
res[[i]] <- c(sum(x[1:u,i] == 1), sum(x[(u+1):(nrow(x)- d),i] == 1), sum(x[(nrow(x)-d+1):nrow(x),i] == 1))
}
res <- do.call(rbind, res)
res
}
then calculate
res_value <- f(output)
the res_values can be found here
https://gist.github.com/anonymous/a1f68b9798affe630e65
df <- data.frame(cbind(c(t(res_value)), rep(1:50, each = 3)), X3 = rep(1:3))
ggplot(df, aes(x = factor(X2), y = X1, fill = as.factor(X3))) + geom_bar(position="fill", stat = "identity")
I got a warning like below
Warning message:
In cbind(c(t(res_value)), rep(1:50, each = 3)) :
number of rows of result is not a multiple of vector length (arg 2)
and of course the plot is like below which is not good at all
What about this? First write a function to calculate the number of ones in each of the three groups using the thresholds (u and d) and then plot the result as filled barplot:
f <- function(x, u = 50, d = 30){
res <- list(NA)
for(i in 1:ncol(x)){
res[[i]] <- c(sum(x[1:u,i] == 1), sum(x[(u+1):(nrow(x)- d),i] == 1), sum(x[(nrow(x)-d+1):nrow(x),i] == 1))
}
res <- do.call(rbind, res)
res
}
res <- f(Mat)
df <- data.frame(cbind(c(t(res)), rep(1:50, each = 3)), X3 = rep(1:3))
ggplot(df, aes(x = factor(X2), y = X1, fill = as.factor(X3))) + geom_bar(position="fill", stat = "identity")
Group 1 is the upper group, 2 the middle and 3 the bottom group. If you want the exact numers to be plotted instead of normalized values you can set position = stack
This is how to view the matrix...
image(Mat)
You can try something like this:
Mat <- matrix(sample(0:1, 200*50, replace = TRUE),200,50)
high_t<-70
bottom_t<-70
sums <- rbind(colSums(Mat[1:high_t,]),colSums(Mat[(high_t+1):(nrow(Mat)-bottom_t),]),colSums(Mat[(nrow(Mat)-bottom_t+1):nrow(Mat),]))
res <- apply(sums,2,which.max)
For each interval, use colSums to sum the columns, then rbind the results and use which.max to find which interval has the most 1s, 1 for top, 2 for middle and 3 for bottom.
I changed your thresholds because the middle always wins if you choose 50 and 30 (the middle then has 120 rows)
library(reshape2)
library(gplots)
library(ggplot2)
Mat <- matrix(sample(0:1, 200*50, replace = TRUE), 200, 50)
low_cut <- 50
high_cut <- 30
lows <- apply(Mat, 2, function(x) sum(x[1:low_cut]))
highs <- apply(Mat, 2, function(x) sum(x[(length(x)-high_cut):length(x)]))
totals <- colSums(Mat)
mids <- totals - lows - highs
results <- data.frame(id = 1:NCOL(Mat),
lows = lows,
mids = mids,
highs = highs)
excludeVars <- names(results) %in% c('id')
image(as.matrix(results[!excludeVars]))
heatmap.2(as.matrix(results[!excludeVars]),
trace = "none")
melted_results <- melt(results, id.vars = "id")
ggplot(melted_results, aes(x=variable, y=id)) +
geom_tile(aes(fill=value))

Resources