Generate samples from bernoulli(p) using R - r

Using R, generate data from the Bernoulli(p), for various sample sizes (n= 10, 15, 20, 25, 30, 50, 100, 150, 200), for p = 0.01, 0.4, 0.8
I know how to do it for one case using the rbinom function. Like for the first scenario: rbinom(n=10, size=1, p=0.01).
My aim is to build a function that could compute all these scenarios preventing me to do all of them individually.

The following function will give you a list of list. I tried to give them appopriate naming.
ff <- function(n,probs) {
res <- lapply(n, function(i) {
setNames(lapply(probs, function(p) {
rbinom(n=i, size=1, p=p)
}),paste0("p=",probs))
})
names(res) <- paste0("n=",n)
res
}
Just call it like ff(n=c(10,15,20),probs = c(0.01,0.4,0.8)) and you will get a list of length 3 (for every n) which contains a list of length 3 (for every probability) with the vectors from the bernoulli-sample.

You can generate a dataframe of your np combinations with expand.grid and then use the map2 function of the tidyverse purrr package to generate a list of outputs for the np pairs:
library(tidyverse)
n <- c(10, 15, 20, 25, 30, 50, 100, 150, 200)
p <- c(0.01, 0.4, 0.8)
nps <- expand.grid(n = n, p = p)
samples <- 10
outlist <- map2(nps$n, nps$p, function(x, y) rbinom(samples, x, y))

Related

R bootstrapping for the two dataframe individual column wise

Want to do Bootstrapping while comparing two dataframe column wise with the different number of rows.
I have two dataframe in which row represent values from experiments and column with the dataset names (data1, data2, data3, data4)
emp.data1 <- data.frame(
data1 = c(234,0,34,0,46,0,0,0,2.26,0, 5,8,93,56),
data2 = c(1.40,1.21,0.83,1.379,2.60,9.06,0.88,1.16,0.64,8.28, 5,8,93,56),
data3 =c(0,34,43,0,0,56,0,0,0,45,5,8,93,56),
data4 =c(45,0,545,34,0,35,0,35,0,534, 5,8,93,56),
stringsAsFactors = FALSE
)
emp.data2 <- data.frame(
data1 = c(45, 0, 0, 45, 45, 53),
data2 = c(23, 0, 45, 12, 90, 78),
data3 = c(72, 45, 756, 78, 763, 98),
data4 = c(1, 3, 65, 78, 9, 45),
stringsAsFactors = FALSE
)
I am trying to do bootstrapping(n=1000). Values are selected at random replacement from emp.data1(14 * 4) without change in the emp.data2(6 * 4). For example from emp.data2 first column (data1) select 6 values colSum and from emp.data1(data1) select 6 random non zero values colSum Divide the values and store in temp repeat the same 1000 times and take a median value et the end. like this i want to do it for each column of the dataframe. sample code I am providing which is working fine but i am not able get the non-zero random values for emp.data1
nboot <- 1e3
boot_temp_emp<- c()
n_data1 <- nrow(emp.data1); n_data2 <- nrow(emp.data2)
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_data1), size = n_data2, replace = TRUE)
value <- colSums(emp.data2)/colSums(emp.data1[boot,])
boot_temp_emp <- rbind(boot_temp_emp, value)
}
boot_data<- apply(boot_temp_emp, 2, median)
From the above script i am able get the output but each column emp.data1[boot,] data has zero values and taken sum. I want indivisual ramdomly selected non-zero values column sum so I tried below script not able remove zero values. Not able get desired output please some one help me to correct my script
nboot <- 1e3
boot_temp_emp<- c()
for (i in colnames(emp.data2)){
for (j in seq_len(nboot)){
data1=emp.data1[i]
data2=emp.data2[i]
n_data1 <- nrow(data1); n_data2 <- nrow(data2)
boot <- sample(x = seq_len(n_data1), size = n_data2, replace = TRUE)
value <- colSums(data2[i])/colSums(data1[boot, ,drop = FALSE])
boot_temp_emp <- rbind(boot_temp_emp, value)
}
}
boot_data<- apply(boot_temp_emp, 2, median)
Thank you
Here is a solution.
Write a function to make the code clearer. This function takes the following arguments.
x the input data.frame emp.data1;
s2 the columns sums of emp.data2;
n = 6 the number of vector elements to sample from emp.data1's columns with a default value of 6.
The create a results matrix, pre-compute the column sums of emp.data2 and call the function in a loop.
boot_fun <- function(x, s2, n = 6){
# the loop makes sure ther is no divide by zero
nrx <- nrow(x)
repeat{
i <- sample(nrx, n, replace = TRUE)
s1 <- colSums(x[i, ])
if(all(s1 != 0)) break
}
s2/s1
}
set.seed(2022)
nboot <- 1e3
sums2 <- colSums(emp.data2)
results <- matrix(nrow = nboot, ncol = ncol(emp.data1))
for(i in seq_len(nboot)){
results[i, ] <- boot_fun(emp.data1, sums2)
}
ratios_medians <- apply(results, 2, median)
old_par <- par(mfrow = c(2, 2))
for(j in 1:4) {
main <- paste0("data", j)
hist(results[, j], main = main, xlab = "ratios", freq = FALSE)
abline(v = ratios_medians[j], col = "blue", lty = "dashed")
}
par(old_par)
Created on 2022-02-24 by the reprex package (v2.0.1)
Edit
Following the comments here is a revised version of the bootstrap function. It makes sure there are no zeros in the sampled vectors, before computing their sums.
boot_fun2 <- function(x, s2, n = 6){
nrx <- nrow(x)
ncx <- ncol(x)
s1 <- numeric(ncx)
for(j in seq.int(ncx)) {
repeat{
i <- sample(nrx, n, replace = TRUE)
if(all(x[i, j] != 0)) {
s1[j] <- sum(x[i, j])
break
}
}
}
s2/s1
}
set.seed(2022)
nboot <- 1e3
sums2 <- colSums(emp.data2)
results2 <- matrix(nrow = nboot, ncol = ncol(emp.data1))
for(i in seq_len(nboot)){
results2[i, ] <- boot_fun2(emp.data1, sums2)
}
ratios_medians2 <- apply(results2, 2, median)
old_par <- par(mfrow = c(2, 2))
for(j in 1:4) {
main <- paste0("data", j)
hist(results2[, j], main = main, xlab = "ratios", freq = FALSE)
abline(v = ratios_medians2[j], col = "blue", lty = "dashed")
}
par(old_par)
Created on 2022-02-27 by the reprex package (v2.0.1)

How to edit the loop to save the mean of each sample to a matrix in R?

set.seed(89235)
values<-c(10, 5, 10, 25, 50, 100, 500, 1000)
n=length(values)
for (i in 1:n){
mymean<- mean(rnorm(values[i], mean=0, sd=1))
cat("sample size:",values[i],"mean:", mymean, fill=TRUE)
}
I have created a Loop as above, but how could I add a loop to also save the mean of each sample to a matrix
My instinct for this approach is to append mymean to a list in the loop and from there convert the list to any data format you need. I've included how to convert to a matrix from there but you could go to a data.frame as well.
set.seed(89235)
values<-c(10, 5, 10, 25, 50, 100, 500, 1000)
n=length(values)
mylist = list()
for (i in 1:n){
mymean<- mean(rnorm(values[i], mean=0, sd=1))
mylist = append(mylist, mymean) # append each iteration of mymean to mylist
cat("sample size:",values[i],"mean:", mymean, fill=TRUE)
}
matrix(unlist(mylist), ncol =1, nrow =length(mylist))
You have to firstly define mymean as a vector variable having n elements. Then, store your each mean value at each iteration into mymean's appropriate index. Do not forget to add the index of mymean when printing the result to the console.
set.seed(89235)
values <- c(10, 5, 10, 25, 50, 100, 500, 1000)
n <- length(values)
mymean <- vector(length = n)
for (i in 1:n){
mymean[i]<- mean(rnorm(values[i], mean=0, sd=1))
cat("sample size:",values[i],"mean:", mymean[i], fill=TRUE)
}
(mymean)
A simpler approach is to use sapply function.
set.seed(89235)
values <- c(10, 5, 10, 25, 50, 100, 500, 1000)
mymean <- sapply(values, function(input) {
mean.value <- mean(rnorm(input, mean=0, sd=1))
cat("sample size:",input,"mean:", mean.value, fill=TRUE)
return(mean.value)
})
(mymean)

Plotting statistical power vs replicates and calculating mean of coefficients

I need to plot the statistical power vs. the number of replicates and in this case the number of replicates (n) is 3, but I can't figure out how to plot it.
This is what I have:
library(car)
n <- 3
nsims <- 1000
p = coef = vector()
for (i in 1:nsims) {
treat <- rnorm(n, mean = 460, sd = 110)
cont <- rnorm(n, mean = 415, sd = 110)
df <- data.frame(
y = c(treat, cont),
x = rep(c("treat", "cont"), each = n)
)
model <- glm(y ~ x, data = df)
p[i] = Anova(model)$P
coef[i] = coef(model)[2]
}
hist(p, col = 'skyblue')
sum(p < 0.05)/nsims
Can someone help me plot this?
Also, I need to calculate the mean of the coefficients using only models where p < 0.05. This is simulating the following process: if you perform the experiment, and p > 0.05, you report 'no effect’, but if p < 0.05 you report ‘significant effect’. But I'm not sure how to set that up from what I have.
Would I just do this?
mean(coef)
But I don't know how to include only those with p < 0.05.
Thank you!
Disclaimer: I spend a decent amount of time simulating experiments for work so I have strong opinions on this.
If that's everything because it's for a study assignment then fine, if you are planning to go further with this I recommend
adding the tidyverse to your arsenal.
Encapsulating functionality
First allows me to put a single iteration into a function to decouple its logic from the result subsetting (the encapsulation).
sim <- function(n) {
treat <- rnorm(n, 460, 110)
cont <- rnorm(n, 415, 110)
data <- data.frame(y = c(treat, cont), x = rep(c("treat", "cont"), each = n))
model <- glm(y ~ x, data = data)
p <- car::Anova(model)$P
coef <- coef(model)[2]
data.frame(n, p, coef)
}
Now we can simulate
nsims <- 1000
sims <- do.call(
rbind,
# We are now using the parameter as opposed to the previous post.
lapply(
rep(c(3, 5, 10, 20, 50, 100), each = nsims),
sim
)
)
# Aggregations
power_smry <- aggregate(p ~ n, sims, function(x) {mean(x < 0.05)})
coef_smry <- aggregate(coef ~ n, sims[sims$p < 0.05, ], mean)
# Plots
plot(p ~ n, data = power_smry
If you do this in the tidyverse this is one possible approach
crossing(
n = rep(c(3, 5, 10, 20, 50, 100))
# Add any number of other inputs here that you want to explore (like lift).
) %>%
rowwise() %>%
# This looks complicated but will be less so if you have multiple
# varying hyperparameters defined in crossing.
mutate(results = list(bind_rows(rerun(nsims, sim(n))))) %>%
pull(results) %>%
bind_rows() %>%
group_by(n) %>%
# The more metrics you want to summarize in different ways the easier compared to base.
summarize(
power = mean(p < 0.05),
coef = mean(coef[p < 0.05])
)

Separating Parameters in Repeated Function Calls

With a vector of values, I want each value to be called on a function
values = 1:10
rnorm(100, mean=values, sd=1)
mean = values repeats the sequence (1,2,3,4,5,6,7,8,9,10). How can I get a matrix, each with 100 observations and using a single element from my vector?
ie:
rnorm(100, mean=1, sd=1)
rnorm(100, mean=2, sd=1)
rnorm(100, mean=3, sd=1)
rnorm(100, mean=4, sd=1)
# ...
It's not clear from your question, but I took it that you wanted a single matrix with 10 rows and 100 columns. That being the case you can do:
matrix(rnorm(1000, rep(1:10, each = 100)), nrow = 10, byrow = TRUE)
Or modify akrun's answer by using sapply instead of lapply
An option is lapply from base R
lapply(1:10, function(i) rnorm(100, mean = i, sd = 1))
Or Map from base R:
Map(function(i) rnorm(100, mean = i, sd = 1), 1:10)
Using map I can apply a function for each value from the vector values
library(purrr)
values = 1:10
map_dfc(
.x = values,
.f = ~rnorm(100,mean = .x,sd = 1)
)
In this case I will have a data.frame 100x10

Set Acceptable Region for My Skewness Test in R

I am writing the below function to let me conduct a test of skewness for a vector of samples (10, 20, 50, 100) with a 1000 replicate.
library(moments)
out <- t(sapply(c(10, 20, 50, 100), function(x)
table(replicate(1000, skewness(rgamma(n = x, shape = 3, rate = 0.5))) > 2)))
row.names(out) <- c(10, 20, 50, 100)
out
My conditions
My condition of rejecting the Null hypothesis is that the statistic must fulfil two (2) conditions:
less than -2
or greater than +2.
What I have
But in my R function I can only describe the second condition.
What I want
How do I include both the first and the second condition in my function?
Perhaps adding the abs would be the easiest approach to meet both conditions
out <- t(sapply(c(10, 20, 50, 100), function(x)
table(abs(unlist(replicate(1000, skewness(rgamma(n = x, shape = 3, rate = 0.5))))) > 2)))
row.names(out) <- c(10, 20, 50, 100)
out

Resources