Monte Carlo Simulations in list-columns in R and purrr - r

I have the following single case Monte Carlo simulation:
runs <- 100000
sim <- rnorm(n=runs,mean = 0,sd=1)
summary(sim)
But I would like to do the above with purrr in a list column. For example if I had the following data.
a <- tribble(
~group, ~mean, ~sd,~n,
1, 10, 5,1e5,
2, 20, 6,1e5,
3, 30, 7,1e5)
How can I produce another column of say 1e5 rnorms per group. In the end I may want to make histograms or summary statistics off of that list column.
I have tried the following.
a %>%
pmap_df(rnorm)
But got the following error, which I did not understand.
Error in .f(group = .l[[c(1L, i)]], mean = .l[[c(2L, i)]], sd = .l[[c(3L, : unused argument (group = .l[[c(1, i)]])
EDIT:
For some additional clarity I have mean successful at implementing what I want in lists, but not in data frames as follows:
mu <- list(10, 20, 30)
sigma <- list(5, 6, 7)
n <- list(1e5, 1e5, 1e5)
args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>%
pmap(rnorm) %>% map(quantile)
where quantile is just an example of a function, maybe I would choose to do mean or sd at a latter date.

Related

Plotting statistical power vs replicates and calculating mean of coefficients

I need to plot the statistical power vs. the number of replicates and in this case the number of replicates (n) is 3, but I can't figure out how to plot it.
This is what I have:
library(car)
n <- 3
nsims <- 1000
p = coef = vector()
for (i in 1:nsims) {
treat <- rnorm(n, mean = 460, sd = 110)
cont <- rnorm(n, mean = 415, sd = 110)
df <- data.frame(
y = c(treat, cont),
x = rep(c("treat", "cont"), each = n)
)
model <- glm(y ~ x, data = df)
p[i] = Anova(model)$P
coef[i] = coef(model)[2]
}
hist(p, col = 'skyblue')
sum(p < 0.05)/nsims
Can someone help me plot this?
Also, I need to calculate the mean of the coefficients using only models where p < 0.05. This is simulating the following process: if you perform the experiment, and p > 0.05, you report 'no effect’, but if p < 0.05 you report ‘significant effect’. But I'm not sure how to set that up from what I have.
Would I just do this?
mean(coef)
But I don't know how to include only those with p < 0.05.
Thank you!
Disclaimer: I spend a decent amount of time simulating experiments for work so I have strong opinions on this.
If that's everything because it's for a study assignment then fine, if you are planning to go further with this I recommend
adding the tidyverse to your arsenal.
Encapsulating functionality
First allows me to put a single iteration into a function to decouple its logic from the result subsetting (the encapsulation).
sim <- function(n) {
treat <- rnorm(n, 460, 110)
cont <- rnorm(n, 415, 110)
data <- data.frame(y = c(treat, cont), x = rep(c("treat", "cont"), each = n))
model <- glm(y ~ x, data = data)
p <- car::Anova(model)$P
coef <- coef(model)[2]
data.frame(n, p, coef)
}
Now we can simulate
nsims <- 1000
sims <- do.call(
rbind,
# We are now using the parameter as opposed to the previous post.
lapply(
rep(c(3, 5, 10, 20, 50, 100), each = nsims),
sim
)
)
# Aggregations
power_smry <- aggregate(p ~ n, sims, function(x) {mean(x < 0.05)})
coef_smry <- aggregate(coef ~ n, sims[sims$p < 0.05, ], mean)
# Plots
plot(p ~ n, data = power_smry
If you do this in the tidyverse this is one possible approach
crossing(
n = rep(c(3, 5, 10, 20, 50, 100))
# Add any number of other inputs here that you want to explore (like lift).
) %>%
rowwise() %>%
# This looks complicated but will be less so if you have multiple
# varying hyperparameters defined in crossing.
mutate(results = list(bind_rows(rerun(nsims, sim(n))))) %>%
pull(results) %>%
bind_rows() %>%
group_by(n) %>%
# The more metrics you want to summarize in different ways the easier compared to base.
summarize(
power = mean(p < 0.05),
coef = mean(coef[p < 0.05])
)

Function that will generate iter samples of size n from a gamma distribution with shape parameter alpha and rate parameter beta

The function needs to return the mean and standard deviation of each sample.
This is what I have:
sample_gamma <- function(alpha, beta, n, iter) {
mean = alpha/beta
var = alpha/(beta)^2
sd = sqrt(var)
gamma = rgamma(n,shape = alpha, scale = 1/beta)
sample_gamma = data.frame(mean = replicate(n = iter, expr = mean))
}
I'm very lost for this. I also need to create a data frame for this function.
Thank you for your time.
Edit:
sample_gamma <- function(alpha, beta, n, iter) {
output <- rgamma(iter, alpha, 1/beta)
output_1 <- matrix(output, ncol = iter)
means <- apply(output_1, 2, mean)
sds <- apply(output_1, 2, sd)
mystats <- data.frame(means, sds)
return(mystats)
}
This works except for the sds. It's returning NAs.
It's not really clear to me what you want. But say you want to create 10 samples of size 1000, alpha = 1, beta = 2. Then you can create a single stream of rgamma realizations, dimension them into a matrix, then get your stats with apply, and finally create a data frame with those vectors:
output <- rgamma(10*1000, 1, 1/2)
output <- matrix(output, ncol = 10)
means <- apply(output, 2, mean)
sds <- apply(output, 2, sd)
mystats <- data.frame(means, sds)
You could wrap your function around that code, replacing the hard values with parameters.

Stacking lapply results

I am using the following code to generate data, and i am estimating regression models across a list of variables (covar1 and covar2). I have also created confidence intervals for the coefficients and merged them together.
I have been examining all sorts of examples here and on other sites, but i can't seem to accomplish what i want. I want to stack the results for each covar into a single data frame, labeling each cluster of results by the covar it is attributable to (i.e., "covar1" and "covar2"). Here is the code for generating data and results using lapply:
##creating a fake dataset (N=1000, 500 at treated, 500 at control group)
#outcome variable
outcome <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 70, sd = 10))
#running variable
running.var <- seq(0, 1, by = .0001)
running.var <- sample(running.var, size = 1000, replace = T)
##Put negative values for the running variable in the control group
running.var[1:500] <- -running.var[1:500]
#treatment indicator (just a binary variable indicating treated and control groups)
treat.ind <- c(rep(0,500), rep(1,500))
#create covariates
set.seed(123)
covar1 <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 50, sd = 20))
covar2 <- c(rnorm(500, mean = 10, sd = 20), rnorm(500, mean = 10, sd = 30))
data <- data.frame(cbind(outcome, running.var, treat.ind, covar1, covar2))
data$treat.ind <- as.factor(data$treat.ind)
#Bundle the covariates names together
covars <- c("covar1", "covar2")
#loop over them using a convenient feature of the "as.formula" function
models <- lapply(covars, function(x){
regres <- lm(as.formula(paste(x," ~ running.var + treat.ind",sep = "")), data = d)
ci <-confint(regres, level=0.95)
regres_ci <- cbind(summary(regres)$coefficient, ci)
})
names(models) <- covars
print(models)
Any nudge in the right direction, or link to a post i just haven't come across, is greatly appreciated.
You can use do.call were de second argument is a list (like in here):
do.call(rbind, models)
I made a (possible) improve to your lapply function. This way you can save the estimated parameters and the variables in a data.frame:
models <- lapply(covars, function(x){
regres <- lm(as.formula(paste(x," ~ running.var + treat.ind",sep = "")), data = data)
ci <-confint(regres, level=0.95)
regres_ci <- data.frame(covar=x,param=rownames(summary(regres)$coefficient),
summary(regres)$coefficient, ci)
})
do.call(rbind,models)

perform ttest on a data.frame

Trying to perform ttest (and to get p.value) from a data.frame, there's one column that includes the groups (good vs bad) and the rest of the columns are numeric.
I generated a toy dataset here:
W <- rep(letters[seq( from = 1, to = 2)], 25)
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
test_data <- data.frame(W, X, Y, Z)
Then I transform the data into long format:
melt_testdata <- melt(test_data)
And performed the t.test
lapply(unique(melt_testdata$variable),function(x){
Good <- subset(melt_testdata, W == 'a' & variable ==x)$variable
Bad <- subset(melt_testdata, W == 'b' & variable ==x)$variable
t.test(Good,Bad)$p.value
})
But I instead of getting the t.test results, I got the following error messages:
Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant") :
missing value where TRUE/FALSE needed In addition: Warning messages:
1: In mean.default(x) : argument is not numeric or logical: returning NA
2: In var(x) :
Calling var(x) on a factor x is deprecated and will become an error.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
3: In mean.default(y) : argument is not numeric or logical: returning NA
4: In var(y) :
Calling var(x) on a factor x is deprecated and will become an error.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
Then I tried to write loops (first time..)
good <- matrix(,50)
bad <- matrix(,50)
cnt=3
out <- rep(0,cnt)
for (i in 2:4){
good[i] <- subset(test_data, W == 'a', select= test_data[,i])
bad[i] <- subset(test_data, W == 'b', select= test_data[,i])
out[i] <- print(t.test(good[[i]], bad[[i]])$p.value)
}
Still not getting p.values .......
This is the error messages
Error in x[j] : only 0's may be mixed with negative subscripts
I appreciate any help in any method, thanks!
I think you'll have better luck with the formula method of t.test. Try
library(broom)
library(magrittr)
library(dplyr)
W <- rep(letters[seq( from = 1, to = 2)], 25)
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
test_data <- data.frame(W, X, Y, Z)
lapply(test_data[c("X", "Y", "Z")],
function(x, y) t.test(x ~ y),
y = test_data[["W"]]) %>%
lapply(tidy) %>%
do.call("rbind", .) %>%
mutate(variable = rownames(.))
Edit:
With stricter adherence to the dplyr philosophy, you can use the following: which is actually a bit cleaner looking.
library(broom)
library(dplyr)
library(tidyr)
W <- rep(letters[seq( from = 1, to = 2)], 25)
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
test_data <- data.frame(W, X, Y, Z)
test_data %>%
gather(variable, value, X:Z) %>%
group_by(variable) %>%
do(., tidy(t.test(value ~ W, data = .)))
Here is a solution using dplyr and the formula argument to t.test. do works on each group defined by the group_by. glance extracts values from the t.test output and makes them into a data.frame.
library(tidyverse)
library(broom)
melt_testdata %>%
group_by(variable) %>%
do(glance(t.test(value ~ W, data = .)))

Using Reduce() to calculate percentiles or variance in R

In the same way that I calculate the average of each position in the parallel vectors combined in the list, I would like to look for percentiles (0.05 and 0.95), variance or standard error.
LOC_GI_1950a <- rnorm(100,5,2)
LOC_GI_1951a <- rnorm(100,7,3)
LOC_GI_1952a <- rnorm(100,1,2)
LOC_GI_1953a <- rnorm(100,2,3)
LOC_GI_1954a <- rnorm(100,5,2)
LOC_GI_1955a <- rnorm(100,7,3)
LOC_GI_1956a <- rnorm(100,8,2)
LOC_GI_1957a <- rnorm(100,2,5)
LOC_GI_1958a <- rnorm(100,5,1)
LOC_GI_1959a <- rnorm(100,7,1)
LOC_GI_1960a <- rnorm(100,1,2)
LOC_GI_1961a <- rnorm(100,6,3)
LOC_GI_Annuala <- list(LOC_GI_1950a,LOC_GI_1951a,LOC_GI_1952a,LOC_GI_1953a,LOC_GI_1954a,
LOC_GI_1955a,LOC_GI_1956a,LOC_GI_1957a,LOC_GI_1958a,LOC_GI_1959a,
LOC_GI_1960a,LOC_GI_1961a)
LOC_GI_AnnualAvga <- Reduce("+",LOC_GI_Annuala)/length(LOC_GI_Annuala)
We can convert the list to an array and then use apply procedures to get the mean, var, etc. of each corresponding element
apply(array(unlist(v1), c(10, 10, 12)), c(1,2), mean)
apply(array(unlist(v1), c(10, 10, 12)), c(1,2), var)
As #RuiBarradas mentioned, the quantile can be used with apply
c(apply(array(unlist(v1), c(10, 10, 12)), c(1,2), quantile, probs = 0.95))

Resources