Simlulating a t-test in R

Simlulating a t-test in R - r

I am looking for a way to simulate the power of a simple t-test in different sample sizes. My idea is to generate 400 random normal distribution samples, each with mean 5 and variance 1, and perform a t-test on each one concerning the hypothesis that the true mean is 4, i.e. the t-test would take the form:
t=(mean(x)-4)*sqrt(n)/sd(x) # for each sample x which consists of n observations.
For comparison I would like, the first 100 samples to consist of 10 observations, the next 100 ones of 100, the next 100 of 1000 and finally the last 100 of 5000, which I think is the upper limit. A t-test will have to be performed on each and every sample.
Lastly, I would like to see on what percentage of each sample group- let's call them, n10,n100,n1000,n5000, depending on how many observations they comprise- my (false) hypothesis is rejected.
Could you please help me write the corresponding R-code? I know the small commands but have trouble putting it all together. This is a nice exercise and hopefully I shall then be able to modify it a bit and use it for different purposes as well.
Thank you in advance.

Here's a one liner for 400 t.tests of n=10:
R>simulations <- replicate(400, t.test(rnorm(10, mean=5, sd=1), mu=4),
simplify=FALSE);
Then you can analyze it:
R>table(sapply(simulations, "[[", "p.value") < .05)
FALSE TRUE
75 325

I'm still learning R, too, so handle with care:
n <- 5
N <- 100
samplesizes <- as.integer(10^(1:n))
set.seed(1)
# generate samples
samples <- replicate(N, mapply(rnorm, samplesizes, mean=4, sd=sqrt(1)))
# perform t-tests
t.tests <- lapply(samples, function(x) t.test(x, mu=5, alternative="two.sided"))
# get p-values
t.test.pvalues <- sapply(t.tests, function(x) x$p.value)
rejected <- t.test.pvalues > .05
sampleIndices <- rep(1:n, N)
res <- aggregate(rejected, list(sample=sampleIndices), FUN=function(x) sum(x)/length(x) )
names(res)[2] <- "percRejected"
print(res, row.names=F)
# sample percRejected
# 1 0.16
# 2 0.00
# 3 0.00
# 4 0.00
# 5 0.00

Related

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.

It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg

You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

get means across samples from bootstrap

I want to get the means and sds across 20 sampled data, but not sure how to do that. My current code can give me the means within each sample, not across samples.
## create data
data <- round(rnorm(100, 5, 3))
data[1:10]
## obtain 20 boostrap samples
## display the first of the boostrap samples
resamples <- lapply(1:20, function(i) sample(data, replace = T))
resamples[1]
## calculate the means for each bootstrap sample
r.mean <- sapply(resamples, mean)
r.median
## calculate the sd of the distribution of medians
sqrt(var(r.median))
From the above code, I got 20 means from each of the sampled data, and sd of the distribution of the means. How can I get 100 means, each mean from the distribution of the 20 samples? and same for the standard deviation?
Many thanks!!

Though the answer by #konvas is probably what you want, I would still take a look at base package boot when it comes to bootstrapping.
See if the following example can get you closer to what you are trying to do.
set.seed(6929) # Make the results reproducible
data <- round(rnorm(100, 5, 3))
boot_mean <- function(data, indices) mean(data[indices])
boot_sd <- function(data, indices) sd(data[indices])
Runs <- 100
r.mean <- boot::boot(data, boot_mean, Runs)
r.sd <- boot::boot(data, boot_sd, Runs)
r.mean$t
r.sd$t
sqrt(var(r.mean$t))
# [,1]
#[1,] 0.3152989
sd(r.mean$t)
#[1] 0.3152989
Now, see the distribution of the bootstrapped means and standard errors.
op <- par(mfrow = c(1, 2))
hist(r.mean$t)
hist(r.sd$t)
par(op)

Make a matrix with your samples
mat <- do.call(rbind, resamples)
Then
rowMeans(mat)
will give you the "within sample" mean and
colMeans(mat)
the "across sample" mean. For other quantities, e.g. standard deviation you can use apply, e.g. apply(mat, 1, sd) or functions from the matrixStats package, e.g. matrixStats::rowSds(mat).

Speed up Simulation in R with Code Optimization

The generic version of what I am trying to do is to conduct a simulation study where I manipulate a few variables to see how that impacts a result. I'm having some speed issues with R. The latest simulation worked with a few iterations (10 per experiment). However, when I moved to a large scale (10k per experiment) version, the simulation has been running for 14 hours (and is still running).
Below is the code (with comments) that I am running. Being a rookie with R, and am struggling to optimize the simulation to be efficient. My hope is to learn from the comments and suggestions provided here to optimize this code and use these comments for future simulation studies.
Let me say a few things about what this code is supposed to do. I am manipulating two variables: effect size and sample size. Each combination is run 10k times (i.e., 10k experiments per condition). I initialize a data frame to store my results (called Results). I loop over three variables: Effect size, sample size, and iterations (10k).
Within the loops, I initialize four NULL components: p.test, p.rep, d.test, and d.rep. The former two capture the p-value of the initial t-test and the p-value of the replication (replicated under similar conditions). The latter two calculate the effect size (Cohen's d).
I generate my random data from a standard normal for the control condition (DVcontrol), and I use my effect size as the mean for the experimental condition (DVexperiment). I take the difference between the values and throw the result into the t-test function in R (paired-samples t-test). I store the results in a list called Trials and I rbind this to the Results data frame. This process is repeated 10k times until completion.
# Set Simulation Parameters
## Effect Sizes (ES is equal to mean difference when SD equals Variance equals 1)
effect_size_range <- seq(0, 2, .1) ## ES
## Sample Sizes
sample_size_range <- seq(10, 1000, 10) ## SS
## Iterations for each ES-SS Combination
iter <- 10000
# Initialize the Vector of Results
Results <- data.frame()
# Set Random Seed
set.seed(12)
# Loop over the Different ESs
for(ES in effect_size_range) {
# Loop over the Different Sample Sizes
for(SS in sample_size_range) {
# Create p-value Vectors
p.test <- NULL
p.rep <- NULL
d.test <- NULL
d.rep <- NULL
# Loop over the iterations
for(i in 1:iter) {
# Generate Test Data
DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)
DVdiff <- DVexperiment - DVcontrol
p.test[i] <- t.test(DVdiff, alternative="greater")$p.value
d.test[i] <- mean(DVdiff) / sd(DVdiff)
# Generate Replication Data
DVcontrol <- rnorm(iter, mean=0, sd=1)
DVexperiment <- rnorm(iter, mean=ES, sd=1)
DVdiff <- DVexperiment - DVcontrol
p.rep[i] <- t.test(DVdiff, alternative="greater")$p.value
d.rep[i] <- mean(DVdiff) / sd(DVdiff)
}
# Results
Trial <- list(ES=ES, SS=SS,
d.test=mean(d.test), d.rep=mean(d.rep),
p.test=mean(p.test), p.rep=mean(p.rep),
r=cor(p.test, p.rep, method="kendall"),
r.log=cor(log2(p.test)*(-1), log2(p.rep)*(-1), method= "kendall"))
Results <- rbind(Results, Trial)
}
}
Thanks in advance for your comments and suggestions,
Josh

The general approach to optimization is to run a profiler to determine what portion of the code the interpreter spends the most time in, and then to optimize that portion. Let's say your code resides in a file called test.R. In R, you can profile it by running the following sequence of commands:
Rprof() ## Start the profiler
source( "test.R" ) ## Run the code
Rprof( NULL ) ## Stop the profiler
summaryRprof() ## Display the results
(Note that these commands will generate a file Rprof.out in the directory of your R session.)
If we run the profiler on your code (with iter <- 10, rather than iter <- 10000), we get the following profile:
# $by.self
# self.time self.pct total.time total.pct
# "rnorm" 1.56 24.53 1.56 24.53
# "t.test.default" 0.66 10.38 2.74 43.08
# "stopifnot" 0.32 5.03 0.86 13.52
# "rbind" 0.32 5.03 0.52 8.18
# "pmatch" 0.30 4.72 0.34 5.35
# "mean" 0.26 4.09 0.42 6.60
# "var" 0.24 3.77 1.38 21.70
From here, we observe that rnorm and t.test are your most expensive operations (shouldn't really be a surprise as these are in your inner-most loop).
Once you figured out where the expensive function calls are, the actual optimization consists of two steps:
Optimize the function, and/or
Optimize the number of times the function is called.
Since t.test and rnorm are built-in R functions, your only option for Step 1 above is to look for alternative packages that may have faster implementations of sampling from the normal distribution and/or running multiple t tests. Step 2 is really about restructuring your code in a way that does not recompute the same thing multiple times. For example, the following lines of code do not depend on i:
# Generate Test Data
DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)
Does it make sense to move these outside the loop, or do you really need a new sample of your test data for each different value of i?

Grouping in R changes mean substantially

I have a file containing the predictions for two models (A and B) on a binary classification problem. Now I'd like to understand how good they are predicting the observations that they are most confident about. To do that I want to group their predictions into 10 groups based on how confident they are. Each of these groups should have an identical number of observations. However, when I do that the accuracy of the models change substantially! How can that be?
I've also tested with n_groups=100, but it only makes a minor difference. The CSV file is here and the code is below:
# Grouping observations
conf <- read.table(file="conf.csv", sep=',', header=T)
n_groups <- 10
conf$model_a_conf <- pmax(conf$model_a_pred_0, conf$model_a_pred_1)
conf$model_b_conf <- pmax(conf$model_b_pred_0, conf$model_b_pred_1)
conf$conf_group_model_a <- cut(conf$model_a_conf, n_groups, labels=FALSE, ordered_result=TRUE)
conf$conf_group_model_b <- cut(conf$model_b_conf, n_groups, labels=FALSE, ordered_result=TRUE)
# Test of original mean.
mean(conf$model_a_acc) # 0.78
mean(conf$model_b_acc) # 0.777
# Test for mean in aggregated data. They should be similar.
(acc_model_a <- mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean))) # 0.8491
(acc_model_b <- mean(tapply(conf$model_b_acc, conf$conf_group_model_b, FUN=mean))) # 0.7526
Edited to clarify slightly.

table(conf$conf_group_model_a)
1 2 3 4 5 6 7 8 9 10
2515 2628 2471 2128 1792 1321 980 627 398 140
The groups you are using are unbalanced. So when you take the mean of each of those groups with tapply thats fine, however to simply take the mean afterwards is not the way to go.
You need to weight the means by their size if you want to do the process you have.
something like this is quick and dirty:
mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean) * (table(conf$conf_group_model_a)/nrow(conf)) * 1000)

R: Distribution of Random Samples vs. 1 Random Sample

I have a question about random sampling.
Are the two following results (A and B) statistically the same?
nobs <- 1000
A <- rt(n=nobs, df=3, ncp=0)
simulations <- 50
B <- unlist(lapply(rep.int(nobs/simulations, times=simulations),function(y) rt(n=y, df=3, ncp=0) ))
I thought it would be but now I've been going back and forth.
Any help would be appreciated.
Thanks

With some small changes, you can even make them numerically equal. You only need to seed the RNG and omit specifying the ncp parameter and use the default value (of 0) instead:
nobs <- 1000
set.seed(42)
A <- rt(n=nobs, df=3)
simulations <- 50
set.seed(42)
B <- unlist(lapply(rep.int(nobs/simulations, times=simulations),function(y) rt(n=y, df=3) ))
all.equal(A, B)
#[1] TRUE
Why don't you get equal results when you specify ncp=0?
Because then rt assumes that you actually want a non-central t-distribution and the values are calculated with rnorm(n, ncp)/sqrt(rchisq(n, df)/df). That means when creating 1000 values at once rnorm is called once and rchisq is called once subsequently. If you create 50 times 20 values, you have alternating calls to these RNGs, which means the RNG states are different for the rnorm and rchisq calls than in the first case.