get means across samples from bootstrap - r

I want to get the means and sds across 20 sampled data, but not sure how to do that. My current code can give me the means within each sample, not across samples.
## create data
data <- round(rnorm(100, 5, 3))
data[1:10]
## obtain 20 boostrap samples
## display the first of the boostrap samples
resamples <- lapply(1:20, function(i) sample(data, replace = T))
resamples[1]
## calculate the means for each bootstrap sample
r.mean <- sapply(resamples, mean)
r.median
## calculate the sd of the distribution of medians
sqrt(var(r.median))
From the above code, I got 20 means from each of the sampled data, and sd of the distribution of the means. How can I get 100 means, each mean from the distribution of the 20 samples? and same for the standard deviation?
Many thanks!!

Though the answer by #konvas is probably what you want, I would still take a look at base package boot when it comes to bootstrapping.
See if the following example can get you closer to what you are trying to do.
set.seed(6929) # Make the results reproducible
data <- round(rnorm(100, 5, 3))
boot_mean <- function(data, indices) mean(data[indices])
boot_sd <- function(data, indices) sd(data[indices])
Runs <- 100
r.mean <- boot::boot(data, boot_mean, Runs)
r.sd <- boot::boot(data, boot_sd, Runs)
r.mean$t
r.sd$t
sqrt(var(r.mean$t))
# [,1]
#[1,] 0.3152989
sd(r.mean$t)
#[1] 0.3152989
Now, see the distribution of the bootstrapped means and standard errors.
op <- par(mfrow = c(1, 2))
hist(r.mean$t)
hist(r.sd$t)
par(op)

Make a matrix with your samples
mat <- do.call(rbind, resamples)
Then
rowMeans(mat)
will give you the "within sample" mean and
colMeans(mat)
the "across sample" mean. For other quantities, e.g. standard deviation you can use apply, e.g. apply(mat, 1, sd) or functions from the matrixStats package, e.g. matrixStats::rowSds(mat).

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

Increase precision when standardizing test dataset

I am dealing with a dataset in R divided into train and test. I preproces the data centering and dividing by the standard deviation and so, I want to store the mean and sd values of the training set to scale the test set using the same values. However, the precision obtained if I use the scale function is much better than when I use the colmeans and apply(x, 2, sd) functions.
set.seed(5)
a = matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale = scale(a) # scale using the scale function
a_scale_custom = (a - colMeans(a)) / apply(a, 2, sd) # Using custom function
Now If I compare the mean of both matrices:
colMeans(a_scale)
[1] -9.270260e-17 -1.492891e-16 1.331857e-16
colMeans(a_scale_custom)
[1] 0.007461065 -0.004395052 -0.003046839
The matrix obtained using scale has a column mean of value 0, while the matrix obtained substracting the mean using colMeans has error in the order of 10^-2. The same happens when comparing the standard deviations.
Is there any way I can obtain a better precision when scaling the data without using the scalefunction?
The custom function has a bug in the matrix layout. You need to transpose the matrix before subtracting the vector with t(), then transpose it back. Try the following:
set.seed(5)
a <- matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale <- scale(a) # scale using the scale function
a_scale_custom <- t((t(a) - colMeans(a)) / apply(a, 2, sd))
colMeans(a_scale)
colMeans(a_scale_custom)
see also: How to divide each row of a matrix by elements of a vector in R

Calculate trend in array using linear regression

I have an array of the dimensions c(54,71,360) which contains climatalogical data. The first two dimensions describe the grid of the region, while the third one serves as time dimension. So in this case, there are 360 time steps (months).
Here is code to produce a sample array:
set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))
Now I would like to calculate the trend of each grid cell. The trend is equal to the slope of the linear regression equation. This is why the calculation of the linear regression of every grid cell with the time needs to be performed. And this is exactly what I am struggeling with.
To clearly show what I wish to do, here is an example with one grid cell, which is taken from the array as a vector of the length 360:
grid_cell <- my_array[1,1,]
The linear regression of this vector with the time needs to be calculated. For that purpose, we create a simple time vector:
time_vec <- 1:360
Since I am only interested at the slope coefficient, it can be done this way:
trend <- lm(grid_cell ~ time_vec)$coefficients[2]
This leads to a value of 1.347029e-05 in this case.
I would like to do this for every grid cell of the array, so that the output is a matrix of the dimensions c(54,71), meaning one trend value for each grid cell.
I tried the following, which did not work:
trend_mat <- apply(my_array, 1:2, lm(my_array ~ time_vec)$coefficients[2])
I receive the error message:Error in model.frame.default: variable lengths differ.
This is kind of surprising, since both, the third dimension of the array and the time_vec are both of the length 360.
Anybody with an idea how to achieve this?
Of course I am also open for other solutions which may work totally differently, as long as they lead to the same result.
The problems with the code in the question are that
the third argument of apply should be a function and the question's code provides an expression instead of a function.
it applies lm many times. We show how to do it applying lm only once and in the second alternative we don't use lm at all. this gives one and two order of magnitude speedups as shown in the Performance section below.
It is easier to illustrate if we use smaller data as shown in the Note at the end. To use it on your example just replace dims with the line shown in the commented out line in the Note.
1) First we reshape the array into a matrix, perform lm and then reshape it back. This invokes lm once rather than invoking it prod(dims[1:2]) times.
m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
## [,1] [,2] [,3]
## [1,] 0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086 0.254682
# check
coef(lm(a[1,1,] ~ timevec))[[2]]
## [1] 0.2636792
coef(lm(a[2,1,] ~ timevec))[[2]]
## [1] -0.4453307
coef(lm(a[1,2,] ~ timevec))[[2]]
## [1] 0.5682025
coef(lm(a[2,2,] ~ timevec))[[2]]
## [1] 0.2338086
coef(lm(a[1,3,] ~ timevec))[[2]]
## [1] -0.255538
coef(lm(a[2,3,] ~ timevec))[[2]]
## [1] 0.254682
2) Alternately, we can remove lm entirely by using the formula for the slope coefficient like this:
m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
## [,1] [,2] [,3]
## [1,] 0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086 0.254682
Performance
We see that the single lm runs about 8x faster than apply and eliminating lm runs about 230x times faster than apply. Because the apply is brutally slow on my laptop I only used 3 replications but if you have a faster machine or more patience you can increase it. The main conclusions are unlikely to change much though.
library(microbenchmark)
set.seed(5)
dims <- c(54,71,360)
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])
microbenchmark(times = 3L,
apply = apply(a, 1:2, function(x) coef(lm(x ~ timevec))[2]),
lm = { m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
},
cov = { m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
})
giving:
Unit: milliseconds
expr min lq mean median uq max neval cld
apply 13446.7953 13523.6016 13605.25037 13600.4079 13684.4779 13768.5479 3 b
lm 264.5883 275.7611 476.82077 286.9338 582.9370 878.9402 3 a
cov 56.9120 57.8830 58.71573 58.8540 59.6176 60.3812 3 a
Note
Test data.
set.seed(5)
# dims <- c(54,71,360)
dims <- 2:4
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])
There is a anonymous function missing in the question's regression code. Here I will use the new lambdas, introduced in R 4.1.0.
I also use the recommended extractor coef.
set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))
time_vec <- 1:360
trend_mat <- apply(my_array, 1:2, \(x) coef(lm(x ~ time_vec))[2])

calculate variance of all samples in r studio

I have 30 random samples taken from a data set. I need to calculate sample mean and sample variance for each sample, and arrange them in a table with 3 columns titled "sample", "mean", and "variance".
My dataset is:
lab6data <- c(2,5,4,6,7,8,4,5,9,7,3,4,7,12,4,10,9,7,8,11,8,
6,13,9,6,7,4,5,2,3,10,13,4,12,9,6,7,3,4,2)
I made samples like:
observations <- matrix(lab6data, 30, 5)
and means for every sample separately by:
means <- rowMeans(observations)
Can you please help me to find the variance for every sample separately?
You can calculate the variance per row using apply:
apply(observations, 1, var)
Or use rowVars from the matrixStats package.
Note that matrixStats::rowVars will be slightly much faster (see #HenrikB's comment below) than apply(..., 1, var), in the same way that rowMeans is faster than apply(..., 1, mean).
We can use pmap to apply the function on each row of the data.frame
library(purrr)
varS <- pmap_dbl(as.data.frame(observations), ~ var(c(...)))
cbind(observations, varS)
data
observations <- matrix(lab6data, 10, 4)

log- and z-transforming my data in R

I'm preparing my data for a PCA, for which I need to standardize it. I've been following someone else's code in vegan but am not getting a mean of zero and SD of 1, as I should be.
I'm using a data set called musci which has 13 variables, three of which are labels to identify my data.
log.musci<-log(musci[,4:13],10)
stand.musci<-decostand(log.musci,method="standardize",MARGIN=2)
When I then check for mean=0 and SD=1...
colMeans(stand.musci)
sapply(stand.musci,sd)
I get mean values ranging from -8.9 to 3.8 and SD values are just listed as NA (for every data point in my data set rather than for each variable). If I leave out the last variable in my standardization, i.e.
log.musci<-log(musci[,4:12],10)
the means don't change, but the SDs now all have a value of 1.
Any ideas of where I've gone wrong?
Cheers!
You data is likely a matrix.
## Sample data
dat <- as.matrix(data.frame(a=rnorm(100, 10, 4), b=rexp(100, 0.4)))
So, either convert to a data.frame and use sapply to operate on columns
dat <- data.frame(dat)
scaled <- sapply(dat, scale)
colMeans(scaled)
# a b
# -2.307095e-16 2.164935e-17
apply(scaled, 2, sd)
# a b
# 1 1
or use apply to do columnwise operations
scaled <- apply(dat, 2, scale)
A z-transformation is quite easy to do manually.
See below using a random string of data.
data <- c(1,2,3,4,5,6,7,8,9,10)
data
mean(data)
sd(data)
z <- ((data - mean(data))/(sd(data)))
z
mean(z) == 0
sd(z) == 1
The logarithm transformation (assuming you mean a natural logarithm) is done using the log() function.
log(data)
Hope this helps!

Resources