Problem with creating a lot of new vector's - r

I want to do some things :
Draw 100 times 50 number's from normal distribution with
mean = 10 and standard deviation = 20
For any draw i want to count his standard deviation and arithmetic mean.
At the end i want to create a vector which has a length 100, containing the absolute value of the difference of the standard deviation and the arithmetic mean. (i.e i want to create some vector x that x[i]=|a-b|, where a is the standard deviation of 100 numbers in i-th draw, and b is the mean of 100 number's in i-th draw.
What i Did :
Creating 100 draw's from normal distribution above :
replicate(100, rnorm(50, 10, 20), simplify = FALSE)
Now i have a problem. I know that i can use functions "mean" and "sd" to count arithmetic mean and standard deviation, but i have to define number's that i draw as a vector. What i mean :
Number's that i rolled in first draw - vector 1
Number's that i rolled in second draw - vector 2
And so on
Then i can count their arithmetic mean and standard deviation.
Then we can count |a-b| (define above). And at the end i will create the vector that x[i]=|a-b|.
I have an idea but i don't know how to write it.

This is a matter of assigning the result of replicate to a variable (of class "list", since simplify = FALSE) and then sapply the mean and sd functions.
set.seed(1234) # Make the results reproducible
repl <- replicate(100, rnorm(50, 10, 20), simplify = FALSE)
mu <- sapply(repl, mean)
s <- sapply(repl, sd)
D <- abs(s - mu)
head(D)
#[1] 16.761930 7.953432 6.833691 12.491605 5.490149 6.850794
A one-liner could be
D2 <- sapply(repl, function(x) abs(sd(x) - mean(x)))
identical(D, D2)
#[1] TRUE

Related

How to create a loop to find the mean of values found by a nested loop?

Below I have code that finds the relative standard deviation of a bootstrap population that were bootstrapped from sample sizes ranging between 2 and 30.
I would like to create a loop that runs this loop for 10 iterations, finding the mean standard deviation for each sample size (2->30), and puts it into a data frame, so instead of the output being n 2:30 with the subsequent standard deviation, the standard deviation is instead a mean standard deviation (from 10 loops). I hope that makes sense.
n_range <- 2:29
bResultsRan <- vector("double", 28)
set.seed(30)
for (b in n_range) {
bRowsRan<-Random[sample(nrow(Random), b), ]
base <- read.table("base.csv", header=T, sep="," )
base$area<-5036821
base$quadrea <- base$area * 16
bootRan <- boot(data=bRowsRan$count, average, R=1000)
base$data<- bootRan$t
base$popsize<-(base$data*base$quadrea)
bValue <- sd(base$popsize)/mean(base$popsize)
bResultsRan[[b - 1]] <- bValue}
BRRan <- data.frame(n = n_range, bResultsRan)
plot(BRRan)

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

Increase precision when standardizing test dataset

I am dealing with a dataset in R divided into train and test. I preproces the data centering and dividing by the standard deviation and so, I want to store the mean and sd values of the training set to scale the test set using the same values. However, the precision obtained if I use the scale function is much better than when I use the colmeans and apply(x, 2, sd) functions.
set.seed(5)
a = matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale = scale(a) # scale using the scale function
a_scale_custom = (a - colMeans(a)) / apply(a, 2, sd) # Using custom function
Now If I compare the mean of both matrices:
colMeans(a_scale)
[1] -9.270260e-17 -1.492891e-16 1.331857e-16
colMeans(a_scale_custom)
[1] 0.007461065 -0.004395052 -0.003046839
The matrix obtained using scale has a column mean of value 0, while the matrix obtained substracting the mean using colMeans has error in the order of 10^-2. The same happens when comparing the standard deviations.
Is there any way I can obtain a better precision when scaling the data without using the scalefunction?
The custom function has a bug in the matrix layout. You need to transpose the matrix before subtracting the vector with t(), then transpose it back. Try the following:
set.seed(5)
a <- matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale <- scale(a) # scale using the scale function
a_scale_custom <- t((t(a) - colMeans(a)) / apply(a, 2, sd))
colMeans(a_scale)
colMeans(a_scale_custom)
see also: How to divide each row of a matrix by elements of a vector in R

Generate a matrix with certain values such that its standard deviation is 1?

I'm currently going through an 'Introduction to R' book and I am completely stuck at the following question:
Create a 5x5 matrix (M), all its entries drawn from the uniform distribution, with sd 1 and mean being the column number of the element. (so mean(matrix[,I]) == column(i), sd(matrix) == 1)
I have to make use of the sapply() function.
I was thinking about something like this:
m <- matrix(runif(25), nrow = 5, ncol = 50
sapply(matrix, function(x) sd(x) == 1)
But that part already doesn't work and I'm just stuck.
Help would be appreciated!
The mean can be set by the following:
my_uniform <- function(col_nbr) {
runif(5, min = col_nbr-sqrt(12)/2, max=col_nbr+sqrt(12)/2)
}
M <- sapply(1:5, my_uniform)
This will lead to std=1 for each column and the mean is set to the number of column in each column. The formular for mean is:
The formular for the sdt is:
From the random uniform distribution one can only simulate values between a range with the same probability each one, being the expected mean when n goes to infinity to be the mean value between the min and the max.
From the point of view of a uniform distribution, the mean and the standard deviation cannot be defined in the function. What you can do is simulate such that the middle value (i.e. the mean) would be the number you are expecting, but the standard deviation would not be 1:
set.seed(1)
numrow<-5
numcol<-5
Mat<-matrix(NA, nrow = numrow, ncol = numcol)
for(i in 1:numcol){
Mat[,i]<- runif(numrow, min = i-0.5, max = i+0.5)
}
Mat
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.7655087 2.398390 2.705975 3.997699 5.434705
# [2,] 0.8721239 2.444675 2.676557 4.217619 4.712143
# [3,] 1.0728534 2.160798 3.187023 4.491906 5.151674
# [4,] 1.4082078 2.129114 2.884104 3.880035 4.625555
# [5,] 0.7016819 1.561786 3.269841 4.277445 4.767221
To see the formulas of the expected mean and expected variance (therefore the standard deviation) I refer to https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)
This should now be the correct way to define the uniform distribution. If the mean is defined as mean=0.5*(a+b) then defining the upper limit like this will result in a mean of the column number.
sapply(1:5, function(x){runif(5, min = 0, max = x*2)})
See this little MonteCarlo experiment:
mean(runif(50000, min = 0, max = 1*2))
You gotta find the pdf ranges (a, b) that fit each mean, sd pair first. The mean of a uniform dist is
mu <- (b + a) / 2 The mu values are indexed from 1:5.
The sd of a uniform dist is (b - a) / sqrt(12)
The sd is fixed at 1, so use the sd equation to solve for b.
Then plug in b in the mu equation to solve for a
Now you have the a, b parameters of the uniform dist
The sapply function then looks like this:
z <- sapply(1:5, function(x) runif(5, 2*x - (2*x + sqrt(12)/2), (2*x + sqrt(12)/2)))
Run summary(z) will give you the output stats. Because of the small sample size the sample means will be off. To test, change the runif sample size from 5 to 100000. Then run summary(z) again. You will see that the values converge to the index means.

How dnorm() works with a vector of quantiles in an sapply loop

I am working through Statistical Rethinking by Richard McElreath and am confused by how some code he uses on p.84 works. The code uses Bayesian grid approximation to derive two model parameters, mu and sigma, to estimate the distribution of height in a sample.
Here is the code
First we make a list of candidate mu values
mu.list <- seq(from = 140, to = 160, length.out = 200)
Then a list of candidate sigma values
sigma.list <- seq(from = 4, to = 9, length.out = 200) # grid of candidate sigma values
Then we make a data frame with every possible combination of mu and sigma.
post <- expand.grid(mu = mu.list, sigma = sigma.list) # expand grid so every mu is matched with every sigma
Which is a dataset with 40000 rows.
nrow(post)
[1] 40000
Now say we have a sample of measured heights, containing 5 measurements.
heights <- c(151.76, 139.70, 136.52, 156.84, 145.41)
Now for the part I don't understand, a reasonable complex sapply loop that calculates a log-likelihood for each of the 40000 candidate combinations of mu and sigma, based on the sample of five height measurements.
postVec <- sapply(1:nrow(post), function (i) sum( dnorm(
heights, # vector of heights
mean = post$mu[i], # candidate mean height value from corresponding position in grid
sd = post$sigma[i], # candidate sigma value from corresponding position in the grid
log = TRUE) ) # make values logs
)
What we get from this is loop is a vector 40000 values long, one value for each row of the post dataframe.
length(postVec)
[1] 40000
What I don't understand is that if we take the dnorm() out of the loop and use single values for the mean and sd, but pass the same 5-value sample vector of heights in the first argument, like so
dnorm( heights, mean = 140, sd = 4, log = TRUE )
We get five values
[1] -6.627033 -2.308045 -2.683683 -11.167283 -3.219861
So my question is: why does the sapply loop passed into the postVec vector above yield 40000 values, not 5 x 40000 = 200000 values?
Why does the dnorm() function return five values outside the sapply() loop but (seemingly) only one value within it?
You are missing sum before dnorm: in each of the 40000 cases it sums those 5 values as to compute the log-likelihood of the whole heights rather than just individual observations.
For instance, without sum for just two combinations we indeed have
sapply(1:2, function (i) dnorm(
heights,
mean = post$mu[i],
sd = post$sigma[i],
log = TRUE)
)
# [,1] [,2]
# [1,] -6.627033 -6.553479
# [2,] -2.308045 -2.310245
# [3,] -2.683683 -2.705858
# [4,] -11.167283 -11.061820
# [5,] -3.219861 -3.186194
while with sum we have column sums of the above matrix:
sapply(1:2, function (i) sum(dnorm(
heights,
mean = post$mu[i],
sd = post$sigma[i],
log = TRUE)
))
# [1] -26.00591 -25.81760

Resources