Increase precision when standardizing test dataset - r

I am dealing with a dataset in R divided into train and test. I preproces the data centering and dividing by the standard deviation and so, I want to store the mean and sd values of the training set to scale the test set using the same values. However, the precision obtained if I use the scale function is much better than when I use the colmeans and apply(x, 2, sd) functions.
set.seed(5)
a = matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale = scale(a) # scale using the scale function
a_scale_custom = (a - colMeans(a)) / apply(a, 2, sd) # Using custom function
Now If I compare the mean of both matrices:
colMeans(a_scale)
[1] -9.270260e-17 -1.492891e-16 1.331857e-16
colMeans(a_scale_custom)
[1] 0.007461065 -0.004395052 -0.003046839
The matrix obtained using scale has a column mean of value 0, while the matrix obtained substracting the mean using colMeans has error in the order of 10^-2. The same happens when comparing the standard deviations.
Is there any way I can obtain a better precision when scaling the data without using the scalefunction?

The custom function has a bug in the matrix layout. You need to transpose the matrix before subtracting the vector with t(), then transpose it back. Try the following:
set.seed(5)
a <- matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale <- scale(a) # scale using the scale function
a_scale_custom <- t((t(a) - colMeans(a)) / apply(a, 2, sd))
colMeans(a_scale)
colMeans(a_scale_custom)
see also: How to divide each row of a matrix by elements of a vector in R

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

Problem with creating a lot of new vector's

I want to do some things :
Draw 100 times 50 number's from normal distribution with
mean = 10 and standard deviation = 20
For any draw i want to count his standard deviation and arithmetic mean.
At the end i want to create a vector which has a length 100, containing the absolute value of the difference of the standard deviation and the arithmetic mean. (i.e i want to create some vector x that x[i]=|a-b|, where a is the standard deviation of 100 numbers in i-th draw, and b is the mean of 100 number's in i-th draw.
What i Did :
Creating 100 draw's from normal distribution above :
replicate(100, rnorm(50, 10, 20), simplify = FALSE)
Now i have a problem. I know that i can use functions "mean" and "sd" to count arithmetic mean and standard deviation, but i have to define number's that i draw as a vector. What i mean :
Number's that i rolled in first draw - vector 1
Number's that i rolled in second draw - vector 2
And so on
Then i can count their arithmetic mean and standard deviation.
Then we can count |a-b| (define above). And at the end i will create the vector that x[i]=|a-b|.
I have an idea but i don't know how to write it.
This is a matter of assigning the result of replicate to a variable (of class "list", since simplify = FALSE) and then sapply the mean and sd functions.
set.seed(1234) # Make the results reproducible
repl <- replicate(100, rnorm(50, 10, 20), simplify = FALSE)
mu <- sapply(repl, mean)
s <- sapply(repl, sd)
D <- abs(s - mu)
head(D)
#[1] 16.761930 7.953432 6.833691 12.491605 5.490149 6.850794
A one-liner could be
D2 <- sapply(repl, function(x) abs(sd(x) - mean(x)))
identical(D, D2)
#[1] TRUE

Convert uniform draws to normal distributions with known mean and std in R

I apply the sensitivity package in R. In particular, I want to use sobolroalhs as it uses a sampling procedure for inputs that allow for evaluations of models with a large number of parameters. The function samples uniformly [0,1] for all inputs. It is stated that desired distributions need to be obtained as follows
####################
# Test case: dealing with non-uniform distributions
x <- sobolroalhs(model = NULL, factors = 3, N = 1000, order =1, nboot=0)
# X1 follows a log-normal distribution:
x$X[,1] <- qlnorm(x$X[,1])
# X2 follows a standard normal distribution:
x$X[,2] <- qnorm(x$X[,2])
# X3 follows a gamma distribution:
x$X[,3] <- qgamma(x$X[,3],shape=0.5)
# toy example
toy <- function(x){rowSums(x)}
y <- toy(x$X)
tell(x, y)
print(x)
plot(x)
I have non-zero mean and standard deviations for some input parameter that I want to sample out of a normal distribution. For others, I want to uniformly sample between a defined range (e.g. [0.03,0.07] instead [0,1]). I tried using built in R functions such as
SA$X[,1] <- rnorm(1000, mean = 579, sd = 21)
but I am afraid this procedure messes up the sampling design of the package and resulted in odd results for the sensitivity indices. Hence, I think I need to adhere for the uniform draw of the sobolroalhs function in which and use the sampled value between [0, 1] when drawing out of the desired distribution (I think as density draw?). Does this make sense to anyone and/or does anyone know how I could sample out of the right distributions following the syntax from the package description?
You can specify mean and sd in qnorm. So modify lines like this:
x$X[,2] <- qnorm(x$X[,2])
to something like this:
x$X[,2] <- qnorm(x$X[,2], mean = 579, sd = 21)
Similarly, you could use the min and max parameters of qunif to get values in a given range.
Of course, it's also possible to transform standard normals or uniforms to the ones you want using things like X <- 579 + 21*Z or Y <- 0.03 + 0.04*U, where Z is a standard normal and U is standard uniform, but for some distributions those transformations aren't so simple and using the q* functions can be easier.

R boxplot with already computed mean, confidence intervals and min max

I am trying to generate a boxplot in R using already computed confidence intervals and min and max. For time 1,2,3,4,5 (x-axis), I have MN which represents array of 5 elements, each describing the mean at time point. I also have CI1, CI2, MINIM, and MAXM, each as an array of 5 elements, one for each time step, representing upper CI, lower CI , minimum and maximum.
I want to generate 5 box plots bars at each time step.
I have tried the usual box plot function, but I could get it to work with already computed CIs and min max.
It would be great if the method work for normal plot function, though ggplot woll be fine too.
Since you have not posted data, I will use the builtin iris dataset, keeping the first 4 columns.
data(iris)
iris2 <- iris[-5]
The function boxplot computes the statistics it uses and then calls bxp to do the printing, passing it those computed values.
If you want a different set of statistics you will have to compute them and pass them to bxp manually.
I am assuming that by CI you mean normal 95% confidence intervals. For that you need to compute the standard errors and the mean values first.
s <- apply(iris2, 2, sd)
mn <- colMeans(iris2)
ci1 <- mn - qnorm(0.95)*s
ci2 <- mn + qnorm(0.95)*s
minm <- apply(iris2, 2, min)
maxm <- apply(iris2, 2, max)
Now have boxplot create the data structure used by bxp, a matrix.
bp <- boxplot(iris2, plot = FALSE)
And fill the matrix with the values computed earlier.
bp$stats <- matrix(c(
minm,
ci1,
mn,
ci2,
maxm
), nrow = 5, byrow = TRUE)
Finally, plot it.
bxp(bp)

Calculate each element in a matrix based on elements in two other matrices

I'm trying to populate a matrix with i x j entries from a random normal distribution based on the means and standard deviations stored in two other matrices. Is there a way to use rnorm pulling each entry from the two "data" matrices (the two matrices with the means and standard deviations) without using a loop?
Sure, just do it:
means <- matrix(1:4, 2, 2)
sds <- matrix((1:4)/1000, 2, 2)
result <- matrix(rnorm(4, mean = means, sd = sds), 2, 2)
or (following the comment from Frank below)
result <- array(rnorm(length(means), mean = means, sd = sds),
dim = dim(means))

Resources