calculate variance of all samples in r studio - r

I have 30 random samples taken from a data set. I need to calculate sample mean and sample variance for each sample, and arrange them in a table with 3 columns titled "sample", "mean", and "variance".
My dataset is:
lab6data <- c(2,5,4,6,7,8,4,5,9,7,3,4,7,12,4,10,9,7,8,11,8,
6,13,9,6,7,4,5,2,3,10,13,4,12,9,6,7,3,4,2)
I made samples like:
observations <- matrix(lab6data, 30, 5)
and means for every sample separately by:
means <- rowMeans(observations)
Can you please help me to find the variance for every sample separately?

You can calculate the variance per row using apply:
apply(observations, 1, var)
Or use rowVars from the matrixStats package.
Note that matrixStats::rowVars will be slightly much faster (see #HenrikB's comment below) than apply(..., 1, var), in the same way that rowMeans is faster than apply(..., 1, mean).

We can use pmap to apply the function on each row of the data.frame
library(purrr)
varS <- pmap_dbl(as.data.frame(observations), ~ var(c(...)))
cbind(observations, varS)
data
observations <- matrix(lab6data, 10, 4)

Related

Increase precision when standardizing test dataset

I am dealing with a dataset in R divided into train and test. I preproces the data centering and dividing by the standard deviation and so, I want to store the mean and sd values of the training set to scale the test set using the same values. However, the precision obtained if I use the scale function is much better than when I use the colmeans and apply(x, 2, sd) functions.
set.seed(5)
a = matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale = scale(a) # scale using the scale function
a_scale_custom = (a - colMeans(a)) / apply(a, 2, sd) # Using custom function
Now If I compare the mean of both matrices:
colMeans(a_scale)
[1] -9.270260e-17 -1.492891e-16 1.331857e-16
colMeans(a_scale_custom)
[1] 0.007461065 -0.004395052 -0.003046839
The matrix obtained using scale has a column mean of value 0, while the matrix obtained substracting the mean using colMeans has error in the order of 10^-2. The same happens when comparing the standard deviations.
Is there any way I can obtain a better precision when scaling the data without using the scalefunction?
The custom function has a bug in the matrix layout. You need to transpose the matrix before subtracting the vector with t(), then transpose it back. Try the following:
set.seed(5)
a <- matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale <- scale(a) # scale using the scale function
a_scale_custom <- t((t(a) - colMeans(a)) / apply(a, 2, sd))
colMeans(a_scale)
colMeans(a_scale_custom)
see also: How to divide each row of a matrix by elements of a vector in R

get means across samples from bootstrap

I want to get the means and sds across 20 sampled data, but not sure how to do that. My current code can give me the means within each sample, not across samples.
## create data
data <- round(rnorm(100, 5, 3))
data[1:10]
## obtain 20 boostrap samples
## display the first of the boostrap samples
resamples <- lapply(1:20, function(i) sample(data, replace = T))
resamples[1]
## calculate the means for each bootstrap sample
r.mean <- sapply(resamples, mean)
r.median
## calculate the sd of the distribution of medians
sqrt(var(r.median))
From the above code, I got 20 means from each of the sampled data, and sd of the distribution of the means. How can I get 100 means, each mean from the distribution of the 20 samples? and same for the standard deviation?
Many thanks!!
Though the answer by #konvas is probably what you want, I would still take a look at base package boot when it comes to bootstrapping.
See if the following example can get you closer to what you are trying to do.
set.seed(6929) # Make the results reproducible
data <- round(rnorm(100, 5, 3))
boot_mean <- function(data, indices) mean(data[indices])
boot_sd <- function(data, indices) sd(data[indices])
Runs <- 100
r.mean <- boot::boot(data, boot_mean, Runs)
r.sd <- boot::boot(data, boot_sd, Runs)
r.mean$t
r.sd$t
sqrt(var(r.mean$t))
# [,1]
#[1,] 0.3152989
sd(r.mean$t)
#[1] 0.3152989
Now, see the distribution of the bootstrapped means and standard errors.
op <- par(mfrow = c(1, 2))
hist(r.mean$t)
hist(r.sd$t)
par(op)
Make a matrix with your samples
mat <- do.call(rbind, resamples)
Then
rowMeans(mat)
will give you the "within sample" mean and
colMeans(mat)
the "across sample" mean. For other quantities, e.g. standard deviation you can use apply, e.g. apply(mat, 1, sd) or functions from the matrixStats package, e.g. matrixStats::rowSds(mat).

How to get the intervals for ntile()

I was trying to figure out if there is a way to get the intervals used for when ntile() is used.
I have a sample that I want to use as a basis for getting the percentile values of a larger sample, and I was hoping to find a way to get the value of the intervals for when I use ntile().
Any enlightenment on this would be appreciated.
I really want to put this as a comment, but I stil can't comment.
How about using quantile to generate the interval, like this:
# create fake data; 100 samples randomly picked from 1 to 500
fakeData <- runif(100, 1, 500)
# create percentile values; tweak the probs to specify the quantile that you want
x <- quantile(fakeData, probs = seq(0, 1, length.out = 100))
Then you can apply that interval to the larger data set (i.e., using cut, which might give the same result to the ntile of dplyr).

log- and z-transforming my data in R

I'm preparing my data for a PCA, for which I need to standardize it. I've been following someone else's code in vegan but am not getting a mean of zero and SD of 1, as I should be.
I'm using a data set called musci which has 13 variables, three of which are labels to identify my data.
log.musci<-log(musci[,4:13],10)
stand.musci<-decostand(log.musci,method="standardize",MARGIN=2)
When I then check for mean=0 and SD=1...
colMeans(stand.musci)
sapply(stand.musci,sd)
I get mean values ranging from -8.9 to 3.8 and SD values are just listed as NA (for every data point in my data set rather than for each variable). If I leave out the last variable in my standardization, i.e.
log.musci<-log(musci[,4:12],10)
the means don't change, but the SDs now all have a value of 1.
Any ideas of where I've gone wrong?
Cheers!
You data is likely a matrix.
## Sample data
dat <- as.matrix(data.frame(a=rnorm(100, 10, 4), b=rexp(100, 0.4)))
So, either convert to a data.frame and use sapply to operate on columns
dat <- data.frame(dat)
scaled <- sapply(dat, scale)
colMeans(scaled)
# a b
# -2.307095e-16 2.164935e-17
apply(scaled, 2, sd)
# a b
# 1 1
or use apply to do columnwise operations
scaled <- apply(dat, 2, scale)
A z-transformation is quite easy to do manually.
See below using a random string of data.
data <- c(1,2,3,4,5,6,7,8,9,10)
data
mean(data)
sd(data)
z <- ((data - mean(data))/(sd(data)))
z
mean(z) == 0
sd(z) == 1
The logarithm transformation (assuming you mean a natural logarithm) is done using the log() function.
log(data)
Hope this helps!

How to use weighted gini function within aggregate function?

I am trying to calculate gini coefficient with sample weights for different groups in my data. I prefer to use aggregate because I later use the output from aggregate to plot the coefficients. I found alternative ways to do it but in those cases the output wasn't exactly what I needed.
library(reldist) #to get gini function
dat <- data.frame(country=rep(LETTERS, each=10)[1:50], replicate(3, sample(11, 10)), year=sample(c(1990:1994), 50, TRUE),wght=sample(c(1:5), 50, TRUE))
dat[51,] <- c(NA,11,2,6,1992,3) #add one more row with NA for country
gini(dat$X1) #usual gini for all
gini(dat$X1,weight=dat$wght) #gini with weight, that's what I actually need
print(a1<-aggregate( X1 ~ country+year, data=dat, FUN=gini))
#Works perfectly fine without weight.
But, now how can I specify the weight option within aggregate? I know there are other ways (as shown here) :
print(b1<-by(dat,list(dat$country,dat$year), function(x)with(x,gini(x$X1,x$wght)))[])
#By function works with weight but now the output has NAs in it
print(s1<-sapply(split(dat, dat$country), function(x) gini(x$X1, x$wght)))
#This seems to a good alternative but I couldn't find a way to split it by two variables
library(plyr)
print(p1<-ddply(dat,.(country,year),summarise, value=gini(X1,wght)))
#yet another alternative but now the output includes NAs for the missing country
If someone could show me way to use weighted gini function within aggregate that would be very helpful, as it produces the output exactly in the way I need. Otherwise, I guess I will work with one of the alternatives.
#using aggregate
aggregate( X1 ~ country+year, data=dat, FUN=gini,weights=dat$wght) # gives different answer than the data.table and dplyr (not sure why?)
#using data.table
library(data.table)
DT<-data.table(dat)
DT[,list(mygini=gini(X1,wght)),by=.(country,year)]
#Using dplyr
library(dplyr)
dat %>%
group_by(country,year)%>%
summarise(mygini=gini(X1,wght))

Resources