Call sapply on vector of densities? - r

I'm generating a set of densities using functions like dweibull and dunif. I'd like to be able to extract their moments using sapply. When I call the mean function on individual distributions, I get the mean value as expected. But when I do this with sapply, instead it seems to call mean() on every individual value in the density (i.e. for every value of x I supplied to dweibull or dunif. Is there a way to get the correct behavior here? Thank you!
weib <- dweibull(seq(from=0, to=25, length=100), shape=1, scale=5)
unif <- dunif(seq(from=1, to=10, length=100), min=1, max=10)
mean(weib) #Works!
dists <- c(weib, unif)
means <- sapply(dists, mean) #Returns a very long list of values, not the mean of weib and unif

You should store the data in a list. c(weib, unif) creates a single combined vector and using sapply(dists, mean) returns the mean of single number i.e the number itself.
dists <- list(weib, unif)
means <- sapply(dists, mean)

You can also use dataframe
dists <- data.frame(weib, unif)
sapply(dists, mean)
weib unif
0.04034828 0.11111111
lapply(dists,mean)
$weib
[1] 0.04034828
$unif
[1] 0.1111111
apply(dists, 2,mean)
weib unif
0.04034828 0.11111111

We may use map
dists <- list(weib, unif)
library(purrr)
means <- map_dbl(dists, mean)

Related

Summing N normal distributions

I am trying to determine the distribution of the sum of N univariate distributions.
Can you suggest a function that allows me to dynamically input any N number of distributions?
This works:
library(distr)
var1 <- Norm(mean=14, sd=1)
var2 <- Norm(mean=10, sd=1)
var3 <- Norm(mean=9, sd=1)
conv <- convpow(var1+var2+var3,1)
This (obviously) doesn't work since pasting the list together creates a messy character string, however this is the framework for my ideal function:
convolution_multi <- function(mean_list = c(14,10,9,10,50)){
distribution_list <- lapply(X = mean_list, Norm, sd=1)
conv_out <- convpow(paste(distribution_list,collapse="+"),1)
return(conv_out)
}
Thanks for your help!
You can use Reduce to repeatedly add each RV to one another. After that you can use convpow
new_var <- Reduce("+", distribution_list)
convpow(new_var, 1)
With that being said the call to convpow does absolutely nothing here.
> identical(convpow(new_var, 1), new_var)
[1] TRUE

Creating a function that determines the impact of an outlier

My big-picture goal is to demonstrate the difference outliers can have on a dataset's average. I'm trying to create a function that uses the size of an outlier "k" as an input and outputs the average. Basically, the function needs to take any value "k" (which is the outlier) and return the average of vector x if the first value of x were replaced with k. For example, say the dataset is the heights of a population of students. The first value is supposed to be 71.3 cm but the kid accidentally put 713 cm. In this case, I want my function to tell me what would be the average of my vector if there was an outlier of value 713 (k = 713). So far I have the following, where x is the name of the dataset of heights.
average_err <- function(k) {
x[1] <- k
mean(x[1])
}
Then calculate the average if there was an outlier of 713
average_err(713)
However, my output is always identical to my input. Will someone please help me?
I would suggest:
average_err <- function(x,k) {
mean(c(x,k))
}
In the above, instead of replacing one of the x-values with an outlier, you're adding an outlier to the existing x-vactor. As #SteveM suggested, you should also have the function take x as an argument
x <- rnorm(25)
average_err(x, 100)
# [1] 3.627824
You could also build it to print both the mean of the original x, x with k and the difference:
average_err <- function(x,k) {
m1 <- mean(x)
m2 <- mean(c(x,k))
d <- m2-m1
out <- data.frame(mean = c(m1, m2, d))
rownames(out) = c("x", "x,k", "difference")
out
}
average_err(x,100)
# mean
# x -0.2270631
# x,k 3.6278239
# difference 3.8548870
I'm not sure if I understand well, but I would rather replace "mean(x[1])" with "mean(x)" in your case. If you write mean(x[1]), you will do the average of one value only, the one you have replace with the outlier k.
average_err <- function(k) {
x[1] <- k
mean(x)
}

Confusion about calculating sample correlation in r

I have been tasked with manually calculating the sample correlation between two datasets (D$Nload and D$Pload), and then compare the result with R's in built cor() function.
I calculate the sample correlation with
cov(D$Nload,D$Pload, use="complete.obs")/(sd(D$Nload)*sd(D$Pload, na.rm=TRUE))
Which gives me the result 0.5693599
Then I try using R's cov() function
cor(D[, c("Nload","Pload")], use="pairwise.complete.obs")
which gives me the result:
Nload Pload
Nload 1.0000000 0.6244952
Pload 0.6244952 1.0000000
Which is a different result. Can anyone see where I've gone wrong?
This happens because when you call sd() on a single vector, it cannot check if the data is pairwise complete. Example:
x <- rnorm(100)
y <- rexp(100)
y[1] <- NA
df <- data.frame(x = x, y = y)
So here we have
df[seq(2), ]
x y
1 1.0879645 NA
2 -0.3919369 0.2191193
We see that while the second row is pairwise complete (all columns used for your computation are not NA), the first row is not. However, if you calculate sd() on just a single column, it doesn't have any information about the pairs. So in your case, sd(df$x) will use all the available data, although it should avoid the first row.
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09301583
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766
But if you remove the first row from your computation, the result is equal
df <- df[complete.cases(df), ]
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09313766
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766

get means across samples from bootstrap

I want to get the means and sds across 20 sampled data, but not sure how to do that. My current code can give me the means within each sample, not across samples.
## create data
data <- round(rnorm(100, 5, 3))
data[1:10]
## obtain 20 boostrap samples
## display the first of the boostrap samples
resamples <- lapply(1:20, function(i) sample(data, replace = T))
resamples[1]
## calculate the means for each bootstrap sample
r.mean <- sapply(resamples, mean)
r.median
## calculate the sd of the distribution of medians
sqrt(var(r.median))
From the above code, I got 20 means from each of the sampled data, and sd of the distribution of the means. How can I get 100 means, each mean from the distribution of the 20 samples? and same for the standard deviation?
Many thanks!!
Though the answer by #konvas is probably what you want, I would still take a look at base package boot when it comes to bootstrapping.
See if the following example can get you closer to what you are trying to do.
set.seed(6929) # Make the results reproducible
data <- round(rnorm(100, 5, 3))
boot_mean <- function(data, indices) mean(data[indices])
boot_sd <- function(data, indices) sd(data[indices])
Runs <- 100
r.mean <- boot::boot(data, boot_mean, Runs)
r.sd <- boot::boot(data, boot_sd, Runs)
r.mean$t
r.sd$t
sqrt(var(r.mean$t))
# [,1]
#[1,] 0.3152989
sd(r.mean$t)
#[1] 0.3152989
Now, see the distribution of the bootstrapped means and standard errors.
op <- par(mfrow = c(1, 2))
hist(r.mean$t)
hist(r.sd$t)
par(op)
Make a matrix with your samples
mat <- do.call(rbind, resamples)
Then
rowMeans(mat)
will give you the "within sample" mean and
colMeans(mat)
the "across sample" mean. For other quantities, e.g. standard deviation you can use apply, e.g. apply(mat, 1, sd) or functions from the matrixStats package, e.g. matrixStats::rowSds(mat).

Standard Deviation Loop in R

I want to create a loop that takes the standard deviation of positions 1 through 3 in "y" then takes standard deviation of positions 4 through 6 etc.
Here is my code I came up with so far but am stuck since the new vector "i" increasing by those same values.
Here is a hypothetical dataset.
x <-rep(1:10, each =3)
y <- rnorm(30, mean=4,sd=1)
data <- cbind(x,y)
sd.v = NULL
for (i in c(1,4,7,10)){
sd.v[i] <- sd(y[c(i,i+1,i+2)])
}
I am really more so interested in creating a loop rather than using apply, sapply, tapply or something else.
If you really want a loop, here is an approach:
set.seed(42)
y <- rnorm(30, mean=4,sd=1)
sd.y <- as.numeric()
for(i in 1:10){
sd.y[i] <- sd(y[(1+(i-1)*3):(3+(i-1)*3)])
}
sd.y
# [1] 0.9681038 0.3783425 1.1031686 1.1799477 0.6867556 1.6987277
# [7] 1.8859794 1.4993717 1.2956209 1.1116502

Resources