"Setting the seed at 747, generate m=500 samples of dimension n=880 from a population X, with Exponential distribution of expected value 1/λ=1/0.52, i.e. X∼Exp(λ=0.52).
For each of the generated samples, construct an approximate confidence interval for λ. Consider the confidence level 1−α=0.99.Indicate the mean of the amplitude of the m=500 confidence intervals obtained"
In this exercise I did:
m_ic <- function(seed, m, n, lambda, gama) {
set.seed(seed)
return(mean(replicate(m, (2*(qnorm((1+gama)/2)/sqrt(n)))/(mean(rexp(n ,lambda))))))
}
m_ic(seed=747, m=500, n=880, lambda=0.52, gama=0.99)
But what if n∈{100,200,300,…,4000} for example, how could i do the loop and put it in a data frame? And how could i after represent it in a plot like geom_line() or geom_point()?
You can use sapply() to provide a sequence of different values of n to your m_ic() function; save these in a vector and plot, like this:
n_vals = seq(100,4000,100)
m_ic_values = sapply(n_val, \(n) m_ic(seed=747,m=500,n=n, lambda=0.52, gama=0.99))
ggplot(NULL, aes(x=n_vals, y=m_ic_values)) +
geom_point() +
geom_line()
Output:
Your function m_ic is computing Normal confidence intervals but the exponential distribution is far from normal and a better confidence interval are Gamma intervals as you can see here. The function gamma_ic below computes these intervals and then the code to compute its amplitude is repeated in a sapply loop.
First the intervals's mean amplitude with n = 880.
gamma_ic <- function(x, conf = 0.95){
n <- length(x)
qlo <- (1 - conf)/2
qhi <- 1 - (1 - conf)/2
qq <- qgamma(c(qlo, qhi), n, n)/mean(x)
c(lower = qq[1], upper = qq[2])
}
n <- 880
m <- 500
lambda <- 0.52
set.seed(747)
x <- replicate(m, rexp(n, rate = lambda))
ci <- apply(x, 2, gamma_ic, conf = 0.99)
mean(apply(ci, 2, diff))
#> [1] 0.09059922
Created on 2022-06-12 by the reprex package (v2.0.1)
Now the amplitudes for n from 100 to 4000 with increments of 100.
n_vec <- seq(100L, 4000L, by = 100L)
ampl <- sapply(n_vec, \(n) {
set.seed(747)
x <- replicate(m, rexp(n, rate = lambda))
ci <- apply(x, 2, gamma_ic, conf = 0.99)
mean(apply(ci, 2, diff))
})
ampldata <- data.frame(n = n_vec, amplitude = ampl)
library(ggplot2)
ggplot(ampldata, aes(n, amplitude)) +
geom_line() +
geom_point() +
theme_bw()
Created on 2022-06-12 by the reprex package (v2.0.1)
Related
I have run a multiple imputation (m=45, 10 iterations) using the MICE package, and want to calculate the cronbach's alpha for a number of ordinal scales in the data. Is there a function in r that could assist me in calculating the alpha coefficient across the imputed datasets in a manner that would satisfy Rubin's rules for pooling estimates?
We may exploit pool.scalar from the mice package, which performs pooling of univariate estimates according to Rubin's rules.
Since you have not provided a reproducible example yourself, I will provide one.
set.seed(123)
# sample survey responses
df <- data.frame(
x1 = c(1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3),
x2 = c(1,1,1,2,3,3,2,3,3,3,
1,1,1,2,3,3,2,3,3,3,
1,2,2,3,2,2,3,3,2,3),
x3 = c(1,1,2,1,2,3,3,3,2,3,
1,1,2,1,2,3,3,3,2,3,
1,2,2,3,2,2,3,3,2,3)
)
# function to column-wise generate missing values (MCAR)
create_missings <- function(data, prob) {
x <- replicate(ncol(data),rbinom(nrow(data), 1, prob))
for(k in 1:ncol(data)) {
data[, k] <- ifelse(x[, k] == 1, NA, data[,k])
}
data
}
df <- create_missings(df, prob = 0.2)
# multiple imputation ----------------------------------
library(mice)
imp <- mice(df, m = 10, maxit = 20)
# extract the completed data in long format
implong <- complete(imp, 'long')
We need a function to compute cronbach's alpha and obtain an estimate of the standard error of alpha, which can be used in a call to pool.scalar() later on. Since there is no available formula with which we can analytically estimate the standard error of alpha, we also need to deploy a bootstrapping procedure to estimate this standard error.
The function cronbach_fun() takes the following arguments:
list_compl_data: a character string specifying the list of completed data from a mids object.
boot: a logical indicating whether a non-parametrical bootstrap should be conducted.
B: an integer specifying the number of bootstrap samples to be taken.
ci: a logical indicating whether a confidence interval around alpha should be estimated.
cronbach_fun <- function(list_compl_data, boot = TRUE, B = 1e4, ci = FALSE) {
n <- nrow(list_compl_data); p <- ncol(list_compl_data)
total_variance <- var(rowSums(list_compl_data))
item_variance <- sum(apply(list_compl_data, 2, sd)^2)
alpha <- (p/(p - 1)) * (1 - (item_variance/total_variance))
out <- list(alpha = alpha)
boot_alpha <- numeric(B)
if (boot) {
for (i in seq_len(B)) {
boot_dat <- list_compl_data[sample(seq_len(n), replace = TRUE), ]
total_variance <- var(rowSums(boot_dat))
item_variance <- sum(apply(boot_dat, 2, sd)^2)
boot_alpha[i] <- (p/(p - 1)) * (1 - (item_variance/total_variance))
}
out$var <- var(boot_alpha)
}
if (ci){
out$ci <- quantile(boot_alpha, c(.025,.975))
}
return(out)
}
Now that we have our function to do the 'heavy lifting', we can run it on all m completed data sets, after which we can obtain Q and U (which are required for the pooling of the estimates). Consult ?pool.scalar for more information.
m <- length(unique(implong$.imp))
boot_alpha <- rep(list(NA), m)
for (i in seq_len(m)) {
set.seed(i) # fix random number generator
sub <- implong[implong$.imp == i, -c(1,2)]
boot_alpha[[i]] <- cronbach_fun(sub)
}
# obtain Q and U (see ?pool.scalar)
Q <- sapply(boot_alpha, function(x) x$alpha)
U <- sapply(boot_alpha, function(x) x$var)
# pooled estimates
pool_estimates <- function(x) {
out <- c(
alpha = x$qbar,
lwr = x$qbar - qt(0.975, x$df) * sqrt(x$t),
upr = x$qbar + qt(0.975, x$df) * sqrt(x$t)
)
return(out)
}
Output
# Pooled estimate of alpha (95% CI)
> pool_estimates(pool.scalar(Q, U))
alpha lwr upr
0.7809977 0.5776041 0.9843913
Suppose I am seeking to integrate the following function from 0 to 10:
How would I accomplish this in R?
Functions
# Functional form
fn <- function(t) -100*(t)^2 + 20000
# First derivative w.r.t. t
fn_dt <- function(t) -200*t
# Density funciton phi
phi <- approxfun(density(rnorm(35, 15, 7)))
# Delta t
delta <- 5
How about the following:
First off, we choose a fixed seed for reproducibility.
# Density funciton phi
set.seed(2017);
phi <- approxfun(density(rnorm(35, 15, 7)))
We define the integrand.
integrand <- function(x) {
f1 <- -500 * x^2 + 100000;
f2 <- phi(x);
f2[is.na(f2)] <- 0;
return(f1 * f2)
}
By default, approxfun returns NA if x falls outside the interval [min(x), max(x)]; since phi is based on the density of a normal distribution, we can replace NAs with 0.
Let's plot the integrand
library(ggplot2);
ggplot(data.frame(x = 0), aes(x)) + stat_function(fun = integrand) + xlim(-50, 50);
We use integrate to calculate the integral; here I assume you are interested in the interval [-Inf, +Inf].
integrate(integrand, lower = -Inf, upper = Inf)
#-39323.06 with absolute error < 4.6
I am interested to build a R function that I can use to test the limits of the Taylor series approximation. I am aware that there is limits to what I am doing, but it's exactly those limits I wish to investigate.
I have two normally distributed random variables x and y. x has a mean of 7 and a standard deviation (sd) of 1. y has a mean of 5 and a sd of 4.
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
I know how to estimate the mean ratio of y/x, like this
# E(y/x) = E(y)/E(x) - Cov(y,x)/E(x)^2 + Var(x)*E(y)/E(x)^3
me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
[1] 1.328125
I am however stuck on how to estimate the Standard Deviation of the ratio? I realize I have to use a Taylor expansion, but not how to use it.
Doing a simple simulation I get
x <- rnorm(10^4, mean = 4, sd = 1); y <- rnorm(10^4, mean = 5, sd = 4)
sd(y/x)
[1] 2.027593
mean(y/x)[1]
1.362142
There is an analytical expression for the PDF of the ratio of two gaussians, done
by David Hinkley (e.g. see Wikipedia). So we could compute all momentums, means etc. I typed it and apparently it clearly doesn't have finite second momentum, thus it doesn't have finite standard deviation. Note, I've denoted your Y gaussian as my X, and your X as my Y (formulas assume X/Y). I've got mean value of ratio pretty close to the what you've got from simulation, but last integral is infinite, sorry. You could sample more and more values, but from sampling std.dev is growing as well, as noted by #G.Grothendieck
library(ggplot2)
m.x <- 5; s.x <- 4
m.y <- 4; s.y <- 1
a <- function(x) {
sqrt( (x/s.x)^2 + (1.0/s.y)^2 )
}
b <- function(x) {
(m.x*x)/s.x^2 + m.y/s.y^2
}
c <- (m.x/s.x)^2 + (m.y/s.y)^2
d <- function(x) {
u <- b(x)^2 - c*a(x)^2
l <- 2.0*a(x)^2
exp( u / l )
}
# PDF for the ratio of the two different gaussians
PDF <- function(x) {
r <- b(x)/a(x)
q <- pnorm(r) - pnorm(-r)
(r*d(x)/a(x)^2) * (1.0/(sqrt(2.0*pi)*s.x*s.y)) * q + exp(-0.5*c)/(pi*s.x*s.y*a(x)^2)
}
# normalization
nn <- integrate(PDF, -Inf, Inf)
nn <- nn[["value"]]
# plot PDF
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))
p <- p + stat_function(fun = function(x) PDF(x)/nn) + xlim(-2.0, 6.0)
print(p)
# first momentum
m1 <- integrate(function(x) x*PDF(x), -Inf, Inf)
m1 <- m1[["value"]]
# mean
print(m1/nn)
# some sampling
set.seed(32345)
n <- 10^7L
x <- rnorm(n, mean = m.x, sd = s.x); y <- rnorm(n, mean = m.y, sd = s.y)
print(mean(x/y))
print(sd(x/y))
# second momentum - Infinite!
m2 <- integrate(function(x) x*x*PDF(x), -Inf, Inf)
Thus, it is impossible to test any Taylor expansion for std.dev.
With the cautions suggested by #G.Grothendieck in mind: a useful mnemonic for products and quotients of independent X and Y variables is
CV^2(X/Y) = CV^2(X*Y) = CV^2(X) + CV^2(Y)
where CV is the coefficient of variation (sd(X)/mean(X)), so CV^2 is Var/mean^2. In other words
Var(Y/X)/(m(Y/X))^2 = Var(X)/m(X)^2 + Var(Y)/m(Y)^2
or rearranging
sd(Y/X) = sqrt[ Var(X)*m(Y/X)^2/m(X)^2 + Var(Y)*m(Y/X)^2/m(Y)^2 ]
For random variables with the mean well away from zero, this is a reasonable approximation.
set.seed(101)
y <- rnorm(1000,mean=5)
x <- rnorm(1000,mean=10)
myx <- mean(y/x)
sqrt(var(x)*myx^2/mean(x)^2 + var(y)*myx^2/mean(y)^2) ## 0.110412
sd(y/x) ## 0.1122373
Using your example is considerably worse because the CV of Y is close to 1 -- I initially thought it looked OK, but now I see that it's biased as well as not capturing the variability very well (I'm also plugging in the expected values of the mean and SD rather than their simulated values, but for such a large sample that should be a minor part of the error.)
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
myx <- me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
x <- rnorm(1e4,me.x,sd.x); y <- rnorm(1e4,me.y,sd.y)
c(myx,mean(y/x))
sdyx <- sqrt(sd.x^2*myx^2/me.x^2 + sd.y^2*myx^2/me.y^2)
c(sdyx,sd(y/x))
## 1.113172 1.197855
rvals <- replicate(1000,
sd(rnorm(1e4,me.y,sd.y)/rnorm(1e4,me.x,sd.x)))
hist(log(rvals),col="gray",breaks=100)
abline(v=log(sdyx),col="red",lwd=2)
min(rvals) ## 1.182698
All the canned delta-method approaches to computing the variance of Y/X use the point estimate for Y/X (i.e. m(Y/X) = mY/mX), rather than the second-order approximation you used above. Constructing higher-order forms for both the mean and the variance should be straightforward if possibly tedious (a computer algebra system might help ...)
mvec <- c(x = me.x, y = me.y)
V <- diag(c(sd.x, sd.y)^2)
car::deltaMethod(mvec, "y/x", V)
## Estimate SE
## y/x 1.25 1.047691
library(emdbook)
sqrt(deltavar(y/x,meanval=mvec,Sigma=V)) ## 1.047691
sqrt(sd.x^2*(me.y/me.x)^2/me.x^2 + sd.y^2*(me.y/me.x)^2/me.y^2) ## 1.047691
For what it's worth, I took the code in #SeverinPappadeux's answer and made it into a function gratio(mx,my,sx,sy). For the Cauchy case (gratio(0,0,1,1)) it gets confused and reports a mean of 0 (which should be NA/divergent) but correctly reports the variance/std dev as divergent. For the parameters specified by the OP (gratio(5,4,4,1)) it gives mean=1.352176, sd=NA as above. For the first parameters I tried above (gratio(10,5,1,1)) it gives mean=0.5051581, sd=0.1141726.
These numerical experiments strongly suggest to me that the ratio of Gaussians sometimes has a well-defined variance, but I don't know when (time for another question on Math StackOverflow or CrossValidated?)
Such approximations are unlikely to be useful since the distribution may not have a finite standard deviation. Look at how unstable it is:
set.seed(123)
n <- 10^6
X <- rnorm(n, me.x, sd.x)
Y <- rnorm(n, me.y, sd.y)
sd(head(Y/X, 10^3))
## [1] 1.151261
sd(head(Y/X, 10^4))
## [1] 1.298028
sd(head(Y/X, 10^5))
## [1] 1.527188
sd(Y/X)
## [1] 1.863168
Contrast that with what happens when we try the same thing with a normal random variable:
sd(head(Y, 10^3))
## [1] 3.928038
sd(head(Y, 10^4))
## [1] 3.986802
sd(head(Y, 10^5))
## [1] 3.984113
sd(Y)
## [1] 3.999024
Note: If you were in a different situation, e.g. the denominator has compact support, then you could do this:
library(car)
m <- c(x = me.x, y = me.y)
v <- diag(c(sd.x, sd.y)^2)
deltaMethod(m, "y/x", v)
I'm trying to estimate a linear model with a log-normal distributed error term. I already have working code for a linear model with normally distributed errors:
library(Ecdat)
library(assertthat)
library(maxLik)
# Load the data
data(Wages1)
# Check what R says
summary(lm(wage ~ school + exper + sex, data = Wages1))
# Use maxLik from package maxLik
# The likelihood function
my_log_lik_pos <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
beta <- head(theta, -1)
sigma <- tail(theta, 1)
xb <- x%*%beta
are_equal(dim(xb), c(nrow(my_data), 1))
return(sum(log(dnorm(y, mean = xb, sd = sigma))))
}
# Bind the data
my_data <- cbind(Wages1$wage, 1, Wages1$school, Wages1$exper, Wages1$sex)
my_problem <- maxLik(my_log_lik_pos, data = my_data,
start = rep(1,5), method = "BFGS")
summary(my_problem)
I get approximately the same results. Now I try to do the same, but using the log-normal likelihood. For this, I have to first simulate some data:
true_beta <- c(0.1, 0.2, 0.3, 0.4, 0.5)
ys <- my_data[, -1] %*% head(true_beta, -1) +
rlnorm(nrow(my_data), 0, tail(true_beta, 1))
my_data_2 <- cbind(ys, my_data[, -1])
And the log-likelihood function:
my_log_lik_lognorm <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
beta <- head(theta, -1)
sigma <- tail(theta, 1)
xb <- x%*%beta
are_equal(dim(xb), c(nrow(data), 1))
return(sum(log(dlnorm(y, mean = xb, sd = sigma))))
}
my_problem2 <- maxLik(my_log_lik_lognorm, data = my_data_2,
start = rep(0.2,5), method = "BFGS")
summary(my_problem2)
The estimated parameters should be around the values of true_beta, but for some reason I find completely different values. I tried with different methods, different starting values but to no avail. I'm sure that I'm missing something obvious, but I don't see what.
Am I right to assume that the log-likelihood of the log-normal distribution is:
sum(log(dlnorm(y, mean = .., sd = ...))
Unless I'm mistaken, this is the definition of the log-likelihood (sum of the logs of the densities).
I found the issue: it seems the problem is not my log-likelihood function. When I try to estimate the model with glm:
summary(glm(ys ~ school + exper + sex, family=gaussian(link="log"), data=Wages1))
I get the same result as with maxLik and my log-likelihood. It would seem the problem comes from when I tried to simulate some data:
ys <- my_data[, -1] %*% head(true_beta, -1) +
rlnorm(nrow(my_data), 0, tail(true_beta, 1))
The correct way to simulate the data:
ys <- rlnorm(nrow(my_data), my_data[, -1] %*% head(true_beta, -1), tail(true_beta, 1))
Now everything works!
Assume A follows Exponential distribution; B follows Gamma distribution
How to plot the PDF of 0.5*(A+B)
This is fairly straight forward using the "distr" package:
library(distr)
A <- Exp(rate=3)
B <- Gammad(shape=2, scale=3)
conv <- 0.5*(A+B)
plot(conv)
plot(conv, to.draw.arg=1)
Edit by JD Long
Resulting plot looks like this:
If you're just looking for fast graph I usually do the quick and dirty simulation approach. I do some draws, slam a Gaussian density on the draws and plot that bad boy:
numDraws <- 1e6
gammaDraws <- rgamma(numDraws, 2)
expDraws <- rexp(numDraws)
combined <- .5 * (gammaDraws + expDraws)
plot(density(combined))
output should look a little like this:
Here is an attempt at doing the convolution (which #Jim Lewis refers to) in R. Note that there are probably much more efficient ways of doing this.
lower <- 0
upper <- 20
t <- seq(lower,upper,0.01)
fA <- dexp(t, rate = 0.4)
fB <- dgamma(t,shape = 8, rate = 2)
## C has the same distribution as (A + B)/2
dC <- function(x, lower, upper, exp.rate, gamma.rate, gamma.shape){
integrand <- function(Y, X, exp.rate, gamma.rate, gamma.shape){
dexp(Y, rate = exp.rate)*dgamma(2*X-Y, rate = gamma.rate, shape = gamma.shape)*2
}
out <- NULL
for(ix in seq_along(x)){
out[ix] <-
integrate(integrand, lower = lower, upper = upper,
X = x[ix], exp.rate = exp.rate,
gamma.rate = gamma.rate, gamma.shape = gamma.shape)$value
}
return(out)
}
fC <- dC(t, lower=lower, upper=upper, exp.rate=0.4, gamma.rate=2, gamma.shape=8)
## plot the resulting distribution
plot(t,fA,
ylim = range(fA,fB,na.rm=TRUE,finite = TRUE),
xlab = 'x',ylab = 'f(x)',type = 'l')
lines(t,fB,lty = 2)
lines(t,fC,lty = 3)
legend('topright', c('A ~ exp(0.4)','B ~ gamma(8,2)', 'C ~ (A+B)/2'),lty = 1:3)
I'm not an R programmer, but it might be helpful to know that for independent random variables with PDFs f1(x) and f2(x), the PDF
of the sum of the two variables is given by the convolution f1 * f2 (x) of the two input PDFs.