R. Bootstrapping max value from a vector - r

I have a data frame df with a X column with normally distributed values along 1,000,000 rows. The max value in X = 0.8. Using R (and perhaps the "boot" package), I would like to do bootstrapping with replacement to estimate how unlikely is to get max(df$X)=0.8 from my data. For this, I could take n bootstrap samples from X and calculate the max value of each sample. Then I can take the standard deviation of each max(sample) and see how far is 0.8 from this st dev. Does anyone know how to do this bootstrapping with R?. Any suggestion is welcomed !

Bootstrapping from x, where x is a normal random variable. statistic function needs to be provided which requires at least data and indices as its arguments. check the R documentation of boot package for more details.
max_x function below checks if the max(x) is same as maximum of a bootsrapped sample. Note that the test data (x) considered in below code has a different maximum value, but conceptual framework remains the same:
set.seed(101)
x <- rnorm(1000, mean= 0.4, sd= 0.2) # normally distributed test data
max_x <- function(data, indices){ m <- max(data[indices])
if (m == max(x)) { return(1)
} else{ return(0)}
}
results <- boot(data = x, statistic = max_x, R = 1000) # 1000 replications
mean(results$t == 1) # probability of max getting sampled
# 0.618
results
# ORDINARY NONPARAMETRIC BOOTSTRAP
# Call:
# boot(data = x, statistic = max_x, R = 1000)
# Bootstrap Statistics :
# original bias std. error
# t1* 1 -0.382 0.4861196
plot(results)

Related

Statistically identical lm() and t.test() yield marginally different effect sizes

I'm encountering an unexpected lack of correspondence between the estimated effects of an unpaired t.test() and lm() in base R's stats. While an independent-samples t-test correctly yields t = 0 and a manual calculation of OLS slope also outputs an estimate of 0, lm() gives a near-zero approximation (specifically, -3.239e-14).
This pattern holds across machines and OS (Mac and Windows) versions as tested by my coworkers. Would anyone be able to explain the discrepancy? The most similar post does not seem to apply here.
# Populating an example dataframe
id <- rep(1, 16)
dose <- rep(c(rep("Low", 4), rep("High", 4)), 2)
domain <- c(rep("Domain1", 8), rep("Domain2", 8))
dv <- c(100, 0, 100, 100, 0, 100, 100, 100, 0, 0, 0, 0, 0, 100, 50, 0)
df <- data.frame(id, dose, domain, dv)
library(dplyr) ## needed for %>% and filter()
# Independent-samples t-test
t.test(dv ~ dose,
data = filter(df, domain == "Domain1") %>% droplevels(),
paired = F)
# data: dv by dose
# t = 0, df = 6, p-value = 1
# Linear regression of factor IV and numeric DV
summary(lm(as.numeric(dv) ~ as.factor(dose),
data = filter(df, domain == "Domain1")))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 7.500e+01 2.500e+01 3 0.024 *
# as.factor(dose)Low -3.239e-14 3.536e+01 0 1.000
# Manual calculation of OLS slope
X <- model.matrix(dv ~ dose, data = filter(df, domain == "Domain2"))
y <- with(filter(df, domain == "Domain1"), dv)
R <- t(X) %*% X
solve(R) %*% t(X) %*% y
# [,1]
# (Intercept) 75
# doseLow 0
The t-statistic is exactly zero for t.test() (assign the results to tt and try tt$statistic == 0), while after assigning the results of summary() to ss we get
ss$coefficients["as.factor(dose)Low", "Estimate"]
[1] -3.014578e-14
(extracting the values and printing them avoids any rounding/formatting that might be done in the print() methods for these objects).
As discussed in the comments, this discrepancy is a more-or-less unavoidable aspect of floating point imprecision, and the fact that the two functions do mathematically equivalent calculations using very different numerical algorithms. (In hindsight it's not surprising that the t-test, which just subtracts the two means and scales them by the pooled SD, gets a more precise answer than lm(), which does some fancy linear algebra (QR decomposition: see lm.fit()) to arrive at the same answer.)
Unfortunately, it's not easy to adjust the printing precision low enough to make the coefficient print as zero: even
printCoefmat(coef(ss), digits = 1)
(a way to get just the summary table) still prints -3e-14.
A hack, using zapsmall() to round the coefficient values in such a way that coefficients that are small (in absolute value) relative to the largest coefficient get set to zero:
ss$coefficients[, "Estimate"] <- zapsmall(ss$coefficients[, "Estimate"])
printCoefmat(coef(ss))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.000 25.000 3 0.02401 *
## as.factor(dose)Low 0.000 35.355 0 1.00000
You could also do this upstream with the output of lm(), or by creating a wrapper function to use in place of lm():
zlm <- function(...) {
m <- lm(...)
m$coefficients <- zapsmall(m$coefficients)
m
}
Unfortunately there's not really any nice way to make things behave this way universally; it's just too hard to know exactly when it's safe/desirable to round small values to zero ...

my location-scale estimator function not working with polynomial mean

I'm building my own maximum likelihood estimator that estimates the parameters associated with the mean and standard deviation. On simulated data my function works when the true mean is a linear function and the standard deviation is constant. However, if the mean structure is polynomial my function cannot recover the true parameters. Can anybody point me to a solution?
I'm aware there are plenty of existing functions for estimating means and SDs. I'm not interested in them, I'm interested in why my function is not working.
Below is a reproducible example where my model does not recover the true standard deviation (true sd = 1.648, mysd = 4.184123)
*Edit: added library()
library(tidyverse)
my_poly_loglik <- function(pars, #parameters
outcome, #outcome variable
poly_mean){ #data frame of polynomials
#modelling the mean - adding intercept column
mean_mdl = cbind(1, poly_mean) %*% pars[1:(ncol(poly_mean) + 1)]
#modelling the standard deviation on exponential scale
sd_mdl = exp(pars[length(pars)])
#computing log likelihood
sum_log_likelihood <- sum(dnorm(outcome,
mean = mean_mdl,
sd = sd_mdl,
log = TRUE),
na.rm = TRUE)
#since optim() is minimizing we want the -log likelihood
return(-sum_log_likelihood)
}
#Generate data
set.seed(103)
n <- 100000 #100k obs
z <- runif(n, min = 0.1, max = 40) #independent variable sampled uniformly
mean <- 10 + 0.2 * z + 0.4 * z^2 #mean structure
sd = exp(0.5) #constant SD
y <- rnorm(n,mean, sd)
#Visualizing simulated data
#plot(z,mean)
#plot(z,sd)
#plot(z,y)
mydf = data.frame(z,y)
#Defining polynomials
polymean = cbind(z, z^2)
#Initial values. 2 extra for mean_intercept and SD
pars = rep(0, ncol(polymean) + 2)
#Optimising my likelihood function
optim_res <- optim(pars,
fn = my_poly_loglik,
outcome = mydf$y,
poly_mean = polymean)
if (optim_res$convergence != 0) stop("optim_res value is not 0!")
#comparing my function to the real parameter
plot_df = data.frame("mymean" = optim_res$par[1] + (polymean %*% optim_res$par[2:3]),
"truemean" = mean,
"z" = z)
#my mean (black) and true mean (red)
plot_df %>%
ggplot(aes(x = z, y = mymean)) +
geom_line() +
geom_line(aes(y = truemean), color = "red")
#Works!
#my SD and true SD - PROBLEM!
sd #true sd
exp(optim_res$par[length(optim_res$par)]) #my sd
this is not a complete solution but it might help others find the correct answer.
The code looks good overall and the issue emerges only with a high range of the z values. In fact, scaling them or generating data from a considerably lower range leads to the correct solution. Furthermore, checking the hessian shows that the covariance matrix of the estimates is not positive semidefinite and slightly reducing the range results in correlations of the mean parameters close to 1. (This is a bit puzzling since a normal linear model with the same parametrization does not suffer from the same issue -- I know it does not optimize the likelihood directly, but still a bit unintuitive to me).
So, a temporal solution might be rescaling the predictors / using an orthogonal parametrization? But that does not really explain core of the issue.

R : Calculate a P-value of a random distribution

I want to get the P-value of two randomly distributed observations x and y, for example :
> set.seed(0)
> x <- rnorm(1000, 3, 2)
> y <- rnorm(2000, 4, 3)
or:
> set.seed(0)
> x <- rexp(50, 10)
> y <- rexp(100, 11)
let's say that T is my test-statistic defined as mean(x) - mean(y) = 0 (this is H0), the P-value is then defined as : p-value = P[T>T_observed | H0 holds].
I tried doing this :
> z <- c(x,y) # if H0 holds then x and y are distributed with the same distribution
> f <- function(x) ecdf(z) # this will get the distribution of z (x and y)
then to calculate the p-value i tried this:
> T <- replicate(10000, mean(sample(z,1000,TRUE))-mean(sample(z,2000,TRUE))) # this is
supposed to get the null distribution of mean(x) - mean(y)
> f(quantile(T,0.05)) # calculating the p-value for a significance of 5%
obviously this doesn't seem to work, what am i missing ?
Your intention is very good -- to calculate statistical significance via bootstrap sampling (aka bootstrapping). However, the mean(sample(x,1000,TRUE))-mean(sample(z,2000,TRUE)) can't work because this is taking an average of 1000 samples of z - an average of 2000 samples of z. This will most certainly be quite close to 0 regardless of the true means of x and y.
I would suggest the following:
diff <- (sample(x, size = 2000, replace = TRUE) - sample(y, size = 2000, replace = TRUE))
2000 samples (with replacement) of both x and y are taken and the difference is calculated. Of course you can increase confidence too by adding replications as you suggested. As opposed to pvalue, I prefer confidence intervals (CI) as I think they are more informative (and equivalent in statistical accuracy to p-values). The CIs can then be calculated as follows using the means and standard errors:
stderror <- sd(diff)/sqrt(length(x))
upperCI <- mean(diff)+stderror
lowerCI <- mean(diff)-stderror
cat(lowerCI, upperCI)
Since the CI does not include 0, the null hypothesis is rejected. Notice that the result will be close to t-test (for your normal example) CI results in R:
t <- t.test(x, y)
cat(t$conf.int)

How to find interval prbability for a given distribution?

Suppose I have some data and I fit them to a gamma distribution, how to find the interval probability for Pr(1 < x <= 1.5), where x is an out-of-sample data point?
require(fitdistrplus)
a <- c(2.44121289,1.70292449,0.30550832,0.04332383,1.0553436,0.26912546,0.43590885,0.84514809,
0.36762336,0.94935435,1.30887437,1.08761895,0.66581035,0.83108270,1.7567334,1.00241339,
0.96263021,1.67488277,0.87400413,0.34639636,1.16804671,1.4182144,1.7378907,1.7462686,
1.7427784,0.8377457,0.1428738,0.71473956,0.8458882,0.2140742,0.9663167,0.7933085,
0.0475603,1.8657773,0.18307362,1.13519144)
fit <- fitdist(a, "gamma",lower = c(0, 0))
Someone does not like my above approach, which is conditional on MLE; now let's see something unconditional. If we take direct integration, we need a triple integration: one for shape, one for rate and finally one for x. This is not appealing. I will just produce Monte Carlo estimate instead.
Under Central Limit Theorem, MLE are normally distributed. fitdistrplus::fitdist does not give standard error, but we can use MASS::fitdistr which would performs exact inference here.
fit <- fitdistr(a, "gamma", lower = c(0,0))
b <- fit$estimate
# shape rate
#1.739737 1.816134
V <- fit$vcov ## covariance
shape rate
shape 0.1423679 0.1486193
rate 0.1486193 0.2078086
Now we would like to sample from parameter distribution and get samples of target probability.
set.seed(0)
## sample from bivariate normal with mean `b` and covariance `V`
## Cholesky method is used here
X <- matrix(rnorm(1000 * 2), 1000) ## 1000 `N(0, 1)` normal samples
R <- chol(V) ## upper triangular Cholesky factor of `V`
X <- X %*% R ## transform X under desired covariance
X <- X + b ## shift to desired mean
## you can use `cov(X)` to check it is very close to `V`
## now samples for `Pr(1 < x < 1.5)`
p <- pgamma(1.5, X[,1], X[,2]) - pgamma(1, X[,1], X[,2])
We can make a histogram of p (and maybe do a density estimation if you want):
hist(p, prob = TRUE)
Now, we often want sample mean for predictor:
mean(p)
# [1] 0.1906975
Here goes an example that uses MCMC techniques and a Bayesian mode of inference to estimate the posterior probability that a new observation falls in the interval (1:1.5). This is an unconditional estimate, as opposed to the conditional estimate obtained by integrating the gamma-distribution with maximum-likelihood parameter estimates.
This code requires that JAGS be installed on your computer (free and easy to install).
library(rjags)
a <- c(2.44121289,1.70292449,0.30550832,0.04332383,1.0553436,0.26912546,0.43590885,0.84514809,
0.36762336,0.94935435,1.30887437,1.08761895,0.66581035,0.83108270,1.7567334,1.00241339,
0.96263021,1.67488277,0.87400413,0.34639636,1.16804671,1.4182144,1.7378907,1.7462686,
1.7427784,0.8377457,0.1428738,0.71473956,0.8458882,0.2140742,0.9663167,0.7933085,
0.0475603,1.8657773,0.18307362,1.13519144)
# Specify the model in JAGS language using diffuse priors for shape and scale
sink("GammaModel.txt")
cat("model{
# Priors
shape ~ dgamma(.001,.001)
rate ~ dgamma(.001,.001)
# Model structure
for(i in 1:n){
a[i] ~ dgamma(shape, rate)
}
}
", fill=TRUE)
sink()
jags.data <- list(a=a, n=length(a))
# Give overdispersed initial values (not important for this simple model, but very important if running complicated models where you need to check convergence by monitoring multiple chains)
inits <- function(){list(shape=runif(1,0,10), rate=runif(1,0,10))}
# Specify which parameters to monitor
params <- c("shape", "rate")
# Set-up for MCMC run
nc <- 1 # number of chains
n.adapt <-1000 # number of adaptation steps
n.burn <- 1000 # number of burn-in steps
n.iter <- 500000 # number of posterior samples
thin <- 10 # thinning of posterior samples
# Running the model
gamma_mod <- jags.model('GammaModel.txt', data = jags.data, inits=inits, n.chains=nc, n.adapt=n.adapt)
update(gamma_mod, n.burn)
gamma_samples <- coda.samples(gamma_mod,params,n.iter=n.iter, thin=thin)
# Summarize the result
summary(gamma_samples)
# Compute improper (non-normalized) probability distribution for x
x <- rep(NA, 50000)
for(i in 1:50000){
x[i] <- rgamma(1, gamma_samples[[1]][i,1], rate = gamma_samples[[1]][i,2])
}
# Find which values of x fall in the desired range and normalize.
length(which(x>1 & x < 1.5))/length(x)
Answer:
Pr(1 < x <= 1.5) = 0.194
So pretty close to the conditional estimate, but this is not guaranteed to generally be the case.
You can just use pgamma with estimated parameters in fit.
b <- fit$estimate
# shape rate
#1.739679 1.815995
pgamma(1.5, b[1], b[2]) - pgamma(1, b[1], b[2])
# [1] 0.1896032
Thanks. But how about P(x > 2)?
Check out the lower.tail argument:
pgamma(q, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)
By default, pgamma(q) evaluates Pr(x <= q). Setting lower.tail = FALSE gives Pr(x > q). So you can do:
pgamma(2, b[1], b[2], lower.tail = FALSE)
# [1] 0.08935687
Or you can also use
1 - pgamma(2, b[1], b[2])
# [1] 0.08935687

Fitting a lognormal distribution to truncated data in R

For a brief background, I am insterested in describing a distribution of fire sizes, which is presumed to follow a lognormal distribution (many small fires and few large fires). For my specific application I am only interested in the fires that fall within a certain range of sizes (> min, < max). So, I am attempting to fit a lognormal distribution to a data set that has been censored on both ends. In essence, I want to find the parameters of the lognormal distribution (mu and sigma) that best fits the full distribution prior to censoring. Can I fit the distribution taking into account that I know I am only looking a a portion of the distribution?
I have done some experimentation, but have become stumped. Here's an example:
# Generate data #
D <- rlnorm(1000,meanlog = -0.75, sdlog = 1.5)
# Censor data #
min <- 0.10
max <- 20
Dt <- D[D > min]
Dt <- Dt[Dt <= max]
If I fit the non-censored data (D) using either fitdistr (MASS) or fitdist (fitdistrplus) I obviously get approximately the same parameter values as I entered. But if I fit the censored data (Dt) then the parameter values do not match, as expected. The question is how to incorporate the known censoring. I have seen some references elsewhere to using upper and lower within fitdistr, but I encounter an error that I'm not sure how to resolve:
> fitt <- fitdist(Dt, "lognormal", lower = min, upper = max)
Error in fitdist(Dt, "lognormal", lower = min, upper = max) :
The dlognormal function must be defined
I will appreciate any advice, first on whether this is the appropriate way to fit a censored distribution, and if so, how to go about defining the dlognormal function so that I can make this work. Thanks!
Your data is not censored (that would mean that observations outside the interval
are there, but you do not know their exact value)
but truncated (those observations have been discarded).
You just have to provide fitdist with the density and the cumulative distribution function
of your truncated distribution.
library(truncdist)
dtruncated_log_normal <- function(x, meanlog, sdlog)
dtrunc(x, "lnorm", a=.10, b=20, meanlog=meanlog, sdlog=sdlog)
ptruncated_log_normal <- function(q, meanlog, sdlog)
ptrunc(q, "lnorm", a=.10, b=20, meanlog=meanlog, sdlog=sdlog)
library(fitdistrplus)
fitdist(Dt, "truncated_log_normal", start = list(meanlog=0, sdlog=1))
# Fitting of the distribution ' truncated_log_normal ' by maximum likelihood
# Parameters:
# estimate Std. Error
# meanlog -0.7482085 0.08390333
# sdlog 1.4232373 0.0668787

Resources