I'm using R to make some calculations. This question is about R but also about statistics.
Say I have a dataset of paired samples consisting of a subject's blood platelet concentration after injection of placebo and then again after injection of medication for a number of subjects. I want to estimate the mean difference for the paired samples. I'm just learning about the t distribution. If I wanted to a 95% confidence interval for the mean difference using a Z-test, I could simply use:
mydata$diff <- mydata$medication - mydata$placebo
mu0 <- mean(mydata$diff)
sdmu <- sd(mydata$diff) / sqrt(length(mydata$diff))
qnorm(c(0.025, 0.975), mu, sdmu)
After much confusion and cross-checking with the t.test function, I've figured out that I can get the 95% confidence interval for a t-test with:
qt(c(0.025, 0.975), df=19) * sdmu + mu0
My understanding of this is as follows:
Tstatistic = (mu - mu0)/sdmu
Tcdf^-1(0.025) <= (mu - mu0) / sdmu <= Tcdf^-1(0.975)
=>
sdmu * Tcdf^-1(0.025) + mu0 <= mu <= sdmu * Tcdf^-1(0.975) + mu0
The reason this is confusing is that if I were using a Z-test, I would write it like this:
qnorm(c(0.025, 0.975), mu0, sdmu)
and it's not until I tried to figure out how to use the t distribution that I realised I could move the normal distribution parameters out of the function too:
qnorm(c(0.025, 0.975), 0, 1) * sdmu + mu0
Trying to wrap my head around what this means mathematically, it means that the Z-statistic (mu - mu0)/sdmu is always normally distributed with mean 0 and standard deviation of 1?
What has me stumped is that I'd like to move the t distribution parameters into the arguments to the function to cut down on the enormous mental overhead of thinking about this transformation.
However, according to my version of the R function qt's documentation, in order to do this, I would need to calculate the non-centrality parameter ncp. According to (my version of) the documentation, the ncp is explained as follows:
Let T= (mX - m0) / (S/sqrt(n)) where mX is the mean and S the sample standard deviation (sd) of X_1, X_2, …, X_n which are i.i.d. N(μ, σ^2) Then T is distributed as non-central t with df= n - 1 degrees of freedom and non-centrality parameter ncp = (μ - m0) * sqrt(n)/σ.
I can't wrap my head around this at all. At first it seems to fit into my framework because Tstatistic = (mu - m0) / sdmu. But isn't μ what I want the qt function (which is Tcdf-1) to return? How can it appear in the ncp, which I need to give as an input? And what about σ? What do μ and σ mean in this context?
Basically, how can I get the same result as qt(c(0.025, 0.975), df=19) * sdmu + mu0, without any terms outside of the function call, and could I have an explanation of how it works?
Let me try to explain without using any formulae.
First of all, the student t distribution and the normal distribution are two distinct probability distributions and (in most situations) are not supposed to give you the same results.
The t distribution is the appropriate probability distribution to test for a difference between two normally distributed samples. Since we do not know the population sd we have to stick with the one we get from the sample. And that distribution is not normal distributed anymore, it is t-distributed.
The z-distribution can be used to approximate the test. In this case, we use the z-distribution as approximation of the t-distribution. However, it is recommended not to do this with low degrees of freedom. Reason: the higher degrees of freedom a t distribution has it becomes increasingly similar to a normal distribution. Textbooks usually say that t and normal distribution with df>30 are similar enough to approximate t with normal distribution. In order to do that, you would have to normalise your data, first, so that mean = 0 and sd = 1. Then you can do the approximation using the z-distribution.
I usually recommend not to use this approximation. It was a reasonable crutch when calculations had to be done on paper using your head, a pen, and a bunch of tables. There exist many workarounds in basic statistics that were supposed to give you a reasonble result with less computation effort. With modern computers that is usually obsolete (in most cases at least).
The z distribution, by the way, is defined (by convention) as a normal distribution N(0, 1) i.e. a normal distribution with mean = 0 and sd = 1.
Finally, about the different ways these distributions are specified. The normal distribution is actually the only probability distribution that I know that you can specify by setting mean and sd directly (there are dozens of distributions, in case you're interested). The non-centrality parameter has a similar effect than the mean of the normal distribution. In a plot it moves the t-distribution along the x-axis. But it also changes its shape and skews it so that mean and ncp move away from each other.
This code will show how the ncp changes the shape and location of the t-distribution:
x <- seq(-5, 15, 0.1)
plot(x, dt(x, df = 10, ncp = 0), from = -4, to = +4, type = "l")
for(ncp in 1:6)
lines(x, dt(x, df = 10, ncp = ncp))
Related
I want to plot the posterior distribution for data sampled from gamma(2,3) with a prior distribution of gamma(3,3). I am assuming alpha=2 is known. But a graph of my posterior for different values of the rate parameter centers around 4. It should be 3. I even tried with a uniform prior to make things simpler. Can you please spot what's wrong? Thank you.
set.seed(101)
dat <- rgamma(100,shape=2,rate=3)
alpha <- 3
n <- 100
post <- function(beta_1) {
posterior<- (((beta_1^alpha)^n)/gamma(alpha)^n)*
prod(dat^(alpha-1))*exp(-beta_1*sum(dat))
return(posterior)
}
vlogl <- Vectorize(post)
curve(vlogl2,from=2,to=6)
A tricky question and possibly more related to statistics than to programming =). I initially made the same reasoning mistake as you, but subsequently realised to be more careful with the posterior and the roles of alpha and beta_1.
The prior is uniform (or flat) so the posterior distribution is proportional (not equal) to the likelihood.
The quantity you have assigned to the posterior is indeed the likelihood. Plugging in alpha=3, this evaluates to
(prod(dat^2)/(gamma(alpha)^n)) * beta_1^(3*n)*exp(-beta_1*sum(dat)).
This is the crucial step. The last two terms in the product depend on beta_1 only, so these two parts determine the shape of the posterior. The posterior distribution is thus gamma distributed with shape parameter 3*n+1 and rate parameter sum(dat). As the mode of the gamma distribution is the ratio of these two and sum(dat) is about 66 for this seed, we get a mode of 301/66 (about 4.55). This coincides perfectly with the ``posterior plot'' (again you plotted the likelihood which is not properly scaled, i.e. not properly integrating to 1) produced by your code (attached below).
I hope LifeisBetter now =).
But a graph of my posterior for different values of the rate parameter
centers around 4. It should be 3.
The mean of your data is 0.659 (~2/3). Given a gamma distribution with a shape parameter alpha = 3, we are trying to find likely values of the rate parameter, beta, that gave rise to the observed data (subject to our prior information). The mean of a gamma distribution is the shape parameter divided by the rate parameter. 100 observations should be enough to mostly overcome the somewhat informative prior (which had a mean of 1), so we should expect beta to take values somewhere in the region alpha/mean(dat), not 3.
alpha/mean(dat)
#> [1] 4.54915
I'm not going to show the derivation of the posterior distribution for beta without TeX, but it is a gamma distribution that includes the rate parameter from the prior distribution of beta (betaPrior = 3):
set.seed(101)
n <- 100
dat <- rgamma(n, 2, 3)
alpha <- 3
betaPrior <- 3
post <- function(x) dgamma(x, alpha*(n + 1), sum(dat) + betaPrior)
curve(post, 2, 6)
Notice that the mean of beta is at ~4.39 rather than ~4.55 because of the informative prior that had a mean of 1.
I'm trying to fit a lognormal distribution to some count data using Colin Gillespie's poweRlaw package in R. I'm aware that the lognormal distribution is continuous and count data is discrete, however, the package contains classes and methods for both continuous and discrete versions of the lognormal distribution.
When I fit xmin (threshold below which count values are disregarded), log mean and log sd parameters and bootstrap the results to get a p value, I get a vector memory exhaustion error. I found that this happens when the package-internal function sample_p_helper tries to generate random numbers from the fitted distribution. The fitted log mean and log sd parameters are so low that the rejection sampling algorithm tries to generate literally billions of numbers to get anything above xmin, hence the memory issue.
Input:
library(poweRlaw)
counts <- c(54, 64, 126, 161, 162, 278, 281, 293, 296, 302, 322, 348, 418, 511, 696, 793, 1894)
dist <- dislnorm$new(counts) # Create discrete lnorm object
dist$setXmin(estimate_xmin(dist)) # Get xmin and parameters
bs <- bootstrap_p(dist) # Run bootstrapping
Error message:
Expected total run time for 100 sims, using 1 threads is 24.6 seconds.
Error in checkForRemoteErrors(val) :
one node produced an error: vector memory exhausted (limit reached?)
The question then becomes why such low and poor-fitting log mean and log sd parameter values are being fitted in the first place.
I noticed that if I fit the continuous version of the lognormal distribution, the error does not occur and the parameter values seem more reasonable (in fact, the p value suggests the data are compatible with the lognormal distribution):
dist_cont <- conlnorm$new(counts)
dist_cont$setXmin(estimate_xmin(dist_cont))
bs <- bootstrap_p(dist_cont)
bs
Looking at the source code for the package, I noticed the likelihood functions for the discrete vs continuous lognormal distributions are different. Specifically, the part where joint probability is calculated.
The continuous version looks how I'd expect:
########################################################
#Log-likelihood
########################################################
conlnorm_tail_ll = function(x, pars, xmin) {
if(is.vector(pars)) pars = t(as.matrix(pars))
n = length(x)
joint_prob = colSums(apply(pars, 1,
function(i) dlnorm(x, i[1], i[2], log=TRUE)))
prob_over = apply(pars, 1, function(i)
plnorm(xmin, i[1], i[2], log.p=TRUE, lower.tail=FALSE))
joint_prob - n*prob_over
}
However, in the discrete version, joint probability is calculated differently:
########################################################
#Log-likelihood
########################################################
dis_lnorm_tail_ll = function(xv, xf, pars, xmin) {
if(is.vector(pars)) pars = t(as.matrix(pars))
n = sum(xf)
p = function(par) {
m_log = par[1]; sd_log = par[2]
plnorm(xv-0.5, m_log, sd_log, lower.tail=FALSE) -
plnorm(xv+0.5, m_log, sd_log, lower.tail=FALSE)
}
if(length(xv) == 1L) {
joint_prob = sum(xf * log(apply(pars, 1, p)))
} else {
joint_prob = colSums(xf * log(apply(pars, 1, p)))
}
prob_over = apply(pars, 1, function(i)
plnorm(xmin-0.5, i[1], i[2],
lower.tail = FALSE, log.p = TRUE))
return(joint_prob - n*prob_over)
}
There's a similar difference between discrete and continuous implementations of the exponential distribution, but not the discrete and continuous power law distributions. In the continuous version, joint_prob is calculated with a relatively simple call to dlnorm, but the discrete versions call plnorm instead. Further, they call plnorm twice, first on the observed data values -0.5 then on the observed values +0.5 and subtract the former from the latter.
So, at last, my questions:
Why does poweRlaw calculate joint probability in this way in the discrete implementation of the lognormal distribution? I'm sure it's been written in this way for a reason and it's just my mathematical ignorance, but I don't really understand it.
Is it safe to use poweRlaw's continuous lognormal distribution instead, even though my data is discrete, since it seems to work well enough anyway?
Any other clues as to what might be going wrong with my data when trying to fit the discrete lognormal distribution? I'm thinking there might be a scaling issue somewhere but having a hard time getting my head around it.
Does my comically small dataset play into things at all? I'm trying to fit a distribution to just 8 values above xmin, which is way too few for maximum likelihood to be reliable, I know.
Thanks for bearing with me through this lengthy post. I'm aware this is as much a statistics question as a coding question. Any helpful nudges in the right direction are very much appreciated! Cheers.
dlnorm() gives the probability density value. Remember densities integrate to one but don't sum to one. So to work out the discrete distribution we take the values either side of an integer. They'll be a normalising constant as well. For the CTN case, the log-likelihood is just a product of dlnorm(), which is easier and faster.
"Safe" is a hard word to define. For this data, the CTN and discrete give visually the same fit. But neither fit well.
The estimated parameters values for the discrete distribution gives a truncated lognormal in the very extreme tails. Simulating data in that region is challenging
Yep, your data is the problem. But that's also the issue when the model doesn't work ;)
For an analysis I need to perform a "F-pseudosigma", also called the "pseudo standard deviation". I tried to look if it's in any R package, but can't find it myself.
There isn't much info on it to begin with.
Does any of you know a package that holds it, or if it is calculated in a function from a package?
I have to admit that I haven't heard about F-pseudo sigma (or pseudo sigma) before; but a bit of research suggests that it is simply defined as the scaled difference between the third and first quartile.
That can be easily translated into a custom R function
fpseudosig <- function(x) unname(diff(quantile(x, c(0.25, 0.75)) / 1.35))
For example, let's generate some random data x ~ N(0, 1)
set.seed(2018)
x <- rnorm(100)
Then
fpseudosig(x)
#[1] 0.9703053
References
(in no particular order)
Irwin, Exploratory Data Analysis for Beginners: "Instead of the using the standard deviation in an RSD calculation, one might consider using the sample-data deviation (F-pseudosigma). This is a nonparametric statistic analogous to the standard deviation that is calculated by using the 25th and 75th percentiles in a data set. It is resistant to the effect of extreme outliers."
https://bqs.usgs.gov/srs/SRS_Spr04/statrate.htm: "The F-pseudosigma is calculated by dividing the fourth-spread (analogous to interquartile range) by 1.349; therefore the smaller the F-pseudosigma the more precise the determinations. The 1.349 value is derived from the number of standard deviations that encompasses 50% of the data."
http://mkseo.pe.kr/stats/?p=5: "Simply put, given the first quartile H1 and the third quartile H3, pseudo sigma is (H3-H1)/1.35. Why? It’s because H1= μ – 0.675σ and H3 = μ + 0.675σ if X ∼N. Therefore, H3-H1=1.35σ, resulting in σ = (H3-H1)/1.35. We call H3-H1 as IQR(Inter Quartile Range)."
I have a bunch of random variables (X1,....,Xn) which are i.i.d. Exp(1/2) and represent the duration of time of a certain event. So this distribution has obviously an expected value of 2, but I am having problems defining it in R. I did some research and found something about a so-called Monte-Carlo Stimulation, but I don't seem to find what I am looking for in it.
An example of what i want to estimate is: let's say we have 10 random variables (X1,..,X10) distributed as above, and we want to determine for example the probability P([X1+...+X10<=25]).
Thanks.
You don't actually need monte carlo simulation in this case because:
If Xi ~ Exp(λ) then the sum (X1 + ... + Xk) ~ Erlang(k, λ) which is just a Gamma(k, 1/λ) (in (k, θ) parametrization) or Gamma(k, λ) (in (α,β) parametrization) with an integer shape parameter k.
From wikipedia (https://en.wikipedia.org/wiki/Exponential_distribution#Related_distributions)
So, P([X1+...+X10<=25]) can be computed by
pgamma(25, shape=10, rate=0.5)
Are you aware of rexp() function in R? Have a look at documentation page by typing ?rexp in R console.
A quick answer to your Monte Carlo estimation of desired probability:
mean(rowSums(matrix(rexp(1000 * 10, rate = 0.5), 1000, 10)) <= 25)
I have generated 1000 set of 10 exponential samples, putting them into a 1000 * 10 matrix. We take row sum and get a vector of 1000 entries. The proportion of values between 0 and 25 is an empirical estimate of the desired probability.
Thanks, this was helpful! Can I use replicate with this code, to make it look like this: F <- function(n, B=1000) mean(replicate(B,(rexp(10, rate = 0.5)))) but I am unable to output the right result.
replicate here generates a matrix, too, but it is an 10 * 1000 matrix (as opposed to a 1000* 10 one in my answer), so you now need to take colSums. Also, where did you put n?
The correct function would be
F <- function(n, B=1000) mean(colSums(replicate(B, rexp(10, rate = 0.5))) <= n)
For non-Monte Carlo method to your given example, see the other answer. Exponential distribution is a special case of gamma distribution and the latter has additivity property.
I am giving you Monte Carlo method because you name it in your question, and it is applicable beyond your example.
I am using r glm to model Poisson data binned by year. So I have x[i] counts with T[i] exposure in each year, i. The r glm with poisson family log link output produces model coefficients a, b for y = a + bx.
What I need is the standard error of (a + bx) not the standard error of a or the standard error of b. The documentation describing a solution I am trying to implement says this should be calculated by the software because it is not straightforward to calculate from the parameters for a and b. Perhaps SAS does the calc, but I am not recognizing it in R.
I am working working through section 7.2.4.5 of the Handbook of Parameter Estimation (NUREG/CR-6823, a public document) and looking at eq 7.2. Also not a statistician so I am finding this is very hard to follow.
The game here is to find the 90 percent simultaneous confidence interval on the model output, not the confidence interval at each year, i.
Adding this here so I can show some code. The first answer below appears to get me pretty close. A statistician here put together the following function to construct the confidence bounds. This appears to work.
# trend line simultaneous confidence intervals
# according to HOPE 7.2.4.5
HOPE = function(x, ...){
t = data$T
mle<-predict(model, newdata=data.frame(x=data$x), type="response")
se = as.data.frame(predict(model, newdata=data.frame(x=data$x), type="link", se.fit=TRUE))[,2]
chi = qchisq(.90, df=n-1)
upper = (mle + (chi * se))/t
lower = (mle - (chi * se))/t
return(as.data.frame(cbind(mle, t, upper, lower)))
}
I think you need to provide the argument se.fit=TRUE when you create the prediction from the model:
hotmod<-glm(...)
predz<-predict(hotmod, ..., se.fit=TRUE)
Then you should be able to find the estimated standard errors using:
predz$se.fit
Now if you want to do it by hand on this software, it should not be as hard as you suggest:
covmat<-vcov(hotmod)
coeffs<-coef(hotmod)
Then I think the standard error should simply be:
sqrt(t(coeffs) %*% covmat %*% coeffs)
The operator %*% can be used for matrix multiplication in this software.