What does it mean to put an `rnorm` as an argument of another `rnorm` in R? - r

I have difficulty understanding what it means when an rnorm is used as one of the arguments of another rnorm? (I'll explain more below)
For example, below, in the first line of my R code I use an rnorm() and I call this rnorm(): mu.
mu consists of 10,000 x.
Now, let me put mu itself as the mean argument of a new rnorm() called "distribution".
My question is how mu which itself has 10,000 x be used as the mean argument of this new rnorm() called distribution?
P.S.: mean argument of any normal distribution can be a single number, and with only ONE single mean, we will have a single, complete normal. Now, how come, using 10,000 mu values still results in a single normal?
mu <- rnorm( 1e4 , 178 , 20 ) ; plot( density(mu) )
distribution <- rnorm( 1e4 , mu , 1 ) ; plot( density(distribution) )

You distribution is a conditional density. While the density you draw with plot(density(distribution)), is a marginal density.
Statistically speaking, you first have a normal random variable mu ~ N(178, 20), then another random variable y | mu ~ N(mu, 1). The plot you produce is the marginal density of y.
P(y), is mathematically an integral of joint distribution P(y | mu) * p(mu), integrating out mu.
#李哲源ZheyuanLi, ahhh! so when we use a vetor as the mean argument or sd argument of an rnorm, the single, final plot is the result of the integral, right?
It means you are sampling from the marginal distribution. The density estimate approximates the Monte Carlo integral from samples.
This kind of thing is often seen in Bayesian computation. Toy R code on Bayesian inference for mean of a normal distribution [data of snowfall amount] gives a full example, but integral is computed by numerical integration.

Related

Simulating likelihood ratio test (LRT) pvalue using Monte Carlo method [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 26 days ago.
I'm trying to figure out my assignment to simulate lrt test p-value output using the Monte Carlo method. As far as I understand, the lrt test is supposed to test for "better", more accurate model.
I know how to perform such a test:
nested <- glm(finalgrade~absences,data=grades)
complex <- glm(finalgrade~absences+age,data=grades)
lrtest(nested, complex)
From there I can return my p-value and perform some calculations like type I and type II errors or power of a test and see how it changes depending of number of simulations.
My question is how am I supposed to simulate the random data. It doesn't have to be grades or school related stuff this was just a showcase of my understanding.
I was thinking about making data frame with 3 to 4 columns with 1 column being a dependent value (0,1) and the rest being random numbers generated from the normal distribution or some different distribution.
But I don't know if this approach will create understandable results, or if this even makes sense.
I looked at this function function but it didn't really help me to understand anything.
I came up with something like this:
library(lmtest)
n <- 1000
depentend = sample(c(0,1), replace=TRUE, size=n)
pvalue <- c()
for(i in 1:1000) {
independend_x = rnorm(n, mean = 2,sd = 0.2)
independend_y = rnorm(n, mean = 7,sd = 0.5)
nested <- lm(depentend~independend_x)
complex <- lm(depentend~independend_x + independend_y)
lrtest(nested, complex)
pvalue <- c(pvalue, as.numeric(lrtest(nested, complex)[5][2,1]))
}
but I don't know if this is the right direction.
I would be really thankful if someone could help me to understand how to simulate data for the Monte Carlo sampling method.
Monte Carlo simulations are performed to compute a distribution of something that is difficult to compute or for which one is too lazy to perform the exact computation.
The likelihood ratio test computes a p-value based on the distribution of the likelihood ratio $\Lambda$, and that distribution is the value that you want to simulate instead of compute or estimate with formula's. The trick is to use simulation instead of computations.
Your problem does not seem to be so much how to perform the simulations, but more like what is the distribution that you are interested in and want to simulate and what are the boundary conditions that you need to fix. Which computation or estimation is it that you want to replace/estimate with simulation?
For your likelihood ratio test you probably want to test the hypothesis $H_0: \theta_{age} = 0$ against the alternative hypothesis $H_a: \theta_{age} \neq 0$. In this case you compute the ratio of the likelihood $\mathcal{L}$ where one of the hypotheses is a composite hypothesis and you select the highest likelihood among them.
$$\Lambda = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\text{sup}_{\theta_{age} \neq 0}\mathcal{L}( \theta_{age} | \text{some data})} = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\mathcal{L}( \hat\theta_{age} | \text{some data})} $$ where the supremum is found by using the likelihood for the maximum likelihood estimator $ \hat\theta_{age} $
To compute these likelihood functions you need assumptions about the distributions. In your case you do this with glm (where you need to decide on some distribution and link function) or more simple lm (which assumes Gaussian conditional distribution for the data).
The simulations are then computed for a given null hypothesis. For instance, given some data, you assume that $\theta_{age} = 0$ and you want to compute what the distribution of the outcomes of $\Lambda$ is. You need some more data and parameters
The independent variables. These you probably want to fix at some values that relate to your practical problem. You want to know the distribution given some independent variables. Potentially you may wish to study what happens when there is an error in these independent variables, in that case you may also simulate these variables.
The variance/dispersion/noise-level of the conditional distribution. This you may vary to see how this influences the statistic. Or you have some value of interest, for instance if you have data for which you estimated the noise.
The other coefficients. These you may likewise vary or keep fixed depending on the situation, whether you want to model a particular situation or a more range of situations.
Example
The code below computes a simulation for a given regressor matrix (the independent variables) and given other coefficients. For large sample size the distribution will approach a chi-squared distribution. The simulation shows that using that limit as an estimate for the distribution underestimates the p-value by a lot.
(I ran the code with only 5000 simulations because I am using an online r-editor an compiler, on a computer you can get more precise results)
n_sim = 5*10^3
### simulate likelihood ratio test
### given coefficient and independent variables
### we assume a logistic model with binomial distribution
sim = function(theta1, X) {
### compute model
Z = X %*% theta1
p = 1/(1+exp(-Z))
### simulate dependent variable
Y = rbinom(length(p), 1, p)
### compute (log)likelihood ratio
mod1 = glm(Y ~ 1 + X[,2] + X[,3], family = binomial)
mod0 = glm(Y ~ 1 + X[,2], family = binomial)
logratio = -2*(logLik(mod0)-logLik(mod1))
return(as.numeric(logratio))
}
set.seed(1)
n = 10
### coefficients with the last one zero
theta1 = c(1,1,0)
### some regressor matrix, independent variables
X = cbind(rep(1,n), matrix(rnorm(n*2),n)) ### first column is intercept
### simulate
Lsim = replicate(n_sim,sim(theta1,X))
### ordering for empirical distribution
Lsim = Lsim[order(Lsim)]
perc = c(1:length(Lsim))/length(Lsim)
plot(Lsim,1-perc, main = "emperical distribution", ylab = "P(likelihood > L)", xlab = "L", type = "l")
lines(qchisq(perc,1),1-perc, lty = 2)
legend(8,1, c("n=10","n=40", "chi-squared estimate"), lty = c(1,1,2), col = c(1,2,1))
#### repeat with larger n
set.seed(1)
n = 40
theta1 = c(1,1,0)
X = cbind(rep(1,n), matrix(rnorm(n*2),n))
Lsim2 = replicate(n_sim,sim(theta1,X))
Lsim2 = Lsim2[order(Lsim2)]
lines(Lsim2, 1-perc, col = 2)
Note that there are many variants and this is just an example what simulation does. Here we simulate data based on a given distribution. (And it replaces a computation that we could not perform. We had an estimate with a chi-squared distribution, but that is not accurate for small $n$.)
Other times this distribution is not know and one uses the data and some resampling method to simulate/estimate the distribution of the statistic.
For your situation you need to figure out what exact computation (for which information/conditions are given) it is that you want to replace by using simulations.

How do I use the pgamma() function in R to compute the CDF of a gamma distribution?

I want to compute the cumulative distribution function in R for data that follows a gamma distribution. I understood how to do this with a lognormal distribution using the equation from Wikipedia; however, the gamma equation seems more complicated and I decided to use the pgamma() function.
I'm new to this and don't understand the following:
Why do I get three different values out of pgamma, and how does it make sense that they are negative?
Am I supposed to take the log of all the quantiles, just as I used log(mean) and log(standard deviation) when doing calculations with a lognorm distribution?
How do I conceptually understand the CDF calculated by pgamma? It made sense for lognorm that I was calculating the probability that X would take a value <= x, but there is no "x" in this pgamma function.
Really appreciate the help in understanding this.
shape <- 1.35721347
scale <- 1/0.01395087
quantiles <- c(3.376354, 3.929347, 4.462594)
pgamma(quantiles, shape = shape, scale = scale, log.p = TRUE)

How to generate a random variable from two different distributions in R

Suppose a random variable Z is taken randomly from two different distributions with equal probability: a standard N(0,1) and an exponential exp(1) with rate=1. I want to generate the random variable Z.
So in r, my approach is: Z=0.5X+0.5Y, so Z is from the joint distribution of N(0,1) and exp(1). The r code will be:
x<-rnorm(1)
y<-rexp(1)
z<-0.5x+0.5y
My question is can I obtain Z by just adding up x and y with their probabilities, or I have to consider the correlations between variables ?
Unfortunately not. You need another variable U, which is a Bernoulli random variable with p=0.5 and independent of X and Y. Define Z = U*X+(1-U)*Y. In R, you can do
x<-rnorm(1)
y<-rexp(1)
u<-rbinom(1,1,0.5)
z<-u*x+(1-u)*y
Averaging X and Y results in totally different distribution, not the mixture of distributions you want.

How does one extract hat values and Cook's Distance from an `nlsLM` model object in R?

I'm using the nlsLM function to fit a nonlinear regression. How does one extract the hat values and Cook's Distance from an nlsLM model object?
With objects created using the nls or nlreg functions, I know how to extract the hat values and the Cook's Distance of the observations, but I can't figure out how to get them using nslLM.
Can anyone help me out on this? Thanks!
So, it's not Cook's Distance or based on hat values, but you can use the function nlsJack in the nlstools package to jackknife your nls model, which means it removes every point, one by one, and bootstraps the resulting model to see, roughly speaking, how much the model coefficients change with or without a given observation in there.
Reproducible example:
xs = rep(1:10, times = 10)
ys = 3 + 2*exp(-0.5*xs)
for (i in 1:100) {
xs[i] = rnorm(1, xs[i], 2)
}
df1 = data.frame(xs, ys)
nls1 = nls(ys ~ a + b*exp(d*xs), data=df1, start=c(a=3, b=2, d=-0.5))
require(nlstools)
plot(nlsJack(nls1))
The plot shows the percentage change in each model coefficient as each individual observation is removed, and it marks influential points above a certain threshold as "influential" in the resulting plot. The documentation for nlsJack describes how this threshold is determined:
An observation is empirically defined as influential for one parameter if the difference between the estimate of this parameter with and without the observation exceeds twice the standard error of the estimate divided by sqrt(n). This empirical method assumes a small curvature of the nonlinear model.
My impression so far is that this a fairly liberal criterion--it tends to mark a lot of points as influential.
nlstools is a pretty useful package overall for diagnosing nls model fits though.

How to fit frequency distributions in R?

Is there a function that could be used to fit a frequency distribution in R? I'm aware of fitdistr but as far as I can tell it only works for data vectors (random samples). Also, I know that converting between the two formats is trivial but frequencies are so large that memory is a concern.
For example, fitdistr may be used the following way:
x<-rpois(100, lambda=10)
fitdistr(x,"poisson")
Is there a function that would do the same fitting on a frequency table? Something along the lines:
freqt <- as.data.frame(table(x))
fitfreqtable(freqt$x, weights=freqt$Freq, "poisson")
Thanks!
There's no built-in function that I know of for fitting a distribution to a frequency table. Note that, in theory, a continuous distribution is inappropriate for a table, since the data is discrete. Of course, for large enough N and a fine enough grid, this can be ignored.
You can build your own model-fitting function using optim or any other optimizer, if you know the density that you're interested in. I did this here for a gamma distribution (which was a bad assumption for that particular dataset, but never mind that).
Code reproduced below.
negll <- function(par, x, y)
{
shape <- par[1]
rate <- par[2]
mu <- dgamma(x, shape, rate) * sum(y)
-2 * sum(dpois(y, mu, log=TRUE))
}
optim(c(1, 1), negll, x=seq_along(g$count), y=g$count, method="L-BFGS-B", lower=c(.001, .001))
$par
[1] 0.73034879 0.00698288
$value
[1] 62983.18
$counts
function gradient
32 32
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
For fitting a Poisson distribution, you only need the mean of your sample. Then the mean equals the lambda, which is the only parameter of the Poisson distribution. Example:
set.seed(1111)
sample<-rpois(n=10000,l=10)
mean(sample)
[1] 10.0191
which is almost equal to the lambda value put for creating the sample (l=10). The small difference (0.0191) is due to the randomness of the Poisson distribution random value generator. As you increase n the difference will get smaller.
Alternatively, you can fit the distribution using an optimization method:
library(fitdistrplus)
fitdist(sample,"pois")
set.seed(1111)
Fitting of the distribution ' pois ' by maximum likelihood
Parameters:
estimate Std. Error
lambda 10.0191 0.03165296
but it's only a waste of time.
For theoritical information on fitting frequency data, you can see my answer here.
The function fixtmixturegrouped from the package ForestFit does the job for other distribution models using frequency-by-group data.
It can fit simple or mixture distribution models based on "gamma", "log-normal", "skew-normal", and "weibull".
For a Poisson distribution, the population mean is the only parameter that is needed. Applying a simple summary function on your data would suffice (as suggested by ntzortzis)

Resources