Bootstrap and sample mean - r

I have a sample from an exponential distribution, let's say x<-rexp(30,0.2).
I resample this 5000 times with replacement and compute the sample mean:
resample<-replicate(5000,mean(sample(x,30,replace=TRUE)))
I do the following histogram to see the distribution of T(X*)-T(X):
hist(resample-mean(x),freq=FALSE)
I know that since I have a sequence of iid Exponentials, the sum of this sequence has a Gamma distribution (scaled by the number of Exponential rv's I'm considering (i.e., 30)).
How can I overlay this Gamma distribution to the previous histogram?
I'm trying to use the following res.range<-seq(min(resample),max(resample),.001)
lines(res.range, dgamma(res.range,shape=1,rate=0.2/30))
but it seems it doesn't work.

Related

I am doing a risk aggregation of losses in R using poisson distribution as frequency of losses and ecdf as severity of losses

I am really new to this and I have no idea how to use the ecdf function in R. Below I have mention everything step by step:
Frequency of losses is defined using a Poisson distribution
Generate an ecdf function that is going to be used for the severity of losses.
Linearly interpolate the ecdf function.
Take inverse transform of the linearly interpolated ecdf function.
For example,
I can use code freq <- rpois(10,5) to generate the random number of loss frequency but further I have to use this vector to do steps 2-4 and I have no idea how to do that. For step 2 I am facing the problem that how can I use that Poisson distribution as an input and then use to compute severity using the ecdf function. If anybody knows this please help me.

Is this right way to generate the following heavy-tailed distribution?

I'm trying to generate an error distribution which follows a heavy-tailed distribution based on the statement
: (2) A heavy-tailed distribution, i.e., t-df=2, a t-distribution with 2 degrees of freedom. This distribution has C95=3.61, where C95 is a measure of the distribution tail weight and defined as the ratio of the 95th percentile point to the 75th percentile point
And this is how I did it.
e<-rt(n,c(0.75,0.95),2)
I'm not quite sure if I did it correctly. Is this the right way to generate the heavy-tailed distribution mentioned in above statement??

Estimating a probability distribution and sampling from it in Julia

I am trying to use Julia to estimate a continuous univariate distribution using N observed data points (stored as an array of Float64 numbers), and then sample from this estimated distribution. I have no prior knowledge restricting attention to some family of distributions.
I was thinking of using the KernelDensity package to estimate the distribution, but I'm not sure how to sample from the resulting output.
Any help/tips would be much appreciated.
Without any restrictions on the estimated distribution, a natural candidate would be the empirical distribution function (see Wikipedia). For this distribution there are very nice theorems about convergence to actual distribution (see Dvoretzky–Kiefer–Wolfowitz inequality).
With this choice, sampling is especially simple. If dataset is a list of current samples, then dataset[rand(1:length(dataset),sample_size)] is a set of new samples from the empirical distribution. With the Distributions package, it could be more readable, like so:
using Distributions
new_sample = sample(dataset,sample_size)
Finally, Kernel density estimation is also good, but might need a parameter to be chosen (the kernel and its width). This shows a preference for a certain family of distributions. Sampling from a kernel distribution is surprisingly similar to sampling from the empirical distribution: 1. choose a sample from the empirical distributions; 2. perturb each sample using a sample from the kernal function.
For example, if the kernel function is a Normal distribution of width w, then the perturbed sample could be calculated as:
new_sample = dataset[rand(1:length(dataset),sample_size)]+w*randn(sample_size)

Convergence rate in R

I have simulated some sample paths S_n=sum(X_i), where the X_i are simulated having different distributions.
I then look at the proportion of the times where S_n is positive. Where n is the length of the sample path, and I have simulated a large number of sample paths.
This follows the arcsine density, which I have shown by a P-P plot and Q-Q plot.
But I would like to see which distribution for X_i that converges fastest to the limiting arcsine distribution.
How do I do this?
Thanks,

How to set a weighted least-squares in r for heteroscedastic data?

I'm running a regression on census data where my dependent variable is life expectancy and I have eight independent variables. The data is aggregated be cities, so I have many thousand observations.
My model is somewhat heteroscedastic though. I want to run a weighted least-squares where each observation is weighted by the city’s population. In this case, it would mean that I want to weight the observations by the inverse of the square root of the population. It’s unclear to me, however, what would be the best syntax. Currently, I have:
Model=lm(…,weights=(1/population))
Is that correct? Or should it be:
Model=lm(…,weights=(1/sqrt(population)))
(I found this question here: Weighted Least Squares - R but it does not clarify how R interprets the weights argument.)
From ?lm: "weights: an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used." R doesn't do any further interpretation of the weights argument.
So, if what you want to minimize is the sum of (the squared distance from each point to the fit line * 1/sqrt(population) then you want ...weights=(1/sqrt(population)). If you want to minimize the sum of (the squared distance from each point to the fit line * 1/population) then you want ...weights=1/population.
As to which of those is most appropriate... that's a question for CrossValidated!
To answer your question, Lucas, I think you want weights=(1/population). R parameterizes the weights as inversely proportional to the variances, so specifying the weights this way amounts to assuming that the variance of the error term is proportional to the population of the city, which is a common assumption in this setting.
But check the assumption! If the variance of the error term is indeed proportional to the population size, then if you divide each residual by the square root of its corresponding sample size, the residuals should have constant variance. Remember, dividing a random variable by a constant results in the variance being divided by the square of that constant.
Here's how you can check this: Obtain residuals from the regression by
residuals = lm(..., weights = 1/population)$residuals
Then divide the residuals by the square roots of the population variances:
standardized_residuals = residuals/sqrt(population)
Then compare the sample variance among the residuals corresponding to the bottom half of population sizes:
variance1 = var(standardized_residuals[population < median(population)])
to the sample variance among the residuals corresponding to the upper half of population sizes:
variance2 = var(standardized_residuals[population > median(population)])
If these two numbers, variance1 and variance2 are similar, then you're doing something right. If they are drastically different, then maybe your assumption is violated.

Resources