Convergence rate in R - r

I have simulated some sample paths S_n=sum(X_i), where the X_i are simulated having different distributions.
I then look at the proportion of the times where S_n is positive. Where n is the length of the sample path, and I have simulated a large number of sample paths.
This follows the arcsine density, which I have shown by a P-P plot and Q-Q plot.
But I would like to see which distribution for X_i that converges fastest to the limiting arcsine distribution.
How do I do this?
Thanks,

Related

Using GAMLSS, the difference between fitDist() and gamlss()

When using the GAMLSS package in R, there are many different ways to fit a distribution to a set of data. My data is a single vector of values, and I am fitting a distribution over these values.
My question is this: what is the main difference between using fitDist() and gamlss() since they give similar but different answers for parameter values, and different worm plots?
Also, using the function confint() works for gamlss() fitted objects but not for objects fitted with fitDist(). Is there any way to produce confidence intervals for parameters fitted with the fitDist() function? Is there an accuracy difference between the two procedures? Thanks!
m1 <- fitDist()
fits many distributions and chooses the best according to a
generalized Akaike information criterion, GAIC(k), wit penalty k for each
fitted parameter in the distribution, where k is specified by the user,
e.g. k=2 for AIC,
k = log(n) for BIC,
k=4 for a Chi-squared test (rounded from 3.84, the 5% critical value of a Chi-squared distribution with 1 degree of fereedom), which is my preference.
m1$fits
gives the full results from the best to worst distribution according to GAIC(k).

Estimating a probability distribution and sampling from it in Julia

I am trying to use Julia to estimate a continuous univariate distribution using N observed data points (stored as an array of Float64 numbers), and then sample from this estimated distribution. I have no prior knowledge restricting attention to some family of distributions.
I was thinking of using the KernelDensity package to estimate the distribution, but I'm not sure how to sample from the resulting output.
Any help/tips would be much appreciated.
Without any restrictions on the estimated distribution, a natural candidate would be the empirical distribution function (see Wikipedia). For this distribution there are very nice theorems about convergence to actual distribution (see Dvoretzky–Kiefer–Wolfowitz inequality).
With this choice, sampling is especially simple. If dataset is a list of current samples, then dataset[rand(1:length(dataset),sample_size)] is a set of new samples from the empirical distribution. With the Distributions package, it could be more readable, like so:
using Distributions
new_sample = sample(dataset,sample_size)
Finally, Kernel density estimation is also good, but might need a parameter to be chosen (the kernel and its width). This shows a preference for a certain family of distributions. Sampling from a kernel distribution is surprisingly similar to sampling from the empirical distribution: 1. choose a sample from the empirical distributions; 2. perturb each sample using a sample from the kernal function.
For example, if the kernel function is a Normal distribution of width w, then the perturbed sample could be calculated as:
new_sample = dataset[rand(1:length(dataset),sample_size)]+w*randn(sample_size)

Significance level of ACF and PACF in R

I want to obtain the the limits that determine the significance of autocorrelation coefficients and partial autocorrelation coefficients, but I don't know how to do it.
I obtained the Partial autocorrelogram using this function pacf(data). I want that R print me the values indicated in the figure.
The limits that determine the significance of autocorrelation coefficients are: +/- of (exp(2*1.96/√(N-3)-1)/(exp(2*1.96/√(N-3)+1).
Here N is the length of the time series, and I used the 95% confidence level.
The correlation values that correspond to the m % confidence intervals chosen for the test are given by 0 ± i/√N where:
N is the length of the time series
i is the number of standard deviations we expect m % of the correlations to lie within under the null hypothesis that there is zero autocorrelation.
Since the observed correlations are assumed to be normally distributed:
i=2 for a 95% confidence level (acf's default),
i=3 for a 99% confidence level,
and so on as dictated by the properties of a Gaussian distribution
Figure A1, Page 1011 here provides a nice example of how the above principle applies in practice.
After investigating acf and pacf functions and library psychometric with its CIz and CIr functions I found this simple code to do the task:
Compute confidence interval for z Fisher:
ciz = c(-1,1)*(-qnorm((1-alpha)/2)/sqrt(N-3))
here alpha is the confidence level (typically 0.95). N - number of observations.
Compute confidence interval for R:
cir = (exp(2*ciz)-1)/(exp(2*ciz)+1

Bootstrap and sample mean

I have a sample from an exponential distribution, let's say x<-rexp(30,0.2).
I resample this 5000 times with replacement and compute the sample mean:
resample<-replicate(5000,mean(sample(x,30,replace=TRUE)))
I do the following histogram to see the distribution of T(X*)-T(X):
hist(resample-mean(x),freq=FALSE)
I know that since I have a sequence of iid Exponentials, the sum of this sequence has a Gamma distribution (scaled by the number of Exponential rv's I'm considering (i.e., 30)).
How can I overlay this Gamma distribution to the previous histogram?
I'm trying to use the following res.range<-seq(min(resample),max(resample),.001)
lines(res.range, dgamma(res.range,shape=1,rate=0.2/30))
but it seems it doesn't work.

How to set a weighted least-squares in r for heteroscedastic data?

I'm running a regression on census data where my dependent variable is life expectancy and I have eight independent variables. The data is aggregated be cities, so I have many thousand observations.
My model is somewhat heteroscedastic though. I want to run a weighted least-squares where each observation is weighted by the city’s population. In this case, it would mean that I want to weight the observations by the inverse of the square root of the population. It’s unclear to me, however, what would be the best syntax. Currently, I have:
Model=lm(…,weights=(1/population))
Is that correct? Or should it be:
Model=lm(…,weights=(1/sqrt(population)))
(I found this question here: Weighted Least Squares - R but it does not clarify how R interprets the weights argument.)
From ?lm: "weights: an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used." R doesn't do any further interpretation of the weights argument.
So, if what you want to minimize is the sum of (the squared distance from each point to the fit line * 1/sqrt(population) then you want ...weights=(1/sqrt(population)). If you want to minimize the sum of (the squared distance from each point to the fit line * 1/population) then you want ...weights=1/population.
As to which of those is most appropriate... that's a question for CrossValidated!
To answer your question, Lucas, I think you want weights=(1/population). R parameterizes the weights as inversely proportional to the variances, so specifying the weights this way amounts to assuming that the variance of the error term is proportional to the population of the city, which is a common assumption in this setting.
But check the assumption! If the variance of the error term is indeed proportional to the population size, then if you divide each residual by the square root of its corresponding sample size, the residuals should have constant variance. Remember, dividing a random variable by a constant results in the variance being divided by the square of that constant.
Here's how you can check this: Obtain residuals from the regression by
residuals = lm(..., weights = 1/population)$residuals
Then divide the residuals by the square roots of the population variances:
standardized_residuals = residuals/sqrt(population)
Then compare the sample variance among the residuals corresponding to the bottom half of population sizes:
variance1 = var(standardized_residuals[population < median(population)])
to the sample variance among the residuals corresponding to the upper half of population sizes:
variance2 = var(standardized_residuals[population > median(population)])
If these two numbers, variance1 and variance2 are similar, then you're doing something right. If they are drastically different, then maybe your assumption is violated.

Resources