Is this right way to generate the following heavy-tailed distribution? - r

I'm trying to generate an error distribution which follows a heavy-tailed distribution based on the statement
: (2) A heavy-tailed distribution, i.e., t-df=2, a t-distribution with 2 degrees of freedom. This distribution has C95=3.61, where C95 is a measure of the distribution tail weight and defined as the ratio of the 95th percentile point to the 75th percentile point
And this is how I did it.
e<-rt(n,c(0.75,0.95),2)
I'm not quite sure if I did it correctly. Is this the right way to generate the heavy-tailed distribution mentioned in above statement??

Related

Estimating a probability distribution and sampling from it in Julia

I am trying to use Julia to estimate a continuous univariate distribution using N observed data points (stored as an array of Float64 numbers), and then sample from this estimated distribution. I have no prior knowledge restricting attention to some family of distributions.
I was thinking of using the KernelDensity package to estimate the distribution, but I'm not sure how to sample from the resulting output.
Any help/tips would be much appreciated.
Without any restrictions on the estimated distribution, a natural candidate would be the empirical distribution function (see Wikipedia). For this distribution there are very nice theorems about convergence to actual distribution (see Dvoretzky–Kiefer–Wolfowitz inequality).
With this choice, sampling is especially simple. If dataset is a list of current samples, then dataset[rand(1:length(dataset),sample_size)] is a set of new samples from the empirical distribution. With the Distributions package, it could be more readable, like so:
using Distributions
new_sample = sample(dataset,sample_size)
Finally, Kernel density estimation is also good, but might need a parameter to be chosen (the kernel and its width). This shows a preference for a certain family of distributions. Sampling from a kernel distribution is surprisingly similar to sampling from the empirical distribution: 1. choose a sample from the empirical distributions; 2. perturb each sample using a sample from the kernal function.
For example, if the kernel function is a Normal distribution of width w, then the perturbed sample could be calculated as:
new_sample = dataset[rand(1:length(dataset),sample_size)]+w*randn(sample_size)

How can I add a kurtosis term in this Gaussian bell-curve formula?

I am working with a formula to generate a curve in Python. I am tuning its parameters so that I can eyeball its matching over an underlying curve (already plotted).
What I need is to adjust its Kurtosis, which currently is not a parameter of the formula:
def gaussian(x, peak_x, peak_y, sigma):
return numpy.exp(-numpy.power(x - peak_x, 2.) / (2 * numpy.power(sigma, 2.))) * peak_y
I would need to expand the function to this signature:
def gaussian(x, peak_x, peak_y, sigma, KURTOSIS)
But I don't know where and how to change the formula.
I'm not sure what you mean by add a Kurtosis term in the Gauassian Bell curve.
The Gaussian Bell curve (also called the Normal Distribution) has a zero Kurtosis. Once you specify the mean and the variance, the graph is uniquely defined.
Assuming that you want to fit a Gaussian-like distribution to your data, I would suggest using one of the Pearson distributions. Specifically, Pearson type VII. These distributions will give you the liberty to define the mean, variance, skewness and Kurtosis so you get a perfect fit. However, if I understand your requirement correctly, you won't even need that level of flexibility. The student's t-distribution should suffice.
You can find the equation on the Wikipedia page and tune the Kurtosis by tuning the v parameter.

Bootstrap and sample mean

I have a sample from an exponential distribution, let's say x<-rexp(30,0.2).
I resample this 5000 times with replacement and compute the sample mean:
resample<-replicate(5000,mean(sample(x,30,replace=TRUE)))
I do the following histogram to see the distribution of T(X*)-T(X):
hist(resample-mean(x),freq=FALSE)
I know that since I have a sequence of iid Exponentials, the sum of this sequence has a Gamma distribution (scaled by the number of Exponential rv's I'm considering (i.e., 30)).
How can I overlay this Gamma distribution to the previous histogram?
I'm trying to use the following res.range<-seq(min(resample),max(resample),.001)
lines(res.range, dgamma(res.range,shape=1,rate=0.2/30))
but it seems it doesn't work.

How to set a weighted least-squares in r for heteroscedastic data?

I'm running a regression on census data where my dependent variable is life expectancy and I have eight independent variables. The data is aggregated be cities, so I have many thousand observations.
My model is somewhat heteroscedastic though. I want to run a weighted least-squares where each observation is weighted by the city’s population. In this case, it would mean that I want to weight the observations by the inverse of the square root of the population. It’s unclear to me, however, what would be the best syntax. Currently, I have:
Model=lm(…,weights=(1/population))
Is that correct? Or should it be:
Model=lm(…,weights=(1/sqrt(population)))
(I found this question here: Weighted Least Squares - R but it does not clarify how R interprets the weights argument.)
From ?lm: "weights: an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used." R doesn't do any further interpretation of the weights argument.
So, if what you want to minimize is the sum of (the squared distance from each point to the fit line * 1/sqrt(population) then you want ...weights=(1/sqrt(population)). If you want to minimize the sum of (the squared distance from each point to the fit line * 1/population) then you want ...weights=1/population.
As to which of those is most appropriate... that's a question for CrossValidated!
To answer your question, Lucas, I think you want weights=(1/population). R parameterizes the weights as inversely proportional to the variances, so specifying the weights this way amounts to assuming that the variance of the error term is proportional to the population of the city, which is a common assumption in this setting.
But check the assumption! If the variance of the error term is indeed proportional to the population size, then if you divide each residual by the square root of its corresponding sample size, the residuals should have constant variance. Remember, dividing a random variable by a constant results in the variance being divided by the square of that constant.
Here's how you can check this: Obtain residuals from the regression by
residuals = lm(..., weights = 1/population)$residuals
Then divide the residuals by the square roots of the population variances:
standardized_residuals = residuals/sqrt(population)
Then compare the sample variance among the residuals corresponding to the bottom half of population sizes:
variance1 = var(standardized_residuals[population < median(population)])
to the sample variance among the residuals corresponding to the upper half of population sizes:
variance2 = var(standardized_residuals[population > median(population)])
If these two numbers, variance1 and variance2 are similar, then you're doing something right. If they are drastically different, then maybe your assumption is violated.

How the length of averaged normal can be seen as a function of deviation of the angle?

Recently I read NVidia's Mipmapping_Normal_Maps
which says we can used the un-renormalized averaged normal to compute the standard deviation of the angle between averaged normal and sample normals.
By the first step, it assumes a Gaussian distribution of the angular deviation and give a figure (sorry but I cannot post an image as a new user, please refer to Figure_2 in that paper).
Then my question is, how the length of averaged normal is represented by a function of Standard Deviation of the angle(original function of Gaussian distribution, red curve in the figure)?
I believe the answer to your question is equation (1) in the paper. It shows how the averaged normal is equal to the reciprocal of 1 + sigma^2. Sigma is the standard deviation. Sometimes sigma^2 is called the variance.
At any rate, if you know the standard deviation, that's your value for sigma in the equations. Square it to get the variance, sigma^2.

Resources