I wonder what would be the best and most correct way to estimate an Entropy from a probability density function in R? I have some real-values that are not probabilities, and I would like to get some measure of "uneaveness" of those values. Thus, I was thinking about entropy. Would something like this work:
entropy(density(dat$X), unit='log2'),
assuming that I am using entropy function from the entropy package.
Are there some other ways of estimating uncertainty from real-valued vector?
Many thanks! PM
Related
I have a data set where observations come from highly distinct groups. Each group may have a wildly different distribution, so I am trying to find the best distribution using fitdist from fitdistrplus, then use gamlssML from the gamlss package to find the best parameters.
My issue is with transforming the data after this step. For some of the distributions, like the Box-Cox t, I can find the equation for normalizing the data using the BCT coefficients, but for many of these distributions I cannot.
Does gamlss have a function that normalizes the data after fitting? Their documentation only provides the transformations for a small number of distributions https://www.gamlss.com/wp-content/uploads/2018/01/DistributionsForModellingLocationScaleandShape.pdf
Thanks a lot
The normalised data values (for any distribution) are exactly equal to the residuals from a gamlss fit,
m1 <- gamlss()
which can be accessed by
residuals(m1) or
m1$residuals
When using the GAMLSS package in R, there are many different ways to fit a distribution to a set of data. My data is a single vector of values, and I am fitting a distribution over these values.
My question is this: what is the main difference between using fitDist() and gamlss() since they give similar but different answers for parameter values, and different worm plots?
Also, using the function confint() works for gamlss() fitted objects but not for objects fitted with fitDist(). Is there any way to produce confidence intervals for parameters fitted with the fitDist() function? Is there an accuracy difference between the two procedures? Thanks!
m1 <- fitDist()
fits many distributions and chooses the best according to a
generalized Akaike information criterion, GAIC(k), wit penalty k for each
fitted parameter in the distribution, where k is specified by the user,
e.g. k=2 for AIC,
k = log(n) for BIC,
k=4 for a Chi-squared test (rounded from 3.84, the 5% critical value of a Chi-squared distribution with 1 degree of fereedom), which is my preference.
m1$fits
gives the full results from the best to worst distribution according to GAIC(k).
I am really new to this and I have no idea how to use the ecdf function in R. Below I have mention everything step by step:
Frequency of losses is defined using a Poisson distribution
Generate an ecdf function that is going to be used for the severity of losses.
Linearly interpolate the ecdf function.
Take inverse transform of the linearly interpolated ecdf function.
For example,
I can use code freq <- rpois(10,5) to generate the random number of loss frequency but further I have to use this vector to do steps 2-4 and I have no idea how to do that. For step 2 I am facing the problem that how can I use that Poisson distribution as an input and then use to compute severity using the ecdf function. If anybody knows this please help me.
I am using the rcorr function within the Hmisc package in R to develop Pearson correlation coefficients and corresponding p-values when analyzing the correlation of several fishery landings time series. The data isn't really important here but what I would like to know is: how are the p-values calculated for this? It states that the asymptotic P-values are approximated by using the t or F distributions but I am wondering if someone could help me find some more information on this or an equation that describes how exactly these values are calculated.
I am trying to use Julia to estimate a continuous univariate distribution using N observed data points (stored as an array of Float64 numbers), and then sample from this estimated distribution. I have no prior knowledge restricting attention to some family of distributions.
I was thinking of using the KernelDensity package to estimate the distribution, but I'm not sure how to sample from the resulting output.
Any help/tips would be much appreciated.
Without any restrictions on the estimated distribution, a natural candidate would be the empirical distribution function (see Wikipedia). For this distribution there are very nice theorems about convergence to actual distribution (see Dvoretzky–Kiefer–Wolfowitz inequality).
With this choice, sampling is especially simple. If dataset is a list of current samples, then dataset[rand(1:length(dataset),sample_size)] is a set of new samples from the empirical distribution. With the Distributions package, it could be more readable, like so:
using Distributions
new_sample = sample(dataset,sample_size)
Finally, Kernel density estimation is also good, but might need a parameter to be chosen (the kernel and its width). This shows a preference for a certain family of distributions. Sampling from a kernel distribution is surprisingly similar to sampling from the empirical distribution: 1. choose a sample from the empirical distributions; 2. perturb each sample using a sample from the kernal function.
For example, if the kernel function is a Normal distribution of width w, then the perturbed sample could be calculated as:
new_sample = dataset[rand(1:length(dataset),sample_size)]+w*randn(sample_size)