Fitting power-law distributions with poweRlaw package in R - r

With poweRlaw library, and once computed alpha and xmin with estimate_xmin, which formula the script uses to compute the fitted values?
I mean, assuming that y=C·x^(-alpha), my question is how the script computes the normalization constant from alpha and xmin.

The normalising constant is fairly easy to calculate. See the Clauset et al's powerlaw paper (in particular table 2.1). For the continuous case, C = (alpha-1) xmin^(alpha-1), the discrete case involves calculating the diagamma function.
You can also examine the R code:
Discrete
Continuous

Related

Weighted mixture model of two distributions where weight depends on the value of the distribution?

I'm trying to replicate the precipitation mixture model from this paper: http://dx.doi.org/10.1029/2006WR005308
f(r) is the gamma PDF, g(r) is the generalized Pareto PDF, and w(r) is the weighting function, which depends on the value r being considered. I've looked at R packages like distr and mixtools that handle mixture models, but I only see examples where w is a constant, and I haven't found any implementations where the mixture is a function of the value. I'm struggling to create valid custom functions to represent h(r) so if someone could point me to a package that would be super helpful.

Fit a Weibull cumulative distribution to mass passing data in R

I have some particle size mass-passing cumulative data for crushed rock material to which I would like to fit a Weibull distribution using R. I have managed to do this in Excel using WEIBULL.DIST() function using the cumulative switch set to TRUE.
I then used excel SOLVER to derive the alpha and beta parameters using RMSE to get the best fit. I would like to reproduce the result in R.
(see attached spreadsheet here)
The particle data and cumulative mass passing % is are the following vectors
d.mm <- c(20.001,6.964,4.595,2.297,1.741,1.149,
0.871,0.574,0.287,0.082,0.062,0.020)
m.pct <- c(1.00,0.97,0.78,0.49,0.27,0.20,0.14,
0.11,0.07,0.03,0.025,0.00)
This is the plot to which I would like to fit the Weibull result:
plot(log10(d.mm),m.pct)
... computing the function for a vector of diameter values as per the spreadsheet
d.wei <- c(seq(0.01,0.1,0.01),seq(0.2,1,0.1),seq(2,30,1))
The values I've determined as best for the Weibull alpha and beta in Excel using Solver are 1.41 and 3.31 respectively
So my question is how to reproduce this analysis in R (not necessarily the Solver part) but fitting the Weibull to this dataset?
The nonlinear least squares function nls is the R version of the Execl's solver.
The pweibull will calculate the probability distribution for the Weibull distribution. The comments in the code should explain the step-by-step solution
d.mm <- c(20.001,6.964,4.595,2.297,1.741,1.149,
0.871,0.574,0.287,0.082,0.062,0.020)
m.pct <- c(1.00,0.97,0.78,0.49,0.27,0.20,0.14,
0.11,0.07,0.03,0.025,0.00)
#create data frame store data
df<-data.frame(m.pct, d.mm)
#data for prediction
d.wei <- c(seq(0.01,0.1,0.01),seq(0.2,1,0.1),seq(2,30,1))
#solver (provided starting value for solution)
# alpha is used for shape and beta is used for scale
fit<-nls(m.pct~pweibull(d.mm, shape=alpha, scale=beta), data=df, start=list(alpha=1, beta=2))
print(summary(fit))
#extract out shape and scale
print(summary(fit)$parameters[,1])
#predict new values base on model
y<-predict(fit, newdata=data.frame(d.mm=d.wei))
#Plot comparison
plot(log10(d.mm),m.pct)
lines(log10(d.wei),y, col="blue")

Estimating a probability distribution and sampling from it in Julia

I am trying to use Julia to estimate a continuous univariate distribution using N observed data points (stored as an array of Float64 numbers), and then sample from this estimated distribution. I have no prior knowledge restricting attention to some family of distributions.
I was thinking of using the KernelDensity package to estimate the distribution, but I'm not sure how to sample from the resulting output.
Any help/tips would be much appreciated.
Without any restrictions on the estimated distribution, a natural candidate would be the empirical distribution function (see Wikipedia). For this distribution there are very nice theorems about convergence to actual distribution (see Dvoretzky–Kiefer–Wolfowitz inequality).
With this choice, sampling is especially simple. If dataset is a list of current samples, then dataset[rand(1:length(dataset),sample_size)] is a set of new samples from the empirical distribution. With the Distributions package, it could be more readable, like so:
using Distributions
new_sample = sample(dataset,sample_size)
Finally, Kernel density estimation is also good, but might need a parameter to be chosen (the kernel and its width). This shows a preference for a certain family of distributions. Sampling from a kernel distribution is surprisingly similar to sampling from the empirical distribution: 1. choose a sample from the empirical distributions; 2. perturb each sample using a sample from the kernal function.
For example, if the kernel function is a Normal distribution of width w, then the perturbed sample could be calculated as:
new_sample = dataset[rand(1:length(dataset),sample_size)]+w*randn(sample_size)

Discretizing a continuous probability distribution

Recognizing that this may be as much a statistical question as a coding question, let's say I have a normal distribution created using Distributions.jl:
using Distributions
mydist = Normal(0, 0.2)
Is there a good, straightforward way that I should go about discretizing such a distribution in order to get a PMF as opposed to a PDF?
In R, I found that the actuar package contains a function to discretize a continuous distribution. I failed to find anything similar for Julia, but thought I'd check here before rolling my own.
There isn't an inbuilt function to do it, but you can use a range object, combined with the cdf and diff functions to compute the values:
using Distributions
mydist = Normal(0, 0.2)
r = -3:0.1:3
d = diff(cdf(mydist, r))

specifying probability weights in R *without* using Lumley survey package

I would really appreciate any help with specifying probability weights in R without using the Lumley survey package. I am conducting mediation analysis in R using the Imai et al mediation package, which does not currently support svyglm.
The code I am currently running is:
olsmediator_basic<-lm(poledu ~ gateway_strict_alt + gender_n + spline1 + spline2 + spline3,
data = unifiedanalysis, weights = designweight).
However, I'm unsure if this is weighting the data correctly. The reason is that this code yields standard errors that differ from those I am getting in Stata. The Stata code I am running is:
reg poledu gateway_strict_alt gender_n spline1 spline2 spline3 [pweight=designweight]).
I was wondering if the weights option in R may not be for inverse probability weights, but I was unable to determine this from the documentation, this forum or elsewhere. If I am missing something, I really apologize - I am new to R as well as to this forum.
Thank you in advance for your help.
The R documentation specifies that the weights parameter of the lm function is inversely proportional to the variance of the observations. This is the definition of analytic weights, or aweights in Stata.
Have a look at the ipw package for inverse probability weighting.
To correct a previous answer - I looked up the manual on weights and found the following description for weights in lm
Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized).
These are actually frequency weights (fweights in stata). They multiply out the observation n number of times as defined by the weight vector. Probability weights, on the other hand, refer to the probability that observations group is included in the population. Doing so adjusts the impact of the observation on the coefficients, but not on the standard errors, as they don't change the number of observations represented in the sample.

Resources