Sampling from a pareto distribution in R - getting values below scale parameter - r

I am sampling from a Pareto distribution using rpareto from package actuar.
What I find strange is that I keep getting values below the defined scale parameter. I was under the impression they should all be above it.
tail <- rpareto(2,scale=y_0,shape=a_hat)
and I get in my vector values below y_0. Does anyone have an explanation?
Thanks!

Check the formulation used by the function in the ?rpareto help page. Perhaps they parameterized the distribution differently than you do. From the help page:
So they say that the returned values with be >0. The support does not depend on the parameters.
If you are comparing this to the version on wikipedia they are basically doing a translation of
y = x - scale

Related

acf function producing slightly inaccurate result

The autocorrelation command acf is slightly off the correct value. the command i am giving is acf(X,lag.max = 1) and the output is 0.881 while the same autocorrelation calculation done using the command cor(X[1:41],X[2:42]) gives the value 0.9452. I searched a lot to understand if there is any syntax I am missing or if the two computations have some fundamental difference but could not find suitable resources. Please help.
The equation for autocorrelation used by acf() and in all time series textbooks is
where
See https://otexts.com/fpp2/autocorrelation.html for example.
The equation for regular correlation is slightly different:
where
The simpler formula is better because it uses all available data in the denominator.
Note also that you might expect the top formula to include the multiplier $T/(T-k)$ given the different number of terms in the summations. But this is not included to ensure that the corresponding autocorrelation matrix is nonnegative definite.

Response variable out of range -> using gamlss in r LOGNO

I am new to the function GAMLSS in r, and when I run my code I always get this error: Response Variable out of range
After looking into the data frame, I realized the issue was one of response variables was 0.0000.
I was wondering if someone could explain to me why 0 is out of range and possible solutions to go around it (ex. such as replacement the values)?
LOGNO family corresponds to the log-normal distribution, which is defined for positive values only.
The possible solutions might be (but highly depend on the context):
use another distribution, which better models the response variable and allows zero values
sometimes zero values are reported if they are below the limit of detection (LOD).
In this case, one has a censored data set, and you may look for review of the methods, how to tackle it. A pragmatic approach is to substitute zeros with values like LOD/2, reviewed, for example, here. However, it may result in a very biased estimation.

Including an offset variable [or constraining a coefficient to 1] using mlogit R

I´ve been estimating some MNL models in R using mlogit. The package works very well but it seems that it does not allow to include offset variables. I read the package documentation in order to see whether it allowed to constrain a coefficient when it estimated the model. It seems that there is one alternative using mlogit.optim. Unfortunately, it does not specify how it must be used.
So, my point is: does any of you know how to either (a) including an offset variable or (b) how to constrain a coefficient (coeff A = 1) using mlogit?
Thanks in advance!
Best,
Dr. Wall

How can optimization be used as a solver?

In a question on Cross Validated (How to simulate censored data), I saw that the optim function was used as a kind of solver instead of as an optimizer. Here is an example:
optim(1, fn=function(scl){(pweibull(.88, shape=.5, scale=scl, lower.tail=F)-.15)^2})
# $par
# [1] 0.2445312
# ...
pweibull(.88, shape=.5, scale=0.2445312, lower.tail=F)
# [1] 0.1500135
I have found a tutorial on optim here, but I am still not able to figure out how to use optim to work as a solver. I have several questions:
What is first parameter (i.e., the value 1 being passed in)?
What is the function that is passed in?
I can understand that it is taking the Weibull probability distribution and subtracting 0.15, but why are we squaring the result?
I believe you are referring to my answer. Let's walk through a few points:
The OP (of that question) wanted to generate (pseudo-)random data from a Weibull distribution with specified shape and scale parameters, and where the censoring would be applied for all data past a certain censoring time, and end up with a prespecified censoring rate. The problem is that once you have specified any three of those, the fourth is necessarily fixed. You cannot specify all four simultaneously unless you are very lucky and the values you specify happen to fit together perfectly. As it happened, the OP was not so lucky with the four preferred values—it was impossible to have all four as they were inconsistent. At that point, you can decide to specify any three and solve for the last. The code I presented were examples of how to do that.
As noted in the documentation for ?optim, the first argument is par "[i]nitial values for the parameters to be optimized over".
Very loosely, the way the optimization routine works is that it calculates an output value given a function and an input value. Then it 'looks around' to see if moving to a different input value would lead to a better output value. If that appears to be the case, it moves in that direction and starts the process again. (It stops when it does not appear that moving in either direction will yield a better output value.)
The point is that is has to start somewhere, and the user is obliged to specify that value. In each case, I started with the OP's preferred value (although really I could have started most anywhere).
The function that I passed in is ?pweibull. It is the cumulative distribution function (CDF) of the Weibull distribution. It takes a quantile (X value) as its input and returns the proportion of the distribution that has been passed through up to that point. Because the OP wanted to censor the most extreme 15% of that distribution, I specified that pweibull return the proportion that had not yet been passed through instead (that is the lower.tail=F part). I then subtracted.15 from the result.
Thus, the ideal output (from my point of view) would be 0. However, it is possible to get values below zero by finding a scale parameter that makes the output of pweibull < .15. Since optim (or really most any optimizer) finds the input value that minimizes the output value, that is what it would have done. To keep that from happening, I squared the difference. That means that when the optimizer went 'too far' and found a scale parameter that yielded an output of .05 from pweibull, and the difference was -.10 (i.e., < 0), the squaring makes the ultimate output +.01 (i.e., > 0, or worse). This would push the optimizer back towards the scale parameter that makes pweibull output (.15-.15)^2 = 0.
In general, the distinction you are making between an "optimizer" and a "solver" is opaque to me. They seem like two different views of the same elephant.
Another possible confusion here involves optimization vs. regression. Optimization is simply about finding an input value[s] that minimizes (maximizes) the output of a function. In regression, we conceptualize data as draws from a data generating process that is a stochastic function. Given a set of realized values and a functional form, we use optimization techniques to estimate the parameters of the function, thus extracting the data generating process from noisy instances. Part of regression analyses partakes of optimization then, but other aspects of regression are less concerned with optimization and optimization itself is much larger than regression. For example, the functions optimized in my answer to the other question are deterministic, and there were no "data" being analyzed.

Histogram matching - image processing - c/c++

I have two histograms.
int Hist1[10] = {1,4,3,5,2,5,4,6,3,2};
int Hist1[10] = {1,4,3,15,12,15,4,6,3,2};
Hist1's distribution is of type multi-modal;
Hist2's distribution is of type uni-modal with single prominent peak.
My questions are
Is there any way that i could determine the type of distribution programmatically?
How to quantify whether these two histograms are similar/dissimilar?
Thanks
Raj,
I posted a C function in your other question ( automatically compare two series -Dissimilarity test ) that will compute divergence between two sets of similar data. It's actually intended to tell you how closely real data matches predicted data but I suspect you could use it for your purpose.
Basically, the smaller the error, the more similar the two sets are.
These are just guesses, but I would try fitting each distribution as a gaussian distribution and use something like the R-squared value to determine if the distribution is uni-modal or not.
As to the similarity between the two distributions, I would try doing an autocorrelation and using the peak positive value in the autocorrelation as a similarity measure. These ideas are pretty rough, but hopefully they give you some ideas.
For #2, you could calculate their cross-correlation (so long as the buckets themselves can be sorted). That would give you a rough estimation of what "similarity".
Comparison of Histograms (For Use in Cloud Modeling).
(That's an MS .doc file.)
There are a variety of software packages that will "fit" your distributions to known discrete distributions for you - Minitab, STATA, R, etc. A reference to fitting distributions in R is here. I wouldn't advise programming this from scratch.
Regarding distribution comparisons, if neither distribution fits a known distribution (Poisson, Binomial, etc.), then you need to use non-parametric methods described here.

Resources