How to produce a vector based on two probability distributions? - r

Let's assume we have 3 distributions for different types of damage within an insurance company. A weibull distribution with these parameters
weibull.data<-rweibull(2000,1,2000)
A lognormal one with these parameters
lnormal.data<-rlnorm(10000,7,0.09)
and at least a frechet distribution for the extreme values looking like these
frechet.data<-rfrechet(15, loc=6, scale=1200000, shape=10000)
For each of them, I calculated the ruin probability. Now I want to estimate the ruin probability of the convolution of the 3 distributions. But I don't know how to "combine" in a logical way. The vector that I need is a combination of the 3.
Excuse my English, I'm a French native speaker :-)

Related

Extracting linear term from a polynomial predictor in a GLM

I am relatively new to both R and Stack overflow so please bear with me. I am currently using GLMs to model ecological count data under a negative binomial distribution in brms. Here is my general model structure, which I have chosen based on fit, convergence, low LOOIC when compared to other models, etc:
My goal is to characterize population trends of study organisms over the study period. I have created marginal effects plots by using the model to predict on a new dataset where all covariates are constant except year (shaded areas are 80% and 95% credible intervals for posterior predicted means):
I am now hoping to extract trend magnitudes that I can report and compare across species (i.e. say a certain species declined or increased by x% (+/- y%) per year). Because I use poly() in the model, my understanding is that R uses orthogonal polynomials, and the resulting polynomial coefficients are not easily interpretable. I have tried generating raw polynomials (setting raw=TRUE in poly()), which I thought would produce the same fit and have directly interpretable coefficients. However, the resulting models don't really run (after 5 hours neither chain gets through even a single iteration, whereas the same model with raw=FALSE only takes a few minutes to run). Very simplified versions of the model (e.g. count ~ poly(year, 2, raw=TRUE)) do run, but take several orders of magnitude longer than setting raw=FALSE, and the resulting model also predicts different counts than the model with orthogonal polynomials. My questions are (1) what is going on here? and (2) more broadly, how can I feasibly extract the linear term of the quartic polynomial describing response to year, or otherwise get at a value corresponding to population trend?
I feel like this should be relatively simple and I apologize if I'm overlooking something obvious. Please let me know if there is further code that I should share for more clarity–I didn't want to make the initial post crazy long, but happy to show specific predictions from different models or anything else. Thank you for any help.

Is it possible to specify the correlation between two distributions?

For context, say there were two academic exams --morning and afternoon-- conducted. I'm only given the summary statistics -- mean, median, skew and kurtosis for the scores on both exams, so I'm unable to say exactly how many students passed, but I can estimate it by fitting the moments and creating a custom pearson distribution. I can estimate, for example, how many students passed the first and the second exam, as well as giving it a standard deviation to quantify my error.
What I would like to do is to estimate the number of students who pass the course, defined as having the average score of both morning and afternoon exams being over 60%. If the performance of students on both tests are completely independent, I suppose this would be easy - I just generate scores for both tests in the form of two lists, average them, count the number of items over 60%, and repeat, say 10000 times.
If both tests are completely dependent, I suppose I would have to order both lists, because the student scoring the highest on the morning exam should also score the highest on the second. What I'm missing is how I should measure the degree of randomness/interdependence (maybe it has something to do with entropy?) in between, where students who score highly on exam 1 also score highly on exam 2, and if there is a package in R that I can use to specify an arbitrary degree of entropy between two variables.
A renowned concept for measuring entropy between two distributions is KL divergence:
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution.
To make the measure symmetric, you can use Jensen-Shannon divergence as well.
For the implementation of KL divergence, you can use this package in R.
A special case of KL-divergence is mutual information, which is a better measure of interdependence which is what you're looking for. Instead of calculating the divergence from a reference distribution, mutual information is basically equal to the KL-divergence between the joint probability and the product of the marginal probability distributions. Mutual information is also equal to the sum of each marginal distribution's entropy minus their joint entropy, meaning that you can estimate the individual and joint entropies first, then sum them together to estimate mutual information.
Here is one implementation of mutual information for R, although there have been many other estimators introduced:
https://github.com/majianthu/copent

Function to produce a single metric to compare the shape of two distributions (predictions vs actuals)

I am assessing the accuracy of a model that predicts count data.
My actual data has quite an unusual distribution - although I have a large amount of data, the shape is unlike any standard distributions (poisson, normal, negative binomial etc.).
As part of my assessment, I want a metric for how well the distribution of the predictions match the distribution of actual data. I've tried using standard model performance metrics, such as MAE or RMSE, but they don't seem to capture how well the predictions match the expected distribution.
My initial idea was to split the predictions into deciles, and calculate what proportion fall in each decile. This would be a very rough indication of the underlying distribution. I would then calculate the same for my 'actuals' and sum the absolute differences between the proportions.
This works to some extent, but feels a bit clunky, and the split into deciles feels arbitrary. Is there a function in R to produce a single metric for how well two distributions match?

Figure 2.5 in Elements of Statistical Learning

I met some difficulty in calculating the Bayes decision boundary of Figure 2.5. In the package ElemStatLearn, it already calcualted the probability at each point and used contour to draw the boundary. Can any one tell me how to calculate the probability? Thank you very much.
In traditional Bayes decision problem, the mixture distribution are usually normal distribution, but in this example, it uses two steps to generate the samples, so I have some difficulty in calculating the distribution.
Thank you very much.
Section 2.3.3 of ESL (accessible online) states how the data were generated. Each class is a mixture of 10 Gaussian distributions of equal covariance and each of the 10 means are drawn from another bivariate Gaussian, as specified in the text. To calculate the exact decision boundary of the simulation in Figure 2.5, you would need to know the particular 20 means (10 for each class) that were generated to produce the data but those values are not provided in the text.
However, you can generate a new pair of mixture models and calculate the probability for each of the two classes (BLUE & ORANGE) that you generate. Since each of the 10 distributions in a class are equally likely, the class-conditional probability p(x|BLUE) is just the average of the probabilities for each of the 10 distributions in the BLUE model.

Generating random values from non-normal and correlated distributions

I have a random variable X that is a mixture of a binomial and two normals (see what the probability density function would look like (first chart))
and I have another random variable Y of similar shape but with different values for each normally distributed side.
X and Y are also correlated, here's an example of data that could be plausible :
X Y
1. 0 -20
2. -5 2
3. -30 6
4. 7 -2
5. 7 2
As you can see, that was simply to represent that my random variables are either a small positive (often) or a large negative (rare) and have a certain covariance.
My problem is : I would like to be able to sample correlated and random values from these two distributions.
I could use Cholesky decomposition for generating correlated normally distributed random variables, but the random variables we are talking here are not normal but rather a mixture of a binomial and two normals.
Many thanks!
Note, you don't have a mixture of a binomial and two normals, but rather a mixture of two normals. Even though for some reason in your previous post you did not want to use a two-step generation process (first genreate a Bernoulli variable telling which component to sample from, and then sampling from that component), that is typically what you would want to do with a mixture distribution. This process naturally generalizes to a mixture of two bivariate normal distributions: first pick a component, and then generate a pair of correlated normal values. Your description does not make it clear whether you are fitting some data with this distribution, or just trying to simulate such a distribution - the difficulty of getting the covariance matrices for the two components will depend on your situation.

Resources