Figure 2.5 in Elements of Statistical Learning - r

I met some difficulty in calculating the Bayes decision boundary of Figure 2.5. In the package ElemStatLearn, it already calcualted the probability at each point and used contour to draw the boundary. Can any one tell me how to calculate the probability? Thank you very much.
In traditional Bayes decision problem, the mixture distribution are usually normal distribution, but in this example, it uses two steps to generate the samples, so I have some difficulty in calculating the distribution.
Thank you very much.

Section 2.3.3 of ESL (accessible online) states how the data were generated. Each class is a mixture of 10 Gaussian distributions of equal covariance and each of the 10 means are drawn from another bivariate Gaussian, as specified in the text. To calculate the exact decision boundary of the simulation in Figure 2.5, you would need to know the particular 20 means (10 for each class) that were generated to produce the data but those values are not provided in the text.
However, you can generate a new pair of mixture models and calculate the probability for each of the two classes (BLUE & ORANGE) that you generate. Since each of the 10 distributions in a class are equally likely, the class-conditional probability p(x|BLUE) is just the average of the probabilities for each of the 10 distributions in the BLUE model.

Related

Smoothed partial residuals of a covariate in a point process model in spatstat

I am using spatstat to build point process models using the ppm function but I have problems in validation, when I use the residual plot parres to understand the effect of a covariate.
The model is composed of 1022 locations of bird occurrences (called ois.ppm), the habitat availability (a raster called FB0lin which has been normalized and log-transformed), the sampling effort (a raster called Nbdate, normalized too) and the accessibility of places (a raster called pAccess, normalized too) across the study area. The objective is to evaluate the fit of a Gibbs point process model with a Geyer process parameter, the habitat availability, the sampling effort and the accessibility. The eps function was also used to create a set of dummy points chosen along a grid with a 100 x 100 m resolution.
The model used is :
mod.ois.echlin = ppm(ois.ppp, ~ FB0lin + Nbdate + pAccess, interaction = Geyer(r=401,sat=9), eps=100)
Geyer parameter were identified using :
rs=expand.grid(r=seq(1,1001, by=50), sat=1:40)
term.interlin=profilepl(rs, Geyer, ois.ppp,~FB0lin+Nbdate+pAccess)
Then I use the parres function :
res.FB0.echlin=parres(mod.ois.echlin, covariate="FB0lin")
plot(res.FB0.echlin,main="FB0 LinCost", legend=FALSE)
The problem is that the fitted values seems not to be optimal (see figure below). The fit curve should have lower values within interval confidence but is outside of this interval, which probably affect the quality of the point process model.
My questions are then :
Have you ever seen such a result and is it normal ?
is it possible to correct it ?
Figure : Smoothed partial residuals - FB0lin
Any advice would be much appreciated.
The diagnostic is working correctly. It indicates that, as a function of the predictor variable FB0lin, the fitted model (represented by the dashed straight line) overestimates the true intensity of the model (represented by the thick black curve with grey confidence bands) by a constant amount. The linear relationship (linear dependence of the log intensity on the covariate) seems to be adequate, in the sense that you don't need to replace this linear relationship by a more complicated relationship (which is the main question for which the partial residuals are used). The diagnostic says that the model is adequate except that it is underestimating the log intensity by a constant amount, which means that it is underestimating the intensity by a constant factor. (This could be due to the way in which the other predictors Nbdate and pAccess are involved in the model, or it could be due to the choice of interpoint interaction. To investigate that, you need to try other tools as discussed in Chapter 11 of the spatstat book.)

Simulating data using existing data and probability

I have measured multiple attributes (height, species, crown width, condition etc) for about 1500 trees in a city. Using remote sensing techniques I also have the heights for the rest of the 9000 trees in the city. I want to simulate/generate/estimate the missing attributes for these unmeasured trees by using their heights.
From the measured data I can obtain proportion of each species in the measured population (and thus a rough probability), height distributions for each species, height-crown width relationships for the species, species-condition relationship and so on. I want to use the height data for the unmeasured trees to first estimate the species and then estimate the rest of the attributes too using probability theory. So for a height of say 25m its more likely to be a Cedar (height range 5 - 30 m) rather than a Mulberry tree (height range 2 -8 m) and more likely to be a cedar (50% of population) than an oak (same height range but 2% of population) and hence will have a crown width of 10m and have a health condition of 95% (based on the distributions for cedar trees in my measured data). But also I am expecting some of the other trees of 25m to be given oak, just less frequently than cedar based on the proportion in population.
Is there a way to do this using probability theory in R preferably utilising Bayesian or machine learning methods?
Im not asking for someone to write the code for me - I am fairly experienced with R. I just want to be pointed in the right direction i.e. a package that does this kind of thing neatly.
Thanks!
Because you want to predict a categorical variable, i.e. the species, you should consider using a tree regression, a method which can be found in the R packages rpart and RandomForest. These models excel when you have a discrete number of categories and you need to slot your observations into those categories. I think those packages would work in your application. As a comparison, you can also look at multinomial regression (mnlogit, nnet, maxent) which can also predict categorical outcomes; unfortunately multinomial regression can get unwieldy with large numbers of outcomes and/or large datasets.
If you want to then predict the individual values for individual trees in your species, first run a regression of all of your measured variables, including species type, on the measured trees. Then take the categorical labels that you predicted and predict out-of-sample for the unmeasured trees where you use the categorical labels as predictors for the unmeasured variable of interest, say tree height. That way the regression will predict the average height for that species/dummy variable, plus some error and incorporating any other information you have on that out-of-sample tree.
If you want to use a Bayesian method, you consider using a hierarchical regression to model these out-of-sample predictions. Sometimes hierarchical models do better at predicting as they tend to be fairly conservative. Consider looking at the package Rstanarm for some examples.
I suggest you looking over Bayesian Networks with table CPDs over your random variables. This is a generative model that can handle missing data and do inference over casual relationships between variables. Bayesian Network structure can be specified by-hand or learned from data by a algorithm.
R has several implementations of Bayesian Networks with bnlearn being one of them: http://www.bnlearn.com/
Please see a tutorial on how to use it here: https://www.r-bloggers.com/bayesian-network-in-r-introduction/
For each species, the distribution of the other variables (height, width, condition) is probably a fairly simple bump. You can probably model the height and width as a joint Gaussian distribution; dunno about condition. Anyway with a joint distribution for variables other than species, you can construct a mixture distribution of all those per-species bumps, with mixing weights equal to the proportion of each species in the available data. Given the height, you can find the conditional distribution of the other variables conditional on height (and it will also be a mixture distribution). Given the conditional mixture, you can sample from it as usual: pick a bump with frequency equal to its mixing weight, and then sample from the selected bump.
Sounds like a good problem. Good luck and have fun.

Imbalanced training dataset and regression model

I have a large dataset (>300,000 observations) that represent the distance (RMSD) between proteins. I'm building a regression model (Random Forest) that is supposed to predict the distance between any two proteins.
My problem is that I'm more interested in close matches (short distances), however my data distribution is highly biased such that the majority of the distances are large. I don't really care how good the model will be able to predict large distances, so I want to make sure that the model will be able to accurately predict the distance of close models. However, when I train the model on the full data the performance of the model isn't good, so I wonder what is the best sampling way I can do such that I can guarantee that the model will predict the close matches distance as much accurately as possible and at the same time now to stratify the data so much since unfortunately this biased data distribution represent the real world data distribution that I'm going to validate and test the model on.
The following is my data distribution where the first column represents the distances and the second column represent the number of observations in this distance range:
Distance Observations
0 330
1 1903
2 12210
3 35486
4 54640
5 62193
6 60728
7 47874
8 33666
9 21640
10 12535
11 6592
12 3159
13 1157
14 349
15 86
16 12
The first thing I would try here is building a regression model of the log of the distance, since this will concentrate the range of larger distances. If you're using a generalised linear model this is the log link function; for other methods you could just manually do this by estimating a regression function of your inputs, x, and exponentiating the result:
y = exp( f(x) )
remember to use the log of the distance for a pair to train with.
Popular techniques for dealing with imbalanced distribution in regression include:
Random over/under-sampling.
Synthetic Minority Oversampling Technique for Regression (SMOTER). Which has an R package to implement.
The Weighted Relevance-based Combination Strategy (WERCS). Which has a GitHub repository of R codes to implement it.
PS: The table you show seems like you have a classification problem and not a regression problem.
As previously mentioned, I think what might help you given your problem is Synthetic Minority Over-Sampling Technique for Regression (SMOTER).
If you're a Python user, I'm currently working to improve my implementation of the SMOGN algorithm, a variant of SMOTER. https://github.com/nickkunz/smogn
Also, there are a few examples on Kaggle that have applied SMOGN to improve their prediction results. https://www.kaggle.com/aleksandradeis/regression-addressing-extreme-rare-cases

Generating random values from non-normal and correlated distributions

I have a random variable X that is a mixture of a binomial and two normals (see what the probability density function would look like (first chart))
and I have another random variable Y of similar shape but with different values for each normally distributed side.
X and Y are also correlated, here's an example of data that could be plausible :
X Y
1. 0 -20
2. -5 2
3. -30 6
4. 7 -2
5. 7 2
As you can see, that was simply to represent that my random variables are either a small positive (often) or a large negative (rare) and have a certain covariance.
My problem is : I would like to be able to sample correlated and random values from these two distributions.
I could use Cholesky decomposition for generating correlated normally distributed random variables, but the random variables we are talking here are not normal but rather a mixture of a binomial and two normals.
Many thanks!
Note, you don't have a mixture of a binomial and two normals, but rather a mixture of two normals. Even though for some reason in your previous post you did not want to use a two-step generation process (first genreate a Bernoulli variable telling which component to sample from, and then sampling from that component), that is typically what you would want to do with a mixture distribution. This process naturally generalizes to a mixture of two bivariate normal distributions: first pick a component, and then generate a pair of correlated normal values. Your description does not make it clear whether you are fitting some data with this distribution, or just trying to simulate such a distribution - the difficulty of getting the covariance matrices for the two components will depend on your situation.

How to produce a vector based on two probability distributions?

Let's assume we have 3 distributions for different types of damage within an insurance company. A weibull distribution with these parameters
weibull.data<-rweibull(2000,1,2000)
A lognormal one with these parameters
lnormal.data<-rlnorm(10000,7,0.09)
and at least a frechet distribution for the extreme values looking like these
frechet.data<-rfrechet(15, loc=6, scale=1200000, shape=10000)
For each of them, I calculated the ruin probability. Now I want to estimate the ruin probability of the convolution of the 3 distributions. But I don't know how to "combine" in a logical way. The vector that I need is a combination of the 3.
Excuse my English, I'm a French native speaker :-)

Resources