How do you get N pairs of values, which represent a joint probability (2d density) of a much larger pairs of values?
I do MCMC sampling on parameters of a function, and I want to visualize the posterior density of that function by plotting, say, 20 semi-transparent lines that visualizes the density of the function. I know that I can just make a sufficiently large sample to approximate the density (like this), but this could clutter the graph. Rather, I'd imagine something like percentiles would work. E.g. a pair for each 2% change in percentile should accurately portray the density using 50 pairs.
Specifically, I'm sampling from a bayesian model with a two-parameter geometric series. The density is not necessarily monotonically increasing (there can be several peaks). This is a bit complicated, but just to get started, a multivariate normal would be more minimal to work with:
library(MASS)
pairs = mvrnorm(10000, c(1,3), rbind(c(0.2, 0.1), c(0.1, 0.2)))
# Easy to just sample the rows to get an approximate representation:
pairs[sample(nrow(pairs), size=50, replace=FALSE)
# But how to get more certainty that the samples are really representative?
# This would probably start with the density
dens = kde2d(pairs[,1], pairs[,2], n=100)
Here, dens$z is a 100*100 matrix containing a density for each of the combinations of dens$x and dens$y, i.e. the (binned) pairs in pairs. Here's an image(dens) visualization:
Related
So, this is something I think I'm complicating far too much but it also has some of my other colleagues stumped as well.
I've got a set of areas represented by polygons and I've got a column in the dataframe holding their areas. The distribution of areas is heavily right skewed. Essentially I want to randomly sample them based upon a distribution of sampling probabilities that is inversely proportional to their area. Rescaling the values to between zero and one (using the {x-min(x)}/{max(x)-min(x)} method) and subtracting them from 1 would seem to be the intuitive approach, but this would simply mean that the smallest are almost always the one sampled.
I'd like a flatter (but not uniform!) right-skewed distribution of sampling probabilities across the values, but I am unsure on how to do this while taking the area values into account. I don't think stratifying them is what I am looking for either as that would introduce arbitrary bounds on the probability allocations.
Reproducible code below with the item of interest (the vector of probabilities) given by prob_vector. That is, how to generate prob_vector given the above scenario and desired outcomes?
# Data
n= 500
df <- data.frame("ID" = 1:n,"AREA" = replicate(n,sum(rexp(n=8,rate=0.1))))
# Generate the sampling probability somehow based upon the AREA values with smaller areas having higher sample probability::
prob_vector <- ??????
# Sampling:
s <- sample(df$ID, size=1, prob=prob_vector)```
There is no one best solution for this question as a wide range of probability vectors is possible. You can add any kind of curvature and slope.
In this small script, I simulated an extremely right skewed distribution of areas (0-100 units) and you can define and directly visualize any probability vector you want.
area.dist = rgamma(1000,1,3)*40
area.dist[area.dist>100]=100
hist(area.dist,main="Probability functions")
area = seq(0,100,0.1)
prob_vector1 = 1-(area-min(area))/(max(area)-min(area)) ## linear
prob_vector2 = .8-(.6*(area-min(area))/(max(area)-min(area))) ## low slope
prob_vector3 = 1/(1+((area-min(area))/(max(area)-min(area))))**4 ## strong curve
prob_vector4 = .4/(.4+((area-min(area))/(max(area)-min(area)))) ## low curve
legend("topright",c("linear","low slope","strong curve","low curve"), col = c("red","green","blue","orange"),lwd=1)
lines(area,prob_vector1*500,col="red")
lines(area,prob_vector2*500,col="green")
lines(area,prob_vector3*500,col="blue")
lines(area,prob_vector4*500,col="orange")
The output is:
The red line is your solution, the other ones are adjustments to make it weaker. Just change numbers in the probability function until you get one that fits your expectations.
I am attempting to reproduce the above function in R. The numerator has the product of the probability density function (pdf) of "y" at time "t". The omega_t is simply the weight (which for now lets ignore). The i stands for each forecast of y (along with the density) derived for model_i, at time t.
The denominator is the integral of the above product. My question is: How to estimate the densities. To get the density of the variable one needs some datapoints. So far I have this:
y<-c(-0.00604,-0.00180,0.00292,-0.0148)
forecastsy_model1<-c(-0.0183,0.00685) # respectively time t=1 and t=2 of the forecasts
forecastsy_model2<-c(-0.0163,0.00931) # similarly
all.y.1<-c(y,forecasty_model1) #together in one vector
all.y.2<-c(y,forecasty_model2) #same
However, I am not aware how to extract the density of x1 for time t=1, or t=6, in order to do the products. I have considered this to find the density estimated using this:
dy1<-density(all.y.1)
which(dy1$x==0.00685)
integer(0) #length(dy1$x) : 512
with dy1$x containing the n coordinates of the points where the density is estimated, according to the documentation. Shouldn't n be 6, or at least contain the points of y that I have supplied? What is the correct way to extract the density (pdf) of y?
There is an n argument in density which defaults to 512. density returns you estimated density values on a relatively dense grid so that you can plot the density curve. The grid points are determined by the range of your data (plus some extension) and the n value. They produce a evenly spaced grid. The sampling locations may not lie exactly on this grid.
You can use linear interpolation to get density value anywhere covered by this grid:
Find the probability density of a new data point using "density" function in R
Exact kernel density value for any point in R
I am generating 2D kernal density distributions for every pair of numeric columns in a data set, using kde2d function in the MASS package.
This takes the following parameters:
kde2d(x, y, h, n=25, lims = c(range(x), range(y)))
where n is the "Number of grid points in each direction. Can be scalar or a length-2 integer vector".
I want to optimize the dimensions of the grid for every pair of columns. At the moment, I used a fixed dimensions of 10x10. Does anyone know a formula for optimizing the grid size so I can generate optimal density estimations for each pair of columns?
Thanks
The parameter n in this function does not influence your density estimation but only the graphical representation, i.e. it should only depend on the size of the plot you want to create but not on the data.
On the other hand your density estimation is indeed influenced by the choice og bandwith h. To choose an optimal bandwith you will need to know (or assume) the distribution of your data
First off, I'm not entirely sure if this is the correct place to be posting this, as perhaps it should go in a more statistics-focussed forum. However, as I'm planning to implement this with R, I figured it would be best to post it here. Please apologise if I'm wrong.
So, what I'm trying to do is the following. I want to simulate data for a total of 250.000 observations, assigning a continuous (non-integer) value in line with a kernel density estimate derived from empirical data (discrete), with original values ranging from -5 to +5. Here's a plot of the distribution I want to use.
It's quite essential to me that I don't simulate the new data based on the discrete probabilities, but rather the continuous ones as it's really important that a value can be say 2.89 rather than 3 or 2. So new values would be assigned based on the probabilities depicted in the plot. The most frequent value in the simulated data would be somewhere around +2, whereas values around -4 and +5 would be rather rare.
I have done quite a bit of reading on simulating data in R and about how kernel density estimates work, but I'm really not moving forward at all. So my question basically entails two steps - how do I even simulate the data (1) and furthermore, how do I simulate the data using this particular probability distribution (2)?
Thanks in advance, I hope you guys can help me out with this.
With your underlying discrete data, create a kernel density estimate on as fine a grid as you wish (i.e., as "close to continuous" as needed for your application (within the limits of machine precision and computing time, of course)). Then sample from that kernel density, using the density values to ensure that more probable values of your distribution are more likely to be sampled. For example:
Fake data, just to have something to work with in this example:
set.seed(4396)
dat = round(rnorm(1000,100,10))
Create kernel density estimate. Increase n if you want the density estimated on a finer grid of points:
dens = density(dat, n=2^14)
In this case, the density is estimated on a grid of 2^14 points, with distance mean(diff(dens$x))=0.0045 between each point.
Now, sample from the kernel density estimate: We sample the x-values of the density estimate, and set prob equal to the y-values (densities) of the density estimate, so that more probable x-values will be more likely to be sampled:
kern.samp = sample(dens$x, 250000, replace=TRUE, prob=dens$y)
Compare dens (the density estimate of our original data) (black line), with the density of kern.samp (red):
plot(dens, lwd=2)
lines(density(kern.samp), col="red",lwd=2)
With the method above, you can create a finer and finer grid for the density estimate, but you'll still be limited to density values at grid points used for the density estimate (i.e., the values of dens$x). However, if you really need to be able to get the density for any data value, you can create an approximation function. In this case, you would still create the density estimate--at whatever bandwidth and grid size necessary to capture the structure of the data--and then create a function that interpolates the density between the grid points. For example:
dens = density(dat, n=2^14)
dens.func = approxfun(dens)
x = c(72.4588, 86.94, 101.1058301)
dens.func(x)
[1] 0.001689885 0.017292405 0.040875436
You can use this to obtain the density distribution at any x value (rather than just at the grid points used by the density function), and then use the output of dens.func as the prob argument to sample.
I have been using the mahal classifier function (Dismo package in r) in several of my analyses and recently I have discovered that it seems to give apparently wrong distance results for points that are identical to points used in training of the classifier. For background, from what I understand of mahalanobis-based classifiers, is that they use Mahalanobis distance to describe the similarity of a unclassified point by measuring the point's distance from the center of mass of the training set (while accounting for differences in scale and covariance, etc.). The mahalanobis distance score varies from –inf to 1, where one indicates no distance between the unclassified point and the centroid defined by the training set. However, I found that, for all points with identical predictor values than the training points, I still get a score of 1, as if the routine is working as a nearest neighbor classifier. This is a very troubling behavior because it has the potential to artificially increase the confidence of my overall classification.
Has anyone encountered this behavior? Any ideas on how to fix/ avoid this behavior?
I have written a small script below that showcases the odd behavior clearly:
rm(list = ls()) #remove all past worksheet variables
library(dismo)
logo <- stack(system.file("external/rlogo.grd", package="raster"))
#presence data (points that fall within the 'r' in the R logo)
pts <- matrix(c(48.243420, 48.243420, 47.985820, 52.880230, 49.531423, 46.182616,
54.168232, 69.624263, 83.792291, 85.337894, 74.261072, 83.792291, 95.126713,
84.565092, 66.275456, 41.803408, 25.832176, 3.936132, 18.876962, 17.331359,
7.048974, 13.648543, 26.093446, 28.544714, 39.104026, 44.572240, 51.171810,
56.262906, 46.269272, 38.161230, 30.618865, 21.945145, 34.390047, 59.656971,
69.839163, 73.233228, 63.239594, 45.892154, 43.252326, 28.356155), ncol=2)
# fit model
m <- mahal(logo, pts)
#using model, predict train data
training_vals=extract(logo, pts)
x <- predict(m, training_vals)
x #results show a perfect 1 prediction, which is highly unlikely
Now, I try to make predictions for values that are an average for directly adjacent point pairs
I do this because given that:
(1) each point for each pair used to train the model have a perfect suitability and
(2) that at least some of these average points are likely to be as close to the center of the mahalanobis centroid than the original pairs
(3) I would expect at least a few of the average points to have a perfect suitability as well.
#pick two adjacent points and fit model
adjacent_pts=pts
adjacent_pts[,2]=adjacent_pts[,2]+1
adjacent_training_vals=extract(logo, adjacent_pts)
new_pts=rbind(pts, adjacent_pts)
plot(logo[[1]]) #plot predictor raster and response point pairs
points(new_pts[,1],new_pts[,2])
#use model to predict mahalanobis score for new training data (point pairs)
m <- mahal(logo, new_pts)
new_training_vals=extract(logo, new_pts)
x <- predict(m, new_training_vals)
x
As expected from the odd behavior described, all training points have a distance score of 1. However, lets try to predict points that are an average of each pair:
mid_vals=(adjacent_training_vals+training_vals)/2
x <- predict(m, mid_vals)
x #NONE DO!
This for me is further indication that the Mahal routine will give a perfect score for any data point that has equal values to any of the points used to train the model
This below is uncessessary, but just another way to prove the point:
Here I predict the same original train data with a near insignificant 'budge' of values for only one of the predictors and show that the resulting scores change quite significantly.
mod_training_vals=training_vals
mod_training_vals[,1]=mod_training_vals[,1]*1.01
x <- predict(m, mod_training_vals)
x #predictions suddenly are far from perfect predictions