Density plot in R sometimes gives frequency, other times probabilities? - r

Plotting the density of some of my data yields frequencies on the Y axis, while plotting the density of other data yields probabilities(?) on the Y axis. Is there an equivalent of freq=FALSE for density() like there is for hist() so I can have control over this? I've tried searching around for this specific issue, but I almost always end up getting hist() documentation instead of finding the answer to this specific question. Thank you!

Adding such a parameter to density would be statistically unwise for the reasons articulated by #MrFlick. If you want to convert a density estimate to be on the same scale as the observations, you can multiply by the length of the vector used for the density calculation. The density then becomes a "per x unit" estimate of "frequency". Compare the two plots:
set.seed(123);x <- sample(1:10, size=5 )
#> x
#[1] 3 8 4 7 6
plot(density(x))
plot(5*density(x)$y)
The "per unit of x" estimate is now in the correct (approximate) range of 0.5 (and it's integral should be roughly equal to the counts). It's only accidentally that an x value of a density would ever be similar to a probability. It should always be that the integral of the density is unity.
Perhaps you are looking for the ecdf function? Instead of returning a density , it provides a mechanism for constructing a cumulative probability function.

Related

Probability Density (pdf) extraction in R

I am attempting to reproduce the above function in R. The numerator has the product of the probability density function (pdf) of "y" at time "t". The omega_t is simply the weight (which for now lets ignore). The i stands for each forecast of y (along with the density) derived for model_i, at time t.
The denominator is the integral of the above product. My question is: How to estimate the densities. To get the density of the variable one needs some datapoints. So far I have this:
y<-c(-0.00604,-0.00180,0.00292,-0.0148)
forecastsy_model1<-c(-0.0183,0.00685) # respectively time t=1 and t=2 of the forecasts
forecastsy_model2<-c(-0.0163,0.00931) # similarly
all.y.1<-c(y,forecasty_model1) #together in one vector
all.y.2<-c(y,forecasty_model2) #same
However, I am not aware how to extract the density of x1 for time t=1, or t=6, in order to do the products. I have considered this to find the density estimated using this:
dy1<-density(all.y.1)
which(dy1$x==0.00685)
integer(0) #length(dy1$x) : 512
with dy1$x containing the n coordinates of the points where the density is estimated, according to the documentation. Shouldn't n be 6, or at least contain the points of y that I have supplied? What is the correct way to extract the density (pdf) of y?
There is an n argument in density which defaults to 512. density returns you estimated density values on a relatively dense grid so that you can plot the density curve. The grid points are determined by the range of your data (plus some extension) and the n value. They produce a evenly spaced grid. The sampling locations may not lie exactly on this grid.
You can use linear interpolation to get density value anywhere covered by this grid:
Find the probability density of a new data point using "density" function in R
Exact kernel density value for any point in R

R - simulate data for probability density distribution obtained from kernel density estimate

First off, I'm not entirely sure if this is the correct place to be posting this, as perhaps it should go in a more statistics-focussed forum. However, as I'm planning to implement this with R, I figured it would be best to post it here. Please apologise if I'm wrong.
So, what I'm trying to do is the following. I want to simulate data for a total of 250.000 observations, assigning a continuous (non-integer) value in line with a kernel density estimate derived from empirical data (discrete), with original values ranging from -5 to +5. Here's a plot of the distribution I want to use.
It's quite essential to me that I don't simulate the new data based on the discrete probabilities, but rather the continuous ones as it's really important that a value can be say 2.89 rather than 3 or 2. So new values would be assigned based on the probabilities depicted in the plot. The most frequent value in the simulated data would be somewhere around +2, whereas values around -4 and +5 would be rather rare.
I have done quite a bit of reading on simulating data in R and about how kernel density estimates work, but I'm really not moving forward at all. So my question basically entails two steps - how do I even simulate the data (1) and furthermore, how do I simulate the data using this particular probability distribution (2)?
Thanks in advance, I hope you guys can help me out with this.
With your underlying discrete data, create a kernel density estimate on as fine a grid as you wish (i.e., as "close to continuous" as needed for your application (within the limits of machine precision and computing time, of course)). Then sample from that kernel density, using the density values to ensure that more probable values of your distribution are more likely to be sampled. For example:
Fake data, just to have something to work with in this example:
set.seed(4396)
dat = round(rnorm(1000,100,10))
Create kernel density estimate. Increase n if you want the density estimated on a finer grid of points:
dens = density(dat, n=2^14)
In this case, the density is estimated on a grid of 2^14 points, with distance mean(diff(dens$x))=0.0045 between each point.
Now, sample from the kernel density estimate: We sample the x-values of the density estimate, and set prob equal to the y-values (densities) of the density estimate, so that more probable x-values will be more likely to be sampled:
kern.samp = sample(dens$x, 250000, replace=TRUE, prob=dens$y)
Compare dens (the density estimate of our original data) (black line), with the density of kern.samp (red):
plot(dens, lwd=2)
lines(density(kern.samp), col="red",lwd=2)
With the method above, you can create a finer and finer grid for the density estimate, but you'll still be limited to density values at grid points used for the density estimate (i.e., the values of dens$x). However, if you really need to be able to get the density for any data value, you can create an approximation function. In this case, you would still create the density estimate--at whatever bandwidth and grid size necessary to capture the structure of the data--and then create a function that interpolates the density between the grid points. For example:
dens = density(dat, n=2^14)
dens.func = approxfun(dens)
x = c(72.4588, 86.94, 101.1058301)
dens.func(x)
[1] 0.001689885 0.017292405 0.040875436
You can use this to obtain the density distribution at any x value (rather than just at the grid points used by the density function), and then use the output of dens.func as the prob argument to sample.

Does R have something similar to TransformedDistribution in Mathematica?

I have a random variable X and a transformation f and I would like to know the probability distribution function of f(X), at least approximately. In Mathematica there is TransformedDistribution, but I could not find something similar in R. As I said, some kind of approximative solution would be fine, too.
You can check the distr package. For instance, say that y = x^2+2x+1, where x is normally distributed with mean 2 and standard deviation 5. You can:
require(distr)
x<-Norm(2,5)
y<-x^2+2*x+1
#y#r gives random samples. We make an histogram.
hist(y#r(10000))
#y#d and y#p are the density and the cumulative functions
y#d(80)
#[1] 0.002452403
y#p(80)
#[1] 0.8891796

How to reuse my kernel density estimation function in R?

I use density() to do KDE,like
#Rscript#
x <- c(rep(1,3),rep(2,4),rep(3,5))
density(x)
Am I suppose to get a probability density function? If so, How do I reuse it to obtain the probability of 1 value e.g. what is the probability of x<=2 P(x<=2) under my KDE function?
Tanks for sharing your idea!
Because density() gives you the continous KDE, the probability of an exact value is zero. You can only get some information like P(x <= 1). In your case hist() should be the correct selection.
EDIT:
Please have a look here
https://stats.stackexchange.com/questions/78711/how-to-find-estimate-probability-density-function-from-density-function-in-r

How to show the value of the AUC from geom_density/stat_density

I have produced some density plots using ggplot2 and stat_density. My colleague mentioned he wasn't convinced that the area under each curve would sum to 1. So, I set out to calculate the area under the curve, and I am wondering if there might be a better approach than what I did.
Here is an example of what I did:
data(iris)
p<-ggplot(iris,aes(x=Petal.Length))+
stat_density(aes(colour=Species),geom="line",position="identity")
q<-print(p)
q<-q$data[[1]]
# calculate interval between density estimates for a given point.
# assume it is the same interval for all estimates
interval<-q$x[2]-q$x[1]
# calculate AUC by summing interval*height for the density estimate at each point
tapply(q$density*interval,
q$group,
sum)
The result:
1 2 3
0.9913514 1.0009785 0.9817040
It seems to works decently, but I wonder if there is a better way of doing this. In particular, my calculation of the interval (i.e. dx, I suppose) seems like it could be a problem, especially if the different density curves use different intervals.
Your way is already good.
Another way to do it is using the trapezoid rule:
data <- cbind(q$x, q$y)
by(data, q$group, FUN = function(x) trapz(x[, 1], x[, 2]))
The results are nearly the same:
INDICES: 1
[1] 0.9903457
INDICES: 2
[1] 1.000978
INDICES: 3
[1] 0.9811152
This is because at the bandwidth needed to make the graph of the densities look reasonable (interval in your code), you are very close to what you would get if you could do the actual integral.

Resources