I calculated the cumulative probability (in English, cdf) of my data, based on the probability of exceedance (edf). No problem at all.
However, does anyone know if there is any command to transform this data into probability density (pdf)?
I have already tested using the histogram function, but it does not work correctly.
x <- c (0.00000000, 0.03505324, 0.07005407, 0.10512053, 0.14021308,
0.17533767, 0.21051443, 0.24570116, 0.28090087, 0.31592221,
0.35092739, 0.38591441,0.42085712, 0.45599341, 0.49119521, 0.52646341,
0.56159558, 0.59673546, 0.63172464, 0.66674853, 0.70177413, 0.73712542,
0.77225123, 0.80750715, 0.84250460, 0.87720473, 0.91172191, 0.94588810,
0.98056348)
Is the function you are looking for density() ?
You can plot with
plot(density(x))
You can see x and y values with:
density(x)$x
density(x)$y
Related
I have an algorithm that uses an x,y plot of sorted y data to produce an ogive.
I then derive the area under the curve to derive %'s.
I'd like to do something similar using kernel density estimation. I like how the upper/lower bounds are smoothed out using kernel densities (i.e. the min and max will extend slightly beyond my hard coded input).
Either way... I was wondering if there is a way to treat an ogive as a type of cumulative distribution function and/or use kernel density estimation to derive a cumulative distribution function given y data?
I apologize if this is a confusing question. I know there is a way to derive a cumulative frequency graph (i.e. ogive). However, I can't determine how to derive a % given this cumulative frequency graph.
What I don't want is an ecdf. I know how to do that, and I am not quite trying to capture an ecdf. But, rather integration of an ogive given two intervals.
I'm not exactly sure what you have in mind, but here's a way to calculate the area under the curve for a kernel density estimate (or more generally for any case where you have the y values at equally spaced x-values (though you can, of course, generalize to variable x intervals as well)):
library(zoo)
# Kernel density estimate
# Set n to higher value to get a finer grid
set.seed(67839)
dens = density(c(rnorm(500,5,2),rnorm(200,20,3)), n=2^5)
# How to extract the x and y values of the density estimate
#dens$y
#dens$x
# x interval
dx = median(diff(dens$x))
# mean height for each pair of y values
h = rollmean(dens$y, 2)
# Area under curve
sum(h*dx) # 1.000943
# Cumulative area
# cumsum(h*dx)
# Plot density, showing points at which density is calculated
plot(dens)
abline(v=dens$x, col="#FF000060", lty="11")
# Plot cumulative area under curve, showing mid-point of each x-interval
plot(dens$x[-length(dens$x)] + 0.5*dx, cumsum(h*dx), type="l")
abline(v=dens$x[-length(dens$x)] + 0.5*dx, col="#FF000060", lty="11")
UPDATE to include ecdf function
To address your comments, look at the two plots below. The first is the empirical cumulative distribution function (ECDF) of the mixture of normal distributions that I used above. Note that the plot of this data looks the same below as it does above. The second is a plot of the ECDF of a plain vanilla normal distribution, mean=0, sd=1.
set.seed(67839)
x = c(rnorm(500,5,2),rnorm(200,20,3))
plot(ecdf(x), do.points=FALSE)
plot(ecdf(rnorm(1000)))
I would like to create a Student's t distribution density plot with a mean of 0.02 instead of 0. is that possible to do?
the distribtion should have 2 degrees of freedom.
is this possible to do?
I tried the following:
X<-rnorm(100000,mean=0.02, sd=(1/sqrt(878)))
pop.mean<-mean(X)
t<-sapply(1:10000, function(x) (mean(sample(X,100))-pop.mean)/(1/sqrt(878)))
plot(density(t))
Is this approach correct?
If it is correct, how can I get the real densities, not just the approximation?
Your statement and example contradict each other somewhat.
Do you want a non-central t distribution which is based on a normal with mean 0.02? This is what your example suggests, but note that the non-central t is not just a shifted t, it is now skewed.
If you want the non-central t then you can plot it with a command like:
curve(dt(x,2,0.02), from=-5, to=6)
Or, do you want a shifted t distribution? A distribution that is symmetric around 0.02 with the shape of a t distribution?
You can plot the curve shifted by using a command like:
curve(dt(x-0.02,2), from=-5, to=6 )
The curve function has an add argument that you could use to plot both on the same plot if you want to compare them (not much difference in this case), changing the color on one of them would be suggested.
I have two data sets that I am comparing using a ked2d contour plot on a log10 scale,
Here I will use an example of the following data sets,
b<-log10(rgamma(1000,6,3))
a<-log10((rweibull(1000,8,2)))
density<-kde2d(a,b,n=100)
filled.contour(density,color.palette=colorRampPalette(c('white','blue','yellow','red','darkred')))
This produces the following plot,
Now my question is what does the z values on the legend actually mean? I know it represents where most the data lies but 0-15 confuses me. I thought it could be a percentage but without the log10 scale I have values ranging from 0-1? And I have also produced plots with scales 1-1.2, 1-2 using my real data.
The colors represent the the values of the estimated density function ranging from 0 to 15 apparently. Just like with your other question about the odd looking linear regression I can relate to your confusion.
You just have to understand that a density's integral over the full domain has to be 1, so you can use it to calculate the probability of an observation falling into a specific region.
I'm trying to plot an histogram of the Cauchy distribution in R using the following code:
X = rcauchy(10^5)
hist(X)
and no matter what options I try in the hist() function, I can never see more than two bars on my histogram (basically one for negative values and one for positive values).
It works fine, however, when I use the normal distribution (or others).
This results from the properties of the distribution.
Most values are relatively close to zero, but very large absolute values are much more probable than for the normal distribution. There are about 1 % values with an absolute value greater than 50, and 0.1 % greater than 500.
Try plotting only part of the values:
hist(X[abs(X)<1])
hist(X[abs(X)<5])
hist(X[abs(X)<50])
hist(X)
You can also look at the cumulative distribution function:
plot(ecdf(X))
And check the boxplot:
boxplot(X)
I have a following matrix [500,2], so we have 500 rows and 2 columns, the left one gives us the index of X observations, and the right one gives the probability with which this X comes true, so - a typical probability density relationship.
So, my question is, how to plot the histogram the right way, so that the x-axis is the x-index, and the y-axis is the density(0.01-1.00). The bandwidth of the estimator is 0.33.
Thanks in advance!
the end of the whole data looks like this: just for a little orientation
[490,] 2.338260830 0.04858685
[491,] 2.347839477 0.04797310
[492,] 2.357418125 0.04736149
[493,] 2.366996772 0.04675206
[494,] 2.376575419 0.04614482
[495,] 2.386154067 0.04553980
[496,] 2.395732714 0.04493702
[497,] 2.405311361 0.04433653
[498,] 2.414890008 0.04373835
[499,] 2.424468656 0.04314252
[500,] 2.434047303 0.04254907
#everyone,
yes, I have made the estimation before, so.. the bandwith is what I mentioned, the data is ordered from low to high values, so respecively the probability at the beginning is 0,22, at the peak about 0,48, at the end 0,15.
The line with the density is plotted like a charm but I have to do in addition is to plot a histogram! So, how I can do this, ordering the blocks properly(ho the data to be splitted in boxes etc..)
Any suggestions?
Here is a part of the data AFTER the estimation, all values are discrete, so I assume histogram can be created.., hopefully.
[491,] 4.956164 0.2618131
[492,] 4.963014 0.2608723
[493,] 4.969863 0.2599309
[494,] 4.976712 0.2589889
[495,] 4.983562 0.2580464
[496,] 4.990411 0.2571034
[497,] 4.997260 0.2561599
[498,] 5.004110 0.2552159
[499,] 5.010959 0.2542716
[500,] 5.017808 0.2533268
[501,] 5.024658 0.2523817
Best regards,
appreciate the fast responses!(bow)
What will do the job is to create a histogram just for the indexes, grouping them in a way x25/x50 each, for instance...and compute the average probability for each 25 or 50/100/150/200/250 etc as boxes..?
Assuming the rows are in order from lowest to highest value of x, as they appear to be, you can use the default plot command, the only change you need is the type:
plot(your.data, type = 'l')
EDIT:
Ok, I'm not sure this is better than the density plot, but it can be done:
x = dnorm(seq(-1, 1, length = 500))
x.bins = rep(1:50, each = 10)
bars = aggregate(x, by = list(x.bins), FUN = sum)[,2]
barplot(bars)
In your case, replace x with the probabilities from the second column of your matrix.
EDIT2:
On second thought, this only makes sense if your 500 rows represent discrete events. If they are instead points along a continuous distribution function adding them together as I have done is incorrect. Mathematically I don't think you can produce the binned probability for a range using only a few points from within that range.
Assuming M is the matrix. wouldn't this just be :
plot(x=M[ , 1], y = M[ , 2] )
You have already done the density estimation since this is not the original data.