I have two sets of data
I had plotted two probability density functions. Now I want the area between the two probability density functions, which are in certain x range.
I tried to integrate the area, trapezoidal rule etc:
Calculating the area between a curve and a straight line without finding the function
Error calculating the area between two lines using "integrate"
How to measure area between 2 distribution curves in R / ggplot2
but all are in vain.
Here is the link to the data i am working on.
https://sheet.zoho.com/sheet/editor.do?doc=1ff030ea1af35f06f8303927d7ea62b3c4b04bdae021555e8cc43ed0569cb2aaceb26368f93db4d15ac66cf7662d9a7873e889e1763139a49ffd68e7843e0b44
dens.pre=density(TX/10)
dens.post=density(TX30/10)`
plot(dens.pre,col="green")
lines(dens.post,col="red")
locator()
#$x
#[1] 18.36246
#$y
#[1] 0.05632428
abline(v=18.3,col="red")
Finding the area between the two curves for X > 18.3.
Area between the curves:
With trapezoidal rule you could probably calculate it like this:
d0 <- dens.pre
d1 <- dens.post
f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)
# defining x range of the density overlap
ovrng <- c(18.3, min(max(d0$x), max(d1$x)))
# dividing it to sections (for example n=500)
i <- seq(min(ovrng), max(ovrng), length.out=500)
# calculating the distance between the density curves
h1 <- f0(i)-f1(i)
h2 <- f1(i)-f0(i)
#and using the formula for the area of a trapezoid we add up the areas
area1<-sum( (h1[-1]+h1[-length(h1)]) /2 *diff(i) *(h1[-1]>=0+0)) # for the regions where d1>d0
area2<-sum( (h2[-1]+h2[-length(h2)]) /2 *diff(i) *(h2[-1]>=0+0)) # for the regions where d1<d0
area_total <- area1 + area2
area_total
Though, since you are interested only in the area where one curve remain below the other for the whole range, this can be shortened:
d0 <- dens.pre
d1 <- dens.post
f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)
# defining x range of the density overlap
ovrng <- c(18.3, min(max(d0$x), max(d1$x)))
# dividing it to sections (for example n=500)
i <- seq(min(ovrng), max(ovrng), length.out=500)
# calculating the distance between the density curves
h1 <- f1(i)-f0(i)
#and using the formula for the area of a trapezoid we add up the areas where d1>d0
area<-sum( (h1[-1]+h1[-length(h1)]) /2 *diff(i) *(h1[-1]>=0+0))
area
#We can plot the region using
plot(d0, main="d0=black, d1=green")
lines(d1, col="green")
jj<-which(h>0 & seq_along(h) %% 5==0); j<-i[jj];
segments(j, f1(j), j, f1(j)-h[jj])
There are other (and more detailed) solutions here and here
Related
I have two numerical variables that I plotted in the following by means command density in R. The code is the following:
d0<-density(T0,n=2^14)
df_density0<-data.frame(x=d0$x,y=d0$y,stringsAsFactors = FALSE)
d1<-density(T1,n=2^14)
df_density1<-data.frame(x=d1$x,y=d1$y,stringsAsFactors = FALSE)
Initially, I had left the number of equally spaced points n at the default value 512, but then I realized that the area under a density plot d1 was not equal to 1 (it was around 13). Then I selected the proper number n in order to obtain an AUC (area under curve) near to 1 for both the density plots. In this way:
library(zoo)
x <- df_density0$x
y <- df_density0$y
id <- order(x)
AUC0 <- sum(diff(x[id])*rollmean(y[id],2))
x <- df_density1$x
y <- df_density1$y
id <- order(x)
AUC1 <- sum(diff(x[id])*rollmean(y[id],2))
For n=2^14 I obtained AUC0 and AUC1 equal to 0.9999... I plotted these density curves and I obtained the following graphic (blue is for T0 and red is for T1):
Since these graphs were indistinguishable from each other, I selected a logarithmic scale for the vertical axis:
Is it possible that I get such a result? Should I change the kernel field in the density function?
Ps. To draw these graphs I exported d0 and d1 in two csv files to import them into LaTeX. Anyway, I obtained the same plots in R. Moreover,
> nrow(T0)
[1] 9760
> nrow(T1)
[1] 1963
I would like to match points in 3-dimensional space.
Therefore, I am using the Hungarian Method described in this question: Finding the best matching pairwise points from 2 vectors
Here is my example using R:
# packages
library(rgl)
library(clue)
library(plyr)
library(fields)
set.seed(1)
a <- c(rep(2,7), 3,4,5,6,3,4,5,6,7,7,7,7,7,7) # x values
b <- c(rep(3,7),3,3,3,3, 3,3,3,3,3,3,3,3,3,3) # y values
c <- c(seq(1,7),1,1,1,1,7,7,7,7,1,2,3,4,5,6) # z values
# transform the points
set.seed(2)
a1 <- a + seq(1,length(a))
b1 <- b + 8
c1 <- c + 9
# plot the data
plot3d(a,b,c, col="red", pch=16,size=10)
plot3d(a1,b1,c1, lwd=10, col="blue", pch=16,size=10, add=TRUE)
# run the Hungarian Method
A <- cbind(a,b,c)
B <- cbind(a1,b1,c1)
distances <- rdist(A,B) # calculate Euclidean Distance between points
min.dist <- solve_LSAP(distances) # minimizing the sum of distance
min.dist.num <- as.numeric(min.dist)
# plot the minimized lines between point sets
for (ii in 1:dim(B)[1]){
D <- c(A[ii,1], B[min.dist.num[ii],1])
R <- c(A[ii,2], B[min.dist.num[ii],2])
W <- c(A[ii,3], B[min.dist.num[ii],3])
segments3d(D,R,W,col=2,lwd=1)
}
# calculate the share of points that is matched correctly
sum(1:dim(B)[1]==min.dist.num)/dim(B)[1]* 100
The problem here is that only 5% of the points are matched correctly (see last line of the code). In my view, the main trouble is that the algorithm does not take the structure of the object (a square) into account.
Question: Is there any method that performs better for this sample data?
In my original data, the dimensional structure of the points is way more complicated. I have a cloud of data and within this cloud there are multiple subfigures.
I am seeking primarily for a solution in R, but other implementations (e.g. MATLAB, Excel, Java) are also welcome.
I have a multi-parameter function on which I infer the parameters using MCMC. This means that I have many samples of the parameters, and I can plot the functions:
# Simulate some parameters. Really, I get these from MCMC sampling.
first = rnorm(1000) # a
second = rnorm(1000) # b
# The function (geometric)
geometric = function(x, a, b) b*(1 - a^(x + 1)/a)
# Plot curves. Perhaps not the most efficient way, but it works.
curve(geometric(x, first[1], second[1]), ylim=c(-3, 3)) # first curve
for(i in 2:length(first)) {
curve(geometric(x, first[i], second[i]), add=T, col='#00000030') # add others
}
How do I make this into a density plot instead of plotting the individual curves? For example, it's hard to see just how much denser it is around y=0 than around other values.
The following would be nice:
The ability to draw observed values on top (points and lines).
Drawing a contour line in the density, e.g. the 95% Highest Posterior Density interval or the 2.5 and 97.5 quantiles.
I am trying to make a surface plot for data that is in a very long list of x,y,z points. To do this, I am dividing the data into a grid of 10k squares and finding the max value of z within each square. From my understanding, each z value should be stored in a matrix where each element of the matrix corresponds to a square on the grid. Is there an easier way to do this than the code below? That last line is already pretty long and it is only one square.
x<-(sequence(101)-1)*max(eff$CFaR)/100
y<-(sequence(101)-1)*max(eff$EaR)/100
effmap<-matrix(ncol=length(x)-1, nrow=length(y)-1)
someMatrix <- max(eff$Cost[which(eff$EaR[which(eff$CFaR >= x[50] & eff$CFaR <x[51], arr.ind=TRUE)]>=y[20] & eff$EaR[which(eff$CFaR >= x[50] & eff$CFaR <x[51], arr.ind=TRUE)]< y[91])])
So this is my interpretation of what you are trying to accomplish...
df <- read.csv("effSample.csv") # downloaded from your link
df <- df[c("CFaR","EaR","Cost")] # remove unnecessary columns
df$x <- cut(df$CFaR,breaks=100,labels=FALSE) # establish bins: CFaR
df$y <- cut(df$EaR,breaks=100,labels=FALSE) # establish bins: EaR
df.max <- expand.grid(x=1:100,y=1:100) # template; 10,000 grid cells
# maximum cost in each grid cell - NOTE: most of the cells are *empty*
df.max <- merge(df.max,aggregate(Cost~x+y,df,max),all.x=TRUE)
z <- matrix(df.max$Cost,nr=100,nc=100) # Cost vector -> matrix
# colors based on z-value
palette <- rev(rainbow(20)) # palette of 20 colors
zlim <- range(z[!is.na(z)])
colors <- palette[19*(z-zlim[1])/diff(zlim) + 1]
# create the plot
library(rgl)
open3d(scale=c(1,1,10)) # CFaR and EaR range ~ 10 X Cost range
x.values <- min(df$CFaR)+(0:99)*diff(range(df$CFaR))/100
y.values <- min(df$EaR)+(0:99)*diff(range(df$EaR))/100
surface3d(x.values,y.values,z,col=colors)
axes3d()
title3d(xlab="CFaR",ylab="EaR",zlab="Cost")
The code above generates a rotatable 3D plot, so the image is just a screen shot. Notice how there are lots of "holes". This is (partially) because you provided only part of your data. However, it is important to realize that just because you imagine 10,000 grid cells (e.g., a 100 X 100 grid), does not mean that there will be data in every cell.
I need to find the volume from the surface down to a specific contour in R. Taking the contour example from the R help files:
x <- 10*1:nrow(volcano)
y <- 10*1:ncol(volcano)
contour(x,y,volcano)
given the resulting graph, how do I find the volume from a specific contour line up to the surface.
In practice, I will use bkde2D to get a density map for a scatter plot. From this I can make the contour plot, but I would like to determine the volume defined by various density cutoffs in the resulting plot.
Function contour just draw the contour lines but doesn't return any values. What you need to use is function contourLines.
cL <- contourLines(x,y,volcano)
From there, you can calculate the area of each contour lines the following way:
area <- rep(0,length(cL))
for(i in 1:length(cL)){
d <- data.frame(cL[[i]]$x,cL[[i]]$y)
sa <- sb <- 0
for(j in 1:(nrow(d)-1)){
sa <- sa+d[j,1]*d[j+1,2]
sb <- sb+d[j,2]*d[j+1,1]
}
area[i] <- abs((sa-sb)/2)
}
area
[1] 1.413924e+05 3.109685e+04 2.431528e+04 2.049473e+04 6.705976e+04 3.202145e+05 1.720469e+03
[8] 2.926802e+05 2.335421e+05 1.834791e+05 1.326162e+05 4.672784e+02 9.419792e+04 5.121851e+03
[15] 5.126860e+04 3.660862e-01 1.216750e+03 2.051307e+04 4.670745e+02 4.146927e+03
Now, if you want the volume between two contour lines (say between levels 120 and 130):
level1 <- 120
level2 <- 130
levels <- unlist(lapply(cL,function(x)x$level))
base <- (1:length(cL))[level==level1]
top <- (1:length(cL))[level==level2]
vol <- (level[top]-level[base])*(area[base]+area[top])/2
vol
[1] 2631111
And that's as far as I can go because I don't see how to proceed if the next contour line is split into several sectors.