How calculate probabillity of density plot? - r

I have following question: Is it possible to calculate a probabillity of a density plot?
So for example, I have following data frame
test<- data.frame(
Gruppe = rep(c("Aktien","Aktien"),
times=c(136, 37)),
Zufriedenheit = c(f_keineErf, f_Erf))
and i plot a density plot, with de ggplot function:
ggplot(test, aes(x=Zufriedenheit)) +geom_density()
How can I calculate the probability for example getting a value above 70?
Thank you!

Your data is not included in the question, so let's make up a small random sample:
library(ggplot2)
set.seed(69)
df <- data.frame(x = rnorm(10))
Now we can create a density plot as per your example:
p <- ggplot(df, aes(x)) +
geom_density() +
xlim(c(-5, 5))
p
Now, we can actually find the x and y coordinates of this line using the base R function density and extracting its x and y components into a data frame:
dens <- density(df$x)
d <- data.frame(x = dens$x, y = dens$y)
head(d)
#> x y
#> 1 -3.157056 0.0009453767
#> 2 -3.144949 0.0010145927
#> 3 -3.132841 0.0010870523
#> 4 -3.120733 0.0011665920
#> 5 -3.108625 0.0012488375
#> 6 -3.096517 0.0013382316
We can see plotting this as a red dashed geom_line it is the same as geom_density:
p + geom_line(data = d, aes(x, y), col = "red", linetype = 2, size = 2)
Now suppose we want to know the probability of having a value of more than one. We can show the area we are interested in like this:
p + geom_area(data = d[d$x >= 1,], aes(x, y), fill = "red")
Since the x values are all equally spaced in our data frame d, then the red area's proportion of the area under the line is a simple ratio of the sum of all y values at x values greater than one to the grand sum of y:
sum(d$y[d$x > 1])/sum(d$y)
#> [1] 0.1599931
So the probability of getting an x value of > 1 is 0.15999, or 16%
Created on 2020-08-17 by the reprex package (v0.3.0)

Related

How to set up y lim and add annotation by group for linear regression in ggplot2

I am drawing the plot for linear regression by group. The demo data and code is below. I have two groups: A and B. For each group, I would like to draw regression line separately. So the regression line for group A is in page 1 of the PDF, the regression line for group B is in page 2 of the PDF. What's more, I also want to adjust the y lim by group. For group A, the Y lim should be from mean(Y) - 10 to mean(Y) + 10 by 1, the Y label should be Y1. For group B, the Y lim should be from -1 to 1 by 0.1, the Y label should be Y2. Finally, I want to add the slope, intercept, their stander error and the median of Y at X = 1 as annotation to the plot. I want the annotation to show up at bottom of each page which looks like:
slope: 2
se for slope: 1
intercept: 20
se for intercept: 1
baseline median: 1
How can I achieve this in ggplot?
Now my pdf file has two pages. But the Y lim is not adjusted by group. I also cannot add annotation by group.
library(ggplot2)
# generate dummy data
set.seed(2)
df1 <- data.frame(group = rep("A",100),
x = rep(seq(1,5,1),20))
df1$y <- 2*df1$x + 20 + abs(rnorm(100))
df2 <- data.frame(group = rep("B",100),
x = rep(seq(1,5,1),20))
df2$y <- 0.001*df2$x + 0.01 + abs(rnorm(100))
df <- as.data.frame(rbind(df1,df2))
my_breaks <- function(x){'if'(mean(x) > 10, seq(mean(x)-10, mean(x)+10, 1),seq(-1,1,0.1))}
pdf("Regression_example.pdf")
for(i in 1:2){
print(ggplot(df, aes(x, y)) + geom_point() + facet_wrap_paginate(~group, ncol = 1, nrow=1, page = i) + geom_smooth(method='lm') + scale_y_continuous(breaks = my_breaks))
}
dev.off()

Defining a line between two points on a scatterplot in ggplot

Apologies if this question is a duplicate; Looking up "drawing a line between two points" on Stack Overflow gave me some ideas, but I'm not sure how to apply them to my specific problem.
Let's say that my data take the shape of:
x <- runif(n = 10)
y <- runif(n = 10)
graphData <- data.frame(x, y)
graphData
x y
1 0.3328235 0.30122890
2 0.4886130 0.06072057
3 0.9544738 0.94772694
4 0.4829024 0.72059627
5 0.8903502 0.14229430
6 0.9144382 0.54928466
7 0.6087350 0.95409124
8 0.4106898 0.58548335
9 0.1470947 0.40451028
10 0.9352998 0.64789348
Then I do a scatterplot of those data:
library(ggplot2)
p <- ggplot(graphData, aes(x = x, y = y)) +
geom_point()
p
What I want is, to draw exactly one line, connecting the point that has the highest y-value with the point that has the highest x-value. (The example makes it look like those could be the same point, but in my real-life data, the odds of that happening are infinitesimally small.)
Also, I'm not just drawing the line on a plot; I will also need to provide that line as a formula, to be used in a separate analysis. Thoughts?
I would try:
p <- ggplot(graphData, aes(x = x, y = y)) +
geom_point()+
geom_smooth(data = . %>% filter(x == max(x) | y == max(y)), method = lm)
p
and then call the formula of the line:
lm(y ~ x, data = graphData %>% filter(x == max(x) | y == max(y)))

Plot one data frame column against all other columns using ggplots and showing densities in R

I have a data frame with 20 columns, and I want to plot one specific column (called BB) against each single column in the data frame. The plots I need are probability density plots, and I’m using the following code to generate one plot (plotting columns BB vs. AA as an example):
mydata = as.data.frame(fread("filename.txt")) #read my data as data frame
#function to calculate density
get_density <- function(x, y, n = 100) {
dens <- MASS::kde2d(x = x, y = y, n = n)
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
ii <- cbind(ix, iy)
return(dens$z[ii])
}
set.seed(1)
#define the x and y of the plot; x = column called AA; y = column called BB
xy1 <- data.frame(
x = mydata$AA,
y = mydata$BB
)
#call function get_density to calculate density for the defined x an y
xy1$density <- get_density(xy1$x, xy1$y)
#Plot
ggplot(xy1) + geom_point(aes(x, y, color = density), size = 3, pch = 20) + scale_color_viridis() +
labs(title = "BB vs. AA") +
scale_x_continuous(name="AA") +
scale_y_continuous(name="BB")
Would appreciate it if someone can suggest a method to produce multiple plot of BB against every other column, using the above density function and ggplot command. I tried adding a loop, but found it too complicated especially when defining the x and y to be plotted or calling the density function.
Since you don't provide sample data, I'll demo on mtcars. We convert the data to long format, calculate the densities, and make a faceted plot. We plot the mpg column against all others.
library(dplyr)
library(tidyr)
mtlong = gather(mtcars, key = "var", value = "value", -mpg) %>%
group_by(var) %>%
mutate(density = get_density(value, mpg))
ggplot(mtlong, aes(x = value, y = mpg, color = density)) +
geom_point(pch = 20, size = 3) +
labs(x = "") +
facet_wrap(~ var, scales = "free")

How to compute area under the curve (intersection may occur) with the original dataset?

I have a bunch of datasets of x’s and y’s. For each dataset, I plot points (x, y) in R. And the resulting plots are generally similar to either type A or type B. Type B has an intersection while type A doesn’t have.
My question: Given a new dataset, is it possible to calculate (in R) the red shaded area under the curve as indicated in type A and type B plot without knowing the visualization?
The main challenges are:
1) How to determine whether the dataset will generate type A or type B in R?
2) How to compute the red shaded area in type B using the dataset with R?
Here is the code producing the dataset that generated type B curve.
set.seed(300)
predicted_value_A = c(rbeta(300, 9, 2), rbeta(700, 2, 4), rbeta(10000, 2, 4))
predicted_value_B = c(rbeta(1000, 4, 3), rbeta(10000, 2, 3))
real_value = c(rep(1, 1000), rep(0, 10000))
library(ROCR)
library(ggplot2)
predB <- prediction(predicted_value_B, real_value)
perfB <- performance(predB, measure = "mat", x.measure = "f")
yB <- attr(perfB, "y.values")[[1]]
yB <- (yB + 1)/2
xB <- attr(perfB, "x.values")[[1]]
# dataset that generates type B curve
dfB <- data.frame(X = xB, Y= yB)
ggplot(df, aes(x=X, y=Y, ymin=0, ymax=1, xmin=0, xmax=1 )) + geom_point(size = 0.2, shape = 21, fill="white")+
ggtitle("Type B curve") +
theme(plot.title=element_text(hjust=0.5))
Here is a bit of code to shade the plot from a set of (x,y) points using an approximation with small rectangles. This assumes evenly spaced x values, and enough that the rectangular approximation works well.
# sample dataset
x <- seq(0,2,length.out=1000)
y1 <- x
y2 <- sin(x*pi)+x
# plot
plot(x,y1,type='l',ylab='y')
lines(x,y2)
# shade the plot
## not efficient but works
dx <- x[2]-x[1]
area <- 0
# shade plot and calculate area
## uses a rectangular strip approximation
## assumes even spacing in x. Could also calculate the dx in each step if it changes
for (i in 1:(length(x))) {
if (y1[i] < y2[i]) {
cord.x <- c(x[i]-dx/2,x[i]-dx/2,x[i]+dx/2,x[i]+dx/2)
cord.y <- c(y1[i],y2[i],y2[i],y1[i])
} else {
cord.x <- c(x[i]-dx/2,x[i]-dx/2,x[i]+dx/2,x[i]+dx/2)
cord.y <- c(y2[i],y1[i],y1[i],y2[i])
}
# draw the polygons
polygon(cord.x, cord.y, col = 'pink', border = NA)
# sum to the area
area <- area + abs(y2[i]-y1[i])*dx
}
area
sample shaded plot by rectangular approximation

R: Calculate and plot difference between two density countours

I have two datasets with two continuous variables: duration and waiting.
library("MASS")
data(geyser)
geyser1 <- geyser[1:150,]
geyser2 <- geyser[151:299,]
geyser2$duration <- geyser2$duration - 1
geyser2$waiting <- geyser2$waiting - 20
For each dataset I output a 2D density plot
ggplot(geyser1, aes(x = duration, y = waiting)) +
xlim(0.5, 6) + ylim(40, 110) +
stat_density2d(aes(alpha=..level..),
geom="polygon", bins = 10)
ggplot(geyser2, aes(x = duration, y = waiting)) +
xlim(0.5, 6) + ylim(40, 110) +
stat_density2d(aes(alpha=..level..),
geom="polygon", bins = 10)
I now want to produce a plot which indicates the regions where the two plot have the same density (white), negative differences (gradation from white to blue where geyser2 is denser than geyser1) and positive differences (gradation from white to red where geyser1 is denser than geyser2).
How to compute and plot the difference of the densities?
You can do this by first using kde2d to calculate the densities and then subtracting them from each other. Then you do some data reshaping to get it into a form that can be fed to ggplot2.
library(reshape2) # For melt function
# Calculate the common x and y range for geyser1 and geyser2
xrng = range(c(geyser1$duration, geyser2$duration))
yrng = range(c(geyser1$waiting, geyser2$waiting))
# Calculate the 2d density estimate over the common range
d1 = kde2d(geyser1$duration, geyser1$waiting, lims=c(xrng, yrng), n=200)
d2 = kde2d(geyser2$duration, geyser2$waiting, lims=c(xrng, yrng), n=200)
# Confirm that the grid points for each density estimate are identical
identical(d1$x, d2$x) # TRUE
identical(d1$y, d2$y) # TRUE
# Calculate the difference between the 2d density estimates
diff12 = d1
diff12$z = d2$z - d1$z
## Melt data into long format
# First, add row and column names (x and y grid values) to the z-value matrix
rownames(diff12$z) = diff12$x
colnames(diff12$z) = diff12$y
# Now melt it to long format
diff12.m = melt(diff12$z, id.var=rownames(diff12))
names(diff12.m) = c("Duration","Waiting","z")
# Plot difference between geyser2 and geyser1 density
ggplot(diff12.m, aes(Duration, Waiting, z=z, fill=z)) +
geom_tile() +
stat_contour(aes(colour=..level..), binwidth=0.001) +
scale_fill_gradient2(low="red",mid="white", high="blue", midpoint=0) +
scale_colour_gradient2(low=muted("red"), mid="white", high=muted("blue"), midpoint=0) +
coord_cartesian(xlim=xrng, ylim=yrng) +
guides(colour=FALSE)

Resources