How to draw double coordinate CDF and PDF with R - r

I want to plot a picture with cumulative distribution function (CDF) and probability density function (PDF),which has a unified x-axis and y-axis with respective scale range on the left and right sides. I tried using the sec_axis() in ggplot2, but PR can only draw one line, and the Y-axis of the other line is invalid. What should I do?
new_data <- rnorm(n=100,mean=8000,sd=1000)
new_data <- as.data.frame(new_data)
names(new_data) <- c("CADP")
m1 <- ggplot(new_data,aes(x=CADP))+geom_density()
p1 <- ggplot(new_data, aes(x =CADP))+stat_ecdf(colour="red")+
labs(
x="CADP")
PR <- p1+scale_y_continuous(expand=c(0,0),limits=c(0,1),
sec.axis = sec_axis(~.*4e-04,breaks=seq(0,4e-04,1e-04)))+
geom_density(colour="blue")

The two lines produced by stat_ecdf and geom_density are on vastly different scales. The cdf's peak value is about 2500 times the peak value of the pdf. If you want to see both lines clearly on the same plot, you need to apply a transformation to one of them, either dividing the cdf by about 2500 or multiplying the pdf by about 2500. You need to do this even if you have a secondary axis.
Remember that a secondary axis is just an inert annotation that gets stuck onto the side of the plot: it doesn't change the size, scale or shape of any of the objects you have plotted. The secondary axis is labelled in such a way that it allows you to 'pretend' that some of your data is on a different scale. The way this works is that you apply the transformation to your data to make it fit on your plot, and pass the inverse transformation as a function to sec_axis.
Although you can apply a transformation to the output of stat_ecdf, it is just as easy to create the transformed cdf yourself and plot it with geom_step
library(ggplot2)
new_data <- data.frame(CADP = rnorm(100, 8000, 1000))
ggplot(new_data,aes(x = CADP)) +
geom_density(colour = 'blue') +
geom_step(aes(x = sort(CADP),
y = 0.0004 * seq_along(CADP)/nrow(new_data))) +
scale_y_continuous(name = 'PDF',
sec.axis = sec_axis(~.x / 0.0004, name = 'CDF'))

Related

Extend line length with geom_line

I want to represent three lines on a graph overlain with datapoints that I used in a discriminant function analysis. From my analysis, I have two points that fall on each line and I want to represent these three lines. The lines represent the probability contours of the classification scheme and exactly how I got the points on the line are not relevant to my question here. However, I want the lines to extend further than the points that define them.
df <-
data.frame(Prob = rep(c("5", "50", "95"), each=2),
Wing = rep(c(107,116), 3),
Bill = c(36.92055, 36.12167, 31.66012, 30.86124, 26.39968, 25.6008))
ggplot()+
geom_line(data=df, aes(x=Bill, y=Wing, group=Prob, color=Prob))
The above df is a dataframe for my points from which the three lines are constructed. I want the lines to extend from y=105 to y=125.
Thanks!
There are probably more idiomatic ways of doing it but this is one way to get it done.
In short you quickly calculate the linear formula that will connect the lines i.e y = mx+c
df_withFormula <- df |>
group_by(Prob) |>
#This mutate command will create the needed slope and intercept for the geom_abline command in the plotting stage.
mutate(increaseBill = Bill - lag(Bill),
increaseWing = Wing - lag(Wing),
slope = increaseWing/increaseBill,
intercept = Wing - slope*Bill)
# The increaseBill, increaseWing and slope could all be combined into one calculation but I thought it was easier to understand this way.
ggplot(df_withFormula, aes(Bill, Wing, color = Prob)) +
#Add in this just so it has something to plot ontop of. You could remove this and instead manually define all the limits (expand_limits would work).
geom_point() +
#This plots the three lines. The rows with NA are automatically ignored. More explicit handling of the NA could be done in the data prep stage
geom_abline(aes(slope = slope, intercept = intercept, color = Prob)) +
#This is the crucial part it lets you define what the range is for the plot window. As ablines are infite you can define whatever limits you want.
expand_limits(y = c(105,125))
Hope this helps you get the graph you want.
This is very much dependent on the structure of your data it could though be changed to fit different shapes.
Similar to the approach by #James in that I compute the slopes and the intercepts from the given data and use a geom_abline to plot the lines but uses
summarise instead of mutate to get rid of the NA values
and a geom_blank instead of a geom_point so that only the lines are displayed but not the points (Note: Having another geom is crucial to set the scale or the range of the data and for the lines to show up).
library(dplyr)
library(ggplot2)
df_line <- df |>
group_by(Prob) |>
summarise(slope = diff(Wing) / diff(Bill),
intercept = first(Wing) - slope * first(Bill))
ggplot(df, aes(x = Bill, y = Wing)) +
geom_blank() +
geom_abline(data = df_line, aes(slope = slope, intercept = intercept, color = Prob)) +
scale_y_continuous(limits = c(105, 125))

How to plot density of points in one dimension with different factors in ggplot2

I am attempting to place individual points on a plot using ggplot2, however as there are many points, it is difficult to gauge how densely packed the points are. Here, there are two factors being compared against a continuous variable, and I want to change the color of the points to reflect how closely packed they are with their neighbors. I am using the geom_point function in ggplot2 to plot the points, but I don't know how to feed it the right information on color.
Here is the code I am using:
s1 = rnorm(1000, 1, 10)
s2 = rnorm(1000, 1, 10)
data = data.frame(task_number = as.factor(c(replicate(100, 1),
replicate(100, 2))),
S = c(s1, s2))
ggplot(data, aes(x = task_number, y = S)) + geom_point()
Which generates this plot:
However, I want it to look more like this image, but with one dimension rather than two (which I borrowed from this website: https://slowkow.com/notes/ggplot2-color-by-density/):
How do I change the colors of the first plot so it resembles that of the second plot?
I think the tricky thing about this is you want to show the original values, and evaluate the density at those values. I borrowed ideas from here to achieve that.
library(dplyr)
data = data %>%
group_by(task_number) %>%
# Use approxfun to interpolate the density back to
# the original points
mutate(dens = approxfun(density(S))(S))
ggplot(data, aes(x = task_number, y = S, colour = dens)) +
geom_point() +
scale_colour_viridis_c()
Result:
One could, of course come up with a meausure of proximity to neighbouring values for each value... However, wouldn't adjusting the transparency basically achieve the same goal (gauging how densely packed the points are)?
geom_point(alpha=0.03)

How do I use the rug function for an exponential distribution in R?

I am just starting out in R. I want to plot interval of times (the distribution is exponential) on x axis, with a tick mark in place every time the interval ends. If I have a string of times say (0.2, 0.8, 0.9 , 1.0) then the tick marks on the x axis would be on 0.2, 0.8, 0.9 and 1 respectively. With big data samples, I want my graph to look something like:
So after using,
set.seed(1)
x <- rexp(50, 0.2)
How can I go about it further, might I have to use rug function (which I am trying to learn how to use)? Can I also put time stamps on this graph?
Edit
So I have modified my command and used:
x <- c(cumsum(rexp(50, 0.2)))
y <- rep(0, length(x))
plot(x,y)
rug(x)
and I have been able to get this:
This result does the work, if it's just about that. However, is there a line of command I can use to edit this result as shown in the second picture, and get a result as shown in the first picture? I would like to just get these tick marks on a horizontal line instead of the whole plot. Or it's not possible?
With ggplot you can use geom_rug() to .add a rug to the plot. First the data need to be made into a data.frame
library("tidyverse")
set.seed(1)
x <- rexp(50, 0.2)
ggplot(data.frame(x), aes(x = x)) + geom_rug()
The rug is rather short (it seems to be a proportion of the graph height and not controllable).
The opposite would be to use geom_vline which will give lines the full length of the y-axis
#ggplot(data.frame(x), aes(xintercept = x)) + geom_vline() #doesn't work
ggplot(data.frame(x)) + geom_vline(aes(xintercept = x))
rug() requires only a vector of values that describes where to draw the tickmarks (rugs). In case of plotting values x on the x-axis, those form the input data for the rug function. Type ?rug to get further help.
# generate y values
y <- rexp(50, 0.2)
# split plotting area into two columns - optional
par(mfrow = c(1, 2))
plot(x, y)
rug(x)
# plot with both axes in log scale to show that rug adjusts to axes scale
plot(x, y, log = "xy")
rug(x)

R - Control Histogram Y-axis Limits by second-tallest peak

I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.
Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:
require(ggplot2)
set.seed(5)
df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))
cols = names(df)
for (i in c(1:length(cols))) {
my_col = cols[i]
p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10)
print(p1)
p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
print(p2)
p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
print(p3)
p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
print(p4)
}
The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.
If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.
I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:
maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]
I would process the data to determine the height you need.
Something along the lines of:
sort(table(cut(df$X1,breaks=10)),T)[2]
Working from the inside out
cut will bin the data (not really needed with integer data like you have but probably needed with real data
table then creates a table with the count of each of those bins
sort sorts the table from highest to lowest
[2] takes the 2nd highest value

How to convert a bar histogram into a line histogram in R

I've seen many examples of a density plot but the density plot's y-axis is the probability. What I am looking for a is a line plot (like a density plot) but the y-axis should contain counts (like a histogram).
I can do this in excel where I manually make the bins and the frequencies and make a bar histogram and then I can change the chart type to a line - but can't find anything similar in R.
I've checked out both base and ggplot2; yet can't seem to find an answer. I understand that histograms are meant to be bars but I think representing them as a continuous line makes more visual sense.
Using default R graphics (i.e. without installing ggplot) you can do the following, which might also make what the density function does a bit clearer:
# Generate some data
data=rnorm(1000)
# Get the density estimate
dens=density(data)
# Plot y-values scaled by number of observations against x values
plot(dens$x,length(data)*dens$y,type="l",xlab="Value",ylab="Count estimate")
This is an old question, but I thought it might be helpful to post a solution that specifically addresses your question.
In ggplot2, you can plot a histogram and display the count with bars using:
ggplot(data) +
geom_histogram()
You can also plot a histogram and display the count with lines using a frequency polygon:
ggplot(data) +
geom_freqpoly()
For more info --
ggplot2 reference
To adapt the example on the ?stat_density help page:
m <- ggplot(movies, aes(x = rating))
# Standard density plot.
m + geom_density()
# Density plot with y-axis scaled to counts.
m + geom_density(aes(y = ..count..))
Although this is old, I thought the following might be useful.
Let's say you have a data set of 10,000 points, and you believe they belong to a certain distribution, and you would like to plot the histogram of the actual data and the line of the probability density of the ideal distribution on top of it.
noise <- 2
#
# the noise is tagged onto the end using runif
# just do demo issues w/real data and fitting
# the subtraction causes the data to have some
# negative values, which must be addressed in
# the fit later on
#
noisylognorm <- rlnorm(10000,
mean = 0.25,
sd = 1) +
(noise * runif(10000) - noise / 10)
#
# using package fitdistrplus
#
# subset is used to remove the negative values
# as the lognormal distribution needs positive only
#
fitlnorm <- fitdist(subset(noisylognorm,
noisylognorm > 0),
"lnorm")
fitlnorm_density <- density(rlnorm(10000,
mean = fitlnorm$estimate[1],
sd = fitlnorm$estimate[2]))
hist(subset(noisylognorm,
noisylognorm < 25),
breaks = seq(-1, 25, 0.5),
col = "lightblue",
xlim = c(0, 25),
xlab = "value",
ylab = "frequency",
main = paste0("Log Normal Distribution\n",
"noise = ", noise))
lines(fitlnorm_density$x,
10000 * fitlnorm_density$y * 0.5,
type="l",
col = "red")
Note the * 0.5 in the lines function. As far as I can tell, this is necessary to account for the width of the hist() bars.
There is a very simple and fast way for count data.
First let's generate some dummy count data:
my.count.data = rpois(n = 10000, lambda = 3)
And then the plotting command (assuming you have called library(magrittr)):
my.count.data %>% table %>% plot

Resources