I've seen many examples of a density plot but the density plot's y-axis is the probability. What I am looking for a is a line plot (like a density plot) but the y-axis should contain counts (like a histogram).
I can do this in excel where I manually make the bins and the frequencies and make a bar histogram and then I can change the chart type to a line - but can't find anything similar in R.
I've checked out both base and ggplot2; yet can't seem to find an answer. I understand that histograms are meant to be bars but I think representing them as a continuous line makes more visual sense.
Using default R graphics (i.e. without installing ggplot) you can do the following, which might also make what the density function does a bit clearer:
# Generate some data
data=rnorm(1000)
# Get the density estimate
dens=density(data)
# Plot y-values scaled by number of observations against x values
plot(dens$x,length(data)*dens$y,type="l",xlab="Value",ylab="Count estimate")
This is an old question, but I thought it might be helpful to post a solution that specifically addresses your question.
In ggplot2, you can plot a histogram and display the count with bars using:
ggplot(data) +
geom_histogram()
You can also plot a histogram and display the count with lines using a frequency polygon:
ggplot(data) +
geom_freqpoly()
For more info --
ggplot2 reference
To adapt the example on the ?stat_density help page:
m <- ggplot(movies, aes(x = rating))
# Standard density plot.
m + geom_density()
# Density plot with y-axis scaled to counts.
m + geom_density(aes(y = ..count..))
Although this is old, I thought the following might be useful.
Let's say you have a data set of 10,000 points, and you believe they belong to a certain distribution, and you would like to plot the histogram of the actual data and the line of the probability density of the ideal distribution on top of it.
noise <- 2
#
# the noise is tagged onto the end using runif
# just do demo issues w/real data and fitting
# the subtraction causes the data to have some
# negative values, which must be addressed in
# the fit later on
#
noisylognorm <- rlnorm(10000,
mean = 0.25,
sd = 1) +
(noise * runif(10000) - noise / 10)
#
# using package fitdistrplus
#
# subset is used to remove the negative values
# as the lognormal distribution needs positive only
#
fitlnorm <- fitdist(subset(noisylognorm,
noisylognorm > 0),
"lnorm")
fitlnorm_density <- density(rlnorm(10000,
mean = fitlnorm$estimate[1],
sd = fitlnorm$estimate[2]))
hist(subset(noisylognorm,
noisylognorm < 25),
breaks = seq(-1, 25, 0.5),
col = "lightblue",
xlim = c(0, 25),
xlab = "value",
ylab = "frequency",
main = paste0("Log Normal Distribution\n",
"noise = ", noise))
lines(fitlnorm_density$x,
10000 * fitlnorm_density$y * 0.5,
type="l",
col = "red")
Note the * 0.5 in the lines function. As far as I can tell, this is necessary to account for the width of the hist() bars.
There is a very simple and fast way for count data.
First let's generate some dummy count data:
my.count.data = rpois(n = 10000, lambda = 3)
And then the plotting command (assuming you have called library(magrittr)):
my.count.data %>% table %>% plot
Related
I am having difficulty successfully plotting a histogram using ggplot in R and would appreciate help on how to do this.
Some background: I have carried out a simulation in R that simulates the outbreak dynamics for an epidemic, and now I want to create a final size distribution plot over 10,000 epidemic simulations.
What I have done so far: I have simulated 10,000 outbreaks and in each of these cases I have found the number of the final size of the outbreak and saved these in f. From typeof(f) I get the answer double, a small overview of f is the following:
> tail(f)
[1] 4492 1 2 1 1 4497
I have then created a (correct) distribution plot over these with the help of the code below, but now instead want to create this using ggplot to get a nicer histogram.
h = hist(f)
h$density = h$counts/sum(h$counts)
plot(h,freq = FALSE,
ylim = c(0,1))
My attempt: I attempted to do this on my own via the following code but I don't get a correct result. I will post the images of these two plots below where the first one is the correct one, as you can se the y-values together add up to one which is correct, and the second one is what I get using ggplot, here the values on the y-axis is not correct. What can I do to create a graph like the first but with ggplot instead? I am guessing that this has something to do with that I set y to be the density and for some reason it doesn't quite match.
ggplot(data=NULL, aes(x = f)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white")
The images:
Your desired output does not have density on the y-axis, but percentages. Your ggplot has density on the y-axis, that's the default for histograms. To get the same results with ggplot you need to use geom_histogram(aes(y=..count../sum(..count..))
The base R function hist calculates the optimal number of bins used to plot the frequencies. The number can be re-used in ggplot like this:
library(ggplot2)
f <- c(4492, 1, 2, 1, 1, 4497)
h <- hist(f, freq = FALSE)
h$breaks
#> [1] 0 1000 2000 3000 4000 5000
ggplot(data = NULL, mapping = aes(x = f, y=..density..)) +
geom_histogram(bins = length(h$breaks) - 1)
Created on 2023-01-07 by the reprex package (v2.0.1)
I want to plot a picture with cumulative distribution function (CDF) and probability density function (PDF),which has a unified x-axis and y-axis with respective scale range on the left and right sides. I tried using the sec_axis() in ggplot2, but PR can only draw one line, and the Y-axis of the other line is invalid. What should I do?
new_data <- rnorm(n=100,mean=8000,sd=1000)
new_data <- as.data.frame(new_data)
names(new_data) <- c("CADP")
m1 <- ggplot(new_data,aes(x=CADP))+geom_density()
p1 <- ggplot(new_data, aes(x =CADP))+stat_ecdf(colour="red")+
labs(
x="CADP")
PR <- p1+scale_y_continuous(expand=c(0,0),limits=c(0,1),
sec.axis = sec_axis(~.*4e-04,breaks=seq(0,4e-04,1e-04)))+
geom_density(colour="blue")
The two lines produced by stat_ecdf and geom_density are on vastly different scales. The cdf's peak value is about 2500 times the peak value of the pdf. If you want to see both lines clearly on the same plot, you need to apply a transformation to one of them, either dividing the cdf by about 2500 or multiplying the pdf by about 2500. You need to do this even if you have a secondary axis.
Remember that a secondary axis is just an inert annotation that gets stuck onto the side of the plot: it doesn't change the size, scale or shape of any of the objects you have plotted. The secondary axis is labelled in such a way that it allows you to 'pretend' that some of your data is on a different scale. The way this works is that you apply the transformation to your data to make it fit on your plot, and pass the inverse transformation as a function to sec_axis.
Although you can apply a transformation to the output of stat_ecdf, it is just as easy to create the transformed cdf yourself and plot it with geom_step
library(ggplot2)
new_data <- data.frame(CADP = rnorm(100, 8000, 1000))
ggplot(new_data,aes(x = CADP)) +
geom_density(colour = 'blue') +
geom_step(aes(x = sort(CADP),
y = 0.0004 * seq_along(CADP)/nrow(new_data))) +
scale_y_continuous(name = 'PDF',
sec.axis = sec_axis(~.x / 0.0004, name = 'CDF'))
I am just starting out in R. I want to plot interval of times (the distribution is exponential) on x axis, with a tick mark in place every time the interval ends. If I have a string of times say (0.2, 0.8, 0.9 , 1.0) then the tick marks on the x axis would be on 0.2, 0.8, 0.9 and 1 respectively. With big data samples, I want my graph to look something like:
So after using,
set.seed(1)
x <- rexp(50, 0.2)
How can I go about it further, might I have to use rug function (which I am trying to learn how to use)? Can I also put time stamps on this graph?
Edit
So I have modified my command and used:
x <- c(cumsum(rexp(50, 0.2)))
y <- rep(0, length(x))
plot(x,y)
rug(x)
and I have been able to get this:
This result does the work, if it's just about that. However, is there a line of command I can use to edit this result as shown in the second picture, and get a result as shown in the first picture? I would like to just get these tick marks on a horizontal line instead of the whole plot. Or it's not possible?
With ggplot you can use geom_rug() to .add a rug to the plot. First the data need to be made into a data.frame
library("tidyverse")
set.seed(1)
x <- rexp(50, 0.2)
ggplot(data.frame(x), aes(x = x)) + geom_rug()
The rug is rather short (it seems to be a proportion of the graph height and not controllable).
The opposite would be to use geom_vline which will give lines the full length of the y-axis
#ggplot(data.frame(x), aes(xintercept = x)) + geom_vline() #doesn't work
ggplot(data.frame(x)) + geom_vline(aes(xintercept = x))
rug() requires only a vector of values that describes where to draw the tickmarks (rugs). In case of plotting values x on the x-axis, those form the input data for the rug function. Type ?rug to get further help.
# generate y values
y <- rexp(50, 0.2)
# split plotting area into two columns - optional
par(mfrow = c(1, 2))
plot(x, y)
rug(x)
# plot with both axes in log scale to show that rug adjusts to axes scale
plot(x, y, log = "xy")
rug(x)
I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.
Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:
require(ggplot2)
set.seed(5)
df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))
cols = names(df)
for (i in c(1:length(cols))) {
my_col = cols[i]
p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10)
print(p1)
p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
print(p2)
p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
print(p3)
p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
print(p4)
}
The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.
If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.
I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:
maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]
I would process the data to determine the height you need.
Something along the lines of:
sort(table(cut(df$X1,breaks=10)),T)[2]
Working from the inside out
cut will bin the data (not really needed with integer data like you have but probably needed with real data
table then creates a table with the count of each of those bins
sort sorts the table from highest to lowest
[2] takes the 2nd highest value
I'd like to plot the frequency of a variable color coded for 2 factor levels for example blue bars should be the hist of level A and green the hist of level B both n the same graph? Is this possible with the hist command? The help of hist does not allow for a factor. Is there another way around?
I managed to do this by barplots manually but i want to ask if there is a more automatic method
Many thanks
EC
PS. I dont need density plots
Just in case the others haven't answered this is a way that satisfies. I had to deal with stacking histograms recently, and here's what I did:
data_sub <- subset(data, data$V1 == "Yes") #only samples that have V1 as "yes" in my dataset #are added to the subset
hist(data$HL)
hist(data_sub$HL, col="red", add=T)
Hopefully, this is what you meant?
It's rather unclear what you have as a data layout. A histogram requires that you have a variable that is ordinal or continuous so that breaks can be created. If you also have a separate grouping factor you can plot histograms conditional on that factor. A nice worked example of such a grouping and overlaying a density curve is offered in the second example on the help page for the histogram function in the lattice package.
A nice resource for learning relative merits of lattice and ggplot2 plotting is the Learning R blog. This is from the first of a multipart series on side-by=side comparison of the two plotting systems:
library(lattice)
library(ggplot2)
data(Chem97, package = "mlmRev")
#The lattice method:
pl <- histogram(~gcsescore | factor(score), data = Chem97)
print(pl)
# The ggplot method:
pg <- ggplot(Chem97, aes(gcsescore)) + geom_histogram(binwidth = 0.5) +
facet_wrap(~score)
print(pg)
I don't think you can do that easily with a bar histogram, as you would have to "interlace" the bars from both factor levels... It would need some kind of "discretization" of the now continuous x axis (i.e. it would have to be split in "categories" and in each category you would have 2 bars, for each factor level...
But it is quite easy and without problems if you are just fine with plotting the density line function:
y <- rnorm(1000, 0, 1)
x <- rnorm(1000, 0.5, 2)
dx <- density(x)
dy <- density(y)
plot(dx, xlim = range(dx$x, dy$x), ylim = range(dx$y, dy$y),
type = "l", col = "red")
lines(dy, col = "blue")
It's very possible.
I didn't have data to work with but here's an example of a histogram with different colored bars. From here you'd need to use my code and figure out how to make it work for factors instead of tails.
BASIC SETUP
histogram <- hist(scale(vector)), breaks= , plot=FALSE)
plot(histogram, col=ifelse(abs(histogram$breaks) < #of SD, Color 1, Color 2))
#EXAMPLE
x<-rnorm(1000)
histogram <- hist(scale(x), breaks=20 , plot=FALSE)
plot(histogram, col=ifelse(abs(histogram$breaks) < 2, "red", "green"))
I agree with the others that a density plot is more useful than merging colored bars of a histogram, particularly if the group's values are intermixed. This would be very difficult and wouldn't really tell you much. You've got some great suggestions from others on density plots, here's my 2 cents for density plots that I sometimes use:
y <- rnorm(1000, 0, 1)
x <- rnorm(1000, 0.5, 2)
DF <- data.frame("Group"=c(rep(c("y","x"), each=1000)), "Value"=c(y,x))
library(sm)
with(DF, sm.density.compare(Value, Group, xlab="Grouping"))
title(main="Comparative Density Graph")
legend(-9, .4, levels(DF$Group), fill=c("red", "darkgreen"))