I am having difficulty successfully plotting a histogram using ggplot in R and would appreciate help on how to do this.
Some background: I have carried out a simulation in R that simulates the outbreak dynamics for an epidemic, and now I want to create a final size distribution plot over 10,000 epidemic simulations.
What I have done so far: I have simulated 10,000 outbreaks and in each of these cases I have found the number of the final size of the outbreak and saved these in f. From typeof(f) I get the answer double, a small overview of f is the following:
> tail(f)
[1] 4492 1 2 1 1 4497
I have then created a (correct) distribution plot over these with the help of the code below, but now instead want to create this using ggplot to get a nicer histogram.
h = hist(f)
h$density = h$counts/sum(h$counts)
plot(h,freq = FALSE,
ylim = c(0,1))
My attempt: I attempted to do this on my own via the following code but I don't get a correct result. I will post the images of these two plots below where the first one is the correct one, as you can se the y-values together add up to one which is correct, and the second one is what I get using ggplot, here the values on the y-axis is not correct. What can I do to create a graph like the first but with ggplot instead? I am guessing that this has something to do with that I set y to be the density and for some reason it doesn't quite match.
ggplot(data=NULL, aes(x = f)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white")
The images:
Your desired output does not have density on the y-axis, but percentages. Your ggplot has density on the y-axis, that's the default for histograms. To get the same results with ggplot you need to use geom_histogram(aes(y=..count../sum(..count..))
The base R function hist calculates the optimal number of bins used to plot the frequencies. The number can be re-used in ggplot like this:
library(ggplot2)
f <- c(4492, 1, 2, 1, 1, 4497)
h <- hist(f, freq = FALSE)
h$breaks
#> [1] 0 1000 2000 3000 4000 5000
ggplot(data = NULL, mapping = aes(x = f, y=..density..)) +
geom_histogram(bins = length(h$breaks) - 1)
Created on 2023-01-07 by the reprex package (v2.0.1)
Related
I Simulated 10 realisations with 200 observations from an MA(3) process and plot them. The code I used is the following and the result in the picture:
Now I want to estimate the autocorrelation function (ACF) for each realisation and plot those in the same plot as seen below:
I want to plot them like this by creating lines of different colours by joining the points ... not the classical ACF plot with the spikes. I tried some things but I cannot figure out how to do it. Any help please?
The code I used for the first part (I am working with R in RStudio):
sim1 <- replicate(10,arima.sim(model = list(ma=c(0.7,-0.7,-0.4), order=c(0,0,3)), n = 200))
matplot(sim1, lty=1, type="l", col=rainbow(11))
I don't know how to do this without using ggplot2, so here it is:
First, some adaptations. I decreased the number of simulations to 3, so it's easier to understand the graphs; I added an Time column to serve as the x-axis; I Created the ACF for each simulation with apply, and stored it in a new df; And i pivoted the data to long for use of ggplot:
sim1 <- cbind("Time"=1:200, replicate(3,arima.sim(model = list(ma=c(0.7,-0.7,-0.4), order=c(0,0,3)), n = 200)))
acfsim1 <- cbind("Lag"=1:200, apply(sim1[,2:4], 2, function(x){acf(x, 200, plot=FALSE)$acf}))
dfsim1 = tidyr::pivot_longer(as.data.frame(sim1), cols=2:4)
dfacfsim1 = tidyr::pivot_longer(as.data.frame(acfsim1), cols=2:4)
Now for the plots. First the one you already had:
library(ggplot2)
ggplot(dfsim1, aes(y=value, color=name, x=Time)) +
geom_line()
Output:
Now the graph that you want, first using a line graph:
ggplot(dfsim1, aes(y=value, color=name, x=Lag)) +
geom_segment(data=dfacfsim1)
Output:
Now with a segment graph:
ggplot(dfsim1, aes(y=value, color=name, x=Lag)) +
geom_segment(data=dfacfsim1, aes(xend=Lag, yend=0))
Output:
I am attempting to place individual points on a plot using ggplot2, however as there are many points, it is difficult to gauge how densely packed the points are. Here, there are two factors being compared against a continuous variable, and I want to change the color of the points to reflect how closely packed they are with their neighbors. I am using the geom_point function in ggplot2 to plot the points, but I don't know how to feed it the right information on color.
Here is the code I am using:
s1 = rnorm(1000, 1, 10)
s2 = rnorm(1000, 1, 10)
data = data.frame(task_number = as.factor(c(replicate(100, 1),
replicate(100, 2))),
S = c(s1, s2))
ggplot(data, aes(x = task_number, y = S)) + geom_point()
Which generates this plot:
However, I want it to look more like this image, but with one dimension rather than two (which I borrowed from this website: https://slowkow.com/notes/ggplot2-color-by-density/):
How do I change the colors of the first plot so it resembles that of the second plot?
I think the tricky thing about this is you want to show the original values, and evaluate the density at those values. I borrowed ideas from here to achieve that.
library(dplyr)
data = data %>%
group_by(task_number) %>%
# Use approxfun to interpolate the density back to
# the original points
mutate(dens = approxfun(density(S))(S))
ggplot(data, aes(x = task_number, y = S, colour = dens)) +
geom_point() +
scale_colour_viridis_c()
Result:
One could, of course come up with a meausure of proximity to neighbouring values for each value... However, wouldn't adjusting the transparency basically achieve the same goal (gauging how densely packed the points are)?
geom_point(alpha=0.03)
I've seen many examples of a density plot but the density plot's y-axis is the probability. What I am looking for a is a line plot (like a density plot) but the y-axis should contain counts (like a histogram).
I can do this in excel where I manually make the bins and the frequencies and make a bar histogram and then I can change the chart type to a line - but can't find anything similar in R.
I've checked out both base and ggplot2; yet can't seem to find an answer. I understand that histograms are meant to be bars but I think representing them as a continuous line makes more visual sense.
Using default R graphics (i.e. without installing ggplot) you can do the following, which might also make what the density function does a bit clearer:
# Generate some data
data=rnorm(1000)
# Get the density estimate
dens=density(data)
# Plot y-values scaled by number of observations against x values
plot(dens$x,length(data)*dens$y,type="l",xlab="Value",ylab="Count estimate")
This is an old question, but I thought it might be helpful to post a solution that specifically addresses your question.
In ggplot2, you can plot a histogram and display the count with bars using:
ggplot(data) +
geom_histogram()
You can also plot a histogram and display the count with lines using a frequency polygon:
ggplot(data) +
geom_freqpoly()
For more info --
ggplot2 reference
To adapt the example on the ?stat_density help page:
m <- ggplot(movies, aes(x = rating))
# Standard density plot.
m + geom_density()
# Density plot with y-axis scaled to counts.
m + geom_density(aes(y = ..count..))
Although this is old, I thought the following might be useful.
Let's say you have a data set of 10,000 points, and you believe they belong to a certain distribution, and you would like to plot the histogram of the actual data and the line of the probability density of the ideal distribution on top of it.
noise <- 2
#
# the noise is tagged onto the end using runif
# just do demo issues w/real data and fitting
# the subtraction causes the data to have some
# negative values, which must be addressed in
# the fit later on
#
noisylognorm <- rlnorm(10000,
mean = 0.25,
sd = 1) +
(noise * runif(10000) - noise / 10)
#
# using package fitdistrplus
#
# subset is used to remove the negative values
# as the lognormal distribution needs positive only
#
fitlnorm <- fitdist(subset(noisylognorm,
noisylognorm > 0),
"lnorm")
fitlnorm_density <- density(rlnorm(10000,
mean = fitlnorm$estimate[1],
sd = fitlnorm$estimate[2]))
hist(subset(noisylognorm,
noisylognorm < 25),
breaks = seq(-1, 25, 0.5),
col = "lightblue",
xlim = c(0, 25),
xlab = "value",
ylab = "frequency",
main = paste0("Log Normal Distribution\n",
"noise = ", noise))
lines(fitlnorm_density$x,
10000 * fitlnorm_density$y * 0.5,
type="l",
col = "red")
Note the * 0.5 in the lines function. As far as I can tell, this is necessary to account for the width of the hist() bars.
There is a very simple and fast way for count data.
First let's generate some dummy count data:
my.count.data = rpois(n = 10000, lambda = 3)
And then the plotting command (assuming you have called library(magrittr)):
my.count.data %>% table %>% plot
I have a couple of box and whisker plots in R. In both, the x-axis corresponds to one categorical variable whilst the grouping colours correspond to the other.
If I draw both plots with an untransformed y-axis, they are both fine. However, if I try to square-root transform the y-axis (using: coord_trans(y = "sqrt")), one of those graph is still fine whilst the other drops the lines corresponding to the median in most boxes (except those for which there are only two groups and where the boxes are therefore slightly wider, see "Numbers" 1 and 2 on the first plot). Further, for the graph that does not draw properly, if I reduce the number of categories on my x-axis (hence getting the boxes wider again), the median lines appear again.
Is this a bug with coord_trans (if so, how can I get around it) or a problem with my code?
Thank you very much for any suggestion.
library(car)
library(gplots)
library(plyr)
library(ggplot2)
library(gridExtra)
library(gdata)
Category=factor(c(rep(1, times =3240), rep(2, times =2160)),
labels=c("A","B"), levels=c(1,2))
ID=factor(rep(seq(from = 1, to = 45),each = 120))
Months=factor(rep(seq(from = 1, to = 3), each = 40, times = 45),
labels=c("Jan","Feb","Mar"),levels=c(1:3))
Obs=rnorm(5400, mean=25, sd=15)
Data=data.frame(Category,ID,Months,Obs)
Data=subset(Data, (Data$Category=="B") | !(Data$ID%in%c(1,2)) |
(Data$Months%in%c("Jan","Feb")))
for (j in 1:2)
{
sel=which(Data$Category==unique(levels(Data$Category))[j])
Observ=Data$Obs[sel]
Month=Data$Months[sel]
Number=droplevels(Data$ID[sel])
Number=droplevels(Number)
Data_used=data.frame(Number,Month,Observ)
plot1 = ggplot(Data_used, aes(Number, Observ)) +
geom_boxplot(aes(fill=Month, drop=FALSE), na.rm=TRUE) +
scale_y_continuous(breaks = c(0,20,40,60,80,100), limits=c(0,115)) +
coord_trans(y = "sqrt")
plot(plot1)
}
#Dennis is correct in his comment that scale_y_sqrt() will correct this. Because median and quartiles are order statistics it doesn't matter whether the data are transformed before or after calculating them.
I'd like to plot the frequency of a variable color coded for 2 factor levels for example blue bars should be the hist of level A and green the hist of level B both n the same graph? Is this possible with the hist command? The help of hist does not allow for a factor. Is there another way around?
I managed to do this by barplots manually but i want to ask if there is a more automatic method
Many thanks
EC
PS. I dont need density plots
Just in case the others haven't answered this is a way that satisfies. I had to deal with stacking histograms recently, and here's what I did:
data_sub <- subset(data, data$V1 == "Yes") #only samples that have V1 as "yes" in my dataset #are added to the subset
hist(data$HL)
hist(data_sub$HL, col="red", add=T)
Hopefully, this is what you meant?
It's rather unclear what you have as a data layout. A histogram requires that you have a variable that is ordinal or continuous so that breaks can be created. If you also have a separate grouping factor you can plot histograms conditional on that factor. A nice worked example of such a grouping and overlaying a density curve is offered in the second example on the help page for the histogram function in the lattice package.
A nice resource for learning relative merits of lattice and ggplot2 plotting is the Learning R blog. This is from the first of a multipart series on side-by=side comparison of the two plotting systems:
library(lattice)
library(ggplot2)
data(Chem97, package = "mlmRev")
#The lattice method:
pl <- histogram(~gcsescore | factor(score), data = Chem97)
print(pl)
# The ggplot method:
pg <- ggplot(Chem97, aes(gcsescore)) + geom_histogram(binwidth = 0.5) +
facet_wrap(~score)
print(pg)
I don't think you can do that easily with a bar histogram, as you would have to "interlace" the bars from both factor levels... It would need some kind of "discretization" of the now continuous x axis (i.e. it would have to be split in "categories" and in each category you would have 2 bars, for each factor level...
But it is quite easy and without problems if you are just fine with plotting the density line function:
y <- rnorm(1000, 0, 1)
x <- rnorm(1000, 0.5, 2)
dx <- density(x)
dy <- density(y)
plot(dx, xlim = range(dx$x, dy$x), ylim = range(dx$y, dy$y),
type = "l", col = "red")
lines(dy, col = "blue")
It's very possible.
I didn't have data to work with but here's an example of a histogram with different colored bars. From here you'd need to use my code and figure out how to make it work for factors instead of tails.
BASIC SETUP
histogram <- hist(scale(vector)), breaks= , plot=FALSE)
plot(histogram, col=ifelse(abs(histogram$breaks) < #of SD, Color 1, Color 2))
#EXAMPLE
x<-rnorm(1000)
histogram <- hist(scale(x), breaks=20 , plot=FALSE)
plot(histogram, col=ifelse(abs(histogram$breaks) < 2, "red", "green"))
I agree with the others that a density plot is more useful than merging colored bars of a histogram, particularly if the group's values are intermixed. This would be very difficult and wouldn't really tell you much. You've got some great suggestions from others on density plots, here's my 2 cents for density plots that I sometimes use:
y <- rnorm(1000, 0, 1)
x <- rnorm(1000, 0.5, 2)
DF <- data.frame("Group"=c(rep(c("y","x"), each=1000)), "Value"=c(y,x))
library(sm)
with(DF, sm.density.compare(Value, Group, xlab="Grouping"))
title(main="Comparative Density Graph")
legend(-9, .4, levels(DF$Group), fill=c("red", "darkgreen"))