How to compare two histograms in R? - r

I want to compare two histograms in a graph in R, but couldn't imagined and implemented.
My histograms are based on two sub-dataframes and these datasets divided according to a type (Action, Adventure Family)
My first histogram is:
split_action <- split(df, df$type)
dataset_action <- split_action$Action
hist(dataset_action$year)
split_adventure <- split(df, df$type)
dataset_adventure <- split_adventure$Adventure
hist(dataset_adventure$year)
I want to see how much overlapping is occured, their comparison based on year in the same histogram. Thank you in advence.

Using the iris dataset, suppose you want to make a histogram of sepal length for each species. First, you can make 3 data frames for each species by subsetting.
irissetosa<-subset(iris,Species=='setosa',select=c('Sepal.Length','Species'))
irisversi<-subset(iris,Species=='versicolor',select=c('Sepal.Length','Species'))
irisvirgin<-subset(iris,Species=='virginica',select=c('Sepal.Length','Species'))
and then, make the histogram for these 3 data frames. Don't forget to set the argument "add" as TRUE (for the second and third histogram), because you want to combine the histograms.
hist(irissetosa$Sepal.Length,col='red')
hist(irisversi$Sepal.Length,col='blue',add=TRUE)
hist(irisvirgin$Sepal.Length,col='green',add=TRUE)
you will have something like this
Then you can see which part is overlapping...
But, I know, it's not so good.
Another way to see which part is overlapping is by using density function.
plot(density(irissetosa$Sepal.Length),col='red')
lines(density(irisversi$Sepal.Length),col='blue')
lines(density(irisvirgin$Sepal.Length,col='green'))
Then you will have something like this
Hope it helps!!

You don't need to split the data if using ggplot. The key is to use transparency ("alpha") and change the value of the "position" argument to "identity" since the default is "stack".
Using the iris dataset:
library(ggplot2)
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_histogram(binwidth=0.2, alpha=0.5, position="identity") +
theme_minimal()
It's not easy to see the overlap, so a density plot may be a better choice if that's the main objective. Again, use transparency to avoid obscuring overlapping plots.
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_density(alpha=0.5) +
xlim(3.9,8.5) +
theme_minimal()
So for your data, the command would be something like this:
ggplot(data=df, aes(x=year, fill=type)) +
geom_histogram(alpha=0.5, position="identity")

Related

Histogram: Combine continuous and discrete values in ggplot2

I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.

5 dimensional plot in r

I am trying to plot a 5 dimensional plot in R. I am currently using the rgl package to plot my data in 4 dimensions, using 3 variables as the x,y,z, coordinates, another variable as the color. I am wondering if I can add a fifth variable using this package, like for example the size or the shape of the points in the space. Here's an example of my data, and my current code:
set.seed(1)
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
colnames(df) <- c("var1","var2","var3","var4","var5")
require(rgl)
plot3d(df$var1, df$var2, df$var3, col=as.numeric(df$var4), size=0.5, type='s',xlab="var1",ylab="var2",zlab="var3")
I hope it is possible to do the 5th dimension.
Many thanks,
Here is a ggplot2 option. I usually shy away from 3D plots as they are hard to interpret properly. I also almost never put in 5 continuous variables in the same plot as I have here...
ggplot(df, aes(x=var1, y=var2, fill=var3, color=var4, size=var5^2)) +
geom_point(shape=21) +
scale_color_gradient(low="red", high="green") +
scale_size_continuous(range=c(1,12))
While this is a bit messy, you can actually reasonably read all 5 dimensions for most points.
A better approach to multi-dimensional plotting opens up if some of your variables are categorical. If all your variables are continuous, you can turn some of them to categorical with cut and then use facet_wrap or facet_grid to plot those.
For example, here I break up var3 and var4 into quintiles and use facet_grid on them. Note that I also keep the color aesthetics as well to highlight that most of the time turning a continuous variable to categorical in high dimensional plots is good enough to get the key points across (here you'll notice that the fill and border colors are pretty uniform within any given grid cell):
df$var4.cat <- cut(df$var4, quantile(df$var4, (0:5)/5), include.lowest=T)
df$var3.cat <- cut(df$var3, quantile(df$var3, (0:5)/5), include.lowest=T)
ggplot(df, aes(x=var1, y=var2, fill=var3, color=var4, size=var5^2)) +
geom_point(shape=21) +
scale_color_gradient(low="red", high="green") +
scale_size_continuous(range=c(1,12)) +
facet_grid(var3.cat ~ var4.cat)

normalizing ggplot2 densities with facet_wrap in R

I am making a series of density plots with geom_density from a dataframe, and showing it by condition using facet_wrap, as in:
ggplot(iris) + geom_density(aes(x=Sepal.Width, colour=Species, y=..count../sum(..count..))) + facet_wrap(~Species)
When I do this, the y-axis scale seems to not represent percent of each Species in a panel, but rather the percent of all the total datapoints across all species.
My question is: How can I make it so the ..count.. variable in geom_density refers to the count of items in each Species set of each panel, so that the panel for virginica has a y-axis corresponding to "Fraction of virginica data points"?
Also, is there a way to get ggplot2 to output the values it uses for ..count.. and sum(..count..) so that I can verify what numbers it is using?
edit: I misunderstood geom_density it looks like even for a single Species, ..count../sum(..count..) is not a percentage:
ggplot(iris[iris$Species == 'virginica',]) + geom_density(aes(x=Sepal.Width, colour=Species, y=..count../sum(..count..))) + facet_wrap(~Species)
so my revised question: how can I get the density plot to be the fraction of data in each bin? Do I have to use stat_density for this or geom_histogram? I just want the y-axis to be percentage / fraction of data points
Unfortunately, what you are asking ggplot2 to do is define separate y's for each facet, which it syntactically cannot do AFAIK.
So, in response to your mentioning in the comment thread that you "just want a histogram fundamentally", I would suggest instead using geom_histogram or, if you're partial to lines instead of bars, geom_freqpoly:
ggplot(iris, aes(Sepal.Width, ..count..)) +
geom_histogram(aes(colour=Species, fill=Species), binwidth=.2) +
geom_freqpoly(colour="black", binwidth=.2) +
facet_wrap(~Species)
**Note: geom_freqpoly works just as well in place of geom_histogram in my above example. I just added both in one plot for sake of efficiency.
Hope this helps.
EDIT: Alright, I managed to work out a quick-and-dirty way of getting what you want. It requires that you install and load plyr. Apologies in advance; this is likely not the most efficient way to do this in terms of RAM usage, but it works.
First, let's get iris out in the open (I use RStudio so I'm used to seeing all my objects in a window):
d <- iris
Now, we can use ddply to count the number of individuals belonging to each unique measurement of what will become your x-axis (here I used Sepal.Length instead of Sepal.Width, to give myself a bit more range, simply for seeing a bigger difference between groups when plotted).
new <- ddply(d, c("Species", "Sepal.Length"), summarize, count=length(Sepal.Length))
Note that ddply automatically sorts the output data.frame according to the quoted variables.
Then we can divvy up the data.frame into each of its unique conditions--in the case of iris, each of the three species (I'm sure there's a much smoother way to go about this, and if you're working with really large amounts of data it's not advisable to keep creating subsets of the same data.frame because you could max out your RAM)...
set <- new[which(new$Species%in%"setosa"),]
ver <- new[which(new$Species%in%"versicolor"),]
vgn <- new[which(new$Species%in%"virginica"),]
... and use ddply again to calculate proportions of individuals falling under each measurement, but separately for each species.
prop <- rbind(ddply(set, c("Species"), summarize, prop=set$count/sum(set$count)),
ddply(ver, c("Species"), summarize, prop=ver$count/sum(ver$count)),
ddply(vgn, c("Species"), summarize, prop=vgn$count/sum(vgn$count)))
Then we just put everything we need into one dataset and remove all the junk from our workspace.
new$prop <- prop$prop
rm(list=ls()[which(!ls()%in%c("new", "d"))])
And we can make our figure with facet-specific proportions on the y. Note that I'm now using geom_line since ddply has automatically ordered your data.frame.
ggplot(new, aes(Sepal.Length, prop)) +
geom_line(aes(colour=new$Species)) +
facet_wrap(~Species)
# let's check our work. each should equal 50
sum(new$count[which(new$Species%in%"setosa")])
sum(new$count[which(new$Species%in%"versicolor")])
sum(new$count[which(new$Species%in%"versicolor")])
#... and each of these should equal 1
sum(new$prop[which(new$Species%in%"setosa")])
sum(new$prop[which(new$Species%in%"versicolor")])
sum(new$prop[which(new$Species%in%"versicolor")])
Maybe using table() and barplot() you might be able to get what you need. I'm still not sure if this is what you are after...
barplot(table(iris[iris$Species == 'virginica',1]))
With ggplot2
tb <- table(iris[iris$Species == 'virginica',1])
tb <- as.data.frame(tb)
ggplot(tb, aes(x=Var1, y=Freq)) + geom_bar()
Passing the argument scales='free_y' to facet_wrap() should do the trick.

ggplot2: Is there a way to overlay a single plot to all facets in a ggplot

I would like to use ggplot and faceting to construct a series of density plots grouped by a factor. Additionally, I would like to a layer another density plot on each of the facets that is not subject to the constraints imposed by the facet.
For example, the faceted plot would look like this:
require(ggplot2)
ggplot(diamonds, aes(price)) + facet_grid(.~clarity) + geom_density()
and then I would like to have the following single density plot layered on top of each of the facets:
ggplot(diamonds, aes(price)) + geom_density()
Furthermore, is ggplot with faceting the best way to do this, or is there a preferred method?
One way to achieve this would be to make new data frame diamonds2 that contains just column price and then two geom_density() calls - one which will use original diamonds and second that uses diamonds2. As in diamonds2 there will be no column clarity all values will be used in all facets.
diamonds2<-diamonds["price"]
ggplot(diamonds, aes(price)) + geom_density()+facet_grid(.~clarity) +
geom_density(data=diamonds2,aes(price),colour="blue")
UPDATE - as suggested by #BrianDiggs the same result can be achieved without making new data frame but transforming it inside the geom_density().
ggplot(diamonds, aes(price)) + geom_density()+facet_grid(.~clarity) +
geom_density(data=transform(diamonds, clarity=NULL),aes(price),colour="blue")
Another approach would be to plot data without faceting. Add two calls to geom_density() - in one add aes(color=clarity) to have density lines in different colors for each level of clarity and leave empty second geom_density() - that will add overall black density line.
ggplot(diamonds,aes(price))+geom_density(aes(color=clarity))+geom_density()

R: Plot multiple box plots using columns from data frame

I would like to plot an INDIVIDUAL box plot for each unrelated column in a data frame. I thought I was on the right track with boxplot.matrix from the sfsmsic package, but it seems to do the same as boxplot(as.matrix(plotdata) which is to plot everything in a shared boxplot with a shared scale on the axis. I want (say) 5 individual plots.
I could do this by hand like:
par(mfrow=c(2,2))
boxplot(data$var1
boxplot(data$var2)
boxplot(data$var3)
boxplot(data$var4)
But there must be a way to use the data frame columns?
EDIT: I used iterations, see my answer.
You could use the reshape package to simplify things
data <- data.frame(v1=rnorm(100),v2=rnorm(100),v3=rnorm(100), v4=rnorm(100))
library(reshape)
meltData <- melt(data)
boxplot(data=meltData, value~variable)
or even then use ggplot2 package to make things nicer
library(ggplot2)
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")
From ?boxplot we see that we have the option to pass multiple vectors of data as elements of a list, and we will get multiple boxplots, one for each vector in our list.
So all we need to do is convert the columns of our matrix to a list:
m <- matrix(1:25,5,5)
boxplot(x = as.list(as.data.frame(m)))
If you really want separate panels each with a single boxplot (although, frankly, I don't see why you would want to do that), I would instead turn to ggplot and faceting:
m1 <- melt(as.data.frame(m))
library(ggplot2)
ggplot(m1,aes(x = variable,y = value)) + facet_wrap(~variable) + geom_boxplot()
I used iteration to do this. I think perhaps I wasn't clear in the original question. Thanks for the responses none the less.
par(mfrow=c(2,5))
for (i in 1:length(plotdata)) {
boxplot(plotdata[,i], main=names(plotdata[i]), type="l")
}

Resources