Modify Histogram to display only a fraction of total data - r

I hope I have not overlooked an answer to this question:
I want to make with ggplot a histogramm of only a fraction of the total data. Here's my example:
df<-iris
ggplot(data=df, aes(x=Sepal.Length, y=..density..*100)) +
geom_bar(binwidth=0.1) +
ylab("percent")
This gives a histogramm of all lines.
Now I want to limit the data passed to the plot (for instance) to a Petal.Width of 0.2. Thus the histogram I wish for, only represents the ratio "count Petal.Width=0.2 divided by total count".
Thanks for helping a ggplot-rookie!! With base plot I managed to work around, but I failed here..

I think what you want to do is to subset the data you're calling in the plot:
ggplot(data=df[df$Petal.Width == 0.2,], aes(x=Sepal.Length, y=..density..*100)) +
geom_bar(binwidth=0.1) +
ylab("percent")
Some other ways to subset data using ggplot are described in this post: Subset and ggplot2

Related

How to compare two histograms in R?

I want to compare two histograms in a graph in R, but couldn't imagined and implemented.
My histograms are based on two sub-dataframes and these datasets divided according to a type (Action, Adventure Family)
My first histogram is:
split_action <- split(df, df$type)
dataset_action <- split_action$Action
hist(dataset_action$year)
split_adventure <- split(df, df$type)
dataset_adventure <- split_adventure$Adventure
hist(dataset_adventure$year)
I want to see how much overlapping is occured, their comparison based on year in the same histogram. Thank you in advence.
Using the iris dataset, suppose you want to make a histogram of sepal length for each species. First, you can make 3 data frames for each species by subsetting.
irissetosa<-subset(iris,Species=='setosa',select=c('Sepal.Length','Species'))
irisversi<-subset(iris,Species=='versicolor',select=c('Sepal.Length','Species'))
irisvirgin<-subset(iris,Species=='virginica',select=c('Sepal.Length','Species'))
and then, make the histogram for these 3 data frames. Don't forget to set the argument "add" as TRUE (for the second and third histogram), because you want to combine the histograms.
hist(irissetosa$Sepal.Length,col='red')
hist(irisversi$Sepal.Length,col='blue',add=TRUE)
hist(irisvirgin$Sepal.Length,col='green',add=TRUE)
you will have something like this
Then you can see which part is overlapping...
But, I know, it's not so good.
Another way to see which part is overlapping is by using density function.
plot(density(irissetosa$Sepal.Length),col='red')
lines(density(irisversi$Sepal.Length),col='blue')
lines(density(irisvirgin$Sepal.Length,col='green'))
Then you will have something like this
Hope it helps!!
You don't need to split the data if using ggplot. The key is to use transparency ("alpha") and change the value of the "position" argument to "identity" since the default is "stack".
Using the iris dataset:
library(ggplot2)
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_histogram(binwidth=0.2, alpha=0.5, position="identity") +
theme_minimal()
It's not easy to see the overlap, so a density plot may be a better choice if that's the main objective. Again, use transparency to avoid obscuring overlapping plots.
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_density(alpha=0.5) +
xlim(3.9,8.5) +
theme_minimal()
So for your data, the command would be something like this:
ggplot(data=df, aes(x=year, fill=type)) +
geom_histogram(alpha=0.5, position="identity")

ggplot2: Time-series plot by continuous variable, color/fill by group

I have searched considerably for what I want to accomplish, but I haven't run across examples or plots that are specifically what I'm looking for), so I am reaching out to the community.
What I have (data downloadable here):
Time-series data (each record 2 hours apart and spanning nearly a year) with associated elevation and property ownership.
library(ggplot2)
data <- read.csv("dataex.csv")
data$timestamp <-as.POSIXct(as.character(data$timestamp),format="%m/%d/%Y %H:%M", tz="GMT")
What I want (via ggplot):
A line or bar plot showing elevation (y-axis) across time (x-axis) for each data record colored by ownership (for a line plot, filling the area under the line, or for a bar plot, filling the bar). I've tried iterations of geom_line, geom_bar, and geom_area (w geom_bar below the closest I have come). I'd like at least one of the following options to come true!
Option A - The closest I have come to achieving this (plotting per data record) is with the following code:
ggplot(data, aes(x=timestamp, y=elev, fill=OWNER)) + geom_bar(stat="identity")
However, I'd like the bars to be touching each other, but if I adjust the width in geom_bar(), everything disappears. (Also, if I run the above code on other batches of similar data, it will only show a fraction of the bars, likely because they have more data records). Seems like its just too much data to plot. So I tried another route...
Option B - Plotting by day, which turns out to be more informative, showing each day the variability in ownership.
ggplot(data, aes(x=as.Date(Date, format='%Y-%m-%d'), y=elev, fill=OWNER)) + geom_bar(stat="identity", width=1)
However, this sums the y-axis, so the elevation is not interpretable. I could divide the y-axis by 12 (the typical number of records per day) but there are occasional days with fewer than 12 records, which causes the y-axis to be incorrect. Is there a function or a way to divide the y-axis by the respective number of records per day that is being represented in the plot? Or does someone have advice for a better solution?
Something like:
library(readr)
library(dplyr)
library(ggplot2)
library(ggalt)
readr::read_csv("~/Desktop/dataex.csv") %>%
mutate(timestamp=lubridate::mdy_hm(timestamp)) %>%
select(timestamp, elev, Owner=OWNER) -> df
ggplot(df, aes(timestamp, elev, group=Owner, color=Owner)) +
geom_segment(aes(xend=timestamp, yend=0), size=0.1) +
scale_x_datetime(expand=c(0,0), date_breaks="2 months") +
scale_y_continuous(expand=c(0,0), limits=c(0,2250), label=scales::comma) +
ggthemes::scale_color_tableau() +
hrbrmisc::theme_hrbrmstr(grid="Y") +
labs(x=NULL, y="Elevation") +
theme(legend.position="bottom") +
theme(axis.title.y=element_text(angle=0, margin=margin(r=-20)))
?

How do you place a curve over a histogram with multiple factors?

I think I am making this out to be much more complicated than it is, but I haven't been able to figure it out.
I essentially have created a bar plot from some count data (vector counts), by day (vector day). Within the plot I am factoring color by treat (vector treat). What I would like to do is overlay some sort of curve instead of having the bars. I am most interested in showing where the peaks are in the vector "counts" for each treat.
Here are the vectors within my dataframe, which I have named "data"
treat<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4)
counts<-c(9,12,11,5,3,2,0,2,0,0,1,1,0,10,4,7,6,1,4,1,1,0,0,0,0,0,12,5,15,3,4,2,0,0,1,0,0,0,0,4,6,11,7,7,4,1,1,0,0,0,0,0)
day<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,1,2,3,4,5,6,7,8,9,10,11,12,13,1,2,3,4,5,6,7,8,9,10,11,12,13,1,2,3,4,5,6,7,8,9,10,11,12,13)
q=ggplot(data, aes(x=factor(day), y=counts, fill=factor(treat), color=factor(treat)))
q+geom_bar(stat= "identity", position=position_dodge(), width=.75)
Thank you for taking a look!
If you want to show the peaks for count in each treat, then you might consider using facets with either facet_wrap or facet_grid. An example with facet_wrap:
ggplot(data, aes(x=factor(day), y=counts, color=factor(treat), group=factor(treat))) +
geom_line() +
facet_wrap(~treat)
which gives you the following plot:
Is this what you're looking for?
data = data.frame(treat, counts, day)
ggplot(data, aes(x=factor(day), y=counts, group=factor(treat), color=factor(treat))) +
geom_line(lwd=1) +
geom_point()
The result is a bit hard to read for the these data, but maybe it will look better with your real data. Or you can use faceting, as shown by #Jaap.
Also, you can simplify your data creation as follows:
treat = rep(1:4, each=13)
day = rep(1:13, 4)

In ggplot2, how can I make a bar chart of proportions across factors (and add error bars)?

I'm struggling with making a graph of proportion of a variable across a factor in ggplot.
Taking mtcars data as an example and stealing part of a solution from this question I can come up with
ggplot(mtcars, aes(x = as.factor(cyl))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels = percent_format())
This graph gives me proportion of each cyl category in the whole dataset.
What I'd like to get though is the proportion of cars in each cyl category, that have automatic transmission (binary variable am).
On top of each bar I would like to add an error bar for the proportion.
Is it possible to do it with ggplot only? Or do I have to first prepare a data frame with summaries and use it with identity option of bar graphs?
I found some examples on Cookbook for R web page, but they deal with continuous y variable.
I think that it would be easier to make new data frame and then use it for plotting. Here I calculated proportions and lower/upper confidence interval values (took them from prop.test() result).
library(plyr)
mt.new<-ddply(mtcars,.(cyl),summarise,
prop=sum(am)/length(am),
low=prop.test(sum(am),length(am))$conf.int[1],
upper=prop.test(sum(am),length(am))$conf.int[2])
ggplot(mt.new,aes(as.factor(cyl),y=prop,ymin=low,ymax=upper))+
geom_bar(stat="identity")+
geom_errorbar()

normalizing ggplot2 densities with facet_wrap in R

I am making a series of density plots with geom_density from a dataframe, and showing it by condition using facet_wrap, as in:
ggplot(iris) + geom_density(aes(x=Sepal.Width, colour=Species, y=..count../sum(..count..))) + facet_wrap(~Species)
When I do this, the y-axis scale seems to not represent percent of each Species in a panel, but rather the percent of all the total datapoints across all species.
My question is: How can I make it so the ..count.. variable in geom_density refers to the count of items in each Species set of each panel, so that the panel for virginica has a y-axis corresponding to "Fraction of virginica data points"?
Also, is there a way to get ggplot2 to output the values it uses for ..count.. and sum(..count..) so that I can verify what numbers it is using?
edit: I misunderstood geom_density it looks like even for a single Species, ..count../sum(..count..) is not a percentage:
ggplot(iris[iris$Species == 'virginica',]) + geom_density(aes(x=Sepal.Width, colour=Species, y=..count../sum(..count..))) + facet_wrap(~Species)
so my revised question: how can I get the density plot to be the fraction of data in each bin? Do I have to use stat_density for this or geom_histogram? I just want the y-axis to be percentage / fraction of data points
Unfortunately, what you are asking ggplot2 to do is define separate y's for each facet, which it syntactically cannot do AFAIK.
So, in response to your mentioning in the comment thread that you "just want a histogram fundamentally", I would suggest instead using geom_histogram or, if you're partial to lines instead of bars, geom_freqpoly:
ggplot(iris, aes(Sepal.Width, ..count..)) +
geom_histogram(aes(colour=Species, fill=Species), binwidth=.2) +
geom_freqpoly(colour="black", binwidth=.2) +
facet_wrap(~Species)
**Note: geom_freqpoly works just as well in place of geom_histogram in my above example. I just added both in one plot for sake of efficiency.
Hope this helps.
EDIT: Alright, I managed to work out a quick-and-dirty way of getting what you want. It requires that you install and load plyr. Apologies in advance; this is likely not the most efficient way to do this in terms of RAM usage, but it works.
First, let's get iris out in the open (I use RStudio so I'm used to seeing all my objects in a window):
d <- iris
Now, we can use ddply to count the number of individuals belonging to each unique measurement of what will become your x-axis (here I used Sepal.Length instead of Sepal.Width, to give myself a bit more range, simply for seeing a bigger difference between groups when plotted).
new <- ddply(d, c("Species", "Sepal.Length"), summarize, count=length(Sepal.Length))
Note that ddply automatically sorts the output data.frame according to the quoted variables.
Then we can divvy up the data.frame into each of its unique conditions--in the case of iris, each of the three species (I'm sure there's a much smoother way to go about this, and if you're working with really large amounts of data it's not advisable to keep creating subsets of the same data.frame because you could max out your RAM)...
set <- new[which(new$Species%in%"setosa"),]
ver <- new[which(new$Species%in%"versicolor"),]
vgn <- new[which(new$Species%in%"virginica"),]
... and use ddply again to calculate proportions of individuals falling under each measurement, but separately for each species.
prop <- rbind(ddply(set, c("Species"), summarize, prop=set$count/sum(set$count)),
ddply(ver, c("Species"), summarize, prop=ver$count/sum(ver$count)),
ddply(vgn, c("Species"), summarize, prop=vgn$count/sum(vgn$count)))
Then we just put everything we need into one dataset and remove all the junk from our workspace.
new$prop <- prop$prop
rm(list=ls()[which(!ls()%in%c("new", "d"))])
And we can make our figure with facet-specific proportions on the y. Note that I'm now using geom_line since ddply has automatically ordered your data.frame.
ggplot(new, aes(Sepal.Length, prop)) +
geom_line(aes(colour=new$Species)) +
facet_wrap(~Species)
# let's check our work. each should equal 50
sum(new$count[which(new$Species%in%"setosa")])
sum(new$count[which(new$Species%in%"versicolor")])
sum(new$count[which(new$Species%in%"versicolor")])
#... and each of these should equal 1
sum(new$prop[which(new$Species%in%"setosa")])
sum(new$prop[which(new$Species%in%"versicolor")])
sum(new$prop[which(new$Species%in%"versicolor")])
Maybe using table() and barplot() you might be able to get what you need. I'm still not sure if this is what you are after...
barplot(table(iris[iris$Species == 'virginica',1]))
With ggplot2
tb <- table(iris[iris$Species == 'virginica',1])
tb <- as.data.frame(tb)
ggplot(tb, aes(x=Var1, y=Freq)) + geom_bar()
Passing the argument scales='free_y' to facet_wrap() should do the trick.

Resources