I want to reduce clutter when plotting bars for different categories. That is, I use facets to compare the same categorical variables by other categorial variables.
For example, I use the tips dataset from reshape2:
library(reshape2)
library(ggplot2)
ggplot(tips, aes(x=time)) +
geom_bar(shape=1) +
facet_grid(. ~ sex)
The result is:
My desired change is that "Dinner" and "Lunch" only appear below the "Female" facet. I tried
scale_x_discrete(labels = c("with", "without", "", ""))
but of course without effect since there are only two categories within the variable time, so why take more than two elements in the labels vector? How can I accomplish my desired graph without the "draw two graphs and combine them"-workaround?
You can modify components of a ggplot using ggplot_build and ggplot_gtable:
x <- ggplot_gtable(ggplot_build(p))
If you look at str(x), you can then figure out where to change labels:
x$grobs[[8]]$children$axis$grobs[[2]]$label <- c('', '')
plot(x)
However, it's important to note that this may not work with future versions of ggplot2 if they decide to change the internal structure of plots.
Related
I want to compare two histograms in a graph in R, but couldn't imagined and implemented.
My histograms are based on two sub-dataframes and these datasets divided according to a type (Action, Adventure Family)
My first histogram is:
split_action <- split(df, df$type)
dataset_action <- split_action$Action
hist(dataset_action$year)
split_adventure <- split(df, df$type)
dataset_adventure <- split_adventure$Adventure
hist(dataset_adventure$year)
I want to see how much overlapping is occured, their comparison based on year in the same histogram. Thank you in advence.
Using the iris dataset, suppose you want to make a histogram of sepal length for each species. First, you can make 3 data frames for each species by subsetting.
irissetosa<-subset(iris,Species=='setosa',select=c('Sepal.Length','Species'))
irisversi<-subset(iris,Species=='versicolor',select=c('Sepal.Length','Species'))
irisvirgin<-subset(iris,Species=='virginica',select=c('Sepal.Length','Species'))
and then, make the histogram for these 3 data frames. Don't forget to set the argument "add" as TRUE (for the second and third histogram), because you want to combine the histograms.
hist(irissetosa$Sepal.Length,col='red')
hist(irisversi$Sepal.Length,col='blue',add=TRUE)
hist(irisvirgin$Sepal.Length,col='green',add=TRUE)
you will have something like this
Then you can see which part is overlapping...
But, I know, it's not so good.
Another way to see which part is overlapping is by using density function.
plot(density(irissetosa$Sepal.Length),col='red')
lines(density(irisversi$Sepal.Length),col='blue')
lines(density(irisvirgin$Sepal.Length,col='green'))
Then you will have something like this
Hope it helps!!
You don't need to split the data if using ggplot. The key is to use transparency ("alpha") and change the value of the "position" argument to "identity" since the default is "stack".
Using the iris dataset:
library(ggplot2)
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_histogram(binwidth=0.2, alpha=0.5, position="identity") +
theme_minimal()
It's not easy to see the overlap, so a density plot may be a better choice if that's the main objective. Again, use transparency to avoid obscuring overlapping plots.
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_density(alpha=0.5) +
xlim(3.9,8.5) +
theme_minimal()
So for your data, the command would be something like this:
ggplot(data=df, aes(x=year, fill=type)) +
geom_histogram(alpha=0.5, position="identity")
I am working with a dataframe with many columns and would like to produce certain plots of the data using ggplot2, namely, boxplots, histograms, density plots. I would like to do this by writing a single function that applies across all attributes (columns), producing one boxplot (or histogram etc) and then storing that as a given element of a list into which all the boxplots will be chained, so I could later index it by number (or by column name) in order to return the plot for a given attribute.
The issue I have is that, if I try to apply across columns with something like apply(df,2,boxPlot), I have to define boxPlot as a function that takes just a vector x. And when I do so, the attribute/column name and index are no longer retained. So e.g. in the code for producing a boxplot, like
bp <- ggplot(df, aes(x=Group, y=Attr, fill=Group)) +
geom_boxplot() +
labs(title="Plot of length per dose", x="Group", y =paste(Attr)) +
theme_classic()
the function has no idea how to extract the info necessary for Attr from just vector x (as this is just the column data and doesn't carry the column name or index).
(Note the x-axis is a factor variable called 'Group', which has 6 levels A,B,C,D,E,F, within X.)
Can anyone help with a good way of automating this procedure? (Ideally it should work for all types of ggplots; the problem here seems to simply be how to refer to the attribute name, within the ggplot function, in a way that can be applied / automatically replicated across the columns.) A for-loop would be acceptable, I guess, but if there's a more efficient/better way to do it in R then I'd prefer that!
Edit: something like what would be achieved by the top answer to this question: apply box plots to multiple variables. Except that in that answer, with his code you would still need a for-loop to change the indices on y=y[2] in the ggplot code and get all the boxplots. He's also expanded-grid to include different ````x``` possibilities (I have only one, the Group factor), but it would be easy to simplify down if the looping problem could be handled.
I'd also prefer just base R if possible--dplyr if absolutely necessary.
Here's an example of iterating over all columns of a data frame to produce a list of plots, while retaining the column name in the ggplot axis label
library(tidyverse)
plots <-
imap(select(mtcars, -cyl), ~ {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
})
plots$mpg
You can also do this without purrr and dplyr
to_plot <- setdiff(names(mtcars), 'cyl')
plots <-
Map(function(.x, .y) {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
}, mtcars[to_plot], to_plot)
plots$mpg
I have two factors and two continuous variables, and I use this to create a two-way facet plot using ggplot2. However, not all of my factor combinations have data, so I end up with dummy facets. Here's some dummy code to produce an equivalent output:
library(ggplot2)
dummy<-data.frame(x=rnorm(60),y=rnorm(60),
col=rep(c("A","B","C","B","C","C"),each=10),
row=rep(c("a","a","a","b","b","c"),each=10))
ggplot(data=dummy,aes(x=x,y=y))+
geom_point()+
facet_grid(row~col)
This produces this figure
Is there any way to remove the facets that don't plot any data? And, ideally, move the x and y axis labels up or right to the remaining plots? As shown in this GIMPed version
I've searched here and elsewhere and unless my search terms just aren't good enough, I can't find the same problem anywhere. Similar issues are often with unused factor levels, but here no factor level is unused, just factor level combinations. So facet_grid(drop=TRUE) or ggplot(data=droplevel(dummy)) doesn't help here. Combining the factors into a single factor and dropping unused levels of the new factor can only produce a 1-dimensional facet grid, which isn't what I want.
Note: my actual data has a third factor level which I represent by different point colours. Thus a single-plot solution allowing me to retain a legend would be ideal.
It's not too difficult to rearrange the graphical objects (grobs) manually to achieve what you're after.
Load the necessary libraries.
library(grid);
library(gtable);
Turn your ggplot2 plot into a grob.
gg <- ggplot(data = dummy, aes(x = x,y = y)) +
geom_point() +
facet_grid(row ~ col);
grob <- ggplotGrob(gg);
Working out which facets to remove, and which axes to move where depends on the grid-structure of your grob. gtable_show_layout(grob) gives a visual representation of your grid structure, where numbers like (7, 4) denote a panel in row 7 and column 4.
Remove the empty facets.
# Remove facets
idx <- which(grob$layout$name %in% c("panel-2-1", "panel-3-1", "panel-3-2"));
for (i in idx) grob$grobs[[i]] <- nullGrob();
Move the x axes up.
# Move x axes up
# axis-b-1 needs to move up 4 rows
# axis-b-2 needs to move up 2 rows
idx <- which(grob$layout$name %in% c("axis-b-1", "axis-b-2"));
grob$layout[idx, c("t", "b")] <- grob$layout[idx, c("t", "b")] - c(4, 2);
Move the y axes to the right.
# Move y axes right
# axis-l-2 needs to move 2 columns to the right
# axis-l-3 needs ot move 4 columns to the right
idx <- which(grob$layout$name %in% c("axis-l-2", "axis-l-3"));
grob$layout[idx, c("l", "r")] <- grob$layout[idx, c("l", "r")] + c(2, 4);
Plot.
# Plot
grid.newpage();
grid.draw(grob);
Extending this to more facets is straightforward.
Maurits Evers solution worked great, but is quite cumbersome to modify.
An alternative solution is to use facet_manual from {ggh4x}.
This is not equivalent though as it uses facet_wrap, but allows appropriate placement of the facets.
# devtools::install_github("teunbrand/ggh4x")
library(ggplot2)
dummy<-data.frame(x=rnorm(60),y=rnorm(60),
col=rep(c("A","B","C","B","C","C"),each=10),
row=rep(c("a","a","a","b","b","c"),each=10))
design <- "
ABC
#DE
##F
"
ggplot(data=dummy,aes(x=x,y=y))+
geom_point()+
ggh4x::facet_manual(vars(row,col), design = design, labeller = label_both)
Created on 2022-02-25 by the reprex package (v2.0.0)
One possible solution, of course, would be to create a plot for each factor combination separately and then combine them using grid.arrange() from gridExtra. This would probably lose my legend and would be an all around pain, would love to hear if anyone has any better suggestions.
This particular case looks like a job for ggpairs (link to a SO example). I haven't used it myself, but for paired plots this seems like the best tool for the job.
In a more general case, where you're not looking for pairs, you could try creating a column with a composite (pasted) factor and facet_grid or facet_wrap by that variable (example)
I am making a series of density plots with geom_density from a dataframe, and showing it by condition using facet_wrap, as in:
ggplot(iris) + geom_density(aes(x=Sepal.Width, colour=Species, y=..count../sum(..count..))) + facet_wrap(~Species)
When I do this, the y-axis scale seems to not represent percent of each Species in a panel, but rather the percent of all the total datapoints across all species.
My question is: How can I make it so the ..count.. variable in geom_density refers to the count of items in each Species set of each panel, so that the panel for virginica has a y-axis corresponding to "Fraction of virginica data points"?
Also, is there a way to get ggplot2 to output the values it uses for ..count.. and sum(..count..) so that I can verify what numbers it is using?
edit: I misunderstood geom_density it looks like even for a single Species, ..count../sum(..count..) is not a percentage:
ggplot(iris[iris$Species == 'virginica',]) + geom_density(aes(x=Sepal.Width, colour=Species, y=..count../sum(..count..))) + facet_wrap(~Species)
so my revised question: how can I get the density plot to be the fraction of data in each bin? Do I have to use stat_density for this or geom_histogram? I just want the y-axis to be percentage / fraction of data points
Unfortunately, what you are asking ggplot2 to do is define separate y's for each facet, which it syntactically cannot do AFAIK.
So, in response to your mentioning in the comment thread that you "just want a histogram fundamentally", I would suggest instead using geom_histogram or, if you're partial to lines instead of bars, geom_freqpoly:
ggplot(iris, aes(Sepal.Width, ..count..)) +
geom_histogram(aes(colour=Species, fill=Species), binwidth=.2) +
geom_freqpoly(colour="black", binwidth=.2) +
facet_wrap(~Species)
**Note: geom_freqpoly works just as well in place of geom_histogram in my above example. I just added both in one plot for sake of efficiency.
Hope this helps.
EDIT: Alright, I managed to work out a quick-and-dirty way of getting what you want. It requires that you install and load plyr. Apologies in advance; this is likely not the most efficient way to do this in terms of RAM usage, but it works.
First, let's get iris out in the open (I use RStudio so I'm used to seeing all my objects in a window):
d <- iris
Now, we can use ddply to count the number of individuals belonging to each unique measurement of what will become your x-axis (here I used Sepal.Length instead of Sepal.Width, to give myself a bit more range, simply for seeing a bigger difference between groups when plotted).
new <- ddply(d, c("Species", "Sepal.Length"), summarize, count=length(Sepal.Length))
Note that ddply automatically sorts the output data.frame according to the quoted variables.
Then we can divvy up the data.frame into each of its unique conditions--in the case of iris, each of the three species (I'm sure there's a much smoother way to go about this, and if you're working with really large amounts of data it's not advisable to keep creating subsets of the same data.frame because you could max out your RAM)...
set <- new[which(new$Species%in%"setosa"),]
ver <- new[which(new$Species%in%"versicolor"),]
vgn <- new[which(new$Species%in%"virginica"),]
... and use ddply again to calculate proportions of individuals falling under each measurement, but separately for each species.
prop <- rbind(ddply(set, c("Species"), summarize, prop=set$count/sum(set$count)),
ddply(ver, c("Species"), summarize, prop=ver$count/sum(ver$count)),
ddply(vgn, c("Species"), summarize, prop=vgn$count/sum(vgn$count)))
Then we just put everything we need into one dataset and remove all the junk from our workspace.
new$prop <- prop$prop
rm(list=ls()[which(!ls()%in%c("new", "d"))])
And we can make our figure with facet-specific proportions on the y. Note that I'm now using geom_line since ddply has automatically ordered your data.frame.
ggplot(new, aes(Sepal.Length, prop)) +
geom_line(aes(colour=new$Species)) +
facet_wrap(~Species)
# let's check our work. each should equal 50
sum(new$count[which(new$Species%in%"setosa")])
sum(new$count[which(new$Species%in%"versicolor")])
sum(new$count[which(new$Species%in%"versicolor")])
#... and each of these should equal 1
sum(new$prop[which(new$Species%in%"setosa")])
sum(new$prop[which(new$Species%in%"versicolor")])
sum(new$prop[which(new$Species%in%"versicolor")])
Maybe using table() and barplot() you might be able to get what you need. I'm still not sure if this is what you are after...
barplot(table(iris[iris$Species == 'virginica',1]))
With ggplot2
tb <- table(iris[iris$Species == 'virginica',1])
tb <- as.data.frame(tb)
ggplot(tb, aes(x=Var1, y=Freq)) + geom_bar()
Passing the argument scales='free_y' to facet_wrap() should do the trick.
Is there a way to specify that I want the bars of a stacked bar graph in with ggplot ordered in terms of the total of the four factors from least to greatest? (so in the code below, I want to order by the total of all of the variables) I have the total for each x value in a dataframe that that I melted to create the dataframe from which I formed the graph.
The code that I am using to graph is:
ggplot(md, aes(x=factor(fullname), fill=factor(variable))) + geom_bar()
My current graph looks like this:
http://i.minus.com/i5lvxGAH0hZxE.png
The end result is I want to have a graph that looks a bit like this:
http://i.minus.com/kXpqozXuV0x6m.jpg
My data looks like this:
(source: minus.com)
and I melt it to this form where each student has a value for each category:
melted data http://i.minus.com/i1rf5HSfcpzri.png
before using the following line to graph it
ggplot(data=md, aes(x=fullname, y=value, fill=variable), ordered=TRUE) + geom_bar()+ opts(axis.text.x=theme_text(angle=90))
Now, I'm not really sure that I understand the way Chi does the ordering and if I can apply that to the data from either of the frames that I have. Maybe it's helpful that that the data is ordered in the original data frame that I have, the one that I show first.
UPDATE: We figured it out. See this thread for the answer:
Order Stacked Bar Graph in ggplot
I'm not sure about the way your data were generated (i.e., whether you use a combination of cast/melt from the reshape package, which is what I suspect given the default name of your variables), but here is a toy example where sorting is done outside the call to ggplot. There might be far better way to do that, browse on SO as suggested by #Andy.
v1 <- sample(c("I","S","D","C"), 200, rep=T)
v2 <- sample(LETTERS[1:24], 200, rep=T)
my.df <- data.frame(v1, v2)
idx <- order(apply(table(v1, v2), 2, sum))
library(ggplot2)
ggplot(my.df, aes(x=factor(v2, levels=LETTERS[1:24][idx], ordered=TRUE),
fill=v1)) + geom_bar() + opts(axis.text.x=theme_text(angle=90)) +
labs(x="fullname")
To sort in the reverse direction, add decr=TRUE with the order command. Also, as suggested by #Andy, you might overcome the problem with x-labels overlap by adding + coord_flip() instead of the opts() option.