ggplot geom_bar with stat = "sum" - r

I want to plot a bar chart summing a variable along two dimensions, one will be spread along x, and the other will be spread vertically (stacked).
I would expect the two following instructions to do the same, but they don't and only the 2nd one gives the desired output (where I aggregate the data myself).
I'd like to understand what's going on in the first case, and if there's a way to use ggplot2 's built-in aggregation features to get the right output.
library(ggplot2)
library(dplyr)
p1 <- ggplot(diamonds,aes(cut,price,fill=color)) +
geom_bar(stat="sum",na.rm=TRUE)
yielding this plot:
p2 <- ggplot(diamonds %>%
group_by(cut,color) %>%
summarize_at("price",sum,na.rm=T),
aes(cut,price,fill=color)) +
geom_bar(stat="identity",na.rm=TRUE)
yielding this picture:
Here's where the top of our bars should be, p1 doesn't give these values:
diamonds %>% group_by(cut) %>% summarize_at("price",sum,na.rm=TRUE)
# # A tibble: 5 x 2
# cut price
# <ord> <int>
# 1 Fair 7017600
# 2 Good 19275009
# 3 Very Good 48107623
# 4 Premium 63221498
# 5 Ideal 74513487

You might be misunderstanding the stat option for geom_bar. In this case, since you want the values for each factor to be summed up within each bar, and the bars to be colored based off how much of that total sum is in each color, you can simplify the call to geom_col which uses the values as heights for the bar; and therefore "sums" all the values within each category. For example, the following will give the desired output:
p1 <- ggplot(diamonds,aes(cut,price,fill=color)) +
geom_col(na.rm=TRUE)
Alternatively, if you want to use geom_bar with a stat call, then you want to use the "identity" stat:
p1 <- ggplot(diamonds,aes(cut,price,fill=color)) +
geom_bar(stat = "identity", na.rm=TRUE)
For more information, consider this thread: https://stackoverflow.com/a/27965637/6722506

Related

R: how to filter within aes()

As an R-beginner, there's one hurdle that I just can't find the answer to. I have a table where I can see the amount of responses to a question according to gender.
Response
Gender
n
1
1
84
1
2
79
2
1
42
2
2
74
3
1
84
3
2
79
etc.
I want to plot these in a column chart: on the y I want the n (or its proportions), and on the x I want to have two seperate bars: one for gender 1, and one for gender 2. It should look like the following example that I was given:
The example that I want to emulate
However, when I try to filter the columns according to gender inside aes(), it returns an error! Could anyone tell me why my approach is not working? And is there another practical way to filter the columns of the table that I have?
ggplot(table) +
geom_col(aes(x = select(filter(table, gender == 1), Q),
y = select(filter(table, gender == 1), n),
fill = select(filter(table, gender == 2), n), position = "dodge")
Maybe something like this:
library(RColorBrewer)
library(ggplot2)
df %>%
ggplot(aes(x=factor(Response), y=n, fill=factor(Gender)))+
geom_col(position=position_dodge())+
scale_fill_brewer(palette = "Set1")
theme_light()
Your answer does not work, because you are assigning the x and y variables as if it was two different datasets (one for x and one for y). In line with the solution from TarJae, you need to think of it as the axis in a diagram - so you need for your x axis to assign the categorical variables you are comparing, and you want for the y axis to assign the numerical variables which determines the height of the bars. Finally, you want to compare them by colors, so each group will have a different color - that is where you include your grouping variable (here, I use fill).
library(dplyr) ## For piping
library(ggplot2) ## For plotting
df %>%
ggplot(aes(x = Response, y = n, fill = as.character(Gender))) +
geom_bar(stat = "Identity", position = "Dodge")
I am adding "Identity" because the default in geom_bar is to count the occurences in you data (i.e., if you data was not aggregated). I am adding "Dodge" to avoid the bars to be stacked. I will recommend you, to look at this resource for more information: https://r4ds.had.co.nz/index.html

ggplot2 - How to plot length of time using geom_bar?

I am trying to show different growing season lengths by displaying crop planting and harvest dates at multiple regions.
My final goal is a graph that looks like this:
which was taken from an answer to this question. Note that the dates are in julian days (day of year).
My first attempt to reproduce a similar plot is:
library(data.table)
library(ggplot2)
mydat <- "Region\tCrop\tPlanting.Begin\tPlanting.End\tHarvest.Begin\tHarvest.End\nCenter-West\tSoybean\t245\t275\t1\t92\nCenter-West\tCorn\t245\t336\t32\t153\nSouth\tSoybean\t245\t1\t1\t122\nSouth\tCorn\t183\t336\t1\t153\nSoutheast\tSoybean\t275\t336\t1\t122\nSoutheast\tCorn\t214\t336\t32\t122"
# read data as data table
mydat <- setDT(read.table(textConnection(mydat), sep = "\t", header=T))
# melt data table
m <- melt(mydat, id.vars=c("Region","Crop"), variable.name="Period", value.name="value")
# plot stacked bars
ggplot(m, aes(x=Crop, y=value, fill=Period, colour=Period)) +
geom_bar(stat="identity") +
facet_wrap(~Region, nrow=3) +
coord_flip() +
theme_bw(base_size=18) +
scale_colour_manual(values = c("Planting.Begin" = "black", "Planting.End" = "black",
"Harvest.Begin" = "black", "Harvest.End" = "black"), guide = "none")
However, there's a few issues with this plot:
Because the bars are stacked, the values on the x-axis are aggregated and end up too high - out of the 1-365 scale that represents day of year.
I need to combine Planting.Begin and Planting.End in the same color, and do the same to Harvest.Begin and Harvest.End.
Also, a "void" (or a completely uncolored bar) needs to be created between Planting.Begin and Harvest.End.
Perhaps the graph could be achieved with geom_rect or geom_segment, but I really want to stick to geom_bar since it's more customizable (for example, it accepts scale_colour_manual in order to add black borders to the bars).
Any hints on how to create such graph?
I don't think this is something you can do with a geom_bar or geom_col. A more general approach would be to use geom_rect to draw rectangles. To do this, we need to reshape the data a bit
plotdata <- mydat %>%
dplyr::mutate(Crop = factor(Crop)) %>%
tidyr::pivot_longer(Planting.Begin:Harvest.End, names_to="period") %>%
tidyr::separate(period, c("Type","Event")) %>%
tidyr::pivot_wider(names_from=Event, values_from=value)
# Region Crop Type Begin End
# <chr> <fct> <chr> <int> <int>
# 1 Center-West Soybean Planting 245 275
# 2 Center-West Soybean Harvest 1 92
# 3 Center-West Corn Planting 245 336
# 4 Center-West Corn Harvest 32 153
# 5 South Soybean Planting 245 1
# ...
We've used tidyr to reshape the data so we have one row per rectangle that we want to draw and we've also make Crop a factor. We can then plot it like this
ggplot(plotdata) +
aes(ymin=as.numeric(Crop)-.45, ymax=as.numeric(Crop)+.45, xmin=Begin, xmax=End, fill=Type) +
geom_rect(color="black") +
facet_wrap(~Region, nrow=3) +
theme_bw(base_size=18) +
scale_y_continuous(breaks=seq_along(levels(plotdata$Crop)), labels=levels(plotdata$Crop))
The part that's a bit messy here that we are using a discrete scale for y but geom_rect prefers numeric values, so since the values are factors now, we use the numeric values for the factors to create ymin and ymax positions. Then we need to replace the y axis with the names of the levels of the factor.
If you also wanted to get the month names on the x axis you could do something like
dateticks <- seq.Date(as.Date("2020-01-01"), as.Date("2020-12-01"),by="month")
# then add this to you plot
... +
scale_x_continuous(breaks=lubridate::yday(dateticks),
labels=lubridate::month(dateticks, label=TRUE, abbr=TRUE))

ggplot2: specifying different scales for rows in facet layout for bar plots

My data are visualized in the package ggplot2 via bar plots with several (~10) facets. I want first to split these facets in several rows. I can use function facet_grid() or facet_wrap() for this. In the minimal example data here I build 8 facets in two rows (4x2). However I need to adjust scales for different facets, namely: first row contains data on small scale, and in the second row values are bigger. So I need to have same scale for all data in the first row to compare them along the row, and another scale for the second row.
Here is the minimal example and possible solutions.
#loading necessary libraries and example data
library(dplyr)
library(tidyr)
library(ggplot2)
trial.facets<-read.csv(text="period,xx,yy
A,2,3
B,1.5,2.5
C,3.2,0.5
D,2.5,1.5
E,11,13
F,16,14
G,8,5
H,5,4")
#arranging data to long format with omission of the "period" variable
trial.facets.tidied<-trial.facets %>% gather(key=newvar,value=newvalue,-period)
And now plotting itself:
#First variant
ggplot(trial.facets.tidied,aes(x=newvar,y=newvalue,position="dodge"))+geom_bar(stat ="identity") +facet_grid(.~period)
#Second variant:
ggplot(trial.facets.tidied,aes(x=newvar,y=newvalue,position="dodge"))+geom_bar(stat ="identity") +facet_wrap(~period,nrow=2,scales="free")
The results for the first and second variants are as follows:
In both examples we have either free scales for all graphs, or fixed for all graphs. Meanwhile the first row (first 4 facets) needs to be scaled somewhat to 5, and the second row - to 15.
As a solution to use facet_grid() function I can add a fake variable "row" which specifies, to what row should the corresponding letter belong. The new dataset, trial.facets.row (three lines shown only) would look like as follows:
period,xx,yy,row
C,3.2,0.5,1
D,2.5,1.5,1
E,11,13,2
Then I can perform the same rearrangement into long format, omitting variables "period" and "row":
trial.facets.tidied.2<-trial.facets.row %>% gather(key=newvar,value=newvalue,-period,-row)
Then I arrange facets along variables "row" and "period" in the hope to use the option scales="free_y" to adjust scales only across rows:
ggplot(trial.facets.tidied.2,aes(x=newvar,y=newvalue,position="dodge"))+geom_bar(stat ="identity") +facet_grid(row~period,scales="free_y")
and - surprise: the problem with scales is solved, however, I get two groups of empty bars, and whole data is again stretched across a long strip:
All discovered manual pages and handbooks (usually using the mpg and mtcars dataset) do not consider such situation of such unwanted or dummy data
I used a combination of your first method (facet_wrap) & second method (leverage on dummy variable for different rows):
# create fake variable "row"
trial.facets.row <- trial.facets %>% mutate(row = ifelse(period %in% c("A", "B", "C", "D"), 1, 2))
# rearrange to long format
trial.facets.tidied.2<-trial.facets.row %>% gather(key=newvar,value=newvalue,-period,-row)
# specify the maximum height for each row
trial.facets.tidied.3<-trial.facets.tidied.2 %>%
group_by(row) %>%
mutate(max.height = max(newvalue)) %>%
ungroup()
ggplot(trial.facets.tidied.3,
aes(x=newvar, y=newvalue,position="dodge"))+
geom_bar(stat = "identity") +
geom_blank(aes(y=max.height)) + # add blank geom to force facets on the same row to the same height
facet_wrap(~period,nrow=2,scales="free")
Note: based on this reproducible example, I'm assuming that all your plots already share a common ymin at 0. If that's not the case, simply create another dummy variable for min.height & add another geom_blank to your ggplot.
Looking over SO I encountered a solution which might be a bit tricky - from here
The idea is to create a second fake dataset which would plot a single point at each facet. This point will be drawn in the position, corresponding to the highest desired value for y scale in every case. So heights of scales can be manually adjusted for each facet. Here is the solution for the dataset in question. We want y scale (maximum y value) 5 for the first row, and 17 for the second row. So create
df3=data.frame(newvar = rep("xx",8),
period = c("A","B","C","D","E","F","G","H"),
newvalue = c(5,5,5,5,17,17,17,17))
And now superimpose the new data on our graph using geom_point() .
ggplot(trial.facets.tidied,aes(x=newvar,y=newvalue,position="dodge"))+
geom_bar(stat ="identity") +
facet_wrap(~period,nrow=2,scales="free_y")+
geom_point(data=df3,aes(x=newvar,y=newvalue),alpha=1)
Here what we get:
Here I intentionally draw this extra point to make things clear. Next we need to make it invisible, which can be achieved by setting alpha=0 instead of 1 in the last command.
This approach draws an invisible line at the maximum for each row
#loading necessary libraries and example data
library(dplyr)
library(tidyr)
library(ggplot2)
trial.facets<-read.csv(text="period,xx,yy
A,2,3
B,1.5,2.5
C,3.2,0.5
D,2.5,1.5
E,11,13
F,16,14
G,8,5
H,5,4")
# define desired number of columns
n_col <- 4
#assign a row number - mmnsodulo number of colu
trial.facets$row <- seq(0, nrow(trial.facets)-1) %/% n_col
# determine the max by row, and round up to nearest multiple of 5
# join back to original
trial.facets.max <- trial.facets %>%
group_by(row) %>%
summarize(maxvalue = (1 + max(xx, yy) %/% 5) * 5 )
trial.facets <- trial.facets %>% inner_join(trial.facets.max)
# make long format carrying period, row and maxvalue
trial.facets.tidied<-trial.facets %>% gather(key=newvar,value=newvalue,-period,-row,-maxvalue)
# plot an invisible line at the max
ggplot(trial.facets.tidied,aes(x=newvar,y=newvalue,position="dodge"))+
geom_bar(stat ="identity") +
geom_hline(aes(yintercept=maxvalue), alpha = 0) +
facet_wrap(~period,ncol=n_col,scales="free")

Transform a ggplot stacked bar into pie chart or alternative

I am having trouble deciding how to graph the data I have.
It consists of overlapping quantities that represent a population, hence my decision to use a stacked bar.
These represent six population divisions ("groups") wherein group 1 and group 2 are the main division. Groups 4 to 6 are subgroups of two, and these are subgroups of each other. Its simple diagram is below:
Note: groups 1 and 2 complete the entire population or group 1 + group 2 = 100%.
I want all of these information in one chart which I do not know what and how to implement.
So far I have the one below, which is wrong because Group 1 is included in the main bar.
require(ggplot2)
require(reshape)
tab <- data.frame(
set=c("XXX","XXX","XXX","XXX","XXX","XXX"),
group=c("1","6","5","4","3","2"),
rate=as.numeric(c(10000,20000,50000,55000,75000,100000))
)
dat <- melt(tab)
dat$time <- factor(dat$group,levels=dat$group)
ggplot(dat,aes(x=set)) +
geom_bar(aes(weight=value,fill=group),position="fill",color="#7F7F7F") +
scale_fill_brewer("Groups", palette="OrRd")
What do you guys suggest to visualize it? I want to use R and ggplot for consistency and uniformity with the other graphs I have made already.
Using facets you can divide your plot into two:
# changed value of set for group 1
tab <- data.frame(
set=c("UUU","XXX","XXX","XXX","XXX","XXX"),
group=c("1","6","5","4","3","2"),
rate=as.numeric(c(10000,20000,50000,55000,75000,100000))
)
# explicitly defined id.vars
dat <- melt(tab, id.vars=c('set','group'))
dat$time <- factor(dat$group,levels=dat$group)
# added facet_wrap, in geom_bar aes changed weight to y,
# added stat="identity", changed position="stack"
ggplot(dat,aes(x=set)) +
geom_bar(aes(y=value,fill=group),position="stack", stat="identity", color="#7F7F7F") +
scale_fill_brewer("Groups", palette="OrRd") +
facet_wrap(~set, scale="free_x")
My guess is what you need is a treemap. Please correct me if I misunderstood your question.
here a link on Treemapping]1
If tree map is what you need you can use either portfolio package or googleVis.

Ordering bar plots with ggplot2 according to their size, i.e. numerical value

This question asks about ordering a bar graph according to an unsummarized table. I have a slightly different situation. Here's part of my original data:
experiment,pvs_id,src,hrc,mqs,mcs,dmqs,imcs
dna-wm,0,7,9,4.454545454545454,1.4545454545454546,1.4545454545454541,4.3939393939393945
dna-wm,1,7,4,2.909090909090909,1.8181818181818181,0.09090909090909083,3.9090909090909087
dna-wm,2,7,1,4.818181818181818,1.4545454545454546,1.8181818181818183,4.3939393939393945
dna-wm,3,7,8,3.4545454545454546,1.5454545454545454,0.4545454545454546,4.272727272727273
dna-wm,4,7,10,3.8181818181818183,1.9090909090909092,0.8181818181818183,3.7878787878787876
dna-wm,5,7,7,3.909090909090909,1.9090909090909092,0.9090909090909092,3.7878787878787876
dna-wm,6,7,0,4.909090909090909,1.3636363636363635,1.9090909090909092,4.515151515151516
dna-wm,7,7,3,3.909090909090909,1.7272727272727273,0.9090909090909092,4.030303030303029
dna-wm,8,7,11,3.6363636363636362,1.5454545454545454,0.6363636363636362,4.272727272727273
I only need a few variables from this, namely mqs and imcs, grouped by their pvs_id, so I create a new table:
m = melt(t, id.var="pvs_id", measure.var=c("mqs","imcs"))
I can plot this as a bar graph where one can see the correlation between MQS and IMCS.
ggplot(m, aes(x=pvs_id, y=value))
+ geom_bar(aes(fill=variable), position="dodge", stat="identity")
However, I'd like the resulting bars to be ordered by the MQS value, from left to right, in decreasing order. The IMCS values should be ordered with those, of course.
How can I accomplish that? Generally, given any molten dataframe — which seems useful for graphing in ggplot2 and today's the first time I've stumbled over it — how do I specify the order for one variable?
It's all in making
pvs_id a factor and supplying the appropriate levels to it:
dat$pvs_id <- factor(dat$pvs_id, levels = dat[order(-dat$mqs), 2])
m = melt(dat, id.var="pvs_id", measure.var=c("mqs","imcs"))
ggplot(m, aes(x=pvs_id, y=value))+
geom_bar(aes(fill=variable), position="dodge", stat="identity")
This produces the following plot:
EDIT:
Well since pvs_id was numeric it is treated in an ordered fashion. Where as if you have a factor no order is assumed. So even though you have numeric labels pvs_id is actually a factor (nominal). And as far as dat[order(-dat$mqs), 2] is concerned the order function with a negative sign orders the data frame from largest to smallest along the variable mqs. But you're interested in that order for the pvs_id variable so you index that column which is the second column. If you tear that apart you'll see it gives you:
> dat[order(-dat$mqs), 2]
[1] 6 2 0 5 7 4 8 3 1
Now you supply that to the levels argument of factor and this orders the factor as you want it.
With newer tidyverse functions, this becomes much more straightforward (or at least, easy to read for me):
library(tidyverse)
d %>%
mutate_at("pvs_id", as.factor) %>%
mutate(pvs_id = fct_reorder(pvs_id, mqs)) %>%
gather(variable, value, c(mqs, imcs)) %>%
ggplot(aes(x = pvs_id, y = value)) +
geom_col(aes(fill = variable), position = position_dodge())
What it does is:
create a factor if not already
reorder it according to mqs (you may use desc(mqs) for reverse-sorting)
gather into individual rows (same as melt)
plot as geom_col (same as geom_bar with stat="identity")

Resources