I'm trying to understand the default behavior of ggplot2::facet_wrap(), in terms of how the panel layout is decided as the number of facets increases.
I've read the ?facet_wrap help file, and also googled this topic with limited success. In one SO post, facet_wrap() was said to "return a symmetrical matrix of plots", but I did not find anything that explained what exactly the default behavior would be.
So next I made a series of plots which had increasing numbers of facets (code shown further down).
The pattern in the image makes it seem like facet_wrap() tries to "make a square"...
Questions
Is that correct? Does facet_wrap try to render the facet
panels so in totality they are most like a square, in terms of the
number of elements in the rows and columns?
If not, what is it actually doing? Do graphical parameters factor in?
Code that made the plot
# load libraries
library(ggplot2)
library(ggpubr)
# plotting function
facetPlots <- function(facets, groups = 8){
# sample data
df <- data.frame(Group = sample(LETTERS[1:groups], 1000, replace = T),
Value = sample(1:10000, 1000, replace = T),
Facet = factor(sample(1:facets, 1000, replace = T)))
# get means
df <- aggregate(list(Value = df$Value),
list(Group = df$Group, Facet = df$Facet), mean)
# plot
p1 <- ggplot(df, aes(x= Group, y= Value, fill = Group))+
geom_bar(stat="identity", show.legend = FALSE)+
facet_wrap(. ~ Facet) +
theme_bw()+
theme(strip.text.x = element_text(size = 6,
margin = margin(.1, 0, .1, 0, "cm")),
axis.text.x=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
plot.margin = unit(c(3,3,3,3), "pt"))
p1
}
# apply function to list
plot_list <- lapply(c(1:25), facetPlots)
# unify into single plot
plot <- ggpubr::ggarrange(plotlist = plot_list)
Here is how the default number of rows and columns are calculated:
ncol <- ceiling(sqrt(n))
nrow <- ceiling(n/ncol)
Apparently, facet_wrap tends to prefer wider grids, since "most displays are roughly rectangular" (according to the documentation). Hence, the number of columns would be greater than or equal to the number of rows.
For your example:
n <- c(1:25)
ncol <- ceiling(sqrt(n))
nrow <- ceiling(n/ncol)
data.frame(n, ncol, nrow)
Here are the computed numbers of rows/cols:
# n ncol nrow
# 1 1 1
# 2 2 1
# 3 2 2
# 4 2 2
# 5 3 2
# 6 3 2
# 7 3 3
# 8 3 3
# 9 3 3
# 10 4 3
# 11 4 3
# 12 4 3
# 13 4 4
# 14 4 4
# 15 4 4
# 16 4 4
# 17 5 4
# 18 5 4
# 19 5 4
# 20 5 4
# 21 5 5
# 22 5 5
# 23 5 5
# 24 5 5
# 25 5 5
Related
I want to accentuate the area in the faceted density plots above the measure threshold of 2 (e.g., red shading for x=>2). This solution works well for a single facet factor, but I have two factors. How do I specify the levels for the two factors when using ggplot_build? Or do I need to use a different approach?
Here's a bit of the dataframe (the dataframe is 750 rows):
mode.f task.f mgds
1 1 A 1.1413636
2 1 A 0.9105000
3 2 A 1.0320000
4 2 A 1.1811429
14 1 C 1.4646000
15 1 C 1.7505000
16 2 C 1.3968000
17 1 D 1.0668333
18 1 D 1.0084000
19 1 D 1.1622500
20 2 D 1.3452500
21 2 D 1.0132000
22 3 C 0.6960000
23 3 C 0.9180000
24 3 D 1.0128000
25 3 D 0.6670000
26 2 E 2.9190000
27 2 E 1.3755000
28 2 E 1.4080000
29 1 E 1.3878000
30 1 E 1.4816667
Here's the code that works for a single facet factor:
mp <- ggplot(df,aes(x=mgds))+
geom_density(color=NA,fill="gray30",alpha=.4)+
facet_wrap(~mode.f)+
theme_bw()+
theme(strip.background =
element_rect(fill="gray95",color="gray60"),
strip.text = element_text(colour="black",size=10),
panel.border = element_rect(color="gray60"))+
labs(x="MGD (s)",y="Density")
to_fill <- data_frame(
x = ggplot_build(mp)$data[[1]]$x,
y = ggplot_build(mp)$data[[1]]$y,
mode.f = factor(ggplot_build(mp)$data[[1]]$PANEL, levels =
c(1,2,3), labels = c("1","2","3")))
mp + geom_area(data = to_fill[to_fill$x >= 2, ],
aes(x=x, y=y), fill = "red")
Here's the code for the facet_grid plots that I want to have the area beyond the x=2 threshold be a different color 2
ggplot(df,aes(x=mgds))+
geom_density(color=NA,fill="gray30",alpha=.4)+
facet_grid(~mode.f~task.f)+
theme_bw()+
theme(strip.background = element_rect(fill="gray95",color="gray60"),
strip.text = element_text(colour="black",size=10),
panel.border = element_rect(color="gray60"))+
geom_vline(xintercept=2,linetype="longdash",color="gray50")+
labs(x="Measure",y="Density")
I have a histogram graph generated from obsoleteness column inside my.data
his.obsoleteness <- (ggplot(my.data, aes(x=obsoleteness))
+ geom_histogram(bins = 15,fill = "white", color="Red" )
+ stat_bin(aes(y=..count.., label=..count..), geom="text",bins = 15))
I want to add a new column to my.data that represents the "bucket number". For example, if a row in the first (smallest bucket) the column should have the value 1, if it's in the second bucket the value should be 2, so on and so forth. Is there an easy way of doing this other than manually assigning the value for every bucket interval?
Since bins group data by equal width, you could try:
# sample data
my.data <- data.frame(obsoleteness = sample(1:10000, 1000, replace = T))
bins <- 15
my.data$bin <- findInterval(my.data$obsoleteness,
quantile(my.data$obsoleteness,
probs = seq(0, 1, by = 1/bins)))
# > head(my.data,10)
# obsoleteness bin
#1 4101 7
#2 1702 3
#3 6354 10
#4 7710 12
#5 3575 6
#6 2686 5
#7 6598 10
#8 6983 11
#9 9414 15
#10 9431 15
I was looking for an clear explanation of the 'labels are constructed using "(a,b]" interval notation' - as described in the cut help file, which seemed to lack an explanation.
So I tested cut on some simple examples as follows:
df <- data.frame(c(1,2,3,4,5,6,7,99))
names(df) <- 'x'
df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE)
df
x cut
# 1 <NA>
# 2 <NA>
# 3 (2,4]
# 4 (2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
# 99 <NA>
So the '(' means x>break on the left and '[' means <= (next) break on the right and if a value is lower than the lowest break it is flagged as NA, similarly if a value exceed the highest break it is also flagged as NA.
Next testing the option include.lowest = TRUE
df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE, include.lowest = TRUE)
df
x cut
# 1 <NA>
# 2 [2,4]
# 3 [2,4]
# 4 [2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
So here for the first bin between the first two breaks, the '[' on left means >=(first break) and the ']' means <=(second) break. Subsequent breaks are treated as above.
Next the NA values can be addressed by using -Inf and/or +Inf in the breaks as follows:
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE)
df
x cut
# 1 [-Inf,2]
# 2 [-Inf,2]
# 3 (2,4]
# 4 (2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
# 99 (8, Inf]
Setting the right = FALSE option swaps around the sense of the thresholds as per the example below:
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = FALSE)
df
# x cut
# 1 [-Inf,2)
# 2 [2,4)
# 3 [2,4)
# 4 [4,6)
# 5 [4,6)
# 6 [6,8)
# 7 [6,8)
# 99 [8, Inf)
Finally the labels option allows custom names for the thresholds should you so wish ...
lbls <- c('x<=2','2<x<=4','4<x<=6','6<x<=8','x>8')
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE, labels = lbls)
df
x cut
# 1 x<=2
# 2 x<=2
# 3 2<x<=4
# 4 2<x<=4
# 5 4<x<=6
# 6 4<x<=6
# 7 6<x<=8
# 99 x>8
I am trying to adjust the colour scale of a geom_tile plot.
A short version of my data (in data.frame format) is:
mydat <-
Sc K n minC
A 2 1 NA
A 2 2 37.453023
A 2 3 23.768316
A 2 4 17.628376
A 3 1 NA
A 3 2 12.693124
A 3 3 8.884226
A 3 4 7.436250
A 10 1 2.128121
A 10 2 2.116539
A 10 3 2.737923
A 10 4 3.509773
A 20 1 1.104592
A 20 2 1.840195
A 20 3 2.717198
A 20 4 3.616501
B 2 1 NA
B 2 2 25.090085
B 2 3 15.924186
B 2 4 11.811022
B 3 1 NA
B 3 2 8.827183
B 3 3 6.179484
B 3 4 5.175331
B 10 1 2.096934
B 10 2 2.064984
B 10 3 2.662373
B 10 4 3.407246
B 20 1 1.096871
B 20 2 1.802418
B 20 3 2.649153
B 20 4 3.517776
My code to prepare the data to plot is the following:
mydat$Sc <- factor(mydat$Sc, levels =c("A", "B"))
mydat$K <- factor(mydat$K, levels =c("2", "3","10","20"))
mydat.m <- melt(pmydat,id.vars=c("Sc","K","n"), measure.vars=c("minC"))
I want to plot with geom_tile the value of minC with K and n as axis and different facets for Sc with the following:
mydat.m.p <- ggplot(mydat.m, aes(x=n, y=K))
mydat.m.p +
geom_tile(data=mydat.m, aes(fill=value)) +
scale_fill_gradient(low="palegreen", high="lightcoral") +
facet_wrap(~ Sc, ncol=2)
This gives me a plot for each Sc factor. However, the colour scale does not reflect want I want to portray, because a few high values making low values all equal.
I want to adjust to a relevant scale in 4 breaks, i.e., 1-2, 2-3, 3-5, >5.
Looking at other questions there was a suggestion to use the cut function and scale fill manual as:
mydat.m$value1 <- cut(mydat.m$value, breaks = c(1:5, Inf), right = FALSE)
Then use the following in geom_tile:
scale_fill_manual(breaks = c("\[1,2)", "\[2, 3)", "\[3, 5)", "\[5, Inf)"),
values = c("darkgreen", "palegreen", "lightcoral", "red"))
However, I am not sure how this can be applied to a data.frame with other factors and in long format.
You're almost there. Simply use cut before melting:
mydat$minC.cut <- cut(mydat$minC, breaks = c(1:3, 5, Inf), right = FALSE)
mydat.cut <- melt(mydat, id.vars=c("Sc", "K", "n"), measure.vars=c("minC.cut"))
Now, you don't need to specify breaks since we took care of that already.
ggplot(mydat.cut, aes(x=n, y=K)) +
geom_tile(aes(fill=value)) +
facet_wrap(~ Sc, ncol=2) +
scale_fill_manual(values = c("darkgreen", "palegreen", "lightcoral", "red"))
I have to find out the cumulative frequency, converted to percentage, of a continuous variable by factor.
For example:
data <- data.frame(n = sample(1:12),
d = seq(10, 120, by = 10),
Site = rep(c("FirstSite", "SecondSite"), 6),
Plot = rep(c("Plot1", "Plot1", "Plot2", "Plot2"), 3)
)
data <- with(data, data[order(Site,Plot),])
data <- transform(data, G = ((pi * (d/2)^2) * n) / 10000)
data
n d Site Plot G
1 7 10 FirstSite Plot1 0.05497787
5 9 50 FirstSite Plot1 1.76714587
9 12 90 FirstSite Plot1 7.63407015
3 10 30 FirstSite Plot2 0.70685835
7 5 70 FirstSite Plot2 1.92422550
11 1 110 FirstSite Plot2 0.95033178
2 3 20 SecondSite Plot1 0.09424778
6 8 60 SecondSite Plot1 2.26194671
10 6 100 SecondSite Plot1 4.71238898
4 4 40 SecondSite Plot2 0.50265482
8 2 80 SecondSite Plot2 1.00530965
12 11 120 SecondSite Plot2 12.44070691
I need the cumulaive frequency of column G by factors Plot~Sitein order to plot a geom_step ggplot of G against d for each plot and site.
I have achieved to compute cumulative sum of G by factor by:
data.ss <- by(data[, "G"], data[,c("Plot", "Site")], function(x) cumsum(x))
# Gtot
(data.ss.tot <- sapply(ss, max))
[1] 9.456194 3.581416 7.068583 13.948671
Now I need to express each Plot G in the range [0..1] where 1 is Gtot for each Plot. I imagine I should divide G by its Plot Gtot, then apply a new cumsum to it. How to do it?
Please note that I have to plot this cumulative frequency against d not G itself, so it is not a proper ecdf.
Thank you.
I usually use ddply and transform to do this type of thing:
> data = ddply(data, c('Site', 'Plot'), transform, Gsum=cumsum(G), Gtot=sum(G))
> qplot(x=d, y=Gsum/Gtot, facets=Plot~Site, geom='step', data=data)