ggplot: remove NA factor level in legend - r

How can I omit the NA level of a factor from a legend?
From the nycflights13 database, I created a new continuous variable called tot_delay, and then created a factor called delay_class with 4 levels. When I plot, I filter out NA values, but they still appear in the legend. Here's my code:
library(nycflights13); library(ggplot2)
flights$tot_delay = flights$dep_delay + flights$arr_delay
flights$delay_class <- cut(flights$tot_delay,
c(min(flights$tot_delay, na.rm = TRUE), 0, 20 , 120,
max(flights$tot_delay, na.rm = TRUE)),
labels = c("none", "short","medium","long"))
filter(flights, !is.na(tot_delay)) %>%
ggplot() +
geom_bar(mapping = aes(x = carrier, fill = delay_class), position = "fill")

The parent example isn't a good illustration of the problem (of course unexpected NA values should be tracked down and eliminated), but this is the top result on Google so it should be noted that there is a now an option in scale_XXX_XXX to prevent NA levels from displaying in the legend by setting na.translate = F. For example:
# default
ggplot(data = data.frame(x = c(1,2,NA), y = c(1,1,NA), a = c("A","B",NA)),
aes(x, y, colour = a)) + geom_point(size = 4)
# with na.translate = F
ggplot(data = data.frame(x = c(1,2,NA), y = c(1,1,NA), a = c("A","B",NA)),
aes(x, y, colour = a)) + geom_point(size = 4) +
scale_colour_discrete(na.translate = F)
This works in ggplot2 3.1.0.

You have one data point where delay_class is NA, but tot_delay isn't. This point is not being caught by your filter. Changing your code to:
filter(flights, !is.na(delay_class)) %>%
ggplot() +
geom_bar(mapping = aes(x = carrier, fill = delay_class), position = "fill")
does the trick:
Alternatively, if you absolutely must have that extra point, you can override the fill legend as follows:
filter(flights, !is.na(tot_delay)) %>%
ggplot() +
geom_bar(mapping = aes(x = carrier, fill = delay_class), position = "fill") +
scale_fill_manual( breaks = c("none","short","medium","long"),
values = scales::hue_pal()(4) )
UPDATE: As pointed out in #gatsky's answer, all discrete scales also include the na.translate argument. The feature actually existed since ggplot 2.2.0; I just wasn't aware of it at the time I posted my answer. For completeness, its usage in the original question would look like
filter(flights, !is.na(tot_delay)) %>%
ggplot() +
geom_bar(mapping = aes(x = carrier, fill = delay_class), position = "fill") +
scale_fill_discrete(na.translate=FALSE)

I like #Artem's method above, i.e., getting to the bottom of why there are NA's in your df. However, sometimes you know there are NA's, and you just want to exclude them. In that case, simply using 'na.omit' should work:
na.omit(flights) %>% ggplot() +
geom_bar(mapping = aes(x = carrier, fill = delay_class), position = "fill")

Related

How can I change the size of a bar in a grouped bar chart when one group has no data? [duplicate]

Is there a way to set a constant width for geom_bar() in the event of missing data in the time series example below? I've tried setting width in aes() with no luck. Compare May '11 to June '11 width of bars in the plot below the code example.
colours <- c("#FF0000", "#33CC33", "#CCCCCC", "#FFA500", "#000000" )
iris$Month <- rep(seq(from=as.Date("2011-01-01"), to=as.Date("2011-10-01"), by="month"), 15)
colours <- c("#FF0000", "#33CC33", "#CCCCCC", "#FFA500", "#000000" )
iris$Month <- rep(seq(from=as.Date("2011-01-01"), to=as.Date("2011-10-01"), by="month"), 15)
d<-aggregate(iris$Sepal.Length, by=list(iris$Month, iris$Species), sum)
d$quota<-seq(from=2000, to=60000, by=2000)
colnames(d) <- c("Month", "Species", "Sepal.Width", "Quota")
d$Sepal.Width<-d$Sepal.Width * 1000
g1 <- ggplot(data=d, aes(x=Month, y=Quota, color="Quota")) + geom_line(size=1)
g1 + geom_bar(data=d[c(-1:-5),], aes(x=Month, y=Sepal.Width, width=10, group=Species, fill=Species), stat="identity", position="dodge") + scale_fill_manual(values=colours)
Some new options for position_dodge() and the new position_dodge2(), introduced in ggplot2 3.0.0 can help.
You can use preserve = "single" in position_dodge() to base the widths off a single element, so the widths of all bars will be the same.
ggplot(data = d, aes(x = Month, y = Quota, color = "Quota")) +
geom_line(size = 1) +
geom_col(data = d[c(-1:-5),], aes(y = Sepal.Width, fill = Species),
position = position_dodge(preserve = "single") ) +
scale_fill_manual(values = colours)
Using position_dodge2() changes the way things are centered, centering each set of bars at each x axis location. It has some padding built in, so use padding = 0 to remove.
ggplot(data = d, aes(x = Month, y = Quota, color = "Quota")) +
geom_line(size = 1) +
geom_col(data = d[c(-1:-5),], aes(y = Sepal.Width, fill = Species),
position = position_dodge2(preserve = "single", padding = 0) ) +
scale_fill_manual(values = colours)
The easiest way is to supplement your data set so that every combination is present, even if it has NA as its value. Taking a simpler example (as yours has a lot of unneeded features):
dat <- data.frame(a=rep(LETTERS[1:3],3),
b=rep(letters[1:3],each=3),
v=1:9)[-2,]
ggplot(dat, aes(x=a, y=v, colour=b)) +
geom_bar(aes(fill=b), stat="identity", position="dodge")
This shows the behavior you are trying to avoid: in group "B", there is no group "a", so the bars are wider. Supplement dat with a dataframe with all the combinations of a and b:
dat.all <- rbind(dat, cbind(expand.grid(a=levels(dat$a), b=levels(dat$b)), v=NA))
ggplot(dat.all, aes(x=a, y=v, colour=b)) +
geom_bar(aes(fill=b), stat="identity", position="dodge")
I had the same problem but was looking for a solution that works with the pipe (%>%). Using tidyr::spread and tidyr::gather from the tidyverse does the trick. I use the same data as #Brian Diggs, but with uppercase variable names to not end up with double variable names when transforming to wide:
library(tidyverse)
dat <- data.frame(A = rep(LETTERS[1:3], 3),
B = rep(letters[1:3], each = 3),
V = 1:9)[-2, ]
dat %>%
spread(key = B, value = V, fill = NA) %>% # turn data to wide, using fill = NA to generate missing values
gather(key = B, value = V, -A) %>% # go back to long, with the missings
ggplot(aes(x = A, y = V, fill = B)) +
geom_col(position = position_dodge())
Edit:
There actually is a even simpler solution to that problem in combination with the pipe. Use tidyr::complete gives the same result in one line:
dat %>%
complete(A, B) %>%
ggplot(aes(x = A, y = V, fill = B)) +
geom_col(position = position_dodge())

Perform transformation inside ggplot2 function to produce negative values

In respect to the below code I can produce a stacked bar chart as shown by the first graph.
library(ggplot2)
vehicle<- sample(rep(c("Cars","Cycles","Motobike"),times=c(20,50,30)))
team<-sample(rep(c("TeamA","TeamB"),times=c(50,50)))
df<-data.frame(team,vehicle, stringsAsFactors = FALSE)
ggplot(data = df, aes(x = as.factor (vehicle), fill =team)) +
geom_bar(mapping = aes(y = stat(count)/sum(..count..)*100),
position = "stack")
What I want to do is to produce a transformation within the geom_bar(mapping = aes(y = stat(count)/sum(..count..)*100),position = "stack") part that says if it is team B, then the count becomes a minus number. I want to do this so I can reproduce something like the 2nd graph.
Can anyone help amend the code to get the desired result?
Note: the second graph is created using the code below but I don't want to have to add two separate geom_bars because it means the % is incorrect on the y axis.
ggplot(data = df, aes(x = as.factor (vehicle), fill =team)) +
geom_bar(data = subset(df, team=="TeamA"),
mapping = aes(y = stat(count)/sum(..count..)*100),
position = "stack")+
geom_bar(data = subset(df, team=="TeamB"),
mapping = aes(y = - stat(count)/sum(..count..)*100),
position = "stack") +
labs(x = "", y="")
I think it's easier to prepare the data before you feed it into ggplot. I realize the numbers don't quite match up here but I'll let you deal with that.
library(tidyverse)
library(ggplot2)
vehicle<- sample(rep(c("Cars","Cycles","Motobike"),times=c(20,50,30)))
team<-sample(rep(c("TeamA","TeamB"),times=c(50,50)))
df<-data.frame(team,vehicle, stringsAsFactors = FALSE) %>%
group_by(team, vehicle) %>%
summarize(count = n()) %>%
mutate(newcount = if_else(team == 'TeamA', count, -count))
ggplot(data = df, aes(x = as.factor(vehicle), y = newcount, fill =team)) +
geom_bar(position = "stack", stat ='identity')
I managed to do it by using an ifelse directly in the function which achieved what I was after.
set.seed (105)
vehicle<- sample(rep(c("Cars","Cycles","Motorbike"),times=c(20,50,30)))
team<-sample(rep(c("TeamA","TeamB"),times=c(50,50)))
df<-data.frame(team,vehicle, stringsAsFactors = FALSE)
ggplot(data = df, aes(x = as.factor (vehicle), fill =team,
y= ifelse(test = team == "TeamB",
yes = -1/nrow(df)*100, no = 1/nrow(df)*100)))+
geom_bar(stat="identity")

How to level a ggplot2 histogram with two classes, with independent levels for each class?

Suppose I have this data:
xy <- data.frame(cbind(c(1,2,3,4,5,2,3,4),c(rep('A',5),rep('B',3))))
So, when I type
ggplot(xy, aes(x = x, fill = y)) +
geom_histogram(aes(y=..count../sum(..count..)), position = "dodge")
I get this graphic:
But I wanted to see the levels independently leveled, i.e., the red bars leveled to 0.2 and the blue bars leveled to 0.333. How can I achieve it?
Also, how can I set the y-axis to show the numbers in percentage instead of decimals?
Many thanks in advance.
This seems to do the job. It uses ..density.. rather than ..count.., a rather ugly way to count the number of levels in the A/B factor column, and then the scales package to get the labels on the y axis
ggplot(xy, aes(x = x, fill = y)) +
geom_histogram(aes(y=..density../sum(..density..)*length(unique(xy$y)), group = y), position = "dodge") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))
Alternatively to calculate everything in ggplot, you can first calculate the relative frequency and then use this value to plot it with geom_col. preserve = "single" preserves equal width of the bars:
library(ggplot2)
library(dpylr)
xy <- data.frame(x = c(1,2,3,4,5,2,3,4),
y = c(rep('A',5),rep('B',3)))
xy <- xy %>%
group_by(y, x) %>%
summarise(rel_freq = n()) %>%
mutate(rel_freq = rel_freq / n())
ggplot(xy, aes(x = x, y = rel_freq, fill = y)) +
geom_col(position = position_dodge2(preserve = "single")) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))

ggplot2, error in filling the area under lines

I have this data set and I want to fill the area under each line. However I get an error saying:
Error: stat_bin() must not be used with a y aesthetic.
Additionally, I need to use alpha value for transparency. Any suggestions?
library(reshape2)
library(ggplot2)
dat <- data.frame(
a = rnorm(12, mean = 2, sd = 1),
b = rnorm(12, mean = 4, sd = 2),
month = c("JAN","FEB","MAR",'APR',"MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"))
dat$month <- factor(dat$month,
levels = c("JAN","FEB","MAR",'APR',"MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"),
ordered = TRUE)
dat <- melt(dat, id="month")
ggplot(data = dat, aes(x = month, y = value, colour = variable)) +
geom_line() +
geom_area(stat ="bin")
I want to fill the area under each line
This means we will need to specify the fill aesthetic.
I get an error saying "Error: stat_bin() must not be used with a y aesthetic."
This means we will need to delete your stat ="bin" code.
Additionally, I need to use alpha value for transparency.
This means we need to put alpha = <some value> in the geom_area layer.
Two other things: (1) since you have a factor on the x-axis, we need to specify a grouping so ggplot knows which points to connect. In this case we can use variable as the grouper. (2) The default "position" of geom_area is to stack the areas rather than overlap them. Because you ask about transparency I assume you want them overlapping, so we need to specify position = 'identity'.
ggplot(data = dat, aes(x = month, y = value, colour = variable)) +
geom_line() +
geom_area(aes(fill = variable, group = variable),
alpha = 0.5, position = 'identity')
To get lines across categorical variables, use the group aesthetic:
ggplot(data = dat, aes(x = month, y = value, colour = variable, group = variable)) +
#geom_line(position = 'stack') + # redundant, but this is where lines are drawn
geom_area(alpha = 0.5)
To change the color inside, use the fill aesthetic.

How to suppress warnings when plotting with ggplot

When passing missing values to ggplot, it's very kind, and warns us that they are present. This is acceptable in an interactive session, but when writing reports, you do not the output get cluttered with warnings, especially if there's many of them. Below example has one label missing, which produces a warning.
library(ggplot2)
library(reshape2)
mydf <- data.frame(
species = sample(c("A", "B"), 100, replace = TRUE),
lvl = factor(sample(1:3, 100, replace = TRUE))
)
labs <- melt(with(mydf, table(species, lvl)))
names(labs) <- c("species", "lvl", "value")
labs[3, "value"] <- NA
ggplot(mydf, aes(x = species)) +
stat_bin() +
geom_text(data = labs, aes(x = species, y = value, label = value, vjust = -0.5)) +
facet_wrap(~ lvl)
If we wrap suppressWarnings around the last expression, we get a summary of how many warnings there were. For the sake of argument, let's say that this isn't acceptable (but is indeed very honest and correct). How to (completely) suppress warnings when printing a ggplot2 object?
You need to suppressWarnings() around the print() call, not the creation of the ggplot() object:
R> suppressWarnings(print(
+ ggplot(mydf, aes(x = species)) +
+ stat_bin() +
+ geom_text(data = labs, aes(x = species, y = value,
+ label = value, vjust = -0.5)) +
+ facet_wrap(~ lvl)))
R>
It might be easier to assign the final plot to an object and then print().
plt <- ggplot(mydf, aes(x = species)) +
stat_bin() +
geom_text(data = labs, aes(x = species, y = value,
label = value, vjust = -0.5)) +
facet_wrap(~ lvl)
R> suppressWarnings(print(plt))
R>
The reason for the behaviour is that the warnings are only generated when the plot is actually drawn, not when the object representing the plot is created. R will auto print during interactive usage, so whilst
R> suppressWarnings(plt)
Warning message:
Removed 1 rows containing missing values (geom_text).
doesn't work because, in effect, you are calling print(suppressWarnings(plt)), whereas
R> suppressWarnings(print(plt))
R>
does work because suppressWarnings() can capture the warnings arising from the print() call.
A more targeted plot-by-plot approach would be to add na.rm=TRUE to your plot calls.
E.g.:
ggplot(mydf, aes(x = species)) +
stat_bin() +
geom_text(data = labs, aes(x = species, y = value,
label = value, vjust = -0.5), na.rm=TRUE) +
facet_wrap(~ lvl)
In your question, you mention report writing, so it might be better to set the global warning level:
options(warn=-1)
the default is:
options(warn=0)

Resources