I am creating a box plot in which I have used scale_x_reordered() after adjusting the order of factors on the x axis.
I am now trying to change the label of one factor. I had previously been doing this using:
scale_x_discrete(labels=c("old_label" = "new_label"))
However, I cannot use both scale_x_discrete() and scale_x_reordered() in the same plot. Does anyone know of a fix so that I can change a label and keep scale_x_reordered?
My ggplot is based off of this very helpful example: linked here
The change I am trying to make is equivalent to manually changing the name "Michael" to "Mike".
To achieve your desired result I would suggest to recode your factor before applying reorder_within.
The reason is that reorder_within transforms the factor levels to make the reordering within facets work. Inside scale_x_reordered a re-transformation is applied via the labels argument to show the original levels or labels. That's why you can't make use of the labels argument.
In the following example taken from the link you posted I make use of dplyr::recode(name, "Michael" = "Mike") just before reorder_within:
library(tidyverse)
library(babynames)
library(tidytext)
top_names <- babynames %>%
filter(year >= 1950,
year < 1990) %>%
mutate(decade = (year %/% 10) * 10) %>%
group_by(decade) %>%
count(name, wt = n, sort = TRUE) %>%
ungroup
top_names %>%
group_by(decade) %>%
top_n(15) %>%
ungroup %>%
mutate(decade = as.factor(decade),
name = recode(name, "Michael" = "Mike"),
name = reorder_within(name, n, decade)) %>%
ggplot(aes(name, n, fill = decade)) +
geom_col(show.legend = FALSE) +
facet_wrap(~decade, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Number of babies per decade",
x = NULL,
title = "What were the most common baby names in each decade?",
subtitle = "Via US Social Security Administration")
#> Selecting by n
Related
Sorry in advance for I'm an R newbie. So I was working on Divvy Bike Share data (details see here. Here is a subset of my df:
I wanted to visualize the total ridership count (how many times bikes are used) as compressed and shown in a week. I tried two blocks of codes, with the only difference being summarize() - the second one has "month" inside the function. I don't understand what resulted in this difference in y-axis values in the two graphs.
p1 <- df %>%
group_by(member_casual, day_of_week) %>%
summarize(total_rides = n()) %>%
ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
theme(axis.title.x = element_blank()) +
scale_fill_discrete(name = "") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p1
p2 <- df %>%
group_by(member_casual, day_of_week, month) %>%
summarize(total_rides = n()) %>%
ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
theme(axis.title.x = element_blank()) +
scale_fill_discrete(name = "") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p2
I tested the tables generated before a plot is visualized, so I tried the following blocks:
df %>%
group_by(member_casual, day_of_week) %>%
summarize(total_rides = n())
df %>%
group_by(member_casual, day_of_week, month) %>%
summarize(total_rides = n())
I guess I understand by adding more elements in group_by, the resulting table will become more catagorized or "grouped". However, the total should always be the same, no? For example, if you add up all the casual & Sundays (as separated into 12 months) in tibble 2, you'll get exactly the number in tibble 1 - 392107, the same number as shown in p1, not p2. So this exacerbated my confusion.
So in a word, I have two questions:
Why the difference in p1 and p2? How could I have avoided such errors in the future?
Where does the numbers come in p2?
Any advice would be greatly appreciated. Thank you!
You’re assuming that the counts for each month will be stacked, so that together the column will show the total across all months. But in fact the counts are overplotted in front of one another, so only the highest month-count is visible. You can see this is the case if you add a border and make your columns transparent. Using mpg as an example, with cyl as the “extra” grouping variable:
library(dplyr)
library(ggplot2)
mpg %>%
count(drv, year, cyl) %>%
ggplot(aes(year, n, fill = drv)) +
geom_col(
position = "dodge",
color = "black",
alpha = .5
)
NB: count(x) is a shortcut for group_by(x) %>% summarize(n = n()).
I'm new to using R so please bear with me as my code might not look the best. So I want to combine these two line graphs together since right now I have written code for each item that I am analyzing. This is the dataset I am using: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-01/readme.md I used the "Arable_Land" dataset!
##USA Arable Land
plot_arable_land_USA <- arable_land %>%
filter(Code == "USA") %>%
select(c(Year, Code, `Arable land needed to produce a fixed quantity of crops ((1.0 = 1961))`)) %>%
pivot_longer(-c(Year, Code)) %>%
ggplot(aes(x = Year, y = value,color=name,group=name)) +
geom_line() +
facet_wrap(.~name,scales = 'free_y') +
theme_light() +
theme(legend.position = 'none')
ggplotly(plot_arable_land_USA)
##Canada Arable Land
plot_arable_land_CAN <- arable_land %>%
filter(Code == "CAN") %>%
select(c(Year, Code, `Arable land needed to produce a fixed quantity of crops ((1.0 = 1961))`)) %>%
pivot_longer(-c(Year, Code)) %>%
ggplot(aes(x = Year, y = value,color=name,group=name)) +
geom_line() +
facet_wrap(.~name,scales = 'free_y') +
theme_light() +
theme(legend.position = 'none')
ggplotly(plot_arable_land_CAN)
Ideally, I would like one graph to show both like one line (in Purple) to show the USA and another line(in Brown) to show Canada.
Thank you!
Try this. It is a better practice to reshape data to long as you did. In your case you can add filter() to choose the desired countries. Then, reshape to long and design the plot. The key is setting color and group with Code in order to obtain the desired lines. You can set the colors using scale_color_manual() and I have left the facet option to get the title. Here the code:
library(plotly)
library(tidyverse)
#Code
plot_arable_land_CAN <- arable_land %>% select(-Entity) %>%
filter(Code %in% c('USA','CAN')) %>%
pivot_longer(-c(Code,Year)) %>%
ggplot(aes(x = Year, y = value,color=Code,group=Code)) +
geom_line() +
facet_wrap(.~name,scales = 'free_y') +
theme_light() +
theme(legend.position = 'none')+
scale_color_manual(values = c('brown','purple'))
#Transform
ggplotly(plot_arable_land_CAN)
Output:
Using the mpg dataset I want to produce a scatterplot that shows for every manufacturer one point with the grouped (by manufacturer) mean of displ.
The following works so far:
ggplot(mpg %>%
group_by(manufacturer) %>%
summarise(mean_displ = mean(displ))) +
geom_point(aes(x = manufacturer, y = mean_displ)) +
guides(x = guide_axis(angle = 90))
Now I want to show the points in ascending order according to their displ value. Or: I want to sort the manufacturer variable on the x-axis according to the corresponding mean_displ value.
I tried to insert a arrange(mean_displ) statement in my dplyr chain. No success.
So I introduced a dummy variable x that produces the plot I want, but now the labeling is gone..
ggplot(mpg %>%
group_by(manufacturer) %>%
summarise(mean_displ = mean(displ)) %>%
arrange(mean_displ) %>%
mutate(x = 1:15)) +
geom_point(aes(x = x, y = mean_displ))
How can I get the later plot but with the labeling from above?
fct_reorder from the forcats package can order the levels of a factor.
library(tidyverse)
ggplot(mpg %>%
group_by(manufacturer) %>%
summarise(mean_displ = mean(displ))) +
geom_point(aes(x = fct_reorder(manufacturer, mean_displ), y = mean_displ)) +
guides(x = guide_axis(angle = 90))
I have a dataset with two variables: 1) ID, 2) Infection Status (Binary:1/0).
I would like to use ggplot2 to
Create a stacked percentage bar graph with the various ID on the verticle-axis (arranged alphabetically with A starting on top), and the percent on the horizontal-axis. I can't seem to get a code that will automatically sort the ID alphabetically as my original dataset has quite a number of categories and will be difficult to arrange them manually.
I also hope to have the infected category (1) to be red and towards the left of the blue non-infected category (0). Is it also possible to change the sub-heading of the legend box from "Non_infected" to "Non-infected"?
I hope that the displayed ID in the plot will include the count of the number of times the ID appeared in the dataset. E.g. "A (n=6)", "B (n=3)"
My sample code is as follow:
ID <- c("A","A","A","A","A","A",
"B","B","B",
"C","C","C","C","C","C","C",
"D","D","D","D","D","D","D","D","D")
Infection <- sample(c(1, 0), size = length(ID), replace = T)
df <- data.frame(ID, Infection)
library(ggplot2)
library(dplyr)
library(reshape2)
df.plot <- df %>%
group_by(ID) %>%
summarize(Infected = sum(Infection)/n(),
Non_Infected = 1-Infected)
df.plot %>%
melt() %>%
ggplot(aes(x = ID, y = value, fill = variable)) + geom_bar(stat = "identity", position = "stack") +
xlab("ID") +
ylab("Percent Infection") +
scale_fill_discrete(guide = guide_legend(title = "Infection Status")) +
coord_flip()
Right now I managed to get this output:
I hope to get this:
Thank you so much!
First, we need to add a count to your original data.frame.
df.plot <- df %>%
group_by(ID) %>%
summarize(Infected = sum(Infection)/n(),
Non_Infected = 1-Infected,
count = n())
Then, we augment our ID column, turn the Infection Status into a factor variable, use forcats::fct_rev to reverse the ID ordering, and use scale_fill_manual to control your legend.
df.plot %>%
mutate(ID = paste0(ID, " (n=", count, ")")) %>%
select(-count) %>%
melt() %>%
mutate(variable = factor(variable, levels = c("Non_Infected", "Infected"))) %>%
ggplot(aes(x = forcats::fct_rev(ID), y = value, fill = variable)) +
geom_bar(stat = "identity", position = "stack") +
xlab("ID") +
ylab("Percent Infection") +
scale_fill_manual("Infection Status",
values = c("Infected" = "#F8766D", "Non_Infected" = "#00BFC4"),
labels = c("Non-Infected", "Infected"))+
coord_flip()
The name of the countries are long and are on top of each other in the x labels, how can I make it readable?
ggplot(results, aes(x = Nationality, horiz=TRUE)) +
theme_solarized() +
geom_bar() +
labs(y = "Number of Medals",
title = "Number of Medals by Country")
Welcome to stackoverflow. Here are some suggestions on how you can deal with the many values. In both methods, I am using the forcats library within the tidyverse. You can read more about it here: https://r4ds.had.co.nz/factors.html
First, some fake data & replicating your problem
library(tidyverse)
df <-
mpg %>%
arrange(manufacturer) %>%
mutate(
n = row_number(),
vehicle = paste(year, manufacturer, model)
) %>%
uncount(n)
# this replicates your problem
ggplot(df, aes(vehicle)) +
geom_bar() +
coord_flip()
Option 1: consolidate
df %>%
mutate(
vehicle = # making heavy use of forcats here
fct_lump(vehicle, 35) %>% # keep only the 35 most frequent values, others in "Other" category
fct_infreq() %>% # order them by frequency
fct_rev() #reverse the order
) %>%
ggplot(aes(vehicle)) +
geom_bar() +
coord_flip()
Option 2: facet
Someone may have a more elegant way of getting these groups but I use this method quite a bit
df %>%
mutate(
vehicle = # similar methods to earlier
fct_infreq(vehicle) %>%
fct_rev(),
num_fct = as.integer(vehicle), # generates a number for each factor
facet = (max(num_fct)-num_fct) %/% 20 # will make groups of 20, but they need to be in descending order within each facet
) %>%
ggplot(aes(vehicle)) +
geom_bar() +
coord_flip() +
facet_wrap(~facet, scales = "free_y", nrow = 1) +
theme(
strip.background = element_blank(),
strip.text = element_blank()
)
Hope this helps.