I'm trying to create a barplot by month that includes two columns, with each column stacked. For each month, the first column would be the total number of video visits, split by vid_new and vid_return. The second column would be the total number of phone visits, split by phone_charge and phone_nocharge.
I still haven't been able to get the bars side-by-side correct. This code uses the data frame in the second picture and it's counting the instances of the word "video" and "phone", not the Count column resulting in the third picture.
plot <- ggplot(data=new_df, aes(x=Month, y = count, fill = gen_type)) +
geom_bar(stat = "identity", position = "dodge")
Below is a pic of the data I'm working with. I've converted it into a few different forms to try new methods by have not been able to form this graph.
How can I make a barplot by group and by stack in ggplot? What data structure do I need to get make it?
Thanks in advance for your advice!
You can try any of these options reshaping your data to long and creating and additional variable so that you can identify the types. Here the code using tidyverse functions:
library(ggplot2)
library(dplyr)
library(tidyr)
#Date
df <- data.frame(Month=c(rep('Mar',4),rep('Apr',4),rep('May',2)),
spec_type=c('vid_new','vid_return','phone_charge','phone_nocharge',
'vid_new','vid_return','phone_charge','phone_nocharge',
'vid_new','vid_return'),
Count=c(7,85,595,56,237,848,2958,274,205,1079))
#Plot 1
df %>% mutate(Month=factor(Month,levels = unique(Month),ordered = T)) %>%
mutate(Dup=spec_type) %>%
separate(Dup,c('Type','Class'),sep='_') %>% select(-Class) %>%
ggplot(aes(x=Type,y=Count,fill=spec_type))+
geom_bar(stat = 'identity',position = 'stack')+
facet_wrap(.~Month,strip.position = 'bottom')+
theme(strip.placement = 'outside',
strip.background = element_blank())
Output:
Or this:
#Plot 2
df %>% mutate(Month=factor(Month,levels = unique(Month),ordered = T)) %>%
mutate(Dup=spec_type) %>%
separate(Dup,c('Type','Class'),sep='_') %>% select(-Class) %>%
ggplot(aes(x=Type,y=Count,fill=spec_type))+
geom_bar(stat = 'identity',position = 'fill')+
facet_wrap(.~Month,strip.position = 'bottom',scales = 'free')+
theme(strip.placement = 'outside',
strip.background = element_blank())
Output:
Or this:
#Plot 3
df %>% mutate(Month=factor(Month,levels = unique(Month),ordered = T)) %>%
mutate(Dup=spec_type) %>%
separate(Dup,c('Type','Class'),sep='_') %>% select(-Class) %>%
ggplot(aes(x=Type,y=Count,fill=spec_type))+
geom_bar(stat = 'identity',position = position_dodge2(preserve = 'single'))+
facet_wrap(.~Month,strip.position = 'bottom',scales = 'free')+
theme(strip.placement = 'outside',
strip.background = element_blank())
Output:
In order to see by month you can use facet_wrap() and placing labels in a smart way.
Related
Sorry in advance for I'm an R newbie. So I was working on Divvy Bike Share data (details see here. Here is a subset of my df:
I wanted to visualize the total ridership count (how many times bikes are used) as compressed and shown in a week. I tried two blocks of codes, with the only difference being summarize() - the second one has "month" inside the function. I don't understand what resulted in this difference in y-axis values in the two graphs.
p1 <- df %>%
group_by(member_casual, day_of_week) %>%
summarize(total_rides = n()) %>%
ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
theme(axis.title.x = element_blank()) +
scale_fill_discrete(name = "") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p1
p2 <- df %>%
group_by(member_casual, day_of_week, month) %>%
summarize(total_rides = n()) %>%
ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
theme(axis.title.x = element_blank()) +
scale_fill_discrete(name = "") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p2
I tested the tables generated before a plot is visualized, so I tried the following blocks:
df %>%
group_by(member_casual, day_of_week) %>%
summarize(total_rides = n())
df %>%
group_by(member_casual, day_of_week, month) %>%
summarize(total_rides = n())
I guess I understand by adding more elements in group_by, the resulting table will become more catagorized or "grouped". However, the total should always be the same, no? For example, if you add up all the casual & Sundays (as separated into 12 months) in tibble 2, you'll get exactly the number in tibble 1 - 392107, the same number as shown in p1, not p2. So this exacerbated my confusion.
So in a word, I have two questions:
Why the difference in p1 and p2? How could I have avoided such errors in the future?
Where does the numbers come in p2?
Any advice would be greatly appreciated. Thank you!
You’re assuming that the counts for each month will be stacked, so that together the column will show the total across all months. But in fact the counts are overplotted in front of one another, so only the highest month-count is visible. You can see this is the case if you add a border and make your columns transparent. Using mpg as an example, with cyl as the “extra” grouping variable:
library(dplyr)
library(ggplot2)
mpg %>%
count(drv, year, cyl) %>%
ggplot(aes(year, n, fill = drv)) +
geom_col(
position = "dodge",
color = "black",
alpha = .5
)
NB: count(x) is a shortcut for group_by(x) %>% summarize(n = n()).
I've written a function to create a plot with data that changes based on a filter
library(tidyverse)
library(plotly)
one<- as.numeric(NA)
two<- 25
three<- 35
four<- 40
five<- 0
dat<- data.frame(one, two, three, four, five)
get_plot <- function(x, a){
data<- x[, a]
p<- data %>%
pivot_longer(everything(), names_to="variable", values_to="value") %>%
ggplot(aes(x = reorder(variable, value), y = value, fill = variable, text = paste0(value*100, "%"))) +
geom_bar(stat = "identity",position = "dodge")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
legend.position = "none")+
coord_flip()
ggplotly(p, tooltip = c("text")) %>% config(displaylogo = FALSE,
modeBarButtons = list(list("toImage")))
}
get_plot(dat, a= c(1:5))
Depending on the filter sometimes I end up with a chart with categories that don't have values in them like in the image below. How do I exclude categories from appearing on the plot when there is no value for the category? **** edited to make a more simplified reprex****
update, after thinking about it more this morning and doing more googling and trying things I figure out a solution. I inserted these select statements into the function and it gets me what I wanted!
p<- data %>%
select(where(~!any(is.na(.)))) %>%
select(where(~!any(.== 0))) %>%
Good afternoon, dear pros.
I ask you to suggest answers to some questions that I could not understand. There is the following set of data:
df<-data.frame(num = c(1:20),
gender=unlist(strsplit("MFFMMMFMFMMMMMFFFMFM","")),
age= sample(1:100, 20, replace=T),
entrance=sample(1:4, 20, replace=T))
The data is sorted and grouped.
df <- group_by(df, df$entrance, df$gender)
Actually, according to the data, a graph is built
ggplot(df, aes(x= df$entrance)) +
geom_bar(aes(fill=df$gender), position = "dodge")+
scale_fill_discrete(name = "Title", labels = c("F", "M"))+
xlab("Distribution of residents by entrances, taking into account gender")+
ylab("Number of residents")
Actually, here's the result:
What I do not like?
I would like that if there is no data in the category, then the column is not drawn, but 0 is indicated.
Also I would like to have data labels like in the picture below.
Well, I would also like options for placing these values:
bottom (as drawn);
in the center of the strip;
on top of the strip.
Try this. You can get something similar to what you expect using facets and changing some aesthetic elements:
library(ggplot2)
library(dplyr)
#Plot
df %>% group_by(entrance,gender) %>%
summarise(N=n()) %>%
mutate(gender=factor(gender,levels = c('F','M'))) %>%
complete(gender,fill = list(M=0,F=0)) %>%
replace(is.na(.),0) %>%
ggplot(aes(x= factor(gender),y=N,fill=gender)) +
geom_bar(stat='identity', position = position_dodge2(0.9))+
geom_text(aes(label=N,y=0),vjust=-0.5,position = position_dodge2(0.9))+
scale_fill_discrete(name = "Title", labels = c("F", "M"))+
facet_wrap(.~entrance,nrow = 1,strip.position = 'bottom')+
xlab("Distribution of residents by entrances, taking into account gender")+
ylab("Number of residents")+
theme(strip.placement = 'outside',
strip.background = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
Output:
I have a dataset with two variables: 1) ID, 2) Infection Status (Binary:1/0).
I would like to use ggplot2 to
Create a stacked percentage bar graph with the various ID on the verticle-axis (arranged alphabetically with A starting on top), and the percent on the horizontal-axis. I can't seem to get a code that will automatically sort the ID alphabetically as my original dataset has quite a number of categories and will be difficult to arrange them manually.
I also hope to have the infected category (1) to be red and towards the left of the blue non-infected category (0). Is it also possible to change the sub-heading of the legend box from "Non_infected" to "Non-infected"?
I hope that the displayed ID in the plot will include the count of the number of times the ID appeared in the dataset. E.g. "A (n=6)", "B (n=3)"
My sample code is as follow:
ID <- c("A","A","A","A","A","A",
"B","B","B",
"C","C","C","C","C","C","C",
"D","D","D","D","D","D","D","D","D")
Infection <- sample(c(1, 0), size = length(ID), replace = T)
df <- data.frame(ID, Infection)
library(ggplot2)
library(dplyr)
library(reshape2)
df.plot <- df %>%
group_by(ID) %>%
summarize(Infected = sum(Infection)/n(),
Non_Infected = 1-Infected)
df.plot %>%
melt() %>%
ggplot(aes(x = ID, y = value, fill = variable)) + geom_bar(stat = "identity", position = "stack") +
xlab("ID") +
ylab("Percent Infection") +
scale_fill_discrete(guide = guide_legend(title = "Infection Status")) +
coord_flip()
Right now I managed to get this output:
I hope to get this:
Thank you so much!
First, we need to add a count to your original data.frame.
df.plot <- df %>%
group_by(ID) %>%
summarize(Infected = sum(Infection)/n(),
Non_Infected = 1-Infected,
count = n())
Then, we augment our ID column, turn the Infection Status into a factor variable, use forcats::fct_rev to reverse the ID ordering, and use scale_fill_manual to control your legend.
df.plot %>%
mutate(ID = paste0(ID, " (n=", count, ")")) %>%
select(-count) %>%
melt() %>%
mutate(variable = factor(variable, levels = c("Non_Infected", "Infected"))) %>%
ggplot(aes(x = forcats::fct_rev(ID), y = value, fill = variable)) +
geom_bar(stat = "identity", position = "stack") +
xlab("ID") +
ylab("Percent Infection") +
scale_fill_manual("Infection Status",
values = c("Infected" = "#F8766D", "Non_Infected" = "#00BFC4"),
labels = c("Non-Infected", "Infected"))+
coord_flip()
I'm trying to use ggplot to create sequence plots, for the sake of keeping the same visual style within my paper using sequence analysis. I do:
library(ggplot2)
library(TraMineR)
library(dplyr)
library(tidyr)
data(mvad)
mvad_seq<-seqdef(mvad,15:length(mvad))
mvad_trate<-seqsubm(mvad_seq,method="TRATE")
mvad_dist<-seqdist(mvad_seq,method="OM",sm=mvad_trate)
cluster<-cutree(hclust(d=as.dist(mvad_dist),method="ward.D2"),k=6)
mvad$cluster<-cluster
mvad_long<-gather(select(mvad,id,contains("."),-matches("N.Eastern"),-matches("S.Eastern")),
key="Month",value="state",
Jul.93, Aug.93, Sep.93, Oct.93, Nov.93, Dec.93, Jan.94, Feb.94, Mar.94,
Apr.94, May.94, Jun.94, Jul.94, Aug.94, Sep.94, Oct.94, Nov.94, Dec.94, Jan.95,
Feb.95, Mar.95, Apr.95, May.95, Jun.95, Jul.95, Aug.95, Sep.95, Oct.95, Nov.95,
Dec.95, Jan.96, Feb.96, Mar.96, Apr.96, May.96, Jun.96, Jul.96, Aug.96, Sep.96,
Oct.96, Nov.96, Dec.96, Jan.97, Feb.97, Mar.97, Apr.97, May.97, Jun.97, Jul.97,
Aug.97, Sep.97, Oct.97, Nov.97, Dec.97, Jan.98, Feb.98, Mar.98, Apr.98, May.98,
Jun.98, Jul.98, Aug.98, Sep.98, Oct.98, Nov.98, Dec.98, Jan.99, Feb.99, Mar.99,
Apr.99, May.99, Jun.99)
mvad_long<-left_join(mvad_long,select(mvad,id,cluster))
ggplot(data=mvad_long,aes(x=Month,y=id,fill=state))+geom_tile()+facet_wrap(~cluster)
I try to plot the sequences by cluster, and this gives me the following plot:
As you can see, there are gaps for the ids that don't belong to the cluster represented by each facet. I would like to get rid of these gaps, so that the sequences show up stacked just as with the seqIplot() function of TraMineR as in the next figure:
Any suggestions of how to proceed?
Two small changes:
mvad_long$id <- as.factor(mvad_long$id)
ggplot(data=mvad_long,aes(x=Month,y=id,fill=state))+
geom_tile()+facet_wrap(~cluster,scales = "free_y")
ggplot was treating id as a numerical variable, rather than a factor, and then the scales were fixed.
An update: I needed to convert the month in to a date for it to work. Full solution follows:
library(ggplot2)
library(TraMineR)
library(dplyr)
library(tidyr)
library(lubridate)
data(mvad)
mvad_seq <- seqdef(mvad, 15:length(mvad))
mvad_trate <- seqsubm(mvad_seq, method = "TRATE")
mvad_dist <- seqdist(mvad_seq, method = "OM", sm = mvad_trate)
cluster <- cutree(hclust(d = as.dist(mvad_dist), method = "ward.D2"), k = 6)
mvad$cluster <- cluster
mvad_long <- mvad %>%
select(id, matches("\\.\\d\\d")) %>%
gather(key = "month", value = "state", -id) %>%
inner_join(
mvad %>%
select(id, cluster),
by = "id"
) %>%
mutate(
id = factor(id),
date = myd(paste0(month, "01"))
)
mvad_long %>%
ggplot(aes(x = date, y = id, fill = state, color = state)) +
geom_tile() +
facet_wrap(~cluster, scales = "free_y", ncol = 2) +
theme_bw() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid = element_blank()
) +
scale_fill_brewer(palette = "Accent") +
scale_colour_brewer(palette = "Accent") +
labs(x = "", y = "")