ggplot geom_col: making certain axis count integers rather than summing - r

I am currently making a hate crime case study. For my plot I am using one zip-code as my y-axis and plotting how many crimes and what group is being targeted on the x-axis using geom-col. The problem is my y-axis is adding the zip-codes together rather than counting each frequency of how many times the zip-code shows up. Here is my dataset looks like:
structure(list(ID = 1:5, CRIME_TYPE = c("VANDALISM", "ASSAULT", "VANDALISM", "ASSAULT",
"OTHER"), BIAS_MOTIVATION_GROUP = c("ANTI-BLACK ",
"ANTI-BLACK ", "ANTI-FEMALE HOMOSEXUAL (LESBIAN) ",
"ANTI-MENTAL DISABILITY ", "ANTI-JEWISH "),
ZIP_CODE = c(40291L, 40219L, 40243L, 40212L, 40222L
)), row.names = c(NA, 5L), class = "data.frame")
Here is my code:
library(ggplot2)
df <- read.csv(file = "LMPD_OP_BIAS.csv", header = T)
library(tidyverse)
hate_crime <- df %>%
filter(ZIP_CODE == "40245")
hate_crime_plot <- hate_crime %>%
ggplot(., aes(x = BIAS_MOTIVATION_GROUP, y = ZIP_CODE, fill =
BIAS_MOTIVATION_GROUP)) +
geom_col() + labs(x = "BIAS_MOTIVATION_GROUP", fill = "BIAS_MOTIVATION_GROUP") +
theme_minimal() +
theme(axis.text.x=element_text (angle =45, hjust =1))
print(hate_crime_plot)
hate_crime_ploter <- hate_crime %>%
ggplot(., aes(x = UOR_DESC, y = ZIP_CODE, fill =
UOR_DESC)) +
geom_col() + labs(x = "UOR_DESC", fill = "UOR_DESC") +
theme_minimal() +
theme(axis.text.x=element_text (angle =45, hjust =1))
print(hate_crime_ploter)
For full data visit here: visit site to download data set

Alright, I think you've got a couple issues here. What's happening in your code is you're asking ggplot to make a bar plot with a categorical variable (BIAS_MOTIVATION_GROUP and UOR_DESC) on the x-axis and a continuous variable (ZIP_CODE) on the y-axis. Since there are more than one row per x-y combination, ggplot adds things together by x value, which is what you'd expect out of a bar plot. Long story short, I wonder if what you actually want is a histogram here. Your dataset (hate_crime) only has one value of ZIP_CODE, so I'm not sure what plotting ZIP on the y-axis is supposed to visualize. A histogram would look like this:
hate_crime %>%
ggplot(., aes(x = UOR_DESC, , fill = UOR_DESC)) +
geom_histogram(stat = "count") +
labs(x = "UOR_DESC", fill = "UOR_DESC") +
theme_minimal() +
theme(axis.text.x=element_text (angle =45, hjust =1))
If, instead, you're trying to visualize how often each ZIP code shows up in each category, you'd have to approach things differently. Perhaps you're looking for something like this?
df %>%
ggplot(aes(x = UOR_DESC, fill = factor(ZIP_CODE))) +
geom_histogram(stat = "count") +
theme(axis.text.x=element_text (angle =45, hjust =1))

Related

How do I correctly connect data points ggplot

I am making a stratigraphic plot but somehow, my data points don't connect correctly.
The purpose of this plot is that the values on the x-axis are connected so you get an overview of the change in d18O throughout time (age, ma).
I've used the following script:
library(readxl)
R_pliocene_tot <- read_excel("Desktop/R_d18o.xlsx")
View(R_pliocene_tot)
install.packages("analogue")
install.packages("gridExtra")
library(tidyverse)
R_pliocene_Rtot <- R_pliocene_tot %>%
gather(key=param, value=value, -age_ma)
R_pliocene_Rtot
R_pliocene_Rtot %>%
ggplot(aes(x=value, y=age_ma)) +
geom_path() +
geom_point() +
facet_wrap(~param, scales = "free_x") +
scale_y_reverse() +
labs(x = NULL, y = "Age (ma)")
which leads to the following figure:
Something is wrong with the geom_path function, I guess, but I can't figure out what it is.
Though the comment seem solve the problem I don't think the question asked was answered. So here is some introduction about ggplot2 library regard geom_path
library(dplyr)
library(ggplot2)
# This dataset contain two group with random value for y and x run from 1->20
# The param is just to replicate the question param variable.
df <- tibble(x = rep(seq(1, 20, by = 1), 2),
y = runif(40, min = 1, max = 100),
group = c(rep("group 1", 20), rep("group 2", 20)),
param = rep("a param", 40))
df %>%
ggplot(aes(x = x, y = y)) +
# In geom_path there is group aesthetics which help the function to know
# which data point should is in which path.
# The one in the same group will be connected together.
# here I use the color to help distinct the path a bit more.
geom_path(aes(group = group, color = group)) +
geom_point() +
facet_wrap(~param, scales = "free_x") +
scale_y_reverse() +
labs(x = NULL, y = "Age (ma)")
In your data which work well with group = 1 I guessed all data points belong to one group and you just want to draw a line connect all those data point. So take my data example above and draw with aesthetics group = 1, you can see the result that have two line similar to the above example but now the end point of group 1 is now connected with the starting point of group 2.
So all data point is now on one path but the order of how they draw is depend on the order they appear in the data. (I keep the color just to help see it a bit clearer)
df %>%
ggplot(aes(x = x, y = y)) +
geom_path(aes(group = 1, color = group)) +
geom_point() +
facet_wrap(~param, scales = "free_x") +
scale_y_reverse() +
labs(x = NULL, y = "Age (ma)")
Hope this give you better understanding of ggplot2::geom_path

Adding a single label per group in ggplot with stat_summary and text geoms

I would like to add counts to a ggplot that uses stat_summary().
I am having an issue with the requirement that the text vector be the same length as the data.
With the examples below, you can see that what is being plotted is the same label multiple times.
The workaround to set the location on the y axis has the effect that multiple labels are stacked up. The visual effect is a bit strange (particularly when you have thousands of observations) and not sufficiently professional for my purposes. You will have to trust me on this one - the attached picture doesn't fully convey the weirdness of it.
I was wondering if someone else has worked out another way. It is for a plot in shiny that has dynamic input, so text cannot be overlaid in a hardcoded fashion.
I'm pretty sure ggplot wasn't designed for the kind of behaviour with stat_summary that I am looking for, and I may have to abandon stat_summary and create a new summary dataframe, but thought I would first check if someone else has some wizardry to offer up.
This is the plot without setting the y location:
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_x <- df_x %>%
group_by(Group) %>%
mutate(w_count = n())
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(label = w_count)) +
coord_flip() +
theme_classic()
and this is with my hack
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(y = 1, label = w_count)) +
coord_flip() +
theme_classic()
Create a df_text that has the grouped info for your labels. Then use annotate:
library(dplyr)
library(ggplot2)
set.seed(123)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_text <- df_x %>%
group_by(Group) %>%
summarise(avg = mean(Value),
n = n()) %>%
ungroup()
yoff <- 0.0
xoff <- -0.1
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
annotate("text",
x = 1:2 + xoff,
y = df_text$avg + yoff,
label = df_text$n) +
coord_flip() +
theme_classic()
I found another way which is a little more robust for when the plot is dynamic in its ordering and filtering, and works well for faceting. More robust, because it uses stat_summary for the text.
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
counts_df <- function(y) {
return( data.frame( y = 1, label = paste0('n=', length(y)) ) )
}
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
coord_flip() +
theme_classic()
p + stat_summary(geom="text", fun.data=counts_df)

How to make a dual axis in ggplot R

I have made a time series plot for total count data of 4 different species. As you can see the results with sharksucker have a much higher count than the other 3 species. To see the trends of the other 3 species they need to plotted separately (or on a smaller y axis). However, I have a figure limit in my masters paper. So, I was trying to create a dual axis plot or have the y axis split into two. Does anyone know of a way I could do this?
library(tidyverse)
library(reshape2)
dat <- read_xlsx("ReefPA.xlsx")
dat1 <- dat
dat1$Date <- format(dat1$Date, "%Y/%m")
plot_dat <- dat1 %>%
group_by(Date) %>%
summarise(Sharksucker_Remora = sum(Sharksucker_Remora)) %>%
melt("Date") %>%
filter(Date > '2018-01-01') %>%
arrange(Date)
names(plot_dat) <- c("Date", "Species", "Count")
ggplot(data = plot_dat) +
geom_line(mapping = aes(x = Date, y = Count, group = Species, colour = Species)) +
stat_smooth(method=lm, aes(x = Date, y = Count, group = Species, colour = Species)) +
scale_colour_manual(values=c(Golden_Trevally="goldenrod2", Red_Snapper="firebrick2", Sharksucker_Remora="darkolivegreen3", Juvenile_Remora="aquamarine2")) +
xlab("Date") +
ylab("Total Presence Per Month") +
theme(legend.title = element_blank()) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
The thing is, the problem you're trying to solve doesn't seem like a 2nd Y axis issue. The problem here is of relative scale of the species. You might want to think of something like standardizing the initial species presence to 100 and showing growth or decline from there.
Another option would be faceting by species.

Plotting a bar chart with years grouped together

I am using the fivethirtyeight bechdel dataset, located here https://github.com/rudeboybert/fivethirtyeight, and am attempting to recreate the first plot shown in the article here https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/. I am having trouble getting the years to group together similarly to how they did in the article.
This is the current code I have:
ggplot(data = bechdel, aes(year)) +
geom_histogram(aes(fill = clean_test), binwidth = 5, position = "fill") +
scale_fill_manual(breaks = c("ok", "dubious", "men", "notalk", "nowomen"),
values=c("red", "salmon", "lightpink", "dodgerblue",
"blue")) +
theme_fivethirtyeight()
I see where you were going with using the histogram geom but this really looks more like a categorical bar chart. Once you take that approach it's easier, after a bit of ugly code to get the correct labels on the year columns.
The bars are stacked in the wrong order on this one, and there needs to be some formatting applied to look like the 538 chart, but I'll leave that for you.
library(fivethirtyeight)
library(tidyverse)
library(ggthemes)
library(scales)
# Create date range column
bechdel_summary <- bechdel %>%
mutate(date.range = ((year %/% 10)* 10) + ((year %% 10) %/% 5 * 5)) %>%
mutate(date.range = paste0(date.range," - '",substr(date.range + 5,3,5)))
ggplot(data = bechdel_summary, aes(x = date.range, fill = clean_test)) +
geom_bar(position = "fill", width = 0.95) +
scale_y_continuous(labels = percent) +
theme_fivethirtyeight()
ggplot

How to use percentage as label in stacked bar plot?

I'm trying to display percentage numbers as labels inside the bars of a stacked bar plot in ggplot2. I found some other post from 3 years ago but I'm not able to reproduce it: How to draw stacked bars in ggplot2 that show percentages based on group?
The answer to that post is almost exactly what I'm trying to do.
Here is a simple example of my data:
df = data.frame('sample' = c('cond1','cond1','cond1','cond2','cond2','cond2','cond3','cond3','cond3','cond4','cond4','cond4'),
'class' = c('class1','class2','class3','class1','class2','class3','class1','class2','class3','class1','class2','class3'))
ggplot(data=df, aes(x=sample, fill=class)) +
coord_flip() +
geom_bar(position=position_fill(reverse=TRUE), width=0.7)
I'd like for every bar to show the percentage/fraction, so in this case they would all be 33%. In reality it would be nice if the values would be calculated on the fly, but I can also hand the percentages manually if necessary. Can anybody help?
Side question: How can I reduce the space between the bars? I found many answers to that as well but they suggest using the width parameter in position_fill(), which doesn't seem to exist anymore.
Thanks so much!
EDIT:
So far, there are two examples that show exactly what I was asking for (big thanks for responding so quickly), however they fail when applying it to my real data. Here is the example data with just another element added to show what happens:
df = data.frame('sample' = c('cond1','cond1','cond1','cond2','cond2','cond2','cond3','cond3','cond3','cond4','cond4','cond4','cond1'),
'class' = c('class1','class2','class3','class1','class2','class3','class1','class2','class3','class1','class2','class3','class2'))
Essentially, I'd like to have only one label per class/condition combination.
I think what OP wanted was labels on the actual sections of the bars. We can do this using data.table to get the count percentages and the formatted percentages and then plot using ggplot:
library(data.table)
library(scales)
dt <- setDT(df)[,list(count = .N), by = .(sample,class)][,list(class = class, count = count,
percent_fmt = paste0(formatC(count*100/sum(count), digits = 2), "%"),
percent_num = count/sum(count)
), by = sample]
ggplot(data=dt, aes(x=sample, y= percent_num, fill=class)) +
geom_bar(position=position_fill(reverse=TRUE), stat = "identity", width=0.7) +
geom_text(aes(label = percent_fmt),position = position_stack(vjust = 0.5)) + coord_flip()
Edit: Another solution where you calculate the y-value of your label in the aggregate. This is so we don't have to rely on position_stack(vjust = 0.5):
dt <- setDT(df)[,list(count = .N), by = .(sample,class)][,list(class = class, count = count,
percent_fmt = paste0(formatC(count*100/sum(count), digits = 2), "%"),
percent_num = count/sum(count),
cum_pct = cumsum(count/sum(count)),
label_y = (cumsum(count/sum(count)) + cumsum(ifelse(is.na(shift(count/sum(count))),0,shift(count/sum(count))))) / 2
), by = sample]
ggplot(data=dt, aes(x=sample, y= percent_num, fill=class)) +
geom_bar(position=position_fill(reverse=TRUE), stat = "identity", width=0.7) +
geom_text(aes(label = percent_fmt, y = label_y)) + coord_flip()
Here is a solution where you first calculate the percentages using dplyr and then plot them:
UPDATED:
options(stringsAsFactors = F)
df = data.frame(sample = c('cond1','cond1','cond1','cond2','cond2','cond2','cond3','cond3','cond3','cond4','cond4','cond4'),
class = c('class1','class2','class3','class1','class2','class3','class1','class2','class3','class1','class2','class3'))
library(dplyr)
library(scales)
df%>%
# count how often each class occurs in each sample.
count(sample, class)%>%
group_by(sample)%>%
mutate(pct = n / sum(n))%>%
ggplot(aes(x = sample, y = pct, fill = class)) +
coord_flip() +
geom_col(width=0.7)+
geom_text(aes(label = paste0(round(pct * 100), '%')),
position = position_stack(vjust = 0.5))
Use scales
library(scales)
ggplot(data=df, aes(x=sample, fill=class)) +
coord_flip() +
geom_bar(position=position_fill(reverse=TRUE), width=0.7) +
scale_y_continuous(labels =percent_format())

Resources