Ridge plot: sort by value / rank - r

I have a data set which I uploaded here as a gist in CSV format.
It is the extracted form of the PDFs provided in the YouGov article "How good is 'good'?". People where asked to rate words (e.g. "perfect", "bad") with a score between 0 (very negative) and 10 (very positive). The gist contains exactly that data, i.e. for every word (column: Word) it stores for every ranking from 0 to 10 (column: Category) the number of votes (column: Total).
I would usually try to visualize the data with matplotlib and Python since I lack knowledge in R, but it seems that ggridges can create way nicer plots than I see myself doing with Python.
Using:
library(ggplot2)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov, aes(x=Category, y=Word, height = Total, group = Word, fill=Word)) +
geom_density_ridges(stat = "identity", scale = 3)
I was able to create this plot (which is still far from perfect):
Ignoring the fact that I have to tweak the aesthetics, there are three things I struggle to do:
Sort the words by their average rank.
Color the ridge by the average rank.
Or color the ridge by the category value, i.e. with varying color.
I tried to adapt the suggestions from this source, but ultimately failed because my data seems to be in the wrong format: Instead of having single instances of votes, I already have the aggregated vote count for each category.
I hope to end up with a result closer to this plot, which satisfies criteria 3 (source):

It took me a little while to get there myself. The key for me way understanding the data and how to order Word based on the average Category score. So let's look at the data first:
> YouGov
# A tibble: 440 x 17
ID Word Category Total Male Female `18 to 35` `35 to 54` `55+`
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 Incr~ 0 0 0 0 0 0 0
2 1 Incr~ 1 1 1 1 1 1 0
3 2 Incr~ 2 0 0 0 0 0 0
4 3 Incr~ 3 1 1 1 1 1 1
5 4 Incr~ 4 1 1 1 1 1 1
6 5 Incr~ 5 5 6 5 6 5 5
7 6 Incr~ 6 6 7 5 5 8 5
8 7 Incr~ 7 9 10 8 10 7 10
9 8 Incr~ 8 15 16 14 13 15 16
10 9 Incr~ 9 20 20 20 22 18 19
# ... with 430 more rows, and 8 more variables: Northeast <dbl>,
# Midwest <dbl>, South <dbl>, West <dbl>, White <dbl>, Black <dbl>,
# Hispanic <dbl>, `Other (NET)` <dbl>
Every Word has a row for every Category (or score, 1-10). The Total provides the number of responses for that Word/Category combination. So although there were no responses where the word "Incredible" scored zero there is still a row for it.
Before we calculate the average score for each Word we calculate the product of Category and Total for each Word-Category combination, let's call it Total Score. From there, we can treat Word as a factor, and reorder based on the average Total Score using forcats. After that, you can plot your data just as you did.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
YouGov %>%
mutate(total_score = Category*Total) %>%
mutate(Word = fct_reorder(.f = Word, .x = total_score, .fun = mean)) %>%
ggplot(aes(x=Category, y=Word, height = Total, group = Word, fill=Word)) +
geom_density_ridges(stat = "identity", scale = 3)
By treating Word as a factor we reordered the Words based on their mean Category. ggplot also orders colors accordingly so we don't have to modify ourselves, unless you'd prefer a different color palette.

The other solution is exactly correct. I just wanted to point out that you can call fct_reorder() from within aes() for an even more compact solution. However, you need to do it twice if you want to change fill color by position along the y axis.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov,
aes(
x = Category,
y = fct_reorder(Word, Category*Total, .fun = sum),
height = Total,
fill = fct_reorder(Word, Category*Total, .fun = sum)
)) +
geom_density_ridges(stat = "identity", scale = 3) +
theme(legend.position = "none")
Created on 2020-01-19 by the reprex package (v0.3.0)
If instead you want to color by x position, you can do something like the following. It just doesn't look as nice as the temperature example because the x values come in discrete steps.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov,
aes(
x = Category,
y = fct_reorder(Word, Category*Total, .fun = sum),
height = Total,
fill = stat(x)
)) +
geom_density_ridges_gradient(stat = "identity", scale = 3) +
theme(legend.position = "none") +
scale_fill_viridis_c(option = "C")
Created on 2020-01-19 by the reprex package (v0.3.0)

Related

how to count and group categorical data by range in r

I have data from a questionnaire that has a column for year of birth. So the range of data was too large and my mapping became confusing. I'm now trying to take the years, group them up by decade decade, and then chart them. But I don't know how to group them.
my data is likeļ¼š
birth_year <- data.frame("years"=c(
"1920","1923","1930","1940","1932","1935","1942","1944","1952","1956","1996","1961",
"1962","1966","1978","1987","1998","1999","1967","1934","1945","1988","1976","1978",
"1951","1986","1942","1999","1935","1920","1933","1987","1998","1999","1931","1977",
"1920","1931","1977","1999","1967","1992","1998","1984"
))
and my plot is like:
However, I want my data by group as:
birth_year count
(1920-1930]: 5
(1931-1940]: 8
(1941-1950]: 4
(1951-1960]: 3
(1961-1970]: 5
(1971-1980]: 5
(1981-1990]: 5
(1991-2000]: 9
and then plot as a range group.
We can use cut() to group the data, and then plot with ggplot().
birth_year <- data.frame("years"=c(
"1920","1923","1930","1940","1932","1935","1942","1944","1952","1956","1996","1961",
"1962","1966","1978","1987","1998","1999","1967","1934","1945","1988","1976","1978",
"1951","1986","1942","1999","1935","1920","1933","1987","1998","1999","1931","1977",
"1920","1931","1977","1999","1967","1992","1998","1984"
))
birth_year$yearGroup <- cut(as.integer(birth_year$years),breaks = 8,dig.lab = 4,
include.lowest = FALSE)
library(ggplot2)
ggplot(birth_year,aes(x = yearGroup)) + geom_bar()
birth_year %>%
mutate(val=cut_width(as.numeric(years),10,boundary = 1920, dig.lab=-1))%>%
count(val)
val n
1 [1920,1930] 5
2 (1930,1940] 8
3 (1940,1950] 4
4 (1950,1960] 3
5 (1960,1970] 5
6 (1970,1980] 5
7 (1980,1990] 5
8 (1990,2000] 9

Arranging the stacks in a stacked bargraph according to the value of one variable

I am trying to write a function that outputs a stacked bar graph at the end, where the stacked bar graph has its' bars ordered going from the greatest percentage to the smallest percentage of one specific variable. I have not been able to find a general way to do this and my ultimate goal is to have this process done in a way where it requires the least amount of human input.
My data looks like this
Swimming_style Comfort_level_label Comfort_level_scale n Total_n Percentage
Front Crawl Excellent 3 7 10 70
Front Crawl Good 2 3 10 30
Backstroke Excellent 3 4 10 40
Backstroke Good 2 4 10 40
Backstroke Fair 1 1 10 10
Backstroke Poor 0 1 10 10
Brest stroke Excellent 3 6 10 60
Brest stroke Fair 1 4 10 40
Butterfly Good 2 7 10 70
Butterfly Fair 1 1 10 10
Butterfly Poor 0 2 10 20
So far, this is my code:
data <- arrange(data, Comfort_level_label, (Percentage))
data$Swimming_style <- factor(data$Swimming_style, levels = unique(data$Swimming_style))
ggplot(data, aes( x = Swimming_style, y = n, fill = Comfort_level_label)) +
geom_bar(position = "fill",stat = "identity") +
scale_y_continuous(labels = scales::percent_format())+
coord_flip()
Which outputs this:
But what I need the graph to do is sort by the Excellent rating from most Excellent on top to least or no Excellent on the bottom, and I'm having trouble doing exactly that.
It can be a little ambiguous, but the important thing when prioritizing is to ensure you always have exactly one of each of the factors.
library(dplyr)
SS <- dat %>%
arrange(-Comfort_level_scale, -n) %>%
group_by(Swimming_style) %>%
slice(1) %>%
ungroup() %>%
arrange(Comfort_level_scale, n) %>%
pull(Swimming_style)
library(ggplot2)
dat %>%
mutate(Swimming_style = factor(Swimming_style, levels = SS)) %>%
ggplot(aes( x = Swimming_style, y = n, fill = Comfort_level_label)) +
geom_bar(position = "fill",stat = "identity") +
scale_y_continuous(labels = scales::percent_format()) +
coord_flip()
BTW: should Brest stroke be Breast stroke?
library(dplyr)
dat <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
Swimming_style Comfort_level_label Comfort_level_scale n Total_n Percentage
Front_Crawl Excellent 3 7 10 70
Front_Crawl Good 2 3 10 30
Backstroke Excellent 3 4 10 40
Backstroke Good 2 4 10 40
Backstroke Fair 1 1 10 10
Backstroke Poor 0 1 10 10
Brest_stroke Excellent 3 6 10 60
Brest_stroke Fair 1 4 10 40
Butterfly Good 2 7 10 70
Butterfly Fair 1 1 10 10
Butterfly Poor 0 2 10 20") %>%
mutate(Swimming_style = gsub("_", " ", Swimming_style))

Swimmerplot in R with 'empty' space between stacked bars (ggplot)

Problem Description
I am trying to make a swimmerplot in R using ggplot. However, I encounter a problem when I would like to have 'empty' space between two stacked bars of the plot: the bars are arranged next to one another.
Code & Sample data
I have the following sample data:
# Sample data
df <- read.table(text="patient start keytreat duration
sub-1 0 treat1 3
sub-1 8 treat2 2
sub-1 13 treat3 1.5
sub-2 0 treat1 4.5
sub-3 0 treat1 4
sub-3 4 treat2 8
sub-3 13.5 treat3 2", header=TRUE)
When I use the following code to generate a swimmerplot, I end up with a swimmerplot of 3 subjects. Subject 2 received only 1 treatment (treatment 1), so this displays correctly.
However, subject 1 received 3 treatments: treatment 1 from time point 0 up to time point 3, then nothing from 3 to 8, then treatment 2 from 8 until 10 etc...
The data is plotted in a way, that in patient 1 and 3 all treatments are consecutive instead of with 'empty' intervals in-between.
# Plot: bars
bars <- map(unique(df$patient)
, ~geom_bar(stat = "identity", position = "stack", width = 0.6,
, data = df %>% filter(patient == .x)))
# Create plot
ggplot(data = df, aes(x = patient,
y = duration,
fill = reorder(keytreat,-start))) +
bars +
guides(fill=guide_legend("ordering")) +
coord_flip()
Question
How do I include empty spaces between two non-consecutive treatments in this swimmerplot?
I don't think geom_bar is the right geom in this case. It's really meant for showing frequencies or counts and you can't explicitly control their start or end coordinates.
geom_segment is probably what you want:
library(tidyverse)
# Sample data
df <- read.table(text="patient start keytreat duration
sub-1 0 treat1 3
sub-1 8 treat2 2
sub-1 13 treat3 1.5
sub-2 0 treat1 4.5
sub-3 0 treat1 4
sub-3 4 treat2 8
sub-3 13.5 treat3 2", header=TRUE)
# Add end of treatment
df_wrangled <- df %>%
mutate(end = start + duration)
ggplot(df_wrangled) +
geom_segment(
aes(x = patient, xend = patient, y = start, yend = end, color = keytreat),
size = 8
) +
coord_flip()
Created on 2019-03-29 by the reprex package (v0.2.1)

How to make a bar plot using ggplot that uses multiple columns for the x-axis?

I am trying to use multiple column names as the x-axis in a barplot. So each column name will be the "factor" and the data it contains is the count for that.
I have tried iterations of this:
ggplot(aes( x = names, y = count)) + geom_bar()
I tried concatenating the x values I want to show with aes(c(col1, col2))
but the aesthetics length does not match and won't work.
library(dplyr)
library(ggplot2)
head(dat)
Sample Week Response_1 Response_2 Response_3 Response_4 Vaccine_Type
1 1 1 300 0 2000 100 1
2 2 1 305 0 320 15 1
3 3 1 310 0 400 35 1
4 4 1 400 1 410 35 1
5 5 1 405 0 180 35 2
6 6 1 410 2 800 75 2
dat %>%
group_by(Week) %>%
ggplot(aes(c(Response_1, Response_2, Response_3, Response_4)) +
geom_boxplot() +
facet_grid(.~Week)
dat %>%
group_by(Week) %>%
ggplot(aes(Response_1, Response_2, Response_3, Response_4)) +
geom_boxplot() +
facet_grid(.~Week)
> Error: Aesthetics must be either length 1 or the same as the data
> (24): x
Both of these failed (kind of expected based on aes length error code), but hopefully you know the direction I was aiming for and can help out.
Goal is to have 4 separate groups, each with their own boxplot (1 for every response). And also have them faceted by week.
Using the simple code below got mostly what I want. Unfortunately I don't think it would be as easy to include the points and other characteristics to the plot like you can with ggplot.
boxplot(dat[,3:6], use.cols = TRUE)
And I could pretty easily just filter by the different weeks and use mfrow for faceting. Not as informative as ggplot, but gets the job done. If anyone else has other workarounds, I'd be interested in seeing.

R - ggplot column or bar graph in 'dodge' position gives me a fat bar when there is no y value for that x

I want to plot percent survival per treatment (percent.o2). When y == 0% for a treatment, I get a fat bar. I'd like them to be the same width. Any advice appreciated.
Data looks like this:
> plotData
# A tibble: 12 x 4
# Groups: Percent.O2 [7]
Status Percent.O2 n percent
<fct> <fct> <int> <dbl>
1 Dead 1 144 1.00
2 Dead 3 141 0.979
3 Dead 7 144 1.00
4 Dead 10 105 0.729
5 Dead 13 69 0.958
6 Dead Control 12 0.167
7 Dead Control2 2 0.0278
8 Still_kicking 3 3 0.0208
9 Still_kicking 10 39 0.271
10 Still_kicking 13 3 0.0417
11 Still_kicking Control 60 0.833
12 Still_kicking Control2 70 0.972
Here's my code for the plot:
> ggplot(plotData, aes(x = Percent.O2, y = percent, fill = Status)) + geom_col(position = "dodge")
You can use the complete function from the tidyr package (which is loaded as part of the tidyverse suite of packages) to add rows for the missing levels and fill them with zero (NA, the default fill value, would work too). I've also added percent labels on the y-axis using percent from the scales package.
library(tidyverse)
library(scales)
ggplot(plotData %>%
complete(Status, nesting(Percent.O2), fill=list(n=0, percent=0)),
aes(x = Percent.O2, y = percent, fill = Status)) +
geom_col(position = "dodge") +
scale_y_continuous(labels=percent)

Resources