Swimmerplot in R with 'empty' space between stacked bars (ggplot) - r

Problem Description
I am trying to make a swimmerplot in R using ggplot. However, I encounter a problem when I would like to have 'empty' space between two stacked bars of the plot: the bars are arranged next to one another.
Code & Sample data
I have the following sample data:
# Sample data
df <- read.table(text="patient start keytreat duration
sub-1 0 treat1 3
sub-1 8 treat2 2
sub-1 13 treat3 1.5
sub-2 0 treat1 4.5
sub-3 0 treat1 4
sub-3 4 treat2 8
sub-3 13.5 treat3 2", header=TRUE)
When I use the following code to generate a swimmerplot, I end up with a swimmerplot of 3 subjects. Subject 2 received only 1 treatment (treatment 1), so this displays correctly.
However, subject 1 received 3 treatments: treatment 1 from time point 0 up to time point 3, then nothing from 3 to 8, then treatment 2 from 8 until 10 etc...
The data is plotted in a way, that in patient 1 and 3 all treatments are consecutive instead of with 'empty' intervals in-between.
# Plot: bars
bars <- map(unique(df$patient)
, ~geom_bar(stat = "identity", position = "stack", width = 0.6,
, data = df %>% filter(patient == .x)))
# Create plot
ggplot(data = df, aes(x = patient,
y = duration,
fill = reorder(keytreat,-start))) +
bars +
guides(fill=guide_legend("ordering")) +
coord_flip()
Question
How do I include empty spaces between two non-consecutive treatments in this swimmerplot?

I don't think geom_bar is the right geom in this case. It's really meant for showing frequencies or counts and you can't explicitly control their start or end coordinates.
geom_segment is probably what you want:
library(tidyverse)
# Sample data
df <- read.table(text="patient start keytreat duration
sub-1 0 treat1 3
sub-1 8 treat2 2
sub-1 13 treat3 1.5
sub-2 0 treat1 4.5
sub-3 0 treat1 4
sub-3 4 treat2 8
sub-3 13.5 treat3 2", header=TRUE)
# Add end of treatment
df_wrangled <- df %>%
mutate(end = start + duration)
ggplot(df_wrangled) +
geom_segment(
aes(x = patient, xend = patient, y = start, yend = end, color = keytreat),
size = 8
) +
coord_flip()
Created on 2019-03-29 by the reprex package (v0.2.1)

Related

how to count and group categorical data by range in r

I have data from a questionnaire that has a column for year of birth. So the range of data was too large and my mapping became confusing. I'm now trying to take the years, group them up by decade decade, and then chart them. But I don't know how to group them.
my data is likeļ¼š
birth_year <- data.frame("years"=c(
"1920","1923","1930","1940","1932","1935","1942","1944","1952","1956","1996","1961",
"1962","1966","1978","1987","1998","1999","1967","1934","1945","1988","1976","1978",
"1951","1986","1942","1999","1935","1920","1933","1987","1998","1999","1931","1977",
"1920","1931","1977","1999","1967","1992","1998","1984"
))
and my plot is like:
However, I want my data by group as:
birth_year count
(1920-1930]: 5
(1931-1940]: 8
(1941-1950]: 4
(1951-1960]: 3
(1961-1970]: 5
(1971-1980]: 5
(1981-1990]: 5
(1991-2000]: 9
and then plot as a range group.
We can use cut() to group the data, and then plot with ggplot().
birth_year <- data.frame("years"=c(
"1920","1923","1930","1940","1932","1935","1942","1944","1952","1956","1996","1961",
"1962","1966","1978","1987","1998","1999","1967","1934","1945","1988","1976","1978",
"1951","1986","1942","1999","1935","1920","1933","1987","1998","1999","1931","1977",
"1920","1931","1977","1999","1967","1992","1998","1984"
))
birth_year$yearGroup <- cut(as.integer(birth_year$years),breaks = 8,dig.lab = 4,
include.lowest = FALSE)
library(ggplot2)
ggplot(birth_year,aes(x = yearGroup)) + geom_bar()
birth_year %>%
mutate(val=cut_width(as.numeric(years),10,boundary = 1920, dig.lab=-1))%>%
count(val)
val n
1 [1920,1930] 5
2 (1930,1940] 8
3 (1940,1950] 4
4 (1950,1960] 3
5 (1960,1970] 5
6 (1970,1980] 5
7 (1980,1990] 5
8 (1990,2000] 9

Build a plot made up of multiple plots

I have 5 different survfit() plots of different models that I am trying to combine into one plot in the style of a landmark analysis plot, as seen below.
At the moment they are just plot(survfit(model, newdata = )), how could I combine them so that I have the line of days 0-100 of survfit 1, 100-200 of survfit 2 etc.
Let's create a model using the built-in lung data from the survival package:
library(survival)
library(tidyverse)
mod1 <- survfit(Surv(time, status) ~ sex, data = lung)
This model actually contains all we need to make the plot. We can convert it to a data frame as follows:
df <- as.data.frame(unclass(mod1)[c(2:7, 15:16)])
df$sex <- rep(c("Male", "Female"), times = mod1$strata)
head(df)
#> time n.risk n.event n.censor surv std.err lower upper sex
#> 1 11 138 3 0 0.9782609 0.01268978 0.9542301 1.0000000 Male
#> 2 12 135 1 0 0.9710145 0.01470747 0.9434235 0.9994124 Male
#> 3 13 134 2 0 0.9565217 0.01814885 0.9230952 0.9911586 Male
#> 4 15 132 1 0 0.9492754 0.01967768 0.9133612 0.9866017 Male
#> 5 26 131 1 0 0.9420290 0.02111708 0.9038355 0.9818365 Male
#> 6 30 130 1 0 0.9347826 0.02248469 0.8944820 0.9768989 Male
With a bit of data manipulation, we can define 100-day periods and renormalize the curves at the start of each period. Then we can plot using geom_step
df %>%
filter(time < 300) %>%
group_by(sex) %>%
mutate(period = factor(100 * floor(time / 100))) %>%
group_by(sex, period) %>%
mutate(surv = surv / first(surv)) %>%
ggplot(aes(time, surv, color = sex, group = interaction(period, sex))) +
geom_step(size = 1) +
geom_vline(xintercept = c(0, 100, 200, 300), linetype = 2) +
scale_color_manual(values = c("deepskyblue4", "orange")) +
theme_minimal(base_size = 16) +
theme(legend.position = "top")
Created on 2022-09-03 with reprex v2.0.2

Ridge plot: sort by value / rank

I have a data set which I uploaded here as a gist in CSV format.
It is the extracted form of the PDFs provided in the YouGov article "How good is 'good'?". People where asked to rate words (e.g. "perfect", "bad") with a score between 0 (very negative) and 10 (very positive). The gist contains exactly that data, i.e. for every word (column: Word) it stores for every ranking from 0 to 10 (column: Category) the number of votes (column: Total).
I would usually try to visualize the data with matplotlib and Python since I lack knowledge in R, but it seems that ggridges can create way nicer plots than I see myself doing with Python.
Using:
library(ggplot2)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov, aes(x=Category, y=Word, height = Total, group = Word, fill=Word)) +
geom_density_ridges(stat = "identity", scale = 3)
I was able to create this plot (which is still far from perfect):
Ignoring the fact that I have to tweak the aesthetics, there are three things I struggle to do:
Sort the words by their average rank.
Color the ridge by the average rank.
Or color the ridge by the category value, i.e. with varying color.
I tried to adapt the suggestions from this source, but ultimately failed because my data seems to be in the wrong format: Instead of having single instances of votes, I already have the aggregated vote count for each category.
I hope to end up with a result closer to this plot, which satisfies criteria 3 (source):
It took me a little while to get there myself. The key for me way understanding the data and how to order Word based on the average Category score. So let's look at the data first:
> YouGov
# A tibble: 440 x 17
ID Word Category Total Male Female `18 to 35` `35 to 54` `55+`
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 Incr~ 0 0 0 0 0 0 0
2 1 Incr~ 1 1 1 1 1 1 0
3 2 Incr~ 2 0 0 0 0 0 0
4 3 Incr~ 3 1 1 1 1 1 1
5 4 Incr~ 4 1 1 1 1 1 1
6 5 Incr~ 5 5 6 5 6 5 5
7 6 Incr~ 6 6 7 5 5 8 5
8 7 Incr~ 7 9 10 8 10 7 10
9 8 Incr~ 8 15 16 14 13 15 16
10 9 Incr~ 9 20 20 20 22 18 19
# ... with 430 more rows, and 8 more variables: Northeast <dbl>,
# Midwest <dbl>, South <dbl>, West <dbl>, White <dbl>, Black <dbl>,
# Hispanic <dbl>, `Other (NET)` <dbl>
Every Word has a row for every Category (or score, 1-10). The Total provides the number of responses for that Word/Category combination. So although there were no responses where the word "Incredible" scored zero there is still a row for it.
Before we calculate the average score for each Word we calculate the product of Category and Total for each Word-Category combination, let's call it Total Score. From there, we can treat Word as a factor, and reorder based on the average Total Score using forcats. After that, you can plot your data just as you did.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
YouGov %>%
mutate(total_score = Category*Total) %>%
mutate(Word = fct_reorder(.f = Word, .x = total_score, .fun = mean)) %>%
ggplot(aes(x=Category, y=Word, height = Total, group = Word, fill=Word)) +
geom_density_ridges(stat = "identity", scale = 3)
By treating Word as a factor we reordered the Words based on their mean Category. ggplot also orders colors accordingly so we don't have to modify ourselves, unless you'd prefer a different color palette.
The other solution is exactly correct. I just wanted to point out that you can call fct_reorder() from within aes() for an even more compact solution. However, you need to do it twice if you want to change fill color by position along the y axis.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov,
aes(
x = Category,
y = fct_reorder(Word, Category*Total, .fun = sum),
height = Total,
fill = fct_reorder(Word, Category*Total, .fun = sum)
)) +
geom_density_ridges(stat = "identity", scale = 3) +
theme(legend.position = "none")
Created on 2020-01-19 by the reprex package (v0.3.0)
If instead you want to color by x position, you can do something like the following. It just doesn't look as nice as the temperature example because the x values come in discrete steps.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov,
aes(
x = Category,
y = fct_reorder(Word, Category*Total, .fun = sum),
height = Total,
fill = stat(x)
)) +
geom_density_ridges_gradient(stat = "identity", scale = 3) +
theme(legend.position = "none") +
scale_fill_viridis_c(option = "C")
Created on 2020-01-19 by the reprex package (v0.3.0)

geom_hline with multiple points and facet_wrap

i am trying to plot horizontal lines at specific points of my data. The idea is that i would like a horizontal line from the first value of equivalent iterations(i.e 0) at y intercept for each of my axis; SA, VLA, HLA. My question will become clearer with data.
iterations subsets equivalent_iterations axis ratio1 ratio2
0 0 0 SA 0.023569024 0.019690577
0 0 0 SA 0.023255814 0.019830028
0 0 0 VLA 0.025362319 0.020348837
0 0 0 HLA 0.022116904 0.021472393
2 2 4 SA 0.029411765 0.024911032
2 2 4 SA 0.024604569 0.022838499
2 2 4 VLA 0.026070764 0.022727273
2 2 4 HLA 0.027833002 0.027888446
4 15 60 SA 0.019746121 0.014403292
4 15 60 SA 0.018691589 0.015538291
4 15 60 VLA 0.021538462 0.01686747
4 15 60 HLA 0.017052375 0.017326733
16 5 80 SA 0.019021739 0.015021459
16 5 80 SA 0.020527859 0.015384615
16 5 80 VLA 0.023217247 0.017283951
16 5 80 HLA 0.017391304 0.016298021
and this is my plot using ggplot
ggplot(df)+
aes(x = equivalent_iterations, y = ratio1, color = equivalent_iterations)+
geom_point() +
facet_wrap(~axis) +
expand_limits(x = 0, y = 0)
What i want is for each axis SA, VLA, HLA (i.e. each facet_wrap) a horizontal line from the first point (which is at 0 equivalent iterations) at the y intercept (which is given by the ratio1 in column 5 in the first 4 values). Any help will be greatly appreciated. Thank you in advance
You can treat it like any other geom_*. Just create a new column with the value of ratio1 at which you want to plot the horizontal line. I do this by sub setting the the data by those where iterations = 0 (note SA has 2 of these) and joining the ratio1 column onto the original dataframe. This column can then be passed to the aesthetics call in geom_hline().
library(tidyverse)
df %>%
left_join(df %>%
filter(iterations == 0) %>%
select(axis, intercept = ratio1)) %>%
ggplot(aes(x = equivalent_iterations, y = ratio1,
color = equivalent_iterations)) +
geom_point() +
geom_hline(aes(yintercept = intercept)) +
facet_wrap(~axis) +
expand_limits(x = 0, y = 0)

R - ggplot column or bar graph in 'dodge' position gives me a fat bar when there is no y value for that x

I want to plot percent survival per treatment (percent.o2). When y == 0% for a treatment, I get a fat bar. I'd like them to be the same width. Any advice appreciated.
Data looks like this:
> plotData
# A tibble: 12 x 4
# Groups: Percent.O2 [7]
Status Percent.O2 n percent
<fct> <fct> <int> <dbl>
1 Dead 1 144 1.00
2 Dead 3 141 0.979
3 Dead 7 144 1.00
4 Dead 10 105 0.729
5 Dead 13 69 0.958
6 Dead Control 12 0.167
7 Dead Control2 2 0.0278
8 Still_kicking 3 3 0.0208
9 Still_kicking 10 39 0.271
10 Still_kicking 13 3 0.0417
11 Still_kicking Control 60 0.833
12 Still_kicking Control2 70 0.972
Here's my code for the plot:
> ggplot(plotData, aes(x = Percent.O2, y = percent, fill = Status)) + geom_col(position = "dodge")
You can use the complete function from the tidyr package (which is loaded as part of the tidyverse suite of packages) to add rows for the missing levels and fill them with zero (NA, the default fill value, would work too). I've also added percent labels on the y-axis using percent from the scales package.
library(tidyverse)
library(scales)
ggplot(plotData %>%
complete(Status, nesting(Percent.O2), fill=list(n=0, percent=0)),
aes(x = Percent.O2, y = percent, fill = Status)) +
geom_col(position = "dodge") +
scale_y_continuous(labels=percent)

Resources