Plotting normal distributions in a ridgeline plot with ggridges - r

I'm a little embarrassed to ask this question but I've spent the better part of my work day trying to find a solution, yet and here I am...
What I'm aiming for is a simple ridgeline plot of several normal distributions which are calculated from given means and SDs in my data, like in this example:
case_number caseMean caseSD
case1 0 1
case2 1 2
case3 3 3
All the examples I've found are working with series of measurement, like in the example with the temperatures in Lincoln, NE:
Example of ridgeline plot
https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html and I cannot get them to work.
As to my experience with R, I am not a complete idiot when it comes to data analysis but proper visualization is something I am eager to learn but unfortunately I need a solution to my problem rather.
Thank you very much for your help!

Edit -- added precise theoretical answer.
Here's a way using dnorm to construct exact normal curves to those specifications:
library(tidyverse); library(ggridges)
n = 100
df3 <- df %>%
mutate(low = caseMean - 3 * caseSD, high = caseMean + 3 * caseSD) %>%
uncount(n, .id = "row") %>%
mutate(x = (1 - row/n) * low + row/n * high,
norm = dnorm(x, caseMean, caseSD))
ggplot(df3, aes(x, case_number, height = norm)) +
geom_ridgeline(scale = 3)
Similar to Sada93's answer, using dplyr and tidyr:
library(tidyverse); library(ggridges)
n = 50000
df2 <- df %>%
uncount(n) %>%
mutate(value = rnorm(n(), caseMean, caseSD))
ggplot(df2, aes(x = value, y = case_number)) + geom_density_ridges()
sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text = "case_number caseMean caseSD
case1 0 1
case2 1 2
case3 3 3")

You need to create a new data frame with the actual distribution values and then use ggridges as follows,
library(ggplot2)
library(ggridges)
data = data.frame(case = c("case1","case2","case3"),caseMean = c(0,1,3),caseSD = c(1,2,3))
#Create 100 rows for each mean and SD
data_plot = data.frame(case = character(),value = numeric())
n = 100
for(i in 1:nrow(data)){
case = data$case[i]
mean = data$caseMean[i]
sd = data$caseSD[i]
val = rnorm(n,mean,sd)
data_plot = rbind(data_plot,
data.frame(case = rep(case,n),
value = val))
}
ggplot(data = data_plot,aes(x = value,y = case))+geom_density_ridges()

Related

How group dataset in a boxplot?

I have been trying to figure out how to group 9 datasets into 3 different groups (1, 2, and 3).
I have 3 different data frames that look like this:
ID1 ID2 dN dS Omega Label_ID1 Label_ID2 Group
QJY77946 NP_073551 0.0293 0.0757 0.3872 229E-CoV 229E-CoV Intra
QJY77954 NP_073551 0.0273 0.0745 0.3668 229E-CoV 229E-CoV Intra
...
So, the only columns that I´m interested in are three: dN, dS, and Omega.
My main goal is to take these three columns from my data frames and plots in a boxplot using Rstudio.
To do that, first I take the 3 columns of each data frame with these lines:
dN_1 <- df_1$dN
dS_1 <- df_1$dS
Omega_1 <- df_1$Omega
Then, to generate the plot I use this line (option 1):
boxplot(dN_S, dS_S, Omega_S, dN_M, dS_M, Omega_M, dN_E, dS_E, Omega_E,
main = "Test",
xlab = "Frames",
ylab = "Distribution",
col = "red")
My goal is to group these 9 boxes into 3 separate groups:
I know that using ggplot2 could be easier, so my option 2 is to use these lines (option 2):
df_1 %>%
ggplot(aes(y=dN_S)) +
geom_boxplot(
color = "blue",
fill = "blue",
alpha = 0.2,
notch = T,
notchwidth = 0.8)
However, you can see that I couldn´t find a way to plot all groups in the same plot.
So how can I group my data in the boxplot using option 1 or option 2? Maybe the second option is less development but perhaps someone could help with that too.
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
set.seed(123)
df_s <- data.frame(dN = runif(20),
dS = runif(20),
Omega = runif(20))
df_m <- data.frame(dN = runif(20),
dS = runif(20),
Omega = runif(20))
df_e <- data.frame(dN = runif(20),
dS = runif(20),
Omega = runif(20))
df <-
list(df_s, df_m, df_e) %>%
set_names(c("S", "M", "E")) %>%
map_dfr(bind_rows, .id = "df") %>%
pivot_longer(-df)
ggplot(df)+
geom_boxplot(aes(x = name, y = value))+
facet_wrap(~df, nrow = 1)
Created on 2021-09-24 by the reprex package (v2.0.0)
One way to accomplish this is by providing ggplot() another aesthetic, like fill. Here's a small reproducible example:
library(tidyverse)
df <- tibble(category = rep(letters[1:4], 5),
time = c(rep("before", 10), rep("after", 10)),
num = rnorm(20))
df %>%
ggplot() +
geom_boxplot(aes(x=category, y=num, fill = time))
Let me know if you're looking for something else.

How to plot % positive cases (y-axis) by collection date (x-axis) and by other factors (R)?

Please help!
I have case data I need to prepare for a report soon and just cannot get the graphs to display properly.
From a dataset with CollectionDate as the "record" of cases (i.e. multiple rows with the same date means more cases that day), I want to display Number of positive cases/total (positive + negative) cases for that day as a percent on the y-axis, with collection dates along the x-axis. Then I want to break down by region. Goal is to look like this but in terms of daily positives/# of tests rather than just positives vs negatives. I also want to add a horizontal line on every graph at 20%.
I have tried manipulating it before, in and after ggplot:
ggplot(df_final, aes(x =CollectionDate, fill = TestResult)) +
geom_bar(aes(y=..prop..)) +
scale_y_continuous(labels=percent_format())
Which is, again, close. But the percents are wrong because they are just taking the proportion of that day against counts of all days instead of per day.
Then I tried using tally()in the following command to try and count per region and aggregate:
df_final %>%
group_by(CollectionDate, Region, as.factor(TestResult)) %>%
filter(TestResult == "Positive") %>%
tally()
and I still cannot get the graphs right.
Suggestions?
A quick look at my data:
head(df_final)
Well, I have to say that I am not 100% sure that I got what you want, but anyway, this can be helpful.
The data: Since you are new here, I have to let you know that using a simple and reproducible version of your data will make it easier to the rest of us to answer. To do this you can simulate a data frame o any other objec, or use dput function on it.
library(ggplot2)
library(dplyr)
data <- data.frame(
# date
CollectionDate = sample(
seq(as.Date("2020-01-01"), by = "day", length.out = 15),
size = 120, replace = TRUE),
# result
TestResult = sample(c("Positive", "Negative"), size = 120, replace = TRUE),
# region
Region = sample(c("Region 1", "Region2"), size = 120, replace = TRUE)
)
With this data, you can do ass follow to get the plots you want.
# General plot, positive cases proportion
data %>%
count(CollectionDate, TestResult, name = "cases") %>%
group_by(CollectionDate) %>%
summarise(positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
geom_hline(yintercept = 0.2)
# positive proportion by day within region
data %>%
count(CollectionDate, TestResult, Region, name = "cases") %>%
group_by(CollectionDate, Region) %>%
summarise(
positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)
) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
# horizontal line at 20%
geom_hline(yintercept = 0.2) +
facet_wrap(~Region)
I can get you halfway there (refer to the comments in the code for clarifications). This code is for the counts per day per region (plotted separately for each region). I think you can tweak things further to calculate the counts per day per county too; and whole state should be a cakewalk. I wish you good luck with your report.
rm(list = ls())
library(dplyr)
library(magrittr)
library(ggplot2)
library(scales)
library(tidyr) #Needed for the spread() function
#Dummy data
set.seed(1984)
sdate <- as.Date('2000-03-09')
edate <- as.Date('2000-05-18')
dateslist <- as.Date(sample(as.numeric(sdate): as.numeric(edate), 10000, replace = TRUE), origin = '1970-01-01')
df_final <- data.frame(Region = rep_len(1:9, 10000),
CollectionDate = dateslist,
TestResult = sample(c("Positive", "Negative"), 10000, replace = TRUE))
#First tally the positve and negative cases
#by Region, CollectionDate, TestResult in that order
df_final %<>%
group_by(Region, CollectionDate, TestResult) %>%
tally()
#Then
#First spread the counts (in n)
#That is, create separate columns for Negative and Positive cases
#for each Region-CollectionDate combination
#Then calculate their proportions (as shown)
#Now you have Negative and Positive
#percentages by CollectionDate by Region
df_final %<>%
spread(key = TestResult, value = n) %>%
mutate(Negative = Negative/(Negative + Positive),
Positive = Positive/(Negative + Positive))
#Plotting this now
#Since the percentages are available already
#Use geom_col() instead of geom_bar()
df_final %>% ggplot() +
geom_col(aes(x = CollectionDate, y = Positive, fill = "Positive"),
position = "identity", alpha = 0.4) +
geom_col(aes(x = CollectionDate, y = Negative, fill = "Negative"),
position = "identity", alpha = 0.4) +
facet_wrap(~ Region, nrow = 3, ncol = 3)
This yields:

Line density heatmap in R

Problem description
I have thousands of lines (~4000) that I want to plot. However it is infeasible to plot all lines using geom_line() and just use for example alpha=0.1 to illustrate where there is a high density of lines and where not. I came across something similar in Python, especially the second plot of the answers looks really nice, but I do not now if something similar can be achieved in ggplot2. Thus something like this:
An example dataset
It would make much more sense to demonstrate this with a set showing a pattern, but for now I just generated random sinus curves:
set.seed(1)
gen.dat <- function(key) {
c <- sample(seq(0.1,1, by = 0.1), 1)
time <- seq(c*pi,length.out=100)
val <- sin(time)
time = 1:100
data.frame(time,val,key)
}
dat <- lapply(seq(1,10000), gen.dat) %>% bind_rows()
Tried heatmap
I tried a heatmap like answered here, however this heatmap will not consider the connection of points over the complete axis (like in a line) but rather show the "heat" per time point.
Question
How can we in R, using ggplot2 plot a heatmap of lines simmilar to that shown in the first figure?
Looking closely, one can see that the graph to which you are linking consists of many, many, many points rather than lines.
The ggpointdensity package does a similar visualisation. Note with so many data points, there are quite some performance issues. I am using the developer version, because it contains the method argument which allows to use different smoothing estimators and apparently helps deal better with larger numbers. There is a CRAN version too.
You can adjust the smoothing with the adjust argument.
I have increased the x interval density of your code, to make it look more like lines. Have slightly reduced the number of 'lines' in the plot though.
library(tidyverse)
#devtools::install_github("LKremer/ggpointdensity")
library(ggpointdensity)
set.seed(1)
gen.dat <- function(key) {
c <- sample(seq(0.1,1, by = 0.1), 1)
time <- seq(c*pi,length.out=500)
val <- sin(time)
time = seq(0.02,100,0.1)
data.frame(time,val,key)
}
dat <- lapply(seq(1, 1000), gen.dat) %>% bind_rows()
ggplot(dat, aes(time, val)) +
geom_pointdensity(size = 0.1, adjust = 10)
#> geom_pointdensity using method='kde2d' due to large number of points (>20k)
Created on 2020-03-19 by the reprex package (v0.3.0)
update
Thanks user Robert Gertenbach for creating some more interesting sample data. Here the suggested use of ggpointdensity on this data:
library(tidyverse)
library(ggpointdensity)
gen.dat <- function(key) {
has_offset <- runif(1) > 0.5
time <- seq(1, 1000, length.out = 1000)
val <- sin(time / 100 + rnorm(1, sd = 0.2) + (has_offset * 1.5)) *
rgamma(1, 20, 20)
data.frame(time,val,key)
}
dat <- lapply(seq(1,1000), gen.dat) %>% bind_rows()
ggplot(dat, aes(time, val, group=key)) +stat_pointdensity(geom = "line", size = 0.05, adjust = 10) + scale_color_gradientn(colors = c("blue", "yellow", "red"))
Created on 2020-03-24 by the reprex package (v0.3.0)
Your data will result in a quite uniform polkadot density.
I generated some slightly more interesting data like this:
gen.dat <- function(key) {
has_offset <- runif(1) > 0.5
time <- seq(1, 1000, length.out = 1000)
val <- sin(time / 100 + rnorm(1, sd = 0.2) + (has_offset * 1.5)) *
rgamma(1, 20, 20)
data.frame(time,val,key)
}
dat <- lapply(seq(1,1000), gen.dat) %>% bind_rows()
We then get a 2d density estimate. kde2d doesn't have a predict function so we model it with a LOESS
dens <- MASS::kde2d(dat$time, dat$val, n = 400)
dens_df <- data.frame(with(dens, expand_grid( y, x)), z = as.vector(dens$z))
fit <- loess(z ~ y * x, data = dens_df, span = 0.02)
dat$z <- predict(fit, with(dat, data.frame(x=time, y=val)))
Plotting it then gets this result:
ggplot(dat, aes(time, val, group = key, color = z)) +
geom_line(size = 0.05) +
theme_minimal() +
scale_color_gradientn(colors = c("blue", "yellow", "red"))
This is all highly reliant on:
The number of series
The resolution of series
The density of kde2d
The span of loess
so your mileage may vary
I came up with the following solution, using geom_segment(), however I'm not sure if geom_segment() is the way to go as it then only checks if pairwise values are exactly the same whereas in a heatmap (as in my question) values near each other also affect the 'heat' rather than being exactly the same.
# Simple stats to get all possible line segments
vals <- unique(dat$time)
min.val = min(vals)
max.val = max(vals)
# Get all possible line segments
comb.df <- data.frame(
time1 = min.val:(max.val - 1),
time2 = (min.val + 1): max.val
)
# Join the original data to all possible line segments
comb.df <- comb.df %>%
left_join(dat %>% select(time1 = time, val1 = val, key )) %>%
left_join(dat %>% select(time2 = time, val2 = val, key ))
# Count how often each line segment occurs in the data
comb.df <- comb.df %>%
group_by(time1, time2, val1, val2) %>%
summarise(n = n_distinct(key))
# ggplot2 to plot segments
ggplot(comb.df %>% arrange(n)) +
geom_segment(aes(x = time1, y = val1, xend = time2, yend = val2, color = n), alpha =0.9) +
scale_colour_gradient( low = 'green', high = 'red') +
theme_bw()

Control relative sizes of discrete scale in ggplot2

I'm trying to generate a polar violin plot with ggplot2. I'd like to control the relative size of each category (the width of each category of the factor on the x axis, which then translates to angle once I make the coordinates polar).
Is there any way to do this?
Example code:
means <- runif(n = 10, min=0.1, max=0.6)
sds <- runif(n = 10, min=0.2, max=0.4)
frame <- data.frame(
cat = sample(1:10, size=10000, replace=TRUE),
value = rnorm(10000)
) %>%
mutate(
mn = means[cat],
sd = sds[cat],
value = (value * sd) + mn,
cat = factor(cat)
)
frame %>%
ggplot(aes(x = cat, y = value)) + geom_violin() +
coord_polar()
Any help or advice is appreciated.
Alternatively (and perhaps better), I'd like to be able to make a polar coordinates chart that isn't centered. Where the angles are the same for each discrete category, but the points converge, say, 1/3 of the way from the bottom of the circle, rather than in the center of the circle.
Based on comments, I'm redoing my previous answer. If what you want is a fan/weed leaf shape, you can add dummy data for additional cat values. In this example, I just doubled the number of levels in cat, but you could change this. Then I set the x breaks to only show the values that actually have data, but let the dummy values take up space to change the shape. Still not sure if this is what you meant but it's interesting to try.
library(tidyverse)
means <- runif(n = 10, min=0.1, max=0.6)
sds <- runif(n = 10, min=0.2, max=0.4)
frame <- data.frame(
cat = sample(1:10, size=10000, replace=TRUE),
value = rnorm(10000)
) %>%
mutate(
mn = means[cat],
sd = sds[cat],
value = (value * sd) + mn,
cat = factor(cat)
)
frame %>%
mutate(cat = as.integer(cat)) %>%
bind_rows(tibble(cat = 11:20, value = NA)) %>%
ggplot(aes(x = as.factor(cat), y = value)) +
geom_violin(scale = "area") +
coord_polar(start = -pi / 2) +
scale_x_discrete(breaks = 1:10)
#> Warning: Removed 10 rows containing non-finite values (stat_ydensity).
Created on 2018-05-08 by the reprex package (v0.2.0).

R - ggplot2 parallel categorical plot

I am working with categorical longitudinal data. My data has 3 simple variables such as :
id variable value
1 1 1 c
2 1 2 b
3 1 3 c
4 1 4 c
5 1 5 c
...
Where variable is basically time, and value are the 3 possible categories one id can take.
I am interested in producing a "parallel" longitudinal graph, similar to this with ggplot2
I am struggling a bit to get it right. What I came up for now is this :
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70") +
geom_point(aes(colour=value, size = nn), size=4) +
scale_colour_brewer(palette="Set1") + theme_minimal()
The issue with this graph is that we can't really see the "thickness" of the "transition" (the id lines).
I wondered if you could help me for :
a) help make visible the id lines, or make it "thicker" according to the number of id going form one state to the other
b) I also would like to re-size the point according to the number of id in this state. I tried to do it with geom_point(aes(colour=value, size = nn), size=4) but it doesn't seem to work.
Thanks.
# data #
library(dplyr)
library(ggplot2)
set.seed(10)
# generate random sequences #
dt = as.data.frame( cbind(id = 1:1000, replicate(5, sample( c('a', 'b', 'c'), prob = c(0.1,0.2,0.7), 1000, replace = T)) ) )
# transform to PP file #
dt = dt %>% melt(id.vars = c('id'))
# create a vector 1-0 if the activity was performed #
dt0 = dt %>% group_by(id) %>% mutate(variable = 1:n()) %>% arrange(id)
# create the number of people in that state #
dt0 = dt0 %>% count(id, variable, value)
dt0 = dt0 %>% group_by(variable, value, n) %>% mutate(nn = n())
# to produce the first graph # 
library(vcrpart)
otsplot(dt0$variable, factor(dt0$value), dt0$id)
you were so close with geom_point(aes(colour=value, size = nn), size=4), the problem was that with you redefined size after defining it in aes() ggplot overwrote the variable reference with the constant 4. Assuming you want to use nn to scale line thinkness as well, you could tweak your code to this:
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70", aes(size = nn)) +
geom_point(aes(colour=value, size = nn)) +
scale_colour_brewer(palette="Set1") + theme_minimal()
If you wanted to use a lag value for the line thickness I would suggests adding that as a new column in dt0.

Resources