Problem description
I have thousands of lines (~4000) that I want to plot. However it is infeasible to plot all lines using geom_line() and just use for example alpha=0.1 to illustrate where there is a high density of lines and where not. I came across something similar in Python, especially the second plot of the answers looks really nice, but I do not now if something similar can be achieved in ggplot2. Thus something like this:
An example dataset
It would make much more sense to demonstrate this with a set showing a pattern, but for now I just generated random sinus curves:
set.seed(1)
gen.dat <- function(key) {
c <- sample(seq(0.1,1, by = 0.1), 1)
time <- seq(c*pi,length.out=100)
val <- sin(time)
time = 1:100
data.frame(time,val,key)
}
dat <- lapply(seq(1,10000), gen.dat) %>% bind_rows()
Tried heatmap
I tried a heatmap like answered here, however this heatmap will not consider the connection of points over the complete axis (like in a line) but rather show the "heat" per time point.
Question
How can we in R, using ggplot2 plot a heatmap of lines simmilar to that shown in the first figure?
Looking closely, one can see that the graph to which you are linking consists of many, many, many points rather than lines.
The ggpointdensity package does a similar visualisation. Note with so many data points, there are quite some performance issues. I am using the developer version, because it contains the method argument which allows to use different smoothing estimators and apparently helps deal better with larger numbers. There is a CRAN version too.
You can adjust the smoothing with the adjust argument.
I have increased the x interval density of your code, to make it look more like lines. Have slightly reduced the number of 'lines' in the plot though.
library(tidyverse)
#devtools::install_github("LKremer/ggpointdensity")
library(ggpointdensity)
set.seed(1)
gen.dat <- function(key) {
c <- sample(seq(0.1,1, by = 0.1), 1)
time <- seq(c*pi,length.out=500)
val <- sin(time)
time = seq(0.02,100,0.1)
data.frame(time,val,key)
}
dat <- lapply(seq(1, 1000), gen.dat) %>% bind_rows()
ggplot(dat, aes(time, val)) +
geom_pointdensity(size = 0.1, adjust = 10)
#> geom_pointdensity using method='kde2d' due to large number of points (>20k)
Created on 2020-03-19 by the reprex package (v0.3.0)
update
Thanks user Robert Gertenbach for creating some more interesting sample data. Here the suggested use of ggpointdensity on this data:
library(tidyverse)
library(ggpointdensity)
gen.dat <- function(key) {
has_offset <- runif(1) > 0.5
time <- seq(1, 1000, length.out = 1000)
val <- sin(time / 100 + rnorm(1, sd = 0.2) + (has_offset * 1.5)) *
rgamma(1, 20, 20)
data.frame(time,val,key)
}
dat <- lapply(seq(1,1000), gen.dat) %>% bind_rows()
ggplot(dat, aes(time, val, group=key)) +stat_pointdensity(geom = "line", size = 0.05, adjust = 10) + scale_color_gradientn(colors = c("blue", "yellow", "red"))
Created on 2020-03-24 by the reprex package (v0.3.0)
Your data will result in a quite uniform polkadot density.
I generated some slightly more interesting data like this:
gen.dat <- function(key) {
has_offset <- runif(1) > 0.5
time <- seq(1, 1000, length.out = 1000)
val <- sin(time / 100 + rnorm(1, sd = 0.2) + (has_offset * 1.5)) *
rgamma(1, 20, 20)
data.frame(time,val,key)
}
dat <- lapply(seq(1,1000), gen.dat) %>% bind_rows()
We then get a 2d density estimate. kde2d doesn't have a predict function so we model it with a LOESS
dens <- MASS::kde2d(dat$time, dat$val, n = 400)
dens_df <- data.frame(with(dens, expand_grid( y, x)), z = as.vector(dens$z))
fit <- loess(z ~ y * x, data = dens_df, span = 0.02)
dat$z <- predict(fit, with(dat, data.frame(x=time, y=val)))
Plotting it then gets this result:
ggplot(dat, aes(time, val, group = key, color = z)) +
geom_line(size = 0.05) +
theme_minimal() +
scale_color_gradientn(colors = c("blue", "yellow", "red"))
This is all highly reliant on:
The number of series
The resolution of series
The density of kde2d
The span of loess
so your mileage may vary
I came up with the following solution, using geom_segment(), however I'm not sure if geom_segment() is the way to go as it then only checks if pairwise values are exactly the same whereas in a heatmap (as in my question) values near each other also affect the 'heat' rather than being exactly the same.
# Simple stats to get all possible line segments
vals <- unique(dat$time)
min.val = min(vals)
max.val = max(vals)
# Get all possible line segments
comb.df <- data.frame(
time1 = min.val:(max.val - 1),
time2 = (min.val + 1): max.val
)
# Join the original data to all possible line segments
comb.df <- comb.df %>%
left_join(dat %>% select(time1 = time, val1 = val, key )) %>%
left_join(dat %>% select(time2 = time, val2 = val, key ))
# Count how often each line segment occurs in the data
comb.df <- comb.df %>%
group_by(time1, time2, val1, val2) %>%
summarise(n = n_distinct(key))
# ggplot2 to plot segments
ggplot(comb.df %>% arrange(n)) +
geom_segment(aes(x = time1, y = val1, xend = time2, yend = val2, color = n), alpha =0.9) +
scale_colour_gradient( low = 'green', high = 'red') +
theme_bw()
Related
I am trying to add the trendline from an SMA (standardized major axis) fit to my ggplot. However, when I extract the coefficients from the SMA and give them to geom_abline() the line extends over the entire plot instead of clipping to the data. The natural solution to this would be use a geom_segment() instead, manually calculating the endpoints of the line. However, when I do this the lines don't match each other and neither match the SMA fit. What's going on here?
I am aware that you can use the plot function directly on an sma object but I would prefer to use ggplot
Note: this is my first time asking a question so my apologies if I'm missing something!
Edit: I am using a log-log axis, which I suspect may be part of the issue.
Reproducible version below:
library(tidyverse)
library(smatr) #for the SMA
# sample data set
x <- rlnorm(100, meanlog = 10)
var <- rlnorm(100, meanlog = 10)
df <- data.frame(x=x, y=x+var)
# fit using an SMA
sm <- sma(x~y, data = df, log = "xy")
# get sma coefficients into a data.frame
bb <- data.frame(coef(sm))
bb <- bb %>%
rownames_to_column(var = "Coef") %>%
pivot_wider(names_from = "Coef", values_from = "coef.sm.")
## calculate end coordinates for segment
bb$min_x <- min(df$x, na.rm = TRUE)
bb$max_x <- max(df$x, na.rm = TRUE)
bb <- bb %>%
mutate(min_y = (slope*min_x) + elevation) %>%
mutate(max_y = (slope*max_x) + elevation)
# plot into ggplot
p1 <- ggplot(df, aes(x=x, y=y)) +
geom_point(shape=21) +
scale_y_continuous(trans = 'log10')+
scale_x_continuous(trans = 'log10') +
geom_abline(data=bb,aes(intercept=elevation,slope=slope), color = "blue")
p1 + geom_segment(data=bb, aes(x=min_x, xend=max_x, y=min_y, yend=max_y), color = "orange")
#this is the plot from the smatr package for comparison
plot(sm)
enter image description hereI wanted to plot multiple lines in one graph but I couldn't figure out which code to use. Also, is there a way I could assign colors to each of the lines? Just new to Rstudio and was assigned to pick up someones work so I've been doing a lot of trial and error but I haven't been lucky for the past few days. Hope someone could help me with this! Thank you so much
ecdf.shift <- function(OUR_threshold, des_cap = 40, nint = 10000){
#create some empty vectors for later use in the loop
ecdf_med = c()
ecdf_obs = c()
for (i in 1:length(OUR_threshold)){
# filter out the OUR threshold data, then select only the capture column and create a ecdf function
ecdf_fun <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
ecdf()
# extract the ecdf data and put in tibble dataframe, then create a linear interpolation of the curve.
ecdf_data <- tibble(TSS_con = environment(ecdf_fun)$x, prob = environment(ecdf_fun)$y)
ecdf_interpol <- approx(x = ecdf_data$TSS_con, y = ecdf_data$prob, n = nint)
# find the vector numbers in x which correspond with the desired capture. Then find correlate the vectornumbers with probability numbers in the y vectors. Take the median value in case multiple hits. Put this number in a vector with designed vectornumber as ditacted by the loopnumber i.
ecdf_med[i] <- median(ecdf_interpol$y[(round(ecdf_interpol$x,1) == des_cap)])
# calculate the number of observations when the filtering takes place.
ecdf_obs[i] <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
length()
# Flush the ecdf data. The ecdf is encoded as a function with global paramaters, so you want to reset them everytime the loop is done to avoid pesky bugs to appear.
rm(ecdf_data)
}
#create a tibble dataframe with all the loop data.
ecdf_out <- tibble(OUR_ratio_cutoff = OUR_threshold, prob = (ecdf_med)*100, nobs = ecdf_obs)
return(ecdf_out)
}
ratio_threshold <- seq(0,115, by = 5)
t = ecdf_MLSS_target <- 400 %>%
ecdf.shift(ratio_threshold, .) %>%
filter(nobs > 2) %>%
ggplot(aes( x = OUR_ratio_cutoff, y = prob)) +
geom_line() +
geom_point() +
theme_bw(base_size = 12) +
theme(panel.grid = element_blank()) +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,300, by = 5),
expand = c(0,0)) +
scale_x_continuous(limits = c(0,120),
breaks = seq(0,110, by = 10),
expand = c(0,0)) +
labs(x = "ESS mg TSS/L",
y = "Probability of contactor MLSS > 400 mg TSS/L ")
plot(t)
Easiest would be to loop over your different t values first and bring the resulting data frames into one big data frame, and use this for your plot. Your code is not fully reproducible (it requires data that we do not have, i.e. HRP_rESS_no). So I have stripped down the function to the core - creating a data frame which makes different "lines" depending on your t value. I just used it as slope.
I hope the idea is clear.
library(tidyverse)
ecdf.shift <- function(OUR_threshold, t) {
data.frame(x = OUR_threshold, y = t * OUR_threshold)
}
ratio_threshold <- seq(0, 115, by = 5)
t_df <-
map(1:5, function(t) ecdf.shift(ratio_threshold, t)) %>%
bind_rows(, .id = "t")
ggplot(t_df, aes(x, y, color = t)) +
geom_line() +
geom_point()
Created on 2020-05-07 by the reprex package (v0.3.0)
I'm currently finishing off my Masters project and need to include some graphics for the write-up. Without boring you too much, I have some data which is associated with AR(1) parameters ranging from 0.1 to 0.9 by 0.1 increments. As such I thought of doing a faceted histogram like the one below (worry not about the hideous fruit salad of colours, it will not be used).
I used this code.
ggplot(opt_lens_geom,aes(x=l_1024,fill=factor(rho))) + geom_histogram()+coord_flip()+facet_grid(.~rho,scales = "free_x")
I also would like to draw a trend line for the median values since the AR(1) parameter is continuous. In a later iteration I deleted the padding and made it "look" like it was one graph, but I have had issues with the endpoints matching up since each facet is a separate graphical device. Can anyone give me some advice on how to do this? I am not particularly partial to the faceting so if it is not needed I do away with it.
I will try and upload sample data, but all simulating 100 values for each of the 9 rhos would work just to get it started like:
opt_lens_geom <- data.frame(rho= rep(seq(0.1,0.9,by=0.1),each=100),l_1024=rnorm(900))
You might consider ggridges. I've assumed here that you want a median value for each value of rho.
library(ggplot2)
library(ggridges)
library(dplyr)
set.seed(1001)
opt_lens_geom <- data.frame(rho = rep(seq(0.1, 0.9, by = 0.1), each = 100),
l_1024 = rnorm(900))
opt_lens_geom %>%
mutate(rho_f = factor(rho)) %>%
ggplot(aes(l_1024, rho_f)) +
stat_density_ridges(quantiles = 2, quantile_lines = TRUE)
Result. You can add scale = 1 as a parameter to stat_density_ridges if you don't like the amount of overlap.
Try the following. It uses a pre-computed data frame of the medians.
library(ggplot2)
df <- iris[c(1, 5)]
names(df) <- c("val", "rho")
med <- plyr::ddply(df, "rho", summarise, m = median(val))
ggplot(data = df, aes(x = val, fill = factor(rho))) +
geom_histogram() +
coord_flip() +
geom_vline(data = med, aes(xintercept = m), colour = 'black') +
facet_wrap(~ factor(rho))
You could do a variant on this using geom_violin instead of using histograms, although you wouldn't get labelled counts, just an idea of the relative density. Example with made up data:
df = data.frame(
rho = rep(c(0.1, 0.2, 0.3), each = 50),
val = sample(1:10, 150, replace = TRUE)
)
df$val = df$val + (5 * (df$rho == 0.2)) + (8 * (df$rho == 0.3))
ggplot(df, aes(x = rho, y = val, fill = factor(rho))) +
geom_violin() +
stat_summary(aes(group = 1), colour = "black",
geom = "line", fun.y = "median")
This produces a violin for each value of rho, and joins the medians for each violin.
In R, I would like to create a box plot that also shows all data points. There are numerous posts and websites where you can find this information, but they all seem to show the data points in ‘jitter’ or ‘random’ style. Here is an example code using the ToothGrowth dataset with ggplot2 in R.
library(datasets)
data(ToothGrowth)
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
library(ggplot2)
ggplot(ToothGrowth, aes(x=dose, y=len)) +
geom_boxplot(notch = TRUE) +
geom_jitter(position=position_jitter(0.2))
However, I would like to have the data points ordered from the lowest at the lower-left to highest at the top-right. Please see example in this link:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3966983/figure/F1/ (freely accessible). Specifically, I refer to Figure 1a, top portion (‘Purity’).
Might anyone have suggestions? I would much appreciate it. Thank you.
I don't know if this is what you are after but maybe you can inspire yourself from the code below.
set.seed(1234)
n <- 20
x <- rnorm(n)
boxplot(x)
points(seq(0.75, 1.25, length.out = n), sort(x))
df1 <- sapply(1:4, function(i) rnorm(n, mean = i))
df1 <- as.data.frame(df1)
df1 <- reshape2::melt(df1)
boxplot(value ~ variable, df1)
sp <- split(df1, df1$variable)
for(i in 1:4){
points(seq(i - 0.25, i + 0.25, length.out = n), sort(sp[[i]]$value))
}
Edit.
A ggplot2 solution uses a similar trick to define the points' x axis coordinates. The only thing "strange", is to rely on R's internal representation of factors as consecutive integers starting at 1. Note that this must be seen as a hack, but as a reliable one, I don't believe it will ever change.
library(ggplot2)
library(tidyverse)
df1 %>%
group_by(variable) %>%
arrange(value) %>%
mutate(xcoord = seq(-0.25, 0.25, length.out = n())) %>%
ggplot(aes(x = variable, y = value, group = variable)) +
geom_boxplot() +
geom_point(aes(x = xcoord + as.integer(variable)))
Here is some workable example of data I wish to plot:
set.seed(123)
x <- rweibull(n = 2000, shape = 2, scale = 10)
x <- round(x, digits = 0)
x <- sort(x, decreasing = FALSE)
y <- c(rep(0.1, times = 500),rep(0.25, times = 500),rep(0.4, times = 500),rep(0.85, times = 500))
z <- rbinom(n=2000, size=1, prob=y)
df1 <- data.frame(x,z)
I want to plot the overal fequency of z across x.
unlike a typical cdf, the function should not reach 1.0, but instead
sum(df1$z)/length(df1$z)
a ymax of 0.36 (721/2000).
using ggplot2 we can create a cdf of x with the following command:
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
But i want to extend this plot to show the cumulative percentage of z (as a function of 'x')
The end result should like like
EDIT
with some very poor data manipulation I am able to generate the something similiar to a cdf plot, but there must be a more beautiful and easy method using various packages and ggplot
mytable <- table(df1$x, df1$z)
mydf <- as.data.frame.matrix(mytable)
colnames(mydf) <- c("z_no", "z_yes")
mydf$A <- 1:length(mydf$z_no)
mydf$sum <- cumsum(mydf$z_yes)
mydf$dis <- mydf$sum/length(z)
plot(mydf$A, mydf$dis)
You can use the package dplyr to process the data as follows:
library(dplyr)
plot_data <- group_by(df1, x) %>%
summarise(z_num = sum(z)) %>%
mutate(cum_perc_z = cumsum(z_num)/nrow(df1))
This gives the same result as the data processing that you describe in your edit. Note, however, that I get sum(df1$z) = 796 and the maximal y value is thus 796/2000 = 0.398.
For the plot, you can use geom_step() to have a step function and add the horizontal line with geom_hline():
ggplot(plot_data, aes(x = x, y = cum_perc_z)) +
geom_step(colour = "red", size = 0.8) +
geom_hline(yintercept = max(plot_data$cum_perc_z))