created a nested cdf that doesn't reach 1 - r

Here is some workable example of data I wish to plot:
set.seed(123)
x <- rweibull(n = 2000, shape = 2, scale = 10)
x <- round(x, digits = 0)
x <- sort(x, decreasing = FALSE)
y <- c(rep(0.1, times = 500),rep(0.25, times = 500),rep(0.4, times = 500),rep(0.85, times = 500))
z <- rbinom(n=2000, size=1, prob=y)
df1 <- data.frame(x,z)
I want to plot the overal fequency of z across x.
unlike a typical cdf, the function should not reach 1.0, but instead
sum(df1$z)/length(df1$z)
a ymax of 0.36 (721/2000).
using ggplot2 we can create a cdf of x with the following command:
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
But i want to extend this plot to show the cumulative percentage of z (as a function of 'x')
The end result should like like
EDIT
with some very poor data manipulation I am able to generate the something similiar to a cdf plot, but there must be a more beautiful and easy method using various packages and ggplot
mytable <- table(df1$x, df1$z)
mydf <- as.data.frame.matrix(mytable)
colnames(mydf) <- c("z_no", "z_yes")
mydf$A <- 1:length(mydf$z_no)
mydf$sum <- cumsum(mydf$z_yes)
mydf$dis <- mydf$sum/length(z)
plot(mydf$A, mydf$dis)

You can use the package dplyr to process the data as follows:
library(dplyr)
plot_data <- group_by(df1, x) %>%
summarise(z_num = sum(z)) %>%
mutate(cum_perc_z = cumsum(z_num)/nrow(df1))
This gives the same result as the data processing that you describe in your edit. Note, however, that I get sum(df1$z) = 796 and the maximal y value is thus 796/2000 = 0.398.
For the plot, you can use geom_step() to have a step function and add the horizontal line with geom_hline():
ggplot(plot_data, aes(x = x, y = cum_perc_z)) +
geom_step(colour = "red", size = 0.8) +
geom_hline(yintercept = max(plot_data$cum_perc_z))

Related

Is there a programatic way to pass specific ranges for the y-axis on a ggplot2 plot?

I've got plots that are being generated automatically based on some user inputs. Most of the time, the plots work fine. However, some users have requested to ensure that there is always an axis label on each end of the plotted data. For example, this plot:
sample_data <-
data.frame(
x = rep(LETTERS[1:3], each = 3)
, y = 1:9 + 0.5
)
ggplot(
sample_data
, aes(x = x, y = y)) +
stat_summary(
fun = "mean"
)
Has no label above the top point or below the bottom point. I can add them easily enough with expand_limits:
ggplot(
sample_data
, aes(x = x, y = y)) +
stat_summary(
fun = "mean"
) +
expand_limits(y = c(2, 10))
However, because these plots are being automatically generated, I cannot manually add the next axis point each time. I've tried passing only.loose = TRUE to labeling:extended, but that still doesn't change the displayed values (any more than entering the values that I want would):
ggplot(
sample_data
, aes(x = x, y = y)) +
stat_summary(
fun = "mean"
) +
scale_y_continuous(breaks = breaks_extended(only.loose = TRUE))
In addition, some of the plots are more complex than this (e.g., with or without confidence intervals, additional grouping, etc.), and the data is prepared for the plot using dplyr and piped directly into ggplot (with %>%). So, even something like recalculating the values is non-trivial.
In fact, even in this case, it fails because adding the expanded points to capture the next set of labels changes the labeling.
ggplot(
sample_data
, aes(x = x, y = y)) +
stat_summary(
fun = "mean"
) +
scale_y_continuous(breaks = breaks_extended(n = 5
, only.loose = TRUE)) +
expand_limits(y =
sample_data %>%
group_by(x) %>%
summarise(my_mean = mean(y)) %>%
pull(my_mean) %>%
range() %>%
{labeling::extended(.[1], .[2], 5
, only.loose = TRUE)}
)
It appears that this happens because
labeling::extended(2.5, 8.5, 5, only.loose = TRUE)
returns the range 2 to 9 by 1's, while:
labeling::extended(2, 9, 5, only.loose = TRUE)
returns the range 2 to 10 by 2's. Somehow, breaks_extended is throwing in some added variation, though whether I track it down or not doesn't change much. I could work around this by calculating the breaks first, but (again) this is for a fairly complicated set of plots.
I feel like I am missing some sort of obvious point, but it keeps eluding me.
Yes there is a programmatic way to set the limits on y-scales and that is to provide a function to the limits argument. It is given the natural data limits as input that you can then edit programmatically. The same goes for breaks, except the input are the limits.
Example below, how this code should look exactly is up to your specifications.
library(ggplot2)
sample_data <- data.frame(
x = rep(LETTERS[1:3], each = 3),
y = 1:9 + 0.5
)
ggplot(sample_data,
aes(x = x, y = y)) +
stat_summary(
fun = "mean"
) +
scale_y_continuous(
limits = function(x) {
lower <- floor(x[1])
lower <- ifelse(x[1] - lower < 0.5, lower - 1, lower)
upper <- ceiling(x[2])
upper <- ifelse(upper - x[2] <= 0.5, upper + 1, upper)
c(lower, upper)
},
breaks = function(x) {
scales::breaks_pretty()(x)
}
)
#> Warning: Removed 3 rows containing missing values (geom_segment).
Created on 2021-03-23 by the reprex package (v1.0.0)
Inspired by teunbrand, I built a function that generates the limits, then checks to ensure that the expansion (including the 5% buffer) does not change the output of pretty
my_lims_expand <- function(x){
prev_pass <-
range(pretty(x))
curr_pass <-
pretty(c(prev_pass[1] - 0.05 * diff(prev_pass)
, prev_pass[2] + 0.05 * diff(prev_pass)))
last_under <-
tail(which(curr_pass < min(x)), 1)
first_over <-
head(which(curr_pass > max(x)), 1)
out <-
range(curr_pass[last_under:first_over])
confirm_out <-
range(pretty(out))
while(!all(out == confirm_out)){
prev_pass <- curr_pass
curr_pass <-
pretty(c(prev_pass[1] - 0.05 * diff(prev_pass)
, prev_pass[2] + 0.05 * diff(prev_pass)))
last_under <-
tail(which(curr_pass < min(x)), 1)
first_over <-
head(which(curr_pass > max(x)), 1)
out <-
range(curr_pass[last_under:first_over])
confirm_out <-
range(pretty(out))
}
return(out)
}
Then, I can use that function for limits:
ggplot(sample_data,
aes(x = x, y = y)) +
stat_summary(
fun = "mean"
) +
scale_y_continuous(
limits = my_lims_expand
, breaks = pretty
)
to generate the desired plot:

trouble adding SMA fit line to ggplot - geom_segment() nor geom_abline() don't match

I am trying to add the trendline from an SMA (standardized major axis) fit to my ggplot. However, when I extract the coefficients from the SMA and give them to geom_abline() the line extends over the entire plot instead of clipping to the data. The natural solution to this would be use a geom_segment() instead, manually calculating the endpoints of the line. However, when I do this the lines don't match each other and neither match the SMA fit. What's going on here?
I am aware that you can use the plot function directly on an sma object but I would prefer to use ggplot
Note: this is my first time asking a question so my apologies if I'm missing something!
Edit: I am using a log-log axis, which I suspect may be part of the issue.
Reproducible version below:
library(tidyverse)
library(smatr) #for the SMA
# sample data set
x <- rlnorm(100, meanlog = 10)
var <- rlnorm(100, meanlog = 10)
df <- data.frame(x=x, y=x+var)
# fit using an SMA
sm <- sma(x~y, data = df, log = "xy")
# get sma coefficients into a data.frame
bb <- data.frame(coef(sm))
bb <- bb %>%
rownames_to_column(var = "Coef") %>%
pivot_wider(names_from = "Coef", values_from = "coef.sm.")
## calculate end coordinates for segment
bb$min_x <- min(df$x, na.rm = TRUE)
bb$max_x <- max(df$x, na.rm = TRUE)
bb <- bb %>%
mutate(min_y = (slope*min_x) + elevation) %>%
mutate(max_y = (slope*max_x) + elevation)
# plot into ggplot
p1 <- ggplot(df, aes(x=x, y=y)) +
geom_point(shape=21) +
scale_y_continuous(trans = 'log10')+
scale_x_continuous(trans = 'log10') +
geom_abline(data=bb,aes(intercept=elevation,slope=slope), color = "blue")
p1 + geom_segment(data=bb, aes(x=min_x, xend=max_x, y=min_y, yend=max_y), color = "orange")
#this is the plot from the smatr package for comparison
plot(sm)

Is there a way I could plot t = 300, 350, 450, and 500 lines in one graph?

enter image description hereI wanted to plot multiple lines in one graph but I couldn't figure out which code to use. Also, is there a way I could assign colors to each of the lines? Just new to Rstudio and was assigned to pick up someones work so I've been doing a lot of trial and error but I haven't been lucky for the past few days. Hope someone could help me with this! Thank you so much
ecdf.shift <- function(OUR_threshold, des_cap = 40, nint = 10000){
#create some empty vectors for later use in the loop
ecdf_med = c()
ecdf_obs = c()
for (i in 1:length(OUR_threshold)){
# filter out the OUR threshold data, then select only the capture column and create a ecdf function
ecdf_fun <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
ecdf()
# extract the ecdf data and put in tibble dataframe, then create a linear interpolation of the curve.
ecdf_data <- tibble(TSS_con = environment(ecdf_fun)$x, prob = environment(ecdf_fun)$y)
ecdf_interpol <- approx(x = ecdf_data$TSS_con, y = ecdf_data$prob, n = nint)
# find the vector numbers in x which correspond with the desired capture. Then find correlate the vectornumbers with probability numbers in the y vectors. Take the median value in case multiple hits. Put this number in a vector with designed vectornumber as ditacted by the loopnumber i.
ecdf_med[i] <- median(ecdf_interpol$y[(round(ecdf_interpol$x,1) == des_cap)])
# calculate the number of observations when the filtering takes place.
ecdf_obs[i] <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
length()
# Flush the ecdf data. The ecdf is encoded as a function with global paramaters, so you want to reset them everytime the loop is done to avoid pesky bugs to appear.
rm(ecdf_data)
}
#create a tibble dataframe with all the loop data.
ecdf_out <- tibble(OUR_ratio_cutoff = OUR_threshold, prob = (ecdf_med)*100, nobs = ecdf_obs)
return(ecdf_out)
}
ratio_threshold <- seq(0,115, by = 5)
t = ecdf_MLSS_target <- 400 %>%
ecdf.shift(ratio_threshold, .) %>%
filter(nobs > 2) %>%
ggplot(aes( x = OUR_ratio_cutoff, y = prob)) +
geom_line() +
geom_point() +
theme_bw(base_size = 12) +
theme(panel.grid = element_blank()) +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,300, by = 5),
expand = c(0,0)) +
scale_x_continuous(limits = c(0,120),
breaks = seq(0,110, by = 10),
expand = c(0,0)) +
labs(x = "ESS mg TSS/L",
y = "Probability of contactor MLSS > 400 mg TSS/L ")
plot(t)
Easiest would be to loop over your different t values first and bring the resulting data frames into one big data frame, and use this for your plot. Your code is not fully reproducible (it requires data that we do not have, i.e. HRP_rESS_no). So I have stripped down the function to the core - creating a data frame which makes different "lines" depending on your t value. I just used it as slope.
I hope the idea is clear.
library(tidyverse)
ecdf.shift <- function(OUR_threshold, t) {
data.frame(x = OUR_threshold, y = t * OUR_threshold)
}
ratio_threshold <- seq(0, 115, by = 5)
t_df <-
map(1:5, function(t) ecdf.shift(ratio_threshold, t)) %>%
bind_rows(, .id = "t")
ggplot(t_df, aes(x, y, color = t)) +
geom_line() +
geom_point()
Created on 2020-05-07 by the reprex package (v0.3.0)

Line density heatmap in R

Problem description
I have thousands of lines (~4000) that I want to plot. However it is infeasible to plot all lines using geom_line() and just use for example alpha=0.1 to illustrate where there is a high density of lines and where not. I came across something similar in Python, especially the second plot of the answers looks really nice, but I do not now if something similar can be achieved in ggplot2. Thus something like this:
An example dataset
It would make much more sense to demonstrate this with a set showing a pattern, but for now I just generated random sinus curves:
set.seed(1)
gen.dat <- function(key) {
c <- sample(seq(0.1,1, by = 0.1), 1)
time <- seq(c*pi,length.out=100)
val <- sin(time)
time = 1:100
data.frame(time,val,key)
}
dat <- lapply(seq(1,10000), gen.dat) %>% bind_rows()
Tried heatmap
I tried a heatmap like answered here, however this heatmap will not consider the connection of points over the complete axis (like in a line) but rather show the "heat" per time point.
Question
How can we in R, using ggplot2 plot a heatmap of lines simmilar to that shown in the first figure?
Looking closely, one can see that the graph to which you are linking consists of many, many, many points rather than lines.
The ggpointdensity package does a similar visualisation. Note with so many data points, there are quite some performance issues. I am using the developer version, because it contains the method argument which allows to use different smoothing estimators and apparently helps deal better with larger numbers. There is a CRAN version too.
You can adjust the smoothing with the adjust argument.
I have increased the x interval density of your code, to make it look more like lines. Have slightly reduced the number of 'lines' in the plot though.
library(tidyverse)
#devtools::install_github("LKremer/ggpointdensity")
library(ggpointdensity)
set.seed(1)
gen.dat <- function(key) {
c <- sample(seq(0.1,1, by = 0.1), 1)
time <- seq(c*pi,length.out=500)
val <- sin(time)
time = seq(0.02,100,0.1)
data.frame(time,val,key)
}
dat <- lapply(seq(1, 1000), gen.dat) %>% bind_rows()
ggplot(dat, aes(time, val)) +
geom_pointdensity(size = 0.1, adjust = 10)
#> geom_pointdensity using method='kde2d' due to large number of points (>20k)
Created on 2020-03-19 by the reprex package (v0.3.0)
update
Thanks user Robert Gertenbach for creating some more interesting sample data. Here the suggested use of ggpointdensity on this data:
library(tidyverse)
library(ggpointdensity)
gen.dat <- function(key) {
has_offset <- runif(1) > 0.5
time <- seq(1, 1000, length.out = 1000)
val <- sin(time / 100 + rnorm(1, sd = 0.2) + (has_offset * 1.5)) *
rgamma(1, 20, 20)
data.frame(time,val,key)
}
dat <- lapply(seq(1,1000), gen.dat) %>% bind_rows()
ggplot(dat, aes(time, val, group=key)) +stat_pointdensity(geom = "line", size = 0.05, adjust = 10) + scale_color_gradientn(colors = c("blue", "yellow", "red"))
Created on 2020-03-24 by the reprex package (v0.3.0)
Your data will result in a quite uniform polkadot density.
I generated some slightly more interesting data like this:
gen.dat <- function(key) {
has_offset <- runif(1) > 0.5
time <- seq(1, 1000, length.out = 1000)
val <- sin(time / 100 + rnorm(1, sd = 0.2) + (has_offset * 1.5)) *
rgamma(1, 20, 20)
data.frame(time,val,key)
}
dat <- lapply(seq(1,1000), gen.dat) %>% bind_rows()
We then get a 2d density estimate. kde2d doesn't have a predict function so we model it with a LOESS
dens <- MASS::kde2d(dat$time, dat$val, n = 400)
dens_df <- data.frame(with(dens, expand_grid( y, x)), z = as.vector(dens$z))
fit <- loess(z ~ y * x, data = dens_df, span = 0.02)
dat$z <- predict(fit, with(dat, data.frame(x=time, y=val)))
Plotting it then gets this result:
ggplot(dat, aes(time, val, group = key, color = z)) +
geom_line(size = 0.05) +
theme_minimal() +
scale_color_gradientn(colors = c("blue", "yellow", "red"))
This is all highly reliant on:
The number of series
The resolution of series
The density of kde2d
The span of loess
so your mileage may vary
I came up with the following solution, using geom_segment(), however I'm not sure if geom_segment() is the way to go as it then only checks if pairwise values are exactly the same whereas in a heatmap (as in my question) values near each other also affect the 'heat' rather than being exactly the same.
# Simple stats to get all possible line segments
vals <- unique(dat$time)
min.val = min(vals)
max.val = max(vals)
# Get all possible line segments
comb.df <- data.frame(
time1 = min.val:(max.val - 1),
time2 = (min.val + 1): max.val
)
# Join the original data to all possible line segments
comb.df <- comb.df %>%
left_join(dat %>% select(time1 = time, val1 = val, key )) %>%
left_join(dat %>% select(time2 = time, val2 = val, key ))
# Count how often each line segment occurs in the data
comb.df <- comb.df %>%
group_by(time1, time2, val1, val2) %>%
summarise(n = n_distinct(key))
# ggplot2 to plot segments
ggplot(comb.df %>% arrange(n)) +
geom_segment(aes(x = time1, y = val1, xend = time2, yend = val2, color = n), alpha =0.9) +
scale_colour_gradient( low = 'green', high = 'red') +
theme_bw()

How to produce a meaningful draftsman/correlation plot for discrete values

One of my favorite tools for exploratory analysis is pairs(), however in the case of a limited number of discrete values, it falls flat as the dots all align perfectly. Consider the following:
y <- t(rmultinom(n=1000,size=4,prob=rep(.25,4)))
pairs(y)
It doesn't really give a good sense of correlation. Is there an alternative plot style that would?
If you change y to a data.frame you can add some 'jitter' and with the col option you can set the transparency level (the 4th number in rgb):
y <- data.frame(y)
pairs(sapply(y,jitter), col = rgb(0,0,0,.2))
Or you could use ggplot2's plotmatrix:
library(ggplot2)
plotmatrix(y) + geom_jitter(alpha = .2)
Edit: Since plotmatrix in ggplot2 is deprecated use ggpairs (GGally package mentioned in #hadley's comment above)
library(GGally)
ggpairs(y, lower = list(params = c(alpha = .2, position = "jitter")))
Here is an example using corrplot:
M <- cor(y)
corrplot.mixed(M)
You can find more examples in the intro
http://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
Here are a couple of options using ggplot2:
library(ggplot2)
## re-arrange data (copied from plotmatrix function)
prep.plot <- function(data) {
grid <- expand.grid(x = 1:ncol(data), y = 1:ncol(data))
grid <- subset(grid, x != y)
all <- do.call("rbind", lapply(1:nrow(grid), function(i) {
xcol <- grid[i, "x"]
ycol <- grid[i, "y"]
data.frame(xvar = names(data)[ycol], yvar = names(data)[xcol],
x = data[, xcol], y = data[, ycol], data)
}))
all$xvar <- factor(all$xvar, levels = names(data))
all$yvar <- factor(all$yvar, levels = names(data))
return(all)
}
dat <- prep.plot(data.frame(y))
## plot with transparent jittered points
ggplot(dat, aes(x = x, y=y)) +
geom_jitter(alpha=.125) +
facet_grid(xvar ~ yvar) +
theme_bw()
## plot with color representing density
ggplot(dat, aes(x = factor(x), y=factor(y))) +
geom_bin2d() +
facet_grid(xvar ~ yvar) +
theme_bw()
I don't have enough credits yet to comment on #Vincent 's post - when doing
library(GGally)
ggpairs(y, lower = list(params = c(alpha = .2, position = "jitter")))
I get
Error in stop_if_params_exist(obj$params) :
'params' is a deprecated argument. Please 'wrap' the function to supply arguments. help("wrap", package = "GGally")
So it seems, based on the indicated help page, that it would need to be in this case here:
ydf <- as.data.frame(y)
regularPlot <- ggpairs(ydf, lower = list(continuous = wrap(ggally_points, alpha = .2, position = "jitter")))
regularPlot

Resources