I am trying to avoid plotting lines beyond the first and last zero to avoid this overlap. Please note that this is just a toy data of a much bigger data set and that solution to filter 0s does not work in this case.
dta <-
data.frame(grp = c(rep("a",10), rep("b",10),rep("c",10)),
lines = c(rep(seq(1,10,1),3)),
vc = c(c(0,0,0,0,.3,.3,.1, 0,0,0),
c(.1,.3,.3,.3,.1, 0,0,0,0,0),
c(0,0,0,0,0, 0,0,0,0,0)))
dta %>%
ggplot(aes(lines, vc, color = grp))+
geom_line()+
scale_x_continuous(
breaks = seq(0, 10, 1)
)+
scale_y_continuous(
limits = c(-0.01, 1),
breaks = seq(0, 1, 0.1)
)
Any ideas on how to remove these lines, please? For example, the blue line should stop at x=6.
If I set 0 to NA lines do not go down to the x-axis.
dta %>%
mutate(vc = ifelse(vc==0, NA, vc)) %>%
ggplot(aes(lines, vc, color = grp))+
geom_line()+
scale_x_continuous(
breaks = seq(0, 10, 1)
) +
scale_y_continuous(
limits = c(-0.01, 1),
breaks = seq(0, 1, 0.1)
)
I need the blue line to go down to the x-axis and then stop. This goes for all other lines.
Here is a working solution with tidyverse:
library(tidyr)
library(dplyr)
dta %>%
group_by(grp) %>%
mutate(across(-lines,
~ ifelse(lag(.) == 0 & . == 0 & lead(.) == 0, NA, .))) %>%
ggplot(aes(lines, vc, color = grp)) +
geom_line()
Produces this plot:
This solution is kind of verbose but does what you need I believe. It can be applied to a grouped data frame. For each group, given a column name as input, it trims away rows at the beginning and end where that column is equal to zero... but importantly it retains a zero at the beginning and end.
function definition
The function uses tidy evaluation for the column name by which to trim the data frame. The statements with which find runs of zeroes at the beginning and end, if present, and retain the last zero before the nonzero entries and the first one after them.
trim_zero <- function(data, column) {
x0 <- pull(data, {{ column }}) == 0
beginning_0 <- max(which(x0)[which(x0) < min(which(!x0))], 1)
ending_0 <- min(which(x0)[which(x0) > max(which(!x0))], length(x0))
data[beginning_0:ending_0, ]
}
applying the function to your data
require(dplyr)
require(ggplot2)
dta_trimmed <- dta %>%
group_by(grp) %>%
group_modify(~ trim_zero(., vc))
ggplot(dta_trimmed, aes(lines, vc, color = grp))+
geom_line()+
scale_x_continuous(
breaks = seq(0, 10, 1)
)+
scale_y_continuous(
limits = c(-0.01, 1),
breaks = seq(0, 1, 0.1)
)
library(tidyverse)
dta <-
data.frame(grp = c(rep("a",10), rep("b",10)),
lines = c(rep(seq(1,10,1),2)),
vc = c(c(0,0,0,0,.3,.3,.1, 0,0,0),
c(.1,.3,.3,.3,.1, 0,0,0,0,0)))
dta %>%
filter(vc > 0) %>%
ggplot(aes(lines, vc, color = grp))+
geom_line()
Created on 2021-06-05 by the reprex package (v2.0.0)
Related
I'm using ggplot geom_vline in combination with a custom function to plot certain values on top of a histogram.
The example function below e.g. returns a vector of three values (the mean and x sds below or above the mean). I can now plot these values in geom_vline(xintercept) and see them in my graph.
#example function
sds_around_the_mean <- function(x, multiplier = 1) {
mean <- mean(x, na.rm = TRUE)
sd <- sd(x, na.rm = TRUE)
tibble(low = mean - multiplier * sd,
mean = mean,
high = mean + multiplier * sd) %>%
pivot_longer(cols = everything()) %>%
pull(value)
}
Reproducible data
#data
set.seed(123)
normal <- tibble(data = rnorm(1000, mean = 100, sd = 5))
outliers <- tibble(data = runif(5, min = 150, max = 200))
df <- bind_rows(lst(normal, outliers), .id = "type")
df %>%
ggplot(aes(x = data)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 3),
linetype = "dashed", color = "red") +
geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 2),
linetype = "dashed")
The problem is, that as you can see I would have to define data$df at various places.
This becomes more error-prone when I apply any change to the original df that I pipe into ggplot, e.g. filtering out outliers before plotting. I would have to apply the same changes again at multiple places.
E.g.
df %>% filter(type == "normal")
#also requires
df$data
#to be changed to
df$data[df$type == "normal"]
#in geom_vline to obtain the correct input values for the xintercept.
So instead, how could I replace the df$data argument with the respective column of whatever has been piped into ggplot() in the first place? Something similar to the "." operator, I assume. I've also tried stat_summary with geom = "vline" to achieve this, but without the desired effect.
You can enclose the ggplot part in curly brackets and reference the incoming dataset with the . symbol both in the ggplot command and when calculating the sds_around_the_mean. This will make it dynamic.
df %>%
{ggplot(data = ., aes(x = data)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 3),
linetype = "dashed", color = "red") +
geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 2),
linetype = "dashed")}
I'm trying to make a plot, and show different colors when p > 0.5, but when I use the color aes, the line appears to be disconnected.
library(tidyverse)
data <- tibble(n = 1:365)
prob <- function (x) {
pr <- 1
for (t in 2:x) {
pr <- pr * ((365 - t + 1) / 365)
}
return(1 - pr)
}
data %>%
mutate(prob = map_dbl(n, prob)) %>%
filter(n < 100) %>%
ggplot(aes(x = n, y = prob, color = prob > 0.5)) + geom_line() +
scale_x_continuous(breaks = seq(0,100,10))
Anyone knows why? Removing the color aes() provides an unique line.
This is because prob is a discrete variable and condition prob > 0.5 is splitting your data into two parts, with gap between them: the first half has max(prob) = .476 and the second half has min(prob) = .507. Hence, the (vertical) gap on the line plot is the gap between this numbers.
you can see it, if you filter modified data for values close to .5:
data %>%
mutate(prob = map_dbl(n, prob)) %>%
filter(n < 100) %>%
filter(between(prob, .4, .6))
if we modify your example:
data2 <- data %>%
mutate(prob = map_dbl(n, prob)) %>%
filter(n < 100)
#bringing extremes closer together
data2$prob[22] <- .49999999999999
data2$prob[23] <- .50000000000001
data2 %>%
ggplot(aes(x = n, y = prob, color = prob >= 0.5)) + geom_line() +
scale_x_continuous(breaks = seq(0,100,10))
The gap becomes significantly smaller:
However, it is still present (mostly on horizontal level) - because x variable is also discrete
A simple way of fixing this is to add dummy aesthetic group = 1 inside aes(), which overrides default grouping by x variable.
data %>%
mutate(prob = map_dbl(n, prob)) %>%
filter(n < 100) %>%
#add 'group = 1' below
ggplot(aes(x = n, y = prob, color = prob >= 0.5, group = 1)) + geom_line() +
scale_x_continuous(breaks = seq(0,100,10))
Here is code to give context to my question:
set.seed(1); tibble(x=factor(sample(LETTERS[1:7],7,replace = T),levels = LETTERS[1:7])) %>% group_by_all() %>% count(x,.drop = F) %>%
ggplot(mapping = aes(x=x,y=n))+geom_bar(stat="identity")+geom_text(
aes(label = n, y = n + 0.05),
position = position_dodge(1),
vjust = 0)
I want ALL of the levels of the variable x to be displayed on the x-axis (LETTERS[1:7]). For each Level with n>0, I want the value to display atop the bar for that level. For each level with n==0, I want the value label to NOT be displayed. Currently, the plot displays the 0 for 'empty' factor levels c("C","F"), and I want to suppress the display of '0's for those levels, but still display "C", and "F" on the x-axis.
I hope someone might be able to help me.
Thanks.
A simple ifelse() will do it. You can enter any text you like for example ifelse( n>0, n , "No Data")
library( tidyr)
library( ggplot2)
library( dplyr )
set.seed(1); tibble(x=factor(sample(LETTERS[1:7],7,replace = T),levels = LETTERS[1:7])) %>% group_by_all() %>% count(x,.drop = F) %>%
ggplot(mapping = aes(x=x,y=n))+geom_bar(stat="identity")+
geom_text(
aes(label = ifelse( n>0, n , ""), y = n + 0.05),
position = position_dodge(1),
vjust = 0)
You pass a function to the data argument inside geom_test, for this example you can do a subset on the piped data (referred as .x):
set.seed(1);
tibble(x=factor(sample(LETTERS[1:7],7,replace = T),levels = LETTERS[1:7])) %>% group_by_all() %>% count(x,.drop = F) %>%
ggplot(mapping = aes(x=x,y=n))+geom_bar(stat="identity")+
geom_text(data=~subset(.x,n>0),
aes(label = n, y = n + 0.05),
position = position_dodge(1),
vjust = 0)
I have a data set similar to the one below where I have a lot of data for certain groups and then only single observations for other groups. I would like my single observations to show up as points but the other groups with multiple observations to show up as lines (no points). My code is below:
EDIT: I'm attempting to find a way to do this without using multiple datasets in the geom_* calls because of the issues it causes with the legend. There was an answer that has since been deleted that was able to handle the legend but didn't get rid of the points on the lines. I would potentially like a single legend with points only showing up if they are a single observation.
library(tidyverse)
dat <- tibble(x = runif(10, 0, 5),
y = runif(10, 0, 20),
group = c(rep("Group1", 4),
rep("Group2", 4),
"Single Point 1",
"Single Point 2")
)
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point() +
geom_line()
Created on 2019-04-02 by the reprex package (v0.2.1)
Only plot the data with 1 point in geom_point() and the data with >1 point in geom_line(). These can be precalculated in mutate().
dat = dat %>%
group_by(group) %>%
mutate(n = n() )
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point(data = filter(dat, n == 1) ) +
geom_line(data = filter(dat, n > 1) )
Having the legend match this is trickier. This is the sort of thing that that override.aes argument in guide_legend() can be useful for.
In your case I would separately calculate the number of observations in each group first, since that is what the line vs point is based on.
sumdat = dat %>%
group_by(group) %>%
summarise(n = n() )
The result is in the same order as the factor levels in the legend, which is why this works.
Now we need to remove lines and keep points whenever the group has only a single observation. 0 stands for a blank line and NA stands for now shape. I use an ifelse() statement for linetype and shape for override.aes, based on the number of observations per group.
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point(data = filter(dat, n == 1) ) +
geom_line(data = filter(dat, n > 1) ) +
guides(color = guide_legend(override.aes = list(linetype = ifelse(sumdat$n == 1, 0, 1),
shape = ifelse(sumdat$n == 1, 19, NA) ) ) )
I'm trying to label the outliers in a geom_boxplot using ggrepel::geom_label_repel. It works nicely when there's only one grouping variable, but when I try it for multiple grouping variables I run into a problem. The position argument in ggrepel doesn't seem to work very consistently for some reason, see this example:
library(tidyverse)
library(ggrepel)
set.seed(1337)
df <- tibble(x = rnorm(500),
g1 = factor(sample(c('A','B'), 500, replace = TRUE)),
g2 = factor(sample(c('A','B'), 500, replace = TRUE)),
rownames = 1:500)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
df_outliers <- df %>% group_by(g1, g2) %>% mutate(outlier=is_outlier(x))
ggplot(df_outliers, aes(x=g1, y=x, fill=g2)) +
geom_boxplot(width=0.3, position = position_dodge(0.5)) +
ggrepel::geom_label_repel(data=. %>% filter(outlier),
aes(label=rownames), position = position_dodge(0.8))
Is there a way to make the labels point to the accompanying dots using ggrepel?
You can try this:
ggplot(df_outliers,
aes(x=g1, y=x, fill=g2, label=rownames)) +
geom_boxplot(width = 0.3, position = position_dodge(0.5)) +
geom_label_repel(data = . %>%
filter(outlier) %>%
group_by(g1) %>%
complete(g2, fill = list(x = 0, rownames = "")),
position = position_dodge(0.5),
box.padding = 1,
min.segment.length = 0,
show.legend = FALSE)
Explanations:
The data source for geom_label_repel() follows aosmith's suggestion to add the B-A combination, filling 0 for x (any number would do, as long as it's not the default NA) and "" for rowname (ggrepel won't plot empty labels, but will take them into account when dodging).
box.padding is set to 1 (increased from the default 0.25) to push the labels further away, so that the line segments are more visible.
min.segment.length is set to 0 (decreased from the default 0.5) to force line segments to be plotted, no matter how short they are.
(show.legend = FALSE is optional. I just don't like seeing "a" letter show up in the legend.)