How to put labels to the graph line on the plot at certain x points - r

I'm new to data scinece and I need any help, please.
I need to put labels from column (integers) on to the graphed line on the plot at certain x points, like x=20, then, x=50, then x=80 or at least every 20-30 steps. I am using geam_text, but it puts the labels on every point that it draws, and I need only on some certain points, so it is readable.
the code is:
ggp<-cut_offs %>%
ggplot(mapping=aes(x=IL6, y=RhoTSHT3_in_less))+
geom_line(color="blue")+
geom_point(col=ifelse(cut_offs$pvalTSHT3_in_less<0.05, "red", "black"))+
ylim(0.1,0.7)+
geom_text(aes(label=n_more))
So, I guess I need to change my last line of code to something like this:
geom_text(aes(label=ifelse(x in labels, cut_offs$n_more, "")))
where labels is a list with point where I wanna put labels.
currently, my graph looks like this, which is unreadable:
I tried this
geom_text(aes(label=ifelse(x in labels, cut_offs$n_more, "")))
and of course it's not working, how do I write it in R?

We don't have your actual data to demonstrate an answer, but I have constructed a very similar set with the same names, range and approximate shape as your own (see footnote).
Using this, we see that your code produces much the same set of problems:
library(tidyverse)
cut_offs %>%
ggplot(aes(IL6, RhoTSHT3_in_less)) +
geom_line(color = "blue")+
geom_point(col = ifelse(cut_offs$pvalTSHT3_in_less < 0.05, "red", "black"))+
ylim(0.1, 0.7) +
geom_text(aes(label = n_more))
To label, say, only every 25th measurement along the x axis, we can do:
cut_offs %>%
ggplot(aes(IL6, RhoTSHT3_in_less)) +
geom_line(color = "blue")+
geom_point(col = ifelse(cut_offs$pvalTSHT3_in_less < 0.05, "red", "black"))+
ylim(0.1, 0.7) +
geom_text(data = . %>% filter(row_number() %% 25 == 1), aes(label = n_more),
nudge_y = 0.05)
Footnote - data used
set.seed(1)
cut_offs <- data.frame(IL6 = seq(0, 500, len = 251),
RhoTSHT3_in_less = c(seq(0.45, 0.22, len = 20) +
rnorm(20, 0, 0.02),
runif(231, .2, .25)),
n_more = sample(300, 251),
pvalTSHT3_in_less = runif(251, 0, 0.2))

Related

Annotate points of a geom_boxplot, that fulfill specified conditions?

Say I have a boxplot that I created per ggplot(). And this boxplot has points above the upper whisker and below the lower whisker. If I desire to comment only a subset of those points, for example, only points, that correspond to variable values 50 and above or 5 and below. How would I do that?
EDIT
For clarification: Instead of commenting and point out, that specific
points are above or below a specified threshold, I meant commenting each point individually, like labelling the points that are above and below the threshold with their respective value. So if a value like 70 is above the upper threshold of 50, I'd like the point to be annotated directly next to it with "70".
EDIT 2
Following the advice in the comments, I have encountered this problem:
As you can see, the coloured points, that are supposed to be identical to those points identified as outliers by the stat_summary() function, or in fact not identical. Some points even touch upon the whiskers.
The coloured points and the boxplots where produced like this:
# Function that enables individualizing boxplots
{
Individualized_Boxplot_Quantiles <- function(x) {
r <- quantile(x, probs = c(0.01, 0.25, 0.5, 0.75, 0.99))
names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
r
}
Definition_of_Outliers = function(x)
{
subset(x,
quantile(x,0.99) < x | quantile(x,0.01) > x)
}
}
Data_Above_99th_Percentile = filter(Data,variable_of_interest > quantile(Data$variable_of_interest, probs = 0.99))
Data_Below_1st_Percentile = filter(Data,variable_of_interest < quantile(Data$variable_of_interest,probs = 0.01))
# creation of the individualized boxplots
stat_summary(fun.data = Individualized_Boxplot_Quantiles,
geom="boxplot",
lwd = 0.1) +
stat_summary(fun.y = Definition_of_Outliers,
geom="point",
size = 0.5) +
geom_point(data = Data_Above_99th_Percentile,
colour = "red",
size = 0.5) +
geom_point(data = Data_Below_1st_Percentile,
colour = "red",
size = 0.5)
I would overplot some points in a new geom_point layer using a distinct color by passing the appropriate subset of the data, then add text labels with the same subset.
set.seed(1)
df <- data.frame(x = 'Data', y = rnorm(1000, 26, 7))
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_boxplot() +
ylim(c(0, 60)) +
geom_point(data = subset(df, y > 50 | y < 5), color = 'red') +
geom_text(data = subset(df, y > 50 | y < 5), aes(label = round(y, 2)),
nudge_x = 0.08)

Layering violin plots with geom_violin to compare distributions

I am trying to compare the distributions of a continuous variable across groups using violin plots. Pretty easy. However, I would like to make comparisons across distributions easier by showing the distribution for one of the groups (the reference) in grey with a low alpha value in the background. Something like this but with a violin plot:
My current approach plots the data twice. For the first geom_violin, I duplicate the data for the reference group and plot it in grey. For the second geom_violin, I use the actual data d. In this example, the two violin plots in grey and blue should look the same for the group "blue". However, they are NOT the same even though they are based on exactly the same data for group "blue".
How can I resolve this problem? Or is there another better approach to do this?
d <- tibble(
group = sample(c("green", "blue"), 1000, replace = TRUE, prob = c(0.7, 0.3)),
x = ifelse(group == "green", rnorm(1000, 1, 1), rnorm(1000, 0, 3))
)
dblue <- filter(d, group == "blue")
dblue <- bind_rows(dblue, mutate(dblue, group = "green"))
ggplot(d, aes(x = factor(group), y = x)) +
geom_violin(data = dblue, fill = alpha("#333333", 0.2), color = alpha("#333333", 0)) +
geom_violin(fill = alpha("#0072B2", 0.8), color = alpha("#0072B2", 0))
Add scale = "width" to the second geom_violin
ggplot(d, aes(x = factor(group), y = x)) +
geom_violin(data = dblue, fill = alpha("#333333", 0.2), color = alpha("#333333", 0)) +
geom_violin(fill = alpha("#0072B2", 0.8), color = alpha("#0072B2", 0),
scale = "width")

divide the y axis to make part with a score <25 occupies the majority in ggplot

I want to divide the y axis for the attached figure to take part with a score <25 occupies the majority of the figure while the remaining represent a minor upper part.
I browsed that and I am aware that I should use scale_y_discrete(limits .I used this p<- p+scale_y_continuous(breaks = 1:20, labels = c(1:20,"//",40:100)) but it doesn't work yet.
I used the attached data and this is my code
Code
p<-ggscatter(data, x = "Year" , y = "Score" ,
color = "grey", shape = 21, size = 3, # Points color, shape and size
add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
add = "loess", #reg.line
conf.int = T,
cor.coef = F, cor.method = "pearson",
xlab = "Year" , ylab= "Score")
p<-p+ coord_cartesian(xlim = c(1980, 2020));p
Here is as close as I could get getting a fake axis break and resizing the upper area of the plot. I still think it's a bad idea and if this were my plot I'd much prefer a more straightforward axis transform.
First, we'd need a function that generates a transform that squeezes all values above some threshold:
library(ggplot2)
library(scales)
# Define new transform
my_transform <- function(threshold = 25, squeeze_factor = 10) {
force(threshold)
force(squeeze_factor)
my_transform <- trans_new(
name = "trans_squeeze",
transform = function(x) {
ifelse(x > threshold,
((x - threshold) * (1 / squeeze_factor)) + threshold,
x)
},
inverse = function(x) {
ifelse(x > threshold,
((x - threshold) * squeeze_factor) + threshold,
x)
}
)
return(my_transform)
}
Next we apply that transformation to the y-axis and add a fake axis break. I've used vanilla ggplot2 code as I find the ggscatter() approach confusing.
ggplot(data, aes(Year, Score)) +
geom_point(color = "grey", shape = 21, size = 3) +
geom_smooth(method = "loess", fill = "lightgray") +
# Add fake axis lines
annotate("segment", x = -Inf, xend = -Inf,
y = c(-Inf, Inf), yend = c(24.5, 25.5)) +
# Apply transform to y-axis
scale_y_continuous(trans = my_transform(25, 10),
breaks = seq(0, 80, by = 10)) +
scale_x_continuous(limits = c(1980, 2020), oob = oob_keep) +
theme_classic() +
# Turn real y-axis line off
theme(axis.line.y = element_blank())
You might find it informative to read Hadley Wickham's view on discontinuous axes. People sometimes mock weird y-axes.

Advice/ on how to plot side by side histograms with line graph going through in ggplot2

I'm currently finishing off my Masters project and need to include some graphics for the write-up. Without boring you too much, I have some data which is associated with AR(1) parameters ranging from 0.1 to 0.9 by 0.1 increments. As such I thought of doing a faceted histogram like the one below (worry not about the hideous fruit salad of colours, it will not be used).
I used this code.
ggplot(opt_lens_geom,aes(x=l_1024,fill=factor(rho))) + geom_histogram()+coord_flip()+facet_grid(.~rho,scales = "free_x")
I also would like to draw a trend line for the median values since the AR(1) parameter is continuous. In a later iteration I deleted the padding and made it "look" like it was one graph, but I have had issues with the endpoints matching up since each facet is a separate graphical device. Can anyone give me some advice on how to do this? I am not particularly partial to the faceting so if it is not needed I do away with it.
I will try and upload sample data, but all simulating 100 values for each of the 9 rhos would work just to get it started like:
opt_lens_geom <- data.frame(rho= rep(seq(0.1,0.9,by=0.1),each=100),l_1024=rnorm(900))
You might consider ggridges. I've assumed here that you want a median value for each value of rho.
library(ggplot2)
library(ggridges)
library(dplyr)
set.seed(1001)
opt_lens_geom <- data.frame(rho = rep(seq(0.1, 0.9, by = 0.1), each = 100),
l_1024 = rnorm(900))
opt_lens_geom %>%
mutate(rho_f = factor(rho)) %>%
ggplot(aes(l_1024, rho_f)) +
stat_density_ridges(quantiles = 2, quantile_lines = TRUE)
Result. You can add scale = 1 as a parameter to stat_density_ridges if you don't like the amount of overlap.
Try the following. It uses a pre-computed data frame of the medians.
library(ggplot2)
df <- iris[c(1, 5)]
names(df) <- c("val", "rho")
med <- plyr::ddply(df, "rho", summarise, m = median(val))
ggplot(data = df, aes(x = val, fill = factor(rho))) +
geom_histogram() +
coord_flip() +
geom_vline(data = med, aes(xintercept = m), colour = 'black') +
facet_wrap(~ factor(rho))
You could do a variant on this using geom_violin instead of using histograms, although you wouldn't get labelled counts, just an idea of the relative density. Example with made up data:
df = data.frame(
rho = rep(c(0.1, 0.2, 0.3), each = 50),
val = sample(1:10, 150, replace = TRUE)
)
df$val = df$val + (5 * (df$rho == 0.2)) + (8 * (df$rho == 0.3))
ggplot(df, aes(x = rho, y = val, fill = factor(rho))) +
geom_violin() +
stat_summary(aes(group = 1), colour = "black",
geom = "line", fun.y = "median")
This produces a violin for each value of rho, and joins the medians for each violin.

Split dataframe and Create multipanel scatterplots from list of data frames

I have a dataframe like so:
set.seed(453)
year= as.factor(c(rep("1998", 20), rep("1999", 16)))
lepsp= c(letters[seq(from = 1, to = 20 )], c('a','b','c'),letters[seq(from =8, to = 20 )])
freq= c(sample(1:15, 20, replace=T), sample(1:18, 16,replace=T))
df<-data.frame(year, lepsp, freq)
df<-
df %>%
group_by(year) %>%
mutate(rank = dense_rank(-freq))
Frequencies freq of each lepsp within each year are ranked in the rank column. Larger freq values correspond to the smallest rank value and smaller freq values have the largest rank values. Some rankings are repeated if levels of lepsp have the same abundance.
I would like to split the df into multiple subsets by year. Then I would like to plot each subsetted dataframe in a multipanel figure. Essentially this is to create species abundance curves. The x-axis would be rank and the yaxis needs to be freq.
In my real dataframe I have 22 years of data. I would prefer the graphs to be displayed as 2 columns of 4 rows for a total of 8 graphs per page. Essentially I would have to repeat the solution offered here 3 times.
I also need to demarcate the 25%, 50% and 75% quartiles with vertical lines to look like this (desired result):
It would be great if each graph specified the year to which it belonged, but since all axis are the same name, I do not want x and y labels to be repeated for each graph.
I have tried to plot multiple lines on the same graph but it gets messy.
year.vec<-unique(df$year)
plot(sort(df$freq[df$year==year.vec[1]],
decreasing=TRUE),bg=1,type="b", ylab="Abundance", xlab="Rank",
pch=21, ylim=c(0, max(df$freq)))
for (i in 2:22){
points(sort(df$freq[df$year==year.vec[i]], decreasing=TRUE), bg=i,
type="b", pch=21)
}
legend("topright", legend=year.vec, pt.bg=1:22, pch=21)
I have also tried a loop, however it does not produce an output and is missing some of the arguments I would like to include:
jpeg('pract.jpg')
par(mfrow = c(6, 4)) # 4 rows and 2 columns
for (i in unique(levels(year))) {
plot(df$rank,df$freq, type="p", main = i)
}
dev.off()
Update
(Attempted result)
I found the following code after my post which gets me a little closer, but is still missing all the features I would like:
library(reshape2)
library(ggplot2)
library (ggthemes)
x <- ggplot(data = df2, aes(x = rank, y = rabun)) +
geom_point(aes(fill = "dodgerblue4")) +
theme_few() +
ylab("Abundance") + xlab("Rank") +
theme(axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 15),
axis.text.y = element_text(size = 15),
plot.title = element_blank(), # we don't want individual plot titles as the facet "strip" will give us this
legend.position = "none", # we don't want a legend either
panel.border = element_rect(fill = NA, color = "darkgrey", size = 1.25, linetype = "solid"),
axis.ticks = element_line(colour = 'darkgrey', size = 1.25, linetype = 'solid')) # here, I just alter to colour and thickness of the plot outline and tick marks. You generally have to do this when faceting, as well as alter the text sizes (= element_text() in theme also)
x
x <- x + facet_wrap( ~ year, ncol = 4)
x
I prefer base R to modify graph features, and have not been able to find a method using base R that meets all my criteria above. Any help is appreciated.
Here's a ggplot approach. First off, I made some more data to get the 3x2 layout:
df = rbind(df, mutate(df, year = year + 4), mutate(df, year = year + 8))
Then We do a little manipulation to generate the quantiles and labels by group:
df_summ =
df %>% group_by(year) %>%
do(as.data.frame(t(quantile(.$rank, probs = c(0, 0.25, 0.5, 0.75)))))
names(df_summ)[2:5] = paste0("q", 0:3)
df_summ_long = gather(df_summ, key = "q", value = "value", -year) %>%
inner_join(data.frame(q = paste0("q", 0:3), lab = c("Common", "Rare-75% -->", "Rare-50% -->", "Rare-25% -->"), stringsAsFactors = FALSE))
With the data in good shape, plotting is fairly simple:
library(ggthemes)
library(ggplot2)
ggplot(df, aes(x = rank, y = freq)) +
geom_point() +
theme_few() +
labs(y = "Abundance (% of total)", x = "Rank") +
geom_vline(data = df_summ_long[df_summ_long$q != "q0", ], aes(xintercept = value), linetype = 4, size = 0.2) +
geom_text(data = df_summ_long, aes(x = value, y = Inf, label = lab), size = 3, vjust = 1.2, hjust = 0) +
facet_wrap(~ year, ncol = 2)
There's some work left to do - mostly in the rarity text overlapping. It might not be such an issue with your actual data, but if it is you could pull the max y values into df_summ_long and stagger them a little bit, actually using y coordinates instead of just Inf to get it at the top like I did.

Resources