Colour points for x-axis plot R - r

I want to give the points relating to the first 130 x-axis values a different colour than the rest (up to 250). So basically, divide the points vertically with two different colours. Is this possible and how would you go about it?

Welcome to SO!
I would use ggplot2
Here are some examples:
library(ggplot2)
ggplot(mtcars,aes(hp,mpg,color = mpg < 20)) +
geom_point()
ggplot(mtcars,aes(hp,mpg,color = mpg < 20)) +
geom_point() +
theme(legend.position = 'none')
ggplot(mtcars,aes(hp,mpg,color = mpg < 20)) +
geom_point() +
labs(color = 'mpg less than 20')
ggplot(mtcars,aes(hp,mpg,color = mpg < 20)) +
geom_point() +
scale_color_manual(values = c('purple4','springgreen4'))
Good luck!

You can use the row_number for the colours.
library(ggplot2)
library(dplyr)
data(mpg)
mpg %>%
mutate(colour=row_number(displ)<=130) %>%
ggplot(aes(x=displ, y=cty, col=colour)) +
geom_point(show.legend=FALSE) + theme_bw()
And seems there is a tie at about 3.5.

using base R you can try
plot(iris$Sepal.Length, iris$Sepal.Width, col = rep(1:2, times = c(130, nrow(iris)-130)))

Related

How to add percentages on top of an histogram when data is grouped

This is not my data (for confidentiality reasons), but I have tried to create a reproducible example using a dataset included in the ggplot2 library. I have an histogram summarizing the value of some variable by group (factor of 2 levels). First, I did not want the counts but proportions of the total, so I used that code:
library(ggplot2)
library(dplyr)
df_example <- diamonds %>% as.data.frame() %>% filter(cut=="Premium" | cut=="Ideal")
ggplot(df_example,aes(x=z,fill=cut)) +
geom_histogram(aes(y=after_stat(width*density)),binwidth=1,center=0.5,col="black") +
facet_wrap(~cut) +
scale_x_continuous(breaks=seq(0,9,by=1)) +
scale_y_continuous(labels=scales::percent_format(accuracy=2,suffix="")) +
scale_fill_manual(values=c("#CC79A7","#009E73")) +
labs(x="Depth (mm)",y="Count") +
theme_bw() + theme(legend.position="none")
It gave me this as a result.
enter image description here
The issue is that I would like to print the numeric percentages on top of the bins and haven't find a way to do so.
As I saw it done for printing counts elsewhere, I attempted to print them using stat_bin(), including the same y and label values as the y in geom_histogram, thinking it would print the right numbers:
ggplot(df_example,aes(x=z,fill=cut)) +
geom_histogram(aes(y=after_stat(width*density)),binwidth=1,center=0.5,col="black") +
stat_bin(aes(y=after_stat(width*density),label=after_stat(width*density*100)),geom="text",vjust=-.5) +
facet_wrap(~cut) +
scale_x_continuous(breaks=seq(0,9,by=1)) +
scale_y_continuous(labels=scales::percent_format(accuracy=2,suffix="")) +
scale_fill_manual(values=c("#CC79A7","#009E73")) +
labs(x="Depth (mm)",y="%") +
theme_bw() + theme(legend.position="none")
However, it does print way more values than there are bins, these values do not appear consistent with what is portrayed by the bar heights and they do not print in respect to vjust=-.5 which would make them appear slightly above the bars.
enter image description here
What am I missing here? I know that if there was no grouping variable/facet_wrap, I could use after_stat(count/sum(count)) instead of after_stat(width*density) and it seems that it would have fixed my issue. But I need the histograms for both groups to appear next to each other. Thanks in advance!
You have to use the same arguments in stat_bin as for the histogram when adding your labels to get same binning for both layers and to align the labels with the bars:
library(ggplot2)
library(dplyr)
df_example <- diamonds %>%
as.data.frame() %>%
filter(cut == "Premium" | cut == "Ideal")
ggplot(df_example, aes(x = z, fill = cut)) +
geom_histogram(aes(y = after_stat(width * density)),
binwidth = 1, center = 0.5, col = "black"
) +
stat_bin(
aes(
y = after_stat(width * density),
label = scales::number(after_stat(width * density), scale = 100, accuracy = 1)
),
geom = "text", binwidth = 1, center = 0.5, vjust = -.25
) +
facet_wrap(~cut) +
scale_x_continuous(breaks = seq(0, 9, by = 1)) +
scale_y_continuous(labels = scales::number_format(scale = 100)) +
scale_fill_manual(values = c("#CC79A7", "#009E73")) +
labs(x = "Depth (mm)", y = "%") +
theme_bw() +
theme(legend.position = "none")

Move ggrepel / geom_text_repel's labels away from lines drawn with geom_vline() and geom_hline()

ggrepel provides an excellent series of functions for annotating ggplot2 graphs and the examples page contains lots of nice hints of how to expand its functionality, including moving the labels generated away from both the axes of the plot, other labels, and so on.
However, one thing that isn't covered is moving the labels away from manually drawn lines with geom_hline() and geom_vline(), as may occur, for example, in making an annotated volcano plot.
Here's a simple MWE to highlight the problem:
library("tidyverse")
library("ggrepel")
dat <- subset(mtcars, wt > 2.75 & wt < 3.45)
dat$car <- rownames(dat)
ggplot(dat, aes(wt, mpg, label = car)) +
geom_point(color = "red") +
geom_text_repel(seed = 1) + #Seed for reproducibility
geom_vline(xintercept = 3.216) + #Deliberately chosen "bad" numbers
geom_hline(yintercept = 19.64) + theme_bw()
This produces the following output:
Note how the lines overlap the text of the labels and obscure it (is that "Horret 4 Drive" or "Hornet 4 Drive"?)
Jiggling the points about a bit post facto you can make a far nicer fit – I have simply shifted some of the labels a tiny bit to get them off the line.
Is it possible to get ggrepel to do this automatically? I know the example given isn't totally stable (other seeds give acceptable results) but for complex plots with a large number of points it definitely is a problem.
Edit: If you're curious, a far less "minimum" working example would be the below (taken from bioconductor):
download.file("https://raw.githubusercontent.com/biocorecrg/CRG_RIntroduction/master/de_df_for_volcano.rds", "de_df_for_volcano.rds", method="curl")
tmp <- readRDS("de_df_for_volcano.rds")
de <- tmp[complete.cases(tmp), ]
de$diffexpressed <- "NO"
# if log2Foldchange > 0.6 and pvalue < 0.05, set as "UP"
de$diffexpressed[de$log2FoldChange > 0.6 & de$pvalue < 0.05] <- "UP"
# if log2Foldchange < -0.6 and pvalue < 0.05, set as "DOWN"
de$diffexpressed[de$log2FoldChange < -0.6 & de$pvalue < 0.05] <- "DOWN"
# Create a new column "delabel" to de, that will contain the name of genes differentially expressed (NA in case they are not)
de$delabel <- NA
de$delabel[de$diffexpressed != "NO"] <- de$gene_symbol[de$diffexpressed != "NO"]
#Actually do plot
ggplot(data=de, aes(x=log2FoldChange, y=-log10(pvalue), col=diffexpressed, label=delabel)) +
geom_point() +
theme_minimal() +
geom_text_repel() +
scale_color_manual(values=c("blue", "black", "red")) +
geom_vline(xintercept=c(-0.6, 0.6), col="red") +
geom_hline(yintercept=-log10(0.05), col="red")
This produces the below, where the text-overlapping-lines problem is quite obvious:
I don't think there's a built-in way to do this.
A non-elegant hack off the top of my head is to add invisible points along the intercept lines which the labels will then repel away from.
dat <- subset(mtcars, wt > 2.75 & wt < 3.45)
dat$car <- rownames(dat)
xintercept = 3.216
yintercept = 19.64
dat %>%
mutate(alpha = 1) %>%
bind_rows(.,
tibble(wt = seq(from = min(.$wt), to = max(.$wt), length.out = 20), mpg = yintercept, car = '', alpha = 0),
tibble(wt = xintercept, mpg = seq(from = min(.$mpg), to = max(.$mpg), length.out = 20), car = '', alpha = 0)
) %>%
ggplot(aes(wt, mpg, label = car, alpha = alpha)) +
geom_point(color = "red") +
geom_text_repel(seed = 1) + #Seed for reproducibility
geom_vline(xintercept = xintercept) +
geom_hline(yintercept = yintercept) + theme_bw() +
scale_alpha_identity()
One (admittedly unorthodox) solution would be to plot "invisible" text along the intercept lines and thus trick geom_text_repel into staying away from them. The complication is that you have to add several filler rows to your data set and then modify the plot to render the filler as invisible. But the end result is pretty clean:
dat2 <- bind_rows(
data.frame(wt = seq(min(dat$wt), max(dat$wt), length = 20), mpg = 19.64, car = 'O'),
data.frame(mpg = seq(min(dat$mpg), max(dat$mpg), length = 20), wt = 3.216, car = 'O'),
dat
)
ggplot(dat2, aes(wt, mpg, label = car)) +
geom_point(data = filter(dat2, car != 'O'), color = "red") +
geom_text_repel(aes(color = car == 'O'), seed = 1, show.legend = F) + #Seed for reproducibility
geom_vline(xintercept = 3.216) + #Deliberately chosen "bad" numbers
geom_hline(yintercept = 19.64) +
scale_color_manual(values = c('black', 'transparent'))
theme_bw()
I'm not sure if there's any functions that allows ggrepel to do this automatically. One way to hack around this is to create multiple subsets of data, and add nudge to the label. Here I used the volcano plot as an example.
library(ggplot2)
library(ggrepel)
ggplot(data=de, aes(x=log2FoldChange, y=-log10(pvalue), col=diffexpressed, label=delabel)) +
geom_point() +
theme_minimal() +
geom_text_repel(data = subset(de, log2FoldChange < -0.6),
nudge_x = -0.05) +
geom_text_repel(data = subset(de, log2FoldChange > 0.6),
nudge_x = 0.08) +
scale_color_manual(values=c("blue", "black", "red")) +
geom_vline(xintercept=c(-0.6, 0.6), col="red") +
geom_hline(yintercept=-log10(0.05), col="red")

How to change the limits from scale_y_continuous depending on the plot in R?

I want to draw boxplots with the number of observations on top. The problem is that depending on the information and the outliers, the y-axis changes. For that reason, I want to change the limits of scale_y_continuous automatically. Is it possible to do this?
This is a reproducible example:
library(dplyr)
library(ggplot2)
myFreqs <- mtcars %>%
group_by(cyl, am) %>%
summarise(Freq = n())
myFreqs
p <- ggplot(mtcars, aes(factor(cyl), drat, fill=factor(am))) +
stat_boxplot(geom = "errorbar") +
geom_boxplot() +
stat_summary(geom = 'text', label = paste("n = ", myFreqs$Freq), fun = max, position = position_dodge(width = 0.77), vjust=-1)
p
The idea is to increase at least +1 to the maximum value of the plot with the highest y-axis value (in the case explained above, it would be the second boxplot with n=8)
I have tried to change the y-axis with scale_y_continuous like this:
p <- p + scale_y_continuous(limits = c(0, 5.3))
p
However, I don't want to put the limits myself, I want to find a way to modify the limits according to the plots that I have. (Because... what if the information changes?).
Is there a way to do something like this? With min and max --> scale_y_continuous(limits = c(min(x), max(x)))
Thanks very much in advance
Thanks to #teunbrand and #caldwellst I got the solution that I needed it.
There are 3 solutions that work perfectly:
1-
p + scale_y_continuous(limits = function(x){
c(min(x), (max(x)+0.1))
})
p
2-
library(tidyverse)
p + scale_y_continuous(limits = ~ c(min(.x), max(.x) + 0.1))
3-
p + scale_y_continuous(limits = function(x){
c(min(x), ceiling(max(x) * 1.1))
})

How To Center Axes in ggplot2

In the following plot, which is a simple scatter plot + theme_apa(), I would like that both axes go through 0.
I tried some of the solutions proposed in the answers to similar questions to that but none of them worked.
A MWE to reproduce the plot:
library(papaja)
library(ggplot2)
library(MASS)
plot_two_factor <- function(factor_sol, groups) {
the_df <- as.data.frame(factor_sol)
the_df$groups <- groups
p1 <- ggplot(data = the_df, aes(x = MR1, y = MR2, color = groups)) +
geom_point() + theme_apa()
}
set.seed(131340)
n <- 30
group1 <- mvrnorm(n, mu=c(0,0.6), Sigma = diag(c(0.01,0.01)))
group2 <- mvrnorm(n, mu=c(0.6,0), Sigma = diag(c(0.01,0.01)))
factor_sol <- rbind(group1, group2)
colnames(factor_sol) <- c("MR1", "MR2")
groups <- as.factor(rep(c(1,2), each = n))
print(plot_two_factor(factor_sol, groups))
The papaja package can be installed via
devtools::install_github("crsh/papaja")
What you request cannot be achieved in ggplot2 and for a good reason, if you include axis and tick labels within the plotting area they will sooner or later overlap with points or lines representing data. I used #phiggins and #Job Nmadu answers as a starting point. I changed the order of the geoms to make sure the "data" are plotted on top of the axes. I changed the theme to theme_minimal() so that axes are not drawn outside the plotting area. I modified the offsets used for the data to better demonstrate how the code works.
library(ggplot2)
iris %>%
ggplot(aes(Sepal.Length - 5, Sepal.Width - 2, col = Species)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_point() +
theme_minimal()
This gets as close as possible to answering the question using ggplot2.
Using package 'ggpmisc' we can slightly simplify the code.
library(ggpmisc)
iris %>%
ggplot(aes(Sepal.Length - 5, Sepal.Width - 2, col = Species)) +
geom_quadrant_lines(linetype = "solid") +
geom_point() +
theme_minimal()
This code produces exactly the same plot as shown above.
If you want to always have the origin centered, i.e., symmetrical plus and minus limits in the plots irrespective of the data range, then package 'ggpmisc' provides a simple solution with function symmetric_limits(). This is how quadrant plots for gene expression and similar bidirectional responses are usually drawn.
iris %>%
ggplot(aes(Sepal.Length - 5, Sepal.Width - 2, col = Species)) +
geom_quadrant_lines(linetype = "solid") +
geom_point() +
scale_x_continuous(limits = symmetric_limits) +
scale_y_continuous(limits = symmetric_limits) +
theme_minimal()
The grid can be removed from the plotting area by adding + theme(panel.grid = element_blank()) after theme_minimal() to any of the three examples.
Loading 'ggpmisc' just for function symmetric_limits() is overkill, so here I show its definition, which is extremely simple:
symmetric_limits <- function (x)
{
max <- max(abs(x))
c(-max, max)
}
For the record, the following also works as above.
iris %>%
ggplot(aes(Sepal.Length-6.2, Sepal.Width-3.2, col = Species)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0)
Setting xlim and slim should work.
library(tidyverse)
# default
iris %>%
ggplot(aes(Sepal.Length, Sepal.Width, col = Species)) +
geom_point()
# setting xlim and ylim
iris %>%
ggplot(aes(Sepal.Length, Sepal.Width, col = Species)) +
geom_point() +
xlim(c(0,8)) +
ylim(c(0,4.5))
Created on 2020-06-12 by the reprex package (v0.3.0)
While the question is not very clear, PoGibas seems to think that this is what the OP wanted.
library(tidyverse)
# default
iris %>%
ggplot(aes(Sepal.Length-6.2, Sepal.Width-3.2, col = Species)) +
geom_point() +
xlim(c(-2.5,2.5)) +
ylim(c(-1.5,1.5)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0)
Created on 2020-06-12 by the reprex package (v0.3.0)

Plotting a bar chart with years grouped together

I am using the fivethirtyeight bechdel dataset, located here https://github.com/rudeboybert/fivethirtyeight, and am attempting to recreate the first plot shown in the article here https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/. I am having trouble getting the years to group together similarly to how they did in the article.
This is the current code I have:
ggplot(data = bechdel, aes(year)) +
geom_histogram(aes(fill = clean_test), binwidth = 5, position = "fill") +
scale_fill_manual(breaks = c("ok", "dubious", "men", "notalk", "nowomen"),
values=c("red", "salmon", "lightpink", "dodgerblue",
"blue")) +
theme_fivethirtyeight()
I see where you were going with using the histogram geom but this really looks more like a categorical bar chart. Once you take that approach it's easier, after a bit of ugly code to get the correct labels on the year columns.
The bars are stacked in the wrong order on this one, and there needs to be some formatting applied to look like the 538 chart, but I'll leave that for you.
library(fivethirtyeight)
library(tidyverse)
library(ggthemes)
library(scales)
# Create date range column
bechdel_summary <- bechdel %>%
mutate(date.range = ((year %/% 10)* 10) + ((year %% 10) %/% 5 * 5)) %>%
mutate(date.range = paste0(date.range," - '",substr(date.range + 5,3,5)))
ggplot(data = bechdel_summary, aes(x = date.range, fill = clean_test)) +
geom_bar(position = "fill", width = 0.95) +
scale_y_continuous(labels = percent) +
theme_fivethirtyeight()
ggplot

Resources