Automatic Highlighting Outliers in ggplots - r

I have a dataframe df. While plotting this in ggplot. Can we also highlight outliers. Below is the sample code
df <- data.frame(col=runif(100, min=0, max=100000))
df$D <- c(1:100)
ggplot(df,aes(x=D,y=col))+geom_line()
Is there the way to highlight outliers here

We can define a function for this. The line_outlier_plot has four arguments. df has the same format as your example data frame. outlier_color and normal_color are to specify the color for the points.drop indicates if we want to drop the category in the legend.
We have to define how to determine an outlier. Here, I decided that an outlier is a value larger or smaller than the mean plus or minus 3 times of the standard deviation. You can define your own approach to determine the outlier by modifying the code in the ifelse statement.
library(ggplot2)
line_outlier_plot <- function(df, outlier_color = "red", normal_color = "black", drop = FALSE){
# Assign a label to show if it is an outlier or not
df$label <- ifelse(df$col > mean(df$col) + 3 * sd(df$col) |
df$col < mean(df$col) - 3 * sd(df$col), "Outlier", "Normal")
df$label <- factor(df$label, levels = c("Normal", "Outlier"))
# Set the color palette
pal <- c("Outlier" = outlier_color, "Normal" = normal_color)
p <- ggplot(df, aes(x = D, y = col)) +
geom_line() +
geom_point(aes(color = label)) +
scale_color_manual(values = pal, drop = drop)
return(p)
}
Below is an example of the plot using this function.
set.seed(155)
df <- data.frame(col=rnorm(1000))
df$D <- c(1:1000)
line_outlier_plot(df)

Related

How to specify groups with colors in qqplot()?

I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result

Boxplots aren't colouring or plotting labels properly in R, why?

My Tukey test significant results LABELS and the colours plotted as box plots do not plot over each sample box plot. Why?
Seems like the labels are plotted at different y-axis along the same s1 (x-axis)?
Reproducible dataset here:
library(multcompView)
df <- data.frame('Sample'=c("s1","s1","s1","s1","s1","s2","s2","s2","s2","s2","s3","s3","s3","s3","s4","s4","s5","s5"), 'value'=c(-0.1098,-0.1435,-0.1046,-0.1308,-0.1523,-0.1219,-0.1114,-0.1328,-0.1589,-0.1567,-0.1395,-0.1181,-0.1448,-0.124,-0.1929,-0.1996,-0.1981,-0.1917))
anova_df <- aov(df$value ~ df$Sample )
tukey_df <- TukeyHSD(anova_df, 'df$Sample', conf.level=0.95)
# I need to group the treatments that are not different each other together.
TUKEY <- tukey_df
generate_label_df <- function(TUKEY, variable){
# Extract labels and factor levels from Tukey post-hoc
Tukey.levels <- TUKEY[[variable]][,4]
Tukey.labels <- data.frame(multcompLetters(Tukey.levels)['Letters'])
#I need to put the labels in the same order as in the boxplot :
Tukey.labels$Sample=rownames(Tukey.labels)
Tukey.labels=Tukey.labels[order(Tukey.labels$Sample) , ]
return(Tukey.labels)
}
# Apply the function on my dataset
LABELS <- generate_label_df(TUKEY , "df$Sample")
# A panel of colors to draw each group with the same color :
my_colors <- c(
rgb(143,199,74,maxColorValue = 255),
rgb(242,104,34,maxColorValue = 255),
rgb(111,145,202,maxColorValue = 255))
# Draw the basic boxplot
a <- boxplot(df$value ~ df$Sample , ylim=c(min(df$value) , 1.1*max(df$value)) , col=my_colors[as.numeric(LABELS[,1])] , ylab="Value" , main="")
# I want to write the letter over each box. Over is how high I want to write it.
over <- 0.1*max(a$stats[nrow(a$stats),] )
#Add the labels
text(c(1:nlevels(df$Sample)), a$stats[nrow(a$stats),]+over, LABELS[,1] , col=my_colors[as.numeric(LABELS[,1])] )
Current output:
Desired plot-like (colours and LABELS):
First, LABELS$Letters is a character vector. You can get as.numeric(LABELS[,1]) to work if you make it a factor first.
Second, your y-limit needs some work for negative values. There is a function you might find useful called extendrange which is used in many a plotting function.
This line c(1:nlevels(df$Sample)) also would work if df$Sample was a factor which is was not.
Also, if you are plotting text at a specific location, you can adjust the text using either text(..., pos = ) or text(..., adj = ) to shift the position.
LABELS$Letters <- factor(LABELS$Letters)
a <- boxplot(df$value ~ df$Sample , ylim = extendrange(df$value), col=my_colors[as.numeric(LABELS[,1])] , ylab="Value" , main="")
text(seq_along(a$names), apply(a$stats, 2, max), LABELS[,1], col=my_colors[as.numeric(LABELS[,1])], pos = 3)
If you don't mind changing your workflow and use tidyverse library this is how you could achieve your goal:
# join df and LABELS into one data table
inner_join(df, LABELS, by = "Sample") %>%
# calculate max value for each Sample group (it will be used to place the labels)
group_by(Sample) %>%
mutate(placement = max(value)) %>%
ungroup() %>%
# make a plot
ggplot(aes(Sample, value, fill = Letters))+
geom_boxplot()+
geom_text(aes(y = placement, label = Letters, col = Letters), nudge_y = 0.01, size = 6)+
theme_minimal()+
theme(legend.position = "none")

color ggplot line based on increasing or decreasing

I am trying to change the color of a line segment based on whether or not it is greater than the point before it. What am I doing wrong here?
Example:
from [1,4] to [2,5] would be green because the y value is increasing.
from [1,4] to [2,1] would be red because the y value is decreasing.
My code:
set.seed(42)
df<-NULL
df$x<-rnorm(10,0,10)
df$y<-rnorm(10,0,10)
df$colors<-cbind(lapply(1:length(df$x),function(i){
ifelse(df$x[i]>df$x[i-1],"#CC6666","#9999CC")
}))
df<-data.frame(df)
ggplot()+
geom_line(aes(x=df$x,y=df$y),size=1.5,colour=df$color)+
scale_color_manual(values=df$color)
Would something like this work for you, I re-arranged your example data a bit but we can use geom_segment() and dplyr::lead() to get the colors matched correctly and a little ggplot hack to make the labels nice again and ditch the NA:
set.seed(42)
df <- data.frame( x = rnorm(10,0,10),
y = rnorm(10,0,10) )
# base R
df <- df[order(df$x),]
df$color <- sapply(1:nrow(df), function(i) df$y[i] > df$y[i+1])
df$group <- "1"
library(tidyverse)
df <- arrange(df, x) %>%
mutate(color = y > lead(y),
group = "1") # group everything togther or else we get two lines
ggplot()+
geom_path(data = df,
aes(x=x, y=y, color = color, group = group),size=1.5) +
scale_color_manual(values = c("#CC6666","#9999CC"), na.value = "white",
labels = c("increase", "decrease", ""))

Could using grid.arrange affect plot output in R?

For a function that accepts a data frame, a plot switch value, and bin size vector, I'm trying to create plots for each count-bin and each density-bin size combination for every numerical variable column when a plot switch value of 'grid' is inputted. I have it working that I get individual plots for each combination, but when I tried to put the plots for each numerical variable on a grid and I'm seeing variations in the actual plots even though as far as I can tell I didn't alter the calculations nor the plot commands.
For instance, notice that, even accounting for scale variations, y-density plots are different in the images below. Tried looking at the documentation for grid.arrange and list just to make sure I wasn't missing something, but I still can't figure it out. Did I somehow change the plot command without realizing it?
Tested with:
data(diamonds, package = "ggplot2")
test <- diamonds[1:100,]
hstPlts(test,ps='grid') #without grid
hst1(test,ps='grid') #with grid
hstPlts <- function(df,ps = 'off',bnSize=c(30)){
#hstPlts() accepts any data frame and if the ps parameter is “on” or
#“grid”, then plot a pair of blue histograms with a vertical red line at the
#mean
#for every numerical variable at each number of bins integer specified in
#the bin
#vector parameter. if the plot switch is set to “grid”, there should be a
#grid for each count-bin combination and a separate grid for
#each density-bin size combination.
#parameters:
#df : data frame
#ps : plot switch defaulted to 'off' with two additional value options -
#'on' and 'grid'
#bnSize : vector containing numeric values representing number of bins for
histograms
num_var <- df[sapply(df,is.numeric)] #extract numeric columns
for(i in 1:length(bnSize)){ #iterate through each bin size
for(j in 1:(ncol(num_var))){ #iterate through each numeric variable
if(ps=='on' | ps =='grid'){ #conditional for both 'on' and 'grid'
#values
bnWid <- (max(num_var[,j]) - min(num_var[,j]))/bnSize[i] # compute
#bin widths
vrMn <- mean(num_var[[j]]) #compute column means for red line
mean = sprintf("%8.3f ", vrMn) #set up label for mean line with
#formatted decimals
cntPlt <- ggplot(num_var, aes(x=num_var[,j])) + #plot count
#histogram with numeric variable on x-axis
geom_histogram(colour = "blue", fill = "blue", binwidth =
bnWid) +
#detail bar fill colors and histogram bin widths
geom_vline(xintercept = mean(num_var[,j]),colour="red") +
#mean line
abs(x=colnames(num_var)[j]) #label x-axis
densPlt <- cntPlt + aes(y = ..density..) + labs(y = "density")
#create corresponding density histogram
print(cntPlt)
print(densPlt)
}
if(ps == 'grid'){ #conditional for just 'grid' value
grdPlt <- ggplot(num_var, aes(x=num_var[,j])) + #create plot for
#each count-bin combination and a separate grid for
#each density-bin size combination
geom_histogram(colour = "blue", fill = "blue", binwidth =
bnWid) +
#detail bar fill colors and histogram bin widths
labs(x=colnames(num_var)[j]) #label x-axis
print(grdPlt)
print(grdPlt + aes(y = ..density..) + labs(y = "density"))
}
}
}
}
hst1 <- function(df,ps = 'off',bsizes=c(30)){
numvar <- df[sapply(df,is.numeric)]
if(ps=='on')
for(i in 1:length(bsizes)){
for(j in 1:(ncol(numvar))){
bwidth <- (max(numvar[,j]) - min(numvar[,j]))/bsizes[i]
var_mean <- mean(numvar[[j]])
mean = sprintf("%8.3f ", var_mean)
counts <- ggplot(numvar, aes(x=numvar[,j])) +
geom_histogram(colour = "blue", fill = "blue", binwidth
= bwidth) +
geom_vline(xintercept = mean(numvar[,j]),colour="red") +
labs(x=colnames(numvar)[j]) # Labeling the x axis
dense <- counts + aes(y = ..density..) + labs(y = "density")
print(counts)
print(dense)
}
}
}
else if(ps == 'grid'){
for(i in 1:length(bsizes)){
cntLst <- list()
denseLst <- list()
for(j in 1:(ncol(numvar))){
bwidth <- (max(numvar[,j]) - min(numvar[,j]))/bsizes[i]
cntGrPlt <- ggplot(numvar, aes(x=numvar[,j])) +
geom_histogram(colour = "blue", fill = "blue", binwidth =
bwidth) +
labs(x=colnames(numvar)[j])
cntLst[[length(cntLst)+1]] <- cntGrPlt
denseLst[[j]] <- cntGrPlt + aes(y = ..density..) + labs(y =
"density")
}
grid.arrange(grobs=cntLst,ncol=2)
grid.arrange(grobs=denseLst,ncol=2)
}
}
}

Custom scatterplot matrix using facet_grid in ggplot2

I'm trying to write a custom scatterplot matrix function in ggplot2 using facet_grid. My data have two categorical variables and one numeric variable.
I'd like to facet (make the scatterplot rows/cols) according to one of the categorical variables and change the plotting symbol according to the other categorical.
I do so by first constructing a larger dataset that includes all combinations (combs) of the categorical variable from which I'm creating the scatterplot panels.
My questions are:
How to use geom_rect to white-out the diagonal and upper panels in facet_grid (I can only make the middle ones black so far)?
How can you move the titles of the facets to the bottom and left hand sides respectively?
How does one remove tick axes and labels for the top left and bottom right facets?
Thanks in advance.
require(ggplot2)
# Data
nC <- 5
nM <- 4
dat <- data.frame(
Control = rep(LETTERS[1:nC], nM),
measure = rep(letters[1:nM], each = nC),
value = runif(nC*nM))
# Change factors to characters
dat <- within(dat, {
Control <- as.character(Control)
measure <- as.character(measure)
})
# Check, lapply(dat, class)
# Define scatterplot() function
scatterplotmatrix <- function(data,...){
controls <- with(data, unique(Control))
measures <- with(data, unique(measure))
combs <- expand.grid(1:length(controls), 1:length(measures), 1:length(measures))
# Add columns for values
combs$value1 = 1
combs$value2 = 0
for ( i in 1:NROW(combs)){
combs[i, "value1"] <- subset(data, subset = Control==controls[combs[i,1]] & measure == measures[combs[i,2]], select = value)
combs[i, "value2"] <- subset(data, subset = Control==controls[combs[i,1]] & measure == measures[combs[i,3]], select = value)
}
for ( i in 1:NROW(combs)){
combs[i,"Control"] <- controls[combs[i,1]]
combs[i,"Measure1"] <- measures[combs[i,2]]
combs[i,"Measure2"] <- measures[combs[i,3]]
}
# Final pairs plot
plt <- ggplot(combs, aes(x = value1, y = value2, shape = Control)) +
geom_point(size = 8, colour = "#F8766D") +
facet_grid(Measure2 ~ Measure1) +
ylab("") +
xlab("") +
scale_x_continuous(breaks = c(0,0.5,1), labels = c("0", "0.5", "1"), limits = c(-0.05, 1.05)) +
scale_y_continuous(breaks = c(0,0.5,1), labels = c("0", "0.5", "1"), limits = c(-0.05, 1.05)) +
geom_rect(data = subset(combs, subset = Measure1 == Measure2), colour='white', xmin = -Inf, xmax = Inf,ymin = -Inf,ymax = Inf)
return(plt)
}
# Call
plt1 <- scatterplotmatrix(dat)
plt1
I'm not aware of a way to move the panel strips (the labels) to the bottom or left. Also, it's not possible to format the individual panels separately (e.g., turn off the tick marks for just one facet). So if you really need these features, you will probably have to use something other than, or in addition to ggplot. You should really look into GGally, although I've never had much success with it.
As far as leaving some of the panels blank, here is a way.
nC <- 5; nM <- 4
set.seed(1) # for reproducible example
dat <- data.frame(Control = rep(LETTERS[1:nC], nM),
measure = rep(letters[1:nM], each = nC),
value = runif(nC*nM))
scatterplotmatrix <- function(data,...){
require(ggplot2)
require(data.table)
require(plyr) # for .(...)
DT <- data.table(data,key="Control")
gg <- DT[DT,allow.cartesian=T]
setnames(gg,c("Control","H","x","V","y"))
fmt <- function(x) format(x,nsmall=1)
plt <- ggplot(gg, aes(x,y,shape = Control)) +
geom_point(subset=.(as.numeric(H)<as.numeric(V)),size=5, colour="#F8766D") +
facet_grid(V ~ H) +
ylab("") + xlab("") +
scale_x_continuous(breaks=c(0,0.5,1), labels=fmt, limits=c(-0.05, 1.05)) +
scale_y_continuous(breaks=c(0,0.5,1), labels=fmt, limits=c(-0.05, 1.05))
return(plt)
}
scatterplotmatrix(dat)
The main feature of this is the use of subset=.(as.numeric(H)<as.numeric(V)) in the call to geom_point(...). This subsets the dataset so you only get a point layer when the condition is met, e.g. in facets where is.numeric(H)<is.numeric(V). This works because I've left the H and V columns as factors and is.numeric(...) operating on a factor returns the levels, not the names.
The rest is just a more compact (and much faster) way of creating what you called comb.

Resources