Conditional stat_summary for ggplot in R - r

I'd like to write some conditional stats in my graph if the data is bigger than a certain value.
With the kind help of Jack Ryan (Cut data and access groups to draw percentile lines), I could create the following script that groups data into hours and plots the result:
# Read example data
A <- read.csv(url('http://people.ee.ethz.ch/~hoferr/download/data-20130812.csv'))
# Libraries
library(doBy)
library(ggplot2)
library(plyr)
library(reshape2)
library(MASS)
library(scales)
# Sample size function
give.n <- function(x){
return(c(y = min(x) - 0.2, label = length(x)))
}
# Calculate gaps
gaps <- rep(NA, length(A$Timestamp))
times <- A$Timestamp
loss <- A$pingLoss
gap.start <- 1
gap.end <- 1
for(i in 2:length(A$Timestamp))
{ #For all rows
if(is.na(A$pingRTT.ms.[i]))
{ #Currently no connection
if(!is.na(A$pingRTT.ms.[i-1]))
{ #Connection lost now
gap.start <- i
}
if(!is.na(A$pingRTT.ms.[i+1]))
{ # Connection restores next time
gap.end <- i+1
gaps[gap.start] <- as.numeric(A$Timestamp[gap.end]-A$Timestamp[gap.start], units="secs")
loss[gap.start] <- gap.end - gap.start
}
}
}
H <- data.frame(times, gaps, loss)
H <- H[complete.cases(H),]
C <- H
C$dates <- strptime(C$times, "%Y-%m-%d %H:%M:%S")
C$h1 <- C$dates$hour
# Calculate percentiles
cuts <- c(1, .75, .5, .25, 0)
c <- ddply(C, .(h1), function (x) { summarise(x, y = quantile(x$gaps, cuts)) } )
c$cuts <- cuts
c <- dcast(c, h1 ~ cuts, value.var = "y")
c.melt <- melt(c, id.vars = "h1")
p <- ggplot(c.h1.melt, aes(x = h1, y = value, color = variable)) +
geom_point(size = 4) +
stat_summary(fun.data = max.n, geom = "text", fun.y = max, colour = "red", angle = 90, size=4) +
scale_colour_brewer(palette="RdYlBu", name="Percentile", guide = guide_legend(reverse=TRUE)) +
scale_x_continuous(breaks=0:23, limits = c(0,23)) +
annotation_logticks(sides = "lr") +
theme_bw() +
scale_y_log10(breaks=c(1e0,1e1,1e2,1e3,1e4), labels = trans_format("log10", math_format(10^.x)), limits=c(1e0,1e4)) +
xlab("Hour of day") + ylab("Ping gaps [s]")
p
p <- ggplot(c.m1.melt, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 1) +
stat_summary(fun.data = give.n, geom = "text", fun.y = median, angle = 90, size=4) +
stat_summary(fun.data = max.n, geom = "text", fun.y = max, colour = "red", angle = 90, size=4) +
scale_colour_brewer(palette="RdYlBu", name="Percentile", guide = guide_legend(reverse=TRUE)) +
scale_x_continuous(breaks=0:23, limits = c(0,24)) +
annotation_logticks(sides = "lr") +
theme_bw() +
scale_y_log10(breaks=c(1e0,1e1,1e2,1e3,1e4), labels = trans_format("log10", math_format(10^.x)), limits=c(1e0,1e4)) +
xlab("Time of day") + ylab("Ping gaps [s]")
p
This creates an hourly grouped plot of gaps with the length of the longest gaps written right next to the data points:
Below is the minutely grouped plot. The number are unreadable why I'd like to add conditional stats if the gap is longer than 5 minutes or only for the ten longest gaps or something like this.
I tried to just change the stat function to
max.n.filt <- function(x){
filter = 300
if ( x > filter ) {
return(c(y = max(x) + 0.4, label = round(max(10^x),2)))
} else {
return(c(y=x, label = ""))
}
}
and use this for the minutely grouped plot. But I got this error:
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Error in if (nrow(layer_data) == 0) return() : argument is of length zero
Calls: print ... print.ggplot -> ggplot_gtable -> Map -> mapply -> <Anonymous>
In addition: Warning message:
Removed 6 rows containing missing values (geom_point).
In addition, in the hourly plot, I'd like to write the number of samples per hour right next to the length of the gaps. I think I can add a new column to the c data frame, but unfortunately I can't find a way to do this.
Any help is very much appreciated.

See ?stat_summary.
fun.data : Complete summary function. Should take data frame as input
and return data frame as output
Your function max.n.filt uses an if() statement that tries to evaluate the condition x > filter. But when length(x) > 1, the if() statement only evaluates the condition for the first value of x. When used on a data frame, this will return a list cobbled together from the original input x and whatever label the if() statement returns.
> max.n.filt(data.frame(x=c(10,15,400)))
$y.x
[1] 10 15 400
$label
[1] ""
Try a function that uses ifelse() instead:
max.n.filt2 <- function(x){
filter = 300 # whatever threshold
y = ifelse( x > filter, max(x) + 1, x[,1] )
label = ifelse( x > filter, round(max(x),2), NA )
return(data.frame(y=y[,1], label=label[,1]))
}
> max.n.filt2(data.frame(x=c(10,15,400)))
y label
1 10 NA
2 15 NA
3 401 400
Alternatively, you might just find it easier to use geom_text(). I can't reproduce your example, but here's a simulated dataset:
set.seed(101)
sim_data <- expand.grid(m1=1:1440, variable=factor(c(0,0.25,0.5,0.75,1)))
sim_data$sample_size <- sapply(1:1440, function(.) sample(1:25, 1, replace=T))
sim_data$value = t(sapply(1:1440, function(.) quantile(rgamma(sim_data$sample_size, 0.9, 0.5),c(0,0.25,0.5,0.75,1))))[1:(1440*5)]
Just use the subset argument in geom_text() to select those points you wish to label:
ggplot(sim_data, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 4) + geom_text(aes(label=round(value)), subset = .(variable == 1 & value > 25), angle = 90, size = 4, colour = "red", hjust = -0.5)
If you have a column of sample sizes, those can be incorporated into label with paste():
ggplot(sim_data, aes(x = m1/60, y = value, color = variable)) +
geom_point(size = 4) + geom_text(aes(label=paste(round(value),", N=",sample_size)), subset = .(variable == 1 & value > 25), angle = 90, size = 4, colour = "red", hjust = -0.25)
(or create a separate column in your data with whatever labels you want.) If you're asking about how to retrieve the sample sizes, you could modify your call to ddply() like this:
...
c2 <- ddply(C, .(h1), function (x) { cbind(summarise(x, y = quantile(x$gaps, cuts)), n=nrow(x)) } )
c2$cuts <- cuts
c2 <- dcast(c2, h1 + n ~ cuts, value.var = "y")
c2.h1.melt <- melt(c2, id.vars = c("h1","n"))
...

Related

Is there a programatic way to pass specific ranges for the y-axis on a ggplot2 plot?

I've got plots that are being generated automatically based on some user inputs. Most of the time, the plots work fine. However, some users have requested to ensure that there is always an axis label on each end of the plotted data. For example, this plot:
sample_data <-
data.frame(
x = rep(LETTERS[1:3], each = 3)
, y = 1:9 + 0.5
)
ggplot(
sample_data
, aes(x = x, y = y)) +
stat_summary(
fun = "mean"
)
Has no label above the top point or below the bottom point. I can add them easily enough with expand_limits:
ggplot(
sample_data
, aes(x = x, y = y)) +
stat_summary(
fun = "mean"
) +
expand_limits(y = c(2, 10))
However, because these plots are being automatically generated, I cannot manually add the next axis point each time. I've tried passing only.loose = TRUE to labeling:extended, but that still doesn't change the displayed values (any more than entering the values that I want would):
ggplot(
sample_data
, aes(x = x, y = y)) +
stat_summary(
fun = "mean"
) +
scale_y_continuous(breaks = breaks_extended(only.loose = TRUE))
In addition, some of the plots are more complex than this (e.g., with or without confidence intervals, additional grouping, etc.), and the data is prepared for the plot using dplyr and piped directly into ggplot (with %>%). So, even something like recalculating the values is non-trivial.
In fact, even in this case, it fails because adding the expanded points to capture the next set of labels changes the labeling.
ggplot(
sample_data
, aes(x = x, y = y)) +
stat_summary(
fun = "mean"
) +
scale_y_continuous(breaks = breaks_extended(n = 5
, only.loose = TRUE)) +
expand_limits(y =
sample_data %>%
group_by(x) %>%
summarise(my_mean = mean(y)) %>%
pull(my_mean) %>%
range() %>%
{labeling::extended(.[1], .[2], 5
, only.loose = TRUE)}
)
It appears that this happens because
labeling::extended(2.5, 8.5, 5, only.loose = TRUE)
returns the range 2 to 9 by 1's, while:
labeling::extended(2, 9, 5, only.loose = TRUE)
returns the range 2 to 10 by 2's. Somehow, breaks_extended is throwing in some added variation, though whether I track it down or not doesn't change much. I could work around this by calculating the breaks first, but (again) this is for a fairly complicated set of plots.
I feel like I am missing some sort of obvious point, but it keeps eluding me.
Yes there is a programmatic way to set the limits on y-scales and that is to provide a function to the limits argument. It is given the natural data limits as input that you can then edit programmatically. The same goes for breaks, except the input are the limits.
Example below, how this code should look exactly is up to your specifications.
library(ggplot2)
sample_data <- data.frame(
x = rep(LETTERS[1:3], each = 3),
y = 1:9 + 0.5
)
ggplot(sample_data,
aes(x = x, y = y)) +
stat_summary(
fun = "mean"
) +
scale_y_continuous(
limits = function(x) {
lower <- floor(x[1])
lower <- ifelse(x[1] - lower < 0.5, lower - 1, lower)
upper <- ceiling(x[2])
upper <- ifelse(upper - x[2] <= 0.5, upper + 1, upper)
c(lower, upper)
},
breaks = function(x) {
scales::breaks_pretty()(x)
}
)
#> Warning: Removed 3 rows containing missing values (geom_segment).
Created on 2021-03-23 by the reprex package (v1.0.0)
Inspired by teunbrand, I built a function that generates the limits, then checks to ensure that the expansion (including the 5% buffer) does not change the output of pretty
my_lims_expand <- function(x){
prev_pass <-
range(pretty(x))
curr_pass <-
pretty(c(prev_pass[1] - 0.05 * diff(prev_pass)
, prev_pass[2] + 0.05 * diff(prev_pass)))
last_under <-
tail(which(curr_pass < min(x)), 1)
first_over <-
head(which(curr_pass > max(x)), 1)
out <-
range(curr_pass[last_under:first_over])
confirm_out <-
range(pretty(out))
while(!all(out == confirm_out)){
prev_pass <- curr_pass
curr_pass <-
pretty(c(prev_pass[1] - 0.05 * diff(prev_pass)
, prev_pass[2] + 0.05 * diff(prev_pass)))
last_under <-
tail(which(curr_pass < min(x)), 1)
first_over <-
head(which(curr_pass > max(x)), 1)
out <-
range(curr_pass[last_under:first_over])
confirm_out <-
range(pretty(out))
}
return(out)
}
Then, I can use that function for limits:
ggplot(sample_data,
aes(x = x, y = y)) +
stat_summary(
fun = "mean"
) +
scale_y_continuous(
limits = my_lims_expand
, breaks = pretty
)
to generate the desired plot:

How to make for-loop work in order to automate plotting

I have
a data.frame df
df = data.frame(year=c(2018,2019,2020), value1=rnorm(3,1,0.5), value2=rnorm(3,2,0.5)
a ggplot-function called ScatterPlot (function code see below)
a for loop that I want to use to run the ggplot-function over my df
My intent is to plot (scatter) value1 over years and value2 over years somewhat automatically (using scatterplot function and my for loop).
For some reason, the for loop below only generates one plot (the last one in my df). Can someone tell me what I am missing?
for loop:
# Create the loop.vector (all the columns)
loop.vector <- ncol(df)-1
for (i in loop.vector) { # Loop over loop.vector
# store data in column.i as x
x <- df[i]
x = unlist(x) #necessary. otherwise ggplot will generate an error
plotname = colnames(df[i])
#plot
jpeg(filename=paste0("/R-Outputs/plots/",plotname,".jpeg"))
plot= ScatterPlot(df,df$year,"year", x, plotname)
print(plot)
dev.off()
}
Scatterplot Function (this works):
ScatterPlot <- function(df, x, x_var_label,y, y_var_label) {
# Input:
# df: a data frame
# x: a column from df in the form of a character vector
# y: a column from df in the form of a character vector
#
# Output:
# a ggplot2 plot
require(ggplot2)
x_title = x_var_label
y_title = y_var_label
time_labels = c("2018", "2019", "2020")
ggplot(data = df, aes(x = x, y = y)) +
geom_point(col="#69b3a2",fill="#69b3a2",alpha=0.5, size = 0) +
geom_line()+
geom_smooth(method = "lm", se = FALSE, size = 0.8, col="red") +
xlab(label = x_title) +
ylab(label = y_title) +
theme_bw()+
theme(axis.text.x=element_text(angle=45, hjust = 1))+
labs(title = paste0(y_title," over time"))+
scale_x_continuous("year", labels = as.character(time_labels),
breaks = as.integer((time_labels)))
}
You don't need to pass both values as well as column name. Pass only the column name in the function ScatterPlot.
library(ggplot2)
ScatterPlot <- function(df, x_var_label,y_var_label) {
# Input:
# df: a data frame
# x: a column from df in the form of a character vector
# y: a column from df in the form of a character vector
#
# Output:
# a ggplot2 plot
time_labels = c("2018", "2019", "2020")
ggplot(data = df, aes(x = .data[[x_var_label]], y = .data[[y_var_label]])) +
geom_point(col="#69b3a2",fill="#69b3a2",alpha=0.5, size = 0) +
geom_line()+
geom_smooth(method = "lm", se = FALSE, size = 0.8, col="red") +
xlab(label = x_var_label) +
ylab(label = y_var_label) +
theme_bw()+
theme(axis.text.x=element_text(angle=45, hjust = 1))+
labs(title = paste0(y_var_label," over time"))+
scale_x_continuous("year", labels = time_labels,
breaks = as.integer(time_labels))
}
To call this function in a loop something like this should work.
#column names to loop over
loop.vector <- names(df[-1])
plot <- vector('list', length(loop.vector))
for (i in seq_along(loop.vector)) { # Loop over loop.vector
jpeg(filename=paste0("/R-Outputs/plots/",loop.vector[i],".jpeg"))
plot[[i]] = ScatterPlot(df,"year", loop.vector[i])
print(plot[[i]])
dev.off()
}
We are also saving individual plots in a list which you can verify with plot[[1]], plot[[2]] etc.

How to find clusters of values over threshold for timeseries

I have timeseries and need to find clusters of values over threshold and plot that cluster on separate plot.
My code example. Unfortunately I don't know how to generate well clustered values.
#generate sample data
Sys.setlocale("LC_ALL","English")
set.seed(8)
Values <- sample(0:100,24241, replace = T)
Values <- rpois(24241, lambda=60)
start <- as.POSIXct("2012-01-15 06:10:00")
interval <- 15
end <- start + as.difftime(4, units="days") + as.difftime(5, units = "hours")
DateTimes <- seq(from=start, by=interval, to=end)
my_data_sample <- tibble(datetime = DateTimes, Value = Values)
threshold <- 82
ggplot(data = my_data_sample, aes(x = datetime, y = Value)) +
geom_line(size = 1, color = "darkgreen") +
geom_hline(yintercept=threshold, linetype="dashed", color = "red") +
theme_bw() +
labs(
x= "" ,
y = "",
title = paste("Threshold:", threshold )
) +
scale_x_datetime(date_breaks = "8 hour", labels = date_format("%b %d - %H:%M")) +
theme(axis.text.x = element_text(angle = 25, vjust = 1.0, hjust = 1.0))
Here is what I need:
I need to find clusters of values over threshold - consecutive or near each other, sort that clusters using cluster length in seconds (longest clusters) or sum of values (most powerful clusters), and plot let's say top 3 of that time periods on separate plots.
Any suggestions how to do that?
You can find runs that follow some expectation using run-length encoding (RLE). At the RLE level, you can filter out runs that are too short on either side. You can play with the run_threshold value until it matches your data.
# Put some actual deviating runs in the data
my_data_sample$Value[5001:5100] <- rpois(100, lambda = 80)
my_data_sample$Value[10001:11000] <- rpois(1000, lambda = 80)
threshold <- 82
rle <- rle(my_data_sample$Value > threshold)
# Find sub-threshold values in between super-threshold values,
# convert these to other class
run_threshold <- 20
rle$values[!rle$values & rle$lengths < run_threshold] <- TRUE
# Restructure rle
rle <- rle(inverse.rle(rle))
# Find short super-threshold values to filter
run_threshold <- 5
rle$values[rle$values & rle$lengths < run_threshold] <- FALSE
rle <- rle(inverse.rle(rle))
# Find run starts and ends
rle_start <- {rle_end <- cumsum(rle$lengths)} - rle$lengths + 1
# Format as data.frame for ggplot
rle_df <- data.frame(
min = my_data_sample$datetime[rle_start],
max = my_data_sample$datetime[rle_end],
value = rle$values
)
ggplot(data = my_data_sample, aes(x = datetime, y = Value)) +
geom_line(size = 1, color = "darkgreen") +
geom_rect(aes(xmin = min, xmax = max, ymin = 0, ymax = 10, fill = value),
data = rle_df, inherit.aes = FALSE) +
geom_hline(yintercept=threshold, linetype="dashed", color = "red") +
theme_bw() +
labs(
x= "" ,
y = "",
title = paste("Threshold:", threshold )
) +
scale_x_datetime(date_breaks = "8 hour", labels = date_format("%b %d - %H:%M")) +
theme(axis.text.x = element_text(angle = 25, vjust = 1.0, hjust = 1.0))

stat_function not transitioning over transition_states

I'm trying to write my own Central Limit Theorem demonstration using ggplot2 and am unable to get my stat_function to display a changing normal distribution.
below is my code, I want the normal distribution in stat_function to transition through different states; specifically, I'm hoping for it to change the standard deviation to correspond with each value in dataset. Any help would be greatly appreciated.
#library defs
library(gganimate)
library(ggplot2)
library(transformr)
#initialization for distribution, rolls, and vectors
k = 2
meanr = 1/k
sdr = 1/k
br = sdr/10
rolls <- 200
avg <- 1
dataset <- 1
s <- 1
#loop through to create vectors of sample statistics from 200 samples of size i
#avg is sample average, s is standard deviations of sample means, and dataset is the indexes to run the transition states
for (i in c(1:40)){
for (j in 1:rolls){
avg <- c(avg,mean(rexp(i,k)))
}
dataset <- c(dataset, rep(i,rolls))
s <- c(s,rep(sdr/sqrt(i),rolls))
}
#remove initialized vector information as it was only created to start loops
avg <- avg[-1]
rn <- rn[-1]
dataset <- dataset[-1]
s <- s[-1]
#dataframe
a <- data.frame(avgf=avg, rnf = rn,datasetf = dataset,sf = s)
#plot histogram, density function, and normal distribution
ggplot(a,aes(x=avg,y=s))+
geom_histogram(aes(y = ..density..), binwidth = br,fill='beige',col='black')+
geom_line(aes(y = ..density..,colour = 'Empirical'),lwd=2, stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal', y = s),lwd=2,args=list(mean=meanr,sd = mean(s)))+
scale_y_continuous(labels = scales::percent_format()) +
scale_color_discrete(name = "Densities", labels = c("Empirical", "Normal"))+
labs(x = 'Sample Average',title = 'Sample Size: {closest_state}')+
transition_states(dataset,4,4)+ view_follow(fixed_x = TRUE)
I think it's difficult to use stat_function here because the dnorm function that you are passing includes a grouped variable (mean(s)). There is no way to indicate that you wish to group s by the dataset column, and the transition_states function doesn't filter the whole data frame. You could use transition_filter to filter the whole data frame, but this would be laborious.
It's not much work to just add a dnorm to your input data frame and plot it as a line, particularly since the rest of your code can be simplified substantially. Here's a fully reproducible example:
library(gganimate)
library(ggplot2)
library(transformr)
k <- 2
meanr <- sdr <- 1/k
br <- sdr/10
rolls <- 200
a <- do.call(rbind, lapply(1:40, function(i){
data.frame(avg = replicate(rolls, mean(rexp(i, k))),
dataset = rep(i, rolls),
x = seq(0, 2, length.out = rolls),
s = dnorm(seq(0, 2, length.out = rolls),
meanr, sdr/sqrt(i))) }))
ggplot(a, aes(x = avg, group = dataset)) +
geom_histogram(aes(y = ..density..), fill = 'beige',
colour = "black", binwidth = br) +
geom_line(aes(y = ..density.., colour = 'Empirical'),
lwd = 2, stat = 'density', alpha = 0.5) +
geom_line(aes(x = x, y = s, colour = "Normal"), size = 2, alpha = 0.5) +
scale_y_continuous(labels = scales::percent_format()) +
coord_cartesian(xlim = c(0, 2)) +
scale_color_discrete(name = "Densities", labels = c("Empirical", "Normal")) +
labs(x = 'Sample Average', title = 'Sample Size: {closest_state}') +
transition_states(dataset, 4, 4) +
view_follow(fixed_x = TRUE, fixed_y = TRUE)

R/ggplot2: Collapse or remove segment of y-axis from scatter-plot

I'm trying to make a scatter plot in R with ggplot2, where the middle of the y-axis is collapsed or removed, because there is no data there. I did it in photoshop below, but is there a way to create a similar plot with ggplot?
This is the data with a continuous scale:
But I'm trying to make something like this:
Here is the code:
ggplot(data=distance_data) +
geom_point(
aes(
x = mdistance,
y = maxZ,
shape = factor(subj),
color = factor(side),
size = (cSA)
)
) +
scale_size_continuous(range = c(4, 10)) +
theme(
axis.text.x = element_text(colour = "black", size = 15),
axis.text.y = element_text(colour = "black", size = 15),
axis.title.x = element_text(colour = "black", size= 20, vjust = 0),
axis.title.y = element_text(colour = "black", size= 20),
legend.position = "none"
) +
ylab("Z-score") +
xlab("Distance")
You could do this by defining a coordinate transformation. A standard example are logarithmic coordinates, which can be achieved in ggplot by using scale_y_log10().
But you can also define custom transformation functions by supplying the trans argument to scale_y_continuous() (and similarly for scale_x_continuous()). To this end, you use the function trans_new() from the scales package. It takes as arguments the transformation function and its inverse.
I discuss first a special solution for the OP's example and then also show how this can be generalised.
OP's example
The OP wants to shrink the interval between -2 and 2. The following defines a function (and its inverse) that shrinks this interval by a factor 4:
library(scales)
trans <- function(x) {
ifelse(x > 2, x - 1.5, ifelse(x < -2, x + 1.5, x/4))
}
inv <- function(x) {
ifelse(x > 0.5, x + 1.5, ifelse(x < -0.5, x - 1.5, x*4))
}
my_trans <- trans_new("my_trans", trans, inv)
This defines the transformation. To see it in action, I define some sample data:
x_val <- 0:250
y_val <- c(-6:-2, 2:6)
set.seed(1234)
data <- data.frame(x = sample(x_val, 30, replace = TRUE),
y = sample(y_val, 30, replace = TRUE))
I first plot it without transformation:
p <- ggplot(data, aes(x, y)) + geom_point()
p + scale_y_continuous(breaks = seq(-6, 6, by = 2))
Now I use scale_y_continuous() with the transformation:
p + scale_y_continuous(trans = my_trans,
breaks = seq(-6, 6, by = 2))
If you want another transformation, you have to change the definition of trans() and inv() and run trans_new() again. You have to make sure that inv() is indeed the inverse of inv(). I checked this as follows:
x <- runif(100, -100, 100)
identical(x, trans(inv(x)))
## [1] TRUE
General solution
The function below defines a transformation where you can choose the lower and upper end of the region to be squished, as well as the factor to be used. It directly returns the trans object that can be used inside scale_y_continuous:
library(scales)
squish_trans <- function(from, to, factor) {
trans <- function(x) {
if (any(is.na(x))) return(x)
# get indices for the relevant regions
isq <- x > from & x < to
ito <- x >= to
# apply transformation
x[isq] <- from + (x[isq] - from)/factor
x[ito] <- from + (to - from)/factor + (x[ito] - to)
return(x)
}
inv <- function(x) {
if (any(is.na(x))) return(x)
# get indices for the relevant regions
isq <- x > from & x < from + (to - from)/factor
ito <- x >= from + (to - from)/factor
# apply transformation
x[isq] <- from + (x[isq] - from) * factor
x[ito] <- to + (x[ito] - (from + (to - from)/factor))
return(x)
}
# return the transformation
return(trans_new("squished", trans, inv))
}
The first line in trans() and inv() handles the case when the transformation is called with x = c(NA, NA). (It seems that this did not happen with the version of ggplot2 when I originally wrote this question. Unfortunately, I don't know with which version this startet.)
This function can now be used to conveniently redo the plot from the first section:
p + scale_y_continuous(trans = squish_trans(-2, 2, 4),
breaks = seq(-6, 6, by = 2))
The following example shows that you can squish the scale at an arbitrary position and that this also works for other geoms than points:
df <- data.frame(class = LETTERS[1:4],
val = c(1, 2, 101, 102))
ggplot(df, aes(x = class, y = val)) + geom_bar(stat = "identity") +
scale_y_continuous(trans = squish_trans(3, 100, 50),
breaks = c(0, 1, 2, 3, 50, 100, 101, 102))
Let me close by stressing what other already mentioned in comments: this kind of plot could be misleading and should be used with care!

Resources