I have a figure created with facet_wrap visualizing the estimated density of many groups. Some of the groups have a much smaller variance than others. This leads to the x axis not being readable for some panels. Minimum reproducable example:
library(tidyverse)
x1 <- rnorm(1e4)
x2 <- rnorm(1e4,mean=2,sd=0.00001)
data.frame(x=c(x1,x2),group=c(rep("1",length(x1)),rep("2",length(x2)))) %>%
ggplot(.) + geom_density(aes(x=x)) + facet_wrap(~group,scales="free")
The obvious solution to the problem is to increase the figure size, so that everything becomes readable. However, there are too many panels to make this a useful solution. My favourite solution would be to control the number of axis ticks, for example allow for only two ticks on all x-axes. Is there a way to accomplish this?
Edit after suggestions:
Adding + scale_x_continuous(n.breaks = 2) looks like it should exactly do what I want, but it actually does not:
Following the answer in the suggested question Change the number of breaks using facet_grid in ggplot2, I end up with two axis ticks, but undesirably many decimal points:
equal_breaks <- function(n = 3, s = 0.5, ...){
function(x){
# rescaling
d <- s * diff(range(x)) / (1+2*s)
seq(min(x)+d, max(x)-d, length=n)
}
}
data.frame(x=c(x1,x2),group=c(rep("1",length(x1)),rep("2",length(x2)))) %>%
ggplot(.) + geom_density(aes(x=x)) + facet_wrap(~group,scales="free") + scale_x_continuous(breaks=equal_breaks(n=3, s=0.05), expand = c(0.05, 0))
You can add if(seq[2]-seq[1] < 10^(-r)) seq else round(seq, r) to the function equal_breaks developed here.
By doing so, you will round your labels on the x-axis only if the difference between them is above a threshold 10^(-r).
equal_breaks <- function(n = 3, s = 0.05, r = 0,...){
function(x){
d <- s * diff(range(x)) / (1+2*s)
seq = seq(min(x)+d, max(x)-d, length=n)
if(seq[2]-seq[1] < 10^(-r)) seq else round(seq, r)
}
}
data.frame(x=c(x1,x2),group=c(rep("1",length(x1)),rep("2",length(x2)))) %>%
ggplot(.) + geom_density(aes(x=x)) + facet_wrap(~group, scales="free") +
scale_x_continuous(breaks=equal_breaks(n=3, s=0.05, r=0))
As you rightfully pointed, this answer gives only two alternatives for the number of digits; so another possibility is to return round(seq, -floor(log10(abs(seq[2]-seq[1])))), which gets the "optimal" number of digits for every facet.
equal_breaks <- function(n = 3, s = 0.1,...){
function(x){
d <- s * diff(range(x)) / (1+2*s)
seq = seq(min(x)+d, max(x)-d, length=n)
round(seq, -floor(log10(abs(seq[2]-seq[1]))))
}
}
data.frame(x=c(x1,x2,x3),group=c(rep("1",length(x1)),rep("2",length(x2)),rep("3",length(x3)))) %>%
ggplot(.) + geom_density(aes(x=x)) + facet_wrap(~group, scales="free") +
scale_x_continuous(breaks=equal_breaks(n=3, s=0.1))
Thanks so much for so many helpful suggestions and great answers! I figured out a solution that works for arbitrarily complex datasets (at least I hope so) by modifying the approach by #Maƫl and borrowing the great function by RHertel from Count leading zeros between the decimal point and first nonzero digit.
Rounding to the first significant decimal point leads to highly asymmetric ticks in some cases, therefore I rounded to the second significant decimal point.
library(tidyverse)
x1 <- rnorm(1e4)
x2 <- rnorm(1e4,mean=2,sd=0.000001)
x3 <- rnorm(1e4,mean=2,sd=0.01)
zeros_after_period <- function(x) {
if (isTRUE(all.equal(round(x),x))) return (0) # y would be -Inf for integer values
y <- log10(abs(x)-floor(abs(x)))
ifelse(isTRUE(all.equal(round(y),y)), -y-1, -ceiling(y))} # corrects case ending with ..01
equal_breaks <- function(n,s){
function(x){
x=x*10000
d <- s * diff(range(x)) / (1+2*s)
seq = seq(min(x)+d, max(x)-d, length=n) / 10000
round(seq,zeros_after_period(seq[2]-seq[1])+2)
}
}
data.frame(x=c(x1,x2,x3),group=c(rep("1",length(x1)),rep("2",length(x2)),rep("3",length(x3)))) %>%
ggplot(.) + geom_density(aes(x=x)) + facet_wrap(~group, scales="free") +
scale_x_continuous(breaks=equal_breaks(n=2, s=0.1))
Apologies for answering my own question ... but that would not have been possible without the great help from the community :-)
One option to achieve your desired result would be to use a custom breaks and limits function which builds on scales::breaks_extended to first get pretty breaks for the range and then makes use of seq to get the desired number of breaks. However, depending on the desired number of breaks this simple approach will not ensure that we end up with pretty breaks:
library(ggplot2)
set.seed(123)
x1 <- rnorm(1e4)
x2 <- rnorm(1e4,mean=2,sd=0.00001)
mylimits <- function(x) range(scales::breaks_extended()(x))
mybreaks <- function(n = 3) {
function(x) {
breaks <- mylimits(x)
seq(breaks[1], breaks[2], length.out = n)
}
}
d <- data.frame(x=c(x1,x2),group=c(rep("1",length(x1)),rep("2",length(x2))))
ggplot(d) +
geom_density(aes(x=x)) +
scale_x_continuous(breaks = mybreaks(n = 3), limits = mylimits) +
facet_wrap(~group,scales="free")
Related
When plotting the ratio between two variables, their relative order is often of no concern, yet depending on which variable is in the numerator, its relative size is constrained either to (0,1) or (1, Inf), which is somewhat unintuitive and breaks symmetry. I want to plot ratios "symmetrically", without resorting to symmetric log-scale, by having a y-axis that goes like 1/4, 1/3, 1/2, 1, 2, 3, 4 or, equivalently, 4^-1, 3^-1, 2^-1, 1, 2, 3, 4 in regular intervals. I've come up with the following:
symmult <- function(x){
isf <- is.finite(x) & (x>0)
xf <- x[isf]
xf <- ifelse(xf>=1,
xf-1,
1-(1/xf))
x[isf] <- xf
x[!isf] <- NA
x[!is.finite(x)] <- NA
return(x)
}
symmultinv <- function(x){
isf <- is.finite(x)
xf <- x[isf]
xf <- ifelse(x[isf]>=0,
x[isf]+1,
-1/(x[isf]-1))
x[isf] <- xf
x[!isf] <- NA
x[!is.finite(x)] <- NA
return(x)
}
sym_mult_trans = function(){trans_new("sym_mult", symmult, symmultinv )}
x <- c(-4:-2, 1:4)
x[x<1] <- 1/abs(x[x<1])
ggplot() +
geom_point(aes(x=x, y=x)) +
scale_y_continuous(trans="sym_mult")
The transformation works, but I cannot get the axis labels etc. to work for any 0<x<1, without setting them manually. Any help would be greatly appreciated.
You can create bespoke 'breaks' and 'format' functions that you can use inside trans_new (or pass to scale_y_continuous directly via its breaks and labels parameters).
For the breaks function, remember it will take as input a length-two numeric vector representing the range of the y axis. You must then convert this to a number of appropriate breaks. Here, if the minimum of the range is less than one, we take its reciprocal, find the pretty breaks between one and that number, then take the reciprocal of the output. We concatenate that onto pretty breaks between 1 and our range maximum:
# Define breaks function
symmult_breaks <- function(x) {
c(1 / extended_breaks(5)(c(1/x[x < 1], 1)),
extended_breaks(5)(c(1, x[x >= 1])))
}
For the labelling function, remember, it needs to take as input the vector of numbers produced by our breaks function. We can paste a 1/ in front of the reciprocal of numbers less than one, but leave numbers of 1 or more unaltered:
# Define labelling function
symmult_labs <- function(x) {
labs <- character(length(x))
labs[x >= 1] <- as.character(x[x >= 1])
labs[x < 1] <- paste("1", as.character(1/x[x < 1]), sep = "/")
labs
}
So your full new transformation becomes:
# Use our four functions to define the whole transformation:
sym_mult_trans <- function() {
trans_new(name = "sym_mult",
transform = symmult,
inverse = symmultinv,
breaks = symmult_breaks,
format = symmult_labs)
}
And your plot becomes:
ggplot() +
geom_point(aes(x = x, y = x)) +
scale_y_continuous(trans = "sym_mult")
I need to scan nearly a million datapoints and determine if they lay under or above a threshold. I have the threshold defined globally and I have a simple predefined function
function.lower.penalty <- function(i,j){ if( i < j ){
#if gate condition is met, flip the gate flag:
n <- 1 }else{n<-0} return(n) }
that I call with mapply, which will write a 0/1 flag column in my dataframe:
df[, paste0("outside.highpass")] <- mapply(function.lower.penalty,i="somesignal.found.in.df", j="*some.threshold.found.in.df*" )
This is pretty straightforward, I can flag dozens of signals with their respective thresholds like this in a second big dataframe. Also, given how the threshold is written, the code will either flag the signals as below/above the threshold (meaning I got also a function.higher.penalty).
Now I was asked to make a more complex threshold that has the shape of a multisegmented line.
What is the fastest way to flag datapoints given that you have only the corner points of the multisegmented line (I can guess them according to how they painted the line) visible here.
Until now I had a predefined threshold (gray 0.2) and used mapply to scan the signal drawn on the x-axis. I just used a function to return 0 or 1 if datapoint was smaller or bigger than the threshold. Now I need a multisegmented line like the one drawn in red to do the same job.
Edit: Using the suggestion from det I was able to flag datapoints in the dataframe. However, it seems that some datapoints close to the defined line are wrongly assinged, see here. I am wondering as how to work around it or if this is a drawing error?
You can create function which returns picewise linear function based on points:
picewiseLinear <- function(x.var, y.var){
stopifnot(length(x.var) == length(y.var), sum(duplicated(x.var)) == 0)
p <- order(x.var)
x.var <- x.var[p]
y.var <- y.var[p]
k <- diff(y.var) / diff(x.var)
l <- -1 * k * head(x.var, -1) + head(y.var, -1)
function(x){
ind <- findInterval(x, x.var)
if(!all(between(ind, 1, length(x.var) - 1))) stop("wrong input")
x * k[ind] + l[ind]
}
}
For example:
point_df <- tribble(
~x, ~y,
3, 0,
5, 2,
3, 3,
5, 4
)
f <- picewiseLinear(point_df$y, point_df$x)
(on your picture you have picewise linear function but looked on x as dependent variable)
and on example dataset you get something like this:
set.seed(123)
tibble(
x = runif(1000, 0, 6),
y = runif(1000, 0, 4)
) %>%
mutate(color = ifelse(x > f(y), "red", "blue")) %>%
ggplot(aes(x, y)) +
geom_point(aes(color = color)) +
scale_color_identity() +
geom_path(data = point_df)
I need to modify the scale of the y-axis in a ggplot2 graphic : I want to express the y-axis in thousands and not in units. For example, the labels have to be 0 ; 1,000 ; 2,000 ; 3,000 instead of 0 ; 1000000 ; 2000000 ; 3000000.
Please, don't tell me to divide my data by 1000 !
My question is the same as ggplot2 axis transformation by constant factor. But the solution provided here modifies the lables parameter of the scale_y_continuous function, whereas I need this parameter to be set to comma. With this solution I get the following breaks : 0 ; 1000 ; 2000 ; 3000 ... Breaks are expressed in thousands and not in millions and this is a good point, but I loose the comma labels. I want to see 1,000 ; 2,000 ; 3,000 and not 1000 ; 2000 ; 3000...
So modifying the lables parameter of the scale_y_continuous function isn't useful. That's why I think I have to work with the trans parameter of the scale_y_continuous function instead of the labels parameter.
There are a lot of built-in transformation that match the trans parameter and solve similar problems in the scales package (look at log_trans for example). So I tried to build my own homothetic transformation, with the code below.
library(ggplot2)
var0 <- c(1:100)
var1 <- 1000000*rnorm(100)
homothetic_breaks<- function (n = 5, base = 1000)
{
function(x) {
rng <- (range(x, na.rm = TRUE)/base)
min <- floor(rng[1])
max <- ceiling(rng[2])
if (max == min)
return(base*min)
by <- floor((max - min)/n) + 1
base*seq(min, max, by = by)
}
}
homothetic_trans <- function(base = 1000) {
trans <- function(x) x/base
inv <- function(x) x*base
trans_new(paste0("diviseur_par_", format(base)), trans, inv,
homothetic_breaks(base=base), domain = c(-Inf, Inf))
}
data <- data.frame(var0,var1)
p <- ggplot(data,aes(var0,var1))+geom_path()
p <- p + scale_y_continuous(trans=homothetic_trans,labels = comma)
p
When I run this code I get the following message :
"Error: Input to str_c should be atomic vectors", and the breaks of the y axis arethe same as the ones I get when I run the following code :
library(ggplot2)
var1 <- 1000*rnorm(100)
var0 <- c(1:100)
data <- data.frame(var0,var1)
p <- ggplot(data,aes(var0,var1))+geom_path()
p
I am currently working on a script that will take in Windows Perfmon Data, and plot graphs from this data, as I have found the PAL tool far too slow.
This is my first pass and is quite basic at the moment.
I am struggling with the scaling of the y axis. I am currently getting horrible graphs like this:
How can I scale the Y axis appropriately so that there are reasonable breaks etc with data between 0 and 1. (e.g 0.0000123,0.12,0.98,0.00000024) etc?
I was hoping for something dynamic like:
scale_y_continuous(breaks = c(min(d[,i]), 0, max(d[,i])))
Error in Summary.factor(c(1L, 105L, 181L, 125L, 699L, 55L, 270L, 226L, :
min not meaningful for factors
Any help appreciated.
require(lattice)
require(ggplot2)
require(reshape2)
# Read in Perfmon -- MUST BE CSV
d <- read.table("~/R/RPerfmon.csv",header=TRUE,sep=",",dec=".",check.names=FALSE)
# Rename First Column to Time as this is standard in all Perfmon CSVs
colnames(d)[1]="Time"
# Convert Time Column into proper format
d$Time<-as.POSIXct(d$Time, format='%m/%d/%Y %H:%M:%S')
# Strip out The computer name from all Column Headers (Perfmon Counters)
# The regex matches a-zA-Z, underscores and dashes, may need to be expanded
colnames(d) <- sub("^\\\\\\\\[a-zA-Z_-]*\\\\", "", colnames(d))
colnames(d) <- sub("\\\\", "|", colnames(d))
colnames(d)
warnings()
pdf(paste("PerfmonPlot_",Sys.Date(),".pdf",sep=""))
for (i in 2:ncol(d)) {
p <- qplot(d[,"Time"],y=d[,i], data=d, xlab="Time",ylab="", main=colnames(d[i]))
p <- p + geom_hline()
p <- p + scale_y_continuous(breaks = c(min(d[,i]), 0, max(d[,i])))
print(p)
}
dev.off()
In order to get reasonable breaks between 0 and 1, you can for example use:
scale_y_continuous(breaks=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0))
A rewritten plot-part of your code:
ggplot(d, aes(x=Time, y=d[,i])) +
geom_hline() +
scale_y_continuous(breaks=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0)) +
labs(title=colnames(d[i]), x="Time",y="")
And a more dynamic way of setting the breaks:
scale_y_continuous(breaks=seq(from=round(min(d[,i]),1), to=round(max(d[,i]),1), by=0.1))
However, when you look at the error message, you can see that the y-variables are factor-variables. So you have to convert them with as.numeric first.
Here is the code I ended up with after a bit of playing in case anyone wants to be able to do the same:
The key to making it dynamic was the following (note the as.numeric to avoid any errors)
ynumeric <- as.numeric(d[,i])
ymin <- min(ynumeric,na.rm = TRUE)
ymax <- max(ynumeric,na.rm = TRUE)
#generate sequence of 10
ybreaks <- seq(ymin, ymax, length.out = 10)
#Then passing this to the y_continuous function
p <- p + scale_y_continuous(breaks=c(ybreaks))
I hope to expand this in the future to be somewhere in the region of PALs complexity, but using R for efficiency.
require(lattice)
require(ggplot2)
require(reshape2)
# Read in Perfmon -- MUST BE CSV
d <- read.table("~/R/RPerfmon.csv",header=TRUE,sep=",",dec=".",check.names=FALSE,stringsAsFactors=FALSE)
# Rename First Column to Time as this is standard in all Perfmon CSVs
colnames(d)[1]="Time"
# Convert Time Column into proper format
d$Time<-as.POSIXct(d$Time, format='%m/%d/%Y %H:%M:%S')
# Strip out The computer name from all Column Headers (Perfmon Counters)
# The regex matches a-zA-Z, underscores and dashes, may need to be expanded
colnames(d) <- sub("^\\\\\\\\[a-zA-Z_-]*\\\\", "", colnames(d))
colnames(d) <- sub("\\\\", "|", colnames(d))
colnames(d)
warnings()
pdf(paste("PerfmonPlotData_",Sys.Date(),".pdf",sep=""))
for (i in 2:ncol(d)) {
ynumeric <- as.numeric(d[,i])
ymin <- min(ynumeric,na.rm = TRUE)
ymax <- max(ynumeric,na.rm = TRUE)
#generate sequence of 10
ybreaks <- seq(ymin, ymax, length.out = 10)
print(ybreaks)
print(paste(ymin,ymax))
p <- qplot(d[,"Time"],y=ynumeric, data=d, xlab="Time",ylab="", main=colnames(d[i]))
p <- p + geom_smooth(size=3,se=TRUE) + theme_bw()
p <- p + scale_y_continuous(breaks=c(ybreaks))
print(p)
}
dev.off()
I am simulating a random walk starting from coordinates(0,0). When I do it with a loop it works well:
require(ggplot2)
n <- 1000 #number of walks
# first solution, w/ loop... works but is SLOOOW
coord <- data.frame (x=0, y=0, step=0) #origin
for (i in 1:n){
dir <- sample(c("w", "e", "n", "s"), 1) #random direction
step <- sample(1:4, 1) #how far to go in each walk
startx <- coord[nrow(coord), 1]
starty <- coord[nrow(coord), 2]
endx <- ifelse (dir=="w", startx-step, ifelse(dir=="e", startx+step, startx))
endy <- ifelse (dir=="n", starty+step, ifelse(dir=="s", starty-step, starty))
newcoord <- data.frame (x=endx, y=endy, step=step)
coord <- rbind(coord, newcoord)
}
rw <- ggplot(coord, aes(x=x, y=y))
rw + geom_path() +
ggtitle(paste(n, "walks")) +
geom_point(aes(x=0, y =0), color="green", size=I(5)) +
geom_point(aes(x=endx, y =endy), color="red", size=I(5))
However, with n>10,000 it gets very slow, so would like to to avoid the loop and use some form of 'apply', but can't figure out how to add the values of coordinates from rows n and n-1. Please help, thank you.
# second solution
d <- data.frame(dir=sample(c("w", "e", "n", "s"), n, replace=T), step=sample(1:4, n, replace=T))
xy <- data.frame(x=0, y=0)
x. <- data.frame(x=with(d, ifelse (dir=="w", -step, ifelse(dir=="e", step, 0))))
y. <- data.frame(y=with(d, ifelse (dir=="s", -step, ifelse(dir=="n", step, 0))))
x.y. <- cbind(x.,y.)
xy <- rbind(xy, x.y.)
head(xy)
# ... stuck here
data.table is fast for this kind of problem...
walk.dt.f<-function(n=10000L, stepsize=1L:4L) {
# lookup table with direction vector info
dir.dt<-data.table(dir=c("w", "e", "n", "s"), xdir=c(-1L,1L,0L,0L), ydir=c(0L,0L,1L,-1L), key="dir")
# initial position for random walk table
walk.ini.dt<-data.table(rowid=0L,dir="n",step=0L)
# generate table with random walk info
walk.dt<-rbindlist(list(data.table(rowid=1L:n, dir=sample(dir.dt[,dir],n,replace=T), step=sample(stepsize,n,replace=T)), walk.ini.dt))
# join the two tables, and multiply the step info by the direction vectors
setkey(walk.dt,dir)
walk.dt[dir.dt,c("xstep","ystep"):=list(step*xdir,step*ydir)]
# update the key and reorder the rows
setkey(walk.dt,rowid)
# update the walk info table with the cumulative position
walk.dt[,c("x","y"):=list(cumsum(xstep),cumsum(ystep))]
}
system.time(walk.dt.f(10000L))
## user system elapsed
## 0 0 0
system.time(walk.dt.f(1e6L))
## user system elapsed
## 0.25 0.00 0.25
Edit: Set the starting position at (0,0)
I think you are getting close. If you read the comments already posted you can make it much faster. So I recommend not looking at this:
n=10000
x.=sample(-4:4,n,rep=T)
y.=sample(-4:4,n,rep=T)
x=cumsum(x.)
y=cumsum(y.)
coord=data.frame(x,y)
Then plot exactly how you did:
rw <- ggplot(coord, aes(x=x, y=y))
rw + geom_path() +
ggtitle(paste(n, "walks")) +
geom_point(aes(x=0, y =0), color="green", size=I(5)) +
geom_point(aes(x=startx, y =starty), color="red", size=I(5))
update: the plotting is quite slow for n bigger than 10^5. Maybe base graphics would be faster.
update2: this is almost exactly as slow/fast as joran's response.
Gah! In the hopes that this will further my goal of squashing out the stupid "for loops are inherently slow" canard for R, here is a re-working of your first version, still using a for loop that is more than 40x times faster.
I haven't even considered whether your implementation of a random walk makes sense at all. My point here is simply to point out how you could achieve the results of your original code, much much faster, while still using a "slow" for loop.
#My version
foo <- function(n){
coord <- matrix(NA,nrow = n,ncol = 3) #origin
coord[1,] <- c(0,0,0)
dir <- sample(c("w", "e", "n", "s"), n,replace = TRUE) #random direction
step <- sample(1:4, n,replace = TRUE) #how far to go in each walk
for (i in 2:n){
startx <- coord[i-1, 1]
starty <- coord[i-1, 2]
endx <- ifelse (dir[i]=="w", startx-step[i], ifelse(dir[i]=="e", startx+step[i], startx))
endy <- ifelse (dir[i]=="n", starty+step[i], ifelse(dir[i]=="s", starty-step[i], starty))
coord[i,] <- c(endx,endy,step[i])
}
}
#Your version
foo2 <- function(n){
coord <- data.frame (x=0, y=0, step=0) #origin
for (i in 1:n){
dir <- sample(c("w", "e", "n", "s"), 1) #random direction
step <- sample(1:4, 1) #how far to go in each walk
startx <- coord[nrow(coord), 1]
starty <- coord[nrow(coord), 2]
endx <- ifelse (dir=="w", startx-step, ifelse(dir=="e", startx+step, startx))
endy <- ifelse (dir=="n", starty+step, ifelse(dir=="s", starty-step, starty))
newcoord <- data.frame (x=endx, y=endy, step=step)
coord <- rbind(coord, newcoord)
}
}
system.time(foo(10000))
user system elapsed
0.353 0.001 0.353
> system.time(foo2(10000))
user system elapsed
11.374 2.061 13.308
All I've done here is:
STOP. USING. RBIND. And pre-allocate.
Switch to matrices.
Move sample calls outside of loop.
Since you are trying a 2-D random walk, there are 4x4 possible displacements. You can encode them with numbers from 1 to 16. However in order to reduce the computation and map these encoded numbers into direction and displacement amount I played a little trick, I did not encoded the steps with 1:16, but with c(-7:0,4:11)
d <- sample(c(-7:0,4:11),n,replace=T)
delta <- d%%4+1
dir <- d%/%4
xd <- dir
xd[xd%%2 ==0]=0
yd <- dir
yd[xd%%2 ==1]=0
yd <- yd/2
x=c(0,xd*delta)
y=c(0,yd*delta)
x=cumsum(x)
y=cumsum(y)
coords<-data.frame(x,y)
This version only uses vectorized operations, has only a little overhead. I think it performs close to the data.table based solution given before.