Can I tell ggpairs to use log scales? - r

Can I provide a parameter to the ggpairs function in the GGally package to use log scales for some, not all, variables?

You can't provide the parameter as such (a reason is that the function creating the scatter plots is predefined without scale, see ggally_points), but you can change the scale afterward using getPlot and putPlot. For instance:
custom_scale <- ggpairs(data.frame(x=exp(rnorm(1000)), y=rnorm(1000)),
upper=list(continuous='points'), lower=list(continuous='points'))
subplot <- getPlot(custom_scale, 1, 2) # retrieve the top left chart
subplotNew <- subplot + scale_y_log10() # change the scale to log
subplotNew$type <- 'logcontinuous' # otherwise ggpairs comes back to a fixed scale
subplotNew$subType <- 'logpoints'
custom_scale <- putPlot(custom_fill, subplotNew, 1, 2)

This is essentially the same answer as Jean-Robert but looks much more simple (approachable). I don't know if it is a new feature but it doesn't look like you need to use getPlot or putPlot anymore.
custom_scale[1,2]<-custom_scale[1,2] + scale_y_log10() + scale_x_log10()
Here is a function to apply it across a big matrix. Supply the number of rows in the plot and the name of the plot.
scalelog2<-function(x=2,g){ #for below diagonal
for (i in 2:x){
for (j in 1:(i-1)) {
g[i,(j)]<-g[i,(j)] + scale_x_continuous(trans='log2') +
scale_y_continuous(trans='log2')
} }
for (i in 1:x){ #for the bottom row
g[(x+1),i]<-g[(x+1),i] + scale_y_continuous(trans='log2')
}
for (i in 1:x){ #for the diagonal
g[i,i]<-g[i,i]+ scale_x_continuous(trans='log2') }
return(g) }

It's probably better use a linear scale and log transform variables as appropriate before supplying them to ggpairs because this avoids ambiguity in how the correlation coefficients have been computed (before or after log-transform).
This can be easily achieved e.g. like this:
library(tidyverse)
log10_vars <- vars(ends_with(".Length")) # define variables to be transformed
iris %>% # use standard R example dataframe
mutate_at(log10_vars, log10) %>% # log10 transform selected columns
rename_at(log10_vars, sprintf, fmt="log10 %s") %>% # rename variables accordingly
GGally::ggpairs(aes(color=Species))

Related

Remove whiskers in boxplot made by ggboxplot (of ggpubr package)?

Sometimes it is appropriate to show a boxplot without the whiskers.
In geom_boxplot (ggplot2) we can achieve this with coef=0.
Is there a way to achieve this in ggboxplot (ggpubr v0.5.0, current version at the time of writing)?
I note that ggboxplot has much in common with geom_boxplot,
such as the ability to use outlier.shape=NA in each case to suppress
outliers. It seems that there should be an easy way to also suppress the whiskers.
I cannot find a way implemented in ggboxplot directly to do this, which is a bit strange because it passes the ellipsis to a geom_boxplot call, so I am not sure why the coef=0 does not reach there and supresses the whiskers.
As a stopgap, you can modify the ggplot object created by ggboxplot and remove whiskers that way.
The following function shows this:
ggboxplot_whisker_opt <- function(...)
{
opts <- list(...)
# check if user specified a whiskers argument and set options accordingly
if("whisker" %in% names(opts))
{
whisk <- opts$whisker
opts$whisker <- NULL
} else {
whisk <- TRUE
}
pl <- do.call(ggboxplot,opts) # create plot by calling ggboxplot with all user options
if(!whisk)
{
pl_list <- ggplot_build(pl) # get listed version of ggplot object to modify
pl_list$data[[1]]$ymin <- NA # remove the ymin/max that specify the whiskers
pl_list$data[[1]]$ymax <- NA
pl <- ggplot_gtable(pl_list) # convert back to ggplot object
}
# plot the ggplot and return
plot(pl)
}
We can now call that function with whisker=TRUE/FALSE or without it and it produced plots accordingly:
set.seed(123)
x <- rnorm(100)
labels <- round(runif(100,1,2))
df <- data.frame(labels=labels,
value=x)
ggboxplot_whisker_opt(df,"labels","value")
# is the same as
ggboxplot_whisker_opt(df,"labels","value",whisker=TRUE)
ggboxplot_whisker_opt(df,"labels","value",whisker=FALSE)

In R, how can I tell if the scales on a ggplot object are log or linear?

I have many ggplot objects where I wish to print some text (varies from plot to plot) in the same relative position on each plot, regardless of scale. What I have come up with to make it simple is to
define a rescale function (call it sx) to take the relative position I want and return that position on the plot's x axis.
sx <- function(pct, range=xr){
position <- range[1] + pct*(range[2]-range[1])
}
make the plot without the text (call it plt)
Use the ggplot_build function to find the x scale's range
xr <- ggplot_build(plt)$layout$panel_params[[1]]$x.range
Then add the text to the plot
plt <- plt + annotate("text", x=sx(0.95), ....)
This works well for me, though I'm sure there are other solutions folks have derived. I like the solution because I only need to add one step (step 3) to each plot. And it's a simple modification to the annotate command (x goes to sx(x)).
If someone has a suggestion for a better method I'd like to hear about it. There is one thing about my solution though that gives me a little trouble and I'm asking for a little help:
My problem is that I need a separate function for log scales, (call it lx). It's a bit of a pain because every time I want to change the scale I need to modify the annotate commands (change sx to lx) and occasionally there are many. This could easily be solved in the sx function if there was a way to tell what the type of scale was. For instance, is there a parameter in ggplot_build objects that describe the log/lin nature of the scale? That seems to be the best place to find it (that's where I'm pulling the scale's range) but I've looked and can not figure it out. If there was, then I could add a command to step 3 above to define the scale type, and add a tag to the sx function in step 1. That would save me some tedious work.
So, just to reiterate: does anyone know how to tell the scaling (type of scale: log or linear) of a ggplot object? such as using the ggplot_build command's object?
Suppose we have a list of pre-build plots:
linear <- ggplot(iris, aes(Sepal.Width, Sepal.Length, colour = Species)) +
geom_point()
log <- linear + scale_y_log10()
linear <- ggplot_build(linear)
log <- ggplot_build(log)
plotlist <- list(a = linear, b = log)
We can grab information about their position scales in the following way:
out <- lapply(names(plotlist), function(i) {
# Grab plot, panel parameters and scales
plot <- plotlist[[i]]
params <- plot$layout$panel_params[[1]]
scales <- plot$plot$scales$scales
# Only keep (continuous) position scales
keep <- vapply(scales, function(x) {
inherits(x, "ScaleContinuousPosition")
}, logical(1))
scales <- scales[keep]
# Grab relevant transformations
out <- lapply(scales, function(scale) {
data.frame(position = scale$aesthetics[1],
# And now for the actual question:
transformation = scale$trans$name,
plot = i)
})
out <- do.call(rbind, out)
# Grab relevant ranges
ranges <- params[paste0(out$position, ".range")]
out$min <- sapply(ranges, `[`, 1)
out$max <- sapply(ranges, `[`, 2)
out
})
out <- do.call(rbind, out)
Which will give us:
out
position transformation plot min max
1 x identity a 1.8800000 4.520000
2 y identity a 4.1200000 8.080000
3 y log-10 b 0.6202605 0.910835
4 x identity b 1.8800000 4.520000
Or if you prefer a straightforward answer:
log$plot$scales$scales[[1]]$trans$name
[1] "log-10"

Save plots as R objects and displaying in grid

In the following reproducible example I try to create a function for a ggplot distribution plot and saving it as an R object, with the intention of displaying two plots in a grid.
ggplothist<- function(dat,var1)
{
if (is.character(var1)) {
var1 <- which(names(dat) == var1)
}
distribution <- ggplot(data=dat, aes(dat[,var1]))
distribution <- distribution + geom_histogram(aes(y=..density..),binwidth=0.1,colour="black", fill="white")
output<-list(distribution,var1,dat)
return(output)
}
Call to function:
set.seed(100)
df <- data.frame(x = rnorm(100, mean=10),y =rep(1,100))
output1 <- ggplothist(dat=df,var1='x')
output1[1]
All fine untill now.
Then i want to make a second plot, (of note mean=100 instead of previous 10)
df2 <- data.frame(x = rep(1,1000),y = rnorm(1000, mean=100))
output2 <- ggplothist(dat=df2,var1='y')
output2[1]
Then i try to replot first distribution with mean 10.
output1[1]
I get the same distibution as before?
If however i use the information contained inside the function, return it back and reset it as a global variable it works.
var1=as.numeric(output1[2]);dat=as.data.frame(output1[3]);p1 <- output1[1]
p1
If anyone can explain why this happens I would like to know. It seems that in order to to draw the intended distribution I have to reset the data.frame and variable to what was used to draw the plot. Is there a way to save the plot as an object without having to this. luckly I can replot the first distribution.
but i can't plot them both at the same time
var1=as.numeric(output2[2]);dat=as.data.frame(output2[3]);p2 <- output2[1]
grid.arrange(p1,p2)
ERROR: Error in gList(list(list(data = list(x = c(9.66707664902549, 11.3631137069225, :
only 'grobs' allowed in "gList"
In this" Grid of multiple ggplot2 plots which have been made in a for loop " answer is suggested to use a list for containing the plots
ggplothist<- function(dat,var1)
{
if (is.character(var1)) {
var1 <- which(names(dat) == var1)
}
distribution <- ggplot(data=dat, aes(dat[,var1]))
distribution <- distribution + geom_histogram(aes(y=..density..),binwidth=0.1,colour="black", fill="white")
plot(distribution)
pltlist <- list()
pltlist[["plot"]] <- distribution
output<-list(pltlist,var1,dat)
return(output)
}
output1 <- ggplothist(dat=df,var1='x')
p1<-output1[1]
output2 <- ggplothist(dat=df2,var1='y')
p2<-output2[1]
output1[1]
Will produce the distribution with mean=100 again instead of mean=10
and:
grid.arrange(p1,p2)
will produce the same Error
Error in gList(list(list(plot = list(data = list(x = c(9.66707664902549, :
only 'grobs' allowed in "gList"
As a last attempt i try to use recordPlot() to record everything about the plot into an object. The following is now inside the function.
ggplothist<- function(dat,var1)
{
if (is.character(var1)) {
var1 <- which(names(dat) == var1)
}
distribution <- ggplot(data=dat, aes(dat[,var1]))
distribution <- distribution + geom_histogram(aes(y=..density..),binwidth=0.1,colour="black", fill="white")
plot(distribution)
distribution<-recordPlot()
output<-list(distribution,var1,dat)
return(output)
}
This function will produce the same errors as before, dependent on resetting the dat, and var1 variables to what is needed for drawing the distribution. and similarly can't be put inside a grid.
I've tried similar things like arrangeGrob() in this question "R saving multiple ggplot2 plots as R-object in list and re-displaying in grid " but with no luck.
I would really like a solution that creates an R object containing the plot, that can be redrawn by itself and can be used inside a grid without having to reset the variables used to draw the plot each time it is done. I would also like to understand wht this is happening as I don't consider it intuitive at all.
The only solution I can think of is to draw the plot as a png file, saved somewhere and then have the function return the path such that i can be reused - is that what other people are doing?.
Thanks for reading, and sorry for the long question.
Found a solution
How can I reference the local environment within a function, in R?
by inserting
localenv <- environment()
And referencing that in the ggplot
distribution <- ggplot(data=dat, aes(dat[,var1]),environment = localenv)
made it all work! even with grid arrange!

Looping over attributes vector to produce combined graphs

Here is some code that tries to compute the marginal effects of each of the predictors in a model (using the effects package) and then plot the results. To do this, I am looping over the "term.labels" attribute of the glm terms object).
library(DAAG)
library(effects)
formula = pres.abs ~ altitude + distance + NoOfPools + NoOfSites + avrain + meanmin + meanmax
summary(logitFrogs <- glm(formula = formula, data = frogs, family = binomial(link = "logit")))
par(mfrow = c(4, 2))
for (predictorName in attr(logitFrogs$terms, "term.labels")) {
print(predictorName)
effLogitFrogs <- effect(predictorName, logitFrogs)
plot(effLogitFrogs)
}
This produces no picture at all. On the other hand, explicitly stating the predictor names does work:
effLogitFrogs <- effect("distance", logitFrogs)
plot(effLogitFrogs)
What am I doing wrong?
Although you call function plot(), actually it calls function plot.eff() and it is lattice plot and so par() argument is ignored. One solution is to use function allEffects() and then plot(). This will call function plot.efflist(). With this function you do not need for loop because all plots are made automatically.
effLogitFrogs <- allEffects(predictorName, logitFrogs)
plot(effLogitFrogs)
EDIT - solution with for loop
There is "ugly" solution to use with for() loop. For this we need also package grid. First, make as variables number of rows and columns (now it works only with 1 or 2 columns). Then grid.newpage() and pushViewport() set graphical window.
Predictor names are stored in vector outside the loop. Using functions pushViewport() and popViewport() all plots are put in the same graphical window.
library(lattice)
library(grid)
n.col=2
n.row= 4
grid.newpage()
pushViewport(viewport(layout = grid.layout(n.row,n.col)))
predictorName <- attr(logitFrogs$terms, "term.labels")
for (i in 1:length(predictorName)) {
print(predictorName[i])
effLogitFrogs <- effect(predictorName[i], logitFrogs)
pushViewport(viewport(layout.pos.col=ceiling(i/n.row), layout.pos.row=ifelse(i-n.row<=0,i,i-n.row)))
p<-plot(effLogitFrogs)
print(p,newpage=FALSE)
popViewport(1)
}
add print to your loop resolve the problem.
print(plot(effLogitFrogs))
plot call plot.eff , which create the plot without printing it.
allEffects generete an object of type eff.list. When we try to plot this object, its calls plot.efflist function which prints the plot so no need to call print like plot.eff.

How to plot a violin scatter boxplot (in R)?

I just came by the following plot:
And wondered how can it be done in R? (or other softwares)
Update 10.03.11: Thank you everyone who participated in answering this question - you gave wonderful solutions! I've compiled all the solution presented here (as well as some others I've came by online) in a post on my blog.
Make.Funny.Plot does more or less what I think it should do. To be adapted according to your own needs, and might be optimized a bit, but this should be a nice start.
Make.Funny.Plot <- function(x){
unique.vals <- length(unique(x))
N <- length(x)
N.val <- min(N/20,unique.vals)
if(unique.vals>N.val){
x <- ave(x,cut(x,N.val),FUN=min)
x <- signif(x,4)
}
# construct the outline of the plot
outline <- as.vector(table(x))
outline <- outline/max(outline)
# determine some correction to make the V shape,
# based on the range
y.corr <- diff(range(x))*0.05
# Get the unique values
yval <- sort(unique(x))
plot(c(-1,1),c(min(yval),max(yval)),
type="n",xaxt="n",xlab="")
for(i in 1:length(yval)){
n <- sum(x==yval[i])
x.plot <- seq(-outline[i],outline[i],length=n)
y.plot <- yval[i]+abs(x.plot)*y.corr
points(x.plot,y.plot,pch=19,cex=0.5)
}
}
N <- 500
x <- rpois(N,4)+abs(rnorm(N))
Make.Funny.Plot(x)
EDIT : corrected so it always works.
I recently came upon the beeswarm package, that bears some similarity.
The bee swarm plot is a
one-dimensional scatter plot like
"stripchart", but with closely-packed,
non-overlapping points.
Here's an example:
library(beeswarm)
beeswarm(time_survival ~ event_survival, data = breast,
method = 'smile',
pch = 16, pwcol = as.numeric(ER),
xlab = '', ylab = 'Follow-up time (months)',
labels = c('Censored', 'Metastasis'))
legend('topright', legend = levels(breast$ER),
title = 'ER', pch = 16, col = 1:2)
(source: eklund at www.cbs.dtu.dk)
I have come up with the code similar to Joris, still I think this is more than a stem plot; here I mean that they y value in each series is a absolute value of a distance to the in-bin mean, and x value is more about whether the value is lower or higher than mean.
Example code (sometimes throws warnings but works):
px<-function(x,N=40,...){
x<-sort(x);
#Cutting in bins
cut(x,N)->p;
#Calculate the means over bins
sapply(levels(p),function(i) mean(x[p==i]))->meansl;
means<-meansl[p];
#Calculate the mins over bins
sapply(levels(p),function(i) min(x[p==i]))->minl;
mins<-minl[p];
#Each dot is one value.
#X is an order of a value inside bin, moved so that the values lower than bin mean go below 0
X<-rep(0,length(x));
for(e in levels(p)) X[p==e]<-(1:sum(p==e))-1-sum((x-means)[p==e]<0);
#Y is a bin minum + absolute value of a difference between value and its bin mean
plot(X,mins+abs(x-means),pch=19,cex=0.5,...);
}
Try the vioplot package:
library(vioplot)
vioplot(rnorm(100))
(with awful default color ;-)
There is also wvioplot() in the wvioplot package, for weighted violin plot, and beanplot, which combines violin and rug plots. They are also available through the lattice package, see ?panel.violin.
Since this hasn't been mentioned yet, there is also ggbeeswarm as a relatively new R package based on ggplot2.
Which adds another geom to ggplot to be used instead of geom_jitter or the like.
In particular geom_quasirandom (see second example below) produces really good results and I have in fact adapted it as default plot.
Noteworthy is also the package vipor (VIolin POints in R) which produces plots using the standard R graphics and is in fact also used by ggbeeswarm behind the scenes.
set.seed(12345)
install.packages('ggbeeswarm')
library(ggplot2)
library(ggbeeswarm)
ggplot(iris,aes(Species, Sepal.Length)) + geom_beeswarm()
ggplot(iris,aes(Species, Sepal.Length)) + geom_quasirandom()
#compare to jitter
ggplot(iris,aes(Species, Sepal.Length)) + geom_jitter()

Resources