Dash in column name yields "object not found" Error - r

I have a function to generate scatter plots from data, where an argument is provided to select which column to use for coloring the points. Here is a simplified version:
library(ggplot2)
plot_gene <- function (df, gene) {
ggplot(df, aes(x, y)) +
geom_point(aes_string(col = gene)) +
scale_color_gradient()
}
where df is a data.frame with columns x, y, and then a bunch of gene names. This works fine for most gene names; however, some have dashes and these fail:
print(plot_gene(df, "Gapdh")) # great!
print(plot_gene(df, "H2-Aa")) # Error: object "H2" not found
It appears the gene variable is getting parsed ("H2-Aa" becomes H2 - Aa). How can I get around this? Is there a way to indicate that a string should not go through eval in aes_string?
Reproducible Input
If you need some input to play with, this fails like my data:
df <- data.frame(c(1,2), c(2,1), c(1,2), c(2,1))
colnames(df) <- c("x", "y", "Gapdh", "H2-Aa")
For my real data, I am using read.table(..., header=TRUE) and get column names with dashes because the raw data files have them.

Normally R tries very hard to make sure you have column names in your data.frame that can be valid variable names. Using non-standard column names (those that are not valid variable names) will lead to problems when using functions that use non-standard evaluation type syntax. When focused to use such variable names you often have to wrap them in back ticks. In the normal case
ggplot(df, aes(x, y)) +
geom_point(aes(col = H2-Aa)) +
scale_color_gradient()
# Error in FUN(X[[i]], ...) : object 'H2' not found
would return an error but
ggplot(df, aes(x, y)) +
geom_point(aes(col = `H2-Aa`)) +
scale_color_gradient()
would work.
You can paste in backticks if you really want
geom_point(aes_string(col = paste0("`", gene, "`")))
or you could treat it as a symbol from the get-go and use aes_q instread
geom_point(aes_q(col = as.name(gene)))
The latest release of ggplot support escaping via !! rather than using aes_string or aes_q so you could do
geom_point(aes(col = !!rlang::sym(gene)))

Related

How do I loop a ggplot2 functon to export and save about 40 plots?

I am trying to loop a ggplot2 plot with a linear regression line over it. It works when I type the y column name manually, but the loop method I am trying does not work. It is definitely not a dataset issue.
I've tried many solutions from various websites on how to loop a ggplot and the one I've attempted is the simplest I could find that almost does the job.
The code that works is the following:
plots <- ggplot(Everything.any, mapping = aes(x = stock_VWRETD, y = stock_10065)) +
geom_point() +
labs(x = 'Market Returns', y = 'Stock Returns', title ='Stock vs Market Returns') +
geom_smooth(method='lm',formula=y~x)
But I do not want to do this another 40 times (and then 5 times more for other reasons). The code that I've found on-line and have tried to modify it for my means is the following:
plotRegression <- function(z,na.rm=TRUE,...){
nm <- colnames(z)
for (i in seq_along(nm)){
plots <- ggplot(z, mapping = aes(x = stock_VWRETD, y = nm[i])) +
geom_point() +
labs(x = 'Market Returns', y = 'Stock Returns', title ='Stock vs Market Returns') +
geom_smooth(method='lm',formula=y~x)
ggsave(plots,filename=paste("regression1",nm[i],".png",sep=" "))
}
}
plotRegression(Everything.any)
I expect it to be the nice graph that I'd expect to get, a Stock returns vs Market returns graph, but instead on the y-axis, I get one value which is the name of the respective column, and the Market value plotted as normally, but as if on a straight number-line across the one y-axis value. Please let me know what I am doing wrong.
Desired Plot:
Actual Plot:
Sample Data is available on Google Drive here:
https://drive.google.com/open?id=1Xa1RQQaDm0pGSf3Y-h5ZR0uTWE-NqHtt
The problem is that when you assign variables to aesthetics in aes, you mix bare names and strings. In this example, both X and Y are supposed to be variables in z:
aes(x = stock_VWRETD, y = nm[i])
You refer to stock_VWRETD using a bare name (as required with aes), however for y=, you provide the name as a character vector produced by colnames. See what happens when we replicate this with the iris dataset:
ggplot(iris, aes(Petal.Length, 'Sepal.Length')) + geom_point()
Since aes expects variable names to be given as bare names, it doesn't interpret 'Sepal.Length' as a variable in iris but as a separate vector (consisting of a single character value) which holds the y-values for each point.
What can you do? Here are 2 options that both give the proper plot
1) Use aes_string and change both variable names to character:
ggplot(iris, aes_string('Petal.Length', 'Sepal.Length')) + geom_point()
2) Use square bracket subsetting to manually extract the appropriate variable:
ggplot(iris, aes(Petal.Length, .data[['Sepal.Length']])) + geom_point()
you need to use aes_string instead of aes, and double-quotes around your x variable, and then you can directly use your i variable. You can also simplify your for loop call. Here is an example using iris.
library(ggplot2)
plotRegression <- function(z,na.rm=TRUE,...){
nm <- colnames(z)
for (i in nm){
plots <- ggplot(z, mapping = aes_string(x = "Sepal.Length", y = i)) +
geom_point()+
geom_smooth(method='lm',formula=y~x)
ggsave(plots,filename=paste("regression1_",i,".png",sep=""))
}
}
myiris<-iris
plotRegression(myiris)

How to add a character string as a code to ggplot object

A part of my code for ggplot is stored in a character vector. I would like to use this code as an additional geoms for my ggplot.
Example1:
DF=data.frame(x=seq(1:10), y=seq(1:20))
a='geom_line()'# This is a string that should be converted to RCode
So far I tried:
ggplot(DF, aes(x,y))+geom_point()+a
Error: Don't know how to add a to a plot
ggplot(DF, aes(x,y))+geom_point()+as.name(a)
Error: Don't know how to add as.name(a) to a plot
ggplot(DF, aes(x,y))+geom_point()+eval(parse(text=a))
Error in geom_line() + geom_line(y = 1) :
non-numeric argument to binary operator
ggplot(DF, aes(x,y))+geom_point()+deparse(substitute(a))
Error: Don't know how to add deparse(substitute(a)) to a plot
Example 2:
DF=data.frame(x=seq(1:10), y=seq(1:20))
a='geom_line()+geom_line(y=1)'
Probable you are wondering, why I would like to do that in a first place? In a for loop, I created expressions and stored them in a list as characters. Later, I pasted together all expressions into a single string expression. Now, I would like to add this string to a ggplot command. Any suggestions?
Edit: Example 1 was successfully solved. But Example 2 stays unsolved.
the parse function has text argument you need to pass a to. Try:
ggplot(DF, aes(x,y)) + geom_point() + eval(parse(text = a))
More info here:
http://adv-r.had.co.nz/Expressions.html#parsing-and-deparsing
In case of multiple statements, it is possible to deparse the original expression, add the new and then evaluate as a whole
original <- deparse(quote(ggplot(DF, aes(x,y)) + geom_point()))
new_call <- paste(original, '+', a)
eval(parse(text = new_call))
You can also use a function to define these code into a list. Please see: https://homepage.divms.uiowa.edu/~luke/classes/STAT4580/dry.html
Here I cited the related code:
Defining a theme_slopegraph function to do the theme adjustment allows
the adjustments to be easily reused:
theme_slopechart = function(toplabels = TRUE) {
thm <- theme(...)
list(thm, ...) # add multiple codes
#...
}
p <- basic_barley_slopes ## from twonum.R
p + theme_slopechart()

Please explain ..density [duplicate]

I can not find the documentation for the double dots around density
set.seed(1234)
df <- data.frame(cond = factor(rep(c("A","B"), each=200)), rating = c(rnorm(200),rnorm(200, mean=.8)))
print(head(df))
print(ggplot(df, aes(x=rating)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") +
geom_vline(aes(xintercept=mean(rating, na.rm=T)), # Ignore NA values for mean
color="red", linetype="dashed", size=1))
Do you know what operator they represent ?
Edit
I know what it does when used in a geom, I would like to know what it is.
For instance, the single dot operator is defined as
> .
function (..., .env = parent.frame())
{
structure(as.list(match.call()[-1]), env = .env, class = "quoted")
}
<environment: namespace:plyr>
If I redefine density, then ..density.. has a different effect, so it seems XX -> ..XX.. is an operator. I would like to find how it is defined.
Unlike many other languages, in R, the dot is perfectly valid in identifiers. In this case, ..count.. is an identifier. However, there is special code in ggplot2 to detect this pattern, and to strip the dots. It feels unlikely that real code would use identifiers formatted like that, and so this is a neat way to distinguish between defined and calculated aesthetics.
The relevant code is at the end of layer.r:
# Determine if aesthetic is calculated
is_calculated_aes <- function(aesthetics) {
match <- "\\.\\.([a-zA-z._]+)\\.\\."
stats <- rep(FALSE, length(aesthetics))
grepl(match, sapply(aesthetics, deparse))
}
# Strip dots from expressions
strip_dots <- function(aesthetics) {
match <- "\\.\\.([a-zA-z._]+)\\.\\."
strings <- lapply(aesthetics, deparse)
strings <- lapply(strings, gsub, pattern = match, replacement = "\\1")
lapply(strings, function(x) parse(text = x)[[1]])
}
It is used further up above in the map_statistic function. If a calculated aesthetic is present, another data frame (one that contains e.g. the count column) is used for the plot.
The single dot . is just another identifier, defined in the plyr package. As you can see, it is a function.

passing varying columns to aes inside a function

I am trying to write a function that calls ggplot with varying arguments to the aes:
hmean <- function(data, column, Label=label){
ggplot(data,aes(column)) +
geom_histogram() +
facet_wrap(~Antibody,ncol=2) +
ggtitle(paste("Mean Antibody Counts (Log2) for ",Label," stain"))
}
hmean(Log2Means,Primary.Mean, Label="Primary")
Error in eval(expr, envir, enclos) : object 'column' not found
Primary.Mean is the varying argument (I have multiple means). Following various posts here I have tried
passing the column name quoted and unquoted (which yieds either an "unexpected string constant" or the "object not found error)
setting up a local ennvironment (foo <-environment() followed by a environment= arg in ggplot)
creating a new copy of the data set using a data2$column <- data[,column]
None of these appear to work within ggplot. How do I write a function that works?
I will be calling it with different data.frames and columns:
hmean(Log2Means, Primary.mean, Label="Primary")
hmean(Log2Means, Secondary.mean, Label="Secondary")
hmean(SomeOtherFrame, SomeColumn, Label="Pretty Label")
You example is not reproducible, but likely this will work:
hmean <- function(data, column, Label=label){
ggplot(data, do.call("aes", list(y = substitute(column))) ) +
geom_histogram() +
facet_wrap(~Antibody,ncol=2) +
ggtitle(paste("Mean Antibody Counts (Log2) for ",Label," stain"))
}
hmean(Log2Means,Primary.Mean, Label="Primary")
If you need more arguments to aes, do like this:
do.call("aes", list(y = substitute(function_parameter), x = quote(literal_parameter)))
You could try this:
hmean <- function(data, column, Label=label){
# cool trick?
data$pColumn <- data[, column]
ggplot(data,aes(pColumn)) +
geom_histogram() +
facet_wrap(~Antibody,ncol=2) +
ggtitle(paste("Mean Antibody Counts (Log2) for ",Label," stain"))
}
hmean(Log2Means,'Primary.Mean', Label="Primary")
I eventually got it to work with an aes_string() call: aes_string(x=foo, y=y, colour=color), wehre y and color were also defined externally to ggplot().

double dots in a ggplot

I can not find the documentation for the double dots around density
set.seed(1234)
df <- data.frame(cond = factor(rep(c("A","B"), each=200)), rating = c(rnorm(200),rnorm(200, mean=.8)))
print(head(df))
print(ggplot(df, aes(x=rating)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") +
geom_vline(aes(xintercept=mean(rating, na.rm=T)), # Ignore NA values for mean
color="red", linetype="dashed", size=1))
Do you know what operator they represent ?
Edit
I know what it does when used in a geom, I would like to know what it is.
For instance, the single dot operator is defined as
> .
function (..., .env = parent.frame())
{
structure(as.list(match.call()[-1]), env = .env, class = "quoted")
}
<environment: namespace:plyr>
If I redefine density, then ..density.. has a different effect, so it seems XX -> ..XX.. is an operator. I would like to find how it is defined.
Unlike many other languages, in R, the dot is perfectly valid in identifiers. In this case, ..count.. is an identifier. However, there is special code in ggplot2 to detect this pattern, and to strip the dots. It feels unlikely that real code would use identifiers formatted like that, and so this is a neat way to distinguish between defined and calculated aesthetics.
The relevant code is at the end of layer.r:
# Determine if aesthetic is calculated
is_calculated_aes <- function(aesthetics) {
match <- "\\.\\.([a-zA-z._]+)\\.\\."
stats <- rep(FALSE, length(aesthetics))
grepl(match, sapply(aesthetics, deparse))
}
# Strip dots from expressions
strip_dots <- function(aesthetics) {
match <- "\\.\\.([a-zA-z._]+)\\.\\."
strings <- lapply(aesthetics, deparse)
strings <- lapply(strings, gsub, pattern = match, replacement = "\\1")
lapply(strings, function(x) parse(text = x)[[1]])
}
It is used further up above in the map_statistic function. If a calculated aesthetic is present, another data frame (one that contains e.g. the count column) is used for the plot.
The single dot . is just another identifier, defined in the plyr package. As you can see, it is a function.

Resources