I started using R recently, and have been confused with ggplot which my class is using. I'm used to the + operator just adding two outputs, but I see that in ggplot you can things such as:
ggplot(data = bechdel, aes(x = domgross_2013)) +
geom_histogram(bins = 10, color="purple", fill="white") +
labs(title = "Domestic Growth of Movies", x = " Domestic Growth")
Here we are adding two function calls together. What exactly is happening here? Is ggplot "overriding" the + operator (maybe like how you can override the == operator in dart?) in order to do something different? Or is it that the '+' operator means something different in R than I am used to with other programming languages?
I'll answer the first question. You should ask the second question in a separate posting.
R lets you override most operators. The easiest way to do it is using the "S3" object system. This is a very simple system where you attach an attribute named "class" to the object, and that affects how R processes some functions. (The ones this applies to are called "generic functions". There are other functions that don't pay any attention to the class.)
Each ggplot2 function returns an object with a class. You can use the class() function to get the class. For example, class(ggplot(data = "mtcars")) is a character vector containing c("gg", "ggplot"), and class(geom_histogram(bins = 10, color="purple", fill="white")) is the vector c("LayerInstance","Layer","ggproto","gg").
If you ask for methods("+") you'll see all the classes with methods defined for addition, and that includes "gg", so R will call that method to process the addition in the expression you used.
The + operator is part of the philosophy of ggplot2. It's inspired by The Grammar of Graphics, which is worth reading. Essentially, you keep creating new and new layers.
Try taking this one step at a time in your code and it should make sense!
one <- ggplot2::ggplot(data = mtcars) +
labs(title = "Mtcars", subtitle = "Blank Canvas")
two <- ggplot2::ggplot(data = mtcars, aes(x = mpg)) +
labs(title = "Mtcars", subtitle = "+ Aesthetic Mapping")
three <- ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
geom_histogram()
library(patchwork)
one + two + three
Related
A tilde (~) in R generally denotes an anonymous function or formula, if I understand correctly. In ggplot2, you can use facet_wrap() to split your plot into facets based on a factor variable with multiple levels. There are two different ways to express this, and they both produce similar results:
# load starwars and tidyverse
library(tidyverse)
data(starwars)
With a ~:
ggplot(data = starwars, mapping = aes(x = mass)) +
geom_histogram(fill = "blue", alpha = .2) +
theme_minimal() +
facet_wrap( ~ gender, nrow = 1)
With vars():
ggplot(data = starwars, mapping = aes(x = mass)) +
geom_histogram(fill = "blue", alpha = .2) +
theme_minimal() +
facet_wrap( vars(gender), nrow = 1)
How are vars() and ~ equivalent in ggplot2? How is ~ being used in a manner that is analogous, or equivalent to, its typical usage as an anonymous function or formula in R? It doesn't seem like it's a function here? Can someone help clarify how vars() and ~ for facet_wrap() denote the same thing?
The two plots should be identical.
In ggplot2, vars() is just a quoting function that takes inputs to be evaluated, which in this case is the variable name used to form the faceting groups. In other words, the column you supplied, usually a variable with more than one level, will be automatically quoted, then evaluated in the context of the data to form small panels of plots. I recommend using vars() inputs when you want to create a function to wrap around facet_wrap(); it’s a lot easier.
The ~, on the other hand, is syntax specific to the facet_wrap() function. For example, facet_wrap(~ variable_name) does not imply the estimation of some formulaic expression. Rather, as a one-sided formula with a variable on the right-hand side, it’s like telling R to feed the function the variable in its current form, which is just the name of the column itself. It’s confusing because we usually use the ~ to denote a relationship between x and y. It’s kind of the same thing in this context. The missing dependent y variable to the left of the ~ represents the row values, whereas the independent x variable to the right of the ~ represents the column(s). Note, the function may already know the y variable, which is usually specified inside of the aes() call. Layering on facet_wrap(~ ...) is just a quick way to partition those y values (rows) across each dimension (level) of your x variable.
I am trying to loop a ggplot2 plot with a linear regression line over it. It works when I type the y column name manually, but the loop method I am trying does not work. It is definitely not a dataset issue.
I've tried many solutions from various websites on how to loop a ggplot and the one I've attempted is the simplest I could find that almost does the job.
The code that works is the following:
plots <- ggplot(Everything.any, mapping = aes(x = stock_VWRETD, y = stock_10065)) +
geom_point() +
labs(x = 'Market Returns', y = 'Stock Returns', title ='Stock vs Market Returns') +
geom_smooth(method='lm',formula=y~x)
But I do not want to do this another 40 times (and then 5 times more for other reasons). The code that I've found on-line and have tried to modify it for my means is the following:
plotRegression <- function(z,na.rm=TRUE,...){
nm <- colnames(z)
for (i in seq_along(nm)){
plots <- ggplot(z, mapping = aes(x = stock_VWRETD, y = nm[i])) +
geom_point() +
labs(x = 'Market Returns', y = 'Stock Returns', title ='Stock vs Market Returns') +
geom_smooth(method='lm',formula=y~x)
ggsave(plots,filename=paste("regression1",nm[i],".png",sep=" "))
}
}
plotRegression(Everything.any)
I expect it to be the nice graph that I'd expect to get, a Stock returns vs Market returns graph, but instead on the y-axis, I get one value which is the name of the respective column, and the Market value plotted as normally, but as if on a straight number-line across the one y-axis value. Please let me know what I am doing wrong.
Desired Plot:
Actual Plot:
Sample Data is available on Google Drive here:
https://drive.google.com/open?id=1Xa1RQQaDm0pGSf3Y-h5ZR0uTWE-NqHtt
The problem is that when you assign variables to aesthetics in aes, you mix bare names and strings. In this example, both X and Y are supposed to be variables in z:
aes(x = stock_VWRETD, y = nm[i])
You refer to stock_VWRETD using a bare name (as required with aes), however for y=, you provide the name as a character vector produced by colnames. See what happens when we replicate this with the iris dataset:
ggplot(iris, aes(Petal.Length, 'Sepal.Length')) + geom_point()
Since aes expects variable names to be given as bare names, it doesn't interpret 'Sepal.Length' as a variable in iris but as a separate vector (consisting of a single character value) which holds the y-values for each point.
What can you do? Here are 2 options that both give the proper plot
1) Use aes_string and change both variable names to character:
ggplot(iris, aes_string('Petal.Length', 'Sepal.Length')) + geom_point()
2) Use square bracket subsetting to manually extract the appropriate variable:
ggplot(iris, aes(Petal.Length, .data[['Sepal.Length']])) + geom_point()
you need to use aes_string instead of aes, and double-quotes around your x variable, and then you can directly use your i variable. You can also simplify your for loop call. Here is an example using iris.
library(ggplot2)
plotRegression <- function(z,na.rm=TRUE,...){
nm <- colnames(z)
for (i in nm){
plots <- ggplot(z, mapping = aes_string(x = "Sepal.Length", y = i)) +
geom_point()+
geom_smooth(method='lm',formula=y~x)
ggsave(plots,filename=paste("regression1_",i,".png",sep=""))
}
}
myiris<-iris
plotRegression(myiris)
I am trying to plot the gene expression of "gene A" among several groups.
I use ggplot2 to draw, but I fail
p <- ggplot(MAPK_plot, aes(x = group, y = gene_A)) + geom_violin(trim = FALSE , aes( colour = gene_A)) + theme_classic()
And I want to get the figure like this from https://www.researchgate.net/publication/313728883_Neuropilin-1_Is_Expressed_on_Lymphoid_Tissue_Residing_LTi-like_Group_3_Innate_Lymphoid_Cells_and_Associated_with_Ectopic_Lymphoid_Aggregates
You would have to provide data to get a more specific answer, tailored to your problem. But, I do not want that you get demotivated by the down-votes you got so far and, based on your link, maybe this example can give you some food for thought.
Nice job on figuring out that you have to use geom_violin. Further, you will need some form of faceting / multi-panels. Finally, to do the full annotation like in the given link, you need to make use of the grid package functionality (which I do not use here).
I am not familiar with gene-expression data sets, but I use a IMDB movie rating data set for this example (stored in the package ggplot2movies).
library(ggplot2)
library(ggplot2movies)
library(data.table)
mv <- copy(movies)
setDT(mv)
# make some variables for our plotting example
mv[, year_10 := cut_width(year, 10)]
mv[, rating_10yr_avg := mean(rating), by = year_10]
mv[, length_3gr := cut_number(length, 3)]
ggplot(mv,
aes(x = year_10,
y = rating)) +
geom_violin(aes(fill = rating_10yr_avg),
scale = "width") +
facet_grid(rows = vars(length_3gr))
Please do not take this answer as a form on encouragement of not posting data relevant to your problem.
I'm using R and ggplot2 to analyze some statistics from basketball games. I'm new to R and ggplot, and I like the results I'm getting, given my limited experience. But as I go along, I find that my code gets repetitive; which I dislike.
I created several plots similar to this one:
Code:
efgPlot <- ggplot(gmStats, aes(EFGpct, Nrtg)) +
stat_smooth(method = "lm") +
geom_point(aes(colour=plg_ShortName, shape=plg_ShortName)) +
scale_shape_manual(values=as.numeric(gmStats$plg_ShortName))
Only difference between the plots is the x-value; next plot would be:
orPlot <- ggplot(gmStats, aes(ORpct, Nrtg)) +
stat_smooth(method = "lm") + ... # from here all is the same
How could I refactor this, such that I could do something like:
efgPlot <- getPlot(gmStats, EFGpct, Nrtg))
orPlot <- getPlot(gmStats, ORpct, Nrtg))
Update
I think my way of refactoring this isn't really "R-ish" (or ggplot-ish if you will); based on baptiste's comment below, I solved this without refactoring anything into a function; see my answer below.
The key to this sort of thing is using aes_string rather than aes (untested, of course):
getPlot <- function(data,xvar,yvar){
p <- ggplot(data, aes_string(x = xvar, y = yvar)) +
stat_smooth(method = "lm") +
geom_point(aes(colour=plg_ShortName, shape=plg_ShortName)) +
scale_shape_manual(values=as.numeric(data$plg_ShortName))
print(p)
invisible(p)
}
aes_string allows you to pass variable names as strings, rather than expressions, which is more convenient when writing functions. Of course, you may not want to hard code to color and shape scales, in which case you could use aes_string again for those.
Although Joran's answer helpt me a lot (and he accurately answers my question), I eventually solved this according to baptiste's suggestion:
# get the variablesI need from the stats data frame:
forPlot <- gmStats[c("wed_ID","Nrtg","EFGpct","ORpct","TOpct","FTTpct",
"plg_ShortName","Home")]
# melt to long format:
forPlot.m <- melt(forPlot, id=c("wed_ID", "plg_ShortName", "Home","Nrtg"))
# use fact wrap to create 4 plots:
p <- ggplot(forPlot.m, aes(value, Nrtg)) +
geom_point(aes(shape=plg_ShortName, colour=plg_ShortName)) +
scale_shape_manual(values=as.numeric(forPlot.m$plg_ShortName)) +
stat_smooth(method="lm") +
facet_wrap(~variable,scales="free")
Which gives me:
I'm plotting lots of similar graphs so I thought I write a function to simplify the task. I'd like to pass it a data.frame and the name of the column to be plotted. Here is what I have tried:
plot_individual_subjects <- function(var, data)
{
require(ggplot2)
ggplot(data, aes(x=Time, y=var, group=Subject, colour=SubjectID)) +
geom_line() + geom_point() +
geom_text(aes(label=Subject), hjust=0, vjust=0)
}
Now if var is a string it will not work. It will not work either if change the aes part of the ggplot command to y=data[,var] and it will complain about not being able to subset a closure.
So what is the correct way/best practice to solve this and similar problems? How can I pass column names easily and safely to functions that would like to do processing on data.frames?
Bad Joran, answering in the comments!
You want to use aes_string, which allows you to pass variable names as strings. In your particular case, since you only seem to want to modify the y variable, you probably want to reorganize which aesthetics are mapped in which geoms. For instance, maybe something like this:
ggplot(data, aes_string(y = var)) +
geom_line(aes(x = Time,group = Subject,colour = SubjectID)) +
geom_point(aes(x = Time,group = Subject,colour = SubjectID)) +
geom_text(aes(x = Time,group = Subject,colour = SubjectID,label = Subject),hjust =0,vjust = 0)
or perhaps the other way around, depending on your tastes.