Misplaced points in ggplot - r

I'm reading in a file like so:
genes<-read.table("goi.txt",header=TRUE, row.names=1)
control<-log2(1+(genes[,1]))
experiment<-log2(1+(genes[,2]))
And plotting them as a simple scatter in ggplot:
ggplot(genes, aes(control, experiment)) +
xlim(0, 20) +
ylim(0, 20) +
geom_text(aes(control, experiment, label=row.names(genes)),size=3)
However the points are incorrectly placed on my plot (see attached image)
This is my data:
control expt
gfi1 0.189634 3.16574
Ripply3 13.752000 34.40630
atonal 2.527670 4.97132
sox2 16.584300 42.73240
tbx15 0.878446 3.13560
hes8 0.830370 8.17272
Tlx1 1.349330 7.33417
pou4f1 3.763400 9.44845
pou3f2 0.444326 2.92796
neurog1 13.943800 24.83100
sox3 17.275700 26.49240
isl2 3.841100 10.08640
As you can see, 'Ripply3' is clearly in the wrong position on the graph!
Am I doing something really stupid?

The aes() function used by ggplot looks first inside the data frame you provide via data = genes. This is why you can (and should) specify variable only by bare column names like control; ggplot will automatically know where to find the data.
But R's scoping system is such that if nothing by that name is found in the current environment, R will look in the parent environment, and so on, until it reaches the global environment until it finds something by that name.
So aes(control, experiment) looks for variables named control and experiment inside the data frame genes. It finds the original, untransformed control variable, but of course there is no experiment variable in genes. So it continues up the chain of environments until it hits the global environment, where you have defined the isolated variable experiment and uses that.
You meant to do something more like this:
genes$controlLog <- log2(1+(genes[,1]))
genese$exptLog <- log2(1+(genes[,2]))
followed by:
ggplot(genes, aes(controlLog, exptLog)) +
xlim(0, 20) +
ylim(0, 20) +
geom_text(aes(controlLog, exptLog, label=row.names(genes)),size=3)

Related

Assigned variable is changing when object is modified - ggplot [duplicate]

I'm trying to copy a ggplot object and then change some properties of the new copied object as, for instance, the colour line to red.
Assume this code:
df = data.frame(cbind(x=1:10, y=1:10))
a = ggplot(df, aes(x=x, y=y)) + geom_line()
b = a
Then, if I change the colour of line of variable a
a$layers[[1]]$geom_params$colour = "red"
it also changes the colour of b
> b$layers[[1]]$geom_params$colour
[1] "red" # why it is not "black"?
I wish I could have two different objects a and b with different characteristics. So, in order to do this in the correct way, I would need to call the plot again for b using b = ggplot(df, aes(xy, y=z)) + geom_line(). However, at this time in the algorithm, there is no way to know the plot command ggplot(df, aes(x=x, y=y)) + geom_line()
Do you know what's wrong with this? Is ggplot objects treated in a different manner?
Thanks!
The issue here is that ggplot uses the proto library to mimic OO-style objects. The proto library relies on environments to collect variables for objects. Environments are passed by reference which is why you are seeing the behavior you are (and also a reason no one would probably recommend changing the properties of a layer that way).
Anyway, adapting an example from the proto documentaiton, we can try to make a deep copy of the laters of the ggplot object. This should "disconnect" them. Here's such a helper function
duplicate.ggplot<-function(x) {
require(proto)
r<-x
r$layers <- lapply(r$layers, function(x) {
as.proto(as.list(x), parent=x)
})
r
}
so if we run
df = data.frame(cbind(x=1:10, y=1:10))
a = ggplot(df, aes(x=x, y=y)) + geom_line()
b = a
c = duplicate.ggplot(a)
a$layers[[1]]$geom_params$colour = "red"
then plot all three, we get
which shows we can change "c" independently from "a"
Ignoring the specifics of ggplot, there's a simple trick to make a deep copy of (almost) any object in R:
obj_copy <- unserialize(serialize(obj, NULL))
This serializes the object to a binary representation suitable for writing to disk and then reconstructs the object from that representation. It's equivalent to saving the object to a file and then loading it again (i.e. saveRDS followed by readRDS), only it never actually saves to a file. It's probably not the most efficient solution, but it should work for just about any object that can be saved to a file.
You can define a deepcopy function using this trick:
deepcopy <- function(p) {
unserialize(serialize(p, NULL))
}
This seems to successfully break the links between related ggplots.
Obviously, this will not work for objects that cannot be serialized, such as big matrices from the bigmemory package.

Refresh plot after changing the underlying dataframe

I have several large R scripts in which I construct complex plots. At the end, I want to output the plots as PDF and TikZ file. It looks something like this:
mydata <- ...
p <- ggplot(mydata, ...)
p <- p + ... # many
p <- p + ... # modifications
p <- p + ... # to the plot
ggsave("plot.pdf")
ggsave("plot.tex", device=tikz)
Now, I want to change the name of factor levels between both calls to ggsave, since I want to include some fancy LaTeX stuff in the level names for the TikZ version:
ggsave("plot.pdf")
mydata$myfactor <- revalue(mydata$myfactor, c(small="S", medium="M"))
ggsave("plot.tex", device=tikz)
The problem here is that the change in mydata is not "propagated" to the plot. The TikZ version still uses the old level names. Is there any command to "refresh" the plot from mydata?
I'm aware of some workarounds, e.g., after renaming the factor levels, I could duplicate the whole plot construction. That works, but is inelegant. I think some kind of refresh-plot-from-data command would be most elegant, so that I don't have to repeat the plot specifications.
You haven't given a reproducible example, but I think the %+% operator (which is primarily intended for replacing the internally stored data set with a new, different one) should work to replace the internally stored data set with an updated version.
ggsave("plot.pdf",plot=p)
mydata$myfactor <- revalue(mydata$myfactor, c(small="S", medium="M"))
p <- p %+% mydata
ggsave("plot.tex", plot=p, device=tikz)
(I'm using an explicit plot= specification here for clarity.)
If that doesn't work, I would wrap your plot-construction code in a function, so that you would just p <- build_plot(mydata) every time you needed to.

Ggplot does not show plots in sourced function

I've been trying to draw two plots using R's ggplot library in RStudio. Problem is, when I draw two within one function, only the last one displays (in RStudio's "plots" view) and the first one disappears. Even worse, when I run ggsave() after each plot - which saves them to a file - neither of them appear (but the files save as expected). However, I want to view what I've saved in the plots as I was able to before.
Is there a way I can both display what I'll be plotting in RStudio's plots view and also save them? Moreover, when the plots are not being saved, why does the display problem happen when there's more than one plot? (i.e. why does it show the last one but not the ones before?)
The code with the plotting parts are below. I've removed some parts because they seem unnecessary (but can add them if they are indeed relevant).
HHIplot = ggplot(pergame)
# some ggplot geoms and misc. here
ggsave(paste("HHI Index of all games,",year,"Finals.png"),
path = plotpath, width = 6, height = 4)
HHIAvePlot = ggplot(AveHHI, aes(x = AveHHI$n_brokers))
# some ggplot geoms and misc. here
ggsave(paste("Average HHI Index of all games,",year,"Finals.png"),
path = plotpath, width = 6, height = 4)
I've already taken a look here and here but neither have helped. Adding a print(HHIplot) or print(HHIAvePlot) after the ggsave() lines has not displayed the plot.
Many thanks in advance.
Update 1: The solution suggested below didn't work, although it works for the answer's sample code. I passed the ggplot objects to .Globalenv and print() gives me an empty gray box on the plot area (which I imagine is an empty ggplot object with no layers). I think the issue might lie in some of the layers or manipulators I have used, so I've brought the full code for one ggplot object below. Any thoughts? (Note: I've tried putting the assign() line in all possible locations in relation to ggsave() and ggplot().)
HHIplot = ggplot(pergame)
HHIplot +
geom_point(aes(x = pergame$n_brokers, y = pergame$HHI)) +
scale_y_continuous(limits = c(0,10000)) +
scale_x_discrete(breaks = gameSizes) +
labs(title = paste("HHI Index of all games,",year,"Finals"),
x = "Game Size", y = "Herfindahl-Hirschman Index") +
theme(text = element_text(size=15),axis.text.x = element_text(angle = 0, hjust = 1))
assign("HHIplot",HHIplot, envir = .GlobalEnv)
ggsave(paste("HHI Index of all games,",year,"Finals.png"),
path = plotpath, width = 6, height = 4)
I'll preface this by saying that the following is bad practice. It's considered bad practice to break a programming language's scoping rules for something as trivial as this, but here's how it's done anyway.
So within the body of your function you'll create both plots and put them into variables. Then you'll use ggsave() to write them out. Finally, you'll use assign() to push the variables to the global scope.
library(ggplot2)
myFun <- function() {
#some sample data that you should be passing into the function via arguments
df <- data.frame(x=1:10, y1=1:10, y2=10:1)
p1 <- ggplot(df, aes(x=x, y=y1))+geom_point()
p2 <- ggplot(df, aes(x=x, y=y2))+geom_point()
ggsave('p1.jpg', p1)
ggsave('p2.jpg', p2)
assign('p1', p1, envir=.GlobalEnv)
assign('p2', p2, envir=.GlobalEnv)
return()
}
Now, when you run myFun() it will write out your two plots to .jpg files, and also drop the plots into your global environment so that you can just run p1 or p2 on the console and they'll appear in RStudio's Plot pane.
ONCE AGAIN, THIS IS BAD PRACTICE
Good practice would be to not worry about the fact that they're not popping up in RStudio. They wrote out to files, and you know they did, so go look at them there.

Cannot save plots as pdf when ggplot function is called inside a function

I am going to plot a boxplot from a 4-column matrix pl1 using ggplot with dots on each box. The instruction for plotting is like this:
p1 <- ggplot(pl1, aes(x=factor(Edge_n), y=get(make.names(y_label)), ymax=max(get(make.names(y_label)))*1.05))+
geom_boxplot(aes(fill=method), outlier.shape= NA)+
theme(text = element_text(size=20), aspect.ratio=1)+
xlab("Number of edges")+
ylab(y_label)+
scale_fill_manual(values=color_box)+
geom_point(aes(x=factor(Edge_n), y=get(make.names(true_des)), ymax=max(get(make.names(true_des)))*1.05, color=method),
position = position_dodge(width=0.75))+
scale_color_manual(values=color_pnt)
Then, I use print(p1) to print it on an opened pdf. However, this does not work for me and I get the below error:
Error in make.names(true_des) : object 'true_des' not found
Does anyone can help?
Your example is not very clear because you give a call but you don't show the values of your variables so it's really hard to figure out what you're trying to do (for instance, is method the name of a column in the data frame pl1, or is it a variable (and if it's a variable, what is its type? string? name?)).
Nonetheless, here's an example that should help set you on the way to doing what you want:
Try something like this:
pl1 <- data.frame(Edge_n = sample(5, 20, TRUE), foo = rnorm(20), bar = rnorm(20))
y_label <- 'foo'
ax <- do.call(aes, list(
x=quote(factor(Edge_n)),
y=as.name(y_label),
ymax = substitute(max(y)*1.05, list(y=as.name(y_label)))))
p1 <- ggplot(pl1) + geom_boxplot(ax)
print(p1)
This should get you started to figuring out the rest of what you're trying to do.
Alternately (a different interpretation of your question) is that you may be running into a problem with the environment in which aes evaluates its arguments. See https://github.com/hadley/ggplot2/issues/743 for details. If this is the issue, then the answer might to override the default value of the environment argument to aes, for instance: aes(x=factor(Edge_n), y=get(make.names(y_label)), ymax=max(get(make.names(y_label)))*1.05, environment=environment())

Weird ggplot2 error: Empty raster

Why does
ggplot(data.frame(x=c(1,2),y=c(1,2),z=c(1.5,1.5)),aes(x=x,y=y,color=z)) +
geom_point()
give me the error
Error in grid.Call.graphics(L_raster, x$raster, x$x, x$y, x$width, x$height, : Empty raster
but the following two plots work
ggplot(data.frame(x=c(1,2),y=c(1,2),z=c(2.5,2.5)),aes(x=x,y=y,color=z)) +
geom_point()
ggplot(data.frame(x=c(1,2),y=c(1,2),z=c(1.5,2.5)),aes(x=x,y=y,color=z)) +
geom_point()
I'm using ggplot2 0.9.3.1
TL;DR: Check your data -- do you really want to use a continuous color scale with only one possible value for the color?
The error does not occur if you add + scale_fill_continuous(guide=FALSE) to the plot. (This turns off the legend.)
ggplot(data.frame(x=c(1,2), y=c(1,2), z=c(1.5,1.5)), aes(x=x,y=y,color=z)) +
geom_point() + scale_color_continuous(guide = FALSE)
The error seems to be triggered in cases where a continuous color scale uses only one color. The current GitHub version already includes the relevant pull request. Install it via:
devtools::install_github("hadley/ggplot2")
But more probably there is an issue with the data: why would you use a continuous color scale with only one value?
The same behaviour (i.e. the "Empty raster"error) appeared to me with another value apart from 1.5.
Try the following:
ggplot(data.frame(x=c(1,2),y=c(1,2),z=c(0.02,0.02)),aes(x=x,y=y,color=z))
+ geom_point()
And you get again the same error (tried with both 0.9.3.1 and 1.0.0.0 versions) so it looks like a nasty and weird bug.
This definitely sounds like an edge case better suited for a bug report as others have mentioned but here's some generalizable code that might be useful to somebody as a clunky workaround or for handling labels/colors. It's plotting a rescaled variable and using the real values as labels.
require(scales)
z <- c(1.5,1.5)
# rescale z to 0:1
z_rescaled <- rescale(z)
# customizable number of breaks in the legend
max_breaks_cnt <- 5
# break z and z_rescaled by quantiles determined by number of maximum breaks
# and use 'unique' to remove duplicate breaks
breaks_z <- unique(as.vector(quantile(z, seq(0,1,by=1/max_breaks_cnt))))
breaks_z_rescaled <- unique(as.vector(quantile(z_rescaled, seq(0,1,by=1/max_breaks_cnt))))
# make a color palette
Pal <- colorRampPalette(c('yellow','orange','red'))(500)
# plot z_rescaled with breaks_z used as labels
ggplot(data.frame(x=c(1,2),y=c(1,2),z_rescaled),aes(x=x,y=y,color=z_rescaled)) +
geom_point() + scale_colour_gradientn("z",colours=Pal,labels = breaks_z,breaks=breaks_z_rescaled)
This is quite off-topic but I like to use rescaling to send tons of changing variables to a function like this:
colorfunction <- gradient_n_pal(colours = colorRampPalette(c('yellow','orange','red'))(500),
values = c(0:1), space = "Lab")
colorfunction(z_rescaled)

Resources