How to annotate subplots with ggplot from rpy2? - r

I'm using Rpy2 to plot dataframes with ggplot2. I make the following plot:
p = ggplot2.ggplot(iris) + \
ggplot2.geom_point(ggplot2.aes_string(x="Sepal.Length", y="Sepal.Width")) + \
ggplot2.facet_wrap(Formula("~Species"))
p.plot()
r["dev.off"]()
I'd like to annotate each subplot with some statistics about the plot. For example, I'd like to compute the correlation between each x/y subplot and place it on the top right corner of the plot. How can this be done? Ideally I'd like to convert the dataframe from R to a Python object, compute the correlations and then project them onto the scatters. The following conversion does not work, but this is how I'm trying to do it:
# This does not work
#iris_df = pandas.DataFrame({"Sepal.Length": rpy2.robjects.default_ri2py(iris.rx("Sepal.Length")),
# "Sepal.Width": rpy2.robjects.default_ri2py(iris.rx("Sepal.Width")),
# "Species": rpy2.robjects.default_ri2py(iris.rx("Species"))})
# So we access iris using R to compute the correlation
x = iris_py.rx("Sepal.Length")
y = iris_py.rx("Sepal.Width")
# compute r.cor(x, y) and divide up by Species
# Assume we get a vector of length Species saying what the
# correlation is for each Species' Petal Length/Width
p = ggplot2.ggplot(iris) + \
ggplot2.geom_point(ggplot2.aes_string(x="Sepal.Length", y="Sepal.Width")) + \
ggplot2.facet_wrap(Formula("~Species")) + \
# ...
# How to project correlation?
p.plot()
r["dev.off"]()
But assuming I could actually access the R dataframe from Python, how could I plot these correlations? thanks.

The solution is to create a dataframe with a label for each sample plotted. The dataframe's column should match the corresponding column name of the dataframe with the original data. Then this can be plotted with:
p += ggplot2.geom_text(data=labels_df, mapping=ggplot2.aes_string(x="1", y="1", mapping="labels"))
where labels_df is the dataframe containing the labels and labels is the column name of labels_df with the labels to be plotted. (1,1) in this case will be the coordinate position of the label in each subplot.

I found that #user248237dfsf's answer didn't work for me. ggplot got confused between the data frame I was plotting and the data frame I was using for labels.
Instead, I used
ggplot2_env = robjects.baseenv'as.environment'
class GBaseObject(robjects.RObject):
#classmethod
def new(*args, **kwargs):
args_list = list(args)
cls = args_list.pop(0)
res = cls(cls._constructor(*args_list, **kwargs))
return res
class Annotate(GBaseObject):
_constructor = ggplot2_env['annotate']
annotate = Annotate.new
Now, I have something that works just like the standard annotate.
annotate(geom = "text", x = 1, y = 1, label = "MPC")
One minor comment: I don't know if this will work with faceting.

Related

Error in axis(side = side, at = at, labels = labels, ...) : invalid value specified for graphical parameter "pch"

I have applied DBSCAN algorithm on built-in dataset iris in R. But I am getting error when tried to visualise the output using the plot( ).
Following is my code.
library(fpc)
library(dbscan)
data("iris")
head(iris,2)
data1 <- iris[,1:4]
head(data1,2)
set.seed(220)
db <- dbscan(data1,eps = 0.45,minPts = 5)
table(db$cluster,iris$Species)
plot(db,data1,main = 'DBSCAN')
Error: Error in axis(side = side, at = at, labels = labels, ...) :
invalid value specified for graphical parameter "pch"
How to rectify this error?
I have a suggestion below, but first I see two issues:
You're loading two packages, fpc and dbscan, both of which have different functions named dbscan(). This could create tricky bugs later (e.g. if you change the order in which you load the packages, different functions will be run).
It's not clear what you're trying to plot, either what the x- or y-axes should be or the type of plot. The function plot() generally takes a vector of values for the x-axis and another for the y-axis (although not always, consult ?plot), but here you're passing it a data.frame and a dbscan object, and it doesn't know how to handle it.
Here's one way of approaching it, using ggplot() to make a scatterplot, and dplyr for some convenience functions:
# load our packages
# note: only loading dbscacn, not loading fpc since we're not using it
library(dbscan)
library(ggplot2)
library(dplyr)
# run dbscan::dbscan() on the first four columns of iris
db <- dbscan::dbscan(iris[,1:4],eps = 0.45,minPts = 5)
# create a new data frame by binding the derived clusters to the original data
# this keeps our input and output in the same dataframe for ease of reference
data2 <- bind_cols(iris, cluster = factor(db$cluster))
# make a table to confirm it gives the same results as the original code
table(data2$cluster, data2$Species)
# using ggplot, make a point plot with "jitter" so each point is visible
# x-axis is species, y-axis is cluster, also coloured according to cluster
ggplot(data2) +
geom_point(mapping = aes(x=Species, y = cluster, colour = cluster),
position = "jitter") +
labs(title = "DBSCAN")
Here's the image it generates:
If you're looking for something else, please be more specific about what the final plot should look like.

Looping and Saving Scatterplots in R

I have a data frame that consists of 2 columns and 3110 rows. The X column is a constant, where as the Y column changes each row. I am looking to create a loop that will generate a scatter plot for each row, and ultimately save the scatter plots onto my desktop.
The original code that I would use to create one scatter plot is:
X <- Abundances$s__Coprobacillus_cateniformis
Y <- Abundances$Gene1
plot(X, Y, main = "Species Vs Gene Expression",
xlab = "s__Coprobacillus_cateniformis", ylab = "Gene1",
pch = 19, frame = FALSE)
So, the X variable is a specie name, and will stay constant. The Y variable is a gene name, and will change for each of the 3110 plots. I am using the percentage abundances for the gene expression and the specie's from another data frame called "Abundances".
A short snippet of my data looks like so, it has 2 columns, one column called Predictor, and one column called response:
Response <- c("ENSG00000000005.5", "ENSG00000001167.10", "ENSG00000001617.7", "ENSG00000003393.10", "ENSG00000004142.7")
Predictor <- c("s__Coprobacillus_cateniformis", "s__Coprobacillus_cateniformis", "s__Coprobacillus_cateniformis", "s__Coprobacillus_cateniformis", "s__Coprobacillus_cateniformis" )
If anyone could help me generate a loop that could create a scatter plot for each individual gene (on the y axis), against the specie on the X axis, and then immediately save these plots on my desktop, that would be great!
Thanks.
It's impossible to test without a sample from Abundances, but I think this is on the right track. The key thing to note is that $ doesn't work with strings, but [[ does: Abundances$Gene1 is the same as Abundances[["Gene1"]] is the same as col = "Gene1"; Abundances[[col]].
for(i in seq_along(Response)) {
png(filename = paste0("plot_", Response[i], ".png"))
X <- Abundances[[Predictor[i]]]
Y <- Abundances[[Response[i]]]
plot(X, Y, main = "Species Vs Gene Expression",
xlab = Response[i], ylab = Predictor[i],
pch = 19, frame = FALSE)
dev.off()
}
If you want the plots on your desktop, set that as the working directory or put the paste to your desktop as part of the filename.

Plot a table with box size changing

Does anyone have an idea how is this kind of chart plotted? It seems like heat map. However, instead of using color, size of each cell is used to indicate the magnitude. I want to plot a figure like this but I don't know how to realize it. Can this be done in R or Matlab?
Try scatter:
scatter(x,y,sz,c,'s','filled');
where x and y are the positions of each square, sz is the size (must be a vector of the same length as x and y), and c is a 3xlength(x) matrix with the color value for each entry. The labels for the plot can be input with set(gcf,properties) or xticklabels:
X=30;
Y=10;
[x,y]=meshgrid(1:X,1:Y);
x=reshape(x,[size(x,1)*size(x,2) 1]);
y=reshape(y,[size(y,1)*size(y,2) 1]);
sz=50;
sz=sz*(1+rand(size(x)));
c=[1*ones(length(x),1) repmat(rand(size(x)),[1 2])];
scatter(x,y,sz,c,'s','filled');
xlab={'ACC';'BLCA';etc}
xticks(1:X)
xticklabels(xlab)
set(get(gca,'XLabel'),'Rotation',90);
ylab={'RAPGEB6';etc}
yticks(1:Y)
yticklabels(ylab)
EDIT: yticks & co are only available for >R2016b, if you don't have a newer version you should use set instead:
set(gca,'XTick',1:X,'XTickLabel',xlab,'XTickLabelRotation',90) %rotation only available for >R2014b
set(gca,'YTick',1:Y,'YTickLabel',ylab)
in R, you should use ggplot2 that allows you to map your values (gene expression in your case?) onto the size variable. Here, I did a simulation that resembles your data structure:
my_data <- matrix(rnorm(8*26,mean=0,sd=1), nrow=8, ncol=26,
dimnames = list(paste0("gene",1:8), LETTERS))
Then, you can process the data frame to be ready for ggplot2 data visualization:
library(reshape)
dat_m <- melt(my_data, varnames = c("gene", "cancer"))
Now, use ggplot2::geom_tile() to map the values onto the size variable. You may update additional features of the plot.
library(ggplot2)
ggplot(data=dat_m, aes(cancer, gene)) +
geom_tile(aes(size=value, fill="red"), color="white") +
scale_fill_discrete(guide=FALSE) + ##hide scale
scale_size_continuous(guide=FALSE) ##hide another scale
In R, corrplotpackage can be used. Specifically, you have to use method = 'square' when creating the plot.
Try this as an example:
library(corrplot)
corrplot(cor(mtcars), method = 'square', col = 'red')

Add to ggplot with element of different length

I'm new to ggplot2 and I'm trying to figure out how I can add a line to an already existing plot I created. The original plot, which is the cumulative distribution of a column of data T1 from a data frame x, has about 100,000 elements in it. I have successfully plotted this using ggplot2 and stat_ecdf() with the code I posted below. Now I want to add another line using a set of (x,y) coordinates, but when I try this using geom_line() I get the error message:
Error in data.frame(x = c(0, 7.85398574631245e-07, 3.14159923334398e-06, :
arguments imply differing number of rows: 1001, 100000
Here's the code I'm trying to use:
> set.seed(42)
> x <- data.frame(T1=rchisq(100000,1))
> ps <- seq(0,1,.001)
> ts <- .5*qchisq(ps,1) #50:50 mixture of chi-square (df=1) and 0
> p <- ggplot(x,aes(T1)) + stat_ecdf() + geom_line(aes(ts,ps))
That's what produces the error from above. Now here's the code using base graphics that I used to use but that I am now trying to move away from:
plot(ecdf(x$T1),xlab="T1",ylab="Cum. Prob.",xlim=c(0,4),ylim=c(0,1),main="Empirical vs. Theoretical Distribution of T1")
lines(ts,ps)
I've seen some other posts about adding lines in general, but what I haven't seen is how to add a line when the two originating vectors are not of the same length. (Note: I don't want to just use 100,000 (x,y) coordinates.)
As a bonus, is there an easy way, similar to using abline, to add a drop line on a ggplot2 graph?
Any advice would be much appreciated.
ggplot deals with data.frames, you need to make ts and ps a data.frame then specify this extra data.frame in your call to geom_line:
set.seed(42)
x <- data.frame(T1=rchisq(100000,1))
ps <- seq(0,1,.001)
ts <- .5*qchisq(ps,1) #50:50 mixture of chi-square (df=1) and 0
tpdf <- data.frame(ts=ts,ps=ps)
p <- ggplot(x,aes(T1)) + stat_ecdf() + geom_line(data=tpdf, aes(ts,ps))

NP chart using ggplot2

how i can generate NP chart using ggplot2?
I made simple Rscript which generates bar, point charts. I am supplying data by csv file. how many columns do i need to specify and in gplot functions what arguments do i need to pass?
I am very new to R, ggplots.
EDIT :
This is what is meant by an NP chart.
Current code attempt:
#load library ggplot2
library(ggplot2)
#get arguments
args <- commandArgs(TRUE)
pdfname <- args[1]
graphtype <- args[2]
datafile <- args[3]
#read csv file
tasks <- read.csv(datafile , header = T)
#name the pdf from passed arg 1
pdf(pdfname)
#main magic that generates the graph
qplot(x,y, data=tasks, geom = graphtype)
#clean up
dev.off()
In .csv file there are 2 columns x,y i call this script by Rscript cne.R 11_16.pdf "point" "data.csv".
Thanks you very much #mathematical.coffee this is what i need but
1> I am reading data from csv file which contains following data
this is my data
Month,Rate
"Jan","37.50"
"Feb","32.94"
"Mar","25.00"
"Apr","33.33"
"May","33.08"
"Jun","29.09"
"Jul","12.00"
"Aug","10.00"
"Sep","6.00"
"Oct","23.00"
"Nov","9.00"
"Dec","14.00"
2> I want to display value on each plotting point. and also display value for UCL,Cl,LCL, and give different label to x and y.
Problem when i read data it is not in the same order as in csv file. how to fix it?
You combine ggplot(tasks,aes(x=x,y=y)) with geom_line and geom_point to get the lines connected by points.
If you additionally want the UCL/LCL/etc drawn you add in a geom_hline (horizontal line).
To add text to these lines you can use geom_text.
An example:
library(ggplot2)
# generate some data to use, say monthly up to a year from today.
n <- 12
tasks <- data.frame(
x = seq(Sys.Date(),by="month",length=n),
y = runif(n) )
CL = median(tasks$y) # substitue however you calculate CL here
LCL = quantile(tasks$y,.25) # substitue however you calculate LCL here
UCL = quantile(tasks$y,.75) # substitue however you calculate UCL here
limits = c(UCL,CL,LCL)
lbls = c('UCL','CL','LCL')
p <- ggplot(tasks,aes(x=x,y=y)) + # store x/y values
geom_line() + # add line
geom_point(aes(colour=(y>LCL&y<UCL))) + # add points, colour if outside limits
opts(legend.position='none', # remove legend for colours
axis.text.x=theme_text(angle=90)) # rotate x axis labels
# Now add in the limits.
# horizontal lines + dashed for upper/lower and solid for the CL
p <- p + geom_hline(aes(yintercept=limits,linetype=lbls)) + # draw lines
geom_text(aes(y=limits,x=tasks$x[n],label=lbls,vjust=-0.2,cex=.8)) # draw text
# display
print(p)
which gives:

Resources