I am a total R beginner here, with corresponding level of sophistication of this question.
I am using the ROCR package in R to generate plotting data for ROC curves. I then use ggplot2 to draw the plot. Something like this:
library(ggplot2)
library(ROCR)
inputFile <- read.csv("path/to/file", header=FALSE, sep=" ", colClasses=c('numeric','numeric'), col.names=c('score','label'))
predictions <- prediction(inputFile$score, inputFile$label)
auc <- performance(predictions, measure="auc")#y.values[[1]]
rocData <- performance(predictions, "tpr","fpr")
rocDataFrame <- data.frame(x=rocData#x.values[[1]],y=rocData#y.values[[1]])
rocr.plot <- ggplot(data=rd, aes(x=x, y=y)) + geom_path(size=1)
rocr.plot <- rocr.plot + geom_text(aes(x=1, y= 0, hjust=1, vjust=0, label=paste(sep = "", "AUC = ",round(auc,4))),colour="black",size=4)
This works well for drawing a single ROC curve. However, what I would like to do is read in a whole directory worth of input files - one file per classifier test results - and make a ggplot2 multifaceted plot of all the ROC curves, while still printing the AUC score into each plot.
I would like to understand what is the "proper" R-style approach to accomplishing this. I am sure I can hack something together by having one loop go through all files in the directory and create a separate data frame for each, and then having another loop to create multiple plots, and somehow getting ggplo2 to output all these plots onto the same surface. However, that does not let me use ggplot2's built-in faceting, which I believe is the right approach. I am not sure how to get my data into proper shape for faceting use, though. Should I be merging all my data frames into a single one, and giving each merged chunk a name (e.g. filename) and faceting on that? If so, is there a library or recommended practice for making this happen?
Your suggestions are appreciated. I am still wrapping my head around the best practices in R, so I'd rather get expert advice instead of just hacking things up to make code that looks more like ordinary declarative programming languages that I am used to.
EDIT: The thing I am least clear on is whether, when using ggplot2's built-in faceting capabilities, I'd still be able to output a custom string (AUC score) into each plot it will generate.
Here is an example of how to generate a plot as you described. I use the built-in dataset quakes:
The code does the following:
Load the ggplot2 and plyr packages
Add a facet variable to quakes - in this case I summarise by depth of earthquake
Use ddply to summarise the mean magnitude for each depth
Use ggplot with geom_text to label the mean magnitude
The code:
library(plyr)
library(ggplot2)
quakes$level <- cut(quakes$depth, 5,
labels=c("Very Shallow", "Shallow", "Medium", "Deep", "Very Deep"))
quakes.summary <- ddply(quakes, .(level), summarise, mag=round(mean(mag), 1))
ggplot(quakes, aes(x=long, y=lat)) +
geom_point(aes(colour=mag)) +
geom_text(aes(label=mag), data=quakes.summary, x=185, y=-35) +
facet_grid(~level) +
coord_map()
Related
I'm having some trouble visualizing what I am hoping to understand from a set of Proximity Ligation Assay Experiments (PLA). I was able to quantify dots within each cell to yield a table like this:
RawData Table
After a lot of wrangling, I was able to filter out data points considered outliers and was able to yield something close to what I was looking for:
However, the trouble here is I need to transform the Y-axis into a summary statistic that counts the amount of objects and plots them on the boxplot, as opposed to the # of each individual object itself.
I have tried wrangling around with dplyr a bit using the following code, but have not been able to successfully reproduce a boxplot similar to the one above:
filtered_output_PLA_dots %>%
count(FileName_RawData) %>%
ggplot() +
geom_boxplot(aes(x = factor( FileName_RawData, level = plot1_order), y = n))
Is there a better way to do this that I haven't looked into? Any advice would be great! Alternatively, some pointers to learning resources and tutorials would also be appreciated!
I am plotting two histograms in R by using the following code.
x1<-rnorm(100)
x2<-rnorm(50)
h1<-hist(x1)
h2<-hist(x2)
plot(h1, col=rgb(0,0,1,.25), xlim=c(-4,4), ylim=c(0,0.6), main="", xlab="Index", ylab="Percent",freq = FALSE)
plot(h2, col=rgb(1,0,0,.25), xlim=c(-4,4), ylim=c(0,0.6), main="", xlab="Index", ylab="Percent",freq = FALSE,add=TRUE)
legend("topright", c("H1", "H2"), fill=c(rgb(0,0,1,.25),rgb(1,0,0,.25)))
The code produces the following output.
I need a visually good looking (or stylistic) version of the above plot. I want to use ggplot2. I am looking for something like this (see Change fill colors section). However, I think, ggplot2 only works with data frames. I do not have data frames in this case. Hence, how can I create good looking histogram plot in ggplot2? Please let me know. Thanks in advance.
You can (and should) put your data into a data.frame if you want to use ggplot. Ideally for ggplot, the data.frame should be in long format. Here's a simple example:
df1 = rbind(data.frame(grp='x1', x=x1), data.frame(grp='x2', x=x2))
ggplot(df1, aes(x, fill=grp)) +
geom_histogram(color='black', alpha=0.5)
There are lots of options to change the appearnce how you like. If you want to have the histograms stacked or grouped, or shown as percent versus count, or as densities etc., you will find many resources in previous questions showing how to implement each of those options.
I love the JMP variability plot. (link) It is a powerful tool.
The example the plot has 2 x-axis labels, one for part-number and one for operator.
Here the JMP variability plot displays more than 2 levels of variables. The following splits by oil amount, batch size, and popcorn type. It can take some work to find the right sequence to show strongest separation, but this is an excellent tool for communication of information.
How does one do this, the multiple-level x-labels, with R using the ggplot2 library?
The best that I can find is this (link, link), which separates based on cylinder count, but does not make the x-axis labels.
My example code is this:
#reproducible
set.seed(2372064)
#data (I'm used to reading my own, not using built-in)
data(mtcars)
attach(mtcars)
#impose factors as factors
fact_idx <- c(2,8:11)
for(i in fact_idx){
mtcars[,i] <- as.factor(mtcars[,i])
}
#boxplot
p <- ggplot(mtcars, aes(gear, mpg, fill=cyl)) +
geom_boxplot(notch = TRUE)
p
The plot this gives is:
How do I make the x-axis lables indicate both gears and cylinders?
In jmp I get this:
You could use R-package VCA which comes with function varPlot implementing variability charts similar to JMP. There are multiple examples provided in the help. Your example would look like this:
library(VCA)
dat <- mtcars[order(mtcars$cyl, mtcars$gear),]
# default
varPlot(mpg~cyl/gear, dat)
# nicely formatted
varPlot(mpg~cyl/gear, dat,
BG=list(var="gear", col=paste0("gray", c(90,80,70)),
col.table=T),
VLine=list(var="cyl"), Mean=NULL,
MeanLine=list(var=c("cyl", "gear"), col=c("blue", "orange"),
lwd=c(2,2)),
Points=list(pch=16, cex=1))
This is the first time I have a R question that I couldn't find on Stack Overflow already - forgive me if the reason why I didn't find anything is a specific term for the type of thing I'm looking for that I'm not aware of (is there?).
I'd like to display data as a cumulative frequency. Since my focus is more on the edges of the Distribution, it is helpful to scale the y-axis to a normal distribution. The result should look something like this:
I've read about quantile-quantile plots, but honestly I can't figure out how to apply them if I want to preserve the X-axis.
I tried both base graphics and ggplot2, but can't figure it out. My current solution is therefore, for example
plot(ecdf(trees$Volume))
or
ggplot(data=trees, aes(Volume)) + stat_ecdf()
I think you are looking for the scales package and the probability_trans() function:
Without transforming the y scales:
require(ggplot2)
ggplot(data = trees,
aes(Volume)) +
stat_ecdf()
With transformation of y axis:
ggplot(data = trees,
aes(Volume)) +
stat_ecdf() +
scale_y_continuous(trans = scales::probability_trans("norm"))
You can read more about these in the documents with ?probability_trans.
The probability_trans() function takes standard R probability names to scale your axis with.
You can also create a new transformation with trans_new() if you need something completely custom.
The qpplot.das function from the StatDA package by Peter Filzmoser might be a "base R" way for you.
library(StatDA)
qpplot.das(trees$Volume, qdist = qnorm, xlab = "Volume", line = FALSE)
output
The StatDA package was used for all calculations and graphics for the book Statistical Data Analysis Explained by Reimann, Filzmoser, Garret and Dutter. All R scripts are online, also examples for the QP plots.
Is there a way to transform data in ggplot2 in the aes declaration of a geom?
I have a plot conceptually similar to this one:
test=data.frame("k"=rep(1:3,3),"ce"=rnorm(9),"comp"=as.factor(sort(rep(1:3,3))))
plot=ggplot(test,aes(y=ce,x=k))+geom_line(aes(lty=comp))
Suppose I would like to add a line calculated as the maximum of the values between the three comp for each k point, with only the plot object available. I have tried several options (e.g. using aggregate in the aes declaration, stat_function, etc.) but I could not find a way to make this work.
At the moment I am working around the problem by extracting the data frame with ggplot_build, but I would like to find a direct solution.
Is
require(plyr)
max.line = ddply(test, .(k), summarise, ce = max(ce))
plot = ggplot(test, aes(y=ce,x=k))
plot = plot + geom_line(aes(lty=comp))
plot = plot + geom_line(data=max.line, color='red')
something like what you want?
Thanks to JLLagrange and jlhoward for your help. However both solutions require the access to the underlying data.frame, which I do not have. This is the workaround I am using, based on the previous example:
data=ggplot_build(plot)$data[[1]]
cemax=with(data,aggregate(y,by=list(x),max))
plot+geom_line(data=cemax,aes(x=Group.1,y=x),colour="green",alpha=.3,lwd=2)
This does not require direct access to the dataset, but to me it is a very inefficient and inelegant solution. Obviously if there is no other way to manipulate the data, I do not have much of a choice :)
EDIT (Response to OP's comment):
OK I see what you mean now - I should have read your question more carefully. You can achieve what you want using stat_summary(...), which does not require access to the original data frame. It also solves the problem I describe below (!!).
library(ggplot2)
set.seed(1)
test <- data.frame(k=rep(1:3,3),ce=rnorm(9),comp=factor(rep(1:3,each=3)))
plot <- ggplot(test,aes(y=ce,x=k))+geom_line(aes(lty=comp))
##
plot + stat_summary(fun.y=max, geom="line", col="red")
Original Response (Requires access to original df)
One unfortunate characteristic of ggplot is that aggregating functions (like max, min, mean, sd, sum, and so on) used in aes(...) operate on the whole dataset, not subgroups. So
plot + geom_line(aes(y=max(ce)))
will not work - you get the maximum of all test$ce vs. k, which is not useful.
One way around this, which is basically the same as #JLLagrange's answer (but doesn't use external libraries), is:
plot+geom_line(data=aggregate(ce~k,test,max),colour="red")
This creates a data frame dynamically aggregating ce by k using the max function.