Transformation of aestethic inputs in R and ggplot2 - r

Is there a way to transform data in ggplot2 in the aes declaration of a geom?
I have a plot conceptually similar to this one:
test=data.frame("k"=rep(1:3,3),"ce"=rnorm(9),"comp"=as.factor(sort(rep(1:3,3))))
plot=ggplot(test,aes(y=ce,x=k))+geom_line(aes(lty=comp))
Suppose I would like to add a line calculated as the maximum of the values between the three comp for each k point, with only the plot object available. I have tried several options (e.g. using aggregate in the aes declaration, stat_function, etc.) but I could not find a way to make this work.
At the moment I am working around the problem by extracting the data frame with ggplot_build, but I would like to find a direct solution.

Is
require(plyr)
max.line = ddply(test, .(k), summarise, ce = max(ce))
plot = ggplot(test, aes(y=ce,x=k))
plot = plot + geom_line(aes(lty=comp))
plot = plot + geom_line(data=max.line, color='red')
something like what you want?

Thanks to JLLagrange and jlhoward for your help. However both solutions require the access to the underlying data.frame, which I do not have. This is the workaround I am using, based on the previous example:
data=ggplot_build(plot)$data[[1]]
cemax=with(data,aggregate(y,by=list(x),max))
plot+geom_line(data=cemax,aes(x=Group.1,y=x),colour="green",alpha=.3,lwd=2)
This does not require direct access to the dataset, but to me it is a very inefficient and inelegant solution. Obviously if there is no other way to manipulate the data, I do not have much of a choice :)

EDIT (Response to OP's comment):
OK I see what you mean now - I should have read your question more carefully. You can achieve what you want using stat_summary(...), which does not require access to the original data frame. It also solves the problem I describe below (!!).
library(ggplot2)
set.seed(1)
test <- data.frame(k=rep(1:3,3),ce=rnorm(9),comp=factor(rep(1:3,each=3)))
plot <- ggplot(test,aes(y=ce,x=k))+geom_line(aes(lty=comp))
##
plot + stat_summary(fun.y=max, geom="line", col="red")
Original Response (Requires access to original df)
One unfortunate characteristic of ggplot is that aggregating functions (like max, min, mean, sd, sum, and so on) used in aes(...) operate on the whole dataset, not subgroups. So
plot + geom_line(aes(y=max(ce)))
will not work - you get the maximum of all test$ce vs. k, which is not useful.
One way around this, which is basically the same as #JLLagrange's answer (but doesn't use external libraries), is:
plot+geom_line(data=aggregate(ce~k,test,max),colour="red")
This creates a data frame dynamically aggregating ce by k using the max function.

Related

R: how to make multiple plots from one CSV, grouping by a column

I'd like to put multiple plots onto a single visual output in R, based on data that I have in a CSV that looks something like this:
user,size,time
fred,123,0.915022
fred,321,0.938769
fred,1285,1.185608
wilma,5146,2.196687
fred,7506,1.181990
barney,5146,1.860287
wilma,1172,1.158015
barney,5146,1.219313
wilma,13185,1.455904
wilma,8754,1.381372
wilma,878,1.216908
barney,2974,1.223852
I can read this just fine, using, e.g.:
data = read.csv('data.csv')
For the moment, a fairly simple plot is fine, so I'm just trying plot(), without much to it (setting type='o' to get lines and points), and' from solving a past problem, I know that I can do, e.g., the following, to get data for just fred:
plot(data$time[which(data$user == 'fred')], data$size[which(data$user == 'fred')], type='o')
What I'd like, though, is to have the data for each user all showing up on one set of axes, with color coding (and a legend to match users to colors) to identify different user data.
And if another user shows up, I'd like another line to show up, with another color (perhaps recycling if I have too many users at once).
However, just this doesn't do it:
plot(data$size, data$time, type='o',col=c("red", "blue", "green"))
Because it doesn't seem to group by the user.
And just this:
plot(data, type='o')
gives me an error:
Error in plot.default(...) :
formal argument "type" matched by multiple actual arguments
This:
plot(data)
does do something, but not what I want.
I've poked around, but I'm new enough to R that I'm not quite sure how best to search for this, nor where to look for examples that would hit a use-case like this.
I even got somewhat closer with this:
plot(data$size[which(data$user == 'wilma')], data$time[which(data$user == 'wilma')], type='o', col=c('red'))
lines(data$size[which(data$user == 'fred')], data$time[which(data$user == 'fred')], type='o', col=c('green'))
lines(data$size[which(data$user == 'barney')], data$time[which(data$user == 'barney')], type='o', col=c('blue'))
This gives me a plot (which I'd post inline, but as a new user, I'm not allowed to yet):
not-quite-right plot
which is kind of close to what I want, except that it:
doesn't have a legend
has ugly axis labels, instead of just time and size
is scaled to the first plot, and thus is missing data from some of the others
isn't sorted by x-axis, which I could do externally, though I'm guessing I could do it fairly easily in R.
So, the question, ultimately, is this:
What's an easy way to plot data like this which:
has multiple lines based on the labels in the first column of the CSV
uses the same set of axes for the data in columns 2 and 3, regardless of the label
has a legend and color-coding for which label is being used for a particular line (or set of points)
will adapt to adding new labels to the data file, hopefully without change to the R code.
Thanks in advance for any help or pointers on this.
P.S. I looked around for similar questions, and found one that's sort of close, but it's not quite the same, and I failed to figure out how to adapt it to what I'm trying to do.
Good question. This is doable in base plot, but it's even easier and more intuitive using ggplot2. Below is an example of how to do this with random data in ggplot2
First download and install the package
install.packages("ggplot2",repos='http://cran.us.r-project.org')
require(ggplot2)
Next generate the data
a <- c(rep('a',3),rep('b',3),rep('c',3))
b <- rnorm(9,50,30)
c <- rep(seq(1,3),3)
dat <- data.frame(a,b,c)
Finally, make the plot
ggplot(data=dat, aes(x=c, y=b , group=a, colour=a)) + geom_line() + geom_point()
Basically, you are telling ggplot that your x axis corresponds to the c column (dat$c), your y axis corresponds to the b column (y$b) and to group (draw separate lines) by the a column (dat$a). Colour specifies that you want to group colour by the a column as well.
The resulting graph looks like this:

Why I get error, when I'm trying to have overlapped density plot using ggplot2?

I want to create overlaped density plot. I decided to use ggplot2.
My data are in data frame formate.
Here How they are look:
Ge<-data.frame(Ge)
dim(Ge)
#[1] 100 1
Ge[1:4,]
#[1] 6.005409 38.681342 102.079283 185.672611
dim(Tr)
#[1] 100 1
Tr[1:4,]
#[1] 12.8678547 1.3034715 1.1372413 0.7973491
Here is my code to create plot:
library(ggplot2)
ggplot() + geom_density(aes(x=x), colour="red", data=Tr) +
geom_density(aes(x=x), colour="blue", data=Ge)
But this is the error I get it:
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Error: stat_density requires the following missing aesthetics: x
Would someone help me to solve this ?
You should be using a single data frame where ever possible with ggplot. That is the logic behind the syntax, but is unintuitive at first. Considering your sample code, Tr and Ge are factors and there is one set of values which you're representing on a common x-axis.
The reshape2 package has a handy tool for combining separate data into a format suitable for ggplot plotting, melt. Check out the package documentation, but see below for working code and a sample output.
require(ggplot2)
require(reshape2)
Ge=runif(n=100)
Tr=runif(n=100)
data=data.frame(Ge,Tr)
names(data)=c('Ge','Tr')
data=melt(data,id.vars=NULL)
ggplot(data,aes(x=value,fill=variable))+geom_density(alpha=.4)
There is a book by Hadley Wickham which covers all of this information in excellent detail. Amazon link
Update
I have more closely replicated the OP's code (straying away from best practices) and still get a functional plot, though with a warning.
Ge=data.frame(runif(n=100))
Tr=data.frame(runif(n=120))
ggplot()+geom_density(aes(data=Ge,x=Ge[,1]),color='red')+
geom_density(aes(data=Tr,x=Tr[,1]),color='blue')
Don't know how to automatically pick scale for object of type
data.frame. Defaulting to continuous Don't know how to automatically
pick scale for object of type data.frame. Defaulting to continuous

How to plot one column vs the rest in R

I have a data set where the [,1] is time and then the next 14 are magnitudes. I would like to scatter plot all the magnitudes vs time on one graph, where each different column is gridded (layered on top of one another)
I want to use the raw data to make these graphs and came make them separately but would like to only have to do this process once.
data set called A, the only independent variable is time (the first column)
df<-data.frame(time=A[,1],V11=A[,2],V08=A[,3],
V21=A[,4],V04=A[,5],V22=A[,6],V23=A[,7],
V24=A[,8],V25=A[,9],V07=A[,10],xxx=A[,11],
V26=A[,12],PV2=A[,13],V27=A[,14],V28=A[,15],
NV1=A[,16])
I tried the code mentioned by #VlooO but it scrunched the graphs making them too hard to decipher and each had its own axes. All my graphs can be on the same axes just separated by their headings.
When looking at the ggplots I Think that would be a perfect program for what I want.
ggplot(data=df.melt,aes(x=time,y=???))
I confused what my y should be since I want to reference each different column.
Thanks R community
Hope i understand you correctly:
df<-data.frame(time=rnorm(10),A=rnorm(10),B=rnorm(10),C=rnorm(10))
par(mfrow=c(length(df)-1,1))
sapply(2:length(df), function(x){
plot(df[,c(1,x)])
})
The result would be
here some hints since you don't provide a reproducible example , neither you show what you have tried :
Use list.files to go through all your documents
Use lapply to loop over the result of the previous step and read your data
Put your data in the long format using melt from reshape2 and the variable time as id.
Use ggplot2 to plot using the variable as aes color/group.
library(ggplot2)
library(reshape2)
invisible(lapply(list.files(pattern=...),{
dt = read.table(x)
dt.l = melt(dt,id.vars='time')
print(ggplot(dt.l)+geom_line(aes(x=time,y=value,color=variable))
}))
If you don't need ggplot2, then the matplot function for base graphics can be used to do what you want in one command.
SOLUTION:
After looking through a bunch more problems and playing around a bit more with ggplot2 I found a code that works pretty great. After I made my data frame (stated above), here is what i did
> df.m<- melt(df,"time")
ggplot(df.m, aes(time, value, colour = variable)) + geom_line() +
+ facet_wrap(~ variable, ncol = 2)
I would post the image but I don't have enough reputation points yet.
I still don't really understand why "value" is placed into the y position in aes(time, value,...) If anyone could provided an explanation that would be greatly appreciated. My last question is if anyones knows how to make the subgraphs titles smaller.
Can I use cex.lab=, cex.main= in ggplot2?

Plotting content from multiple data frames into a single ggplot2 surface

I am a total R beginner here, with corresponding level of sophistication of this question.
I am using the ROCR package in R to generate plotting data for ROC curves. I then use ggplot2 to draw the plot. Something like this:
library(ggplot2)
library(ROCR)
inputFile <- read.csv("path/to/file", header=FALSE, sep=" ", colClasses=c('numeric','numeric'), col.names=c('score','label'))
predictions <- prediction(inputFile$score, inputFile$label)
auc <- performance(predictions, measure="auc")#y.values[[1]]
rocData <- performance(predictions, "tpr","fpr")
rocDataFrame <- data.frame(x=rocData#x.values[[1]],y=rocData#y.values[[1]])
rocr.plot <- ggplot(data=rd, aes(x=x, y=y)) + geom_path(size=1)
rocr.plot <- rocr.plot + geom_text(aes(x=1, y= 0, hjust=1, vjust=0, label=paste(sep = "", "AUC = ",round(auc,4))),colour="black",size=4)
This works well for drawing a single ROC curve. However, what I would like to do is read in a whole directory worth of input files - one file per classifier test results - and make a ggplot2 multifaceted plot of all the ROC curves, while still printing the AUC score into each plot.
I would like to understand what is the "proper" R-style approach to accomplishing this. I am sure I can hack something together by having one loop go through all files in the directory and create a separate data frame for each, and then having another loop to create multiple plots, and somehow getting ggplo2 to output all these plots onto the same surface. However, that does not let me use ggplot2's built-in faceting, which I believe is the right approach. I am not sure how to get my data into proper shape for faceting use, though. Should I be merging all my data frames into a single one, and giving each merged chunk a name (e.g. filename) and faceting on that? If so, is there a library or recommended practice for making this happen?
Your suggestions are appreciated. I am still wrapping my head around the best practices in R, so I'd rather get expert advice instead of just hacking things up to make code that looks more like ordinary declarative programming languages that I am used to.
EDIT: The thing I am least clear on is whether, when using ggplot2's built-in faceting capabilities, I'd still be able to output a custom string (AUC score) into each plot it will generate.
Here is an example of how to generate a plot as you described. I use the built-in dataset quakes:
The code does the following:
Load the ggplot2 and plyr packages
Add a facet variable to quakes - in this case I summarise by depth of earthquake
Use ddply to summarise the mean magnitude for each depth
Use ggplot with geom_text to label the mean magnitude
The code:
library(plyr)
library(ggplot2)
quakes$level <- cut(quakes$depth, 5,
labels=c("Very Shallow", "Shallow", "Medium", "Deep", "Very Deep"))
quakes.summary <- ddply(quakes, .(level), summarise, mag=round(mean(mag), 1))
ggplot(quakes, aes(x=long, y=lat)) +
geom_point(aes(colour=mag)) +
geom_text(aes(label=mag), data=quakes.summary, x=185, y=-35) +
facet_grid(~level) +
coord_map()

ggplot2 - possible to reorder x's by value of computed y (stat_summary)?

Is it possible to reorder x values using a computed y via stat_summary?
I would think that this should work:
stat_summary( aes( x = reorder( XVarName , ..y.. ) ) )
but I get the following error:
"Error: stat_summary requires the following missing aesthetics: x"
I've seen a number of your posts, and I think this may be helpful for you. When generating a plot, always save it to a unique variable
Create your plots without regard for ordering at first, until you're comfortable just creating the plots. Then, work your way into the structure of the ggplot objects to get a better understanding of what's in them. Then, figure out what you should be sorting.
plot1 <- ggplot() + ...
You can push plots to the viewport by typing out the object name that you've saved them to:
plot1
Creating a ggplot object (or variable) allows you the opportunity to review the structure of the plot. Which, incidentally, can answer a number of the questions that you've been having so far.
str(plot1)
It is still fairly simple to reorder a plot after you've saved it as a variable/object, albeit with slightly longer names:
plot$data$variable_tobe_recoded <- factor(...)

Resources