Trouble visualizing with geom_boxplot()

Trouble visualizing with geom_boxplot() - r

I'm having some trouble visualizing what I am hoping to understand from a set of Proximity Ligation Assay Experiments (PLA). I was able to quantify dots within each cell to yield a table like this:
RawData Table
After a lot of wrangling, I was able to filter out data points considered outliers and was able to yield something close to what I was looking for:
However, the trouble here is I need to transform the Y-axis into a summary statistic that counts the amount of objects and plots them on the boxplot, as opposed to the # of each individual object itself.
I have tried wrangling around with dplyr a bit using the following code, but have not been able to successfully reproduce a boxplot similar to the one above:
filtered_output_PLA_dots %>%
count(FileName_RawData) %>%
ggplot() +
geom_boxplot(aes(x = factor( FileName_RawData, level = plot1_order), y = n))
Is there a better way to do this that I haven't looked into? Any advice would be great! Alternatively, some pointers to learning resources and tutorials would also be appreciated!

Related

Making individual histograms for multiple categories from one sheet in R

I have a data set with multiple categories of study type for pond data. The column of overall categories is organized with each type having individual values that follow. I can make a histogram for each when I produce individual sheets to use. I have dug around for a while, but cannot find how to make the same histogram for the study types from the overall data set.
Piece of data sheet that I am working with. As you can see, there are multiple study types that we have each with their own data.
Basically, I want to pull each individual study type and the num_divided to make a histogram for the types. My end goal is to make one image with the 9 different histograms stacked above one another. Each having the same x-axis values and their individual names on the left-hand side.
The trouble I am running into is that when I make the histograms from the separated sheets, I cannot make the stacked image I want. I apologize in advance if this lacks some information, but I also thank anyone that offers advice.

ggplot2 is the best option.
You didn't give reproducible data but it's easy to make some. Here are 9 studies each with 100 values:
set.seed(111)
dat <- data.frame(study = rep(letters[1:9], each = 100), num_divided = rnorm(900))
What you want is a facetted plot.
library(ggplot2)
ggplot(dat, aes(x = num_divided)) + geom_histogram() + facet_grid(study ~ .)
If you don't know much about ggplot2, a good starting point is the R Cookbook.

R: how to make multiple plots from one CSV, grouping by a column

I'd like to put multiple plots onto a single visual output in R, based on data that I have in a CSV that looks something like this:
user,size,time
fred,123,0.915022
fred,321,0.938769
fred,1285,1.185608
wilma,5146,2.196687
fred,7506,1.181990
barney,5146,1.860287
wilma,1172,1.158015
barney,5146,1.219313
wilma,13185,1.455904
wilma,8754,1.381372
wilma,878,1.216908
barney,2974,1.223852
I can read this just fine, using, e.g.:
data = read.csv('data.csv')
For the moment, a fairly simple plot is fine, so I'm just trying plot(), without much to it (setting type='o' to get lines and points), and' from solving a past problem, I know that I can do, e.g., the following, to get data for just fred:
plot(data$time[which(data$user == 'fred')], data$size[which(data$user == 'fred')], type='o')
What I'd like, though, is to have the data for each user all showing up on one set of axes, with color coding (and a legend to match users to colors) to identify different user data.
And if another user shows up, I'd like another line to show up, with another color (perhaps recycling if I have too many users at once).
However, just this doesn't do it:
plot(data$size, data$time, type='o',col=c("red", "blue", "green"))
Because it doesn't seem to group by the user.
And just this:
plot(data, type='o')
gives me an error:
Error in plot.default(...) :
formal argument "type" matched by multiple actual arguments
This:
plot(data)
does do something, but not what I want.
I've poked around, but I'm new enough to R that I'm not quite sure how best to search for this, nor where to look for examples that would hit a use-case like this.
I even got somewhat closer with this:
plot(data$size[which(data$user == 'wilma')], data$time[which(data$user == 'wilma')], type='o', col=c('red'))
lines(data$size[which(data$user == 'fred')], data$time[which(data$user == 'fred')], type='o', col=c('green'))
lines(data$size[which(data$user == 'barney')], data$time[which(data$user == 'barney')], type='o', col=c('blue'))
This gives me a plot (which I'd post inline, but as a new user, I'm not allowed to yet):
not-quite-right plot
which is kind of close to what I want, except that it:
doesn't have a legend
has ugly axis labels, instead of just time and size
is scaled to the first plot, and thus is missing data from some of the others
isn't sorted by x-axis, which I could do externally, though I'm guessing I could do it fairly easily in R.
So, the question, ultimately, is this:
What's an easy way to plot data like this which:
has multiple lines based on the labels in the first column of the CSV
uses the same set of axes for the data in columns 2 and 3, regardless of the label
has a legend and color-coding for which label is being used for a particular line (or set of points)
will adapt to adding new labels to the data file, hopefully without change to the R code.
Thanks in advance for any help or pointers on this.
P.S. I looked around for similar questions, and found one that's sort of close, but it's not quite the same, and I failed to figure out how to adapt it to what I'm trying to do.

Good question. This is doable in base plot, but it's even easier and more intuitive using ggplot2. Below is an example of how to do this with random data in ggplot2
First download and install the package
install.packages("ggplot2",repos='http://cran.us.r-project.org')
require(ggplot2)
Next generate the data
a <- c(rep('a',3),rep('b',3),rep('c',3))
b <- rnorm(9,50,30)
c <- rep(seq(1,3),3)
dat <- data.frame(a,b,c)
Finally, make the plot
ggplot(data=dat, aes(x=c, y=b , group=a, colour=a)) + geom_line() + geom_point()
Basically, you are telling ggplot that your x axis corresponds to the c column (dat$c), your y axis corresponds to the b column (y$b) and to group (draw separate lines) by the a column (dat$a). Colour specifies that you want to group colour by the a column as well.
The resulting graph looks like this:

Transformation of aestethic inputs in R and ggplot2

Is there a way to transform data in ggplot2 in the aes declaration of a geom?
I have a plot conceptually similar to this one:
test=data.frame("k"=rep(1:3,3),"ce"=rnorm(9),"comp"=as.factor(sort(rep(1:3,3))))
plot=ggplot(test,aes(y=ce,x=k))+geom_line(aes(lty=comp))
Suppose I would like to add a line calculated as the maximum of the values between the three comp for each k point, with only the plot object available. I have tried several options (e.g. using aggregate in the aes declaration, stat_function, etc.) but I could not find a way to make this work.
At the moment I am working around the problem by extracting the data frame with ggplot_build, but I would like to find a direct solution.

Is
require(plyr)
max.line = ddply(test, .(k), summarise, ce = max(ce))
plot = ggplot(test, aes(y=ce,x=k))
plot = plot + geom_line(aes(lty=comp))
plot = plot + geom_line(data=max.line, color='red')
something like what you want?

Thanks to JLLagrange and jlhoward for your help. However both solutions require the access to the underlying data.frame, which I do not have. This is the workaround I am using, based on the previous example:
data=ggplot_build(plot)$data[[1]]
cemax=with(data,aggregate(y,by=list(x),max))
plot+geom_line(data=cemax,aes(x=Group.1,y=x),colour="green",alpha=.3,lwd=2)
This does not require direct access to the dataset, but to me it is a very inefficient and inelegant solution. Obviously if there is no other way to manipulate the data, I do not have much of a choice :)

EDIT (Response to OP's comment):
OK I see what you mean now - I should have read your question more carefully. You can achieve what you want using stat_summary(...), which does not require access to the original data frame. It also solves the problem I describe below (!!).
library(ggplot2)
set.seed(1)
test <- data.frame(k=rep(1:3,3),ce=rnorm(9),comp=factor(rep(1:3,each=3)))
plot <- ggplot(test,aes(y=ce,x=k))+geom_line(aes(lty=comp))
##
plot + stat_summary(fun.y=max, geom="line", col="red")
Original Response (Requires access to original df)
One unfortunate characteristic of ggplot is that aggregating functions (like max, min, mean, sd, sum, and so on) used in aes(...) operate on the whole dataset, not subgroups. So
plot + geom_line(aes(y=max(ce)))
will not work - you get the maximum of all test$ce vs. k, which is not useful.
One way around this, which is basically the same as #JLLagrange's answer (but doesn't use external libraries), is:
plot+geom_line(data=aggregate(ce~k,test,max),colour="red")
This creates a data frame dynamically aggregating ce by k using the max function.

How to plot one column vs the rest in R

I have a data set where the [,1] is time and then the next 14 are magnitudes. I would like to scatter plot all the magnitudes vs time on one graph, where each different column is gridded (layered on top of one another)
I want to use the raw data to make these graphs and came make them separately but would like to only have to do this process once.
data set called A, the only independent variable is time (the first column)
df<-data.frame(time=A[,1],V11=A[,2],V08=A[,3],
V21=A[,4],V04=A[,5],V22=A[,6],V23=A[,7],
V24=A[,8],V25=A[,9],V07=A[,10],xxx=A[,11],
V26=A[,12],PV2=A[,13],V27=A[,14],V28=A[,15],
NV1=A[,16])
I tried the code mentioned by #VlooO but it scrunched the graphs making them too hard to decipher and each had its own axes. All my graphs can be on the same axes just separated by their headings.
When looking at the ggplots I Think that would be a perfect program for what I want.
ggplot(data=df.melt,aes(x=time,y=???))
I confused what my y should be since I want to reference each different column.
Thanks R community

Hope i understand you correctly:
df<-data.frame(time=rnorm(10),A=rnorm(10),B=rnorm(10),C=rnorm(10))
par(mfrow=c(length(df)-1,1))
sapply(2:length(df), function(x){
plot(df[,c(1,x)])
})
The result would be

here some hints since you don't provide a reproducible example , neither you show what you have tried :
Use list.files to go through all your documents
Use lapply to loop over the result of the previous step and read your data
Put your data in the long format using melt from reshape2 and the variable time as id.
Use ggplot2 to plot using the variable as aes color/group.
library(ggplot2)
library(reshape2)
invisible(lapply(list.files(pattern=...),{
dt = read.table(x)
dt.l = melt(dt,id.vars='time')
print(ggplot(dt.l)+geom_line(aes(x=time,y=value,color=variable))
}))

If you don't need ggplot2, then the matplot function for base graphics can be used to do what you want in one command.

SOLUTION:
After looking through a bunch more problems and playing around a bit more with ggplot2 I found a code that works pretty great. After I made my data frame (stated above), here is what i did
> df.m<- melt(df,"time")
ggplot(df.m, aes(time, value, colour = variable)) + geom_line() +
+ facet_wrap(~ variable, ncol = 2)
I would post the image but I don't have enough reputation points yet.
I still don't really understand why "value" is placed into the y position in aes(time, value,...) If anyone could provided an explanation that would be greatly appreciated. My last question is if anyones knows how to make the subgraphs titles smaller.
Can I use cex.lab=, cex.main= in ggplot2?

Plotting content from multiple data frames into a single ggplot2 surface

I am a total R beginner here, with corresponding level of sophistication of this question.
I am using the ROCR package in R to generate plotting data for ROC curves. I then use ggplot2 to draw the plot. Something like this:
library(ggplot2)
library(ROCR)
inputFile <- read.csv("path/to/file", header=FALSE, sep=" ", colClasses=c('numeric','numeric'), col.names=c('score','label'))
predictions <- prediction(inputFile$score, inputFile$label)
auc <- performance(predictions, measure="auc")#y.values[[1]]
rocData <- performance(predictions, "tpr","fpr")
rocDataFrame <- data.frame(x=rocData#x.values[[1]],y=rocData#y.values[[1]])
rocr.plot <- ggplot(data=rd, aes(x=x, y=y)) + geom_path(size=1)
rocr.plot <- rocr.plot + geom_text(aes(x=1, y= 0, hjust=1, vjust=0, label=paste(sep = "", "AUC = ",round(auc,4))),colour="black",size=4)
This works well for drawing a single ROC curve. However, what I would like to do is read in a whole directory worth of input files - one file per classifier test results - and make a ggplot2 multifaceted plot of all the ROC curves, while still printing the AUC score into each plot.
I would like to understand what is the "proper" R-style approach to accomplishing this. I am sure I can hack something together by having one loop go through all files in the directory and create a separate data frame for each, and then having another loop to create multiple plots, and somehow getting ggplo2 to output all these plots onto the same surface. However, that does not let me use ggplot2's built-in faceting, which I believe is the right approach. I am not sure how to get my data into proper shape for faceting use, though. Should I be merging all my data frames into a single one, and giving each merged chunk a name (e.g. filename) and faceting on that? If so, is there a library or recommended practice for making this happen?
Your suggestions are appreciated. I am still wrapping my head around the best practices in R, so I'd rather get expert advice instead of just hacking things up to make code that looks more like ordinary declarative programming languages that I am used to.
EDIT: The thing I am least clear on is whether, when using ggplot2's built-in faceting capabilities, I'd still be able to output a custom string (AUC score) into each plot it will generate.

Here is an example of how to generate a plot as you described. I use the built-in dataset quakes:
The code does the following:
Load the ggplot2 and plyr packages
Add a facet variable to quakes - in this case I summarise by depth of earthquake
Use ddply to summarise the mean magnitude for each depth
Use ggplot with geom_text to label the mean magnitude
The code:
library(plyr)
library(ggplot2)
quakes$level <- cut(quakes$depth, 5,
labels=c("Very Shallow", "Shallow", "Medium", "Deep", "Very Deep"))
quakes.summary <- ddply(quakes, .(level), summarise, mag=round(mean(mag), 1))
ggplot(quakes, aes(x=long, y=lat)) +
geom_point(aes(colour=mag)) +
geom_text(aes(label=mag), data=quakes.summary, x=185, y=-35) +
facet_grid(~level) +
coord_map()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Trouble visualizing with geom_boxplot() - r

Related

Making individual histograms for multiple categories from one sheet in R

R: how to make multiple plots from one CSV, grouping by a column

Transformation of aestethic inputs in R and ggplot2

How to plot one column vs the rest in R

Plotting content from multiple data frames into a single ggplot2 surface

Categories

Resources