R: Plot multiple box plots using columns from data frame

R: Plot multiple box plots using columns from data frame - r

I would like to plot an INDIVIDUAL box plot for each unrelated column in a data frame. I thought I was on the right track with boxplot.matrix from the sfsmsic package, but it seems to do the same as boxplot(as.matrix(plotdata) which is to plot everything in a shared boxplot with a shared scale on the axis. I want (say) 5 individual plots.
I could do this by hand like:
par(mfrow=c(2,2))
boxplot(data$var1
boxplot(data$var2)
boxplot(data$var3)
boxplot(data$var4)
But there must be a way to use the data frame columns?
EDIT: I used iterations, see my answer.

You could use the reshape package to simplify things
data <- data.frame(v1=rnorm(100),v2=rnorm(100),v3=rnorm(100), v4=rnorm(100))
library(reshape)
meltData <- melt(data)
boxplot(data=meltData, value~variable)
or even then use ggplot2 package to make things nicer
library(ggplot2)
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")

From ?boxplot we see that we have the option to pass multiple vectors of data as elements of a list, and we will get multiple boxplots, one for each vector in our list.
So all we need to do is convert the columns of our matrix to a list:
m <- matrix(1:25,5,5)
boxplot(x = as.list(as.data.frame(m)))
If you really want separate panels each with a single boxplot (although, frankly, I don't see why you would want to do that), I would instead turn to ggplot and faceting:
m1 <- melt(as.data.frame(m))
library(ggplot2)
ggplot(m1,aes(x = variable,y = value)) + facet_wrap(~variable) + geom_boxplot()

I used iteration to do this. I think perhaps I wasn't clear in the original question. Thanks for the responses none the less.
par(mfrow=c(2,5))
for (i in 1:length(plotdata)) {
boxplot(plotdata[,i], main=names(plotdata[i]), type="l")
}

Related

How to compare two histograms in R?

I want to compare two histograms in a graph in R, but couldn't imagined and implemented.
My histograms are based on two sub-dataframes and these datasets divided according to a type (Action, Adventure Family)
My first histogram is:
split_action <- split(df, df$type)
dataset_action <- split_action$Action
hist(dataset_action$year)
split_adventure <- split(df, df$type)
dataset_adventure <- split_adventure$Adventure
hist(dataset_adventure$year)
I want to see how much overlapping is occured, their comparison based on year in the same histogram. Thank you in advence.

Using the iris dataset, suppose you want to make a histogram of sepal length for each species. First, you can make 3 data frames for each species by subsetting.
irissetosa<-subset(iris,Species=='setosa',select=c('Sepal.Length','Species'))
irisversi<-subset(iris,Species=='versicolor',select=c('Sepal.Length','Species'))
irisvirgin<-subset(iris,Species=='virginica',select=c('Sepal.Length','Species'))
and then, make the histogram for these 3 data frames. Don't forget to set the argument "add" as TRUE (for the second and third histogram), because you want to combine the histograms.
hist(irissetosa$Sepal.Length,col='red')
hist(irisversi$Sepal.Length,col='blue',add=TRUE)
hist(irisvirgin$Sepal.Length,col='green',add=TRUE)
you will have something like this
Then you can see which part is overlapping...
But, I know, it's not so good.
Another way to see which part is overlapping is by using density function.
plot(density(irissetosa$Sepal.Length),col='red')
lines(density(irisversi$Sepal.Length),col='blue')
lines(density(irisvirgin$Sepal.Length,col='green'))
Then you will have something like this
Hope it helps!!

You don't need to split the data if using ggplot. The key is to use transparency ("alpha") and change the value of the "position" argument to "identity" since the default is "stack".
Using the iris dataset:
library(ggplot2)
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_histogram(binwidth=0.2, alpha=0.5, position="identity") +
theme_minimal()
It's not easy to see the overlap, so a density plot may be a better choice if that's the main objective. Again, use transparency to avoid obscuring overlapping plots.
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_density(alpha=0.5) +
xlim(3.9,8.5) +
theme_minimal()
So for your data, the command would be something like this:
ggplot(data=df, aes(x=year, fill=type)) +
geom_histogram(alpha=0.5, position="identity")

Apply ggplot2 across columns

I am working with a dataframe with many columns and would like to produce certain plots of the data using ggplot2, namely, boxplots, histograms, density plots. I would like to do this by writing a single function that applies across all attributes (columns), producing one boxplot (or histogram etc) and then storing that as a given element of a list into which all the boxplots will be chained, so I could later index it by number (or by column name) in order to return the plot for a given attribute.
The issue I have is that, if I try to apply across columns with something like apply(df,2,boxPlot), I have to define boxPlot as a function that takes just a vector x. And when I do so, the attribute/column name and index are no longer retained. So e.g. in the code for producing a boxplot, like
bp <- ggplot(df, aes(x=Group, y=Attr, fill=Group)) +
geom_boxplot() +
labs(title="Plot of length per dose", x="Group", y =paste(Attr)) +
theme_classic()
the function has no idea how to extract the info necessary for Attr from just vector x (as this is just the column data and doesn't carry the column name or index).
(Note the x-axis is a factor variable called 'Group', which has 6 levels A,B,C,D,E,F, within X.)
Can anyone help with a good way of automating this procedure? (Ideally it should work for all types of ggplots; the problem here seems to simply be how to refer to the attribute name, within the ggplot function, in a way that can be applied / automatically replicated across the columns.) A for-loop would be acceptable, I guess, but if there's a more efficient/better way to do it in R then I'd prefer that!
Edit: something like what would be achieved by the top answer to this question: apply box plots to multiple variables. Except that in that answer, with his code you would still need a for-loop to change the indices on y=y[2] in the ggplot code and get all the boxplots. He's also expanded-grid to include different ````x``` possibilities (I have only one, the Group factor), but it would be easy to simplify down if the looping problem could be handled.
I'd also prefer just base R if possible--dplyr if absolutely necessary.

Here's an example of iterating over all columns of a data frame to produce a list of plots, while retaining the column name in the ggplot axis label
library(tidyverse)
plots <-
imap(select(mtcars, -cyl), ~ {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
})
plots$mpg
You can also do this without purrr and dplyr
to_plot <- setdiff(names(mtcars), 'cyl')
plots <-
Map(function(.x, .y) {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
}, mtcars[to_plot], to_plot)
plots$mpg

graphing multiple data series in R ggplot

I am trying to plot (on the same graph) two sets of data versus date from two different data frames. Both data frames have the same exact dates for each of the two measurements. I would like to plot these two sets of data on the same graph, with different colors. However, I can't get them on the same graph at all. R is already reading the date as date. I tried this:
qplot( date , NO3, data=qual.arn)
+ qplot( qual.arn$date , qual.arn$DIS.O2, "O2(aq)" , add=T)
and received this error.
Error in add_ggplot(e1, e2, e2name) :
argument "e2" is missing, with no default
I tried using the ggplot function instead of qplot, but I couldn't even plot one graph this way.
ggplot(date=qual.no3.s, aes(date,NO3))
Error: ggplot2 doesn't know how to deal with data of class uneval
PLEASE HELP. Thank you!

Since you didn't provide any data (please do so in future), here's a made up dataset for demonstrate a solution. There are (at least) two ways to do this: the right way and the wrong way. Both yield equivalent results in this very simple case.
# set up minimum reproducible example
set.seed(1) # for reproducible example
dates <- seq(as.Date("2015-01-01"),as.Date("2015-06-01"), by=1)
df1 <- data.frame(date=dates, NO3=rpois(length(dates),25))
df2 <- data.frame(date=dates, DIS.O2=rnorm(length(dates),50,10))
ggplot is designed to use data in "long" format. This means that all the y-values (the concentrations) are in a single column, and there is separate column which identifies the corresponding category ("NO3" or "DIS.O2" in your case). So first we merge the two data-sets based on date, then use melt(...) to convert from "wide" (categories in separate columns) to "long" format. Then we let ggplot worry about legends, colors, etc.
library(ggplot2)
library(reshape2) # for melt(...)
# The right way: combine the data-sets, then plot
df.mrg <- merge(df1,df2, by="date", all=TRUE)
gg.df <- melt(df.mrg, id="date", variable.name="Component", value.name="Concentration")
ggplot(gg.df, aes(x=date, y=Concentration, color=Component)) +
geom_point() + labs(x=NULL)
The "wrong" way to do this is by making separate calls to geom_point(...) for each layer. In your particular case this might be simpler, but in the long run it's better to use the other method.
# The wrong way: plot two sets of points
ggplot() +
geom_point(data=df1, aes(x=date, y=NO3, color="NO2")) +
geom_point(data=df2, aes(x=date, y=DIS.O2, color="DIS.O2")) +
scale_color_manual(name="Component",values=c("red", "blue")) +
labs(x=NULL, y="Concentration")

Create Lollipop-like plot with R

I have a .csv file that looks like that:
Pos,ReadsME_016,ReadsME_017,ReadsME_018,ReadsME_019,ReadsME_020,ReadsME_021
95952794,62.36,62.06,55.56,51,60.35,44.27
95952795,100,100,100,100,100,100
95952833,0,0,-,0,-,-
95952846,0,0,-,0,0,-
95952876,0,-,0,0,0,0
95952877,38.89,28.98,25.67,36.99,37.91,16.86
95952878,100,100,100,100,100,100
95952884,0,-,0,-,-,0
95952897,18.7,20.52,20.94,16.43,22.68,12.55
95952898,100,100,75,80,-,100
95952902,10.88,8.93,10.22,10.63,13.51,6.06
95952903,100,100,100,75,-,100
95952915,10.75,8.7,7.91,8.35,15.12,8.88
What I want is to create a plot that is similar to this one:
http://www.scfbm.org/content/9/1/11/figure/F2
However, all my attempts failed. Unfortunately, the tool is yet not available and I cannot read the source code.
I've thought of ggplot and melt, but I do not come close to this graph. How can I achieve that all read samples (ReadsME_016,ReadsME_017,..) are listed on the x-axes and the positions are listed on the y-axes? I don’t know how to deal with both x- & y-axes being categorical while the plotted values should show percentages?
dataset <- melt(dataset, id.vars="Pos")
ggplot(dataset, aes(x=value, y=Pos, colour=variable)) + geom_point()
Here is the complete .csv file:
Pos,ReadsME_016,ReadsME_017,ReadsME_018,ReadsME_019,ReadsME_020,ReadsME_021,ReadsME_022,ReadsME_023,ReadsME_024,ReadsME_025,ReadsME_026,ReadsME_027,ReadsME_028,ReadsME_030,ReadsME_031,ReadsME_032
95952794,62.36,62.06,55.56,51.0,60.35,44.27,53.73,61.69,57.04,64.16,61.48,59.42,66.93,49.71,55.23,66.67
95952795,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,-,100.0,100.0,100.0,100.0,-
95952833,0.0,0.0,-,0.0,-,-,100.0,-,-,-,-,0.0,-,-,0.0,-
95952846,0.0,0.0,-,0.0,0.0,-,0.0,0.0,-,-,-,0.0,-,-,-,-
95952876,0.0,-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-
95952877,38.89,28.98,25.67,36.99,37.91,16.86,29.65,35.38,35.43,36.87,34.04,33.91,35.04,19.09,38.35,0.0
95952878,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,-,100.0,100.0,100.0,100.0,-
95952884,0.0,-,0.0,-,-,0.0,-,-,100.0,-,-,0.0,-,-,-,-
95952897,18.7,20.52,20.94,16.43,22.68,12.55,18.3,22.28,21.05,22.55,24.81,20.63,22.05,13.06,22.8,0.0
95952898,100.0,100.0,75.0,80.0,-,100.0,80.0,100.0,100.0,-,-,-,100.0,-,100.0,-
95952902,10.88,8.93,10.22,10.63,13.51,6.06,9.62,15.73,14.08,18.65,13.28,16.44,15.02,8.92,11.11,100.0
95952903,100.0,100.0,100.0,75.0,-,100.0,100.0,100.0,100.0,-,-,100.0,100.0,100.0,100.0,-
95952915,10.75,8.7,7.91,8.35,15.12,8.88,7.32,9.76,11.45,8.99,10.57,14.07,10.36,6.35,10.04,0.0
95952916,100.0,100.0,100.0,100.0,-,100.0,100.0,100.0,100.0,-,-,100.0,100.0,-,100.0,-
95952925,10.39,8.33,8.59,10.51,14.19,10.99,6.98,11.56,13.93,15.0,14.29,16.26,9.76,5.86,12.96,0.0
95952926,100.0,100.0,100.0,100.0,-,100.0,100.0,100.0,100.0,-,-,-,100.0,-,100.0,-
95952937,19.53,14.97,11.97,14.43,19.26,17.18,19.48,12.31,21.17,21.57,23.08,26.24,16.38,13.47,21.82,0.0
95952938,100.0,100.0,100.0,100.0,-,100.0,100.0,-,-,-,-,-,-,-,100.0,-
95952825,-,0.0,-,-,-,-,-,-,-,-,0.0,-,-,0.0,0.0,-
95952975,-,0.0,-,-,-,-,-,-,0.0,-,-,-,-,-,-,-
95952669,-,-,0.0,-,-,0.0,0.0,-,-,-,-,-,-,-,0.0,-
95952718,-,-,0.0,0.0,0.0,-,0.0,-,-,-,0.0,-,-,0.0,0.0,-
95952868,-,-,0.0,-,0.0,-,-,0.0,-,-,0.0,-,-,-,-,-
95952957,-,-,0.0,-,-,-,-,0.0,0.0,0.0,-,0.0,-,-,-,-
95952976,-,-,0.0,-,0.0,0.0,0.0,100.0,-,0.0,-,-,-,-,0.0,-
95952681,-,-,-,0.0,-,0.0,-,0.0,-,-,-,-,-,0.0,-,-
95952779,-,-,-,0.0,-,-,-,-,-,-,-,-,-,-,-,-
95952811,-,-,-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-,-,-,0.0,-
95952821,-,-,-,0.0,-,-,-,-,-,-,-,-,-,-,-,-
95952823,-,-,-,0.0,-,-,-,-,-,-,-,-,-,-,-,-
95952859,-,-,-,0.0,0.0,-,-,0.0,0.0,-,0.0,-,-,0.0,0.0,-
95952882,-,-,-,0.0,-,-,-,-,-,-,0.0,-,-,-,-,-
95953023,-,-,-,0.0,-,0.0,-,-,-,-,-,-,-,-,-,-
95953058,-,-,-,0.0,-,0.0,-,-,-,-,-,-,-,-,-,-
95952664,-,-,-,-,-,0.0,0.0,-,-,0.0,-,-,-,-,0.0,-
95952801,-,-,-,-,-,0.0,-,-,-,-,-,-,-,-,-,-
95952968,-,-,-,-,-,-,0.0,-,-,0.0,-,-,-,-,-,-
95952797,-,-,-,-,-,-,-,-,0.0,-,-,-,-,-,-,-
95952851,-,-,-,-,-,-,-,-,-,-,0.0,-,-,-,-,-
95952894,-,-,-,-,-,-,-,-,-,-,0.0,-,-,-,-,-
95952807,-,-,-,-,-,-,-,-,-,-,-,-,-,0.0,-,-
95952712,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0.0,-

First, you want to make sure you are reading in your data properly. You have non-numeric values (specifically "-") mixed in with numeric values. I'm assuming those are missing values. Make sure you let R know that with na.strings="-". Then, to get something more consistent with the example plot, i changed your variables around
library(reshape2) # for melt()
library(ggplot2) # for ggplot()
dataset <- read.table("file.txt", header=TRUE, sep=",", na.strings="-")
ggplot(melt(dataset, id.vars="Pos"),
aes(x=Pos, y=variable, colour=cut(value, breaks=5))) +
geom_point()

Animation, adding geom

I want to create some kind of animation with ggplot2 but it doesn't work as I want to. Here is a minimal example.
print(p <- qplot(c(1, 2),c(1, 1))+geom_point())
print(p <- p + geom_point(aes(c(1, 2),c(2, 2)))
print(p <- p + geom_point(aes(c(1, 2),c(3, 3)))
Adding extra points by hand is no problem. But now I want to do it in some loop to get an animation.
for(i in 4:10){
Sys.sleep(.3)
print(p <- p + geom_point(aes(c(1, ),c(i, i))))
}
But now only the new points added are shown, and points of the previous iterations are deleted. I want the old ones still to be visible. How can I do this?

Either of these will do what you want, I think.
# create df dynamically
for (i in 1:10) {
df <- data.frame(x=rep(1:2,i),y=rep(1:i,each=2))
Sys.sleep(0.3)
print(ggplot(df, aes(x,y))+geom_point() + ylim(0,10))
}
# create df at the beginning, then subset in the loop
df <- data.frame(x=rep(1:2,10), y=rep(1:10,each=2))
for (i in 1:10) {
Sys.sleep(0.3)
print(ggplot(df[1:(2*i),], aes(x,y))+geom_point() +ylim(0,10))
}
Also, your code will cause the y-axis limits to change for each plot. Using ylim(...) keeps all the plots on the same scale.
EDIT Response to OP's comment.
One way to create animations is using the animations package. Here's an example.
library(ggplot2)
library(animation)
ani.record(reset = TRUE) # clear history before recording
df <- data.frame(x=rep(1:2,10), y=rep(1:10,each=2))
for (i in 1:10) {
plot(ggplot(df[1:(2*i),], aes(x,y))+geom_point() +ylim(0,10))
ani.record() # record the current frame
}
## now we can replay it, with an appropriate pause between frames
oopts = ani.options(interval = 0.5)
ani.replay()
This will "record" each frame (using ani.record(...)) and then play it back at the end using ani.replay(...). Read the documentation for more details.
Regarding the question about why your code fails, the simple answer is: "this is not the way ggplot is designed to be used." The more complicated answer is this: ggplot is based on a framework which expects you to identify a default dataset as a data frame, and then associate (map) various aspects of the graph (aesthetics) with columns in the data frame. So if you have a data frame df with columns A and B, and you want to plot B vs. A, you would write:
ggplot(data=df, aes(x=A, y=B)) + geom_point()
This code identifies df as the dataset, and maps the aesthetic x (the horizontal axis) with column A and y with column B. Taking advantage of the default order of the arguments, you could also write:
ggplot(df, aes(A,B)) + geom_point()
It is possible to specify things other than column names in aes(...) but this can and often does lead to unexpected (even bizarre) results. Don't do it!.
The reason, basically, is that ggplot does not evaluate the arguments to aes(...) immediately, but rather stores them as expressions in a ggplot object, and evaluates them when you plot or print that object. This is why, for example, you can add layers to a plot and ggplot is able to dynamically rescale the x- and y-limits, something that does not work with plot(...) in base R.