R violin plot overlay 2 dataframes - r

Say you have two dataframes
M1 <- data.frame(sample(1:3, 500, replace = TRUE), ncol = 5)
M2 <- data.frame(sample(1:3, 500, replace = TRUE), ncol = 5)
and I want to overlay them as violin plots as seen here:
Overlay violin plots ggplot2
but I have 2 dataframes like above (but bigger) not one with 3 columns as in the example above
I have tried the advice using melt as seen here:
Violin plot of a data frame
but I cant get it to overlay two dataframes
help is much appreciated:

Like this?
library(ggplot2)
library(reshape2)
set.seed(1)
M1 <- data.frame(matrix(sample(1:5, 500, replace = TRUE), ncol = 5))
M2 <- data.frame(matrix(sample(2:4, 500, replace = TRUE), ncol = 5))
M1.melt <- melt(M1)
M2.melt <- melt(M2)
ggplot() +
geom_violin(data=M1.melt, aes(x=variable,y=value),fill="lightblue",colour="blue")+
geom_violin(data=M2.melt, aes(x=variable,y=value),fill="lightgreen",colour="green")
There are several issues. First, data.frame(...) does no take an ncol argument, so your code just generates a pair of 2-column data frames with the second column called ncol with all values = 5. If you want 5 columns (do you??) then you have to use matrix(...) as above.
Second, you do need to use melt(...) to reorganize the dataframes from "wide" format (categories in 5 different columns) to "long" format (all data in 1 column, called value, with categories distinguihsed by a second column, called variable).
Another way to do this combines the two dataframes first:
M3 <- rbind(M1,M2)
M3$group <- rep(c("A","B"),each=100)
M3.melt <- melt(M3, id="group")
ggplot(M3.melt, aes(x=variable, y=value, fill=group)) +
geom_violin(position="identity")
Note that this generates a slightly different plot because ggplot scales the width of the violins together, whereas in the earlier plot they were scaled separately.
EDIT (Response to OP's comment)
To put the fill colors in a legend, you have to make them part of an aesthetic scale: put fill=... inside the call to aes(...) as follows.
ggplot() +
geom_violin(data=M1.melt, aes(x=variable,y=value,fill="M1"),colour="blue")+
geom_violin(data=M2.melt, aes(x=variable,y=value,fill="M2"),colour="green")+
scale_fill_manual(name="Data Set",values=c(M1="lightblue",M2="lightgreen"))

Related

How to efficiently draw lots of graphs in R from data in a wide format?

I'm trying to draw 18 graphs using R and the ggplot2 package. My data look like this:
v1 v2 v3 ... v18 subject group
534 543 512 ... 410 1 (6.5, 18]
437 576 465 ... 420 2 (0, 6.5]
466 487 492 ... 501 3 (18, 55]
And I need to create a "faceted" histogram showing distributions for all of the groups in one frame (i. e. to conveniently present all of the subgroups' distributions) like this:
I came up with this code for a single plot:
ggplot(data = df, aes (x = v1)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
But since there are 18 variables (v1, v2,...), I'm looking for a way to write an efficient function/loop/command that would draw all the 18 graphs without me having to copy/paste and change the variable name 18 times. Like this:
ggplot(data = df, aes (x = **v1**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
ggplot(data = df, aes (x = **v2**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
ggplot(data = df, aes (x = **v3**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
I know that the solution probably lies in looping and it seems like a useful skill to have, so I'm also using this opportunity to learn this right.
Thank you, any help is appreciated! (And thanks to all the suggestions so far!)
This is where I've gotten so far with the kind help of the user below:
for (v in c(v1,v2)) {
pdf("plots.pdf")
histograms <- ggplot(data = data, aes (x = v)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
print(histograms)
}
dev.off()
EDIT A significantly revised answer is provided having clarified the needs.
The problem presents several common issues, each of which are addressed in other posts. However, perhaps this suggestion allows for a one-stop solution to these common issues.
My first suggestion is to reformat the data into a "long" format. There are many resources describing this and packages to help. Many users embrace the "tidyverse" set of tools and I'll leave that to others. I'll demonstrate a simple approach using base functions. I don't recommend the reshape() function in the stats package. I find it to be useful for repeated measures with time as one of the variables but find it rather complicated for other data.
A large fake data set will be generated in the "wide" format with demographic data (id, sex, weight, age, group) and 18 variables named "v01", "v02", ..., "v18" as random integers between 400 and 500.
# Set random number generator and number of "individuals" in fake data
set.seed(1234) # to ensure reproducibility
N <- 936 # number of "individuals" in the fake data
# Create typical fake demographic data and divide the age into 4 groups
id <- factor(sample(1e4:9e4, N, replace = FALSE))
age <- rpois(N, 36)
sex <- sample(c("F","M"), N, replace = TRUE)
weight <- 16 * log(age)
group <- cut(age, breaks = c(12, 32, 36, 40, 62))
Generate 18 fake values for each individual for the wide format and then create the fake "wide" data.frame.
# 18 variable measurements for wide format
V <- replicate(18, sample(400:600, N, replace = TRUE), simplify = FALSE)
names(V) <- sprintf("v%02d", 1:18)
# Add a little variation to the fake data
adj <- sample(1:6, 18, replace = TRUE)
V <- Map("/", V, adj) # divide each value by the number in 'adj'
V <- lapply(V, round, 1) # simplify
# Create data.frame with variable data in wide format
vars <- as.data.frame(V)
names(vars)
# Assemble demographic and variable data into a typical "wide" data set
wide <- data.frame(id, sex, weight, age, group, vars)
names(wide)
head(wide)
In the "wide" format, each row corresponds to a unique individual with demographic information and 18 values for 18 variables. This is going to be changed into the "long" format with each value represented by a row. The new "long" data frame will have two new variables for the data (values) and a factor indicating the group from which the data came (ind). Typically they get renamed but I will simply work with the default names here.
As noted above, the simple base function stack() will be used to stack the variables into a single vector. In contrast to cbind(), the data.frame() function will replicate values only as long as they are an even multiple of each other. The following code takes advantage of this property to build the "long" data.frame.
# Identify those variables to be stacked (they all start with 'v')
sel <- grepl("^v", names(wide))
long <- data.frame(wide[!sel], stack(wide[sel]))
head(long)
My second suggestion is to use one of the "apply" functions to create a list of ggplot objects. By storing the plots in this variable, you have the option of plotting them with different formats without running the plotting code each time.
The code creates a plot for each of the 18 different variables, which are identified by the new variable ind. I changed boundary = 500 to a bins = 10 since I don't know what your actual data looks like. I also added a "caption" to each plot identifying the original variable.
library(ggplot2) # to use ggplot...
plotList <- lapply(levels(long$ind), function(i)
ggplot(data = subset(long, ind == i), aes(x = values))
+ geom_histogram(bins = 10)
+ facet_wrap(~ group, nrow = 2)
+ labs(caption = paste("Variable", i)))
names(plotList) <- levels(long$ind) # name the list elements for convenience
Now to examine each of the 18 plots (this may not work in RStudio):
opar <- par(ask = TRUE)
plotList # This is the same as print(plotList)
par(opar) # turn off the 'ask' option
To save the plots to file, the advice of Imo is good. But it would be wise to take control of the size and nature of the file output. I suggest you look at the help files for pdf() and dev.print(). The last part of this answer shows one possibility with the pdf() function using a for loop to generate single page plots.
for (v in levels(long$ind)) {
fname <- paste(v, "pdf", sep = ".")
fname <- file.path("~", fname) # change this to specify a directory
pdf(fname, width = 6.5, height = 7, paper = "letter")
print(plotList[[v]])
dev.off()
}
And just to add another possible approach, here's a solution with lattice showing 6 groups of variables per plot. (Personally, I'm a fan of this simpler approach.)
library(lattice)
idx <- split(levels(long$ind), gl(3, 6, 18))
opar <- par(ask = TRUE)
for (i in idx)
plot(histogram(~values | group + ind, data = long,
subset = ind %in% i, as.table = TRUE))
par(opar)

boxplots with missing values in R - ggplot

I am trying to make boxplots for a matrix (athTp) with 6 variables (columns) but with many missing values, '
ggplot(athTp)+geom_boxplot()
But maybe sth I am doing wrong...
I tried also to make many box plots and after to arrange the grid, but the final plot was very small (in desired dimensions), loosing many of details.
q1 <- ggplot(athTp,aes(x="V1", y=athTp[,1]))+ geom_boxplot()
..continue with other 5 columns
grid.arrange(q1,q2,q3,q4,q5,q6, ncol=6)
ggsave("plot.pdf",plot = qq, width = 8, height = 8, units = "cm")
Do you have any ideas?
Thanks in advance!
# ok so your data has 6 columns like this
set.seed(666)
dat <- data.frame(matrix(runif(60,1,20),ncol=6))
names(dat) <- letters[1:6]
head(dat)
# so let's get in long format like ggplot likes
library(reshape2)
longdat <- melt(dat)
head(longdat)
# and try your plot call again specifying that we want a box plot per column
# which is now indicated by the "variable" column
# [remember you should specify the x and y axes with `aes()`]
library(ggplot2)
ggplot(longdat, aes(x=variable, y=value)) + geom_boxplot(aes(colour = variable))

Universal scale bar for paneled levelplots

I would like to have multiple heatmaps/levelplots in a single plot, with a universal scale bar. I have the plots arranged, and I think I'm close to the answer, but I want to make sure I don't mess the scale up.
#Fake data
library(gridExtra)
fill = rnorm(100,4)
matA = matrix(fill, ncol=10)
matB = matrix(fill * 2, ncol=10)
# Plotting
a=levelplot(matA, colorkey=FALSE)
b=levelplot(matB, colorkey=list(col=rainbow(1000), at=seq(0,6, length.out=1000)))
grid.arrange(a,b,ncol=2)
Thanks for any help!
Instead of using grid.arrange, you may rearrange your data to be able to use the formula method of x in levelplot. This allows you to easily create a plot with different panels based on a grouping variable g, with a common scale. Here g ('L1') corresponds to the different matrices.
library(reshape2)
library(lattice)
# put your matrices in a list an melt them to one data frame.
l <- list(matA, matB)
df <- melt(l)
# plot
levelplot(value ~ Var1 * Var2 | L1, data = df,
col.regions = rainbow(100))

R: Loop pairs of columns in a dataframe

Is it possible to plot pairs of columns in a single plot with a loop? For example, if I have a data frame of time series with 10 columns (x1, x2.. x10), I would like to create 5 plots: 1st plot will display x1 and x2, the 2nd plot would display x3 and x4 and so on.
Any plotting method would be useful, (zoo, lattice, ggplot2).
I got stuck at creating a loop to plot a single variable:
set.seed(1)
x<- data.frame(replicate(10,rnorm(10, mean = 0, sd = 1)))
cols <- seq(1,10)
library(zoo)
z <- read.zoo(x)
for (i in cols) {
plot(z[,i], screen = 1)
}
Thanks in advance.
How about this with ggplot2 and reshape2:
require(reshape2)
require(ggplot2)
m<-melt(matrix(z,10))
m$facet<-cut(m$Var2,c(0,2,4,6,8,10))
ggplot(m)+geom_line(aes(x=Var1,y=value,group=Var2,color=factor(Var2)))+facet_wrap(~ facet)
It can be done in a single line without a loop like this where the col argument specifies that the odd series are black and the even are red. Note that z in the question has 9 columns (since the first column in x is the time index) so we have used a 10 column z below instead which was likely what was intended.
library(zoo)
# test data
set.seed(123); z <- zoo(matrix(rnorm(250), 25)); colnames(z) <- make.names(1:10)
plot(z, screen = rep(colnames(z)[c(TRUE, FALSE)], each = 2), col = 1:2)
The output is shown below. To produce a single column add the argument nc=1 or to produce a lattice plot replace plot with xyplot.
ADDED: lattice solution.
like this? Although I am not clear how you want to plot it.
par(mfrow=c(1,5))
for (i in seq(1,10,by=2)){
plot(x[,i],x[,i+1])
}

Subset of data included in more than one ggplot facet

I have a population and a sample of that population. I've made a few plots comparing them using ggplot2 and its faceting option, but it occurred to me that having the sample in its own facet will distort the population plots (however slightly). Is there a way to facet the plots so that all records are in the population plot, and just the sampled records in the second plot?
Matt,
If I understood your question properly - you want to have a faceted plot where one panel contains all of your data, and the subsequent facets contain only a subset of that first plot?
There's probably a cleaner way to do this, but you can create a new data.frame object with the appropriate faceting variable that corresponds to each subset. Consider:
library(ggplot2)
df <- data.frame(x = rnorm(100), y = rnorm(100), sub = sample(letters[1:5], 100, TRUE))
df2 <- rbind(
cbind(df, faceter = "Whole Sample")
, cbind(df[df$sub == "a" ,], faceter = "Subset A")
#other subsets go here...
)
qplot(x,y, data = df2) + facet_wrap(~ faceter)
Let me know if I've misunderstood your question.
-Chase

Resources