Plotting Bacteria according to Food Groups & Abundance in R - r

I have a dataframe that includes four bacteria types: R, B, P, Bi - this is in variable.x
value.y is their abundance and variable.y is various groups they are in.
I would like to plot them according to their food categories: "FiberCategory", "FruitCategory", "VegetablesCategory" & "WholegrainCategory." I have made 4 separate files that have the as such:
Sample Bacteria Abundance Category Level
30841102 R 0.005293192 1 Low
30841102 P 0.000002570 1 Low
30841102 B 0.005813275 1 Low
30841102 Bi 0.000000000 1 Low
49812105 R 0.003298709 1 Low
49812105 P 0.000000855 1 Low
49812105 B 0.131147541 1 Low
49812105 Bi 0.000350086 1 Low
So, I would like a bar plot of how much of each bacteria is in each category. So it should be 4 plots, for each bacteria, with value on the y-axis and food category on the x-axis.
I have tried this code:
library(dplyr)
genus_veg %>% group_by(Genus, Abundance) %>% summarise(Abundance = sum(Abundance)) %>%
ggplot(aes(x = Level, y= Abundance, fill = Genus)) + geom_bar(stat="identity")
But get this error:
Error: cannot modify grouping variable
Any suggestions?

TL;DR Combine individual plots with cowplot
In another interpretation of the super unclear question, this time from:
Plotting Bacteria according to Food Groups & Abundance in R
and
would like to plot them according to their food categories: "FiberCategory", "FruitCategory", "VegetablesCategory" & "WholegrainCategory." I have made 4 separate files
You might be asking for:
You want a bar chart
You want 4 plots, one for each of the food categories
x-axis = bacteria type
y-axis = abundance of bacteria
Input
Let say you have a data frame for each food category. (Again, I'm using dummy data)
library(tidyr)
library(dplyr)
library(ggplot2)
## The categories you have defined
bacteria <- c("R", "B", "P", "Bi")
food <- c("FiberCategory", "FruitCategory", "VegetablesCategory", "WholegrainCategory")
## Create dummy data for plotting
set.seed(1)
num_rows <- length(bacteria)
num_cols <- length(food)
dummydata <-
matrix(data = abs(rnorm(num_rows*num_cols, mean=0.01, sd=0.05)),
nrow=num_rows, ncol=num_cols)
rownames(dummydata) <- bacteria
colnames(dummydata) <- food
dummydata <-
dummydata %>%
as.data.frame() %>%
tibble::rownames_to_column("bacteria") %>%
gather(food, abundance, -bacteria)
## If we have 4 data frames
filter_food <- function(dummydata, foodcat){
dummydata %>%
filter(food == foodcat) %>%
select(-food)
}
dd_fiber <- filter_food(dummydata, "FiberCategory")
dd_fruit <- filter_food(dummydata, "FruitCategory")
dd_veg <- filter_food(dummydata, "VegetablesCategory")
dd_grain <- filter_food(dummydata, "WholegrainCategory")
Where one data frame looks something like
#> dd_grain
# bacteria abundance
#1 R 0.02106203
#2 B 0.10073499
#3 P 0.06624655
#4 Bi 0.00775332
Plot
You can create separate plots. (Here, I'm using a function to generate my plots)
plot_food <- function(dd, title=""){
dd %>%
ggplot(aes(x = bacteria, y = abundance)) +
geom_bar(stat = "identity") +
ggtitle(title)
}
plt_fiber <- plot_food(dd_fiber, "fiber")
plt_fruit <- plot_food(dd_fruit, "fruit")
plt_veg <- plot_food(dd_veg, "veg")
plt_grain <- plot_food(dd_grain, "grain")
And then combine them using cowplot
cowplot::plot_grid(plt_fiber, plt_fruit, plt_veg, plt_grain)

TL;DR Plotting by facets
How you posed the question is super unclear. So I have interpreted your question from
So, I would like a bar plot of how much of each bacteria is in each category. So it should be 4 plots, for each bacteria, with value on the y-axis and food category on the x-axis.
as:
You want a bar chart
You want 4 plots, one for each of the bacteria types: R, B, P, Bi
x-axis = food category
y-axis = abundance of bacteria
Input
In regards to the input data, the data was unclear e.g. you did not describe what "Sample", "Level", or "Category" is. Ideally, you would keep all the food category in one data frame. e.g.
library(tidyr)
library(dplyr)
library(ggplot2)
## The categories you have defined
bacteria <- c("R", "B", "P", "Bi")
food <- c("FiberCategory", "FruitCategory", "VegetablesCategory", "WholegrainCategory")
## Create dummy data for plotting
set.seed(1)
num_rows <- length(bacteria)
num_cols <- length(food)
dummydata <-
matrix(data = abs(rnorm(num_rows*num_cols, mean=0.01, sd=0.05)),
nrow=num_rows, ncol=num_cols)
rownames(dummydata) <- bacteria
colnames(dummydata) <- food
dummydata <-
dummydata %>%
as.data.frame() %>%
tibble::rownames_to_column("bacteria") %>%
gather(food, abundance, -bacteria)
of which the output looks like:
#> dummydata
# bacteria food abundance
#1 R FiberCategory 0.021322691
#2 B FiberCategory 0.019182166
#3 P FiberCategory 0.031781431
#4 Bi FiberCategory 0.089764040
#5 R FruitCategory 0.026475389
#6 B FruitCategory 0.031023419
#7 P FruitCategory 0.034371453
#8 Bi FruitCategory 0.046916235
#9 R VegetablesCategory 0.038789068
#10 B VegetablesCategory 0.005269419
#11 P VegetablesCategory 0.085589058
#12 Bi VegetablesCategory 0.029492162
#13 R WholegrainCategory 0.021062029
#14 B WholegrainCategory 0.100734994
#15 P WholegrainCategory 0.066246546
#16 Bi WholegrainCategory 0.007753320
Plot
Once you have the data formatted as above, you can simply do:
dummydata %>%
ggplot(aes(x = food,
y = abundance,
group = bacteria)) +
geom_bar(stat="identity") +
## Split into 4 plots
## Note: can also use 'facet_grid' to do this
facet_wrap(~bacteria) +
theme(
## rotate the x-axis label
axis.text.x = element_text(angle=90, hjust=1, vjust=.5)
)

Related

ggplot stat-ecdf cumulative distribution custom maximum

I have a df in the following format:
df <- read.table(text="
DAYS STATUS ID
2 Complete A
10 Complete A
15 Complete B
NA Incomplete A
NA Incomplete B
20 Complete C", header=TRUE)
I have plotted the cumulative distribution using:
ggplot(df,aes(x=DAYS, color=ID)) +
stat_ecdf(geom = "step")
Since this is only plotting the completed rows I would like to include the incomplete rows that have an NA for days. By doing this the cumulative distributions for each ID would not reach 100% because some of the rows do not have a value for days.
ID PERCENT_COMPLETE
A .95
B .55
C .5
For example in my full dataset ID A has .95 status complete so the distribution line would reach at max at .95 while B would reach a max at .55.
It doesn't appear any of the plotting functions handle NA values in the way you want. So we can just pre-calculate the values in the way we want using dplyr
library(ggplot2)
library(dplyr)
df <- read.table(text="
DAYS STATUS ID
2 Complete A
10 Complete A
15 Complete B
NA Incomplete A
NA Incomplete B
20 Complete C", header=TRUE)
incomplete_cdf <- function(x, gmin, gmax) {
cdf <- rle(sort(na.omit(x)))
obsx <- cdf$values
obsy <- cumsum(cdf$lengths)/length(x)
data.frame(x = c(gmin, obsx, gmax) , y=c(0, obsy, tail(obsy, 1)))
}
df %>%
mutate(gmin =min(DAYS, na.rm=TRUE), gmax=max(DAYS, na.rm=TRUE)) %>%
group_by(ID) %>%
summarize(incomplete_cdf(DAYS, first(gmin), first(gmax)))%>%
ggplot(aes(x=x, y=y, color=ID)) +
geom_step()

adding rows to a tibble based on mostly replicating existing rows

I have data that only shows a variable if it is not 0. However, I would like to have gaps representing these 0s in the graph.
(I will be working from a large dataframe, but have created an example data based on how I will be manipulating it for this purpose.)
library(tidyverse)
library(ggplot2)
A <- tibble(
name = c("CTX_M", "CblA_1"),
rpkm = c(350, 4),
sample = "A"
)
B <- tibble(
name = c("CTX_M", "OXA_1", "ampC"),
rpkm = c(324, 357, 99),
sample = "B"
)
plot <- bind_rows(A, B)
ggplot()+ geom_col(data = plot, aes(x = sample, y = rpkm, fill = name),
position = "dodge")
Sample A and B both have CTX_M, however the othre three "names" are only present in either sample A or sample B. When I run the code, the output graph shows two bars for sample A and three bars for sample B the resulting graph was:
Is there a way for me to add ClbA_1 to sample B with rpkm=0, and OXA_1 and ampC to sample A with rpkm=0, while maintaining sample separation? - so the tibble would look like this (order not important):
and the graph would therefore look like this:
You can use complete from tidyr.
plot <- plot %>% complete(name,sample,fill=list(rpkm=0))
# A tibble: 8 x 3
name sample rpkm
<chr> <chr> <dbl>
1 ampC A 0
2 ampC B 99
3 CblA_1 A 4
4 CblA_1 B 0
5 CTX_M A 350
6 CTX_M B 324
7 OXA_1 A 0
8 OXA_1 B 357
ggplot()+ geom_col(data = plot, aes(x = sample, y = rpkm, fill = name),
position = "dodge")

Plot a multivariate histogram in R

I would like to plot 6 different variables with their corresponding calculated statistical data. The following dataframe may serve as an example
X aggr_a aggr_b count
<chr> <dbl> <dbl> <dbl>
1 A 470676 594423 58615
2 B 549142 657291 67912
3 C 256204 311723 26606
4 D 248256 276593 40201
5 E 1581770 1717788 250553
6 F 1932096 2436769 385556
I would like to plot each row as category with its statistics as histogram bins. The desired output is
May I use ggplots for this kind of graphs?
All the available resources seem to cover the uni variate case only.
library(tidyverse)
df = read.table(text = "
X aggr_a aggr_b count
A 470676 594423 58615
B 549142 657291 67912
C 256204 311723 26606
D 248256 276593 40201
E 1581770 1717788 250553
F 1932096 2436769 385556
", header=T)
df %>%
gather(type,value,-X) %>% # reshape dataset
ggplot(aes(X,value,fill=type))+
geom_bar(position = "dodge", stat = "identity")

Coloring Rarefaction curve lines by metadata (vegan package) (phyloseq package)

First time question asker here. I wasn't able to find an answer to this question in other posts (love stackexchange, btw).
Anyway...
I'm creating a rarefaction curve via the vegan package and I'm getting a very messy plot that has a very thick black bar at the bottom of the plot which is obscuring some low diversity sample lines.
Ideally, I would like to generate a plot with all of my lines (169; I could reduce this to 144) but make a composite graph, coloring by Sample Year and making different types of lines for each Pond (i.e: 2 sample years: 2016, 2017 and 3 ponds: 1,2,5). I've used phyloseq to create an object with all my data, then separated my OTU abundance table from my metadata into distinct objects (jt = OTU table and sampledata = metadata). My current code:
jt <- as.data.frame(t(j)) # transform it to make it compatible with the proceeding commands
rarecurve(jt
, step = 100
, sample = 6000
, main = "Alpha Rarefaction Curve"
, cex = 0.2
, color = sampledata$PondYear)
# A very small subset of the sample metadata
Pond Year
F16.5.d.1.1.R2 5 2016
F17.1.D.6.1.R1 1 2017
F16.1.D15.1.R3 1 2016
F17.2.D00.1.R2 2 2017
enter image description here
Here is an example of how to plot a rarefaction curve with ggplot. I used data available in the phyloseq package available from bioconductor.
to install phyloseq:
source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')
library(phyloseq)
other libraries needed
library(tidyverse)
library(vegan)
data:
mothlist <- system.file("extdata", "esophagus.fn.list.gz", package = "phyloseq")
mothgroup <- system.file("extdata", "esophagus.good.groups.gz", package = "phyloseq")
mothtree <- system.file("extdata", "esophagus.tree.gz", package = "phyloseq")
cutoff <- "0.10"
esophman <- import_mothur(mothlist, mothgroup, mothtree, cutoff)
extract OTU table, transpose and convert to data frame
otu <- otu_table(esophman)
otu <- as.data.frame(t(otu))
sample_names <- rownames(otu)
out <- rarecurve(otu, step = 5, sample = 6000, label = T)
Now you have a list each element corresponds to one sample:
Clean the list up a bit:
rare <- lapply(out, function(x){
b <- as.data.frame(x)
b <- data.frame(OTU = b[,1], raw.read = rownames(b))
b$raw.read <- as.numeric(gsub("N", "", b$raw.read))
return(b)
})
label list
names(rare) <- sample_names
convert to data frame:
rare <- map_dfr(rare, function(x){
z <- data.frame(x)
return(z)
}, .id = "sample")
Lets see how it looks:
head(rare)
sample OTU raw.read
1 B 1.000000 1
2 B 5.977595 6
3 B 10.919090 11
4 B 15.826125 16
5 B 20.700279 21
6 B 25.543070 26
plot with ggplot2
ggplot(data = rare)+
geom_line(aes(x = raw.read, y = OTU, color = sample))+
scale_x_continuous(labels = scales::scientific_format())
vegan plot:
rarecurve(otu, step = 5, sample = 6000, label = T) #low step size because of low abundance
One can make an additional column of groupings and color according to that.
Here is an example how to add another grouping. Lets assume you have a table of the form:
groupings <- data.frame(sample = c("B", "C", "D"),
location = c("one", "one", "two"), stringsAsFactors = F)
groupings
sample location
1 B one
2 C one
3 D two
where samples are grouped according to another feature. You could use lapply or map_dfr to go over groupings$sample and label rare$location.
rare <- map_dfr(groupings$sample, function(x){ #loop over samples
z <- rare[rare$sample == x,] #subset rare according to sample
loc <- groupings$location[groupings$sample == x] #subset groupings according to sample, if more than one grouping repeat for all
z <- data.frame(z, loc) #make a new data frame with the subsets
return(z)
})
head(rare)
sample OTU raw.read loc
1 B 1.000000 1 one
2 B 5.977595 6 one
3 B 10.919090 11 one
4 B 15.826125 16 one
5 B 20.700279 21 one
6 B 25.543070 26 one
Lets make a decent plot out of this
ggplot(data = rare)+
geom_line(aes(x = raw.read, y = OTU, group = sample, color = loc))+
geom_text(data = rare %>% #here we need coordinates of the labels
group_by(sample) %>% #first group by samples
summarise(max_OTU = max(OTU), #find max OTU
max_raw = max(raw.read)), #find max raw read
aes(x = max_raw, y = max_OTU, label = sample), check_overlap = T, hjust = 0)+
scale_x_continuous(labels = scales::scientific_format())+
theme_bw()
I know this is an older question but I originally came here for the same reason and along the way found out that in a recent (2021) update vegan has made this a LOT easier.
This is an absolutely bare-bones example.
Ultimately we're going to be plotting the final result in ggplot so you'll have full customization options, and this is a tidyverse solution with dplyr.
library(vegan)
library(dplyr)
library(ggplot2)
I'm going to use the dune data within vegan and generate a column of random metadata for the site.
data(dune)
metadata <- data.frame("Site" = as.factor(1:20),
"Vegetation" = rep(c("Cactus", "None")))
Now we will run rarecurve, but provide the argument tidy = TRUE which will export a dataframe rather than a plot.
One thing to note here is that I have also used the step argument. The default step is 1, and this means by default you will get one row per individual per sample in your dataset, which can make the resulting dataframe huge. Step = 1 for dune gave me over 600 rows. Reducing the step too much will make your curves blocky, so it will be a balance between step and resolution for a nice plot.
Then I piped a left join right into the rarecurve call
dune_rare <- rarecurve(dune,
step = 2,
tidy = TRUE) %>%
left_join(metadata)
Now it will be plottable in ggplot, with a color/colour call to whatever metadata you attached.
From here you can customize other aspects of the plot as well.
ggplot(dune_rare) +
geom_line(aes(x = Sample, y = Species, group = Site, colour = Vegetation)) +
theme_bw()
dune-output
(Sorry it says I'm not allowed to embed the image yet :( )

ggplot: Generate facet grid plot with multiple series

I have following data frame:
Quarter x y p q
1 2001 8.714392 8.714621 3.3648435 3.3140090
2 2002 8.671171 8.671064 0.9282508 0.9034387
3 2003 8.688478 8.697413 6.2295996 8.4379698
4 2004 8.685339 8.686349 3.7520135 3.5278024
My goal is to generate a facet plot where x and y column in one plot in the facet and p,q together in another plot instead of 4 facets.
If I do following:
x.df.melt <- melt(x.df[,c('Quarter','x','y','p','q')],id.vars=1)
ggplot(x.df.melt, aes(Quarter, value, col=variable, group=1)) + geom_line()+
facet_grid(variable~., scale='free_y') +
scale_color_discrete(breaks=c('x','y','p','q'))
I all the four series in 4 different facets but how do I combine x,y to be one while p,q to be in another together. Preferable no legends.
One idea would be to create a new grouping variable:
x.df.melt$var <- ifelse(x.df.melt$variable == "x" | x.df.melt$variable == "y", "A", "B")
You can use it for facetting while using variable for grouping:
ggplot(x.df.melt, aes(Quarter, value, col=variable, group=variable)) + geom_line()+
facet_grid(var~., scale='free_y') +
scale_color_discrete(breaks=c('x','y','p','q'), guide = F)
I think beetroot's answer above is more elegant but I was working on the same problem and arrived at the same place a different way. I think it is interesting because I used a "double melt" (yum!) to line up the x,y/p,q pairs. Also, it demonstrates tidyr::gather instead of melt.
library(tidyr)
x.df<- data.frame(Year=2001:2004,
x=runif(4,8,9),y=runif(4,8,9),
p=runif(4,3,9),q=runif(4,3,9))
x.df.melt<-gather(x.df,"item","item_val",-Year,-p,-q) %>%
group_by(item,Year) %>%
gather("comparison","comp_val",-Year,-item,-item_val) %>%
filter((item=="x" & comparison=="p")|(item=="y" & comparison=="q"))
> x.df.melt
# A tibble: 8 x 5
# Groups: item, Year [8]
Year item item_val comparison comp_val
<int> <chr> <dbl> <chr> <dbl>
1 2001 x 8.400538 p 5.540549
2 2002 x 8.169680 p 5.750010
3 2003 x 8.065042 p 8.821890
4 2004 x 8.311194 p 7.714197
5 2001 y 8.449290 q 5.471225
6 2002 y 8.266304 q 7.014389
7 2003 y 8.146879 q 7.298253
8 2004 y 8.960238 q 5.342702
See below for the plotting statement.
One weakness of this approach (and beetroot's use of ifelse) is the filter statement quickly becomes unwieldy if you have a lot of pairs to compare. In my use case I was comparing mutual fund performances to a number of benchmark indices. Each fund has a different benchmark. I solved this by with a table of meta data that pairs the fund tickers with their respective benchmarks, then use left/right_join. In this case:
#create meta data
pair_data<-data.frame(item=c("x","y"),comparison=c("p","q"))
#create comparison name for each item name
x.df.melt2<-x.df %>% gather("item","item_val",-Year) %>%
left_join(pair_data)
#join comparison data alongside item data
x.df.melt2<-x.df.melt2 %>%
select(Year,item,item_val) %>%
rename(comparison=item,comp_val=item_val) %>%
right_join(x.df.melt2,by=c("Year","comparison")) %>%
na.omit() %>%
group_by(item,Year)
ggplot(x.df.melt2,aes(Year,item_val,color="item"))+geom_line()+
geom_line(aes(y=comp_val,color="comp"))+
guides(col = guide_legend(title = NULL))+
ylab("Value")+
facet_grid(~item)
Since there is no need for an new grouping variable we preserve the names of the reference items as labels for the facet plot.

Resources