Stack bar generated by Phyloseq - r

I am using this R package called "phyloseq" to analyze the bioinformatic data.
otumat = matrix(sample(1:100, 100, replace = TRUE), nrow = 10, ncol = 10)
otumat
rownames(otumat) <- paste0("OTU", 1:nrow(otumat))
colnames(otumat) <- paste0("Sample", 1:ncol(otumat))
otumat
taxmat = matrix(sample(letters, 70, replace = TRUE), nrow = nrow(otumat), ncol = 7)
rownames(taxmat) <- rownames(otumat)
colnames(taxmat) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus",
"Species")
taxmat
library("phyloseq")
OTU = otu_table(otumat, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)
OTU
TAX
physeq = phyloseq(OTU, TAX)
physeq
plot_bar(physeq, fill = "Family")
So the bar graph generated do not stack the same Family together. For example, there are two separate "I" blocks in sample 10. I know phyloseq plot graph using ggplot2. Does any one know what ggplot2 associated codes I can add to the lot_bar(physeq, fill = "Family") to stack the same family together in the bar graph?

You need to reorder the levels of the factor being used for the x-axis. physeq presumably has a column called "Sample" (don't have the relevant package installed), you need to reorder the levels in this.
It should be possible to use a command like this
physeq$Sample <- factor(physeq$Sample, levels = paste0("Sample", 1:10))
Then it should plot correctly.
You might need to dig to find the relevant part to change

Actually, with respect, the plot_bar function does already do what you're asking:
# preliminaries
rm(list = ls())
library("phyloseq"); packageVersion("phyloseq")
data("GlobalPatterns")
gp.ch = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
# the function call that does what you're asking for
plot_bar(gp.ch, fill = "Family")
See the following help tutorial for more details, examples:
https://joey711.github.io/phyloseq/plot_bar-examples.html
You can also specify the x-axis grouping as well.

Related

Plotly gauge graph and crosstalk filtering for flexdashboard

I am trying the create a plotly gauge graph for a flexdashboard which should change value depending on the chosen filter in crosstalk::filter_select().
I have tried and tried but cannot get the filter to work. This is an example with mtcars of what I am trying to do. I noticed that if the SharedData object has only one value, then it works, but otherwise plotly does not show any data.
mtcars_data <- tibble::rownames_to_column(mtcars, "Car")
shared_mtcars <- SharedData$new(mtcars_data)
row1 <- bscols(filter_select("Car", "Car", shared_mtcars, ~Car, multiple = F)
)
fig <- plot_ly(shared_mtcars,
domain = list(x = c(0, 1), y = c(0, 1)),
value = ~mpg,
title = list(text = "MPG"),
type = "indicator",
mode = "gauge+number")
bscols(row1, fig, widths = 12)
This code results in a graph with no data. If I subset mtcars_data to take the first row or the first two rows (which happen to have the same value for mpg) then it works. If I subset rows 1 and 3, it doesn't.
I might be missing something - in that case would really appreciate any feedback.

Outputting a list of the intersecting genes/Values when making a VennDiagram in R with the VennDiagram package

I made a VennDiagram with five intersecting vectors, each containing a set of gene names.
Does anyone know whether I can somehow export the list of genes, which overlap in the different intersections?
I know I can do that with several online tools, such as Venny or InteractiVenn, but it would be much more convenient in R.
This is the code I use:
venn.diagram(
x = list(set1, set2, set3, set4, set5),
category.names = c("set1", "set2", "set3", "set4", "set5"),
filename= "my_path/venn.png",
output=NULL,
# # Output features
imagetype="png" ,
height = 2000 ,
width = 2000 ,
units = "px",
na = 'stop',
resolution = 300,
compression = "lzw",
lwd = 2,
col = c("#1ABC9C", "#85C1E9", "#CD6155", "#5B2C6F", "#F8C471"),
cat.col = c("#1ABC9C", "#85C1E9", "#CD6155", "#5B2C6F", "#F8C471"),
fill = c(alpha("#1ABC9C",0.3), alpha("#85C1E9",0.3), alpha("#CD6155",0.3), alpha("#5B2C6F",0.3), alpha("#F8C471",0.3)),
cex = 1.5,
fontfamily = "sans",
cat.cex = 1.15,
cat.default.pos = "text",
cat.fontfamily = "sans",
cat.dist= c(0.055),
cat.pos= c(1)
)
Thanks!
I suspect the OP has moved on, but I had the same question.
Here's what I came up with for a five set example- NB this uses a different package:
require(nVennR)
require(dplyr)
# wrangle input
Venn <- plotVenn(list("set1"=set1, "set2"=set2, "set3"=set3, "set4"=set4,
"set5"=set5), outFile = "DataSourceVenn.svg") # produces associated diagram
# generate lists of each intersect
intersects <- listVennRegions(Venn)
# pull lists together
intersects <- plyr::ldply(intersects, cbind)
# insert own appropriate col name for V1
colnames(intersects)<-c('Intersect','V1')
# transpose data into columns for each intersect
intersects <- dcast(setDT(intersects), rowid(Intersect) ~ Intersect, value.var =
"V1")[,Intersect:=NULL]

How to programmatically determine the column indices of principal components using FactoMineR package?

Given a data frame containing mixed variables (i.e. both categorical and continuous) like,
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
I perform unsupervised feature selection using the package FactoMineR
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
The variable df.princomp is a list.
Thereafter, to visualize the principal components I use
fviz_screeplot() and fviz_contrib() like,
#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
which gives the following Fig1
and Fig2
Explanation of Fig1: The Fig1 is a scree plot. A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 43.8% of total variance. The question now naturally arises, "What are these variables?". This I have shown in Fig2.
Explanation of Fig2: This figure visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, name, studLoc and finalMark are the most important variables that can be used for further analysis.
Further Analysis- where I'm stuck at: To derive the contribution of the aforementioned variables name, studLoc, finalMark. I use the principal component variable df.princomp (see above) like df.princomp$quanti.var$contrib[,4]and df.princomp$quali.var$contrib[,2:3].
I've to manually specify the column indices [,2:3] and [,4].
What I want: I want to know how to do dynamic column index assignment, such that I do not have to manually code the column index [,2:3] in the list df.princomp?
I've already looked at the following similar questions 1, 2, 3 and 4 but cannot find my solution? Any help or suggestions to solve this problem will be helpful.
Not sure if my interpretation of your question is correct, apologies if not. From what I gather you are using PCA as an initial tool to show you what variables are the most important in explaining the dataset. You then want to go back to your original data, select these variables quickly without manual coding each time, and use them for some other analysis.
If this is correct then I have saved the data from the contribution plot, filtered out the variables that have the greatest contribution, and used that result to create a new data frame with these variables alone.
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than, say, 20
r<-rownames(dat[dat$contrib>20,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
#finalmark name studLoc
#1 53 b POTYQ0002N
#2 73 i LWMTW1195I
#3 95 d VTUGO1685F
#4 39 f YCGGS5755N
#5 97 c GOSWE3283C
#6 58 g APBQD6181U
#7 67 a VUJOG1460V
#8 64 h YXOGP1897F
#9 15 j NFUOB6042V
#10 81 e QYTHG0783G
Based on your comment, where you said you wanted to 'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame', I would do this:
#top contributors to both Dim 1 and 2
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1,2), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than 5
r<-rownames(dat[dat$contrib>5,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
(This keeps all the original variables in our new data frame since they all contributed more than 5% to the total variance)
There are a lot of ways to extract contributions of individual variables to PCs. For numeric input, one can run a PCA with prcomp and look at $rotation (I spoke to soon and forgot you've got factors here so prcomp won't work directly). Since you are using factoextra::fviz_contrib, it makes sense to check how that function extracts this information under the hood. Key factoextra::fviz_contrib and read the function:
> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var",
"quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue",
color = "steelblue", sort.val = c("desc", "asc", "none"),
top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(),
...)
{
sort.val <- match.arg(sort.val)
choice = match.arg(choice)
title <- .build_title(choice[1], "Contribution", axes)
dd <- facto_summarize(X, element = choice, result = "contrib",
axes = axes)
contrib <- dd$contrib
names(contrib) <- rownames(dd)
theo_contrib <- 100/length(contrib)
if (length(axes) > 1) {
eig <- get_eigenvalue(X)[axes, 1]
theo_contrib <- sum(theo_contrib * eig)/sum(eig)
}
df <- data.frame(name = factor(names(contrib), levels = names(contrib)),
contrib = contrib)
if (choice == "quanti.var") {
df$Groups <- .get_quanti_var_groups(X)
if (missing(fill))
fill <- "Groups"
if (missing(color))
color <- "Groups"
}
p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill,
color = color, sort.val = sort.val, top = top, main = title,
xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt,
ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib,
linetype = 2, color = "red")
p
}
<environment: namespace:factoextra>
So it's really just calling facto_summarize from the same package. By analogy you can do the same thing, simply call:
> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
name contrib
ID ID 0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark 7.1874438
subj2mark subj2mark 16.6831560
name name 26.8610132
studLoc studLoc 26.8610132
And that's the table corresponding to your figure 2. For PC2 use axes = 2 and so on.
Regarding "how to programmatically determine the column indices of the PCs", I'm not 100% sure I understand what you want, but if you just want to say for column "finalmark", grab its contribution to PC3 you can do the following:
library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")
# get the contribution of column 'finalmark' by name
contribution_df %>%
filter(name == "finalmark")
# get the contribution of column 'finalmark' to PC3
contribution_df %>%
filter(name == "finalmark" & PC == 3)
# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
BTW I think ID in your example is treated as numeric instead of factor, but since it's just an example I'm not bothering with it.

Skip empty panel using lattice package, R programming

I want to skip a empty panel using lattice package in R.
set.seed(1)
df1 <- data.frame("treatment" = c(rep("A",16),rep("B",16),rep("C",16)),
"disease_type" = c(rep("1",8),rep("2",8)),
"days_after_application" = rep(c(rep("10-24",4),rep("24-48",4)),6),
"severity" = rnorm(48, mean = 80, sd = 5))
df1[(df1$disease_type == "2" & df1$days_after_application == "24-48"),"severity"] <- NA
library(lattice)
figure1 <- bwplot(treatment~severity|days_after_application+disease_type,
data = df1,layout = c(2,2),
strip = strip.custom(strip.names = TRUE))
jpeg("figure1.jpeg")
print(figure1)
dev.off()
Here is what I get
My question is how I can remove/skip empty panel in the top right WITHOUT changing layout?
I have tried following code. However, it doesn't work.
figure2 <- bwplot(treatment~severity|days_after_application+disease_type,
data = df1,layout = c(2,2),
strip = strip.custom(strip.names = TRUE),
skip = c(FALSE,FALSE,FALSE,TRUE))
jpeg("figure2.jpeg")
print(figure2)
dev.off()
Here is what I got
I also tried following codes. But it is not what I want since I do want 2 levels strips.
df1[(df1$disease_type == "2" & df1$days_after_application == "24-48"),] <- NA
bwplot(treatment~severity|interaction(days_after_application,disease_type),
data = df1,layout = c(2,2),
strip = strip.custom(strip.names = TRUE))
Thank you!
Get help from a Professor in Temple University.
Here is his solution:
figure4 <- bwplot(treatment~severity|days_after_application+disease_type,
data = df1,layout = c(2,2),
strip = strip.custom(strip.names = TRUE),
skip = c(FALSE,FALSE,FALSE,TRUE),
scales=list(alternating=FALSE), ## keep x-scale on bottom
between=list(x=1, y=1)) ## space between panels
pdf("figure4%03d.pdf",onefile = FALSE) ## force two pages in file.
print(figure4)
dev.off()

Scatter/Bubble plot not correctly plotting using rCharts and dimple.js

I am trying to create the simplest of scatter charts using dimple and rCharts. I am curious if there is something I mis-understand about the 'scatter' type. When I run this code, the y-axis values are off by factors of ten or larger--almost as if the scatter chart is acting as a stacked bar chart instead of a simple scatter plot. The sample data below mimics my data exactly.
testdat1 <- data.frame(Recommend = sample(60:90, 200, replace = T), Quiet = sample(20:60, 200, replace = T),
Owner = as.factor(rep(c(1,2), 100)))
summary(testdat1) # no values exceed 90
dtest <- dPlot(Recommend ~ Quiet, groups = 'Owner', data = testdat1, type = 'scatter')
dtest # plotted y-values reach upwards of 450
Any thoughts?
See comment but answer might be accomplished through this block of code:
require(rCharts)
testdat1 <- data.frame(Recommend = sample(60:90, 200, replace = T), Quiet = sample(20:60, 200, replace = T),
Owner = as.factor(rep(c(1,2), 100)))
summary(testdat1) # no values exceed 90
dtest <- dPlot(Recommend ~ Quiet, groups = 'Owner', data = testdat1, type = 'bubble')
#will aggregate as avg by default
dtest$xAxis(type="addMeasureAxis")
dtest
#add x,y, and grouping so now only will aggregate where x,y,and group is exact same
#if still a problem, could a unique id and group on that
dtest$params$groups <- c('Recommend','Quiet','Owner')
dtest # plotted y-values reach upwards of 450

Resources