R calculate most abundant taxa using phyloseq object - r

I would like to know if my approach to calculate the average of the relative abundance of any taxon is correct !!!
If I want to know if, to calculate the relative abundance (percent) of each family (or any Taxon) in a phyloseq object (GlobalPattern) will be correct like:
data("GlobalPatterns")
T <- GlobalPatterns %>%
tax_glom(., "Family") %>%
transform_sample_counts(function(x)100* x / sum(x)) %>% psmelt() %>%
arrange(OTU) %>% rename(OTUsID = OTU) %>%
select(OTUsID, Family, Sample, Abundance) %>%
spread(Sample, Abundance)
T$Mean <- rowMeans(T[, c(3:ncol(T))])
FAM <- T[, c("Family", "Mean" ) ]
#order data frame
FAM <- FAM[order(dplyr::desc(FAM$Mean)),]
rownames(FAM) <- NULL
head(FAM)
Family Mean
1 Bacteroidaceae 7.490944
2 Ruminococcaceae 6.038956
3 Lachnospiraceae 5.758200
4 Flavobacteriaceae 5.016402
5 Desulfobulbaceae 3.341026
6 ACK-M1 3.242808
in this case the Bacteroidaceae were the most abundant family in all the samples of GlobalPattern (26 samples and 19216 OTUs), it was present in 7.49% in average in 26 samples !!!!
It’s correct to make the T$Mean <- rowMeans(T[, c(3:ncol(T))]) to calculate the average any given Taxon ?

Bacteroidaceae has the highest abundance, if all samples were pooled together.
However, it has the highest abundance in only 2 samples.
Nevertheless, there is no other taxon having a higher abundance in an average sample.
Let's use dplyr verbs for all the steps to have a more descriptive and consistent code:
library(tidyverse)
library(phyloseq)
#> Creating a generic function for 'nrow' from package 'base' in package 'biomformat'
#> Creating a generic function for 'ncol' from package 'base' in package 'biomformat'
#> Creating a generic function for 'rownames' from package 'base' in package 'biomformat'
#> Creating a generic function for 'colnames' from package 'base' in package 'biomformat'
data(GlobalPatterns)
data <-
GlobalPatterns %>%
tax_glom("Family") %>%
transform_sample_counts(function(x)100* x / sum(x)) %>%
psmelt() %>%
as_tibble()
# highest abundance: all samples pooled together
data %>%
group_by(Family) %>%
summarise(Abundance = mean(Abundance)) %>%
arrange(-Abundance)
#> # A tibble: 334 × 2
#> Family Abundance
#> <chr> <dbl>
#> 1 Bacteroidaceae 7.49
#> 2 Ruminococcaceae 6.04
#> 3 Lachnospiraceae 5.76
#> 4 Flavobacteriaceae 5.02
#> 5 Desulfobulbaceae 3.34
#> 6 ACK-M1 3.24
#> 7 Streptococcaceae 2.77
#> 8 Nostocaceae 2.62
#> 9 Enterobacteriaceae 2.55
#> 10 Spartobacteriaceae 2.45
#> # … with 324 more rows
# sanity check: is total abundance of each sample 100%?
data %>%
group_by(Sample) %>%
summarise(Abundance = sum(Abundance)) %>%
pull(Abundance) %>%
`==`(100) %>%
all()
#> [1] TRUE
# get most abundant family for each sample individually
data %>%
group_by(Sample) %>%
arrange(-Abundance) %>%
slice(1) %>%
select(Family) %>%
ungroup() %>%
count(Family, name = "n_samples") %>%
arrange(-n_samples)
#> Adding missing grouping variables: `Sample`
#> # A tibble: 18 × 2
#> Family n_samples
#> <chr> <int>
#> 1 Desulfobulbaceae 3
#> 2 Bacteroidaceae 2
#> 3 Crenotrichaceae 2
#> 4 Flavobacteriaceae 2
#> 5 Lachnospiraceae 2
#> 6 Ruminococcaceae 2
#> 7 Streptococcaceae 2
#> 8 ACK-M1 1
#> 9 Enterobacteriaceae 1
#> 10 Moraxellaceae 1
#> 11 Neisseriaceae 1
#> 12 Nostocaceae 1
#> 13 Solibacteraceae 1
#> 14 Spartobacteriaceae 1
#> 15 Sphingomonadaceae 1
#> 16 Synechococcaceae 1
#> 17 Veillonellaceae 1
#> 18 Verrucomicrobiaceae 1
Created on 2022-06-10 by the reprex package (v2.0.0)

Related

How can I do Stratified sampling with proportionate size

I have a dataset named by "Tree_all_exclusive" of 7607 rows and 39 column, which contains different information of tress such as age, height, name etc. I am able to create a sample of 1200 size with the below code, which looks picking trees randomly:
sam1<-sample_n(Tree_all_exclusive, size = 1200)
But I like to generate a proportionate stratified sample of 1200 trees which will pick the number of trees according to the proportion of the number of that specific type of tree.
To do this I am using below code:
sam3<-Tree_all_exclusive %>%
group_by(TaxonNameFull)%>%
summarise(total_numbers=n())%>%
arrange(-total_numbers)%>%
mutate(pro = total_numbers/7607)%>% #7607 total number of trees
mutate(sz= pro*1200)%>% #1200 is number of sample
mutate(siz=as.integer(sz)+1) #since some size is 0.01 so making it 1
sam3
s<-stratified(sam3, group="TaxonNameFull", sam3$siz)
But it is giving me the below error:
Error in s_n(indt, group, size) : 'size' should be entered as a named vector.
Would you please point me any direction to solve this issue?
Also if there is any other way to do the stratified sampling with proportionate number please guide me.
Thanks a lot.
How about using sample_frac():
library(dplyr)
data(mtcars)
mtcars %>%
group_by(cyl) %>%
tally()
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
mtcars %>%
group_by(cyl) %>%
sample_frac(.5) %>%
tally()
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 6
#> 2 6 4
#> 3 8 7
Created on 2023-01-24 by the reprex package (v2.0.1)

Why does `mutate(across(...))` with `scale()` adds [,1] to the column header?

This seems too basic to not be found in a search, but maybe I didn't use the correct search terms on Google.
I want to normalize a numeric column. When I modify that column with mutate(across(.., scale)) I get [,1] added to the header. Why is that?
library(dplyr, warn.conflicts = FALSE)
mtcars_mpg_only <-
mtcars %>%
as_tibble() %>%
select(mpg)
mtcars_mpg_only %>%
as_tibble() %>%
mutate(across(mpg, scale))
#> # A tibble: 32 x 1
#> mpg[,1]
#> <dbl>
#> 1 0.151
#> 2 0.151
#> 3 0.450
#> 4 0.217
#> 5 -0.231
#> 6 -0.330
#> 7 -0.961
#> 8 0.715
#> 9 0.450
#> 10 -0.148
#> # ... with 22 more rows
But if I use a different function rather than scale() (e.g., log()), then the column header remains as-is:
mtcars_mpg_only %>%
as_tibble() %>%
mutate(across(mpg, log))
#> # A tibble: 32 x 1
#> mpg
#> <dbl>
#> 1 3.04
#> 2 3.04
#> 3 3.13
#> 4 3.06
#> 5 2.93
#> 6 2.90
#> 7 2.66
#> 8 3.19
#> 9 3.13
#> 10 2.95
#> # ... with 22 more rows
I know how to remove/rename [,1] after the fact, but my question is why it's created to begin with?
It is because scale returns a matrix whereas log returns a plain vector. The mpg[, 1] is actually a matrix within a data.frame. See ?scale for the definition of its value.
class(scale(mtcars$mpg))
## [1] "matrix" "array"
class(log(mtcars$mpg))
## [1] "numeric"
Convert the matrix to a plain vector to avoid this, e.g.
mtcars_mpg_only %>%
mutate(across(mpg, ~ c(scale(.))))
# or extracting first column
mtcars_mpg_only %>%
mutate(across(mpg, ~ scale(.)[, 1]))
# or normalizing using mean and sd
mtcars_mpg_only %>%
mutate(across(mpg, ~ (. - mean(.)) / sd(.)))
# or without across
mtcars_mpg_only %>%
mutate(mpg = c(scale(mpg)))
# or using base R
mtcars_mpg_only |>
transform(mpg = c(scale(mpg)))

Is there a way to summarise if column value is x?

I am trying to make a data.frame which displays the average time an individual displays a behaviour.
I have been using group_by and summarise to calculate the averages across groups. But the output is many rows down. See an example using the iris dataset...
data(iris)
x <- iris %>%
group_by(Species, Petal.Length) %>%
summarise(mean(Sepal.Length))
I would like to get an output that has, for this example, one row per 'Species' and a column of averages per 'Petal.Length'.
I have resorted to creating multiple outputs and then using left_join to combine them into the desired data.frame. See example below...
a <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.1) %>%
summarise(mean(Sepal.Length))
b <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.2) %>%
summarise(mean(Sepal.Length))
left_join(a, b)
However, doing this twelve or more times at a time is tedious and I am sure there must be an easy way to get the mean(Sepal.Length) for the 'Petal.Length' 0.1, and 0.2, and 0.3 (etc) in the one output.
n.b. in my data Petal.Length would actually be characters that represent behaviours and Sepal.Length would be the duration of time
Some ideas:
library(tidyverse)
data(iris)
mutate(iris, Petal.Length_discrete = cut(Petal.Length, 5)) %>%
group_by(Species, Petal.Length_discrete) %>%
summarise(mean(Sepal.Length))
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
#> # A tibble: 7 x 3
#> # Groups: Species [3]
#> Species Petal.Length_discrete `mean(Sepal.Length)`
#> <fct> <fct> <dbl>
#> 1 setosa (0.994,2.18] 5.01
#> 2 versicolor (2.18,3.36] 5
#> 3 versicolor (3.36,4.54] 5.81
#> 4 versicolor (4.54,5.72] 6.43
#> 5 virginica (3.36,4.54] 4.9
#> 6 virginica (4.54,5.72] 6.32
#> 7 virginica (5.72,6.91] 7.25
iris %>%
group_split(Species, Petal.Length) %>%
map(~ summarise(.x, mean(Sepal.Length))) %>%
head(3)
#> [[1]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.6
#>
#> [[2]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.3
#>
#> [[3]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 5.4
Created on 2021-06-28 by the reprex package (v2.0.0)

R, 3-way table, how to order

I am trying to order a table that has 3 variables, commonly known as a 3-way table.
I have attached a picture of the structure of the table the replicable code will produce.
Is it possible to order this table in a logical way, despite the fact it is essentially split into three sections/groups? For instance, could you order by the column "No" or the column "Yes" based on the values?
For example, when ordering "No" England would be ordered as "Sertosa" (7), Virginica (8), Versicolour (16). Wales would be ordered Versicolor (11), Setoda (12), Virginica... and so on for each section of the table.
#Replicable code using the Iris data built into R:
Data <- iris
Data $ var2 <- Data $ Species
Data $ var2 <- sample(Data $ var2)
Data $ var3 <- Data $ Species
Data $ var3 <- sample(Data $ var3)
#making the example clearer
library(plyr)
Data $ var2 <- revalue(Data $ var2, c("setosa"="No", "versicolor"="No","virginica" ="Yes"))
Data $ var3 <- revalue(Data $ var3, c("setosa"="England", "versicolor"="Wales","virginica" ="Scotland"))
#3-way Table:
df <- table(Data $ Species, Data $ var2, Data $ var3)
df
Kind Regards, James Prentice, a person trying to get to grips with R.
You should avoid using table() and array() in R, as they are hard to work with. Also, I recommend you focus on learning dplyr, rather than plyr, as plyr is no longer maintained.
Instead of using table(), work directly with the original data frame:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Data <- iris
Data$Living <- sample(c("No", "Yes"), size = nrow(Data), replace = TRUE)
Data$Country <- sample(c("England", "Wales", "Scotland"), size = nrow(Data), replace = TRUE)
# Results in one data frame
Data %>%
group_by(Country, Species) %>%
summarize(Yes = sum(Living == "Yes"), No = sum(Living == "No")) %>%
ungroup() %>%
arrange(Country, Yes)
#> `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> Country Species Yes No
#> <chr> <fct> <int> <int>
#> 1 England virginica 2 8
#> 2 England versicolor 7 15
#> 3 England setosa 14 5
#> 4 Scotland setosa 5 14
#> 5 Scotland virginica 6 12
#> 6 Scotland versicolor 9 8
#> 7 Wales setosa 4 8
#> 8 Wales versicolor 5 6
#> 9 Wales virginica 14 8
# Results in a list of data frames
Data %>%
group_by(Country, Species) %>%
summarize(Yes = sum(Living == "Yes"), No = sum(Living == "No")) %>%
ungroup() %>%
arrange(Country, Yes) %>%
split(., .$Country)
#> `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
#> $England
#> # A tibble: 3 x 4
#> Country Species Yes No
#> <chr> <fct> <int> <int>
#> 1 England virginica 2 8
#> 2 England versicolor 7 15
#> 3 England setosa 14 5
#>
#> $Scotland
#> # A tibble: 3 x 4
#> Country Species Yes No
#> <chr> <fct> <int> <int>
#> 1 Scotland setosa 5 14
#> 2 Scotland virginica 6 12
#> 3 Scotland versicolor 9 8
#>
#> $Wales
#> # A tibble: 3 x 4
#> Country Species Yes No
#> <chr> <fct> <int> <int>
#> 1 Wales setosa 4 8
#> 2 Wales versicolor 5 6
#> 3 Wales virginica 14 8
Created on 2021-06-01 by the reprex package (v2.0.0)

How do I select the first 84 rows of my x categorical variable (exposed) to compute the mean of my continuous y variable using r?

I've tried using the mean function, as well as summary. I've also tried tapply and tried to select rows but it computes the overall mean.
Something like this?
library(tidyverse)
data <- tibble(
x = factor(c("A", "B", "C")) %>% sample(100, replace = TRUE),
y = rnorm(100)
)
data
#> # A tibble: 100 x 2
#> x y
#> <fct> <dbl>
#> 1 B -0.271
#> 2 C -0.361
#> 3 C 1.17
#> 4 A -0.652
#> 5 A 0.770
#> 6 C -0.605
#> 7 B 0.976
#> 8 B 0.392
#> 9 B 1.08
#> 10 A 0.548
#> # ... with 90 more rows
head_means <-
head(data, 84) %>%
group_by(x) %>%
summarize_at("y", mean) %>%
ungroup()
head_means
#> # A tibble: 3 x 2
#> x y
#> <fct> <dbl>
#> 1 A 0.132
#> 2 B 0.385
#> 3 C -0.110
Created on 2019-10-19 by the reprex package (v0.3.0)
Feel free to incorporate this, or a variant of it, into your question.

Resources