How can I do Stratified sampling with proportionate size - r

I have a dataset named by "Tree_all_exclusive" of 7607 rows and 39 column, which contains different information of tress such as age, height, name etc. I am able to create a sample of 1200 size with the below code, which looks picking trees randomly:
sam1<-sample_n(Tree_all_exclusive, size = 1200)
But I like to generate a proportionate stratified sample of 1200 trees which will pick the number of trees according to the proportion of the number of that specific type of tree.
To do this I am using below code:
sam3<-Tree_all_exclusive %>%
group_by(TaxonNameFull)%>%
summarise(total_numbers=n())%>%
arrange(-total_numbers)%>%
mutate(pro = total_numbers/7607)%>% #7607 total number of trees
mutate(sz= pro*1200)%>% #1200 is number of sample
mutate(siz=as.integer(sz)+1) #since some size is 0.01 so making it 1
sam3
s<-stratified(sam3, group="TaxonNameFull", sam3$siz)
But it is giving me the below error:
Error in s_n(indt, group, size) : 'size' should be entered as a named vector.
Would you please point me any direction to solve this issue?
Also if there is any other way to do the stratified sampling with proportionate number please guide me.
Thanks a lot.

How about using sample_frac():
library(dplyr)
data(mtcars)
mtcars %>%
group_by(cyl) %>%
tally()
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
mtcars %>%
group_by(cyl) %>%
sample_frac(.5) %>%
tally()
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 6
#> 2 6 4
#> 3 8 7
Created on 2023-01-24 by the reprex package (v2.0.1)

Related

Difference by subgroup using R

I have the following dataset:
I want to calculate the difference between values according to the subgroups. Nevertheless, subgroup 1 must come first. Thus 10-0=10; 0-20=-20; 30-31=-1. I want to perform it using R.
I know that it would be something like this, but I do not know how to put the sub_group into the code:
library(tidyverse)
df %>%
group_by(group) %>%
summarise(difference= diff(value))
Edited answer after OP's comment:
The OP clarified that the data are not sorted by sub_group within every group. Therefore, I added the arrange after group_by. The OP further clarified that the value of sub_group == 1 always should be the first term of the difference.
Below I demonstrate how to achieve this in an example with 3 sub_groups within every group. The code rests on the assumption that the lowest value of sub_group == 1. I drop each group's first sub_group after the difference.
library(tidyverse)
df <- tibble(group = rep(LETTERS[1:3], each = 3),
sub_group = rep(1:3, 3),
value = c(10,0,5,0,20,15,30,31,10))
df
#> # A tibble: 9 × 3
#> group sub_group value
#> <chr> <int> <dbl>
#> 1 A 1 10
#> 2 A 2 0
#> 3 A 3 5
#> 4 B 1 0
#> 5 B 2 20
#> 6 B 3 15
#> 7 C 1 30
#> 8 C 2 31
#> 9 C 3 10
df |>
group_by(group) |>
arrange(group, sub_group) |>
mutate(value = first(value) - value) |>
slice(2:n())
#> # A tibble: 6 × 3
#> # Groups: group [3]
#> group sub_group value
#> <chr> <int> <dbl>
#> 1 A 2 10
#> 2 A 3 5
#> 3 B 2 -20
#> 4 B 3 -15
#> 5 C 2 -1
#> 6 C 3 20
Created on 2022-10-18 with reprex v2.0.2
P.S. (from the original answer)
In the example data, you show the wrong difference for group C. It should read -1. I am convinced that most people here would appreciate if you could post your example data using code or at least as text which can be copied instead of a picture.

R calculate most abundant taxa using phyloseq object

I would like to know if my approach to calculate the average of the relative abundance of any taxon is correct !!!
If I want to know if, to calculate the relative abundance (percent) of each family (or any Taxon) in a phyloseq object (GlobalPattern) will be correct like:
data("GlobalPatterns")
T <- GlobalPatterns %>%
tax_glom(., "Family") %>%
transform_sample_counts(function(x)100* x / sum(x)) %>% psmelt() %>%
arrange(OTU) %>% rename(OTUsID = OTU) %>%
select(OTUsID, Family, Sample, Abundance) %>%
spread(Sample, Abundance)
T$Mean <- rowMeans(T[, c(3:ncol(T))])
FAM <- T[, c("Family", "Mean" ) ]
#order data frame
FAM <- FAM[order(dplyr::desc(FAM$Mean)),]
rownames(FAM) <- NULL
head(FAM)
Family Mean
1 Bacteroidaceae 7.490944
2 Ruminococcaceae 6.038956
3 Lachnospiraceae 5.758200
4 Flavobacteriaceae 5.016402
5 Desulfobulbaceae 3.341026
6 ACK-M1 3.242808
in this case the Bacteroidaceae were the most abundant family in all the samples of GlobalPattern (26 samples and 19216 OTUs), it was present in 7.49% in average in 26 samples !!!!
It’s correct to make the T$Mean <- rowMeans(T[, c(3:ncol(T))]) to calculate the average any given Taxon ?
Bacteroidaceae has the highest abundance, if all samples were pooled together.
However, it has the highest abundance in only 2 samples.
Nevertheless, there is no other taxon having a higher abundance in an average sample.
Let's use dplyr verbs for all the steps to have a more descriptive and consistent code:
library(tidyverse)
library(phyloseq)
#> Creating a generic function for 'nrow' from package 'base' in package 'biomformat'
#> Creating a generic function for 'ncol' from package 'base' in package 'biomformat'
#> Creating a generic function for 'rownames' from package 'base' in package 'biomformat'
#> Creating a generic function for 'colnames' from package 'base' in package 'biomformat'
data(GlobalPatterns)
data <-
GlobalPatterns %>%
tax_glom("Family") %>%
transform_sample_counts(function(x)100* x / sum(x)) %>%
psmelt() %>%
as_tibble()
# highest abundance: all samples pooled together
data %>%
group_by(Family) %>%
summarise(Abundance = mean(Abundance)) %>%
arrange(-Abundance)
#> # A tibble: 334 × 2
#> Family Abundance
#> <chr> <dbl>
#> 1 Bacteroidaceae 7.49
#> 2 Ruminococcaceae 6.04
#> 3 Lachnospiraceae 5.76
#> 4 Flavobacteriaceae 5.02
#> 5 Desulfobulbaceae 3.34
#> 6 ACK-M1 3.24
#> 7 Streptococcaceae 2.77
#> 8 Nostocaceae 2.62
#> 9 Enterobacteriaceae 2.55
#> 10 Spartobacteriaceae 2.45
#> # … with 324 more rows
# sanity check: is total abundance of each sample 100%?
data %>%
group_by(Sample) %>%
summarise(Abundance = sum(Abundance)) %>%
pull(Abundance) %>%
`==`(100) %>%
all()
#> [1] TRUE
# get most abundant family for each sample individually
data %>%
group_by(Sample) %>%
arrange(-Abundance) %>%
slice(1) %>%
select(Family) %>%
ungroup() %>%
count(Family, name = "n_samples") %>%
arrange(-n_samples)
#> Adding missing grouping variables: `Sample`
#> # A tibble: 18 × 2
#> Family n_samples
#> <chr> <int>
#> 1 Desulfobulbaceae 3
#> 2 Bacteroidaceae 2
#> 3 Crenotrichaceae 2
#> 4 Flavobacteriaceae 2
#> 5 Lachnospiraceae 2
#> 6 Ruminococcaceae 2
#> 7 Streptococcaceae 2
#> 8 ACK-M1 1
#> 9 Enterobacteriaceae 1
#> 10 Moraxellaceae 1
#> 11 Neisseriaceae 1
#> 12 Nostocaceae 1
#> 13 Solibacteraceae 1
#> 14 Spartobacteriaceae 1
#> 15 Sphingomonadaceae 1
#> 16 Synechococcaceae 1
#> 17 Veillonellaceae 1
#> 18 Verrucomicrobiaceae 1
Created on 2022-06-10 by the reprex package (v2.0.0)

Find cohorts in dataset in r dataframe

I'm trying to find the biggest cohort in a dataset of about 1000 candidates and 100 test questions. Every candidate is asked 15 questions out of a pool of 100 test questions. People in different cohorts make the same set of randomly sampled questions. I'm trying to find the largest group of candidates who all make the same test.
I'm working in R. The data.frame has about a 1000 rows, and 100 columns. Each column indicates which test question we're working with. For each row (candidate) all column entries are NA apart from the ones where a candidate filled in a particular question he or she was shown. The input in these question instances are either 0 or 1. (see picture)
Is there an elegant way to solve this? The only thing I could think of was using dplyer and filter per 15 question subset, and check how many rows still remain. However, with 100 columns this means it has to check (i think) 15 choose 100 different possibilities. Many thanks!
data.frame structure
We can infer the cohort based on the NA pattern:
library(tidyverse)
answers <- tribble(
~candidate, ~q1, ~q2, ~q3,
1,0,NA,NA,
2,1,NA,NA,
3,0,0,00
)
answers
#> # A tibble: 3 x 4
#> candidate q1 q2 q3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 NA NA
#> 2 2 1 NA NA
#> 3 3 0 0 0
# infer cohort by NA pattern
cohorts <-
answers %>%
group_by(candidate) %>%
mutate_at(vars(-group_cols()), ~ ifelse(is.na(.x), NA, TRUE)) %>%
unite(-candidate, col = "cohort")
cohorts
#> # A tibble: 3 x 2
#> # Groups: candidate [3]
#> candidate cohort
#> <dbl> <chr>
#> 1 1 TRUE_NA_NA
#> 2 2 TRUE_NA_NA
#> 3 3 TRUE_TRUE_TRUE
answers %>%
pivot_longer(-candidate) %>%
left_join(cohorts) %>%
# count filled answers per candidate and cohort
group_by(cohort, candidate) %>%
filter(! is.na(value)) %>%
count() %>%
# get the largest cohort
arrange(-n) %>%
pull(cohort) %>%
first()
#> Joining, by = "candidate"
#> [1] "TRUE_TRUE_TRUE"
Created on 2021-09-21 by the reprex package (v2.0.1)

Efficiently removing `NAs` in repeated measures designs using `tidyverse`

This is not so much question about how to do something but more about how to do it efficiently. In particular, I would like to drop NAs in a repeated measures design in such a way that each group has all complete observations.
In the bugs_long dataframe below, the same participant takes part in four condition and report their desire to kill bugs in each condition. Now if I wanted to carry out some repeated measures analysis with this dataset, this typically doesn't work in the long format because a different number of observations are found for each group after the pairwise exclusion of NAs. So the final dataframe should leave out the following five subjects.
# setup
set.seed(123)
library(ipmisc)
library(tidyverse)
# looking at the NAs
dplyr::filter(bugs_long, is.na(desire))
#> # A tibble: 5 x 6
#> subject gender region education condition desire
#> <int> <fct> <fct> <fct> <chr> <dbl>
#> 1 2 Female North America advance LDHF NA
#> 2 80 Female North America less LDHF NA
#> 3 42 Female North America high HDLF NA
#> 4 64 Female Europe some HDLF NA
#> 5 10 Female Other high HDHF NA
Here is the current roundabout way I am hacking this and getting it to work:
# figuring out the number of levels in the grouping factor
x_n_levels <- nlevels(as.factor(bugs_long$condition))[[1]]
# removing observations that don't have all repeated values
df <-
bugs_long %>%
filter(!is.na(condition)) %>%
group_by(condition) %>%
mutate(id = dplyr::row_number()) %>%
ungroup(.) %>%
filter(!is.na(desire)) %>%
group_by(id) %>%
mutate(n = dplyr::n()) %>%
ungroup(.) %>%
filter(n == x_n_levels) %>%
select(-n)
# did this work? yes
df %>%
group_by(condition) %>%
count()
#> # A tibble: 4 x 2
#> # Groups: condition [4]
#> condition n
#> <chr> <int>
#> 1 HDHF 88
#> 2 HDLF 88
#> 3 LDHF 88
#> 4 LDLF 88
But I would be surprised if the tidyverse (dplyr + tidyr) doesn't have a more efficient way to achieve this and I would really appreciate it if anyone else has a better refactoring this.
You're actually making this much more complicated than it needs to be. Once you find the cases to exclude, it's just a simple task of removing rows in your data that match those subjects, i.e. an anti-join. Some useful discussions here and here.
set.seed(123)
library(ipmisc)
library(dplyr)
exclude <- filter(bugs_long, is.na(desire))
full_cases <- bugs_long %>%
anti_join(exclude, by = "subject")
Or do the filtering and anti-joining in one go, similar to what you might do in SQL:
bugs_long %>%
anti_join(filter(., is.na(desire)), by = "subject")
For either way, the number of cases kept checks out:
count(full_cases, condition)
#> # A tibble: 4 x 2
#> condition n
#> <chr> <int>
#> 1 HDHF 88
#> 2 HDLF 88
#> 3 LDHF 88
#> 4 LDLF 88

How to Output a List of Summaries From Different Grouping Variables When Using Dplyr::Group_by and Dplyr::Summarise

library(tidyverse)
Using a simple example from the mtcars dataset, I can group by cyl and get basic counts with this...
mtcars%>%group_by(cyl)%>%summarise(Count=n())
And I can group by both cyl and am...
mtcars%>%group_by(cyl,am)%>%summarise(Count=n())
I can then create a function that will allow me to input multiple grouping variables.
Fun<-function(dat,...){
dat%>%
group_by_at(vars(...))%>%
summarise(Count=n())
}
However, rather than entering multiple grouping variables, I would like to output a list of two summaries, one for counts with cyl as the grouping variable, and one for cyl and am as the grouping variables.
I feel like something similar to the following should work, but I can't seem to figure it out. I'm hoping for an rlang or purrr solution. Help would be appreciated.
Groups<-list("cyl",c("cyl","am"))
mtcars%>%group_by(!!Groups)%>%summarise(Count=n())
Here's a working, tidyeval-compliant method.
library(tidyverse)
library(rlang)
Groups <- list("cyl" ,c("cyl","am"))
Groups %>%
map(function(group) {
syms <- syms(group)
mtcars %>%
group_by(!!!syms) %>%
summarise(Count = n())
})
#> [[1]]
#> # A tibble: 3 x 2
#> cyl Count
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
#>
#> [[2]]
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am Count
#> <dbl> <dbl> <int>
#> 1 4 0 3
#> 2 4 1 8
#> 3 6 0 4
#> 4 6 1 3
#> 5 8 0 12
#> 6 8 1 2

Resources