Find cohorts in dataset in r dataframe - r

I'm trying to find the biggest cohort in a dataset of about 1000 candidates and 100 test questions. Every candidate is asked 15 questions out of a pool of 100 test questions. People in different cohorts make the same set of randomly sampled questions. I'm trying to find the largest group of candidates who all make the same test.
I'm working in R. The data.frame has about a 1000 rows, and 100 columns. Each column indicates which test question we're working with. For each row (candidate) all column entries are NA apart from the ones where a candidate filled in a particular question he or she was shown. The input in these question instances are either 0 or 1. (see picture)
Is there an elegant way to solve this? The only thing I could think of was using dplyer and filter per 15 question subset, and check how many rows still remain. However, with 100 columns this means it has to check (i think) 15 choose 100 different possibilities. Many thanks!
data.frame structure

We can infer the cohort based on the NA pattern:
library(tidyverse)
answers <- tribble(
~candidate, ~q1, ~q2, ~q3,
1,0,NA,NA,
2,1,NA,NA,
3,0,0,00
)
answers
#> # A tibble: 3 x 4
#> candidate q1 q2 q3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 NA NA
#> 2 2 1 NA NA
#> 3 3 0 0 0
# infer cohort by NA pattern
cohorts <-
answers %>%
group_by(candidate) %>%
mutate_at(vars(-group_cols()), ~ ifelse(is.na(.x), NA, TRUE)) %>%
unite(-candidate, col = "cohort")
cohorts
#> # A tibble: 3 x 2
#> # Groups: candidate [3]
#> candidate cohort
#> <dbl> <chr>
#> 1 1 TRUE_NA_NA
#> 2 2 TRUE_NA_NA
#> 3 3 TRUE_TRUE_TRUE
answers %>%
pivot_longer(-candidate) %>%
left_join(cohorts) %>%
# count filled answers per candidate and cohort
group_by(cohort, candidate) %>%
filter(! is.na(value)) %>%
count() %>%
# get the largest cohort
arrange(-n) %>%
pull(cohort) %>%
first()
#> Joining, by = "candidate"
#> [1] "TRUE_TRUE_TRUE"
Created on 2021-09-21 by the reprex package (v2.0.1)

Related

How can I do Stratified sampling with proportionate size

I have a dataset named by "Tree_all_exclusive" of 7607 rows and 39 column, which contains different information of tress such as age, height, name etc. I am able to create a sample of 1200 size with the below code, which looks picking trees randomly:
sam1<-sample_n(Tree_all_exclusive, size = 1200)
But I like to generate a proportionate stratified sample of 1200 trees which will pick the number of trees according to the proportion of the number of that specific type of tree.
To do this I am using below code:
sam3<-Tree_all_exclusive %>%
group_by(TaxonNameFull)%>%
summarise(total_numbers=n())%>%
arrange(-total_numbers)%>%
mutate(pro = total_numbers/7607)%>% #7607 total number of trees
mutate(sz= pro*1200)%>% #1200 is number of sample
mutate(siz=as.integer(sz)+1) #since some size is 0.01 so making it 1
sam3
s<-stratified(sam3, group="TaxonNameFull", sam3$siz)
But it is giving me the below error:
Error in s_n(indt, group, size) : 'size' should be entered as a named vector.
Would you please point me any direction to solve this issue?
Also if there is any other way to do the stratified sampling with proportionate number please guide me.
Thanks a lot.
How about using sample_frac():
library(dplyr)
data(mtcars)
mtcars %>%
group_by(cyl) %>%
tally()
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
mtcars %>%
group_by(cyl) %>%
sample_frac(.5) %>%
tally()
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 6
#> 2 6 4
#> 3 8 7
Created on 2023-01-24 by the reprex package (v2.0.1)

How to do a for loop of a count() function for a datatable in R?

I need to count the number of occurrences of specific values in each column, and then do a for loop for that to run that count() function for the entire dataframe (consisting of several thousand columns).
For instance, if I have a column consisting of: [0,0,0,1,1,0,0,0,0,0,0,0]. I want it to count the column and return for me the information of:
1 -> 2 counts
0 -> 10 counts
The dataframe that I have consists entirely of only 0s and 1s. I just need to count how many of them are in each column, but that dataframe has over a few thousand columns.
Currently, my for loop code doesnt work, it seems to only register the first column and keep printing that same first column result over and over again. Thanks everyone!!
s <- 0
yes_filt_high_mutation <- data.frame();
for(c in colnames(high_mutations)[2:ncol(high_mutations)]){ #high_mutations = my dataframe
mutation_results = high_mutations %>% count(high_mutations$c); #Count the # of 0s and 1s in each column
print(c)
print(mutation_results)
s <- s + 1
add_column <- c(c,mutation_results[1,2],mutation_results[2,2])
yes_filt_high_mutation <- rbind(data.frame(yes_filt_high_mutation), add_column)
}
names(yes_filt_high_mutation)[1] <- "Samples"
names(yes_filt_high_mutation)[2] <- "Number of 0's"
names(yes_filt_high_mutation)[3] <- "Number of 1's"
I want my result to be something like this, for each loop result:
So essentially tell me that there are 134 counts of 0 and 2 counts of 1 in Column 1.
high_mutations$Column1 n
1 0 134
2 1 2
I would suggest that you reflect on the desired final format. If your intention is get a count of observations within a column you can obtain that by using common verbs available in tidyverse.
library(tidyverse)
select(mtcars, cyl, vs, gear) %>%
pivot_longer(cols = everything()) %>%
group_by(name, value) %>%
summarise(ndist = n())
#> `summarise()` has grouped output by 'name'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 3
#> # Groups: name [3]
#> name value ndist
#> <chr> <dbl> <int>
#> 1 cyl 4 11
#> 2 cyl 6 7
#> 3 cyl 8 14
#> 4 gear 3 15
#> 5 gear 4 12
#> 6 gear 5 5
#> 7 vs 0 18
#> 8 vs 1 14
Created on 2022-04-16 by the reprex package (v2.0.1)
Explanation
For sake of simplicity a set of columns is reduced to only include vs, cyl and gear via the select verb.
Data is transformed to a long format to make grouping easier via pivot_longer available through tidyr
The key element is counting occurrences of each combination, if I understood your request, this is your goal. So in this case for column cyl we get 11 instances of value 4, 7 instances of value 6 and so on
Optional
You can transform that data into a wide format using pivot_wider but I wouldn't rush that as nicely formatted long data is frequently easier to work with
Wider remarks
Looping over columns in a data frame is generally not advisable practice. R offers a number of optimised, robust and mature approaches to achieve similar objectives. apply functions available in base R or across verb offered via tidyverse are a good starting points
You may wish to reflect on refining your requirements. As it was observed in the comments, are you in effect looking for an output similar to table(mtcars$cyl) plus some additional embellishments?
Alternative solution
If you are not too fussed about the output format you could also leverage map.
library(tidyverse)
select(mtcars, cyl, vs, gear) %>%
map(~ table(.x))
#> $cyl
#> .x
#> 4 6 8
#> 11 7 14
#>
#> $vs
#> .x
#> 0 1
#> 18 14
#>
#> $gear
#> .x
#> 3 4 5
#> 15 12 5
Created on 2022-04-16 by the reprex package (v2.0.1)
You will arrive at identical result but as a list, you may wish to pack those in a data frame but if you will intend to do that staying with group_by is probably a more straightforward.

Get number of occurrences of each unique value [duplicate]

This question already has answers here:
Count number of occurences for each unique value
(14 answers)
How to count how many values per level in a given factor?
(9 answers)
Closed 2 years ago.
This is something I spent some time searching for. There were several good answers on Stack Overflow detailing how you can get the number of unique values, but I couldn't find any that showed how to count the number of occurrences for each value using dplyr.
df %>% select(val) %>% group_by(val) %>% mutate(count = n()) %>% unique()
This first filters out the value of interest, groups by it, then creates a new column all the unique values, and the number of occurrences of each of those values.
Here is a reproducible example showcasing how it works:
id <- c(1,2,3,4,5,6,7,8,9,0)
val <- c(0,1,2,3,1,1,1,0,0,2)
df <- data.frame(id=id,val=val)
df
#> id val
#> 1 1 0
#> 2 2 1
#> 3 3 2
#> 4 4 3
#> 5 5 1
#> 6 6 1
#> 7 7 1
#> 8 8 0
#> 9 9 0
#> 10 0 2
df %>% select(val) %>% group_by(val) %>% mutate(count = n()) %>% unique()
#> # A tibble: 4 x 2
#> # Groups: val [4]
#> val count
#> <dbl> <int>
#> 1 0 3
#> 2 1 4
#> 3 2 2
#> 4 3 1
Created on 2020-06-17 by the reprex package (v0.3.0)

Efficiently removing `NAs` in repeated measures designs using `tidyverse`

This is not so much question about how to do something but more about how to do it efficiently. In particular, I would like to drop NAs in a repeated measures design in such a way that each group has all complete observations.
In the bugs_long dataframe below, the same participant takes part in four condition and report their desire to kill bugs in each condition. Now if I wanted to carry out some repeated measures analysis with this dataset, this typically doesn't work in the long format because a different number of observations are found for each group after the pairwise exclusion of NAs. So the final dataframe should leave out the following five subjects.
# setup
set.seed(123)
library(ipmisc)
library(tidyverse)
# looking at the NAs
dplyr::filter(bugs_long, is.na(desire))
#> # A tibble: 5 x 6
#> subject gender region education condition desire
#> <int> <fct> <fct> <fct> <chr> <dbl>
#> 1 2 Female North America advance LDHF NA
#> 2 80 Female North America less LDHF NA
#> 3 42 Female North America high HDLF NA
#> 4 64 Female Europe some HDLF NA
#> 5 10 Female Other high HDHF NA
Here is the current roundabout way I am hacking this and getting it to work:
# figuring out the number of levels in the grouping factor
x_n_levels <- nlevels(as.factor(bugs_long$condition))[[1]]
# removing observations that don't have all repeated values
df <-
bugs_long %>%
filter(!is.na(condition)) %>%
group_by(condition) %>%
mutate(id = dplyr::row_number()) %>%
ungroup(.) %>%
filter(!is.na(desire)) %>%
group_by(id) %>%
mutate(n = dplyr::n()) %>%
ungroup(.) %>%
filter(n == x_n_levels) %>%
select(-n)
# did this work? yes
df %>%
group_by(condition) %>%
count()
#> # A tibble: 4 x 2
#> # Groups: condition [4]
#> condition n
#> <chr> <int>
#> 1 HDHF 88
#> 2 HDLF 88
#> 3 LDHF 88
#> 4 LDLF 88
But I would be surprised if the tidyverse (dplyr + tidyr) doesn't have a more efficient way to achieve this and I would really appreciate it if anyone else has a better refactoring this.
You're actually making this much more complicated than it needs to be. Once you find the cases to exclude, it's just a simple task of removing rows in your data that match those subjects, i.e. an anti-join. Some useful discussions here and here.
set.seed(123)
library(ipmisc)
library(dplyr)
exclude <- filter(bugs_long, is.na(desire))
full_cases <- bugs_long %>%
anti_join(exclude, by = "subject")
Or do the filtering and anti-joining in one go, similar to what you might do in SQL:
bugs_long %>%
anti_join(filter(., is.na(desire)), by = "subject")
For either way, the number of cases kept checks out:
count(full_cases, condition)
#> # A tibble: 4 x 2
#> condition n
#> <chr> <int>
#> 1 HDHF 88
#> 2 HDLF 88
#> 3 LDHF 88
#> 4 LDLF 88

Select column that has the fewest NA values

I am working with a data frame that produces two output columns. One column always has more NA values than the other column, but not in any predictable fashion. here is my question, how can I use dplyr to select the column with the fewest number of NA values. I was thinking of utilizing which.min to decide, but not sure how to put it all together. Note that both columns contain na values, and I want to select the one with the fewest of those values.
You can do this with dplyr and purrr.
inside which.min you first calculate the number of NA's in the columns with map (can be as many columns as you have in your data.frame. The keep part returns only those columns which actually have NA's. The which.min returns the named vector of which we take the name and supply it to the select function of dplyr.
I have outlined the code a bit so you can easily see which parts belong where.
library(purrr)
library(dplyr)
df %>% select(names(which.min(df %>%
map(function(x) sum(is.na(x))) %>%
keep(~ .x > 0)
)
)
)
library(dplyr)
df <- tibble(a = c(rep(c(NA, 1:5), 4)), # df with different NA counts/col
b = c(rep(c(NA, NA, 2:5), 4)))
df %>%
summarise_all(funs(sum(is.na(.)))) # NA counts
#> # A tibble: 1 x 2
#> a b
#> <int> <int>
#> 1 4 8
df %>% # answer
select_if(funs(which.min(sum(is.na(.)))))
#> # A tibble: 24 x 1
#> a
#> <int>
#> 1 NA
#> 2 1
#> 3 2
#> 4 3
#> 5 4
#> 6 5
#> 7 NA
#> 8 1
#> 9 2
#> 10 3
#> # ... with 14 more rows
Created on 2018-05-25 by the reprex package (v0.2.0).

Resources