Efficiently removing `NAs` in repeated measures designs using `tidyverse`

Efficiently removing `NAs` in repeated measures designs using `tidyverse` - r

This is not so much question about how to do something but more about how to do it efficiently. In particular, I would like to drop NAs in a repeated measures design in such a way that each group has all complete observations.
In the bugs_long dataframe below, the same participant takes part in four condition and report their desire to kill bugs in each condition. Now if I wanted to carry out some repeated measures analysis with this dataset, this typically doesn't work in the long format because a different number of observations are found for each group after the pairwise exclusion of NAs. So the final dataframe should leave out the following five subjects.
# setup
set.seed(123)
library(ipmisc)
library(tidyverse)
# looking at the NAs
dplyr::filter(bugs_long, is.na(desire))
#> # A tibble: 5 x 6
#> subject gender region education condition desire
#> <int> <fct> <fct> <fct> <chr> <dbl>
#> 1 2 Female North America advance LDHF NA
#> 2 80 Female North America less LDHF NA
#> 3 42 Female North America high HDLF NA
#> 4 64 Female Europe some HDLF NA
#> 5 10 Female Other high HDHF NA
Here is the current roundabout way I am hacking this and getting it to work:
# figuring out the number of levels in the grouping factor
x_n_levels <- nlevels(as.factor(bugs_long$condition))[[1]]
# removing observations that don't have all repeated values
df <-
bugs_long %>%
filter(!is.na(condition)) %>%
group_by(condition) %>%
mutate(id = dplyr::row_number()) %>%
ungroup(.) %>%
filter(!is.na(desire)) %>%
group_by(id) %>%
mutate(n = dplyr::n()) %>%
ungroup(.) %>%
filter(n == x_n_levels) %>%
select(-n)
# did this work? yes
df %>%
group_by(condition) %>%
count()
#> # A tibble: 4 x 2
#> # Groups: condition [4]
#> condition n
#> <chr> <int>
#> 1 HDHF 88
#> 2 HDLF 88
#> 3 LDHF 88
#> 4 LDLF 88
But I would be surprised if the tidyverse (dplyr + tidyr) doesn't have a more efficient way to achieve this and I would really appreciate it if anyone else has a better refactoring this.

You're actually making this much more complicated than it needs to be. Once you find the cases to exclude, it's just a simple task of removing rows in your data that match those subjects, i.e. an anti-join. Some useful discussions here and here.
set.seed(123)
library(ipmisc)
library(dplyr)
exclude <- filter(bugs_long, is.na(desire))
full_cases <- bugs_long %>%
anti_join(exclude, by = "subject")
Or do the filtering and anti-joining in one go, similar to what you might do in SQL:
bugs_long %>%
anti_join(filter(., is.na(desire)), by = "subject")
For either way, the number of cases kept checks out:
count(full_cases, condition)
#> # A tibble: 4 x 2
#> condition n
#> <chr> <int>
#> 1 HDHF 88
#> 2 HDLF 88
#> 3 LDHF 88
#> 4 LDLF 88

Related

How can I do Stratified sampling with proportionate size

I have a dataset named by "Tree_all_exclusive" of 7607 rows and 39 column, which contains different information of tress such as age, height, name etc. I am able to create a sample of 1200 size with the below code, which looks picking trees randomly:
sam1<-sample_n(Tree_all_exclusive, size = 1200)
But I like to generate a proportionate stratified sample of 1200 trees which will pick the number of trees according to the proportion of the number of that specific type of tree.
To do this I am using below code:
sam3<-Tree_all_exclusive %>%
group_by(TaxonNameFull)%>%
summarise(total_numbers=n())%>%
arrange(-total_numbers)%>%
mutate(pro = total_numbers/7607)%>% #7607 total number of trees
mutate(sz= pro*1200)%>% #1200 is number of sample
mutate(siz=as.integer(sz)+1) #since some size is 0.01 so making it 1
sam3
s<-stratified(sam3, group="TaxonNameFull", sam3$siz)
But it is giving me the below error:
Error in s_n(indt, group, size) : 'size' should be entered as a named vector.
Would you please point me any direction to solve this issue?
Also if there is any other way to do the stratified sampling with proportionate number please guide me.
Thanks a lot.

How about using sample_frac():
library(dplyr)
data(mtcars)
mtcars %>%
group_by(cyl) %>%
tally()
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
mtcars %>%
group_by(cyl) %>%
sample_frac(.5) %>%
tally()
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 6
#> 2 6 4
#> 3 8 7
Created on 2023-01-24 by the reprex package (v2.0.1)

How to create a standardised mean for 2 groups of numerous variables in r?

I am playing around with the brca dataset in r and am trying to create 2 x standardised mean values for each variable; one for the B group and one for the M group. This is so that I can calculate the difference between the standardised mean to see which variables have the highest difference.
I think what I want to do is:
scale each variable so they are standardised
group by the outcome (either B or M)
calculate the mean of each variable for each group
pivot from wide to long
I expect that B is one column and M is a second column at this point (and each variable mean is a row, with variable name being row name)
calculate the absolute difference between means for B & M for each variable and store as new column
arrange by desc
Does my logic sound correct?
If so, 'think' I have managed to do steps 1-3 but I have never done these calculations before let alone done them in r so I have no idea if I am on the right track. Would anyone mind reviewing and seeing if it looks right?
Secondly - can someone help me with how to complete the pivot to a long table (my step 4)?
library(tidyverse)
library(purrrlyr)
library(ggplot2)
temp <- dslabs::brca
df <- cbind(as.data.frame(temp$x), outcome = temp$y)
scaled_df <- df %>%
mutate_if(is.numeric, scale) %>%
group_by(outcome) %>%
dmap(mean)

Something like this?
suppressPackageStartupMessages({
library(tidyverse)
library(purrrlyr)
})
temp <- dslabs::brca
df <- cbind(as.data.frame(temp$x), outcome = temp$y)
scaled_df <- df %>%
mutate_if(is.numeric, scale) %>%
group_by(outcome) %>%
purrrlyr::dmap(mean)
scaled_df %>%
pivot_longer(-outcome) %>%
group_by(name) %>%
summarise(diff_means = diff(value))
#> # A tibble: 30 × 2
#> name diff_means
#> <chr> <dbl>
#> 1 area_mean 1.47
#> 2 area_se 1.13
#> 3 area_worst 1.52
#> 4 compactness_mean 1.23
#> 5 compactness_se 0.605
#> 6 compactness_worst 1.22
#> 7 concave_pts_mean 1.60
#> 8 concave_pts_se 0.843
#> 9 concave_pts_worst 1.64
#> 10 concavity_mean 1.44
#> # … with 20 more rows
Created on 2022-08-04 by the reprex package (v2.0.1)

How to scale by pair of columns together

I want to use the scale function but to do it on each pair of columns - To calculate the mean on pair of columns and not on each column.
In details:
This is my data for example:
phone
phone1_X
phone2
phone2_X
phone3
phone3_X
1
2
3
4
5
6
2
4
6
8
10
12
I want to use the scale function on each pair phone1+phone1_X, Phone2+Phone2_X etc..
Each pair has the same name "phone1" but the second column always contains an additional "_X" (a different condition in the experiment).
In the end, I wish to have the original table but in Z.scores (but as I mentioned before, the mean is calculated by pair of columns and not by one column)
Thank you so much!

There might be a more elegant way, but this is how I'd do it.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -phone) %>%
group_by(phone, name = stringr::str_extract(name, 'phone[0-9]?')) %>%
summarise(mean_value = mean(value), .groups = 'drop') %>%
pivot_wider(names_from = name, values_from = mean_value)
#> # A tibble: 2 × 4
#> phone phone1 phone2 phone3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3.5 5.5
#> 2 2 4 7 11

converting long to wide with columns starting at zero

I have the following data
county<-c(a,a,a,b,b,c)
id<-c(1,2,3,4,5,6)
data<-data.frame(county,id)
I need to convert from long to wide and get the following output
county<-c(a,b,c)
id__0<-c(1,4,6)
id__1<-c(2,5,##NA##)
id__2<-3,##NA##,##NA##)
data2<-data.frame(county,id__0,id__1,id__2)
My main problem is not in converting from long to wide, but how to make the columns start with id__0.

You could add an intermediate variable by grouping according to county and using mutate to build a sequence from 0 upwards for each county, then pivot_wider on that:
library(tidyr)
library(dplyr)
data %>%
group_by(county) %>%
mutate(id_count = seq(n()) - 1) %>%
pivot_wider(county, names_from =id_count, values_from = id, names_prefix = "id_")
#> # A tibble: 3 x 4
#> # Groups: county [3]
#> county id_0 id_1 id_2
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 1 2 3
#> 2 b 4 5 NA
#> 3 c 6 NA NA
Created on 2022-02-10 by the reprex package (v2.0.1)

Find cohorts in dataset in r dataframe

I'm trying to find the biggest cohort in a dataset of about 1000 candidates and 100 test questions. Every candidate is asked 15 questions out of a pool of 100 test questions. People in different cohorts make the same set of randomly sampled questions. I'm trying to find the largest group of candidates who all make the same test.
I'm working in R. The data.frame has about a 1000 rows, and 100 columns. Each column indicates which test question we're working with. For each row (candidate) all column entries are NA apart from the ones where a candidate filled in a particular question he or she was shown. The input in these question instances are either 0 or 1. (see picture)
Is there an elegant way to solve this? The only thing I could think of was using dplyer and filter per 15 question subset, and check how many rows still remain. However, with 100 columns this means it has to check (i think) 15 choose 100 different possibilities. Many thanks!
data.frame structure

We can infer the cohort based on the NA pattern:
library(tidyverse)
answers <- tribble(
~candidate, ~q1, ~q2, ~q3,
1,0,NA,NA,
2,1,NA,NA,
3,0,0,00
)
answers
#> # A tibble: 3 x 4
#> candidate q1 q2 q3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 NA NA
#> 2 2 1 NA NA
#> 3 3 0 0 0
# infer cohort by NA pattern
cohorts <-
answers %>%
group_by(candidate) %>%
mutate_at(vars(-group_cols()), ~ ifelse(is.na(.x), NA, TRUE)) %>%
unite(-candidate, col = "cohort")
cohorts
#> # A tibble: 3 x 2
#> # Groups: candidate [3]
#> candidate cohort
#> <dbl> <chr>
#> 1 1 TRUE_NA_NA
#> 2 2 TRUE_NA_NA
#> 3 3 TRUE_TRUE_TRUE
answers %>%
pivot_longer(-candidate) %>%
left_join(cohorts) %>%
# count filled answers per candidate and cohort
group_by(cohort, candidate) %>%
filter(! is.na(value)) %>%
count() %>%
# get the largest cohort
arrange(-n) %>%
pull(cohort) %>%
first()
#> Joining, by = "candidate"
#> [1] "TRUE_TRUE_TRUE"
Created on 2021-09-21 by the reprex package (v2.0.1)