R: Using GTsummary of table where ID can have multiple values - r

I have this examplary dataframe:
df <- tibble(ID = c(1, 1, 2), value = c(0, 1, 3), group = c("group0", "group0", "group1")) %>% group_by(value)
ID value group
<dbl> <dbl> <chr>
1 1 0 group0
2 1 1 group0
3 2 3 group1
That is, an ID always belongs to one group, however, there might be more than one value associated with that ID.
I know want to summarise the occurence of values within the different groups. For that I tried
df %>% gtsummary::tbl_summary(by = "group")
which gives me
However, as you can see in the header, the N numbers do not quite match my requirements. Because I only want to count the number of unique IDs in the group. Therefore, for both groups it should be N = 1.
Is there a way to achieve this with gtsummary?

Related

Difference between two variables in R within the same group

I have a dataset with the following structure:
I would like to make the difference between two variables in the same group. Thus, the result I wish to obtain is the following:
Note that the difference must always be equal to or bigger than 0. I would like to solve it using R.
Try group by and diff function.
library(tidyverse)
df <- data.frame(group = rep(LETTERS[1:3], each=2),
value = c(20, 5, 0, 30, 10, 2))
df %>%
group_by(group) %>%
summarise(difference= abs(diff(value)))
# A tibble: 3 × 2
group difference
<chr> <dbl>
1 A 15
2 B 30
3 C 8

Compute accordance of column values grouped by another column [duplicate]

This question already has an answer here:
Find out what values occur the most in my collection and its proportion
(1 answer)
Closed 1 year ago.
I have a data frame with a column of IDs spanning multiple rows (col_id) and another column of assessments for this row (col_assessment), like so:
df <- data.frame(col_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
col_assessment = c("Pos", "Pos", "Neu", "Neu", "Neg", "Neu", "Pos", "Neu", "Neg"))
I now want to calculate how much the assessment is in accordance for each row. (I.e. how many of the assessments are the same per ID. For this, I have the following function. (I do not have to use this function and am also open to other solutions.)
compute_ICR <- function(coding_values){
### takes in list of coding values and returns number of the share of agreement (up to 1 if all are in agreement)
most_common_value <- coding_values %>% table() %>% sort(decreasing = TRUE) %>% magrittr::extract(1) %>% names()
share_accordance <- length(which(coding_values == most_common_value)) / coding_values %>% nrow()
# number of matching, most common values divided by number of total values
return(share_accordance)
}
I would now like to apply this to df by group of col_id, like so (not working pseudo-code!)
df %>% group_by(col_id) %>% summarize(share_accordance = compute_ICR(df$col_assessment))
This should give me the following data frame for the above example:
data.frame(col_id = c(1,2,3), share_accordance = c(.6667, 1, .333))
Can someone point out how to achieve this result? Thanks in advance.
I would change the function to -
compute_ICR <- function(x){
sort(table(x), decreasing = TRUE)[1]/length(x)
}
and apply it for each ID .
library(dplyr)
df %>%
group_by(col_id) %>%
summarize(share_accordance = compute_ICR(col_assessment))
# col_id share_accordance
# <dbl> <dbl>
#1 1 0.667
#2 2 0.667
#3 3 0.333
Or in base R -
aggregate(col_assessment~col_id, df, compute_ICR)
As I understand your question you want the largest proportion of answers per ID? The code below will give this answer independent of the number of possible values for col_assessment
library(dplyr)
df1 %>%
group_by(col_id) %>%
summarise(prop = max(prop.table(table(col_assessment))))
Returns:
col_id prop
<dbl> <dbl>
1 1 0.667
2 2 0.667
3 3 0.333

Summarise using multiple functions with dplyr across()

I have data where an id variable should identify a unique observation. However, some ids are repeated. I want to get an idea of which measurements are driving this repetition by grouping by id and then calculating the proportion of inconsistent responses for each variable.
Below is an example of what I mean:
require(tidyverse)
df <- tibble(id = c(1,1,2,3,4,4,4),
col1 = c('a','a','b','b','c','c','c'), # perfectly consistent
col2 = c('a','b','b','b','c','c','c'), # id 1 is inconsistent - proportion inconsistent = 0.25
col3 = c('a','a','b','b','a','b','c'), # id 4 is inconsistent - proportion inconsistent = 0.25
col4 = c('a','b','b','b','b','b','c') # id 1 and 4 are inconsistent - proportion inconsistent = 0.5
)
I can test for inconsistent responses within ids by using group_by(), across(), and n_distinct() as per the below:
# count the number of distinct responses for each id in each column
# if the value is equal to 1, it means that all responses were consistent
df <- df %>%
group_by(id) %>%
mutate(across(.cols = c(col1:col4), ~n_distinct(.), .names = '{.col}_distinct')) %>%
ungroup()
For simplicity I can now take one row for each id:
# take one row for each test (so we aren't counting duplicates twice)
df <- distinct(df, across(c(id, contains('distinct'))))
Now I would like to calculate the proportion of ids that contained an inconsistent response for each variable. I would like to do something like the following:
consistency <- df %>%
summarise(across(contains('distinct'), ~sum(.>1) / n(.)))
But this gives the following error, which I am having trouble interpreting:
Error: Problem with `summarise()` input `..1`.
x unused argument (.)
ℹ Input `..1` is `across(contains("distinct"), ~sum(. > 1)/n(.))`.
I can get the answer I want by doing the following:
# calculate consistency for each column by finding the number of distinct values greater
# than 1 and dividing by total rows
# first get the number of distinct values
n_inconsistent <- df %>%
summarise(across(.cols = contains('distinct'), ~sum(.>1)))
# next get the number of rows
n_total <- nrow(df)
# calculate the proportion of tests that have more than one value for each column
consistency <- n_inconsistent %>%
mutate(across(contains('distinct'), ~./n_total))
But this involves intermediate variables and feels inelegant.
You can do it in the following way :
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(starts_with('col'), n_distinct)) %>%
summarise(across(starts_with('col'), ~mean(. > 1), .names = '{col}_distinct'))
# col1_distinct col2_distinct col3_distinct col4_distinct
# <dbl> <dbl> <dbl> <dbl>
#1 0 0.25 0.25 0.5
First we count number of unique values in each column per id and then calculate the proportion of values that are above 1 in each column.

Using tidyr top_n to select by a variable and include NAs

I'm trying to use dplyr to group by a variable and identify the nearest location for every place in my dataset. I would also like to include all rows for which distance has not been measured (NA).
# Set up df of place, distance, and destination.
df <- data.frame(place = c('A','B','B','C','C','D','D'),dist = c(NA, 4, 1, 6, 3, 1, 1), dest = 1:7)
# For each place, get the nearest destination.
df %>%
group_by(place) %>%
top_n(1, desc(dist))
# This does not return a row for place A.
Is there a tidyr solution for using top_n to identify rows based on rank that will also include rows that have not been ranked? Thank you in advance.
This works but there are probably more efficient solutions.
The coalesce(dist, max(dist), ...) is there because we prioritize non-null values. Then, we want to make sure that a random value doesn't end up in top_n, so we take the max(dist) of the group. Then finally, to actually return a value, I put a number in - you could use any number.
If you were doing non-desc, you would likely use min(dist) instead of max(dist).
df %>%
group_by(place) %>%
top_n(1, desc(coalesce(dist, max(dist)+1, 0)))
place dist dest
<fct> <dbl> <int>
1 A NA 1
2 B 1 3
3 C 3 5
4 D 1 6
5 D 1 7

Threshold exceed check by two tables

1st table --> Threshold data frame which has threshold for respective label
threshold <- data.frame(label=c("a","b", "c", "a","d", "e", "f"), threshold = c(12, 10, 20, 12, 12, 35, 40))
[this table will have repetition at the same time the repeated label will have the same threshold like "a" ]
The 2nd table --- > contains value,label along with unique id
data_id <- data.frame(id =c(1,2,1,4),label=c("a","b","a","b"), value =c(32.1,0,15.0,10))
This i should check with the previous table for value exceeding the respective threshold considering each unique id.
[For each id how many times it exceeded the threshold for respective label and its threshold]
And finally i am expecting a table like this
[To calculate total number of exceeding values for each unique id & label combination]
I can do this by taking the respective label using if condition but i would like to get a dynamic way in less time.[I have millions of records]
I didn't understand your goal clearly but looking at your final data frame, I am assuming you want to get the total number of exceeding values for each unique id & label combination. Below is a possible dplyr solution:
library(dplyr)
final_df <- data_id %>%
left_join(unique(threshold), by = "label") %>%
mutate(check = if_else(value > threshold, 1, 0)) %>%
group_by(id, label) %>%
summarise(exceed = sum(check))
final_df
# # A tibble: 3 x 3
# # Groups: id [?]
# id label exceed
# <dbl> <chr> <dbl>
# 1 1 a 2
# 2 2 b 0
# 3 4 b 0
Please note that you will get a warning while joining the data frames because labels are initially defined as factors with different levels. You may want to set stringsAsFactors = F to create your data frames for consistency.

Resources