I have data where an id variable should identify a unique observation. However, some ids are repeated. I want to get an idea of which measurements are driving this repetition by grouping by id and then calculating the proportion of inconsistent responses for each variable.
Below is an example of what I mean:
require(tidyverse)
df <- tibble(id = c(1,1,2,3,4,4,4),
col1 = c('a','a','b','b','c','c','c'), # perfectly consistent
col2 = c('a','b','b','b','c','c','c'), # id 1 is inconsistent - proportion inconsistent = 0.25
col3 = c('a','a','b','b','a','b','c'), # id 4 is inconsistent - proportion inconsistent = 0.25
col4 = c('a','b','b','b','b','b','c') # id 1 and 4 are inconsistent - proportion inconsistent = 0.5
)
I can test for inconsistent responses within ids by using group_by(), across(), and n_distinct() as per the below:
# count the number of distinct responses for each id in each column
# if the value is equal to 1, it means that all responses were consistent
df <- df %>%
group_by(id) %>%
mutate(across(.cols = c(col1:col4), ~n_distinct(.), .names = '{.col}_distinct')) %>%
ungroup()
For simplicity I can now take one row for each id:
# take one row for each test (so we aren't counting duplicates twice)
df <- distinct(df, across(c(id, contains('distinct'))))
Now I would like to calculate the proportion of ids that contained an inconsistent response for each variable. I would like to do something like the following:
consistency <- df %>%
summarise(across(contains('distinct'), ~sum(.>1) / n(.)))
But this gives the following error, which I am having trouble interpreting:
Error: Problem with `summarise()` input `..1`.
x unused argument (.)
ℹ Input `..1` is `across(contains("distinct"), ~sum(. > 1)/n(.))`.
I can get the answer I want by doing the following:
# calculate consistency for each column by finding the number of distinct values greater
# than 1 and dividing by total rows
# first get the number of distinct values
n_inconsistent <- df %>%
summarise(across(.cols = contains('distinct'), ~sum(.>1)))
# next get the number of rows
n_total <- nrow(df)
# calculate the proportion of tests that have more than one value for each column
consistency <- n_inconsistent %>%
mutate(across(contains('distinct'), ~./n_total))
But this involves intermediate variables and feels inelegant.
You can do it in the following way :
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(starts_with('col'), n_distinct)) %>%
summarise(across(starts_with('col'), ~mean(. > 1), .names = '{col}_distinct'))
# col1_distinct col2_distinct col3_distinct col4_distinct
# <dbl> <dbl> <dbl> <dbl>
#1 0 0.25 0.25 0.5
First we count number of unique values in each column per id and then calculate the proportion of values that are above 1 in each column.
Related
Starting data
I'm working in R and I have a set of data generated from groups (cohorts) of animals treated with different doses of different drugs. A simplified reproducible example of my dataset follows:
# set starting values for simulation of animal cohorts across doses of various drugs with a few numeric endpoints
cohort_size <- 3
animals <- letters[1:cohort_size]
drugs <- factor(c("A", "B", "C"))
doses <- factor(c(0, 10, 100))
total_size <- cohort_size * length(drugs) * length(doses)
# simulate data based on above parameters
df <- cbind(expand.grid(drug = drugs, dose = doses, animal = animals),
data.frame(
other_metadata = sample(LETTERS[24:26], size = total_size, replace = TRUE),
num1 = rnorm(total_size, mean = 10, sd = 3),
num2 = rnorm(total_size, mean = 60, sd = 9),
num3 = runif(total_size, min = 1, max = 5)))
This produces something like:
## drug dose animal other_metadata num1 num2 num3
## 1 A 0 a X 6.448411 54.49473 4.111368
## 2 B 0 a Y 9.439396 67.39118 4.917354
## 3 C 0 a Y 8.519773 67.11086 3.969524
## 4 A 10 a Z 6.286326 69.25982 2.194252
## 5 B 10 a Y 12.428265 70.32093 1.679301
## 6 C 10 a X 13.278707 68.37053 1.746217
My goal
For each drug treatment, I consider the dose == 0 animals as my control group for that drug (let's say each was run at a different time and has it's own control group). I wish to calculate the mean for each numeric endpoint (columns 5:7 in this example) of the control group. Next I want to normalize (divide) every numeric endpoint (columns 5:7) for every animal by the mean of it's respective control group.
In other words num1 for all animals where drug == "A" should be divided by the mean of num1 for all animals where drug == "A" AND dose == 0 and so on for each endpoint.
The final output should be the same size as the original data.frame with all of the non-numeric metadata columns remaining unchanged on the left side and all the numeric data columns now with the normalized values.
Naturally I'd like to find the simplest solution possible - minimizing creation of new variables and ideally in a single dplyr pipeline if possible.
What I've tried so far
I should say that I have technically solved this but the solution is super ugly with a ton of steps so I'm hoping to get help to find a more elegant solution.
I know I can easily get the averages for the control groups into a new data.frame using:
df %>%
filter(dose == 0) %>%
group_by(drug, dose) %>%
summarise_all(mean)
I've looked into several things but can't figure out how to implement them. In order of what seems most promising to me:
dplyr::group_modify()
dplyr::rowwise()
sweep() in some type of loop
Thanks in advance for any help you can offer!
If the intention is to divide the numeric columns by the mean of the control group values, grouped by 'drug', after grouping by 'drug', use mutate with across (from dplyr 1.0.0), divide the column values (. with mean of the values where the 'dose' is 0
library(dplyr) # 1.0.0
df %>%
group_by(drug) %>%
mutate(across(where(is.numeric), ~ ./mean(.[dose == 0])))
If we have a dplyr version is < 1.0.0, use mutate_if
df %>%
group_by(drug) %>%
mutate_if(is.numeric, ~ ./mean(.[dose == 0]))
I have a data set called test with multiple observations per participant. Every participant has a unique id, but several observations (1 row in the data = 1 observation). I have to reduce the data set to 1 row per participant and add two new variables which are a sum of the no. of observations per participant and the sum of points he or she received per observation.
I already got this values, but how can I create and add these two variables to my data set based on this code?
test %>%
group_by(id) %>%
summarize(sum_communities = sum(id/id, na.rm = TRUE))
test %>%
group_by(id) %>%
summarize(sum_points = sum(points, na.rm = TRUE))
I created a demo data in test table. test_reduced table has desired output.
library(dplyr)
test = data.frame("Participent" =c("A","A","A","B","B","C","C","C", "C"),
"Observation" = c(4,5,6,4,7,4,6,6,3))
test_reduced = test %>% group_by(Participent) %>%
summarise(count = n(), sum = sum(Observation))
Output:
# A tibble: 3 x 3
Participent count sum
<fct> <int> <dbl>
1 A 3 15
2 B 2 11
3 C 4 19
I have a question on how to count the occurrence of specified permuations in a data set in R.
I am currently working on continuous-glucose-monitoring data sets. Shortly, each data set has between 1500 to 2000 observations (each observation is a plasma glucose value measured every 5 minutes over 6 days).
I need to count the occurrence of glucose values below 3.9 occurring for 15 minutes or more and less than 120 minutes in a row (>3 observations and <24 observations for values <3.9 in a row) on a numeric scale.
I have made a new variable with a factor 1 or 0 for whether the plasma glucose value is below 3.9 or not.
I would then like to count the number of occurrences of permutations > three 1’s in a row and < twenty-four 1’s in a row.
Is there a function in R for this or what would be the easiest approach?
Im not sure if i got your data-structure right, but maybe the following code still can help
I'm assuming a data-structure that includes Measurement, person-id and measurement-id.
library(dplyr)
# create dumy-data
set.seed(123)
data_test = data.frame(measure = rnorm(100, 3.5,2), person_id = rep(1:10, each = 10), measure_id = rep(1:10, 10))
data_test$below_criterion = 0 # indicator for measures below crit-value
data_test$below_criterion[which(data_test$measure < 3.9)] = 1 # indicator for measures below crit-value
# indicator, that shows if the current measurement is the first one below crit_val in a possible series
# shift columns, to compare current value with previous one
data_test = data_test %>% group_by(person_id) %>% mutate(prev_below_crit = c(below_criterion[1], below_criterion[1:(n()-1)]))
data_test$start_of_run = 0 # create the indicator variable
data_test$start_of_run[which(data_test$below_criterion == 1 & data_test$prev_below_crit == 0)] = 1 # if current value is below crit and previous value is above, this is the start of a series
data_test = data_test %>% group_by(person_id) %>% mutate(grouper = cumsum(start_of_run)) # helper-variable to group all the possible series within a person
data_test = data_test %>% select(measure, person_id, measure_id, below_criterion, grouper) # get rid of the previous created helper-variables
data_results = data_test %>% group_by(person_id, grouper) %>% summarise(count_below_crit = sum(below_criterion)) # count the length of each series by summing up all below_crit indicators within a person and series
data_results = data_results %>% group_by(person_id) %>% filter(count_below_crit >= 3 & count_below_crit <=24) %>% summarise(n()) # count all series within a desired length for each person
data_results
data.frame(data_test)
I have the feeling this was already asked several times, but I can not make it run in my case. Don't know why.
I group_by my data frame and calculate a mean from values. Additionally, I marked a specific row and I want to calculate the ratio of my fresh calculated mean with the value of my highlighted row of the subset.
library(dplyr)
df <- data.frame(int=c(5:1,4:1),
highlight=c(T,F,F,F,F,F,T,F,F),
exp=c('a','a','a','a','a','b','b','b','b'))
df %>%
group_by(exp) %>%
summarise(mean=mean(int),
l1=nrow(.),
ratio_mean=.[.$highlight, 'int']/mean)
But for some reason, . is not the subset of group_by but the complete input. Am I missing something here?
My expected output would be
exp mean ratio_mean
<fct> <dbl> <dbl>
1 a 3 1.67
2 b 2.5 1.2
This works:
df %>%
group_by(exp) %>%
summarise(mean = mean(int),
l1 = n(),
ratio_mean = int[highlight] / mean)
But what's going wrong with your solution?
nrow(.) counts the number of rows of your whole input dataframe, wherase n() counts only the rows per group
.[.$highlight, 'int']/mean here again you use the whole input dataframe and subset using the highlight column, but it get's divided by the correct group mean. Actually you are returning two values here as two rows of your original df have a highlight = TRUE. This causes a nasty NA-column name.
To save it, we could use do() as suggested by #MikkoMarttila, but this gets a little bit clunky:
df %>%
group_by(exp) %>%
do(summarise(., mean = mean(.$int),
l1 = nrow(.),
ratio_mean = .$int[.$highlight] / mean))
Original output
df %>%
group_by(exp) %>%
summarise(mean=mean(int),
l1=nrow(.),
ratio_mean=.[.$highlight, 'int']/mean)
# A tibble: 2 x 4
# exp mean l1 ratio_mean$ NA
# <fct> <dbl> <int> <dbl> <dbl>
# 1 a 3 9 1.67 2
# 2 b 2.5 9 1 1.2
1st table --> Threshold data frame which has threshold for respective label
threshold <- data.frame(label=c("a","b", "c", "a","d", "e", "f"), threshold = c(12, 10, 20, 12, 12, 35, 40))
[this table will have repetition at the same time the repeated label will have the same threshold like "a" ]
The 2nd table --- > contains value,label along with unique id
data_id <- data.frame(id =c(1,2,1,4),label=c("a","b","a","b"), value =c(32.1,0,15.0,10))
This i should check with the previous table for value exceeding the respective threshold considering each unique id.
[For each id how many times it exceeded the threshold for respective label and its threshold]
And finally i am expecting a table like this
[To calculate total number of exceeding values for each unique id & label combination]
I can do this by taking the respective label using if condition but i would like to get a dynamic way in less time.[I have millions of records]
I didn't understand your goal clearly but looking at your final data frame, I am assuming you want to get the total number of exceeding values for each unique id & label combination. Below is a possible dplyr solution:
library(dplyr)
final_df <- data_id %>%
left_join(unique(threshold), by = "label") %>%
mutate(check = if_else(value > threshold, 1, 0)) %>%
group_by(id, label) %>%
summarise(exceed = sum(check))
final_df
# # A tibble: 3 x 3
# # Groups: id [?]
# id label exceed
# <dbl> <chr> <dbl>
# 1 1 a 2
# 2 2 b 0
# 3 4 b 0
Please note that you will get a warning while joining the data frames because labels are initially defined as factors with different levels. You may want to set stringsAsFactors = F to create your data frames for consistency.