Threshold exceed check by two tables - r

1st table --> Threshold data frame which has threshold for respective label
threshold <- data.frame(label=c("a","b", "c", "a","d", "e", "f"), threshold = c(12, 10, 20, 12, 12, 35, 40))
[this table will have repetition at the same time the repeated label will have the same threshold like "a" ]
The 2nd table --- > contains value,label along with unique id
data_id <- data.frame(id =c(1,2,1,4),label=c("a","b","a","b"), value =c(32.1,0,15.0,10))
This i should check with the previous table for value exceeding the respective threshold considering each unique id.
[For each id how many times it exceeded the threshold for respective label and its threshold]
And finally i am expecting a table like this
[To calculate total number of exceeding values for each unique id & label combination]
I can do this by taking the respective label using if condition but i would like to get a dynamic way in less time.[I have millions of records]

I didn't understand your goal clearly but looking at your final data frame, I am assuming you want to get the total number of exceeding values for each unique id & label combination. Below is a possible dplyr solution:
library(dplyr)
final_df <- data_id %>%
left_join(unique(threshold), by = "label") %>%
mutate(check = if_else(value > threshold, 1, 0)) %>%
group_by(id, label) %>%
summarise(exceed = sum(check))
final_df
# # A tibble: 3 x 3
# # Groups: id [?]
# id label exceed
# <dbl> <chr> <dbl>
# 1 1 a 2
# 2 2 b 0
# 3 4 b 0
Please note that you will get a warning while joining the data frames because labels are initially defined as factors with different levels. You may want to set stringsAsFactors = F to create your data frames for consistency.

Related

R: Using GTsummary of table where ID can have multiple values

I have this examplary dataframe:
df <- tibble(ID = c(1, 1, 2), value = c(0, 1, 3), group = c("group0", "group0", "group1")) %>% group_by(value)
ID value group
<dbl> <dbl> <chr>
1 1 0 group0
2 1 1 group0
3 2 3 group1
That is, an ID always belongs to one group, however, there might be more than one value associated with that ID.
I know want to summarise the occurence of values within the different groups. For that I tried
df %>% gtsummary::tbl_summary(by = "group")
which gives me
However, as you can see in the header, the N numbers do not quite match my requirements. Because I only want to count the number of unique IDs in the group. Therefore, for both groups it should be N = 1.
Is there a way to achieve this with gtsummary?

Summarise using multiple functions with dplyr across()

I have data where an id variable should identify a unique observation. However, some ids are repeated. I want to get an idea of which measurements are driving this repetition by grouping by id and then calculating the proportion of inconsistent responses for each variable.
Below is an example of what I mean:
require(tidyverse)
df <- tibble(id = c(1,1,2,3,4,4,4),
col1 = c('a','a','b','b','c','c','c'), # perfectly consistent
col2 = c('a','b','b','b','c','c','c'), # id 1 is inconsistent - proportion inconsistent = 0.25
col3 = c('a','a','b','b','a','b','c'), # id 4 is inconsistent - proportion inconsistent = 0.25
col4 = c('a','b','b','b','b','b','c') # id 1 and 4 are inconsistent - proportion inconsistent = 0.5
)
I can test for inconsistent responses within ids by using group_by(), across(), and n_distinct() as per the below:
# count the number of distinct responses for each id in each column
# if the value is equal to 1, it means that all responses were consistent
df <- df %>%
group_by(id) %>%
mutate(across(.cols = c(col1:col4), ~n_distinct(.), .names = '{.col}_distinct')) %>%
ungroup()
For simplicity I can now take one row for each id:
# take one row for each test (so we aren't counting duplicates twice)
df <- distinct(df, across(c(id, contains('distinct'))))
Now I would like to calculate the proportion of ids that contained an inconsistent response for each variable. I would like to do something like the following:
consistency <- df %>%
summarise(across(contains('distinct'), ~sum(.>1) / n(.)))
But this gives the following error, which I am having trouble interpreting:
Error: Problem with `summarise()` input `..1`.
x unused argument (.)
ℹ Input `..1` is `across(contains("distinct"), ~sum(. > 1)/n(.))`.
I can get the answer I want by doing the following:
# calculate consistency for each column by finding the number of distinct values greater
# than 1 and dividing by total rows
# first get the number of distinct values
n_inconsistent <- df %>%
summarise(across(.cols = contains('distinct'), ~sum(.>1)))
# next get the number of rows
n_total <- nrow(df)
# calculate the proportion of tests that have more than one value for each column
consistency <- n_inconsistent %>%
mutate(across(contains('distinct'), ~./n_total))
But this involves intermediate variables and feels inelegant.
You can do it in the following way :
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(starts_with('col'), n_distinct)) %>%
summarise(across(starts_with('col'), ~mean(. > 1), .names = '{col}_distinct'))
# col1_distinct col2_distinct col3_distinct col4_distinct
# <dbl> <dbl> <dbl> <dbl>
#1 0 0.25 0.25 0.5
First we count number of unique values in each column per id and then calculate the proportion of values that are above 1 in each column.

How to calculate average by excluding first 300 measurements in Group 1 in r

I have many 1-sec emission measurements and have grouped them by event number. However, I would like remove the first 300 measurements from group 1, and calculate the group average from 301 to the last measurement in this group. For the remaining groups, I will just calculate the group average using all measurements, no need to take out the first 300 seconds.
I knew the code to compute group averages without excluding the first 300 measurements in group 1, in something like:
StartsSummary <- ddply(emission, "Group", summarize, CO2_avg = mean(CO2_DC))
emission <- data.frame(Group= c(rep(1, 400), rep(2, 305),rep(3, 200)), CO2_DC = c(rep(0.5, 350), rep(1, 400), rep(1.5, 155))
I expect the results as:
Group CO2_avg
1 0.75 (excluding first 300 measurements in group 1)
2 1 (include all measurements in group 2)
3 1.3875 (include all measurements in group 3)
You can combine #TonyLadson and #tmfmnk into one filter statement.
library(dplyr)
emission <- data.frame(Group= c(rep(1, 400), rep(2, 305),rep(3, 200)), CO2_DC = c(rep(0.5, 350), rep(1, 400), rep(1.5, 155)))
emission%>%
group_by(Group)%>%
filter(!(Group == 1 & row_number() %in% 1:300))%>%
summarize(CO2_Avg = mean(CO2_DC))
Group CO2_Avg
<dbl> <dbl>
1 1 0.75
2 2 1
3 3 1.39
Edit: I switched the order of the group_by() and the filter() statements. This allows the statement to work in case the Group is in a different order or if you wanted the first 100 rows of Group 2.
Depending on the size of the real problem, the easiest way would be to do the calculation in two stages
library(tidyverse)
# Mean of groups 2 and 3 using all data
emission %>%
filter(Group != 1) %>% # exclude group 1
group_by(Group) %>%
summarise(mean(CO2_DC))
# Mean of group 1 exclusing the first 300 rows
emission %>%
filter(Group == 1) %>%
slice(301:n()) %>%
summarise(mean(CO2_DC))

How to count the occurrence of permutations in a data set in R?

I have a question on how to count the occurrence of specified permuations in a data set in R.
I am currently working on continuous-glucose-monitoring data sets. Shortly, each data set has between 1500 to 2000 observations (each observation is a plasma glucose value measured every 5 minutes over 6 days).
I need to count the occurrence of glucose values below 3.9 occurring for 15 minutes or more and less than 120 minutes in a row (>3 observations and <24 observations for values <3.9 in a row) on a numeric scale.
I have made a new variable with a factor 1 or 0 for whether the plasma glucose value is below 3.9 or not.
I would then like to count the number of occurrences of permutations > three 1’s in a row and < twenty-four 1’s in a row.
Is there a function in R for this or what would be the easiest approach?
Im not sure if i got your data-structure right, but maybe the following code still can help
I'm assuming a data-structure that includes Measurement, person-id and measurement-id.
library(dplyr)
# create dumy-data
set.seed(123)
data_test = data.frame(measure = rnorm(100, 3.5,2), person_id = rep(1:10, each = 10), measure_id = rep(1:10, 10))
data_test$below_criterion = 0 # indicator for measures below crit-value
data_test$below_criterion[which(data_test$measure < 3.9)] = 1 # indicator for measures below crit-value
# indicator, that shows if the current measurement is the first one below crit_val in a possible series
# shift columns, to compare current value with previous one
data_test = data_test %>% group_by(person_id) %>% mutate(prev_below_crit = c(below_criterion[1], below_criterion[1:(n()-1)]))
data_test$start_of_run = 0 # create the indicator variable
data_test$start_of_run[which(data_test$below_criterion == 1 & data_test$prev_below_crit == 0)] = 1 # if current value is below crit and previous value is above, this is the start of a series
data_test = data_test %>% group_by(person_id) %>% mutate(grouper = cumsum(start_of_run)) # helper-variable to group all the possible series within a person
data_test = data_test %>% select(measure, person_id, measure_id, below_criterion, grouper) # get rid of the previous created helper-variables
data_results = data_test %>% group_by(person_id, grouper) %>% summarise(count_below_crit = sum(below_criterion)) # count the length of each series by summing up all below_crit indicators within a person and series
data_results = data_results %>% group_by(person_id) %>% filter(count_below_crit >= 3 & count_below_crit <=24) %>% summarise(n()) # count all series within a desired length for each person
data_results
data.frame(data_test)

Extract top N count from data frame with column showing count while maintaining data frame structure

I have a data frame like the following:
c1 <- c(324, 213, 122, 34)
c2 <- c("SDOIHHFOEKN", "SDIUFONBSD", "DSLIHFEIHDFS", "DOOIUDBD")
c3 <- c("G", "T", "U", "T")
df <- data.frame(count = c1, seq = c2, other = c3)
I want the top N sequences in a data frame. For example, for N = 600, I want the final data frame to have a column sum of count to be 600, meaning that only the top 3 rows of this data frame would remain, and the count of the third row would now be 600-324-213 = 63.
How can I get the output data frame like this?
I would really appreciate it if you could provide a general solution, as the data frame I am working with has over 1000 rows and smaller numbers.
Thanks!
A solution using dplyr. The idea is to arrange the data frame by count by descending order, subset for the first three rows, and then update the count column with the last row to be 600 minus all the count of previous row. df2 is the final output.
library(dplyr)
df2 <- df %>%
arrange(desc(c1)) %>%
slice(1:which(cumsum(c1) > 600)[1])) %>%
mutate(count = ifelse(row_number() == n(),
600 - sum(count[1:(n() - 1)]),
count))
df2
# # A tibble: 3 x 3
# count seq other
# <dbl> <fct> <fct>
# 1 324 SDOIHHFOEKN G
# 2 213 SDIUFONBSD T
# 3 63.0 DSLIHFEIHDFS U

Resources