finding/comparing dates when sometimes only NA's are present - r

I'm both new to coding in R (always used SPSS but have to use R for a project) and this website, so bear with me. Hopefully I'm both able to explain the problem and what I've tried.
My data looks somewhat like this:
df <- data.frame (
ID = c(1, 1, 1, 2, 2, 2, 2, 3, 3),
measurement = c (1, 2, 3, 1, 2, 3, 4, 1, 2),
date_event1 = c(NA, NA, "2021-02-15", NA, NA, NA, "2021-03-01", NA, NA),
date_event2 = c(NA, NA, NA, NA, "2021-03-06", NA, NA, "2022-02-02", "2022-02-02")
)
df
ID measurement date_event1 date_event2
1 1 <NA> <NA>
1 2 <NA> <NA>
1 3 2021-02-15 <NA>
2 1 <NA> <NA>
2 2 <NA> 2021-03-06
2 3 <NA> <NA>
2 4 2021-03-01 <NA>
3 1 <NA> 2022-02-02
3 2 <NA> 2022-02-02
I have patients (identified by ID) with a variable number of measurements (identified by measurement number and their date, so long-format data) and events (coded here as 'event1' and 'event2'). Events can be present for a particular patient and measurement (then coded with the date it occurred) or absent (then coded as NA).
Ultimately, my goal is to calculate intervals (in days) between two first events, if two are present. If no or only 1 event took place, the result should be NA. Desired output should look something like this:
ID measurement date_event1 date_event2 interval
1 1 <NA> <NA> NA
1 2 <NA> <NA> NA
1 3 2021-02-15 <NA> NA
2 1 <NA> <NA> **5**
2 2 <NA> 2021-03-06 **5**
2 3 <NA> <NA> **5**
2 4 2021-03-01 <NA> 5
3 1 <NA> 2022-02-02 **NA**
3 2 <NA> 2022-02-02 NA
Main issues here are:
finding the first event using functions as 'min' will error if NA's are present.
using the 'min(x, na.rm=TRUE)' as a solution doesn't work if IDs with only NA's are present
What I've tried:
df <- df %>%
group_by(ID) %>%
arrange(ID, measurement) %>%
# creating 2 identifier variables if all rows for event1/event2 are NA
mutate(allNA1 = ifelse(all(is.na(date_event1)), 1, 0)) %>%
mutate(allNA2 = ifelse(all(is.na(date_event2)), 1, 0)) %>%
ungroup()
# for simplicity, combining these two identifier variables into 1
df$test <- ifelse(df$allNA1 == 0 & df$allNA2 == 0, 1, NA)
# then using this combined identifier variable to only use mindate on IDs that have at least 1 event
df <- df %>%
group_by(ID) %>%
mutate(mindate = if_else(test == 1, min(date_event1, na.rm=TRUE), NA_real_)) %>%
ungroup ()
I haven't gotten to the comparing-dates step as finding the first date still produces the 'no non-missing arguments to min; returning Inf' warnings, even though I'm only mutating if test==1. What am I missing here? Are there easier solutions to my main problem? Thank you in advance!
Edit: forgot to add that the FIRST event should be used. Changes highlighted in bold.
Edit 2: made an error in the example, changed dates and intervals, also removed a column for simplicity.
Edit 3: Harre suggested to suppress warnings. Using the following did not work:
options(warn = -1)
df <- df %>%
group_by(ID) %>%
mutate(interval = abs(min(date_event1, na.rm=TRUE) - min(date_event2, na.rm=TRUE))) %>%
ungroup()
options(warn = 1)
Edit 6/8/22:
Okay, so I managed to circumvent the problem:
df <- df %>%
group_by(ID) %>%
# first recoding all missing values to an extremely late date
mutate(date_event1 = if_else(is.na(date_event1), as.Date("2099-09-09"), date_event1)) %>%
# then finding the earliest date (which are the non-2099-dates, if one was present)
mutate(min_date_event1 = min(date_event1)) %>%
# then recoding the 2099-dates back to NA
mutate(min_date_event1= if_else(mindate == '2099-09-09', as.Date(NA_real_), mindate)) %>%
Which feels really inefficient for something I could do with 1 function in SPSS (AGGREGATE > FIRST). I'll checkmark my question but if anyone has an easier solution, feel free to add suggestions!

Given that your "goal is to calculate intervals (in days) between two events, if two are present. If no or only 1 event took place, the result should be NA", there is no need for anything else than converting to the date-type:
library(dplyr)
df |>
mutate(interval = abs(as.Date(date_event2) - as.Date(date_event1)))
or
library(dplyr)
df |>
mutate(across(starts_with("date"), as.Date),
interval = abs(date_event2 - date_event1))
Output:
ID measurement date_measurement date_event1 date_event2 interval
1 1 1 2020-01-01 <NA> <NA> NA days
2 1 2 2020-01-05 <NA> <NA> NA days
3 1 3 2020-01-10 2021-02-15 <NA> NA days
4 2 1 2021-02-01 <NA> 2021-03-01 NA days
5 2 2 2021-02-15 <NA> 2021-03-05 NA days
6 2 3 2021-03-01 <NA> <NA> NA days
7 2 4 2021-04-01 2021-03-01 2021-03-06 5 days
8 3 1 2022-01-01 <NA> 2022-02-02 NA days
9 3 2 2022-03-01 <NA> <NA> NA days
Update:
I this case you'll just want to suppress the warning as it doesn't influence the result (shown below). Alternatively you could arrange by the date and pick the first using first() no matter if it's NA or not.
df |>
group_by(ID) |>
mutate(across(starts_with("date"), as.Date),
across(starts_with("date_event"), ~ suppressWarnings(min(., na.rm = TRUE)), .names = "{.col}_min"),
interval = na_if(abs(date_event2_min - date_event1_min), Inf)) |>
ungroup()
Output:
# A tibble: 9 × 7
ID measurement date_event1 date_event2 date_event1_min date_event2_min interval
<dbl> <dbl> <date> <date> <date> <date> <drtn>
1 1 1 NA NA 2021-02-15 NA NA days
2 1 2 NA NA 2021-02-15 NA NA days
3 1 3 2021-02-15 NA 2021-02-15 NA NA days
4 2 1 NA NA 2021-03-01 2021-03-06 5 days
5 2 2 NA 2021-03-06 2021-03-01 2021-03-06 5 days
6 2 3 NA NA 2021-03-01 2021-03-06 5 days
7 2 4 2021-03-01 NA 2021-03-01 2021-03-06 5 days
8 3 1 NA 2022-02-02 NA 2022-02-02 NA days
9 3 2 NA 2022-02-02 NA 2022-02-02 NA days

Related

Fill missing values of dates using the number of the weekday

I have a dataset in long format. Every subject in the dataset was observed five times during the week. I have a column with the number of the day of the week in which the observation was supposed to happen/happened and another column with the actual dates of the observations. The latter column has some missing values. I would like to use the information on the first column to fill the missing values in the second column. Here is a toy dataset:
df <- data.frame(case = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
day = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
date = as.Date(c("2023-01-02", "2023-01-03", NA, NA, "2023-01-06",
NA, "2021-05-11", "2021-05-12", "2021-05-13", NA)))
df
# case day date
# 1 1 2023-01-02
# 1 2 2023-01-03
# 1 3 <NA>
# 1 4 <NA>
# 1 5 2023-01-06
# 2 1 <NA>
# 2 2 2021-05-11
# 2 3 2021-05-12
# 2 4 2021-05-13
# 2 5 <NA>
And here is the desired output:
# case day date
#1 1 1 2023-01-02
#2 1 2 2023-01-03
#3 1 3 2023-01-04
#4 1 4 2023-01-05
#5 1 5 2023-01-06
#6 2 1 2021-05-10
#7 2 2 2021-05-11
#8 2 3 2021-05-12
#9 2 4 2021-05-13
#10 2 5 2021-05-14
Does this work for you? No linear models are used.
library(tidyverse)
df2 <-
df %>%
mutate(
ref_date = case_when(
case == 1 ~ as.Date("2023-01-01"),
case == 2 ~ as.Date("2021-05-09")
),
date2 = as.Date(day, origin = ref_date)
)
Output:
> df2
case day date ref_date date2
1 1 1 2023-01-02 2023-01-01 2023-01-02
2 1 2 2023-01-03 2023-01-01 2023-01-03
3 1 3 <NA> 2023-01-01 2023-01-04
4 1 4 <NA> 2023-01-01 2023-01-05
5 1 5 2023-01-06 2023-01-01 2023-01-06
6 2 1 <NA> 2021-05-09 2021-05-10
7 2 2 2021-05-11 2021-05-09 2021-05-11
8 2 3 2021-05-12 2021-05-09 2021-05-12
9 2 4 2021-05-13 2021-05-09 2021-05-13
10 2 5 <NA> 2021-05-09 2021-05-14
I concede that G.G.'s answer has the advantage that you don't need to hardcode the reference date.
P.S. here is a pure tidyverse solution without any hardcoding:
df2 <-
df %>%
mutate(ref_date = date - day) %>%
group_by(case) %>%
fill(ref_date, .direction = "downup") %>%
ungroup() %>%
mutate(date2 = as.Date(day, origin = ref_date))
1) Convert case to factor and then use predict with lm to fill in the NA's. No packages are used.
within(df, {
case <- factor(case)
date <- .Date(predict(lm(date ~ case/day), data.frame(case, date)))
})
giving
case day date
1 1 1 2023-01-02
2 1 2 2023-01-03
3 1 3 2023-01-04
4 1 4 2023-01-05
5 1 5 2023-01-06
6 2 1 2021-05-10
7 2 2 2021-05-11
8 2 3 2021-05-12
9 2 4 2021-05-13
10 2 5 2021-05-14
2) Find the mean day and date and then use day to appropriately offset each row.
library(dplyr) # version 1.1.0 or later
df %>%
mutate(date = {
Mean <- Map(mean, na.omit(pick(date, day)))
Mean$date + day - Mean$day
}, .by = case)

R - Find x days from start date while keeping dates inbetween

I am trying to find the first date of each category then subtract 5 days AND I want to keep the days inbetween! this is where I am struggling. I tried seq() but it gave me an error, so I'm not sure if this is the right way to do it.
I am able to get 5 days prior to my start date for each category, but I can't figure out how to get 0, 1, 2, 3, 4 AND 5 days prior to my start date!
The error I got is this (for the commented out part of the code):
Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) :
'from' must be of length 1
Any help would be greatly appreciated!
library ("lubridate")
library("dplyr")
library("tidyr")
data <- data.frame(date = c("2020-06-08",
"2020-06-09",
"2020-06-10",
"2020-06-11",
"2020-06-12",
"2021-07-13",
"2021-07-14",
"2021-07-15",
"2021-08-16",
"2021-08-17",
"2021-08-18",
"2021-09-19",
"2021-09-20"),
value = c(2,1,7,1,0,1,2,3,4,7,6,5,10),
category = c(1,1,1,1,1,2,2,2,3,3,3,4,4))
data$date <- as.Date(data$date)
View(data)
test_dates <- data %>%
group_by(category) %>%
arrange(date) %>%
slice(1L) %>% #takes first date
mutate(first_day = as.Date(date) - 5)#%>%
#seq(as.Date(first_day),by="day",length.out=5)
#error for seq(): Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) : 'from' must be of length 1
head(test_dates)
The answer I'm looking for should include these dates but in a column format! I'm also trying to input NA in the value category if the value doesnt already exist. I want to keep all possible columns, as the dataframe I'm needing to use this on has about 20 columns
Dates: "2020-06-03 ", "2020-06-04", "2020-06-05", "2020-06-06", "2020-06-07", "2020-06-08", "2020-07-08 ", "2020-07-09", "2020-07-10", "2020-07-11", "2020-07-12", "2021-07-13", "2020-08-11 ", "2020-08-12", "2020-08-13", "2020-08-14", "2020-08-15", "2021-08-16", "2020-09-14 ", "2020-09-15", "2020-09-16", "2020-09-17", "2020-09-18", "2021-09-19",
Related question here: How do I subset my df for the minimum date based on one category and including x days before that?
Here's one approach but kinda clunky:
bind_rows(
data,
data %>%
group_by(category) %>%
slice_min(date) %>%
uncount(6, .id = "id") %>%
mutate(date = date - id + 1) %>%
select(-id)) %>%
arrange(category, date)
Result
# A tibble: 37 × 3
date value category
<date> <dbl> <dbl>
1 2020-06-03 2 1
2 2020-06-04 2 1
3 2020-06-05 2 1
4 2020-06-06 2 1
5 2020-06-07 2 1
6 2020-06-08 2 1
7 2020-06-08 2 1
8 2020-06-09 1 1
9 2020-06-10 7 1
10 2020-06-11 1 1
# … with 27 more rows
This approach provides the row from each category with the minimum date, plus the five dates prior for each category (with value set to NA for these rows)
library(data.table)
setDT(data)[data[, .(date=seq(min(date)-5,by="day", length.out=6)), category], on=.(category,date)]
Output:
date value category
1: 2020-06-03 NA 1
2: 2020-06-04 NA 1
3: 2020-06-05 NA 1
4: 2020-06-06 NA 1
5: 2020-06-07 NA 1
6: 2020-06-08 2 1
7: 2021-07-08 NA 2
8: 2021-07-09 NA 2
9: 2021-07-10 NA 2
10: 2021-07-11 NA 2
11: 2021-07-12 NA 2
12: 2021-07-13 1 2
13: 2021-08-11 NA 3
14: 2021-08-12 NA 3
15: 2021-08-13 NA 3
16: 2021-08-14 NA 3
17: 2021-08-15 NA 3
18: 2021-08-16 4 3
19: 2021-09-14 NA 4
20: 2021-09-15 NA 4
21: 2021-09-16 NA 4
22: 2021-09-17 NA 4
23: 2021-09-18 NA 4
24: 2021-09-19 5 4
date value category
Note: The above uses a join; an identical result can be achieved without a join by row-binding the first row for each category with the data.table generated similarly as above:
rbind(
setDT(data)[order(date), .SD[1],category],
data[,.(date=seq(min(date)-5,by="day",length.out=5),value=NA),category]
)
You indicate you have many columns, so if you are going to take this second approach, rather than explicitly setting value=NA in the second input to rbind, you can also just leave it out, and add fill=TRUE within the rbind()
A dplyr version of the same is:
bind_rows(
data %>%
group_by(category) %>%
slice_min(date) %>%
ungroup() %>%
mutate(date=as.Date(date)),
data %>%
group_by(category) %>%
summarize(date=seq(min(as.Date(date))-5,by="day", length.out=5), .groups="drop")
)
Output:
# A tibble: 24 x 3
date value category
<date> <dbl> <dbl>
1 2020-06-08 2 1
2 2021-07-13 1 2
3 2021-08-16 4 3
4 2021-09-19 5 4
5 2020-06-03 NA 1
6 2020-06-04 NA 1
7 2020-06-05 NA 1
8 2020-06-06 NA 1
9 2020-06-07 NA 1
10 2021-07-08 NA 2
# ... with 14 more rows
Update (9/21/22) -
If you want the NA values to be filled, simply add this to the end of either data.table pipeline:
...[,value:=max(value, na.rm=T), category]
or add this to the dplyr pipeline
... %>%
group_by(category) %>%
mutate(value=max(value, na.rm=T))
#Jon Srpings answer fired this alternative approach:
Here we first get the first days - 5 as already presented in the question. Then we use bind_rows as Jon Srping does in his answer. Next step is to identify the original first dates within the dates column (we use !duplicated within filter). Last main step is to use coalesce:
library(lubridate)
library(dplyr)
data %>%
group_by(category) %>%
mutate(x = min(ymd(date))-5) %>%
slice(1) %>%
bind_rows(data) %>%
mutate(date = ymd(date)) %>%
filter(!duplicated(date)) %>%
mutate(x = coalesce(x, date)) %>%
arrange(category) %>%
select(date = x, value)
category date value
<dbl> <date> <dbl>
1 1 2020-06-03 2
2 1 2020-06-09 1
3 1 2020-06-10 7
4 1 2020-06-11 1
5 1 2020-06-12 0
6 2 2021-07-08 1
7 2 2021-07-14 2
8 2 2021-07-15 3
9 3 2021-08-11 4
10 3 2021-08-17 7
11 3 2021-08-18 6
12 4 2021-09-14 5
13 4 2021-09-20 10

R: fill in cells with values from different rows

I’m trying to fill NAs in a row with values from a different row. These rows are “linked” by a case number. I want to write an if loop that goes through the entire data frame and does this. But I think I don’t grasp the R language well enough. Can anybody help me?
The data frame:
CASE <- c(1, 2, 3, 4, 5, 6)
SERIAL <-c("AB",NA, NA, "CD", NA, NA)
REF <- c(NA, 1, 1, NA, 4, 4)
PA <- c(4, NA, NA, 2, NA, NA)
PE <- c(NA, 2, NA, NA, 1, NA)
PE2 <- c(NA, NA, 3, NA, NA, 3)
df <- data.frame (CASE, SERIAL, REF, PA, PE, PE2)
CASE SERIAL REF PA PE PE2
1 AB NA 4 NA NA
2 <NA> 1 NA 2 NA
3 <NA> 1 NA NA 3
4 CD NA 2 NA NA
5 <NA> 4 NA 1 NA
6 <NA> 4 NA NA 3
In the row CASE = 1, I want to fill in the empty PE and PE2 with the values from the rows below, which reference the line (by REF = 1). In the line CASE = 4, I want to fill in the empty PE and PE2 with the values from the rows below, which reference the line (by REF = 4). The lines with no serial number only serve to fill the lines 1 and 4, so to speak. There is no way to collect the data directly into the corresponding lines. I tried this for loop, but don't know how to refrence the values correctly?
for (i in 1:dim(df)[1]{
if (data$SERIAL[i]==NA){
[data$CASE[data$REF[i]],PE] <- data$PE[i]
[data$CASE[data$REF[i]],PE2] <- data$PE2[i]}
}
)
Expected output:
CASE SERIAL REF PA PE PE2
1 1 AB NA 4 2 3
2 2 <NA> 1 NA 2 NA
3 3 <NA> 1 NA NA 3
4 4 CD NA 2 1 3
5 5 <NA> 4 NA 1 NA
6 6 <NA> 4 NA NA 3
This is a dplyr solution, but perhaps it would work:
df %>%
mutate(REF = ifelse(is.na(REF), CASE, REF)) %>%
group_by(REF) %>%
summarise(SERIAL = first(SERIAL),
across(c(PA, PE, PE2), ~sum(.x, na.rm=TRUE))) %>%
rename("CASE" = "REF")
# # A tibble: 2 x 5
# CASE SERIAL PA PE PE2
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 AB 4 2 3
# 2 4 CD 2 1 3
withSerial = subset(df, !is.na(SERIAL))
withSerial
# CASE SERIAL REF PA PE PE2
#1 1 AB NA 4 NA NA
#4 4 CD NA 2 NA NA
noSerialwithRef = subset(df, is.na(SERIAL) & !is.na(REF))
noSerialwithRef
# CASE SERIAL REF PA PE PE2
#2 2 <NA> 1 NA 2 NA
#3 3 <NA> 1 NA NA 3
#5 5 <NA> 4 NA 1 NA
#6 6 <NA> 4 NA NA 3
withSerial$PE = subset(noSerialwithRef, !is.na(PE))$PE
withSerial$PE2 = subset(noSerialwithRef, !is.na(PE2))$PE2
withSerial
# CASE SERIAL REF PA PE PE2
#1 1 AB NA 4 2 3
#4 4 CD NA 2 1 3
Update: Added library(tidyr) thanks to Martin Gal and added alternative code suggested by Martin Gal:
Here is another dplyr way:
fill SERIAL
use lead in the grouped_columns
keep only first rows of gorups with slice(1)
library(dplyr)
library(tidyr)
df %>%
fill(SERIAL, .direction = "down") %>%
group_by(SERIAL) %>%
mutate(PE = lead(PE),
PE2 = lead(PE2,2)) %>%
slice(1)
# Alternative and better (suggested by Martin Gal):
df %>% fill(-c(CASE, SERIAL), .direction = "up") %>% drop_na()
CASE SERIAL REF PA PE PE2
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 AB NA 4 2 3
2 4 CD NA 2 1 3

Index multiple vectors into table in R

I have three vectors:
position <- c(13, 13, 24, 20, 24, 6, 13)
my_string_allele <- c("T>A", "T>A", "G>C", "C>A", "A>G", "A>G", "G>T")
position_ref <- c("12006", "1108", "13807", "1970", "9030", "2222", "4434")
I want to create a table (starting from the smallest position) as shown below. I want to account for the number of occurrence for each my_string_allele column for each position and have their corresponding position_ref in position_ref column. What would be the simplest way to do this?
position T>A position_ref G>C position_ref C>A position_ref A>G position_ref G>T position_ref
6 1 2222
13 2 12006, 1108 1 4434
20 1 1970
24 1 13807 1 9030
Here is a spread() method which stretches data to the wide format with mutate_all() to count the number of occurrences.
Data
library(tidyverse)
df <- data.frame(position, my_string_allele, position_ref, stringsAsFactors = F)
Code
df %>% group_by(position, my_string_allele) %>%
mutate(position_ref = paste(position_ref, collapse = ", ")) %>%
distinct() %>%
spread(my_string_allele, position_ref) %>%
mutate_all(funs(N = if_else(is.na(.), NA_integer_, lengths(str_split(., ", ")))))
Output
position `A>G` `C>A` `G>C` `G>T` `T>A` `A>G_N` `C>A_N` `G>C_N` `G>T_N` `T>A_N`
<dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
1 6 2222 NA NA NA NA 1 NA NA NA NA
2 13 NA NA NA 4434 12006, 1108 NA NA NA 1 2
3 20 NA 1970 NA NA NA NA 1 NA NA NA
4 24 9030 NA 13807 NA NA 1 NA 1 NA NA
(You can sort the columns by their column names to get the output you show in the question.)
Full disclosure: I am adapting part of #DarrenTsai's answer with data.table to provide the number of occurrence as well (since it is missing from his answer). Using data.table:
library(data.table)
df <- data.frame(position, my_string_allele, position_ref, stringsAsFactors = F)
setDT(df)
df[, `:=`(position_ref = paste(.N, paste(position_ref, collapse = ", "))),
by = c("position", "my_string_allele")] %>%
unique(., by = c("position", "my_string_allele", "position_ref")) %>%
dcast(position ~ my_string_allele, value.var = "position_ref")
Result:
position A>G C>A G>C G>T T>A
1: 6 1 2222 <NA> <NA> <NA> <NA>
2: 13 <NA> <NA> <NA> 1 4434 2 12006, 1108
3: 20 <NA> 1 1970 <NA> <NA> <NA>
4: 24 1 9030 <NA> 1 13807 <NA> <NA>
With dplyr (largely based on #DarrenTsai's answer, should upvote his as well):
library(dplyr)
df %>% group_by(position, my_string_allele) %>%
mutate(position_ref = paste(n(), paste(position_ref, collapse = ", "))) %>%
distinct() %>%
tidyr::spread(my_string_allele, position_ref)

How do I output the max value within a range of rows in a data frame?

Suppose I have the following data and data frame:
sample_data <- c(1:14)
sample_data2 <- c(NA,NA,NA, "break", NA, NA, "break", NA,NA,NA,NA,NA,NA,"break")
sample_df <- as.data.frame(sample_data)
sample_df$sample_data2 <- sample_data2
When I print this data frame, the results are as follows:
sample_data sample_data2
1 1 <NA>
2 2 <NA>
3 3 <NA>
4 4 break
5 5 <NA>
6 6 <NA>
7 7 break
8 8 <NA>
9 9 <NA>
10 10 <NA>
11 11 <NA>
12 12 <NA>
13 13 <NA>
14 14 break
How would I program it so that at every "break", it outputs the max from that row up? For instance, I would want the code to output the set of (4,7,14). Additionally, I would want it so that it only finds the max value between up to the next "break" interval.
I apologize in advance if I used any incorrect nomenclature.
I construct the groups looking for the word "break" and then move the results one row up. Then some dplyr commands to get max of every group.
library(dplyr)
sample_df_new <- sample_df %>%
mutate(group = c(1, cumsum(grepl("break", sample_data2)) + 1)[1:length(sample_data2)]) %>%
group_by(group) %>%
summarise(group_max = max(sample_data))
> sample_df_new
# A tibble: 3 x 2
group group_max
<dbl> <dbl>
1 1 4
2 2 7
3 3 14
I have an answer using data.table:
library(data.table)
sample_df <- setDT(sample_df)
sample_df[,group := (rleid(sample_data2)-0.5)%/%2]
sample_df[,.(maxvalues = max(sample_data)),by = group]
group maxvalues
1: 0 4
2: 1 7
3: 2 14
The tricky part is (rleid(sample_data2)-0.5)%/%2: rleid create an increasing index to each change :
sample_data sample_data2 rleid
1: 1 NA 1
2: 2 NA 1
3: 3 NA 1
4: 4 break 2
5: 5 NA 3
6: 6 NA 3
7: 7 break 4
8: 8 NA 5
9: 9 NA 5
10: 10 NA 5
11: 11 NA 5
12: 12 NA 5
13: 13 NA 5
14: 14 break 6
If you keep the entire part of that index - 0.5, you have a constant index for the rows you want, that you can use for grouping operation:
sample_data sample_data2 group
1: 1 NA 0
2: 2 NA 0
3: 3 NA 0
4: 4 break 0
5: 5 NA 1
6: 6 NA 1
7: 7 break 1
8: 8 NA 2
9: 9 NA 2
10: 10 NA 2
11: 11 NA 2
12: 12 NA 2
13: 13 NA 2
14: 14 break 2
Then it is just taking the maximum for each group. You can easily translate it into dplyr if it is easier for you
Here are 2 ways with base R. The trick is to define a grouping variable, grp.
grp <- !is.na(sample_df$sample_data2) & sample_df$sample_data2 == "break"
grp <- rev(cumsum(rev(grp)))
grp <- -1*grp + max(grp)
tapply(sample_df$sample_data, grp, max, na.rm = TRUE)
aggregate(sample_data ~ grp, sample_df, max, na.rm = TRUE)
Data.
This is simplified data creation code.
sample_data <- 1:14
sample_data2 <- c(NA,NA,NA, "break", NA, NA, "break", NA,NA,NA,NA,NA,NA,"break")
sample_df <- data.frame(sample_data, sample_data2)
Looks like there are lots of different ways of doing this. This is how I went about it:
rows <- which(sample_data2 == "break") #Get the row indices for where "break" appears
findmax <- function(maxrow) {
max(sample_data[1:maxrow])
} #Create a function that returns the max "up to" a given row
sapply(rows, findmax) #apply it for each of your rows
### [1] 4 7 14
Note that this works "up to" the given row. To get the maximum value between the two breaks would probably be easier with one of the other solutions, but you could also do it by looking at the j-1 row to jth row from the rows object.
Depending whether you want to assess the maximum "sample_data" number between all "sample_data2" == break including (e.g. row 1 to row 4) or excluding (e.g. row 1 to row 3) the given "sample_data2" == break row, you can do something like this with tidyverse:
Excluding the break rows:
sample_df %>%
group_by(sample_data2) %>%
mutate(temp = ifelse(is.na(sample_data2), NA_character_, paste0(gl(length(sample_data2), 1)))) %>%
ungroup() %>%
fill(temp, .direction = "up") %>%
filter(is.na(sample_data2)) %>%
group_by(temp) %>%
summarise(res = max(sample_data))
temp res
<chr> <dbl>
1 1 3.
2 2 6.
3 3 13.
Including the break rows:
sample_df %>%
group_by(sample_data2) %>%
mutate(temp = ifelse(is.na(sample_data2), NA_character_, paste0(gl(length(sample_data2), 1)))) %>%
ungroup() %>%
fill(temp, .direction = "up") %>%
group_by(temp) %>%
summarise(res = max(sample_data))
temp res
<chr> <dbl>
1 1 4.
2 2 7.
3 3 14.
Both of the codes create an ID variable called "temp" using gl() for "sample_data2" == break and then fill up the NA rows with that ID. Then, the first code filters out the "sample_data2" == break rows and assess the maximum "sample_data" values per group, while the second assess the maximum "sample_data" values per group including the "sample_data2" == break rows.

Resources