Are all dates in order - r

I am trying to apply diff() on a series of columns containing dates. I am interested in the difference between date1-date2, date2-date3, etc.
I am interested in:
the actual difference between the dates (days)
if all dates of a row are in order (diff >= 0, by row)
I can use diff() on a series of dates (e.g. on the first row --> diff(unlist(df1[1,])) ). I just need to apply this per row, i guess using apply(), but for some reason I can't work it out. Some dates are missing, which is allowed in my study.
Hopefully this is very easy for you guys...
df <- structure(list(date1 = structure(c(-10871, -13634, -15937, -15937,
-290, -2323), class = "Date"), date2 = structure(c(16678, NA,16037, 16659,
16538, 16626), class = "Date"), date3 = structure(c(16685,16688, NA, 16659,
16568, 16672), class = "Date"), date4 = structure(c(16701, 16695, 16670,
16661, 16582, 16672), class = "Date"), date5 = structure(c(16709, 16695,
16661, 16667, 16619, 16692), class = "Date")), .Names = c("date1","date2",
"date3", "date4", "date5"), row.names = c("2", "3", "4", "5", "6", "7"),
class = "data.frame")
df

You can try something like this:
apply(df, 1, function(x) identical(sort(as.Date(x)), as.Date(x[!is.na(x)])))
It is providing output as this, which says whether the particular rows dates are in sorted order.
2 3 4 5 6 7
TRUE TRUE FALSE TRUE TRUE TRUE

This will be simpler and quicker to process in long form I reckon:
dflong <- transform(
stack(lapply(df, as.numeric)),
date = as.Date(values,origin="1970-01-01"),
group = seq_len(nrow(df)),
ind = NULL,
values = NULL
)
dflong <- dflong[order(dflong$group),]
dflong$daysdiff <- with(dflong,
ave(as.numeric(date), group, FUN=function(x) c(NA,diff(x)) )
)
# date group daysdiff
#1 1940-03-28 1 NA
#7 2015-08-31 1 27549
#13 2015-09-07 1 7
#19 2015-09-23 1 16
#25 2015-10-01 1 8
#2 1932-09-03 2 NA
#8 <NA> 2 NA
#14 2015-09-10 2 NA
aggregate(daysdiff ~ group, data=dflong, function(x) any(x < 0, na.rm=TRUE) )
# group daysdiff
#1 1 FALSE
#2 2 FALSE
#3 3 TRUE
#4 4 FALSE
#5 5 FALSE
#6 6 FALSE

Related

How to find the earliest date across multiple columns in R (Issue with NAs)

I have 3 date columns (class-date) and I want to create a new column that will have the earliest of the 3 dates. This is the code I used below:
df1 <- df %>% mutate(timeout= pmin(date1, date2, end_date))
In the case that date1 and date2 are NAs, then I would like the date in end_date to be returned in the timeout column and therefore timeout should not have any NAs. The code above is bringing back NAs. Any assistance will be greatly appreciated.
You can add na.rm = TRUE, then it will ignore the NAs in each row when calculating pmin.
library(dplyr)
df %>%
mutate(timeout = pmin(date1, date2, end_date, na.rm = TRUE))
Output
id date1 date2 end_date timeout
1 1 <NA> <NA> 2008-01-23 2008-01-23
2 1 2007-10-16 2007-11-01 2008-01-23 2007-10-16
3 2 2007-11-30 2007-11-30 2007-11-30 2007-11-30
4 3 2007-08-17 2007-12-17 2008-12-12 2007-08-17
5 3 2008-11-12 2008-12-12 2008-12-12 2008-11-12
Data
df <- structure(list(id = c(1L, 1L, 2L, 3L, 3L), date1 = structure(c(NA,
13802, 13847, 13742, 14195), class = "Date"), date2 = structure(c(NA,
13818, 13847, 13864, 14225), class = "Date"), end_date = c("2008-01-23",
"2008-01-23", "2007-11-30", "2008-12-12", "2008-12-12")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))

Simple but not easy merge task

I have two incomplete dataframes (df_a, df_b): Columns are missing or NA values. "by" is the merge index and df_a has "priority" over df_b.
df_a = structure(list(Datum = structure(c(1635163200, 1635166800, 1635170400, 1635174000), class = c("POSIXct", "POSIXt")), Vorhersage = c(10.297922, 10.155121, 10.044135, 9.699513), Export = c("10.912", "10.47", NA, NA), color = c("rgb(0,128,0)", "rgb(0,128,0)", NA, NA), Status = c("ok", "ok", NA, NA), Plausibilität = c("4", "4", NA, NA), min = c(7.93000000000001, 9.4, 8.7, 8.3), max = c(12.31715325, 12.42822725, 12.51326325, 12.28620625)), row.names = c(NA, -4L), class = "data.frame")
df_b = structure(list(Datum = structure(c(1632510000, 1632513600, 1632517200, 1632520800), class = c("POSIXct", "POSIXt")), Vorhersage = c(14.821988, 14.832919, 14.706179, 14.573527), Referenz = c(16.6, 16.2, 15.9, 16), DWD_Name = c("Elpersbüttel", "Elpersbüttel", "Elpersbüttel", "Elpersbüttel"), Export = c(17.198, 16.713, 16.378, 16.358), color = c("rgb(0,128,0)", "rgb(0,128,0)", "rgb(0,128,0)", "rgb(0,128,0)"), Status = c("ok", "ok", "ok", "ok"), Plausibilität = c(4, 4, 4, 4), min = c(13.05, 12.808, 11.631891, 12.312), max = c(17, 17, 16.9, 16.7)), row.names = c(NA, -4L), class = "data.frame")
desired output is:
Datum Vorhersage Export color Status Plausibilität min max Referenz
1 2021-10-25 14:00:00 10.3 10.912 rgb(0,128,0) ok 4 7.9 12 NA
2 2021-10-25 15:00:00 10.2 10.47 rgb(0,128,0) ok 4 9.4 12 NA
3 2021-10-25 16:00:00 10.0 <NA> <NA> <NA> <NA> 8.7 13 NA
4 2021-10-25 17:00:00 9.7 <NA> <NA> <NA> <NA> 8.3 12 NA
5 2021-09-24 21:00:00 14.8 17.198 rgb(0,128,0) ok 4 13.1 17 17
6 2021-09-24 22:00:00 14.8 16.713 rgb(0,128,0) ok 4 12.8 17 16
7 2021-09-24 23:00:00 14.7 16.378 rgb(0,128,0) ok 4 11.6 17 16
8 2021-09-25 00:00:00 14.6 16.358 rgb(0,128,0) ok 4 12.3 17 16
DWD_Name
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 Elpersbüttel
6 Elpersbüttel
7 Elpersbüttel
8 Elpersbüttel
# for rebuild:
structure(list(Datum = structure(c(1635163200, 1635166800, 1635170400,
1635174000, 1632510000, 1632513600, 1632517200, 1632520800), class = c("POSIXct",
"POSIXt")), Vorhersage = c(10.297922, 10.155121, 10.044135, 9.699513,
14.821988, 14.832919, 14.706179, 14.573527), Export = c("10.912",
"10.47", NA, NA, "17.198", "16.713", "16.378", "16.358"), color = c("rgb(0,128,0)",
"rgb(0,128,0)", NA, NA, "rgb(0,128,0)", "rgb(0,128,0)", "rgb(0,128,0)",
"rgb(0,128,0)"), Status = c("ok", "ok", NA, NA, "ok", "ok", "ok",
"ok"), Plausibilität = c("4", "4", NA, NA, "4", "4", "4", "4"
), min = c(7.93000000000001, 9.4, 8.7, 8.3, 13.05, 12.808, 11.631891,
12.312), max = c(12.31715325, 12.42822725, 12.51326325, 12.28620625,
17, 17, 16.9, 16.7), Referenz = c(NA, NA, NA, NA, 16.6, 16.2,
15.9, 16), DWD_Name = c(NA, NA, NA, NA, "Elpersbüttel", "Elpersbüttel",
"Elpersbüttel", "Elpersbüttel")), row.names = c(NA, -8L), class = "data.frame")
Thanks to the help of #r2evans I tried the following:
by = "Datum"
library(data.table)
colnms <- setdiff(intersect(names(df_a), names(df_b)), by)
setDT(df_a)
setDT(df_b)
merge(df_a, df_b, by = by, all = TRUE
)[, (colnms) := lapply(colnms, function(nm) fcoalesce(.SD[[paste0(nm, ".x")]], .SD[[paste0(nm, ".y")]]))
][, c(outer(colnms, c(".x", ".y"), paste0)) := NULL ][]
but I get the following error:
Error in fcoalesce(.SD[[paste0(nm, ".x")]], .SD[[paste0(nm, ".y")]]) :
Item 2 is type double but the first item is type character. Please coerce
Most of the other answers are good, but many either over-complicate the result (in my opinion) or they perform a left or right join, not the full join as expected in the OP.
Here's a quick solution that uses dynamic column names.
library(data.table)
colnms <- setdiff(intersect(names(df_a), names(df_b)), "by")
colnms
# [1] "a"
setDT(df_a)
setDT(df_b)
merge(df_a, df_b, by = "by", all = TRUE
)[, (colnms) := lapply(colnms, function(nm) fcoalesce(.SD[[paste0(nm, ".x")]], .SD[[paste0(nm, ".y")]]))
][, c(outer(colnms, c(".x", ".y"), paste0)) := NULL ][]
# by b c a
# <num> <num> <num> <num>
# 1: 1 1 NA 1
# 2: 2 NA 2 2
# 3: 3 3 3 3
# 4: 4 NA 4 4
Notes:
the normal data.table::[ merge is a left-join only, so we need to use data.table::merge in order to be able to get a full-join with all=TRUE;
because it's using merge, the repeated columns get the .x and .y suffixes, something we can easily capitalize on;
the canonical and most-performant way when using (colnms) := ... is to also include .SDcols=colnms, but that won't work as well here since we need the suffixed columns, not the colnms columns themselves; this is a slight performance penalty but certainly not an anti-pattern (I believe) given what we need to do; and since we could have more than one duplicate column, we have to be careful to do it with each pair at a time, not all of them at once;
the last [-block (using outer) is for removing the duplicate columns; without it, the output would have column names c("by", "a.x", "b", "a.y", "c", "a"). It uses outer because that's a straight-forward way to get 1-or-more colnms and combine .x and .y to each of them; it then uses data.table's := NULL shortcut for removing one-or-more columns.
This isn't the most elegant, but you can make a function that applies your rule to coalesce the values if they occur in both data frames.
# find the unique column names (not called "by")
cols <- union(names(df_a),names(df_b))
cols <- cols[!(cols == "by")]
# merge the data sets
df_merge <- merge(df_a, df_b, by = "by", all = TRUE)
# function to check for the base column names that now have a '.x' and
# a '.y' version. for the columns, fill in the NAs from '.x' with the
# value from '.y'
col_val <- function(col_base, df) {
x <- names(df)
if (all(paste0(col_base, c(".x", ".y")) %in% x)) {
na.x <- is.na(df[[paste0(col_base, ".x")]])
df[[paste0(col_base, ".x")]][na.x] <- df[[paste0(col_base, ".y")]][na.x]
df[[paste0(col_base, ".x")]]
} else {
df[[col_base]]
}
}
# apply this function to every column
cbind(df_merge["by"], sapply(cols, col_val, df = df_merge))
This will give the following result.
by a b c
1 1 1 1 NA
2 2 2 NA 2
3 3 3 3 3
4 4 4 NA 4
I know you specified base, by the natural_join() function is worth mentioning.
library(rqdatatable)
natural_join(df_a, df_b, by = "by", jointype = "FULL")
This gives exactly what you want.
by a b c
1 1 1 1 NA
2 2 2 NA 2
3 3 3 3 3
4 4 4 NA 4
Not the answer with R base. But one possible solution with the package data.table
library(data.table)
setDT(df_a)
setDT(df_b)
df_a <- rbind(df_a, list(4, NA, NA))
df_b <- rbind(list(1, NA, NA), df_b)
df_a[df_b, `:=` (a = fifelse(is.na(a), i.a, a), c = c), on = .(by)][]
#> by a b c
#> 1: 1 1 1 NA
#> 2: 2 2 NA 2
#> 3: 3 3 3 3
#> 4: 4 4 NA 4
Edit with the help of #r2evans, A much more elegant and efficient solution:
df_a[df_b, `:=` (a = fcoalesce(a, i.a), c = c), on = .(by)][]
#> by a b c
#> 1: 1 1 1 NA
#> 2: 2 2 NA 2
#> 3: 3 3 3 3
#> 4: 4 4 NA 4
Created on 2021-10-19 by the reprex package (v2.0.1)
here a dynamic solution.. not bad, but maybe someone knows how to speed it up.
get_complete_df<-function(df_a,df_b, by = "by"){
df_a = unique(df_a)
df_b = unique(df_b)
nam_a = names(df_a)[!(names(df_a) == by)]
nam_b = names(df_b)[!(names(df_b) == by)]
nums_a = unlist(lapply(df_a, is.numeric))
nums_b = unlist(lapply(df_b, is.numeric))
nums = unique(names(df_a)[nums_a],names(df_b)[nums_b])
## try to supplement NAs
x = df_b[[by]][df_b[[by]] %in% df_a[[by]]]
y = nam_b[nam_b %in% nam_a]
vna = is.na(df_a[df_a[,1] %in% x,y])
df_a[df_a[,1] %in% x ,y][vna] = df_b[df_b[,1] %in% x,y][vna]
## get complete df
all_names = c(nam_a,nam_b )
all_names = c(by, unique(all_names))
all_by = na.omit(unique(c(df_a[[by]],df_b[[by]]) ))
## build
df_o = as.data.frame(matrix(nrow = length(all_by),ncol = length(all_names)))
names(df_o) = all_names
df_o[[by]] = all_by
## fill in content
df_o[df_o[,1] %in% df_b[,1],names(df_b)] = df_b
df_o[df_o[,1] %in% df_a[,1],names(df_a)] = df_a ## df_a has priority!
# fix numeric:
# why did some(!) num fields changed to chr ?
df_o[,nums] = as.data.frame(apply(df_o[,nums], 2, as.numeric))
df_o
}

Find if a time exists between two different times, type issue

I have a dataframe as so
df <- structure(list(TIME = c("11:15:00", NA, "15:15:00", "12:00:00",
"18:40:00", "18:15:00", "7:10:00", "15:58:00", "10:00:00", "10:00:00"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
And I basically want to create a new variable which tells me if the time is in a certain group.
I wrote the following but it's not correct, tried changing to as.POSICxt but no dice.
df <- df %>%
mutate(time_groups = ifelse(between(as.POSIXct(TIME),00:00, 5:59), 1,
ifelse(between(as.POSIXct(TIME),06:00, 8:59), 2,
ifelse(between(as.POSIXct(TIME),09:00,11:59), 3,
ifelse(between(as.POSIXct(TIME),12:00,14:59), 4,
ifelse(between(as.POSIXct(TIME),15:00,17:59), 5,
ifelse(between(as.POSIXct(TIME),18:00,23:59), 6,
), NA)
You could use the findInterval function:
library(tidyverse)
library(lubridate)
a <- c("00:00","5:59", "8:59", "11:59", "14:59", "17:59", "23:59")
b <- ymd_hm(paste(Sys.Date(), a))
df %>%
mutate(Interval = findInterval(ymd_hms(paste(Sys.Date(), TIME)), b))
TIME Interval
<chr> <int>
1 11:15:00 3
2 NA NA
3 15:15:00 5
4 12:00:00 4
5 18:40:00 6
6 18:15:00 6
7 7:10:00 2
8 15:58:00 5
9 10:00:00 3
10 10:00:00 3

Calculating age over multiple dataframes based on name of dataframe

I was wondering if someone here can help me with a lapply question.
Every month, data are extracted and the data frames are named according to the date extracted (01-08-2019,01-09-2019,01-10-2019 etc). The contents of each data frame are similar to the example below:
01-09-2019
ID DOB
3 01-07-2019
5 01-06-2019
7 01-05-2019
8 01-09-2019
01-10-2019
ID DOB
2 01-10-2019
5 01-06-2019
8 01-09-2019
9 01-02-2019
As the months roll on, there are more data sets being downloaded.
I am wanting to calculate the ages of people in each of the data sets based on the date the data was extracted - so in essence, the age would be the date difference between the data frame name and the DOB variable.
01-09-2019
ID DOB AGE(months)
3 01-07-2019 2
5 01-06-2019 3
7 01-05-2019 4
8 01-09-2019 0
01-10-2019
ID DOB AGE(months)
2 01-10-2019 0
5 01-06-2019 4
8 01-09-2019 1
9 01-02-2019 8
I was thinking of putting all of the data frames together in a list (as there are a lot) and then using lapply to calculate age across all data frames. How do I go about calculating the difference between a data frame name and a column?
If I may suggest a slightly differen approach: It might make more sense to compress your list into a single data frame before calculating the ages. Given your data looks something like this, i.e. it is a list of data frames, where the list element names are the dates of access:
$`01-09-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 3 2019-07-01
2 5 2019-06-01
3 7 2019-05-01
4 8 2019-09-01
$`01-10-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 2 2019-10-01
2 5 2019-06-01
3 8 2019-09-01
4 9 2019-02-01
You can call bind_rows first with parameter .id = "date_extracted" to turn your list into a data frame, and then calculate age in months.
library(tidyverse)
library(lubridate)
tib <- bind_rows(tib_list, .id = "date_extracted") %>%
mutate(date_extracted = dmy(date_extracted),
DOB = dmy(DOB),
age_months = month(date_extracted) - month(DOB)
)
#### OUTPUT ####
# A tibble: 8 x 4
date_extracted ID DOB age_months
<date> <dbl> <date> <dbl>
1 2019-09-01 3 2019-07-01 2
2 2019-09-01 5 2019-06-01 3
3 2019-09-01 7 2019-05-01 4
4 2019-09-01 8 2019-09-01 0
5 2019-10-01 2 2019-10-01 0
6 2019-10-01 5 2019-06-01 4
7 2019-10-01 8 2019-09-01 1
8 2019-10-01 9 2019-02-01 8
This can be solved with lapply as well but we can also use Map in this case to iterate over list and their names after adding all the dataframes in a list. In base R,
Map(function(x, y) {
x$DOB <- as.Date(x$DOB)
transform(x, age = as.integer(format(as.Date(y), "%m")) -
as.integer(format(x$DOB, "%m")))
}, list_df, names(list_df))
#$`01-09-2019`
# ID DOB age
#1 3 0001-07-20 2
#2 5 0001-06-20 3
#3 7 0001-05-20 4
#4 8 0001-09-20 0
#$`01-10-2019`
# ID DOB age
#1 2 0001-10-20 0
#2 5 0001-06-20 4
#3 8 0001-09-20 1
#4 9 0001-02-20 8
We can also do the same in tidyverse
library(dplyr)
library(lubridate)
purrr::imap(list_df, ~.x %>% mutate(age = month(.y) - month(DOB)))
data
list_df <- list(`01-09-2019` = structure(list(ID = c(3L, 5L, 7L, 8L),
DOB = structure(c(3L, 2L, 1L, 4L), .Label = c("01-05-2019", "01-06-2019",
"01-07-2019", "01-09-2019"), class = "factor")), class = "data.frame",
row.names = c(NA, -4L)), `01-10-2019` = structure(list(ID = c(2L, 5L, 8L, 9L),
DOB = structure(c(4L, 2L, 3L, 1L), .Label = c("01-02-2019",
"01-06-2019", "01-09-2019", "01-10-2019"), class = "factor")),
class = "data.frame", row.names = c(NA, -4L)))
It's bad practice to use dates and numbers as dataframe names consider prefix the date with an "x" as shown below in this base R solution:
df_list <- list(x01_09_2019 = `01-09-2019`, x01_10_2019 = `01-10-2019`)
df_list <- mapply(cbind, "report_date" = names(df_list), df_list, SIMPLIFY = F)
df_list <- lapply(df_list, function(x){
x$report_date <- as.Date(gsub("_", "-", gsub("x", "", x$report_date)), "%d-%m-%Y")
x$Age <- x$report_date - x$DOB
return(x)
}
)
Data:
`01-09-2019` <- structure(list(ID = c(3, 5, 7, 8),
DOB = structure(c(18078, 18048, 18017, 18140), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))
`01-10-2019` <- structure(list(ID = c(2, 5, 8, 9),
DOB = structure(c(18170, 18048, 18140, 17928), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))

compute an average of the last two columns which differ for all subjects

I'm an R beginner and it's my first post here. I'm struggling with a problem and would love your advice. Basically, I have a dataset with 3 sets of columns that I need to manipulate altogether in order to obtain the desired outcome, which is an average of the 2 most recent observations (and that these observations must occur after a cutoff date, say, 3/15/2018) that are of high quality, but what makes it complex is that the relevant columns that go into the average differ for all cases.
The first set of data columns has to do with the number of observations each case has, so subject one has 2 observations, subject two has 3, etc.
The second set of columns describe the data quality for each of these observations. So for example, subject 1 has two good observations whereas subject 2 has 1 bad data quality for the first observation and good data quality for the 2 latter ones, and subject 3 has 3 observations that are of good quality and one observation (obs_3)that is of bad data quality.
The third set of columns specify the dates of the observations.
subject_id obs_1 obs_2 obs_3 obs_4 obs_1_dq obs_2_dq obs_3_dq obs_4_dq obs_1_date obs_2_date obs_3_date obs_4_date desired.average
1 1 5 6 NA NA TRUE TRUE NA NA 2018-02-01 2018-03-16 <NA> <NA> NA
2 2 6 8 11 NA FALSE TRUE TRUE NA 2018-02-18 2018-03-16 2018-04-10 <NA> 9.5
3 3 7 9 12 15 TRUE TRUE FALSE TRUE 2018-02-15 2018-03-18 2018-04-02 2018-04-10 12.0
4 4 3 4 8 15 TRUE TRUE TRUE TRUE 2018-02-16 2018-03-08 2018-03-10 2018-03-15 NA
In order to compute an average of TWO latest observations that are of good data quality:
I must first decide which observations are of good quality,
Then, compute an average (and it has to be an average of 2 observations) that occur after 3/15 and they must be the two most recent observations.
Below is my sample dataset. I've tried to do this manually in Excel and it was really painstaking. I'm hoping to do this in R and would very much appreciate your feedback. Thank you!
Here is my sample dataset:
> dput(head(df,5))
structure(list(subject_id = c(1, 2, 3, 4), obs_1 = c(5, 6, 7,
3), obs_2 = c(6, 8, 9, 4), obs_3 = c(NA, 11, 12, 8), obs_4 = c(NA,
NA, 15, 15), obs_1_dq = c(TRUE, FALSE, TRUE, TRUE), obs_2_dq = c(TRUE,
TRUE, TRUE, TRUE), obs_3_dq = c(NA, TRUE, FALSE, TRUE), obs_4_dq =
c(NA,
NA, TRUE, TRUE), obs_1_date = structure(c(17563, 17580, 17577,
17578), class = "Date"), obs_2_date = structure(c(17606, 17606,
17608, 17598), class = "Date"), obs_3_date = structure(c(NA,
17631, 17623, 17600), class = "Date"), obs_4_date = structure(c(NA,
NA, 17631, 17605), class = "Date"), desired.average = c(NA, 9.5,
12, NA)), .Names = c("subject_id", "obs_1", "obs_2", "obs_3",
"obs_4", "obs_1_dq", "obs_2_dq", "obs_3_dq", "obs_4_dq", "obs_1_date",
"obs_2_date", "obs_3_date", "obs_4_date", "desired.average"), row.names
= c(NA,
4L), class = "data.frame")
This should also work, and though a bit verbose it doesn't rely on column indices, so should be robust:
library(dplyr)
library(tidyr)
num_date <- as.numeric(as.Date("2018-03-15"))
df <- df[,-ncol(df)]
df_join <- df %>%
gather(Obs, value, 2:ncol(df)) %>%
mutate(
nr = as.numeric(gsub("[^\\d]", "", Obs, perl = TRUE))
) %>%
group_by(subject_id, nr) %>%
filter(!(is.na(value) | (grepl("_dq", Obs) & value == 0) | any(value[grepl("_date", Obs)] <= num_date))) %>%
ungroup() %>%
group_by(subject_id, Obs) %>%
filter(!row_number() < (max(row_number() - 1))) %>%
ungroup() %>%
group_by(subject_id) %>%
mutate(
desired.average = mean(value[grepl("_date|_dq", Obs) == FALSE], na.rm = TRUE)
) %>%
filter(!max(row_number()) == 3) %>%
distinct(subject_id, desired.average)
df <- left_join(df, df_join)
Result:
subject_id obs_1 obs_2 obs_3 obs_4 obs_1_dq obs_2_dq obs_3_dq obs_4_dq obs_1_date obs_2_date
1 1 5 6 NA NA TRUE TRUE NA NA 2018-02-01 2018-03-16
2 2 6 8 11 NA FALSE TRUE TRUE NA 2018-02-18 2018-03-16
3 3 7 9 12 15 TRUE TRUE FALSE TRUE 2018-02-15 2018-03-18
4 4 3 4 8 15 TRUE TRUE TRUE TRUE 2018-02-16 2018-03-08
obs_3_date obs_4_date desired.average
1 <NA> <NA> NA
2 2018-04-10 <NA> 9.5
3 2018-04-02 2018-04-10 12.0
4 2018-03-10 2018-03-15 NA
See if this works for you. Code is annotated briefly.
df=structure(list(subject_id = c(1, 2, 3, 4), obs_1 = c(5, 6, 7,
3), obs_2 = c(6, 8, 9, 4), obs_3 = c(NA, 11, 12, 8), obs_4 = c(NA,
NA, 15, 15), obs_1_dq = c(TRUE, FALSE, TRUE, TRUE), obs_2_dq = c(TRUE,
TRUE, TRUE, TRUE), obs_3_dq = c(NA, TRUE, FALSE, TRUE), obs_4_dq =
c(NA, NA, TRUE, TRUE), obs_1_date = structure(c(17563, 17580, 17577,
17578), class = "Date"), obs_2_date = structure(c(17606, 17606,
17608, 17598), class = "Date"), obs_3_date = structure(c(NA,
17631, 17623, 17600), class = "Date"), obs_4_date = structure(c(NA,
NA, 17631, 17605), class = "Date"), desired.average = c(NA, 9.5,
12, NA)), .Names = c("subject_id", "obs_1", "obs_2", "obs_3",
"obs_4", "obs_1_dq", "obs_2_dq", "obs_3_dq", "obs_4_dq", "obs_1_date",
"obs_2_date", "obs_3_date", "obs_4_date", "desired.average"), row.names
= c(NA, 4L), class = "data.frame")
# separate each section
obs=df[,2:5]
dq=df[, 6:9]
dt=sapply(df[, 10:13], as.numeric) # for easier calculations
# remove bad quality
obs[dq==F]=NA
# remove dates before 2018-3-15
obs[dt - as.numeric(as.Date("2018-03-15")) <= 0] = NA
# only leave two most recent dates
dt[is.na(obs)]=NA
dt=t(apply(dt,1,function(x){x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA;x}))
obs[is.na(dt)]=NA
# average
df$avg=apply(obs,1,function(x)ifelse(sum(!is.na(x))>=2, mean(x,na.rm=T), NA))
df
Edits:
Explanation
dt=t(apply(dt,1, function(x){x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA;x}))
I think this might be a little confusing for x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA. The na.rm=T meaning remove NA values. max(x[x!=max(x)]) meaning the second largest number. So x[x < 2nd_largest_num]=NA just removed any number except the largest and the 2nd largest. This function is then applied to the data frame row-wise. The final result is dt contains only two largest number in each row (most recent date in numeric format). All "discarded" values (NA in dt) will be removed from obs in the next line obs[is.na(dt)]=NA. After all these, obs only contains the two recent values in each line.

Resources