R Dplyr solution for summarize_at correlation - r

I am attempting to calculate correlation by (group_by) MktDate, for all columns in a dataframe to another column (Security Return).
I have attempted a number of dplyr solutions and can't quite get the correlation example to work properly but have no issues getting an example using mean to work properly.
This works, to calculate mean by specified columns
MyMeanTest <- MyDataTest %>%
filter(MktDate >='2009-12-31') %>%
group_by(MktDate) %>%
summarize_at(c('RtnVol_EM','OCFROI_EM'),mean,na.rm=TRUE)
This does not work. essentially I want the correlation for the columns specified, grouped by MktDate with the column FwdRet_12M. I get the following error message -
Error in summarise_impl(.data, dots) :
Evaluation error: not all arguments have the same length.
MyCorTest <- MyDataTest %>%
group_by(MktDate) %>%
summarize_at(c('RtnVol_EM','OCFROI_EM'),funs(cor(.,MyDataTest$FwdRet_12M,use="pairwise.complete.obs", "spearman")))
With the code example above I should end with something like this
MktDate,RtnVol_EM,OCFROI_EM...
Here is some sample code that should help to understand the structure of the data and end objective.
MyDataTest <- structure(list(MktDate = structure(c(17896, 17896, 17896, 17896,
17927, 17927, 17927, 17927), class = "Date"), FwdRet = c(2, 3,
4, 5, 5, 2, 1, 4), Fact1 = c(10, 30, 20, 15, 12, 25, 26, 28),
Fact2 = c(100, 500, 300, 400, 150, 400, 430, 420)), .Names = c("MktDate",
"FwdRet", "Fact1", "Fact2"), row.names = c(NA, -8L), class = "data.frame")
When running the pairwise correlation grouped by date on that data set the following should be the result.
MktDate,Fact1,Fact2
12/31/18,.2,.4
1/31/19,.4,-.8

One possible approach would be to reshape your data so that you have the variable you always want in the correlation (FwdRet) in one column and the variable that changes in a separate column. Like so:
MyDataTest_reshape <- MyDataTest %>%
gather(factor, value, -MktDate, -FwdRet)
MyDataTest_reshape
MktDate FwdRet factor value
1 2018-12-31 2 Fact1 10
2 2018-12-31 3 Fact1 30
3 2018-12-31 4 Fact1 20
4 2018-12-31 5 Fact1 15
5 2019-01-31 5 Fact1 12
6 2019-01-31 2 Fact1 25
7 2019-01-31 1 Fact1 26
8 2019-01-31 4 Fact1 28
9 2018-12-31 2 Fact2 100
10 2018-12-31 3 Fact2 500
11 2018-12-31 4 Fact2 300
12 2018-12-31 5 Fact2 400
13 2019-01-31 5 Fact2 150
14 2019-01-31 2 Fact2 400
15 2019-01-31 1 Fact2 430
16 2019-01-31 4 Fact2 420
Then you can take that reshaped data and feed it into your correlation:
MyDataTest_reshape %>%
group_by(MktDate, factor) %>%
summarize(correlation = cor(FwdRet, value)) %>%
spread(factor, correlation)
# A tibble: 2 x 3
# Groups: MktDate [2]
MktDate Fact1 Fact2
<date> <dbl> <dbl>
1 2018-12-31 0.0756 0.529
2 2019-01-31 -0.627 -0.736
You can also do this all in one step, of course:
MyDataTest %>%
gather(factor, value, -MktDate, -FwdRet) %>%
group_by(MktDate, factor) %>%
summarize(correlation = cor(FwdRet, value)) %>%
spread(factor, correlation)

This works for me.
library(tidyverse)
MyDataTest <- structure(list(MktDate = structure(c(17896, 17896, 17896, 17896,
17927, 17927, 17927, 17927), class = "Date"), FwdRet = c(2, 3,
4, 5, 5, 2, 1, 4), Fact1 = c(10, 30, 20, 15, 12, 25, 26, 28),
Fact2 = c(100, 500, 300, 400, 150, 400, 430, 420)), .Names = c("MktDate",
"FwdRet", "Fact1", "Fact2"), row.names = c(NA, -8L), class = "data.frame")
MyDataTest %>%
group_by(MktDate) %>%
summarize_at(c("Fact1", "Fact2"), list(~cor(., FwdRet, use="pairwise.complete.obs", "spearman")))
#> # A tibble: 2 x 3
#> MktDate Fact1 Fact2
#> <date> <dbl> <dbl>
#> 1 2018-12-31 0.2 0.4
#> 2 2019-01-31 -0.4 -0.8

Related

new rows for every day from in-between dates

I have to create a database with a single row for every day in the interval between the two dates (date_in - date_out).
I have to use R.
How can I do this?
My data:
id date_in date_out days
1 1 13May2022 0:00:00 03Jul2022 0:00:00 51
2 3 10Nov2020 0:00:00 15Nov2020 0:00:00 5
3 4 25Feb2020 0:00:00 05Apr2020 0:00:00 40
> dput(df)
structure(list(id = c(1L, 3L, 4L), date_in = c("13May2022 0:00:00",
"10Nov2020 0:00:00", "25Feb2020 0:00:00"), date_out = c("03Jul2022 0:00:00",
"15Nov2020 0:00:00", "05Apr2020 0:00:00"), days = c(51, 5, 40
)), class = "data.frame", row.names = c(NA, -3L))
Here is an option. First, change dates into dates (yours might already be), then we map out all the dates from the start to the end, lastly we unnest.
library(tidyverse)
#data
df <- read.csv(textConnection("id, date_in, date_out, days,
1, 13May2022 0:00:00, 03Jul2022 0:00:00, 51,
3, 10Nov2020 0:00:00, 15Nov2020 0:00:00, 5,
4, 25Feb2020 0:00:00, 05Apr2020 0:00:00, 40")) |>
select(-X)
#solution
df|>
mutate(across(starts_with("date"), \(x) lubridate::dmy_hms(x) |>
lubridate::date()),
full_date = map2(date_in, date_out, \(x,y) seq(x, y, by = "1 day"))) |>
unnest_longer(full_date) |>
select(id, date = full_date)
#> # A tibble: 99 x 2
#> id date
#> <int> <date>
#> 1 1 2022-05-13
#> 2 1 2022-05-14
#> 3 1 2022-05-15
#> 4 1 2022-05-16
#> 5 1 2022-05-17
#> 6 1 2022-05-18
#> 7 1 2022-05-19
#> 8 1 2022-05-20
#> 9 1 2022-05-21
#> 10 1 2022-05-22
#> # ... with 89 more rows
Here is a similar approach to AndS.'s, but using summarize:
library(tidyverse)
library(lubridate)
# data
df <- read.csv(textConnection("id, date_in, date_out, days,
1, 13May2022 0:00:00, 03Jul2022 0:00:00, 51,
3, 10Nov2020 0:00:00, 15Nov2020 0:00:00, 5,
4, 25Feb2020 0:00:00, 05Apr2020 0:00:00, 40")) |>
select(-X)
# answer
df |>
mutate(across(c(date_in, date_out), ~date(dmy_hms(.x)))) |>
group_by(id) |>
summarize(date=seq(date_in, date_out, by="1 day"))

Classify table based on value 'moving window' range and proportions?

I have a datasets of forest stands, each containing several tree layers of different age and volume.
I want to classify the stands as even- or uneven-aged, combining volume and age data. The forest is considered even-aged if more then 80% of the volume is allocated to age classes within 20 years apart. I wonder how to implement the 'within 20 years apart' condition? I can easily calculate the sum of volume and it's share for individual tree layers (strat). But how to check for 'how many years they are apart?' Is it some sort of moving window?
Dummy example:
# investigate volume by age classes?
library(dplyr)
df <- data.frame(stand = c("id1", "id1", "id1", "id1",
'id2', 'id2', 'id2'),
strat = c(1,2,3,4,
1,2,3),
v = c(4,10,15,20,
11,15,18),
age = c(5,10,65,80,
10,15,20))
# even age = if more of teh 80% of volume is allocated in layers in 20 years range
df %>%
group_by(stand) %>%
mutate(V_tot = sum(v)) %>%
mutate(V_share = v/V_tot*100)
Expected outcome:
stand strat v age V_tot V_share quality
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id1 1 4 5 49 8.16 uneven-aged
2 id1 2 10 10 49 20.4 uneven-aged
3 id1 3 15 65 49 30.6 uneven-aged
4 id1 4 20 80 49 40.8 uneven-aged #* because age classes 65 and 80, even less then 20 years apart have only 70% of total volume
5 id2 1 11 10 44 25 even-aged
6 id2 2 15 15 44 34.1 even-aged
7 id2 3 18 20 44 40.9 even-aged
Another tidyverse solution implementing a moving average:
library(tidyverse)
df <- structure(list(stand = c("id1", "id1", "id1", "id1", "id2", "id2", "id2"), strat = c(1, 2, 3, 4, 1, 2, 3), v = c(4, 10, 15, 20, 11, 15, 18), age = c(5, 10, 65, 80, 10, 15, 20), V_tot = c(49, 49, 49, 49, 44, 44, 44), V_share = c(8.16326530612245, 20.4081632653061, 30.6122448979592, 40.8163265306122, 25, 34.0909090909091, 40.9090909090909)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L))
df %>%
group_by(stand) %>%
mutate(range20 = map_dbl(age, ~ sum(V_share[which(abs(age - .x) <= 20)])),
quality = ifelse(any(range20 > 80), "even-aged", "uneven-aged"))
#> # A tibble: 7 × 8
#> # Groups: stand [2]
#> stand strat v age V_tot V_share range20 quality
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 id1 1 4 5 49 8.16 28.6 uneven-aged
#> 2 id1 2 10 10 49 20.4 28.6 uneven-aged
#> 3 id1 3 15 65 49 30.6 71.4 uneven-aged
#> 4 id1 4 20 80 49 40.8 71.4 uneven-aged
#> 5 id2 1 11 10 44 25 100 even-aged
#> 6 id2 2 15 15 44 34.1 100 even-aged
#> 7 id2 3 18 20 44 40.9 100 even-aged
Created on 2021-09-08 by the reprex package (v2.0.1)
Interesting issue, I think I have a solution using the runner package
df %>%
group_by(stand) %>%
mutate(
V_tot = sum(v),
V_share = v/V_tot*100,
test = sum_run(
V_share,
k = 20L,
idx = age,
na_rm = TRUE,
na_pad = FALSE
),
quality = if_else(any(test >= 80), 'even-aged', 'uneven-aged')
) %>%
select(-test)

R: Expand rows according to start and end date and calculate hours between days

My question extends this one: Generate rows between two dates into a data frame in R
I have a dataset on admissions, discharges and lengths of stay (Stay_in_days) of patients from a hospital. It looks like this:
ID Admission Discharge Stay_in_days
1 2020-08-20 15:25:03 2020-08-21 21:09:34 1.239
2 2020-10-04 21:53:43 2020-10-09 11:02:57 4.548
...
Dates are in POSIXct format so far.
I aim for this:
ID Date Stay_in_days
1 2020-08-20 15:25:03 0.357
1 2020-08-21 21:09:49 1.239
2 2020-10-04 21:53:43 0.087
2 2020-10-05 00:00:00 1.087
2 2020-10-06 00:00:00 2.087
2 2020-10-07 00:00:00 3.087
2 2020-10-08 00:00:00 4.087
2 2020-10-09 11:02:57 4.548
...
What I have done so far:
M <- Map(seq, patients$Admission, patients$Discharge, by = "day")
patients2 <- data.frame(
ID = rep.int(patients$ID, vapply(M, length, 1L)),
Date = do.call(c, M)
)
patients <- patients %>%
mutate(
Date2=as.Date(Date, format = "%Y-%m-%d"),
Dat2=Date2+1,
Diff=difftime(Date2, Date, units = "days")
)
but this gives me:
ID Date Date2 Diff
1 2020-08-20 17:25:03 2020-08-21 0.375
1 2020-08-21 17:25:03 2020-08-22 0.357
2 2020-10-04 23:53:43 2020-10-05 0.087
2 2020-10-05 23:53:43 2020-10-06 0.087
2 2020-10-06 23:53:43 2020-10-07 0.087
2 2020-10-07 23:53:43 2020-10-08 0.087
2 2020-10-08 23:53:43 2020-10-09 0.087
...
Strangely enough, it adds two hours to the Admission date but calculates the correct length of stay. Can someone explain?
Here is some data:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), Admission = structure(c(1597937103.872,
1598717768.704, 1599060521.984, 1599758087.168, 1599815496.704,
1600702198.784, 1600719631.36, 1601065923.584, 1601119400.96,
1601215476.736, 1601236710.4, 1601416934.4, 1601499640.832, 1601545647.104,
1601587328, 1601644868.608, 1601741206.528, 1601848423.424, 1601901245.44,
1601913828.352), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Discharge = structure(c(1598044189.696, 1598897337.344, 1599144670.208,
1599845118.976, 1599842366.464, 1602733683.712, 1603372135.424,
1601125168.128, 1601314173.952, 1605193905.152, 1602190259.2,
1601560720.384, 1601737143.296, 1602705634.304, 1602410460.16,
1602698425.344, 1601770566.656, 1602241377.28, 1602780476.416,
1602612048.896), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Stay_in_days = c(1.239, 2.078, 0.974, 1.007, 0.311, 23.513,
30.7, 0.686, 2.254, 46.047, 11.036, 1.664, 2.749, 13.426,
9.527, 12.194, 0.34, 4.548, 10.176, 8.081)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Thanks in advance for your help!
Though it is a bit crude but it'll work
library(tidyverse)
library(lubridate)
df %>%
pivot_longer(cols = -c(ID, Stay_in_days), names_to = "Event", values_to = "DATE") %>%
group_by(ID) %>%
mutate(dummy = case_when(Event == "Admission" ~ 0,
Event == "Discharge" ~ max(floor(Stay_in_days),1))) %>%
complete(dummy = seq(min(dummy), max(dummy), 1)) %>%
mutate(Event = ifelse(is.na(Event), "Dummy", Event),
DATE = if_else(is.na(DATE), first(DATE)+dummy*24*60*60, DATE),
Stay_in_days = case_when(Event == "Admission" ~ as.numeric(difftime(ceiling_date(DATE, "day"), DATE, units = "days")),
Event == "Discharge" ~ Stay_in_days,
TRUE ~ dummy + as.numeric(difftime(ceiling_date(first(DATE), "day"), first(DATE), units = "days")))) %>%
select(ID, DATE, Stay_in_days)
# A tibble: 199 x 3
# Groups: ID [20]
ID DATE Stay_in_days
<dbl> <dttm> <dbl>
1 1 2020-08-20 15:25:03 0.358
2 1 2020-08-21 21:09:49 1.24
3 2 2020-08-29 16:16:08 0.322
4 2 2020-08-30 16:16:08 1.32
5 2 2020-08-31 18:08:57 2.08
6 3 2020-09-02 15:28:41 0.355
7 3 2020-09-03 14:51:10 0.974
8 4 2020-09-10 17:14:47 0.281
9 4 2020-09-11 17:25:18 1.01
10 5 2020-09-11 09:11:36 0.617
# ... with 189 more rows
Explanation of logic For the first date in every ID, the stay_in_days gives the duration from admission date-time to following 24 Hrs. For intermediate dates, it just adds 1 to previous value. For discharge_date it retains the stay value calculated prior to pivoting. Hope this was you after.
Explanation of code After pivoting longer, I used a dummy column to create intermediate date-time objects. After that I just mutate the columns for generating output as described above.
You can achieve this with pivot_longer from tidyr.
Edit: with comments:
df1 <- df %>%
select(ID = ID, date1 = Admission, date2 = Discharge, Stay_in_days) %>% # prepare for pivoting
pivot_longer(
cols = starts_with("date"),
names_to = "Date1",
values_to = "Date",
) %>% # pivot to longformat
select(-Date1) %>% # remove temporary Date1
relocate(Stay_in_days, .after = Date) %>% # change column order
group_by(ID) %>%
mutate(idgroup = rep(row_number(), each=1:2, length.out = n())) %>% # id for admission = 1 and for discharge id = 2
mutate(Stay_in_days = replace(Stay_in_days, row_number() == 1, 0)) %>% # set Admission to zero
ungroup()

Find if a time exists between two different times, type issue

I have a dataframe as so
df <- structure(list(TIME = c("11:15:00", NA, "15:15:00", "12:00:00",
"18:40:00", "18:15:00", "7:10:00", "15:58:00", "10:00:00", "10:00:00"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
And I basically want to create a new variable which tells me if the time is in a certain group.
I wrote the following but it's not correct, tried changing to as.POSICxt but no dice.
df <- df %>%
mutate(time_groups = ifelse(between(as.POSIXct(TIME),00:00, 5:59), 1,
ifelse(between(as.POSIXct(TIME),06:00, 8:59), 2,
ifelse(between(as.POSIXct(TIME),09:00,11:59), 3,
ifelse(between(as.POSIXct(TIME),12:00,14:59), 4,
ifelse(between(as.POSIXct(TIME),15:00,17:59), 5,
ifelse(between(as.POSIXct(TIME),18:00,23:59), 6,
), NA)
You could use the findInterval function:
library(tidyverse)
library(lubridate)
a <- c("00:00","5:59", "8:59", "11:59", "14:59", "17:59", "23:59")
b <- ymd_hm(paste(Sys.Date(), a))
df %>%
mutate(Interval = findInterval(ymd_hms(paste(Sys.Date(), TIME)), b))
TIME Interval
<chr> <int>
1 11:15:00 3
2 NA NA
3 15:15:00 5
4 12:00:00 4
5 18:40:00 6
6 18:15:00 6
7 7:10:00 2
8 15:58:00 5
9 10:00:00 3
10 10:00:00 3

compute an average of the last two columns which differ for all subjects

I'm an R beginner and it's my first post here. I'm struggling with a problem and would love your advice. Basically, I have a dataset with 3 sets of columns that I need to manipulate altogether in order to obtain the desired outcome, which is an average of the 2 most recent observations (and that these observations must occur after a cutoff date, say, 3/15/2018) that are of high quality, but what makes it complex is that the relevant columns that go into the average differ for all cases.
The first set of data columns has to do with the number of observations each case has, so subject one has 2 observations, subject two has 3, etc.
The second set of columns describe the data quality for each of these observations. So for example, subject 1 has two good observations whereas subject 2 has 1 bad data quality for the first observation and good data quality for the 2 latter ones, and subject 3 has 3 observations that are of good quality and one observation (obs_3)that is of bad data quality.
The third set of columns specify the dates of the observations.
subject_id obs_1 obs_2 obs_3 obs_4 obs_1_dq obs_2_dq obs_3_dq obs_4_dq obs_1_date obs_2_date obs_3_date obs_4_date desired.average
1 1 5 6 NA NA TRUE TRUE NA NA 2018-02-01 2018-03-16 <NA> <NA> NA
2 2 6 8 11 NA FALSE TRUE TRUE NA 2018-02-18 2018-03-16 2018-04-10 <NA> 9.5
3 3 7 9 12 15 TRUE TRUE FALSE TRUE 2018-02-15 2018-03-18 2018-04-02 2018-04-10 12.0
4 4 3 4 8 15 TRUE TRUE TRUE TRUE 2018-02-16 2018-03-08 2018-03-10 2018-03-15 NA
In order to compute an average of TWO latest observations that are of good data quality:
I must first decide which observations are of good quality,
Then, compute an average (and it has to be an average of 2 observations) that occur after 3/15 and they must be the two most recent observations.
Below is my sample dataset. I've tried to do this manually in Excel and it was really painstaking. I'm hoping to do this in R and would very much appreciate your feedback. Thank you!
Here is my sample dataset:
> dput(head(df,5))
structure(list(subject_id = c(1, 2, 3, 4), obs_1 = c(5, 6, 7,
3), obs_2 = c(6, 8, 9, 4), obs_3 = c(NA, 11, 12, 8), obs_4 = c(NA,
NA, 15, 15), obs_1_dq = c(TRUE, FALSE, TRUE, TRUE), obs_2_dq = c(TRUE,
TRUE, TRUE, TRUE), obs_3_dq = c(NA, TRUE, FALSE, TRUE), obs_4_dq =
c(NA,
NA, TRUE, TRUE), obs_1_date = structure(c(17563, 17580, 17577,
17578), class = "Date"), obs_2_date = structure(c(17606, 17606,
17608, 17598), class = "Date"), obs_3_date = structure(c(NA,
17631, 17623, 17600), class = "Date"), obs_4_date = structure(c(NA,
NA, 17631, 17605), class = "Date"), desired.average = c(NA, 9.5,
12, NA)), .Names = c("subject_id", "obs_1", "obs_2", "obs_3",
"obs_4", "obs_1_dq", "obs_2_dq", "obs_3_dq", "obs_4_dq", "obs_1_date",
"obs_2_date", "obs_3_date", "obs_4_date", "desired.average"), row.names
= c(NA,
4L), class = "data.frame")
This should also work, and though a bit verbose it doesn't rely on column indices, so should be robust:
library(dplyr)
library(tidyr)
num_date <- as.numeric(as.Date("2018-03-15"))
df <- df[,-ncol(df)]
df_join <- df %>%
gather(Obs, value, 2:ncol(df)) %>%
mutate(
nr = as.numeric(gsub("[^\\d]", "", Obs, perl = TRUE))
) %>%
group_by(subject_id, nr) %>%
filter(!(is.na(value) | (grepl("_dq", Obs) & value == 0) | any(value[grepl("_date", Obs)] <= num_date))) %>%
ungroup() %>%
group_by(subject_id, Obs) %>%
filter(!row_number() < (max(row_number() - 1))) %>%
ungroup() %>%
group_by(subject_id) %>%
mutate(
desired.average = mean(value[grepl("_date|_dq", Obs) == FALSE], na.rm = TRUE)
) %>%
filter(!max(row_number()) == 3) %>%
distinct(subject_id, desired.average)
df <- left_join(df, df_join)
Result:
subject_id obs_1 obs_2 obs_3 obs_4 obs_1_dq obs_2_dq obs_3_dq obs_4_dq obs_1_date obs_2_date
1 1 5 6 NA NA TRUE TRUE NA NA 2018-02-01 2018-03-16
2 2 6 8 11 NA FALSE TRUE TRUE NA 2018-02-18 2018-03-16
3 3 7 9 12 15 TRUE TRUE FALSE TRUE 2018-02-15 2018-03-18
4 4 3 4 8 15 TRUE TRUE TRUE TRUE 2018-02-16 2018-03-08
obs_3_date obs_4_date desired.average
1 <NA> <NA> NA
2 2018-04-10 <NA> 9.5
3 2018-04-02 2018-04-10 12.0
4 2018-03-10 2018-03-15 NA
See if this works for you. Code is annotated briefly.
df=structure(list(subject_id = c(1, 2, 3, 4), obs_1 = c(5, 6, 7,
3), obs_2 = c(6, 8, 9, 4), obs_3 = c(NA, 11, 12, 8), obs_4 = c(NA,
NA, 15, 15), obs_1_dq = c(TRUE, FALSE, TRUE, TRUE), obs_2_dq = c(TRUE,
TRUE, TRUE, TRUE), obs_3_dq = c(NA, TRUE, FALSE, TRUE), obs_4_dq =
c(NA, NA, TRUE, TRUE), obs_1_date = structure(c(17563, 17580, 17577,
17578), class = "Date"), obs_2_date = structure(c(17606, 17606,
17608, 17598), class = "Date"), obs_3_date = structure(c(NA,
17631, 17623, 17600), class = "Date"), obs_4_date = structure(c(NA,
NA, 17631, 17605), class = "Date"), desired.average = c(NA, 9.5,
12, NA)), .Names = c("subject_id", "obs_1", "obs_2", "obs_3",
"obs_4", "obs_1_dq", "obs_2_dq", "obs_3_dq", "obs_4_dq", "obs_1_date",
"obs_2_date", "obs_3_date", "obs_4_date", "desired.average"), row.names
= c(NA, 4L), class = "data.frame")
# separate each section
obs=df[,2:5]
dq=df[, 6:9]
dt=sapply(df[, 10:13], as.numeric) # for easier calculations
# remove bad quality
obs[dq==F]=NA
# remove dates before 2018-3-15
obs[dt - as.numeric(as.Date("2018-03-15")) <= 0] = NA
# only leave two most recent dates
dt[is.na(obs)]=NA
dt=t(apply(dt,1,function(x){x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA;x}))
obs[is.na(dt)]=NA
# average
df$avg=apply(obs,1,function(x)ifelse(sum(!is.na(x))>=2, mean(x,na.rm=T), NA))
df
Edits:
Explanation
dt=t(apply(dt,1, function(x){x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA;x}))
I think this might be a little confusing for x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA. The na.rm=T meaning remove NA values. max(x[x!=max(x)]) meaning the second largest number. So x[x < 2nd_largest_num]=NA just removed any number except the largest and the 2nd largest. This function is then applied to the data frame row-wise. The final result is dt contains only two largest number in each row (most recent date in numeric format). All "discarded" values (NA in dt) will be removed from obs in the next line obs[is.na(dt)]=NA. After all these, obs only contains the two recent values in each line.

Resources