Calculate the mean of values that fall between 2 dates - r

I have 2 dataframes. One is a list of occasional events. It has a date column and a column of values.
df1 = data.frame(date = c(as.Date('2020-01-01'), as.Date('2020-02-02'), as.Date('2020-03-01')),
value = c(1,5,9))
I have another data frame that is a daily record. It too has a date column and a column of values.
set.seed(1)
df2 = data.frame(date = seq.Date(from = as.Date('2020-01-01'), to = as.Date('2020-04-01'), by = 1),
value = rnorm(92))
I want to create a new column in df1 that is the mean of df2$value from the current row date to the subsequent date value (non inclusive of the second value, so in this example, the first new value would be the mean of values from df2 of row 1 through row 32, where row 33 is the row that matches df1$date[2]). The resultant data frame would look like the following:
date value value_new
1 2020-01-01 1 0.1165512
2 2020-02-02 5 0.0974052
3 2020-03-01 9 0.1241778
But I have no idea how to specify that. Also I would prefer the last value to be the mean of whatever data is beyond the last value of df1$date, but I would also accept an NA.

We can joion df2 with df1, fill the NA values with previous values and get mean of value_new column.
library(dplyr)
df2 %>%
rename(value_new = value) %>%
left_join(df1, by = 'date') %>%
tidyr::fill(value) %>%
group_by(value) %>%
summarise(date = first(date),
value_new = mean(value_new))
# A tibble: 3 x 3
# value date value_new
# <dbl> <date> <dbl>
#1 1 2020-01-01 0.117
#2 5 2020-02-02 0.0974
#3 9 2020-03-01 0.124

Related

Populate one data frame with data from the another data frame

I have to dfs (dfA and dfB) that contain dates and I want to populate some columns in dfA with data from dfB based in some simple opreations.
Say df A has the following structure:
Location Mass Date
A 0.18 10/05/2001
B 0.25 15/08/2006
C 0.50 17/12/2019
Df B contains
Date Event Time
Where date has a wide range of dates. I would like to look in dfB for the dates in dfA and retrieve "Event" and "Time" data from dfB based in simple date operations, such getting data from one, two or three days from that showing in "Date" on dfA, giving me something like:
Location Mass Date Event 1 Event 2 Event 3
A 0.18 10/05/2001 (w) (x) (y)
B 0.25 15/08/2006 (z) (z1) (z2)
Where (w) would be the data extracted from "Event" in dfB on "Date" (-1) day from "Date" specified in dfA (09/05/2001), then (x) would retrieve the data from "Event" in dfB on "Date" (-2) days from that in df A (08/05/2001) and so on.
I believe using dplyr and lubridate could sort this out.
You can add dummy variables with lagged dates (day - 1, day - 2 etc.) then use a series of left_join to achieve intended results. Please see the code below:
library(lubridate)
library(tidyverse)
# Simulation
dfa <- tibble(location = LETTERS[1:4],
mass = c(0.18, 0.25, 0.5, 1),
date = dmy(c("10/05/2001", "15/08/2006", "15/07/2006", "17/12/2019")))
dfb <- tibble(date = dmy(c("9/05/2001", "13/08/2006", "13/07/2006", "14/12/2019")),
event = c("day-1a", "day-2a", "day-2b", "day-3"))
# Dplyr-ing, series of left_joins
dfc <- dfa %>%
mutate(date_1 = date - 1,
date_2 = date - 2,
date_3 = date - 3) %>%
left_join(dfb, by = c("date_1" = "date")) %>%
rename(event1 = event) %>%
left_join(dfb, by = c("date_2" = "date")) %>%
rename(event2 = event) %>%
left_join(dfb, by = c("date_3" = "date")) %>%
rename(event3 = event) %>%
select(-starts_with("date_"))
dfc
Output:
# A tibble: 4 x 6
location mass date event1 event2 event3
<chr> <dbl> <date> <chr> <chr> <chr>
1 A 0.18 2001-05-10 day-1a NA NA
2 B 0.25 2006-08-15 NA day-2a NA
3 C 0.5 2006-07-15 NA day-2b NA
4 D 1 2019-12-17 NA NA day-3

Number of days spent in each STATE in r

I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14

using mutate with row and column indexing and group by

I want to create a variable using dplyr that takes in a value conditional on another variable.
See example below.
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I want to create a variable 'baseline' that takes the content of variable 'value' where time = 1 and by group. As such the desired output would be
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1),
baseline = c(1,1,3,3)))
Tried to run the following code with indexing but am clearly going wrong somewhere
x <- data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
x %>% group_by(group) %>%
mutate(baseline = .[[.$time==1,.$value]])
Thanks
We can use which.min
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(baseline = value[which.min(time)])
# A tibble: 4 x 4
# Groups: group [2]
# group time value baseline
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 2 1
#3 b 1 3 3
#4 b 2 4 3
and if it is already ordered by 'time', then simply use first
df1 %>%
group_by(group) %>%
mutate(baseline = first(value))
data
df1 <- data.frame(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))

Joining data in R by first row, then second and so on

I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05

Slicing the first row of a dataframe and an additional row containing a date

I would like to take the first row of a dataframe (which will change weekly) and also a row containing a reference date (which is a constant) in order to perform a mathematical operation on them.
I can use dplyr::slice() to get the first row but any ideas on how to also return the additional row in the same call?
library(dplyr)
df <- data_frame(x = c(10, 45, 65, 10),
dt = as.POSIXct("2018-01-01", tz = "GMT"))
slice(df, 1)
Ideally, I will get two rows back as a dataframe. The first row and the row specified by date.
I'd use dplyr::filter because it lets you provide multiple conditions using an OR statment. We can then filter based on the desired criteria or based on a specific row number (generated by the dplyr::row_number() function):
df %>%
filter(x == 65 | row_number() == 1)
# A tibble: 2 x 2
x dt
<dbl> <dttm>
1 10 2018-01-01 00:00:00
2 65 2018-01-01 00:00:00
We can use slice by concatenating the the row index for first row ('1') with the row index got by matching the value '65' from the 'x' column
df %>%
slice(c(1, match(65, x)))
# A tibble: 2 x 2
# x dt
# <dbl> <dttm>
#1 10 2018-01-01 00:00:00
#2 65 2018-01-01 00:00:00

Resources