Add rows based on missing dates within a group - r

I am trying to add rows to a data frame based on the minimum and maximum data within each group. Suppose this is my original data frame:
df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01", "2018-02-01")),
Group = c(1,1,2,2,2,3,3),
Value = c(100, 200, 150, 125, 200, 150, 175))
Notice that Group 1 has 2 consecutive dates, group 2 has 3 consecutive dates, and group 3 is missing the date in the middle (2018-01-01). I'd like to be able to complete the data frame by adding rows for missing dates. But the thing is I only want to add additional dates based on dates that are missing between the minimum and maximum date within each group. So if I were to complete this data frame it would look like this:
df_complete = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01","2018-01-01", "2018-02-01")),
Group = c(1,1,2,2,2,3,3,3),
Value = c(100, 200, 150, 125, 200, 150,NA, 175))
Only one row was added because Group 3 was missing one date. There was no date added for Group 1 because it had all the dates between its minimum (2017-12-01) and maximum date (2018-01-01).

You can use tidyr::complete with dplyr to find a solution. The interval between consecutive dates seems to be month. The approach will be as below:
library(dplyr)
library(tidyr)
df %>% group_by(Group) %>%
complete(Group, Date = seq.Date(min(Date), max(Date), by = "month"))
# A tibble: 8 x 3
# Groups: Group [3]
# Group Date Value
# <dbl> <date> <dbl>
# 1 1.00 2017-12-01 100
# 2 1.00 2018-01-01 200
# 3 2.00 2017-12-01 150
# 4 2.00 2018-01-01 125
# 5 2.00 2018-02-01 200
# 6 3.00 2017-12-01 150
# 7 3.00 2018-01-01 NA
# 8 3.00 2018-02-01 175
Data
df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01",
"2018-02-01","2017-12-01", "2018-02-01")),
Group = c(1,1,2,2,2,3,3),
Value = c(100, 200, 150, 125, 200, 150, 175))

#MKR's approach of using tidyr::complete with dplyr is good, but will fail if the group column is not numeric. It will then be typecast as factors and the complete() operation will then result in a tibble with a row for every factor/time combination for each group.
complete() does not need the group variable as first argument, so the solution is
library(dplyr)
library(tidyr)
df %>% group_by(Group) %>%
complete(Date = seq.Date(min(Date), max(Date), by = "month"))

Related

Populate one data frame with data from the another data frame

I have to dfs (dfA and dfB) that contain dates and I want to populate some columns in dfA with data from dfB based in some simple opreations.
Say df A has the following structure:
Location Mass Date
A 0.18 10/05/2001
B 0.25 15/08/2006
C 0.50 17/12/2019
Df B contains
Date Event Time
Where date has a wide range of dates. I would like to look in dfB for the dates in dfA and retrieve "Event" and "Time" data from dfB based in simple date operations, such getting data from one, two or three days from that showing in "Date" on dfA, giving me something like:
Location Mass Date Event 1 Event 2 Event 3
A 0.18 10/05/2001 (w) (x) (y)
B 0.25 15/08/2006 (z) (z1) (z2)
Where (w) would be the data extracted from "Event" in dfB on "Date" (-1) day from "Date" specified in dfA (09/05/2001), then (x) would retrieve the data from "Event" in dfB on "Date" (-2) days from that in df A (08/05/2001) and so on.
I believe using dplyr and lubridate could sort this out.
You can add dummy variables with lagged dates (day - 1, day - 2 etc.) then use a series of left_join to achieve intended results. Please see the code below:
library(lubridate)
library(tidyverse)
# Simulation
dfa <- tibble(location = LETTERS[1:4],
mass = c(0.18, 0.25, 0.5, 1),
date = dmy(c("10/05/2001", "15/08/2006", "15/07/2006", "17/12/2019")))
dfb <- tibble(date = dmy(c("9/05/2001", "13/08/2006", "13/07/2006", "14/12/2019")),
event = c("day-1a", "day-2a", "day-2b", "day-3"))
# Dplyr-ing, series of left_joins
dfc <- dfa %>%
mutate(date_1 = date - 1,
date_2 = date - 2,
date_3 = date - 3) %>%
left_join(dfb, by = c("date_1" = "date")) %>%
rename(event1 = event) %>%
left_join(dfb, by = c("date_2" = "date")) %>%
rename(event2 = event) %>%
left_join(dfb, by = c("date_3" = "date")) %>%
rename(event3 = event) %>%
select(-starts_with("date_"))
dfc
Output:
# A tibble: 4 x 6
location mass date event1 event2 event3
<chr> <dbl> <date> <chr> <chr> <chr>
1 A 0.18 2001-05-10 day-1a NA NA
2 B 0.25 2006-08-15 NA day-2a NA
3 C 0.5 2006-07-15 NA day-2b NA
4 D 1 2019-12-17 NA NA day-3

Addition of missing data after floor_date / detect and fill in missing data gaps

I would like to sum up a larger set of data per month. floor_date offers the right functionality to sum up the data from the individual days on a monthly level. But unfortunately I need to make sure that all months are included in the final table. The initial data therefore does not always cover all months, but after floor_date there must be 0 in the corresponding months; the rows / months must not simply be missing. How can I ensure this automatically?
The following exemplary code clarifies my problem:
df <- data.frame(
time = c(as.Date("01-01-2020", format = "%d-%m-%Y"), as.Date("02-01-2020", format = "%d-%m-%Y"), as.Date("01-03-2020", format = "%d-%m-%Y")),
text = c("A", "A", "B")
)
df2 <- df %>%
mutate(month = floor_date(time, unit = "month")) %>%
select(text, month) %>%
group_by(month, text) %>%
summarise(n = n())
df2
# A tibble: 2 x 3
# Groups: month [2]
month text n
<date> <fct> <int>
1 2020-01-01 A 2
2 2020-03-01 B 1
It should be recognized that there is no data for B in month 2020-01, no data for A&B in month 2020-02 and no data for A in month 2020-03: this rows should be added with value 0.
Unfortunately, so far I have not found a solution to solve the problem in an automated way.
Thanks in advance!
I cannot understand the need of using format while mutating the variable for a given month (floor_date). This formatting turns the variable into character type and hence no further calculations can be performed.
Remove that step, and use tidyr::complete you can fill missing months as shown under-
df <- data.frame(
time = c(as.Date("01-01-2020", format = "%d-%m-%Y"), as.Date("02-01-2020", format = "%d-%m-%Y"), as.Date("01-03-2020", format = "%d-%m-%Y")),
text = c("A", "A", "B")
)
library(lubridate, warn.conflicts = F)
library(tidyverse, warn.conflicts = F)
df %>%
mutate(month = floor_date(time, unit = "month")) %>%
group_by(text, month) %>%
summarise(n = n(), .groups = 'drop') %>%
complete(nesting(text), month = seq.Date(from = min(month), to = max(month), by = '1 month'), fill = list(n = 0))
# A tibble: 6 x 3
text month n
<chr> <date> <dbl>
1 A 2020-01-01 2
2 A 2020-02-01 0
3 A 2020-03-01 0
4 B 2020-01-01 0
5 B 2020-02-01 0
6 B 2020-03-01 1
Created on 2021-07-06 by the reprex package (v2.0.0)
Base R option using cut -
stack(table(cut(df$time,'month')))[2:1]
# ind values
#1 2020-01-01 2
#2 2020-02-01 0
#3 2020-03-01 1

Number of days spent in each STATE in r

I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14

Calculate the mean of values that fall between 2 dates

I have 2 dataframes. One is a list of occasional events. It has a date column and a column of values.
df1 = data.frame(date = c(as.Date('2020-01-01'), as.Date('2020-02-02'), as.Date('2020-03-01')),
value = c(1,5,9))
I have another data frame that is a daily record. It too has a date column and a column of values.
set.seed(1)
df2 = data.frame(date = seq.Date(from = as.Date('2020-01-01'), to = as.Date('2020-04-01'), by = 1),
value = rnorm(92))
I want to create a new column in df1 that is the mean of df2$value from the current row date to the subsequent date value (non inclusive of the second value, so in this example, the first new value would be the mean of values from df2 of row 1 through row 32, where row 33 is the row that matches df1$date[2]). The resultant data frame would look like the following:
date value value_new
1 2020-01-01 1 0.1165512
2 2020-02-02 5 0.0974052
3 2020-03-01 9 0.1241778
But I have no idea how to specify that. Also I would prefer the last value to be the mean of whatever data is beyond the last value of df1$date, but I would also accept an NA.
We can joion df2 with df1, fill the NA values with previous values and get mean of value_new column.
library(dplyr)
df2 %>%
rename(value_new = value) %>%
left_join(df1, by = 'date') %>%
tidyr::fill(value) %>%
group_by(value) %>%
summarise(date = first(date),
value_new = mean(value_new))
# A tibble: 3 x 3
# value date value_new
# <dbl> <date> <dbl>
#1 1 2020-01-01 0.117
#2 5 2020-02-02 0.0974
#3 9 2020-03-01 0.124

map/dplyr way for dynamically populating two columns in dataframe

I have the dataframe 'test' as shown at very bottom below.
I have 2 different operations Id like to complete on two different columns and would like to use an efficient dplyr or purrr method to resolve, if possible.
Operation#1:
Id like to populate 'amt_needed' NA values to be the two values from 'remaining' above it (this is a test dataframe, but in actual version Ill have more rows and each time Id like the two 'amt_needed' values to be = to the two values from 'remaining' in the above two rows).
Operation #2:
The two NA values for 'remaining' should be the new 'amt_needed' values - sum(contrib) for both a and b.
Any thoughts/suggestions appreciated!
test <- data.frame(date = c("2018-01-01", "2018-01-01", "2018-01-15", "2018-01-15"),
name = c("a","b","a","b"),
contrib = c(4,2,4,2),
amt_needed = c(100,100, NA,NA),
remaining = c(94,94, NA,NA))
Based on new data provided in OP, one solution using dplyr could be :
library(dplyr)
# Data
test <- data.frame(date = c("2018-01-01", "2018-01-01", "2018-01-15", "2018-01-15", "2018-01-30", "2018-01-30"),
name = c("a","b","a","b", "a","b"),
contrib = c(4,2,4,2,4,2),
amt_needed = c(100,100, NA,NA, NA,NA),
remaining = c(94,94, NA,NA, NA,NA))
# Change column to date
test$date <- as.Date(test$date, "%Y-%m-%d")
test$amt_needed <- test$amt_needed[1]
test %>%
arrange(date, name) %>%
group_by(date) %>%
mutate(group_contrib = cumsum(sum(contrib))) %>%
ungroup() %>%
select(date, group_contrib) %>%
unique() %>%
arrange(date) %>%
mutate(cumm_group_sum = cumsum(group_contrib)) %>%
inner_join(test, by = "date") %>%
mutate(remaining = amt_needed - cumm_group_sum) %>%
mutate(amt_needed_act = remaining + group_contrib) %>%
select(date, name, contrib, amt_needed_act, remaining)
# A tibble: 6 x 5
date name contrib amt_needed_act remaining
<date> <fctr> <dbl> <dbl> <dbl>
1 2018-01-01 a 4.00 100 94.0
2 2018-01-01 b 2.00 100 94.0
3 2018-01-15 a 4.00 94.0 88.0
4 2018-01-15 b 2.00 94.0 88.0
5 2018-01-30 a 4.00 88.0 82.0
6 2018-01-30 b 2.00 88.0 82.0

Resources