Padding around dates in R to add missing/blank months? - r

The padr R pacakge vignette describes different package functions to pad dates and times around said dates and times.
I am in situations where I'll be tallying events in data frames (ie dplyr::count()) and will need to plot occurrences, over a period of say... 1 year. When I count the events in a low volume data frame I'll often get single line item results, like this:
library(tidyverse)
library(lubridate)
library(padr)
df <- tibble(col1 = as.Date("2018-10-01"), col2 = "g", col3 = 5)
#> # A tibble: 1 x 3
#> col1 col2 col3
#> <date> <chr> <dbl>
#> 1 2018-10-01 g 5
To plot this with ggplot, over a period of a year, on a monthly basis, requires a data frame of 12 rows. It basically needs to look like this:
#> # A tibble: 12 x 3
#> col1 col2 col3
#> <date> <chr> <dbl>
#> 1 2018-01-01 NA 0
#> 2 2018-02-01 NA 0
#> 3 2018-03-01 NA 0
#> 4 2018-04-01 NA 0
#> 5 2018-05-01 NA 0
#> 6 2018-06-01 NA 0
#> 7 2018-07-01 NA 0
#> 8 2018-08-01 NA 0
#> 9 2018-09-01 NA 0
#> 10 2018-10-01 g 5
#> 11 2018-11-01 NA 0
#> 12 2018-12-01 NA 0
Perhaps padr() can do this with some combination of the thicken() and pad() functions. My attempts are shown below, neither line 3 nor line 4 construct the data frame shown directly above.
How do I construct that data frame direclty above, utilizing padr(), lubridate(), tidyverse(), data.table(), base R, or any way you please? Manual entry of each month shall not be considered either, if that needs to be said. Thank you.
df %>%
thicken("year") %>%
# pad(by = "col1") %>% # line 3
# pad(by = "col1_year") %>% # line 4
print()

library(lubridate)
library(tidyverse)
df <- tibble(col1 = as.Date("2018-10-01"), col2 = "g", col3 = 5)
my_year <- year(df$col1[1])
df2 <- tibble(col1 = seq(ymd(paste0(my_year,'-01-01')),ymd(paste0(my_year,'-12-01')), by = '1 month'))
df3 <- merge(df,df2, by ="col1",all.y=TRUE) %>% mutate(col3 = replace_na(col3,0))
df3

Related

How to count exact matches across two data frames within IDs in R

I have two datasets similar to the one below (but with 4m observations) and I want to count the number of matching sample days between the two data frames (see example below).
DF1
ID date
1 1992-10-15
1 2010-02-17
2 2019-09-17
2 2015-08-18
3 2020-10-27
3 2020-12-23
DF2
ID date
1 1992-10-15
1 2001-04-25
1 2010-02-17
3 1990-06-22
3 2014-08-18
3 2020-10-27
Expected output
ID Count
1 2
2 0
3 1
I have tried the aggregate function (though unsure what to put in "which":
test <- aggregate(date~ID, rbind(DF1, DF2), length(which(exact?)))
and the table function:
Y<-table(DF1$ID)
X <- table(DF2$ID)
Y2 <- DF1[Y %in% X,]
I am having trouble finding an example to help my situation.
Your help is appreciated!
in Base R
data.frame(table(factor(merge(df1,df2)$ID, unique(df1$ID))))
Var1 Freq
1 1 2
2 2 0
3 3 1
Using tidyverse
library(dplyr)
library(tidyr)
inner_join(df1, df2) %>%
complete(ID = unique(df1$ID)) %>%
reframe(Freq = sum(!is.na(date)), .by = "ID")
-output
# A tibble: 3 × 2
ID Freq
<int> <int>
1 1 2
2 2 0
3 3 1
Here is one way to do it with 'dplyr' and 'tidyr':
library(dplyr)
library(tidyr)
DF1 %>%
semi_join(DF2) %>%
count(ID) %>%
complete(ID = DF1$ID,
fill = list(n = 0))
#> Joining with `by = join_by(ID, date)`
#> # A tibble: 3 × 2
#> ID n
#> <dbl> <int>
#> 1 1 2
#> 2 2 0
#> 3 3 1
data
DF1 <- tibble(ID = c(1,1,2,2,3,3),
date = c("1992-10-15", "2010-02-17", "2019-09-17",
"2015-08-18", "2020-10-27", "2020-12-23"))
DF2 <- tibble(ID = c(1,1,1,3,3,3),
date = c("1992-10-15", "2001-04-25", "2010-02-17",
"1990-06-22", "2014-08-18", "2020-10-27"))
Created on 2023-02-16 with reprex v2.0.2

Including missing values in summarise output

I am trying to still keep all rows in a summarise output even when one of the columns does not exist. I have a data frame that looks like this:
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9))
I then need to summarize by all ids, where id is a row and there is a column for the first seq_num and second one. Even if the second one doesn't exist, I'd still like that row to be maintained, with an NA in that slot. I've tried the answers in this answer, but they are not working.
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
outputs
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
I would still like a 3rd row, though, with seq_0_time=9, and seq_1_time=NA since it doesn't exist.
How can I do this?
If there are only max one observation per 'seq_num' for each 'id', then it is possible to coerce to NA where there are no cases with [1]
library(dplyr)
dat %>%
group_by(id) %>%
summarise(seq_0_time = time[seq_num ==0][1],
seq_1_time = time[seq_num == 1][1], .groups = 'drop')
-output
# A tibble: 3 × 3
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
3 3 9 NA
It is just that the length of 0 can be modified to length 1 by assigning NA Or similarly this can be used to replicate NA to fill for 2, 3, etc, by specifying the index that didn't occur
> with(dat, time[seq_num==1 & id == 3])
numeric(0)
> with(dat, time[seq_num==1 & id == 3][1])
[1] NA
> numeric(0)
numeric(0)
> numeric(0)[1]
[1] NA
> numeric(0)[1:2]
[1] NA NA
Or using length<-
> `length<-`(numeric(0), 3)
[1] NA NA NA
This can actually be pretty easily solved using reshape.
> reshape(dat, timevar='seq_num', idvar = 'id', direction = 'wide')
id time.0 time.1 time.2
1 1 4 5 NA
3 2 6 7 8
6 3 9 NA NA
My understanding is that you must use complete() on both the seq_num and id variables to achieve your desired result:
library(tidyverse)
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9)) %>%
complete(seq_num = seq_num,
id = id)
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
#> # A tibble: 3 x 3
#> id seq_0_time seq_1_time
#> <dbl> <dbl> <dbl>
#> 1 1 4 5
#> 2 2 6 7
#> 3 3 9 NA
Created on 2022-04-20 by the reprex package (v2.0.1)

if_else with sequence of conditions

I have the following data:
library(tidyverse)
library(lubridate)
df <- tibble(date = as_date(c("2019-11-20", "2019-11-27", "2020-04-01", "2020-04-15", "2020-09-23", "2020-11-25", "2021-03-03")))
# A tibble: 7 x 1
date
<date>
1 2019-11-20
2 2019-11-27
3 2020-04-01
4 2020-04-15
5 2020-09-23
6 2020-11-25
7 2021-03-03
I also have an ordered comparison vector of dates:
comparison <- seq(as_date("2019-12-01"), today(), by = "months") - 1
I now want to compare my dates in df to those comparison dates and so something like:
if date in df is < comparison[1], then assign a 1
if date in df is < comparison[2], then assign a 2
and so on.
I know I could do it with a case_when, e.g.
df %>%
mutate(new_var = case_when(date < comparison[1] ~ 1,
date < comparison[2] ~ 2))
(of course filling this up with all comparisons).
However, this would require to manually write out all sequential conditions and I'm wondering if I couldn't just automate it. I though about creating a match lookup first (i.e. take the comparison vector, then add the respective new_var number (i.e. 1, 2, and so on)) and then match it against my data, but I only know how to do that for exact matches and don't know how I can add the "smaller than" condition.
Expected result:
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17
You can use findInterval as follows:
df %>% mutate(new_var = df$date %>% findInterval(comparison) + 1)
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17

Create multiple new dataframes based on rows in another dataframe with a for loop in r

I have a dataframe that looks like this:
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
ID Type X2019 X2020 X2021
1 1 A 1 2 3
2 2 A 2 3 4
3 3 B 3 4 5
4 4 B 4 5 6
5 5 C 5 6 7
6 6 C 6 7 8
Now, I'm looking for some code that does the following:
1. Create a new data.frame for every row in df
2. Names the new dataframe with a combination of "ID" and "Type" (A_1, A_2, ... , C_6)
The resulting new dataframes should look like this (example for A_1, A_2 and C_6):
Year Values
1 2019 1
2 2020 2
3 2021 3
Year Values
1 2019 2
2 2020 3
3 2021 4
Year Values
1 2019 6
2 2020 7
3 2021 8
I have some things that somehow complicate the code:
1. The code should work in the next few years without any changes, meaning next year the data.frame df will no longer contain the years 2019-2021, but rather 2020-2022.
2. As the data.frame df is only a minimal reproducible example, I need some kind of loop. In the "real" data, I have a lot more rows and therefore a lot more dataframes to be created.
Unfortunately, I can't give you any code, as I have absolutely no idea how I could manage that.
While researching, I found the following code that may help adress the first problem with the changing years:
year <- as.numeric(format(Sys.Date(), "%Y"))
Further, I read about list, and that it may help to work with a list in a for loop and then transform the list back into a dataframe. Sorry for my limited approach, I hope anyone can give me a hint or even the solution to my problem. If you need any further information, please let me know. Thanks in advance!
A kind of similar question to mine:
Populating a data frame in R in a loop
Try this:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
df %>%
gather(Year, Values, 3:5) %>%
mutate(Year = str_sub(Year, 2)) %>%
select(ID, Year, Values) %>%
group_split(ID) # split(.$ID)
# [[1]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 1 2019 1
# 2 1 2020 2
# 3 1 2021 3
#
# [[2]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 2 2019 2
# 2 2 2020 3
# 3 2 2021 4
#
# [[3]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 3 2019 3
# 2 3 2020 4
# 3 3 2021 5
#
# [[4]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 4 2019 4
# 2 4 2020 5
# 3 4 2021 6
#
# [[5]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 5 2019 5
# 2 5 2020 6
# 3 5 2021 7
#
# [[6]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 6 2019 6
# 2 6 2020 7
# 3 6 2021 8
Data
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
library(magrittr)
library(tidyr)
library(dplyr)
library(stringr)
names(df) <- str_replace_all(names(df), "X", "") #remove X's from year names
df %>%
gather(Year, Values, 3:5) %>%
select(ID, Year, Values) %>%
group_split(ID)

Filling missing dates in a grouped time series - a tidyverse-way?

Given a data.frame that contains a time series and one or ore grouping fields. So we have several time series - one for each grouping combination.
But some dates are missing.
So, what's the easiest (in terms of the most "tidyverse way") of adding these dates with the right grouping values?
Normally I would say I generate a data.frame with all dates and do a full_join with my time series. But now we have to do it for each combination of grouping values -- and fill in the grouping values.
Let's look at an example:
First I create a data.frame with missing values:
library(dplyr)
library(lubridate)
set.seed(1234)
# Time series should run vom 2017-01-01 til 2017-01-10
date <- data.frame(date = seq.Date(from=ymd("2017-01-01"), to=ymd("2017-01-10"), by="days"), v = 1)
# Two grouping dimensions
d1 <- data.frame(d1 = c("A", "B", "C", "D"), v = 1)
d2 <- data.frame(d2 = c(1, 2, 3, 4, 5), v = 1)
# Generate the data.frame
df <- full_join(date, full_join(d1, d2)) %>%
select(date, d1, d2)
# and ad to value columns
df$v1 <- runif(200)
df$v2 <- runif(200)
# group by the dimension columns
df <- df %>%
group_by(d1, d2)
# create missing dates
df.missing <- df %>%
filter(v1 <= 0.8)
# So now 2017-01-01 and 2017-01-10, A, 5 are missing now
df.missing %>%
filter(d1 == "A" & d2 == 5)
# A tibble: 8 x 5
# Groups: d1, d2 [1]
date d1 d2 v1 v2
<date> <fctr> <dbl> <dbl> <dbl>
1 2017-01-02 A 5 0.21879954 0.1335497
2 2017-01-03 A 5 0.32977018 0.9802127
3 2017-01-04 A 5 0.23902573 0.1206089
4 2017-01-05 A 5 0.19617465 0.7378315
5 2017-01-06 A 5 0.13373890 0.9493668
6 2017-01-07 A 5 0.48613541 0.3392834
7 2017-01-08 A 5 0.35698708 0.3696965
8 2017-01-09 A 5 0.08498474 0.8354756
So to add the missing dates I generate a data.frame with all dates:
start <- min(df.missing$date)
end <- max(df.missing$date)
all.dates <- data.frame(date=seq.Date(start, end, by="day"))
No I want to do something like (remember: df.missing is group_by(d1, d2))
df.missing %>%
do(my_join())
So let's define my_join():
my_join <- function(data) {
# get value of both dimensions
d1.set <- data$d1[[1]]
d2.set <- data$d2[[1]]
tmp <- full_join(data, all.dates) %>%
# First we need to ungroup. Otherwise we can't change d1 and d2 because they are grouping variables
ungroup() %>%
mutate(
d1 = d1.set,
d2 = d2.set
) %>%
group_by(d1, d2)
return(tmp)
}
Now we can call my_join() for each combination and have a look at "A/5"
df.missing %>%
do(my_join(.)) %>%
filter(d1 == "A" & d2 == 5)
# A tibble: 10 x 5
# Groups: d1, d2 [1]
date d1 d2 v1 v2
<date> <fctr> <dbl> <dbl> <dbl>
1 2017-01-02 A 5 0.21879954 0.1335497
2 2017-01-03 A 5 0.32977018 0.9802127
3 2017-01-04 A 5 0.23902573 0.1206089
4 2017-01-05 A 5 0.19617465 0.7378315
5 2017-01-06 A 5 0.13373890 0.9493668
6 2017-01-07 A 5 0.48613541 0.3392834
7 2017-01-08 A 5 0.35698708 0.3696965
8 2017-01-09 A 5 0.08498474 0.8354756
9 2017-01-01 A 5 NA NA
10 2017-01-10 A 5 NA NA
Great! That's what we were looking for.
But we need to define d1 and d2 in my_join and it feels a little bit clumsy.
So, is there any tidyverse-way of this solution?
P.S.: I've put the code into a gist: https://gist.github.com/JerryWho/1bf919ef73792569eb38f6462c6d7a8e
tidyr has some great tools for these sorts of problems. Take a look at complete.
library(dplyr)
library(tidyr)
library(lubridate)
want <- df.missing %>%
ungroup() %>%
complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day"))
want %>% filter(d1 == "A" & d2 == 5)
#> # A tibble: 10 x 5
#> d1 d2 date v1 v2
#> <fctr> <dbl> <date> <dbl> <dbl>
#> 1 A 5 2017-01-01 NA NA
#> 2 A 5 2017-01-02 0.21879954 0.1335497
#> 3 A 5 2017-01-03 0.32977018 0.9802127
#> 4 A 5 2017-01-04 0.23902573 0.1206089
#> 5 A 5 2017-01-05 0.19617465 0.7378315
#> 6 A 5 2017-01-06 0.13373890 0.9493668
#> 7 A 5 2017-01-07 0.48613541 0.3392834
#> 8 A 5 2017-01-08 0.35698708 0.3696965
#> 9 A 5 2017-01-09 0.08498474 0.8354756
#> 10 A 5 2017-01-10 NA NA
package tsibble function fill_gaps should do the job easily.
library(tsibble)
df.missing %>%
# tsibble format
as_tsibble(key = c(d1, d2), index = date) %>%
# fill gaps
fill_gaps(.full = TRUE)
Here's a tidyverse way starting with df.missing
library(tidyverse)
ans <- df.missing %>%
nest(date) %>%
mutate(data = map(data, ~seq.Date(start, end, by="day"))) %>%
unnest(data) %>%
rename(date = data) %>%
left_join(., df.missing, by=c("date","d1","d2"))
ans %>% filter(d1 == "A" & d2 == 5)
Output
d1 d2 date v1 v2
<fctr> <dbl> <date> <dbl> <dbl>
1 A 5 2017-01-01 NA NA
2 A 5 2017-01-02 0.21879954 0.1335497
3 A 5 2017-01-03 0.32977018 0.9802127
4 A 5 2017-01-04 0.23902573 0.1206089
5 A 5 2017-01-05 0.19617465 0.7378315
6 A 5 2017-01-06 0.13373890 0.9493668
7 A 5 2017-01-07 0.48613541 0.3392834
8 A 5 2017-01-08 0.35698708 0.3696965
9 A 5 2017-01-09 0.08498474 0.8354756
10 A 5 2017-01-10 NA NA
-------------------------------------------------------------------------------------------------
Here's an alternative approach that uses expand.grid and dplyr verbs
with(df.missing, expand.grid(unique(date), unique(d1), unique(d2))) %>%
setNames(c("date", "d1", "d2")) %>%
left_join(., df.missing, by=c("date","d1","d2"))
output (head)
date d1 d2 v1 v2
1 2017-01-01 A 1 0.113703411 0.660754634
2 2017-01-02 A 1 0.316612455 0.422330675
3 2017-01-03 A 1 0.553333591 0.424109178
4 2017-01-04 A 1 NA NA
5 2017-01-05 A 1 NA NA
6 2017-01-06 A 1 0.035456727 0.352998502
Here read.zoo creates a wide form zoo object and to that we merge the dates. Then we convert that back to a long data frame using fortify.zoo and spread out out v1 and v2 using spread.
Note that:
if we can assume that each date appears in at least one combination of the split variables, i.e. sort(unique(df.missing$date)) contains all the dates, then we could omit the merge line and no joins would have to be done at all. The test data df.missing shown in the question does have this property:
all(all.dates$date %in% df.missing$date)
## [1] TRUE
we could stop after the merge (or after read.zoo if each date is present at least once as in prior point) if a wide form zoo object can be used as that already has all the dates.
In the code below the line marked ### can be omitted with the development version of zoo (1.8.1):
library(dplyr)
library(tidyr)
library(zoo)
split.vars <- c("d1", "d2")
df.missing %>%
as.data.frame %>% ###
read.zoo(split = split.vars) %>%
merge(zoo(, seq(start(.), end(.), "day"))) %>%
fortify.zoo(melt = TRUE) %>%
separate(Series, c("v", split.vars)) %>%
spread(v, Value)
Update: Note simplification in zoo 1.8.1 .

Resources