How to generate a sequence of monthly dates from a data frame in R? - r

Consider the following data frame (df):
"id" "date_start" "date_end"
a 2012-03-11 2012-03-27
a 2012-05-17 2012-07-21
a 2012-06-09 2012-08-18
b 2015-06-21 2015-07-12
b 2015-06-27 2015-08-04
b 2015-07-02 2015-08-01
c 2017-10-11 2017-11-08
c 2017-11-27 2017-12-15
c 2017-01-02 2018-02-03
I am trying to create a new data frame with sequences of monthly dates, starting one month prior to the minimum value of "date_start" for each group in "id". The sequence also only includes dates from the first day of a month and ends at the maximum value of "date-end" for each group in "id".
This is a reproducible example for my data frame:
library(lubridate)
id <- c("a","a","a","b","b","b","c","c","c")
df <- data.frame(id)
df$date_start <- as.Date(c("2012-03-11", "2012-05-17","2012-06-09", "2015-06-21", "2015-06-27","2015-07-02", "2017-10-11", "2017-11-27","2018-01-02"))
df$date_end <- as.Date(c("2012-03-27", "2012-07-21","2012-08-18", "2015-07-12", "2015-08-04","2015-08-012", "2017-11-08", "2017-12-15","2018-02-03"))
What I have tried to do:
library(dplyr)
library(Desctools)
library(timeDate)
df2 <- df %>%
group_by(id) %>%
summarize(start= floor_date(AddMonths(min(date_start),-1), "month"),end=max(date_end)) %>%
do(data.frame(id=.$id, date=seq(.$start,.$end,by="1 month")))
The code works perfectly fine for an ungrouped data frame. Somehow, with the grouping by "id" it throws an error message:
Error in seq.default(.$date_start, .$date_end, by = "1 month") :
'from' must be of length 1
This is how the desired output looks like for the data frame given above:
"id" "date"
a 2012-02-01
a 2012-03-01
a 2012-04-01
a 2012-05-01
a 2012-06-01
a 2012-07-01
a 2012-08-01
b 2015-05-01
b 2015-06-01
b 2015-07-01
b 2015-08-01
c 2017-09-01
c 2017-10-01
c 2017-11-01
c 2017-12-01
c 2018-01-01
c 2018-02-01
Is there a way to alter the code to function with a grouped data frame? Is there an altogether different approach for this operation?

Another option using dplyr and lubridate is to first summarise a list of Date objects for each id and then unnest them to expand them into different rows.
library(dplyr)
library(lubridate)
df %>%
group_by(id) %>%
summarise(date = list(seq(floor_date(min(date_start),unit = "month") - months(1),
floor_date(max(date_end), unit = "month"), by = "month"))) %>%
tidyr::unnest()
# id date
# <fct> <date>
# 1 a 2012-02-01
# 2 a 2012-03-01
# 3 a 2012-04-01
# 4 a 2012-05-01
# 5 a 2012-06-01
# 6 a 2012-07-01
# 7 a 2012-08-01
# 8 b 2015-05-01
# 9 b 2015-06-01
#10 b 2015-07-01
#11 b 2015-08-01
#12 c 2017-09-01
#13 c 2017-10-01
#14 c 2017-11-01
#15 c 2017-12-01
#16 c 2018-01-01
#17 c 2018-02-01

In your code, since there are duplicates in id, you could group by row_number and achieve the same results as below:
df %>%
group_by(id) %>%
summarize(start= floor_date(AddMonths(min(date_start),-1), "month"),end=max(date_end)) %>%
group_by(rn=row_number()) %>%
do(data.frame(id=.$id, date=seq(.$start, .$end, by="1 month"))) %>%
ungroup() %>%
select(-rn)
# A tibble: 17 x 2
id date
<fct> <date>
1 a 2012-02-01
2 a 2012-03-01
3 a 2012-04-01
4 a 2012-05-01
5 a 2012-06-01
6 a 2012-07-01
7 a 2012-08-01
8 b 2015-05-01
9 b 2015-06-01
10 b 2015-07-01
11 b 2015-08-01
12 c 2017-09-01
13 c 2017-10-01
14 c 2017-11-01
15 c 2017-12-01
16 c 2018-01-01
17 c 2018-02-01

Use as.yearmon to convert to year/month. Note that yearmon objects are represented internally as year + fraction where fraction is 0 for January, 1/12 for February, 2/12 for March and so on. Then use as.Date to convert that to Date class. do allows the group to change size.
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
do( data.frame(month = as.Date(seq(as.yearmon(min(.$date_start)) - 1/12,
as.yearmon(max(.$date_end)),
1/12) ))) %>%
ungroup
giving:
# A tibble: 17 x 2
id month
<fct> <date>
1 a 2012-02-01
2 a 2012-03-01
3 a 2012-04-01
4 a 2012-05-01
5 a 2012-06-01
6 a 2012-07-01
7 a 2012-08-01
8 b 2015-05-01
9 b 2015-06-01
10 b 2015-07-01
11 b 2015-08-01
12 c 2017-09-01
13 c 2017-10-01
14 c 2017-11-01
15 c 2017-12-01
16 c 2018-01-01
17 c 2018-02-01
This could also be written like this using the same library statements as above:
Seq <- function(st, en) as.Date(seq(as.yearmon(st) - 1/12, as.yearmon(en), 1/12))
df %>%
group_by(id) %>%
do( data.frame(month = Seq(min(.$date_start), max(.$date_end))) ) %>%
ungroup

Related

Selecting distinct entries based on specific variables in R

I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!
I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03
Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]
We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03
Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)

Using the r dplyr library to generate aggregate numbers in a new column

I am trying to use dplyr to generate a new column in a data frame, based on the aggregation of values in existing columns. Given my dataframe:
group1 <- c("2019","2019","2019","2018","2018","2017","2017","2017")
group2 <- c("2019-01-01", "2019-01-01","2019-01-01","2018-05-01","2018-06-01","2017-01-01","2017-01-01","2017-02-01")
group3 <- c("A","A","B","A","A","C","C","B")
df <- data.frame("Year" = group1,"Date" = group2,"Sample" = group3)
Gives:
Year Date Sample
1 2019 2019-01-01 A
2 2019 2019-01-01 A
3 2019 2019-01-01 B
4 2018 2018-05-01 A
5 2018 2018-06-01 A
6 2017 2017-01-01 C
7 2017 2017-01-01 C
8 2017 2017-02-01 B
So I'd like to generate new column "Count", that for each row gives the total number of unique dates per sample. So for the above data, I would expect the results to be:
Year Date Sample Count
1 2019 2019-01-01 A 1
2 2019 2019-01-01 A 1
3 2019 2019-02-01 B 1
4 2018 2018-05-01 A 2
5 2018 2018-06-01 C 2
6 2017 2017-01-01 C 1
7 2017 2017-01-01 C 1
8 2017 2017-02-01 B 1
I've tried using the following code in r:
df %>%
group_by(Year) %>%
group_by(Sample) %>%
group_by(Date) %>%
mutate(Count = n_distinct(Date))
But I'm not getting the correct answer!
You could try:
library(dplyr)
df %>%
group_by(Year, Sample) %>%
mutate(Count = n_distinct(Date))
If you want to pass several variables to group_by, you need to put them together - what you were doing is cancelling out the previous groupings by each new statement.
Moreover, if you'd like to count unique dates, you shouldn't group by them.
The above code would give:
# A tibble: 8 x 4
# Groups: Year, Sample [6]
Year Date Sample Count
<fct> <fct> <fct> <int>
1 2019 2019-01-01 A 1
2 2019 2019-01-01 A 1
3 2019 2019-01-01 B 1
4 2018 2018-05-01 A 2
5 2018 2018-06-01 A 2
6 2018 2017-01-01 C 1
7 2017 2017-01-01 C 1
8 2017 2017-02-01 B 1
Note that there is a mismatch between your generated data frame and the one you show us. The data frame generated by your code is:
Year Date Sample
1 2019 2019-01-01 A
2 2019 2019-01-01 A
3 2019 2019-01-01 B
4 2018 2018-05-01 A
5 2018 2018-06-01 A
6 2018 2017-01-01 C
7 2017 2017-01-01 C
8 2017 2017-02-01 B
Where indeed the only Sample with 2 distinct Dates in a given Year is A (in 2018).

Fill in missing cases till specific condition per group

I'm attempting to create a data frame that shows all of the in between months for my data set, by subject. Here is an example of what the data looks like:
dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01', '2017-01-01', '2017-02-01', '2017-03-01', '2017-01-01',
'2017-02-01', '2017-04-01'))
colnames(dat) <- c('id', 'value', 'date')
dat$Out.Of.Study <- c("", "", "Out", "Out", "", "", "Out", "", "", "Out")
dat
id value date Out.Of.Study
1 1 30 2017-01-01
2 1 30 2017-02-01
3 1 25 2017-04-01 Out
4 2 25 2017-02-01 Out
5 3 25 2017-01-01
6 3 25 2017-02-01
7 3 25 2017-03-01 Out
8 4 20 2017-01-01
9 4 20 2017-02-01
10 4 20 2017-04-01 Out
If I want to show the in between months where no data was collected (but the subject was still enrolled in the study) I can use the complete() function. However, the issue is that I get all missing months for each subject id based on the min and max month identified in the data set:
## Add Dates by Group
library(tidyr)
complete(dat, id, date)
id date value Out.Of.Study
1 1 2017-01-01 30
2 1 2017-02-01 30
3 1 2017-03-01 NA <NA>
4 1 2017-04-01 25 Out
5 2 2017-01-01 NA <NA>
6 2 2017-02-01 25 Out
7 2 2017-03-01 NA <NA>
8 2 2017-04-01 NA <NA>
9 3 2017-01-01 25
10 3 2017-02-01 25
11 3 2017-03-01 25 Out
12 3 2017-04-01 NA <NA>
13 4 2017-01-01 20
14 4 2017-02-01 20
15 4 2017-03-01 NA <NA>
16 4 2017-04-01 20 Out
The issue with this is that I don't want the missing months to exceed the subject's final observed month (essentially, I have subjects who are censored and would need to be removed from the study) or show up prior to the month a subject started the study. For example, subject 2 was only a participant in the month '2017-02-01'. There for, I'd like the data to represent that this was the only month they were in there and not have them represented by the extra months after and the extra month before, as shown above. The same is the case with subject 3, who has an extra month, even though they are out of the study.
Perhaps the complete() isn't the best way to go about this?
This can be solved by creating a sequence of months individually for each id and by joining the sequences with dat to complete the missing months.
1. data.table
(The question is tagged with tidyr. But as I am more acquainted with data.table I have tried this first.)
library(data.table)
# coerce date strings to class Date
setDT(dat)[, date := as.Date(date)]
# create sequence of months for each id
sdt <- dat[, .(date = seq(min(date), max(date), "month")), by = id]
# join
dat[sdt, on = .(id, date)]
id value date Out.Of.Study
1: 1 30 2017-01-01
2: 1 30 2017-02-01
3: 1 NA 2017-03-01 <NA>
4: 1 25 2017-04-01 Out
5: 2 25 2017-02-01 Out
6: 3 25 2017-01-01
7: 3 25 2017-02-01
8: 3 25 2017-03-01 Out
9: 4 20 2017-01-01
10: 4 20 2017-02-01
11: 4 NA 2017-03-01 <NA>
12: 4 20 2017-04-01 Out
Note that there is only one row for id == 2 as requested by the OP.
This approach requires to coerce date from factor to class Date to make sure that all missing months will be completed.
This is also safer than to rely on the avialable date factors in the dataset. For illustration, let's assume that id == 4 is Out in month 2017-06-01 (June) instead of 2017-04-01 (April). Then, there would be no month 2017-05-01 (May) in the whole dataset and the final result would be incomplete.
Without creating the temporary variable sdt the code becomes
library(data.table)
setDT(dat)[, date := as.Date(date)][
dat[, .(date = seq(min(date), max(date), "month")), by = id], on = .(id, date)]
2. tidyr / dplyr
library(dplyr)
library(tidyr)
# coerce date strings to class Date
dat <- dat %>%
mutate(date = as.Date(date))
dat %>%
# create sequence of months for each id
group_by(id) %>%
expand(date = seq(min(date), max(date), "month")) %>%
# join to complete the missing month for each id
left_join(dat, by = c("id", "date"))
# A tibble: 12 x 4
# Groups: id [?]
id date value Out.Of.Study
<dbl> <date> <dbl> <chr>
1 1 2017-01-01 30 ""
2 1 2017-02-01 30 ""
3 1 2017-03-01 NA NA
4 1 2017-04-01 25 Out
5 2 2017-02-01 25 Out
6 3 2017-01-01 25 ""
7 3 2017-02-01 25 ""
8 3 2017-03-01 25 Out
9 4 2017-01-01 20 ""
10 4 2017-02-01 20 ""
11 4 2017-03-01 NA NA
12 4 2017-04-01 20 Out
There is a variant which does not update dat:
library(dplyr)
library(tidyr)
dat %>%
mutate(date = as.Date(date)) %>%
right_join(group_by(., id) %>%
expand(date = seq(min(date), max(date), "month")),
by = c("id", "date"))
I would still use complete (probably the right method to use here), but after it would subset rows that exceed row with "Out". You can do this with dplyr::between.
dat %>%
group_by(id) %>%
complete(date) %>%
# Filter rows that are between 1 and the one that has "Out"
filter(between(row_number(), 1, which(Out.Of.Study == "Out")))
id date value Out.Of.Study
<dbl> <fct> <dbl> <chr>
1 1 2017-01-01 30 ""
2 1 2017-02-01 30 ""
3 1 2017-03-01 NA NA
4 1 2017-04-01 25 Out
5 2 2017-01-01 NA NA
6 2 2017-02-01 25 Out
7 3 2017-01-01 25 ""
8 3 2017-02-01 25 ""
9 3 2017-03-01 25 Out
10 4 2017-01-01 20 ""
11 4 2017-02-01 20 ""
12 4 2017-03-01 NA NA
13 4 2017-04-01 20 Out

How can I fill missing data points in R for a given dataframe

I have a dataframe which contains dates, products and amounts. However product b is not on every date, I would like it to be with an NA or 0 balance. Is this possible?
Summary_Date <-
as.Date(c("2017-01-31",
"2017-02-28",
"2017-03-31",
"2017-03-31",
"2017-04-30",
"2017-05-31",
"2017-05-31",
"2017-06-30"))
Product <-
as.character(c("a","a","a","b","a","a","b","a"))
Amounts <-
as.numeric(c(10,10,10,20,10,10,20,10))
df <- data.frame(Summary_Date,Product,Amounts)
Regards,
Aksel
You can use tidyr:
> library(tidyr)
> complete(data = df,Summary_Date,Product)
# A tibble: 12 x 3
Summary_Date Product Amounts
<date> <fctr> <dbl>
1 2017-01-31 a 10
2 2017-01-31 b NA
3 2017-02-28 a 10
4 2017-02-28 b NA
5 2017-03-31 a 10
6 2017-03-31 b 20
7 2017-04-30 a 10
8 2017-04-30 b NA
9 2017-05-31 a 10
10 2017-05-31 b 20
11 2017-06-30 a 10
12 2017-06-30 b NA

"Unnesting" a dataframe in R

I have the following data.frame:
df <- data.frame(id=c(1,2,3),
first.date=as.Date(c("2014-01-01", "2014-03-01", "2014-06-01")),
second.date=as.Date(c("2015-01-01", "2015-03-01", "2015-06-1")),
third.date=as.Date(c("2016-01-01", "2017-03-01", "2018-06-1")),
fourth.date=as.Date(c("2017-01-01", "2018-03-01", "2019-06-1")))
> df
id first.date second.date third.date fourth.date
1 1 2014-01-01 2015-01-01 2016-01-01 2017-01-01
2 2 2014-03-01 2015-03-01 2017-03-01 2018-03-01
3 3 2014-06-01 2015-06-01 2018-06-01 2019-06-01
Each row represents three timespans; i.e. the time spans between first.date and second.date, second.date and third.date, and third.date and fourth.date respectively.
I would like to, in lack of a better word, unnest the dataframe to obtain this instead:
id StartDate EndDate
1 1 2014-01-01 2015-01-01
2 1 2015-01-01 2016-01-01
3 1 2016-01-01 2017-01-01
4 2 2014-03-01 2015-03-01
5 2 2015-03-01 2017-03-01
6 2 2017-03-01 2018-03-01
7 3 2014-06-01 2015-06-01
8 3 2015-06-01 2018-06-01
9 3 2018-06-01 2019-06-01
I have been playing around with the unnest function from the tidyr package, but I came to the conclusion that I don't think it's what I'm really looking for.
Any suggestions?
You can try tidyr/dplyr as follows:
library(tidyr)
library(dplyr)
df %>% gather(DateType, StartDate, -id) %>% select(-DateType) %>% arrange(id) %>% group_by(id) %>% mutate(EndDate = lead(StartDate))
You can eliminate the last row in each id group by adding:
%>% slice(-4)
To the above pipeline.
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), then melt the dataset to long format, use shift with type='lead' grouped by 'id' and then remove the NA elements.
library(data.table)
na.omit(melt(setDT(df), id.var='id')[, shift(value,0:1, type='lead') , id])
# id V1 V2
#1: 1 2014-01-01 2015-01-01
#2: 1 2015-01-01 2016-01-01
#3: 1 2016-01-01 2017-01-01
#4: 2 2014-03-01 2015-03-01
#5: 2 2015-03-01 2017-03-01
#6: 2 2017-03-01 2018-03-01
#7: 3 2014-06-01 2015-06-01
#8: 3 2015-06-01 2018-06-01
#9: 3 2018-06-01 2019-06-01
The column names can be changed by using either setnames or earlier in the shift step.

Resources