I have the following data:
df <- data.frame(dt1 = c("2019-05-02", "2019-01-02", "2019-06-02"),
dt2 = c("2019-08-30", "2019-04-24", "2019-12-06") )
> df
dt1 dt2
1 2019-05-02 2019-08-30
2 2019-01-02 2019-04-24
3 2019-06-02 2019-12-06
Here is what I want to do:
i) I want create factors by binning, for example, for the first date, the dates binned as 2019-07-31, 2019-06-30, 2019-05-31, so essentially binning by dt2.
ii) I want to count the total number of dates in each bin.
The expected output is:
dt1 dt2 val_count
1 2019-05-02 2019-08-30 3
2 2019-01-02 2019-04-24 3
3 2019-06-02 2019-12-06 6
I found this post relevant.
Note: I do not want take difference between months of two dates.
Thank you for suggestions.
It's pretty messy but if you want to count how many last date of the months are in between dt1 and dt2, you may try
library(lubridate)
library(dplyr)
fd <- paste0(lubridate::year(min(df$dt1, df$dt2)), "-02-01") %>% as.Date()
ld <- paste0(lubridate::year(max(df$dt1, df$dt2))+1, "-01-01") %>% as.Date()
x <- seq.Date(fd, ld, by = "month") - 1
df %>%
rowwise() %>%
mutate(val_count = length(x[dt1 < x & x < dt2]))
dt1 dt2 val_count
<chr> <chr> <int>
1 2019-05-02 2019-08-30 3
2 2019-01-02 2019-04-24 3
3 2019-06-02 2019-12-06 6
Choice of < or <= depends on your purpose.
To get total days between dt1 and dt2,
df %>%
rowwise() %>%
mutate(val_count = length(x[dt1 < x & x < dt2])) %>%
mutate(dd = as.Date(dt2) - as.Date(dt1))
dt1 dt2 val_count dd
<chr> <chr> <int> <drtn>
1 2019-05-02 2019-08-30 3 120 days
2 2019-01-02 2019-04-24 3 112 days
3 2019-06-02 2019-12-06 6 187 days
Add
df %>%
rowwise() %>%
mutate(val_count = length(x[dt1 < x & x < dt2]),
val_count = ifelse(val_count == 0, 1, val_count)) %>%
mutate(dd = as.Date(dt2) - as.Date(dt1))
dt1 dt2 val_count dd
<chr> <chr> <dbl> <drtn>
1 2019-05-02 2019-08-30 3 120 days
2 2019-01-02 2019-04-24 3 112 days
3 2019-06-02 2019-12-06 6 187 days
4 2019-06-01 2019-06-02 1 1 days
The above solution is indeed kinda messy, it just takes a simple oneliner to do this
df <- data.frame(dt1 = c("2019-05-02", "2019-01-02", "2019-06-02", "2019-06-01"), dt2 = c("2019-08-30", "2019-04-24", "2019-12-06", "2019-06-02") )
df %>%
mutate(val_count = as.period(ymd(dt2) - ymd(dt1)) %/% months(1))
# dt1 dt2 val_count
# 1 2019-05-02 2019-08-30 3
# 2 2019-01-02 2019-04-24 3
# 3 2019-06-02 2019-12-06 6
# 4 2019-06-01 2019-06-02 0
Related
For instance, suppose I have the following dataframe:
ID<-c("A", "A", "B", "B", "B", "C")
StartDate<-as.Date(c("2018-01-01", "2019-02-05", "2016-04-18", "2020-03-03", "2021-12-13", "2014-03-03"), "%Y-%m-%d")
TermDate<-as.Date(c("2018-02-01", NA, "2016-05-18", "2020-04-03", "2021-12-15", "2014-04-03"), "%Y-%m-%d")
df<-data.frame(ID=ID, StartDate=StartDate, TermDate=TermDate)
ID StartDate TermDate
1 A 2018-01-01 2018-02-01
2 A 2019-02-05 <NA>
3 B 2016-04-18 2016-05-18
4 B 2020-03-03 2020-04-03
5 B 2021-12-13 2021-12-15
6 C 2014-03-03 2014-04-03
What I'm ultimately trying to get is the following:
ID StartDate TermDate
1 A 2018-01-01 <NA>
2 B 2016-04-18 2021-12-15
3 C 2014-03-03 2014-04-03
There are functions first and last in dplyr and data.table that could help here.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(StartDate = first(StartDate),
TermDate = last(TermDate))
# ID StartDate TermDate
#* <chr> <date> <date>
#1 A 2018-01-01 NA
#2 B 2016-04-18 2021-12-15
#3 C 2014-03-03 2014-04-03
With data.table :
library(data.table)
setDT(df)[, .(StartDate = first(StartDate), TermDate = last(TermDate)), ID]
Using min and max instead of first and last will eliminate the need for sorting the data, if not already
df %>% group_by(ID) %>%
summarise(StartDate = min(StartDate),
TermDate = max(TermDate))
# A tibble: 3 x 3
ID StartDate TermDate
* <chr> <date> <date>
1 A 2018-01-01 NA
2 B 2016-04-18 2021-12-15
3 C 2014-03-03 2014-04-03
See if your df is like this
> df
ID StartDate TermDate
1 A 2019-02-05 <NA>
2 A 2018-01-01 2018-02-01
3 B 2016-04-18 2016-05-18
4 B 2020-03-03 2020-04-03
5 B 2021-12-13 2021-12-15
6 C 2014-03-03 2014-04-03
df %>% group_by(ID) %>%
summarise(StartDate = first(StartDate),
TermDate = last(TermDate))
# A tibble: 3 x 3
ID StartDate TermDate
* <chr> <date> <date>
1 A 2019-02-05 2018-02-01
2 B 2016-04-18 2021-12-15
3 C 2014-03-03 2014-04-03
We can also do
library(dplyr)
df %>%
group_by(ID) %>%
summarise(StartDate = StartDate[1]),
TermDate = TermDate[n()])
Another data.table option
setDT(df)[
,
as.list(
setNames(
data.frame(.SD)[cbind(c(1, .N), c(1, 2))],
names(.SD)
)
), ID
]
gives
ID StartDate TermDate
1: A 2018-01-01 <NA>
2: B 2016-04-18 2021-12-15
3: C 2014-03-03 2014-04-03
Issue
I have a grouped dataframe with overlapping intervals (date as ymd). I want to retain only the largest non-overlapping intervals in each group.
Example data
# Packages
library(tidyverse)
library(lubridate)
# Example data
df <- tibble(
group = c(1, 1, 1, 2, 2, 3, 3, 3, 3),
start = as_date(
c("2019-01-10", "2019-02-01", "2019-10-05", "2018-07-01", "2019-01-01", "2019-10-01", "2019-10-01", "2019-11-30","2019-11-20")),
end = as_date(
c("2019-02-07", "2019-05-01", "2019-11-15", "2018-07-31", "2019-05-05", "2019-11-06", "2019-10-07", "2019-12-10","2019-12-31"))) %>%
mutate(intval = interval(start, end),
intval_length = intval / days(1))
df
#> # A tibble: 9 x 5
#> group start end intval intval_length
#> <dbl> <date> <date> <Interval> <dbl>
#> 1 1 2019-01-10 2019-02-07 2019-01-10 UTC--2019-02-07 UTC 28
#> 2 1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#> 3 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#> 4 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#> 5 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#> 6 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#> 7 3 2019-10-01 2019-10-07 2019-10-01 UTC--2019-10-07 UTC 6
#> 8 3 2019-11-30 2019-12-10 2019-11-30 UTC--2019-12-10 UTC 10
#> 9 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
# Goal
# Row: 1 and 2; 6 to 9 have overlaps; Keep rows with largest intervals (in days)
df1 <- df[-c(1, 7, 8),]
df1
#> # A tibble: 6 x 5
#> group start end intval intval_length
#> <dbl> <date> <date> <Interval> <dbl>
#> 1 1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#> 2 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#> 3 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#> 4 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#> 5 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#> 6 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
Current approach
I found a related question in another thread (see: Find dates within a period interval by group). However, the respective solution identifies all overlapping rows by group. In this way, I can't identify the largest non-overlapping intervals.
df$overlap <- unlist(tapply(df$intval, #loop through intervals
df$group, #grouped by id
function(x) rowSums(outer(x,x,int_overlaps)) > 1))
As an example, consider group 3 in my example data. Here row 6/7 and 8/9 overlap. With row 6 and 9 being the largest non-overlapping periods, I would like to remove row 7 and 8.
I would greatly appreciate it if someone could pinpoint me to a solution.
Having searched for related problems on stackoverflow, I found that the following approaches (here: Collapse and merge overlapping time intervals) and (here: How to flatten / merge overlapping time periods) could be adapted to my issue.
# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694
# Note: df and df1 created in the initial reprex (above)
df2 <- df %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > # find overlaps
cummax(as.numeric(end)))[-n()])) %>%
ungroup() %>%
group_by(group, indx) %>%
arrange(desc(intval_length)) %>% # retain largest interval
filter(row_number() == 1) %>%
ungroup() %>%
select(-indx) %>%
arrange(group, start)
# Desired output?
identical(df1, df2)
#> [1] TRUE
I have a large dataset with thousands of dates in the ymd format. I want to convert this column so that way there are three individual columns by year, month, and day. There are literally thousands of dates so I am trying to do this with a single code for the entire dataset.
You can use the year(), month(), and day() extractors in lubridate for this. Here's an example:
library('dplyr')
library('tibble')
library('lubridate')
## create some data
df <- tibble(date = seq(ymd(20190101), ymd(20191231), by = '7 days'))
which yields
> df
# A tibble: 53 x 1
date
<date>
1 2019-01-01
2 2019-01-08
3 2019-01-15
4 2019-01-22
5 2019-01-29
6 2019-02-05
7 2019-02-12
8 2019-02-19
9 2019-02-26
10 2019-03-05
# … with 43 more rows
Then mutate df using the relevant extractor function:
df <- mutate(df,
year = year(date),
month = month(date),
day = day(date))
This results in:
> df
# A tibble: 53 x 4
date year month day
<date> <dbl> <dbl> <int>
1 2019-01-01 2019 1 1
2 2019-01-08 2019 1 8
3 2019-01-15 2019 1 15
4 2019-01-22 2019 1 22
5 2019-01-29 2019 1 29
6 2019-02-05 2019 2 5
7 2019-02-12 2019 2 12
8 2019-02-19 2019 2 19
9 2019-02-26 2019 2 26
10 2019-03-05 2019 3 5
# … with 43 more rows
If you only want the new three columns, use transmute() instead of mutate().
Using lubridate but without having to specify a separator:
library(tidyverse)
df <- tibble(d = c('2019/3/18','2018/10/29'))
df %>%
mutate(
date = lubridate::ymd(d),
year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date)
)
Note that you can change the first entry from ymd to fit other formats.
A slighlty different tidyverse solution that requires less code could be:
Code
tibble(date = "2018-05-01") %>%
mutate_at(vars(date), lst(year, month, day))
Result
# A tibble: 1 x 4
date year month day
<chr> <dbl> <dbl> <int>
1 2018-05-01 2018 5 1
#Data
d = data.frame(date = c("2019-01-01", "2019-02-01", "2012/03/04"))
library(lubridate)
cbind(d,
read.table(header = FALSE,
sep = "-",
text = as.character(ymd(d$date))))
# date V1 V2 V3
#1 2019-01-01 2019 1 1
#2 2019-02-01 2019 2 1
#3 2012/03/04 2012 3 4
OR
library(dplyr)
library(tidyr)
library(lubridate)
d %>%
mutate(date2 = as.character(ymd(date))) %>%
separate(date2, c("year", "month", "day"), "-")
# date year month day
#1 2019-01-01 2019 01 01
#2 2019-02-01 2019 02 01
#3 2012/03/04 2012 03 04
I have a data frame with a specific variable (Var1) and a time variable (Var2).
I would like to calculate the frequency of occurrence (Frequency) of Var1 withing a specific time step (let say 1 min) during a year.
sample dataset:
Var1 <- c(rep("A", 4), rep("B", 3), rep("C", 2))
Var2 <- c("2018-09-01 10:00:00", "2018-09-01 10:00:30", "2018-09-01 10:00:45",
"2018-09-10 22:10:00", "2017-09-05 10:54:30", "2018-12-15 10:00:30",
"2018-12-15 10:01:00", "2017-02-20 17:16:30", "2017-12-20 20:08:56")
df <- data.frame(Var1, Var2)
df$Var2 <- as.POSIXct(df$Var2)
desired output:
Frequency <- c(rep(3, 3), rep(1, 2), rep(2,2), rep(1,2))
dfOut <- data.frame(Var1, Var2, Frequency)
# Var1 Var2 Frequency
#1 A 2018-09-01 10:00:00 3
#2 A 2018-09-01 10:00:30 3
#3 A 2018-09-01 10:00:45 3
#4 A 2018-09-10 22:10:00 1
#5 B 2017-09-05 10:54:30 1
#6 B 2018-12-15 10:00:30 2
#7 B 2018-12-15 10:01:00 2
#8 C 2017-02-20 17:16:30 1
#9 C 2017-12-20 20:08:56 1
You can use lubridate::floor_date to get the minute grouping column that accounts for date as you are describing. Note that your displayed desired output does not seem to match your comment
Var1 <- c(rep("A", 4), rep("B", 3), rep("C", 2))
Var2 <- c("2018-09-01 10:00:00", "2018-09-01 10:00:30", "2018-09-01 10:00:45",
"2018-09-10 22:10:00", "2017-09-05 10:54:30", "2018-12-15 10:00:30",
"2018-12-15 10:01:00", "2017-02-20 17:16:30", "2017-12-20 20:08:56")
df <- data.frame(Var1, Var2)
df$Var2 <- as.POSIXct(df$Var2)
library(tidyverse)
library(lubridate)
df %>%
mutate(minute = floor_date(Var2, unit = "minute")) %>%
add_count(Var1, minute)
#> # A tibble: 9 x 4
#> Var1 Var2 minute n
#> <fct> <dttm> <dttm> <int>
#> 1 A 2018-09-01 10:00:00 2018-09-01 10:00:00 3
#> 2 A 2018-09-01 10:00:30 2018-09-01 10:00:00 3
#> 3 A 2018-09-01 10:00:45 2018-09-01 10:00:00 3
#> 4 A 2018-09-10 22:10:00 2018-09-10 22:10:00 1
#> 5 B 2017-09-05 10:54:30 2017-09-05 10:54:00 1
#> 6 B 2018-12-15 10:00:30 2018-12-15 10:00:00 1
#> 7 B 2018-12-15 10:01:00 2018-12-15 10:01:00 1
#> 8 C 2017-02-20 17:16:30 2017-02-20 17:16:00 1
#> 9 C 2017-12-20 20:08:56 2017-12-20 20:08:00 1
Created on 2018-09-11 by the reprex package (v0.2.0).
You can do something like this. Create a new character vector to define the groups, then group by Var1 and the new variable. This doesn't give exactly your desired output because the minutes are defined differently.
library(dplyr)
df %>%
mutate(minute = substring(as.character(Var2), 1, 16)) %>%
group_by(Var1, minute) %>%
mutate(frequency = n())
Here is a data.table approach. You can first create an index showing if the datetime for next row is 1 min after the datetime of current row. Then, use this as one of the grouping criteria to calculate the frequency.
library(data.table)
setDT(df)[, idx := cumsum(c(0L, Var2[-1L] > Var2[-.N] + 60L)), by=.(Var1)][,
Freq := .N, by=.(Var1, idx)]
output:
Var1 Var2 idx Freq
1: A 2018-09-01 10:00:00 0 3
2: A 2018-09-01 10:00:30 0 3
3: A 2018-09-01 10:00:45 0 3
4: A 2018-09-10 22:10:00 1 1
5: B 2017-09-05 10:54:30 0 1
6: B 2018-12-15 10:00:30 1 2
7: B 2018-12-15 10:01:00 1 2
8: C 2017-02-20 17:16:30 0 1
9: C 2017-12-20 20:08:56 1 1
I have the following situation.
df <- rbind(
data.frame(thisDate = rep(seq(as.Date("2018-1-1"), as.Date("2018-1-2"), by="day")) ),
data.frame(thisDate = rep(seq(as.Date("2018-2-1"), as.Date("2018-2-2"), by="day")) ))
df <- cbind(df,lastMonth = as.Date(format(as.Date(df$thisDate - months(1)),"%Y-%m-01")))
df <- cbind(df, prod1Quantity= seq(1:4) )
I have quantities for different days of a month for an unknown number of products. I want to have 1 column for every product with the total monthly quantity of that product for all of the previous month. So the output would be like this .. ie grouped by lastMonth, Prod1Quantity . I just don't get how to group by, mutate and summarise dynamically if that indeed is the right approach.
I came across data.table generate multiple columns and summarize them . I think it appears to do what I need - but I just don't get how it is working!
Desired Output
thisDate lastMonth prod1Quantity prod1prevMonth
1 2018-01-01 2017-12-01 1 NA
2 2018-01-02 2017-12-01 2 NA
3 2018-02-01 2018-01-01 3 3
4 2018-02-02 2018-01-01 4 3
Another approach could be
library(dplyr)
library(lubridate)
temp_df <- df %>%
mutate(thisDate_forJoin = as.Date(format(thisDate,"%Y-%m-01")))
final_df <- temp_df %>%
mutate(thisDate_forJoin = thisDate_forJoin %m-% months(1)) %>%
left_join(temp_df %>%
group_by(thisDate_forJoin) %>%
summarise_if(is.numeric, sum),
by="thisDate_forJoin") %>%
select(-thisDate_forJoin)
Output is:
thisDate prod1Quantity.x prod2Quantity.x prod1Quantity.y prod2Quantity.y
1 2018-01-01 1 10 NA NA
2 2018-01-02 2 11 NA NA
3 2018-02-01 3 12 3 21
4 2018-02-02 4 13 3 21
Sample data:
df <- structure(list(thisDate = structure(c(17532, 17533, 17563, 17564
), class = "Date"), prod1Quantity = 1:4, prod2Quantity = 10:13), class = "data.frame", row.names = c(NA,
-4L))
# thisDate prod1Quantity prod2Quantity
#1 2018-01-01 1 10
#2 2018-01-02 2 11
#3 2018-02-01 3 12
#4 2018-02-02 4 13
A solution can be reached by calculating the month-wise production quantity and then joining on month of lastMonth and thisDate.
lubridate::month function has been used evaluate month from date.
library(dplyr)
library(lubridate)
df %>% group_by(month = as.integer(month(thisDate))) %>%
summarise(prodQuantMonth = sum(prod1Quantity)) %>%
right_join(., mutate(df, prevMonth = month(lastMonth)), by=c("month" = "prevMonth")) %>%
select(thisDate, lastMonth, prod1Quantity, prodQuantLastMonth = prodQuantMonth)
# # A tibble: 4 x 4
# thisDate lastMonth prod1Quantity prodQuantLastMonth
# <date> <date> <int> <int>
# 1 2018-01-01 2017-12-01 1 NA
# 2 2018-01-02 2017-12-01 2 NA
# 3 2018-02-01 2018-01-01 3 3
# 4 2018-02-02 2018-01-01 4 3