I am trying to convert a column in my dataset that contains week numbers into weekly Dates. I was trying to use the lubridate package but could not find a solution. The dataset looks like the one below:
df <- tibble(week = c("202009", "202010", "202011","202012", "202013", "202014"),
Revenue = c(4543, 6764, 2324, 5674, 2232, 2323))
So I would like to create a Date column with in a weekly format e.g. (2020-03-07, 2020-03-14).
Would anyone know how to convert these week numbers into weekly dates?
Maybe there is a more automated way, but try something like this. I think this gets the right days, I looked at a 2020 calendar and counted. But if something is off, its a matter of playing with the (week - 1) * 7 - 1 component to return what you want.
This just grabs the first day of the year, adds x weeks worth of days, and then uses ceiling_date() to find the next Sunday.
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
separate(week, c("year", "week"), sep = 4, convert = TRUE) %>%
mutate(date = ceiling_date(ymd(paste(year, "01", "01", sep = "-")) +
(week - 1) * 7 - 1, "week", week_start = 7))
# # A tibble: 6 x 4
# year week Revenue date
# <int> <int> <dbl> <date>
# 1 2020 9 4543 2020-03-01
# 2 2020 10 6764 2020-03-08
# 3 2020 11 2324 2020-03-15
# 4 2020 12 5674 2020-03-22
# 5 2020 13 2232 2020-03-29
# 6 2020 14 2323 2020-04-05
Related
I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
The post has been edit at Aug 17, 2020 to make the example looks more like my actual data.
The days always come first either with 1 or 2 digits. The months always come second either in full or in part and in French. The years always come third either with 2 or 4 digits.
I'm learning to code with tidyverse packages. I'm trying to replace every elements in a variable by an other string if they match specific conditions. The problem is that I can only do it one condition at the time. I would like to know how to achieve it at severals condition a the time.
Here's a reproductible exemple :
library(tidyverse)
library(magrittr)
tib <- tibble(
ID = 1:6,
Date = c("1-JAN-20", "15-JUILL-20", "30 DEC 2020",
"1-JAN-20", "15-JUILL-20", "30 DEC 2020"),
Comm = c("Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30",
"Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30"))
head(tib)
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 1-JAN-20 Should be 2020-01-01
2 2 15-JUILL-20 Should be 2020-06-15
3 3 30 DEC 2020 Should be 2020-12-30
4 4 1-JAN-20 Should be 2020-01-01
5 5 15-JUILL-20 Should be 2020-06-15
6 6 30 DEC 2020 Should be 2020-12-30
# Returns the unique values of the character variables execept the "Comm" one. So, it
# returns only one in that case, but my original data have severals ones.
tib %>% select(where(is.character), -Comm) %>% map(~ unique(.x))
$Date
[1] "1-JAN-20" "15-JUILL-20" "30 DEC 2020"
Here we are! The following code works, but I wonder if there's a better way to atcheive it instead of copy/pass the same code line every time and changing it.
tib <- tib %>% mutate(Date = case_when(Date == "1-JAN-20" ~ "2020-01-01",
Date == "15-JUILL-20" ~ "2020-06-15",
Date == "30 DEC 2020" ~ "2020-12-01"))
head(tib)
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 2020-01-01 Should be 2020-01-01
2 2 2020-06-15 Should be 2020-06-15
3 3 2020-12-01 Should be 2020-12-30
4 4 2020-01-01 Should be 2020-01-01
5 5 2020-06-15 Should be 2020-06-15
6 6 2020-12-01 Should be 2020-12-30
Since I will have to do this manipulation on other variables, how could I build a function that would accomplish this?
Also, I would like to know if you know some good documentations/tutorials to learn Purrr package?
Thank you and have a good day!
While handling dates/times you should use standard date time functions for manipulation. Don't replace dates one by one using str_replace. Imagine you have 1000's of dates with different years, it is not practically possible to list each one of them. In this case, you can use lubridate::dmy to convert them to date object, for more complicated cases there is lubridate::parse_date_time which can convert variables in different format to dates.
tib %>% dplyr::mutate(new_date = lubridate::dmy(Date))
# ID Date Comm new_date
# <int> <chr> <chr> <date>
#1 1 01-JAN-20 Should be 2020-01-01 2020-01-01
#2 2 15-JUN-20 Should be 2020-06-15 2020-06-15
#3 3 30 DEC 2020 Should be 2020-12-30 2020-12-30
#4 4 01-JAN-20 Should be 2020-01-01 2020-01-01
#5 5 15-JUN-20 Should be 2020-06-15 2020-06-15
#6 6 30 DEC 2020 Should be 2020-12-30 2020-12-30
If you want dates in specific format, you can use the format function on new_date.
Maybe you could try dplyr::case_when:
library(magrittr)
library(purrr)
# A tibble that looks like my data.
tib <- tibble(
ID = 1:6,
Date = c("01-JAN-20", "15-JUN-20", "30 DEC 2020",
"01-JAN-20", "15-JUN-20", "30 DEC 2020"),
Comm = c("Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30",
"Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30"))
head(tib)
tib %>% select(where(is.character), -Comm) %>% map(~ unique(.x))
tib <- tib %>% mutate(Date = dplyr::case_when(Date == "01-JAN-20" ~ "2020-01-01",
Date == "15-JUN-20" ~ "2020-06-15",
Date == "30 DEC 2020" ~ "2020-12-01"))
> tib
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 2020-01-01 Should be 2020-01-01
2 2 2020-06-15 Should be 2020-06-15
3 3 2020-12-01 Should be 2020-12-30
4 4 2020-01-01 Should be 2020-01-01
5 5 2020-06-15 Should be 2020-06-15
6 6 2020-12-01 Should be 2020-12-30
The best thing to try to do here is to transform your Date column into Date class using the "anytime" package. Although you would have to manually fix your Date column so all years have 4 digits. If years are always in the last place of the date, that can be an easy thing to do.
I have some data in a format like the reproducible example below (code for data input below the question, at the end). Two things:
Not all dates have a value (i.e. many dates are missing).
Some dates have multiple values, eg 16 June 2020.
#> date value
#> 1 30-Jun-20 20
#> 2 29-Jun-20 -100
#> 3 26-Jun-20 -4
#> 4 16-Jun-20 -13
#> 5 16-Jun-20 40
#> 6 9-Jun-20 -6
For two week periods, ending on Tuesdays, I would like to take a sum of the value column.
So in the example data above, I want to sum ending on:
two weeks ending on Tuesday 16 June 2020 (i.e. from 3 June 2020 - 16 June 2020, inclusive)
two weeks ending on Tuesday 30 June 2020 (17 June 2020 - 30 June 2020 inclusive)
I'd ultimately like the code to continue summing all two week periods ending on every second Tuesday for when there's more data.
So my desired output is:
#2_weeks_end total
#30-Jun-20 -84
#16-Jun-20 21
Tidyverse and lubridate solutions would be my first preference.
Code for data input below:
df <- data.frame(
stringsAsFactors = FALSE,
date = c("30-Jun-20","29-Jun-20",
"26-Jun-20","16-Jun-20","16-Jun-20","9-Jun-20"),
value = c(20L, -100L, -4L, -13L, 40L, -6L)
)
df
Solution using findInterval().
df$date <- dmy(df$date)
df_intervals <- seq(as.Date("2020-06-03"), as.Date("2020-06-03")+14*3, 14)
df %>%
mutate(interval = findInterval(date, df_intervals)) %>%
mutate(`2_weeks_end` = df_intervals[interval+1]-1) %>%
group_by(`2_weeks_end`) %>%
summarise(total= sum(value))
Returns:
# A tibble: 2 x 2
2_weeks_end total
<date> <int>
1 2020-06-16 21
2 2020-06-30 -84
Here is an option if you like weekly or any other unit that is in lubridate by default:
library(dplyr)
library(lubridate)
df%>%
mutate(date = as.Date(date, format = "%d-%b-%y"))%>%
group_by(week_ceil = ceiling_date(date - 1L, unit = "week", week_start = 2L))%>%
summarize(sums = sum(value))
Here is a data.table approach that creates a reference table followed by a non-equi join:
library(data.table)
setDT(df)
df[, date := as.Date(date, format = "%d-%b-%y")]
ref_dt = df[, .(beg_date = seq.Date(from = floor_date(min(date), unit = "week", week_start = 3L),
to = max(date),
by = "2 weeks"))]
ref_dt[, end_date := beg_date +13L]
df[ref_dt,
on = .(date > beg_date,
date <= end_date),
sum(value),
by = .EACHI]
## date date V1
##1: 2020-06-03 2020-06-16 21
##2: 2020-06-17 2020-06-30 -84
I need to get the row of the first and last day of each month in a big data frame where I need to apply operations that cover accurately each month, using a for loop. Unfortunately, the data frame is not very homogeneous. Here a reproducible example to work upon:
dataframe <- data.frame(Date=c(seq.Date(as.Date("2020-01-01"),as.Date("2020-01-31"),by="day"),
seq.Date(as.Date("2020-02-01"),as.Date("2020-02-28"),by="day"),seq.Date(as.Date("2020-03-02"),
as.Date("2020-03-31"),by="day")))
We can create a grouping column by converting to yearmon and then get the first and last
library(zoo)
library(dplyr)
dataframe %>%
group_by(yearMon = as.yearmon(Date)) %>%
summarise(FirstDay = first(Date), LastDay = last(Date))
# A tibble: 3 x 3
# yearMon First Last
#* <yearmon> <date> <date>
#1 Jan 2020 2020-01-01 2020-01-31
#2 Feb 2020 2020-02-01 2020-02-28
#3 Mar 2020 2020-03-02 2020-03-31
If it the first and last day irrespective of the data
library(lubridate)
dataframe %>%
group_by(yearMon = as.yearmon(Date)) %>%
summarise(First = floor_date(first(Date), 'month'),
Last = ceiling_date(last(Date), 'month')-1)
In R, how can I produce a list of dates of all 2nd to last Wednesdays of the month in a specified date range? I've tried a few things but have gotten inconsistent results for months with five Wednesdays.
To generate a regular sequence of dates you can use seq with dates for parameter from and to. See the seq.Date documentation for more options.
Create a data frame with the date, the month and weekday. And then obtain the second to last wednesday for each month with the help of aggregate.
day_sequence = seq(as.Date("2020/1/1"), as.Date("2020/12/31"), "day")
df = data.frame(day = day_sequence,
month = months(day_sequence),
weekday = weekdays(day_sequence))
#Filter only wednesdays
df = df[df$weekday == "Wednesday",]
result = aggregate(day ~ month, df, function(x){head(tail(x,2),1)})
tail(x,2) will return the last two rows, then head(.., 1) will give you the first of these last two.
Result:
month day
1 April 2020-04-22
2 August 2020-08-19
3 December 2020-12-23
4 February 2020-02-19
5 January 2020-01-22
6 July 2020-07-22
7 June 2020-06-17
8 March 2020-03-18
9 May 2020-05-20
10 November 2020-11-18
11 October 2020-10-21
12 September 2020-09-23
There are probably simpler ways of doing this but the function below does what the question asks for. it returns a named vector of days such that
They are between from and to.
Are weekday day, where 1 is Monday.
Are n to last of the month.
By n to last I mean the nth counting from the end of the month.
whichWeekday <- function(from, to, day, n, format = "%Y-%m-%d"){
from <- as.Date(from, format = format)
to <- as.Date(to, format = format)
day <- as.character(day)
d <- seq(from, to, by = "days")
m <- format(d, "%Y-%m")
f <- c(TRUE, m[-1] != m[-length(m)])
f <- cumsum(f)
wed <- tapply(d, f, function(x){
i <- which(format(x, "%u") == day)
x[ tail(i, n)[1] ]
})
y <- as.Date(wed, origin = "1970-01-01")
setNames(y, format(y, "%Y-%m"))
}
whichWeekday("2019-01-01", "2020-03-31", 4, 2)
# 2019-01 2019-02 2019-03 2019-04 2019-05
#"2019-01-23" "2019-02-20" "2019-03-20" "2019-04-17" "2019-05-22"
# 2019-06 2019-07 2019-08 2019-09 2019-10
#"2019-06-19" "2019-07-24" "2019-08-21" "2019-09-18" "2019-10-23"
# 2019-11 2019-12 2020-01 2020-02 2020-03
#"2019-11-20" "2019-12-18" "2020-01-22" "2020-02-19" "2020-03-18"