How to replace values under severals conditions using purrr? - r

The post has been edit at Aug 17, 2020 to make the example looks more like my actual data.
The days always come first either with 1 or 2 digits. The months always come second either in full or in part and in French. The years always come third either with 2 or 4 digits.
I'm learning to code with tidyverse packages. I'm trying to replace every elements in a variable by an other string if they match specific conditions. The problem is that I can only do it one condition at the time. I would like to know how to achieve it at severals condition a the time.
Here's a reproductible exemple :
library(tidyverse)
library(magrittr)
tib <- tibble(
ID = 1:6,
Date = c("1-JAN-20", "15-JUILL-20", "30 DEC 2020",
"1-JAN-20", "15-JUILL-20", "30 DEC 2020"),
Comm = c("Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30",
"Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30"))
head(tib)
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 1-JAN-20 Should be 2020-01-01
2 2 15-JUILL-20 Should be 2020-06-15
3 3 30 DEC 2020 Should be 2020-12-30
4 4 1-JAN-20 Should be 2020-01-01
5 5 15-JUILL-20 Should be 2020-06-15
6 6 30 DEC 2020 Should be 2020-12-30
# Returns the unique values of the character variables execept the "Comm" one. So, it
# returns only one in that case, but my original data have severals ones.
tib %>% select(where(is.character), -Comm) %>% map(~ unique(.x))
$Date
[1] "1-JAN-20" "15-JUILL-20" "30 DEC 2020"
Here we are! The following code works, but I wonder if there's a better way to atcheive it instead of copy/pass the same code line every time and changing it.
tib <- tib %>% mutate(Date = case_when(Date == "1-JAN-20" ~ "2020-01-01",
Date == "15-JUILL-20" ~ "2020-06-15",
Date == "30 DEC 2020" ~ "2020-12-01"))
head(tib)
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 2020-01-01 Should be 2020-01-01
2 2 2020-06-15 Should be 2020-06-15
3 3 2020-12-01 Should be 2020-12-30
4 4 2020-01-01 Should be 2020-01-01
5 5 2020-06-15 Should be 2020-06-15
6 6 2020-12-01 Should be 2020-12-30
Since I will have to do this manipulation on other variables, how could I build a function that would accomplish this?
Also, I would like to know if you know some good documentations/tutorials to learn Purrr package?
Thank you and have a good day!

While handling dates/times you should use standard date time functions for manipulation. Don't replace dates one by one using str_replace. Imagine you have 1000's of dates with different years, it is not practically possible to list each one of them. In this case, you can use lubridate::dmy to convert them to date object, for more complicated cases there is lubridate::parse_date_time which can convert variables in different format to dates.
tib %>% dplyr::mutate(new_date = lubridate::dmy(Date))
# ID Date Comm new_date
# <int> <chr> <chr> <date>
#1 1 01-JAN-20 Should be 2020-01-01 2020-01-01
#2 2 15-JUN-20 Should be 2020-06-15 2020-06-15
#3 3 30 DEC 2020 Should be 2020-12-30 2020-12-30
#4 4 01-JAN-20 Should be 2020-01-01 2020-01-01
#5 5 15-JUN-20 Should be 2020-06-15 2020-06-15
#6 6 30 DEC 2020 Should be 2020-12-30 2020-12-30
If you want dates in specific format, you can use the format function on new_date.

Maybe you could try dplyr::case_when:
library(magrittr)
library(purrr)
# A tibble that looks like my data.
tib <- tibble(
ID = 1:6,
Date = c("01-JAN-20", "15-JUN-20", "30 DEC 2020",
"01-JAN-20", "15-JUN-20", "30 DEC 2020"),
Comm = c("Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30",
"Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30"))
head(tib)
tib %>% select(where(is.character), -Comm) %>% map(~ unique(.x))
tib <- tib %>% mutate(Date = dplyr::case_when(Date == "01-JAN-20" ~ "2020-01-01",
Date == "15-JUN-20" ~ "2020-06-15",
Date == "30 DEC 2020" ~ "2020-12-01"))
> tib
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 2020-01-01 Should be 2020-01-01
2 2 2020-06-15 Should be 2020-06-15
3 3 2020-12-01 Should be 2020-12-30
4 4 2020-01-01 Should be 2020-01-01
5 5 2020-06-15 Should be 2020-06-15
6 6 2020-12-01 Should be 2020-12-30
The best thing to try to do here is to transform your Date column into Date class using the "anytime" package. Although you would have to manually fix your Date column so all years have 4 digits. If years are always in the last place of the date, that can be an easy thing to do.

Related

R create week numbers with specified start date

This seems like it should be straightforward but I cannot find a way to do this.
I have a sales cycle that begins ~ August 1 of each year and need to sum sales by week number. I need to create a "week number" field where week #1 begins on a date that I specify. Thus far I have looked at lubridate, baseR, and strftime, and I cannot find a way to change the "start" date from 01/01/YYYY to something else.
Solution needs to let me specify the start date and iterate week numbers as 7 days from the start date. The actual start date doesn't always occur on a Sunday or Monday.
EG Data Frame
eg_data <- data.frame(
cycle = c("cycle2019", "cycle2019", "cycle2018", "cycle2018", "cycle2017", "cycle2017", "cycle2016", "cycle2016"),
dates = as.POSIXct(c("2019-08-01" , "2019-08-10" ,"2018-07-31" , "2018-08-16", "2017-08-03" , "2017-08-14" , "2016-08-05", "2016-08-29")),
week_n = c("1", "2","1","3","1","2","1","4"))
I'd like the result to look like what is above - it would take the min date for each cycle and use that as a starting point, then iterate up week numbers based on a given date's distance from the cycle starting date.
This almost works. (Doing date arithmetic gives us durations in seconds: there may be a smoother way to convert with lubridate tools?)
secs_per_week <- 60*60*24*7
(eg_data
%>% group_by(cycle)
%>% mutate(nw=1+as.numeric(round((dates-min(dates))/secs_per_week)))
)
The results don't match for 2017, because there is an 11-day gap between the first and second observation ...
cycle dates week_n nw
<chr> <dttm> <chr> <dbl>
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 3
If someone has a better answer plz post, but this works -
Take the dataframe in the example, eg_data -
eg_data %>%
group_by(cycle) %>%
mutate(
cycle_start = as.Date(min(dates)),
days_diff = as.Date(dates) - cycle_start,
week_n = days_diff / 7,
week_n_whole = ceiling(days_diff / 7) ) -> eg_data_check
(First time I've answered my own question)
library("lubridate")
eg_data %>%
as_tibble() %>%
group_by(cycle) %>%
mutate(new_week = week(dates)-31)
This doesn't quite work the same as your example, but perhaps with some fiddling based on your domain experience you could adapt it:
library(lubridate)
eg_data %>%
mutate(aug1 = ymd_h(paste(str_sub(cycle, start = -4), "080100")),
week_n2 = ceiling((dates - aug1)/ddays(7)))
EDIT: If you have specific known dates for the start of each cycle, it might be helpful to join those dates to your data for the calc:
library(lubridate)
cycle_starts <- data.frame(
cycle = c("cycle2019", "cycle2018", "cycle2017", "cycle2016"),
start_date = ymd_h(c(2019080100, 2018072500, 2017080500, 2016071300))
)
eg_data %>%
left_join(cycle_starts) %>%
mutate(week_n2 = ceiling((dates - start_date)/ddays(7)))
#Joining, by = "cycle"
# cycle dates week_n start_date week_n2
#1 cycle2019 2019-08-01 1 2019-08-01 1
#2 cycle2019 2019-08-10 2 2019-08-01 2
#3 cycle2018 2018-07-31 1 2018-07-25 1
#4 cycle2018 2018-08-16 3 2018-07-25 4
#5 cycle2017 2017-08-03 1 2017-08-05 0
#6 cycle2017 2017-08-14 2 2017-08-05 2
#7 cycle2016 2016-08-05 1 2016-07-13 4
#8 cycle2016 2016-08-29 4 2016-07-13 7
This is a concise solution using lubridate
library(lubridate)
eg_data %>%
group_by(cycle) %>%
mutate(new_week = floor(as.period(ymd(dates) - ymd(min(dates))) / weeks()) + 1)
# A tibble: 8 x 4
# Groups: cycle [4]
cycle dates week_n new_week
<chr> <dttm> <chr> <dbl>
1 cycle2019 2019-08-01 00:00:00 1 1
2 cycle2019 2019-08-10 00:00:00 2 2
3 cycle2018 2018-07-31 00:00:00 1 1
4 cycle2018 2018-08-16 00:00:00 3 3
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 2
7 cycle2016 2016-08-05 00:00:00 1 1
8 cycle2016 2016-08-29 00:00:00 4 4

How to convert week numbers into date format using R

I am trying to convert a column in my dataset that contains week numbers into weekly Dates. I was trying to use the lubridate package but could not find a solution. The dataset looks like the one below:
df <- tibble(week = c("202009", "202010", "202011","202012", "202013", "202014"),
Revenue = c(4543, 6764, 2324, 5674, 2232, 2323))
So I would like to create a Date column with in a weekly format e.g. (2020-03-07, 2020-03-14).
Would anyone know how to convert these week numbers into weekly dates?
Maybe there is a more automated way, but try something like this. I think this gets the right days, I looked at a 2020 calendar and counted. But if something is off, its a matter of playing with the (week - 1) * 7 - 1 component to return what you want.
This just grabs the first day of the year, adds x weeks worth of days, and then uses ceiling_date() to find the next Sunday.
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
separate(week, c("year", "week"), sep = 4, convert = TRUE) %>%
mutate(date = ceiling_date(ymd(paste(year, "01", "01", sep = "-")) +
(week - 1) * 7 - 1, "week", week_start = 7))
# # A tibble: 6 x 4
# year week Revenue date
# <int> <int> <dbl> <date>
# 1 2020 9 4543 2020-03-01
# 2 2020 10 6764 2020-03-08
# 3 2020 11 2324 2020-03-15
# 4 2020 12 5674 2020-03-22
# 5 2020 13 2232 2020-03-29
# 6 2020 14 2323 2020-04-05

Split a rows into two when a date range spans a change in calendar year

I am trying to figure out how to add a row when a date range spans a calendar year. Below is a minimal reprex:
I have a date frame like this:
have <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-12-20'), as.Date('2019-05-13')),
to = c(as.Date('2019-06-20'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
have
#> from to
#> 1 2018-12-15 2019-06-20
#> 2 2019-12-20 2020-01-25
#> 3 2019-05-13 2019-09-10
I want a data.frame that splits into two rows when to and from span a calendar year.
want <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-01-01'), as.Date('2019-12-20'), as.Date('2020-01-01'), as.Date('2019-05-13')),
to = c(as.Date('2018-12-31'), as.Date('2019-06-20'), as.Date('2019-12-31'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
want
#> from to
#> 1 2018-12-15 2018-12-31
#> 2 2019-01-01 2019-06-20
#> 3 2019-12-20 2019-12-31
#> 4 2020-01-01 2020-01-25
#> 5 2019-05-13 2019-09-10
I am wanting to do this because for a particular row, I want to know how many days are in each year.
want$time_diff_by_year <- difftime(want$to, want$from)
Created on 2020-05-15 by the reprex package (v0.3.0)
Any base R, tidyverse solutions would be much appreciated.
You can determine the additional years needed for your date intervals with map2, then unnest to create additional rows for each year.
Then, you can identify date intervals of intersections between partial years and a full calendar year. This will keep the partial years starting Jan 1 or ending Dec 31 for a given year.
library(tidyverse)
library(lubridate)
have %>%
mutate(date_int = interval(from, to),
year = map2(year(from), year(to), seq)) %>%
unnest(year) %>%
mutate(year_int = interval(as.Date(paste0(year, '-01-01')), as.Date(paste0(year, '-12-31'))),
year_sect = intersect(date_int, year_int),
from_new = as.Date(int_start(year_sect)),
to_new = as.Date(int_end(year_sect))) %>%
select(from_new, to_new)
Output
# A tibble: 5 x 2
from_new to_new
<date> <date>
1 2018-12-15 2018-12-31
2 2019-01-01 2019-06-20
3 2019-12-20 2019-12-31
4 2020-01-01 2020-01-25
5 2019-05-13 2019-09-10

How to select the earliest date in a month from a Date series in R?

I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17

How to divide monthly totals by the seasonal monthly ratio in R

I am trying de-seasonalize my data by dividing my monthly totals by the average seasonality ratio per that month. I have two data frames. avgseasonality that has 12 rows of the average seasonality ratio per month. The problem is since the seasonality ratio is the ratio of each month averaged only has 12 rows and the ordertotal data frame has 147 rows.
deseasonlize <- transform(avgseasonalityratio, deseasonlizedtotal =
df1$OrderTotal / avgseasonality$seasonalityratio)
This runs but it does not pair the months appropriately. It uses the first ratio of april and runs it on the first ordertotal of december.
> avgseasonality
Month seasonalityratio
1 April 1.0132557
2 August 1.0054602
3 December 0.8316988
4 February 0.9813396
5 January 0.8357475
6 July 1.1181648
7 June 1.0439899
8 March 1.1772450
9 May 1.0430667
10 November 0.9841149
11 October 0.9595041
12 September 0.8312318
> df1
# A tibble: 157 x 3
DateEntLabel OrderTotal `d$Month`
<dttm> <dbl> <chr>
1 2005-12-01 00:00:00 512758. December
2 2006-01-01 00:00:00 227449. January
3 2006-02-01 00:00:00 155652. February
4 2006-03-01 00:00:00 172923. March
5 2006-04-01 00:00:00 183854. April
6 2006-05-01 00:00:00 239689. May
7 2006-06-01 00:00:00 237638. June
8 2006-07-01 00:00:00 538688. July
9 2006-08-01 00:00:00 197673. August
10 2006-09-01 00:00:00 144534. September
# ... with 147 more rows
I need the ordertotal and ratio of each month respectively. The calculations would for each month respectively be such as (december) 512758/0.8316988 = 616518.864762 The output for the calculations would be in their new column that corresponds with the month and ordertotal. Please any help is greatly appreciated!
Easiest way would be to merge() your data first, then do the operation. You can use R base merge() function, though I will show here using the tidyverse left_join() function. I see that one of your columns has a strange name d$Month, renameing this to Month will simplify the merge!
Reproducible example:
library(tidyverse)
df_1 <- data.frame(Month = c("Jan", "Feb"), seasonalityratio = c(1,2))
df_2 <- data.frame(Month = rep(c("Jan", "Feb"),each=2), OrderTotal = 1:4)
df_1 %>%
left_join(df_2, by = "Month") %>%
mutate(eseasonlizedtotal = OrderTotal / seasonalityratio)
#> Month seasonalityratio OrderTotal eseasonlizedtotal
#> 1 Jan 1 1 1.0
#> 2 Jan 1 2 2.0
#> 3 Feb 2 3 1.5
#> 4 Feb 2 4 2.0
Created on 2019-01-30 by the reprex package (v0.2.1)

Resources