Extract information for date from other columns - r

I working with monthly data. Data is in a specific format in two columns Month and Year. Below you can see a sample of data:
df<-data.frame(
Month=c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12"),
Year=c("2020","2020","2020","2020","2020","2020","2020","2020","2020","2020","2020","2020"))
Now I want to convert this data, from that format into the format shown below or more precisely in column Date
So can anybody help me how to solve this problem?

In base R you can do:
df$Date <- as.Date(paste0(df$Year, gsub("m", "-", df$Month, fixed = TRUE), "-01"))

Here is an option using parse_number and my function -
library(dplyr)
library(readr)
library(lubridate)
df %>%
mutate(Month = parse_number(Month),
Date = my(paste(Month, Year)))
# Month Year Date
#1 1 2020 2020-01-01
#2 2 2020 2020-02-01
#3 3 2020 2020-03-01
#4 4 2020 2020-04-01
#5 5 2020 2020-05-01
#6 6 2020 2020-06-01
#7 7 2020 2020-07-01
#8 8 2020 2020-08-01
#9 9 2020 2020-09-01
#10 10 2020 2020-10-01
#11 11 2020 2020-11-01
#12 12 2020 2020-12-01
Base R option -
transform(df, Date = as.Date(paste(1, sub('m', '', Month), Year), '%d %m %Y'))

Try
as.Date(paste0(df$Year, '-', gsub('\\D+', '', df$Month), '-01'))
#[1] "2020-01-01" "2020-02-01" "2020-03-01" "2020-04-01" "2020-05-01" "2020-06-01" "2020-07-01" "2020-08-01" "2020-09-01" "2020-10-01" "2020-11-01" "2020-12-01"

Related

How to replace values under severals conditions using purrr?

The post has been edit at Aug 17, 2020 to make the example looks more like my actual data.
The days always come first either with 1 or 2 digits. The months always come second either in full or in part and in French. The years always come third either with 2 or 4 digits.
I'm learning to code with tidyverse packages. I'm trying to replace every elements in a variable by an other string if they match specific conditions. The problem is that I can only do it one condition at the time. I would like to know how to achieve it at severals condition a the time.
Here's a reproductible exemple :
library(tidyverse)
library(magrittr)
tib <- tibble(
ID = 1:6,
Date = c("1-JAN-20", "15-JUILL-20", "30 DEC 2020",
"1-JAN-20", "15-JUILL-20", "30 DEC 2020"),
Comm = c("Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30",
"Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30"))
head(tib)
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 1-JAN-20 Should be 2020-01-01
2 2 15-JUILL-20 Should be 2020-06-15
3 3 30 DEC 2020 Should be 2020-12-30
4 4 1-JAN-20 Should be 2020-01-01
5 5 15-JUILL-20 Should be 2020-06-15
6 6 30 DEC 2020 Should be 2020-12-30
# Returns the unique values of the character variables execept the "Comm" one. So, it
# returns only one in that case, but my original data have severals ones.
tib %>% select(where(is.character), -Comm) %>% map(~ unique(.x))
$Date
[1] "1-JAN-20" "15-JUILL-20" "30 DEC 2020"
Here we are! The following code works, but I wonder if there's a better way to atcheive it instead of copy/pass the same code line every time and changing it.
tib <- tib %>% mutate(Date = case_when(Date == "1-JAN-20" ~ "2020-01-01",
Date == "15-JUILL-20" ~ "2020-06-15",
Date == "30 DEC 2020" ~ "2020-12-01"))
head(tib)
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 2020-01-01 Should be 2020-01-01
2 2 2020-06-15 Should be 2020-06-15
3 3 2020-12-01 Should be 2020-12-30
4 4 2020-01-01 Should be 2020-01-01
5 5 2020-06-15 Should be 2020-06-15
6 6 2020-12-01 Should be 2020-12-30
Since I will have to do this manipulation on other variables, how could I build a function that would accomplish this?
Also, I would like to know if you know some good documentations/tutorials to learn Purrr package?
Thank you and have a good day!
While handling dates/times you should use standard date time functions for manipulation. Don't replace dates one by one using str_replace. Imagine you have 1000's of dates with different years, it is not practically possible to list each one of them. In this case, you can use lubridate::dmy to convert them to date object, for more complicated cases there is lubridate::parse_date_time which can convert variables in different format to dates.
tib %>% dplyr::mutate(new_date = lubridate::dmy(Date))
# ID Date Comm new_date
# <int> <chr> <chr> <date>
#1 1 01-JAN-20 Should be 2020-01-01 2020-01-01
#2 2 15-JUN-20 Should be 2020-06-15 2020-06-15
#3 3 30 DEC 2020 Should be 2020-12-30 2020-12-30
#4 4 01-JAN-20 Should be 2020-01-01 2020-01-01
#5 5 15-JUN-20 Should be 2020-06-15 2020-06-15
#6 6 30 DEC 2020 Should be 2020-12-30 2020-12-30
If you want dates in specific format, you can use the format function on new_date.
Maybe you could try dplyr::case_when:
library(magrittr)
library(purrr)
# A tibble that looks like my data.
tib <- tibble(
ID = 1:6,
Date = c("01-JAN-20", "15-JUN-20", "30 DEC 2020",
"01-JAN-20", "15-JUN-20", "30 DEC 2020"),
Comm = c("Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30",
"Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30"))
head(tib)
tib %>% select(where(is.character), -Comm) %>% map(~ unique(.x))
tib <- tib %>% mutate(Date = dplyr::case_when(Date == "01-JAN-20" ~ "2020-01-01",
Date == "15-JUN-20" ~ "2020-06-15",
Date == "30 DEC 2020" ~ "2020-12-01"))
> tib
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 2020-01-01 Should be 2020-01-01
2 2 2020-06-15 Should be 2020-06-15
3 3 2020-12-01 Should be 2020-12-30
4 4 2020-01-01 Should be 2020-01-01
5 5 2020-06-15 Should be 2020-06-15
6 6 2020-12-01 Should be 2020-12-30
The best thing to try to do here is to transform your Date column into Date class using the "anytime" package. Although you would have to manually fix your Date column so all years have 4 digits. If years are always in the last place of the date, that can be an easy thing to do.

Create dataframe with month start and end in R

I want to create a dataframe from a given start and end date:
start_date <- as.Date("2020-05-17")
end_date <- as.Date("2020-06-23")
For each row in this dataframe, I should have the start day and end day of the month, so the expected output is:
start end month year
2020-05-17 2020-05-31 May 2020
2020-06-01 2020-06-23 June 2020
I have tried to create a sequence, but I'm stuck on what to do next:
day_seq <- seq(start_date, end_date, 1)
Please, a base R or tidyverse solution will be greatly appreciated.
1) yearmon Using start_date and end_date from the question create a yearmon sequence and then each of the desired columns is a simple one line computation. The stringAsFactors line can be omitted under R 4.0 onwards as that is the default there.
library(zoo)
ym <- seq(as.yearmon(start_date), as.yearmon(end_date), 1/12)
data.frame(start = pmax(start_date, as.Date(ym)),
end = pmin(end_date, as.Date(ym, frac = 1)),
month = month.name[cycle(ym)],
year = as.integer(ym),
stringsAsFactors = FALSE)
giving:
start end month year
1 2020-05-17 2020-05-31 May 2020
2 2020-06-01 2020-06-23 June 2020
2) Base R This follows similar logic and gives the same answer. We first define a function month1 which given a Date class vector x returns a Date vector the same length but for the first of the month.
month1 <- function(x) as.Date(cut(x, "month"))
months <- seq(month1(start_date), month1(end_date), "month")
data.frame(start = pmax(start_date, months),
end = pmin(end_date, month1(months + 31) - 1),
month = format(months, "%B"),
year = as.numeric(format(months, "%Y")),
stringsAsFactors = FALSE)
A while ago that I used the tidyverse, but here is my go at things..
sample data
different sample data to tagckle some problems wher the year changes..
start_date <- as.Date("2020-05-17")
end_date <- as.Date("2021-06-23")
code
library( tidyverse )
library( lubridate )
#create a sequence of days from start to end
tibble( date = seq( start_date, end_date, by = "1 day" ) ) %>%
mutate( month = lubridate::month( date ),
year = lubridate::year( date ),
end = as.Date( paste( year, month, lubridate::days_in_month(date), sep = "-" ) ) ) %>%
#the end of the last group is now always larger than tghe maximum date... repair!
mutate( end = if_else( end > max(date), max(date), end ) ) %>%
group_by( year, month ) %>%
summarise( start = min( date ),
end = max( end ) ) %>%
select( start, end, month, year )
output
# # A tibble: 14 x 4
# # Groups: year [2]
# start end month year
# <date> <date> <dbl> <dbl>
# 1 2020-05-17 2020-05-31 5 2020
# 2 2020-06-01 2020-06-30 6 2020
# 3 2020-07-01 2020-07-31 7 2020
# 4 2020-08-01 2020-08-31 8 2020
# 5 2020-09-01 2020-09-30 9 2020
# 6 2020-10-01 2020-10-31 10 2020
# 7 2020-11-01 2020-11-30 11 2020
# 8 2020-12-01 2020-12-31 12 2020
# 9 2021-01-01 2021-01-31 1 2021
# 10 2021-02-01 2021-02-28 2 2021
# 11 2021-03-01 2021-03-31 3 2021
# 12 2021-04-01 2021-04-30 4 2021
# 13 2021-05-01 2021-05-31 5 2021
# 14 2021-06-01 2021-06-23 6 2021
For the specific period in your question, you may use:
library(lubridate)
start_date <- as.Date("2020-05-17")
end_date <- as.Date("2020-06-23")
start <- c(start_date, floor_date(end_date, unit = 'months'))
end <- c(ceiling_date(start_date, unit = 'months'), end_date)
month <- c(as.character(month(start[1], label = TRUE)),
as.character(month(start[2], label = TRUE)))
year <- c(year(start[1]), year(start[2]))
data.frame(start, end, month, year, stringsAsFactors = FALSE)
Here is one approach using intervals with lubridate. You would create a full interval between the 2 dates of interest, and then intersect with monthly ranges for each month (first to last day each month).
library(tidyverse)
library(lubridate)
start_date <- as.Date("2020-05-17")
end_date <- as.Date("2021-08-23")
full_int <- interval(start_date, end_date)
month_seq = seq(start_date, end_date, by = "month")
month_int = interval(floor_date(month_seq, "month"), ceiling_date(month_seq, "month") - days(1))
data.frame(interval = intersect(full_int, month_int)) %>%
mutate(start = int_start(interval),
end = int_end(interval),
month = month.abb[month(start)],
year = year(start)) %>%
select(-interval)
Output
start end month year
1 2020-05-17 2020-05-31 May 2020
2 2020-06-01 2020-06-30 Jun 2020
3 2020-07-01 2020-07-31 Jul 2020
4 2020-08-01 2020-08-31 Aug 2020
5 2020-09-01 2020-09-30 Sep 2020
6 2020-10-01 2020-10-31 Oct 2020
7 2020-11-01 2020-11-30 Nov 2020
8 2020-12-01 2020-12-31 Dec 2020
9 2021-01-01 2021-01-31 Jan 2021
10 2021-02-01 2021-02-28 Feb 2021
11 2021-03-01 2021-03-31 Mar 2021
12 2021-04-01 2021-04-30 Apr 2021
13 2021-05-01 2021-05-31 May 2021
14 2021-06-01 2021-06-30 Jun 2021
15 2021-07-01 2021-07-31 Jul 2021
16 2021-08-01 2021-08-23 Aug 2021

Can you round a date to either the 1st or 15th based on they day of the month using r?

In the below piece of code I generate a date based off:
grabbing the first date from a character string that has one or more dates
adding a numeric value (lead time) to that date to get a new date
What I am trying to do is to now make the resulting date either the 15th of the month or the 1st of the month. This would be done by saying "if the day of the month is 15th or greater, then make the date the 15th of said month, if it is less than 15 then make the date the 1st of said month")
I am much more familiar with SQL but not quite sure how do to this the most simple way in R
Code:
AAR_Combined_w_LL$newvalue <-mdy(substr(AAR_Combined_w_LL$order_cutoffs,0,10)) + AAR_Combined_w_LL$lead_time
sample data:
season article_number order_cutoffs amount_of_limited_order_cut_offs retail_intro_date lead_time
1 adidas Fall/Winter 2020 GI7954 12/17/2019,01/14/2020,02/25/2020 3 2020-07-01 105
2 adidas Fall/Winter 2020 GI7955 12/17/2019,01/14/2020,02/25/2020 3 2020-07-01 105
3 adidas Fall/Winter 2020 P82146 12/17/2019 1 2020-06-01 75
4 adidas Fall/Winter 2020 S86676 12/17/2019 1 2020-06-01 75
5 adidas Fall/Winter 2020 P82145 12/17/2019 1 2020-06-01 75
6 adidas Fall/Winter 2020 S86673 12/17/2019 1 2020-06-01 75
1st_Booking_Deadline 1st_BW_Retail_Windows 2nd_Booking_Deadline 2nd_BW_Retail_Windows 3rd_Booking_Deadline 3rd_BW_Retail_Windows
1 2019-12-06 2020-07-01 - 2020-08-01 2020-01-24 2020-09-01 - 2020-11-01
2 2019-12-06 2020-07-01 - 2020-08-01 2020-01-24 2020-09-01 - 2020-11-01
3 2019-12-06 2020-06-01 - 2020-11-01
4 2019-12-06 2020-06-01 - 2020-11-01
5 2019-12-06 2020-06-01 - 2020-11-01
6 2019-12-06 2020-06-01 - 2020-11-01
4th_Booking_Deadline 4th_BW_Retail_Windows 5th_Booking_Deadline 5th_BW_Retail_Windows Booking_Deadlines Booking_Deadline_Intervals
1 2019-12-06, 2020-01-24, , , 2
2 2019-12-06, 2020-01-24, , , 2
3 2019-12-06, , , , NULL
4 2019-12-06, , , , NULL
5 2019-12-06, , , , NULL
6 2019-12-06, , , , NULL
newvalue
1 2020-03-31
2 2020-03-31
3 2020-03-01
4 2020-03-01
5 2020-03-01
6 2020-03-01
We can use ifelse
library(lubridate)
day(x) <- ifelse(day(x) > 15, 15, 1)
x
#[1] "2020-03-15" "2020-03-15" "2020-03-01" "2020-03-01" "2020-03-01"
Or with case_when
day(x) <- case_when(day(x) > 15 ~ 15, TRUE ~ 1)
Or using base R
day(x) <- ifelse(as.integer(format(x, "%d")) > 15, 15, 1)
Or another option is gsubfn to change from a character string
library(gsubfn)
as.Date(gsubfn("(\\d+)$", ~ ifelse(as.numeric(x) > 15, 15, 1), as.character(x)))
#[1] "2020-03-15" "2020-03-15" "2020-03-01" "2020-03-01" "2020-03-01"
data
x <- as.Date(c('2020-03-31','2020-03-31','2020-03-01','2020-03-01','2020-03-01'))
You could get the day, compare and assign value based on it.
library(lubridate)
x <- as.Date(c('2020-03-31','2020-03-31','2020-03-01','2020-03-01','2020-03-01'))
day(x) <- c(1, 15)[(day(x) >= 15) + 1]
This is another way of writing
day(x) <- ifelse(day(x) >= 15, 15, 1)
x
#[1] "2020-03-15" "2020-03-15" "2020-03-01" "2020-03-01" "2020-03-01"

Add Month and Year column from complete date column

I have a column with date formatted as MM-DD-YYYY, in the Date format.
I want to add 2 columns one which only contains YYYY and the other only contains MM.
How do I do this?
Once again base R gives you all you need, and you should not do this with sub-strings.
Here we first create a data.frame with a proper Date column. If your date is in text format, parse it first with as.Date() or my anytime::anydate() (which does not need formats).
Then given the date creating year and month is simple:
R> df <- data.frame(date=Sys.Date()+seq(1,by=30,len=10))
R> df[, "year"] <- format(df[,"date"], "%Y")
R> df[, "month"] <- format(df[,"date"], "%m")
R> df
date year month
1 2017-12-29 2017 12
2 2018-01-28 2018 01
3 2018-02-27 2018 02
4 2018-03-29 2018 03
5 2018-04-28 2018 04
6 2018-05-28 2018 05
7 2018-06-27 2018 06
8 2018-07-27 2018 07
9 2018-08-26 2018 08
10 2018-09-25 2018 09
R>
If you want year or month as integers, you can wrap as as.integer() around the format.
A base R option would be to remove the substring with sub and then read with read.table
df1[c('month', 'year')] <- read.table(text=sub("-\\d{2}-", ",", df1$date), sep=",")
Or using tidyverse
library(tidyverse)
separate(df1, date, into = c('month', 'day', 'year') %>%
select(-day)
Note: it may be better to convert to datetime class instead of using the string formatting.
df1 %>%
mutate(date =mdy(date), month = month(date), year = year(date))
data
df1 <- data.frame(date = c("05-21-2017", "06-25-2015"))

R: extract hour from variable format timestamp

My dataframe has timestamp with and without seconds, and a random use of 0 in front of months and hours, i.e. 01 or 1
library(tidyverse)
df <- data_frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 01:03', '12/30/2015 23:04:25'))
cust timestamp
A 5/31/2016 1:03:12
A 05/25/2016 01:06
B 6/16/2016 01:03
B 12/30/2015 23:04:25
How to extract hours into a separate column? The desired output:
cust timestamp hours
A 5/31/2016 1:03:12 1
A 05/25/2016 01:06 1
B 6/16/2016 9:03 9
B 12/30/2015 23:04:25 23
I prefer the answer with tidyverse and mutate, but my attempt fails to extract hours correctly:
df %>% mutate(hours=strptime(timestamp, '%H') %>% as.character() )
# A tibble: 4 × 3
cust timestamp hours
<chr> <chr> <chr>
1 A 5/31/2016 1:03:12 2016-10-31 05:00:00
2 A 05/25/2016 01:06 2016-10-31 05:00:00
3 B 6/16/2016 01:03 2016-10-31 06:00:00
4 B 12/30/2015 23:04:25 2016-10-31 12:00:00
Try this:
library(lubridate)
df <- data.frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 09:03', '12/30/2015 23:04:25'))
df %>% mutate(hours=hour(strptime(timestamp, '%m/%d/%Y %H:%M')) %>% as.character() )
cust timestamp hours
1 A 5/31/2016 1:03:12 1
2 A 05/25/2016 01:06 1
3 B 6/16/2016 09:03 9
4 B 12/30/2015 23:04:25 23
Here is a solution that appends 00 for the seconds when they are missing, then converts to a date using lubridate and extracts the hours using format. Note, if you don't want the 00:00 at the end of the hours, you can just eliminate them from the output format in format:
df %>%
mutate(
cleanTime = ifelse(grepl(":[0-9][0-9]:", timestamp)
, timestamp
, paste0(timestamp, ":00")) %>% mdy_hms
, hour = format(cleanTime, "%H:00:00")
)
returns:
cust timestamp cleanTime hour
<chr> <chr> <dttm> <chr>
1 A 5/31/2016 1:03:12 2016-05-31 01:03:12 01:00:00
2 A 05/25/2016 01:06 2016-05-25 01:06:00 01:00:00
3 B 6/16/2016 01:03 2016-06-16 01:03:00 01:00:00
4 B 12/30/2015 23:04:25 2015-12-30 23:04:25 23:00:00
Your timestamp is a character string (), you need to format is as a date (with as.Date for example) before you can start using functions like strptime.
You are going to have to go through some string manipulations to have properly formatted data before you can convert it to dates. Prepend a zero to months with a single digit and append :00 to hours with missing seconds. Use strsplit() and other regex functions. Afterwards do as.Date(df$timestamp,format = '%m/%d/%Y %H:%M:%S'), then you will be able to use strptime to extract the hours.

Resources