parse dates from multiple columns with NAs and dates hidden in text - r

I have a data.frame with dates distributed across columns and in a messy format: the year column contains years and NAs, the column date_old contains the format Month DD or DD (or a date duration) or NAs, and the column hidden_date contains text and dates either in thee format .... YYYY .... or in the format .... DD Month YYYY .... (with .... representing general text of variable length).
An example data.frame looks like this:
df <- data.frame(year = c("1992", "1993", "1995", NA),
date_old = c("February 15", "October 02-24", "15", NA),
hidden_date = c(NA, NA, "The hidden date is 15 July 1995", "The hidden date is 2005"))
I want to get the dates in the format YYYY-MM-DD (take the first day of date durations) and fill unknown values with zeroes.
Using parse_date_time didn't help me so far, and the expected output would be:
year date_old hidden_date date
1 1992 February 15 <NA> 1992-02-15
2 1993 October 02-24 <NA> 1993-10-02
3 1995 15 The hidden date is 15 July 1995 1995-07-15
4 <NA> <NA> The hidden date is 2005 2005-00-00
How do I best go about this?

It's a little complicated because you have a jumble of date information in different columns which you need to extract and combine. I don't quite understand if you only have three columns, or if there could be more, so I've tried to solve the general case of an arbitray number of columns. If you only have three columns, each of which always have the same format, then things could be a little simpler, but not much.
I would start by creating a regex pattern for month names:
# We'll use dplyr, stringr, tidyr, readr, and purrr
library(tidyverse)
# We'll use month names and abbreviations just in case.
ms <- paste(c(month.name, month.abb), collapse = "|")
# [1] "January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec"
We can then iterate over each column, extracting the year, month, and day from each row as a data frame, which we then combine into a single data frame. The digit suffixes correspond to the original columns:
df_split_ymd <- map_dfc(df,
~ map_dfr(
.,
~ tibble(
year = str_extract(., "\\b\\d{4}\\b"),
month = str_extract(., str_glue("\\b({ms})\\b")),
day = str_extract(., "\\b\\d{2}\\b")
)
)
)
#### OUTPUT ####
# A tibble: 4 x 9
year month day year1 month1 day1 year2 month2 day2
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1992 NA NA NA February 15 NA NA NA
2 1993 NA NA NA October 02 NA NA NA
3 1995 NA NA NA NA 15 1995 July 15
4 NA NA NA NA NA NA 2005 NA NA
Finally, the year*, month*, and day* columns should be coalesced and then united to make parsing easier. Note that I've replaced NA values in day
with "01" and those in month with "January" because dates can't contain "00":
df_ymd <- df_split_ymd %>%
mutate(year = coalesce(!!!as.list(select(., starts_with("year")))),
month = coalesce(!!!as.list(select(., starts_with("month")))) %>%
replace_na("January"),
day = coalesce(!!!as.list(select(., starts_with("day")))) %>%
replace_na("01")
) %>%
unite(ymd, year, month, day, sep = " ") %>%
select(ymd) %>%
mutate(ymd = parse_date(ymd, "%Y %B %d"))
#### OUTPUT ####
# A tibble: 4 x 1
ymd
<date>
1 1992-02-15
2 1993-10-02
3 1995-07-15
4 2005-01-01

Related

Aggregate week and date in R by some specific rules

I'm not used to using R. I already asked a question on stack overflow and got a great answer.
I'm sorry to post a similar question, but I tried many times and got the output that I didn't expect.
This time, I want to do slightly different from my previous question.
Merge two data with respect to date and week using R
I have two data. One has a year_month_week column and the other has a date column.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022. What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022.If more than one date are in year_month_week, I will just include the first of them. Now, here's the different part. Even if there is no date inside year_month_week,just leave it NA. So my expected output has a same number of rows as df1 which includes the column year_month_week.So my expected output is as follows:
df<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43),
temperature=c(36.1,36.6,NA,34.3,34.9,NA,NA))
First we can convert the dates in df2 into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
You can build off of a previous answer here by taking the function to count the week of the month, then generate a join key in df2. See here
df1 <- data.frame(
id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2 <- data.frame(
id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
# Take the function from the previous StackOverflow question
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
# Create a year_month_week variable to join on
df2 <-
df2 %>%
mutate(
date = lubridate::parse_date_time(
x = date,
orders = "%Y%m%d"),
year_month_week = paste0(
lubridate::year(date),
0,
lubridate::month(date),
monthweeks.Date(date)),
year_month_week = as.double(year_month_week))
# Remove duplicate year_month_weeks
df2 <-
df2 %>%
arrange(year_month_week) %>%
distinct(year_month_week, .keep_all = T)
# Join dataframes
df1 <-
left_join(
df1,
df2,
by = "year_month_week")
Produces this result
id.x year_month_week points id.y date temperature
1 1 2022051 65 1 2022-05-03 36.1
2 1 2022052 58 1 2022-05-12 36.6
3 1 2022053 47 NA <NA> NA
4 2 2022041 21 2 2022-04-01 34.3
5 2 2022042 25 2 2022-04-08 34.9
6 2 2022043 27 NA <NA> NA
7 2 2022044 43 NA <NA> NA
>
Edit: forgot to mention that you need tidyverse loaded
library(tidyverse)

How to wrangle the data using Lubridate package and Regex instead of using the separate function?

https://www.kaggle.com/shivamb/netflix-shows-and-movies-exploratory-analysis/data ---- contains the data set.
This is an exploratory data analysis performed on the shows from the Netflix data set. There are two main objectives in the data wrangling process. First is to get only the year part separately from the date_added column. Second is to create a new column which contains the number of seasons for a particular show from the duration column. I have relied on the separate function from the dplyr package to achieve the above two objectives.
The code is as follows:-
# Neitlix EDA ----
# https://www.kaggle.com/shivamb/netflix-shows-and-movies-exploratory-analysis
library(tidyverse)
library(lubridate)
net_flix <- read.csv("netflix_titles_nov_2019.csv")
net_flix_wrangled_tbl <- net_flix %>%
separate(date_added,
into = c("date","month","year"),
sep = "-",
remove = FALSE)%>%
separate(duration,
into = c("count","show_type"),
sep = " ",
remove = FALSE)%>%
glimpse()
Those who do not wish to download the data can use the following code of the data frame contained below:
sf <- data.frame(date_added = c("30-11-19", "29-11-19", "", "12-07-19", "", "16-09-19"),
duration = c("1 Season", "67 min", "135 min", "2 Seasons", "107 min", "3 Seasons"))
The output is working with the separate() function for getting both the date and filtering the number of Seasons from the duration column.
But can this be done in a better and a robust way by using the lubridate package to get the year and ifelse() and filter() or Regex function to get only number of seasons and not the minutes of movies ?
Here is one alternative :
library(dplyr)
library(lubridate)
sf %>%
mutate(date_added = dmy(date_added),
date = day(date_added), month = month(date_added),
year = year(date_added),
count = readr::parse_number(as.character(duration)),
show_type = stringr::str_remove(duration, as.character(count)))
# date_added duration date month year count show_type
#1 2019-11-30 1 Season 30 11 2019 1 Season
#2 2019-11-29 67 min 29 11 2019 67 min
#3 <NA> 135 min NA NA NA 135 min
#4 2019-07-12 2 Seasons 12 7 2019 2 Seasons
#5 <NA> 107 min NA NA NA 107 min
#6 2019-09-16 3 Seasons 16 9 2019 3 Seasons

R Calculate change in Weekly values Year on Year (with additional complication)

I have a data set of daily value. It spans from Dec-1 2018 to April-1 2020.
The columns are "date" and "value". As shown here:
date <- c("2018-12-01","2000-12-02", "2000-12-03",
...
"2020-03-30","2020-03-31","2020-04-01")
value <- c(1592,1825,1769,1909,2022, .... 2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
What I would like to do is the sum the values by week and then calculate week over week change from the current to previous year.
I know that I can sum by week using the following function:
Data_week <- df%>% group_by(category ,week = cut(date, "week")) %>% mutate(summed= sum(value))
My questions are twofold:
1) How do I sum by week and then manipulate the dataframe so that I can calculate week over week change (e.g. week dec.1 2019/ week dec.1 2018).
2) How can I do that above, but using a "customized" week. Let's say I want to define a week as moving 7 days back from the latest date I have data for. Eg. the latest week I would have would be week starting on March 26th (April 1st -7 days).
We can use lag from dplyr to help and also some convenience functions from lubridate.
library(dplyr)
library(lubridate)
df %>%
mutate(year = year(date)) %>%
group_by(week = week(date),year) %>%
summarize(summed = sum(value)) %>%
arrange(year, week) %>%
ungroup %>%
mutate(change = summed - lag(summed))
# week year summed change
# <dbl> <dbl> <dbl> <dbl>
# 1 48 2018 3638. NA
# 2 49 2018 15316. 11678.
# 3 50 2018 13283. -2033.
# 4 51 2018 15166. 1883.
# 5 52 2018 12885. -2281.
# 6 53 2018 1982. -10903.
# 7 1 2019 14177. 12195.
# 8 2 2019 14969. 791.
# 9 3 2019 14554. -415.
#10 4 2019 12850. -1704.
#11 5 2019 1907. -10943.
If you would like to define "weeks" in different ways, there is also isoweek and epiweek. See this answer for a great explaination of your options.
Data
set.seed(1)
df <- data.frame(date = seq.Date(from = as.Date("2018-12-01"), to = as.Date("2019-01-29"), "days"), value = runif(60,1500,2500))

How do I coerce a tibble dataframe column from double to time?

I'm in the tidyverse.
I read in several CSV files using read_csv (all have the same columns)
df <- read_csv("data.csv")
to obtain the a series of dataframes. After a bunch of data cleaning and calculations, I want to merge all the dataframes.
There are a dozen dataframes of several hundred rows, and a few dozen columns. A minimal example is
DF1
ID name costcentre start stop date
<chr> <chr> <chr> <time> <tim> <chr>
1 R_3PMr4GblKPV~ Geo Prizm 01:00 03:00 25/12/2019
2 R_s6IDep6ZLpY~ Chevy Malibu NA NA NA
3 R_238DgbfO0hI~ Toyota Corolla 08:00 11:00 25/12/2019
DF2
ID name costcentre start stop date
<chr> <chr> <chr> <lgl> <time> <chr>
1 R_3PMr4GblKPV1OYd Geo Prizm NA NA NA
2 R_s6IDep6ZLpYvUeR Chevy Malibu NA 03:00 12/12/2019
3 R_238DgbfO0hItPxZ Toyota Corolla NA NA NA
Based on my cleaning requirements (is start == NA & stop != NA), some of the NAs in start must be 00:00. I can enter a zero in that cell:
df <- within(df, start[is.na(df$start) & !is.na(df$stop)] <- 0)
This results in
DF1
ID name costcentre start stop date
<chr> <chr> <chr> <time> <tim> <chr>
1 R_3PMr4GblKPV~ Geo Prizm 01:00 03:00 25/12/2019
2 R_s6IDep6ZLpY~ Chevy Malibu NA NA NA
3 R_238DgbfO0hI~ Toyota Corolla 08:00 11:00 25/12/2019
DF2
ID name costcentre start stop date
<chr> <chr> <chr> <dbl> <time> <chr>
1 R_3PMr4GblKPV1OYd Geo Prizm NA NA NA
2 R_s6IDep6ZLpYvUeR Chevy Malibu 0 03:00 12/12/2019
3 R_238DgbfO0hItPxZ Toyota Corolla NA NA NA
I run into issues on merging, as sometimes start is a double (as I've done some replacements), is logical (as there were all NAs with no replacements), or is time (if there were some times in the original data reading)
merged_df <- bind_rows(DF1, DF2,...)
gives me the error Error: Columnstartcan't be converted from hms, difftime to numeric
How do I coerce the start column to be of the type time so that I may merge my data?
I think the important point is that the columns start and stop, which appear to be of type time, are based on the hms package. I wondered why/when is displayed, becauses I had not heared about this class before.
As I see it, these columns are actually of class hms and difftime. Such objects are actually stored not in minutes (as the printed tibble suggests) but in seconds. We see this if we look at the data via View(df). Interestingly, if we print the data, the variable type is displayed as time.
To solve your problem, you have to convert all your start and stop columns consistently into hms difftime columns as in the example below.
Minimal reproducible example:
library(dplyr)
library(hms)
df1 <- tibble(id = 1:3,
start = as_hms(as.difftime(c(1*60,NA,8*60), units = "mins")),
stop = as_hms(as.difftime(c(3*60,NA,11*60), units = "mins")))
df2 <- tibble(id = 4:6,
start = c(NA,NA,NA),
stop = as_hms(as.difftime(c(NA,3*60,NA), units = "mins")))
Or even easier (but with slightly different printing than in the question):
df1 <- tibble(id = 1:3,
start = as_hms(c(1*60,NA,8*60)),
stop = as_hms(c(3*60,NA,11*60)))
df2 <- tibble(id = 4:6,
start = c(NA,NA,NA),
stop = as_hms(c(NA,3*60,NA)))
Solving the problem:
class(df1$start) # In df1 start has class hms and difftime
class(df2$start) # In df2 start has class logical
# We set start=0 if stop is not missing and turn the whole column into an hms object
df2 <- df2 %>% mutate(start = new_hms(ifelse(!is.na(stop), 0, NA)))
# Now that column types are consistent across tibbles we can easily bind them together
df <- bind_rows(df1, df2)
df

Weekends in a Month in R

I am trying to prepare an xreg serie for my Arima model and I will use number of weekends in a month for it. I can find results for a year but when it is longer than a year, it usually is, I couldn't find a way. Here is what I do so far.
dates <- seq(from=as.Date("2001-01-01"), to=as.Date("2010-12-31"), by = "day")
wd <- weekdays(dates)
aylar <- months(dates[which(wd == "Sunday" | wd == "Satuday")])
table(aylar)
What I want is gathering all months' weekends not based on only months but also years. So that I can have the same length of serie with my original forecast serie.
Here is my solution:
library(chron)
library(dplyr)
library(lubridate)
month <- months(dates[chron::is.weekend(dates)])
day <- dates[chron::is.weekend(dates)]
# create data.frame
df <- data.frame(date = day, month = month, year = chron::years(day))
df %>% group_by(year, month) %>% summarize(weekends = floor(n()/2))
# year month weekends
# <dbl> <fctr> <dbl>
#1 2001 April 4
#2 2001 August 4
#3 2001 Dezember 5
#4 2001 Februar 4
#5 2001 Januar 4
#6 2001 Juli 4
#7 2001 Juni 4
#8 2001 Mai 4
#9 2001 März 4
#10 2001 November 4
## ... with 110 more rows
I hope this is a starting point for your work.

Resources