Process Date Regex Capturing Groups outputs in R - r

I'm trying to coerce dates from two formats into a single one that I can easily feed into as.Date. Here's a sample:
library(dplyr)
df <- data_frame(date = c("Mar 29 2017 9:30AM", "5/4/2016"))
I've tried this:
df %>%
mutate(date = gsub("([A-z]{3}) (\\d{2}) (\\d{4}).*",
paste0(which(month.abb == "\\1"),"/\\2","/\\3"), date))
But it gave me this:
date
1 /29/2017
2 5/4/2016
but I want this!
date
1 3/29/2017
2 5/4/2016
It looks like when I use month.abb == "\\1", it doesn't use the capturing group output ("Mar"), it just uses the caller text ("\\1"). I want to do this in regex if possible. I know you can do it another way but want to be slick.
Any ideas?

Here is one way with gsubfn
library(gsubfn)
df$date <- gsubfn("^([A-Za-z]{3})\\s+(\\d{2})\\s+(\\d{4}).*", function(x, y, z)
paste(match(x, month.abb),y, z, sep="/"), df$date)
df$date
#[1] "3/29/2017" "5/4/2016"
Or sub in combination with gsubfn
sub("(\\S+)\\s+(\\S+)\\s+(\\S+).*", "\\1/\\2/\\3",
gsubfn("^([A-z]{3})", setNames(as.list(1:12), month.abb), df$date))
#[1] "3/29/2017" "5/4/2016"

Related

How to fix date typos in R?

I have a data set with a good hundred thousand lines in it.
somehow.. the data provider sent it to me with all the dates formatted like 1/1/20202021 08:07:43 AM (mdy_hms). The correct year should be the last four in year for every row.
lubridate::mdy_hms() obviously cant recognize this. So I am trying to figure out how I could use grep or similar to pull out the correct date time. Any ideas?
Thanks everyone (:
You can handle this with functions in the stringr package. First, get the correct year by extracting it from the date variable. For example,
library(stringr)
date_value <- "1/1/20202021 08:07:43 AM"
correct_year <- str_sub(
str_extract(date_value, pattern = "\\d{8}\\s"), 5, 10
)
This returns "2021 ". You can now use str_replace() to replace the 8-digit bad year with correct_year:
str_replace(date_value, pattern = "\\d{8}\\s", replacement = correct_year)
[1] "1/1/2021 08:07:43 AM"
To perform this operation across the whole data frame you can do something like this:
library(tidyverse)
df %>%
mutate(
date_value = str_replace(
date_value,
pattern = "\\d{8}\\s",
replacement = str_sub(
str_extract(date_value, pattern = "\\d{8}\\s"), 5, 10
)
)
)
You can extract only the 2nd 4-digit year with sub.
x <- "1/1/20202021 08:07:43 AM"
lubridate::mdy_hms(sub('(\\d{4})(\\d{4})', '\\2', x))
#[1] "2021-01-01 08:07:43 UTC"
To apply this to entire column you replace x with df$column_name.

Add "\\-" every "x" letters of a string vector

I have a date vector like this
date <- c("01jan2020", "04mar2020", "20dec2020")
and I want to separate it with - following the next pattern (after the first 2 digits and after the first 5 digits):
date_transform1 <- c("01-jan-2020", "04-mar-2020", "20-dec-2020")
Next I want to convert the first letter of the month into a capital letter:
date_transform2 <- c("01-Jan-2020", "04-Mar-2020", "20-Dec-2020")
Any clue?
Regards
An option with lubridate and format
library(lubridate)
format(dmy(date), "%d-%b-%Y")
#[1] "01-Jan-2020" "04-Mar-2020" "20-Dec-2020"
You can try this approach splitting your text chain into multiple components:
#Data
date <- c("01jan2020", "04mar2020", "20dec2020")
#Extract first element
x1 <- substr(gsub("[^0-9.-]", "", date),1,2)
#Extract second element
x2 <- substr(gsub("[^0-9.-]", "", date),nchar(gsub("[^0-9.-]", "", date))-3,
nchar(gsub("[^0-9.-]", "", date)))
#Format month
x3 <- gsub('[[:digit:]]+', '', date)
x3 <- paste(toupper(substr(x3, 1, 1)), substr(x3, 2, nchar(x3)), sep="")
#Now concatenate
xf <- paste0(x1,'-',x3,'-',x2)
Output:
[1] "01-Jan-2020" "04-Mar-2020" "20-Dec-2020"
You can convert the character object to Date and change its format.
format(as.Date(date, "%d%b%Y"), "%d-%b-%Y")
# [1] "01-Jan-2020" "04-Mar-2020" "20-Dec-2020"
The first letters of months will be turned to capital ones. You can also use dmy() from lubridate or anydate() from anytime to parse Date objects.
format(lubridate::dmy(date), "%d-%b-%Y")
format(anytime::anydate(date), "%d-%b-%Y")
Another option with stringr package:
library(stringr)
str_replace(date, "[a-z]+", function(x) sprintf("-%s-", str_to_title(x)))
# [1] "01-Jan-2020" "04-Mar-2020" "20-Dec-2020"
or
str_replace(date, "[a-z]+", function(x) str_pad(str_to_title(x), 5, "both", "-"))
# [1] "01-Jan-2020" "04-Mar-2020" "20-Dec-2020"

Extracting String from Column

I am working with the following dataset called results and am trying to add in a column that only contains the date (ideally just the year) of the row.
I am trying to extract just the date (for example: 2012-02-10) from the column_label column.
This is the code that I use:
pattern <- "- (.*?) .RData"
subsetpk <- results %>%
filter(team=="Pakistan") %>%
mutate(year = str_extract(column_label, pattern))
This, however, only gives me NA values.
You can use a regular expression. Here '\\d{4}' just matches the first 4 consecutive digits that are found in the string. This works if your data always looks the same as your example. If not, you may need something more sophisticated. If this doesn't work, post some more example data.
library(tidyverse)
library(stringr)
df <- data.frame(column_label = c("Afghanistan-Pakistan-2012-02-10.RDATA.overs",
"Afghanistan-Pakistan-2019-02-10.RDATA.overs"))
df %>%
mutate(my_year = str_extract(column_label, '\\d{4}'))
column_label my_year
#1 Afghanistan-Pakistan-2012-02-10.RDATA.overs 2012
#2 Afghanistan-Pakistan-2012-02-10.RDATA.overs 2019
The ymd() function from the lubridate package
Transforms dates stored in character and numeric vectors to Date or POSIXct objects
So, we can pass the complete string conveniently without having to deal with regular expressions:
x <- c("Afghanistan-Pakistan-2012-02-10.RDATA.overs",
"Afghanistan-Pakistan-2019-02-10.RDATA.overs")
lubridate::ymd(x)
[1] "2012-02-10" "2019-02-10"
The year can be derived from the extracted dates by
library(lubridate)
year(ymd(x))
[1] 2012 2019
Use str_extract from the package stringr:
DATA:
results <- data.frame(
column_label = "Afghanistan-Pakistan-2012-02-10.RData.overs")
SOLUTION:
results$date <- str_extract(results$column_label, "\\d{4}-\\d{2}-\\d{2}")
RESULT:
results
column_label date
1 Afghanistan-Pakistan-2012-02-10.RData.overs 2012-02-10

Find year in random data in R

I have 71 columns in a dataframe, 10 of which include data that may include a year between 1990 and 2019 in the format YYYY (e.g. 2019). For example:
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
I am trying to find a way to pull the years from relevant cells and insert them in a new column.
So far, I am only aware of how to filter the data in a very time-consuming way. I have produced the following code, which starts like this:
dated_data <- select(undated_data, 1:71) %>%
filter(grepl("1990", id_1) | filter(grepl("1990", id_2) | filter(grepl("1991", id_1) | filter(grepl("1991", id_2)
However, it take a really long time to write that for all ten columns and all 30 years. I am sure there is a quicker way. I also have no idea how to then pull the dates from each of the matching cells into a new cell.
The output I want looks like this:
dated_data$year <- c("2013", "2014", "2016", "1990")
Does anyone know how I do this? Thank you in advance for your help!
There are many ways. This is one of them:
Step 1: define a pattern you want to match with regex:
pattern <- "(1|2)\\d{3}"
Step 2: define a function to extract raw matches:
extract <- function(x) unlist(regmatches(x, gregexpr(pattern, x, perl = T)))
Step 3: apply the function to your data, e.g., id_1:
extract(id_1)
[1] "2013" "2014" "2016" "1990"
Here's another way, actually simpler ;)
It uses the str_extract function from the stringr package. So you install the package and activate it:
install.packages("stringr")
library(stringr)
and use str_extract to pull your matches:
years <- str_extract(id_1,"(1|2)\\d{3}")
years
[1] "2013" "2014" "2016" "1990"
EDIT:
If not every string contains a match and you want to preserve the length of the vectors/columns, you can use ifelse to test whether the regex finds a match and, where it doesn't, to put NA.
For example, if your data is like this (note the two added strings which do not contain years):
id_3 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759", "gbgbgbgb", "hnhna25")
you can set up the ifelse test like this:
years <- ifelse(grepl("(1|2)\\d{3}", id_3), str_extract(id_3,"(1|2)\\d{3}"), NA)
years
[1] "2013" "2014" "2016" "1990" NA NA
Based on the example in your question, you are trying to filter out any rows without years and then extract the year from the string. It looks like every row only contains 1 year. Here is some code so that you do not have to write long filter statements for 10 columns and 30 years. Keep in mind that I don't have your data so I couldn't test it.
library(tidyverse)
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate(year = str_extract(id_1, pattern = paste0(1990:2019, collapse = "|")))
EDIT: based on your comment it looks like maybe some columns have a year and others do not. What we do instead is pull the year out of any column with id_* and then we coalesce the columns together. Again, without your data its tough to test this.
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate_at(vars(starts_with("id_")), list(year = ~str_extract(., pattern = paste0(1990:2019, collapse = "|")))) %>%
mutate(year = coalesce(ends_with("_year"))) %>%
select(-ends_with("_year"))
Using tidyverse methods:
undated_data %>%
mutate_at(vars(1:71),
funs(str_extract(., "(1|2)[0-9]{3}")))
(Note that the regex pattern will match numbers that may not be years, such as 2999; if your data has many "false positives" like that, you may be better off writing a custom function.)
Here is a similar solution to the one provided, but using dplyr and stringr on a data.frame.
library(stringr)
library(dplyr)
df<-data.frame("X1" = id_1,"X2" = id_2)
#Set in cols the column names from which years are going to be extracted
df %>%
pivot_longer(cols = c("X1","X2"), names_to = "id") %>%
arrange(id) %>%
mutate(new = unlist(str_extract_all(value, pattern = "(1|2)\\d{3}")))
Base R solution:
# Sample data: id_1; id_2 => character vectors
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
# Thanks #Chris Ruehlemann: store the date pattern: date_pattern => character scalar
date_pattern <- "(1|2)\\d{3}"
# Convert to data.frame: df => data.frame
df <- data.frame(id_1, id_2, stringsAsFactors = FALSE)
# Subset the data to only contain date information vectors: dates_subset => data.frame
dates_subset <- df[,sapply(df, function(x){any(grepl(date_pattern, x))}), drop = FALSE]
# Initialse the year vector: year => character vector:
df$years <- NA_character_
# Remove punctuation and letters, return valid dates, combine into a, comma-separated string:
# Store the dates found in the string: years => character vector
df$years[which(rowSums(Vectorize(grepl)(date_pattern, dates_subset)) > 0)] <-
apply(sapply(dates_subset, function(x){
grep(date_pattern, unlist(strsplit(x, "[[:punct:]]|[a-zA-Z]")), value = TRUE)}),
1, paste, collapse = ", ")
Here may be another solution.
We just use gsub() function and set pattern as ".(199[0-9]|20[01][0-9]).".
The pattern captures a year text between 1990 to 2019 as a
group result , especially only one group , so we replace original text with first one group string:)
library(magrittr)
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_1)
# [1] "2013" "2014" "2016" "1990"
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_2)
#[1] "2013" "2014" "2016" "1990"

How to modify date values?

How could I modify raw date values. For example.
> DF2
Date
1 11012018
2 7312014
3 6102015
4 10202017
Into modified date values the one with "/"
> DF2
Date
1 11/01/2018
2 7/31/2014
3 6/10/2015
4 10/20/2017
Use lubridate for all date and time related tasks
> lubridate::mdy(c("11012018", "7/31/2014"))
[1] "2018-11-01" "2014-07-31"
You can also format it if needed:
format(lubridate::mdy(c("11012018", "7/31/2014")), "%m/%d/%Y")
[1] "11/01/2018" "07/31/2014"
Assuming: your date is in month-date-year format. Else you can use other lubridate functions
We could also use(It is assumed that you just need to add a new separator. In any case, you could convert back to date-time type):
new<-gsub("([0-9]{,2})([0-9]{2})([0-9]{4})","\\1 \\2 \\3",df$Date)
gsub(" ","/",new)
#[1] "11/01/2018" "7/31/2014" "6/10/2015" "10/20/2017"
Edit:
More generally as suggested by #jay.sf ,
test4<-gsub("(^[0-1]?\\d)([0-3]?\\d)(\\d{4}$)","\\1 \\2 \\3",df$Date)
gsub(" ","/",test4)
#[1] "11/01/2018" "7/31/2014" "6/10/2015" "10/20/2017"
This is to account for such date formats as:
test3<-c("11012018", "1112015", "7312014", "7312014", "10202017", "772007", "772007",
"7072007")
One possible solution could be:
df <- transform(df, V1 = as.Date(as.character(V1), "%d%m%Y"))
And another which may convert in the required mm/dd/yyyy format is as below:
df <- data.frame(lapply(df, function(x) as.Date(as.character(x), "%m%d%Y")))
Both the solutions are through the base R package.

Resources