Extracting Date from text using R - r

My dataframe looks like
df <- setNames(data.frame(c("2 June 2004, 5 words, ()(","profit, Insight, 2 May 2004, 188 words, reports, by ()("), stringsAsFactors = F), "split")
What I want is to split column for date and words So far I found
"Extract date text from string"
lapply(df2, function(x) gsub(".*(\\d{2} \\w{3} \\d{4}).*", "\\1", x))
But its not working with my example, thanks for the help as always

As there is only a single column, we can directly use gsub/sub after extracting the column. In the pattern, the days can be 1 or more, similarly the words have 3 ('May') or 4 characters ('June'), so we need to make those changes
sub(".*\\b(\\d{1,} \\w{3,4} \\d{4}).*", "\\1", df$split)
#[1] "2 June 2004" "2 May 2004"

Related

Formatting and Replacing Multiple Dates within a Single String in R

I have a question very similar to this one. The difference with mine is that I can have text with multiple dates within one string. All the dates are in the same format, as demonstrated below
rep <- "on the evening of june 11 2022, i was too tired to complete my homework that was due on august 4 2022. on august 25 2022 there will be a test "
All my sentences are lower case and all dates follow the %B %d %Y format. I'm able to extract all the dates using the following code:
> pattern <- paste(month.name, "[:digit:]{1,2}", "[:digit:]{4}", collapse = "|") %>%
regex(ignore_case = TRUE)
> str_extract_all(rep, pattern)
[[1]]
[1] "june 11 2022" "august 4 2022" "august 25 2022"
what I want to do is replace every instance of a date formatted %B %d %Y with the format %Y-%m-%d. I've tried something like this:
str_replace_all(rep, pattern, as.character(as.Date(str_extract_all(rep, pattern),format = "%B %d %Y")))
Which throws the error do not know how to convert 'str_extract_all' to class "Date". This makes sense to me since Im trying to replace multiple different dates and R doesn't know which one to replace it with.
If I change the str_extract_all to just str_extract I get this:
"on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-06-11. on 2022-06-11 there will be a test "
Which again, makes sense since the str_extract is taking the first instance of a date, converting the format, and applying that same date across all instances of a date.
I would prefer if the solution used the stringr package just because most of my string tidying thus far has been using that package, BUT I am 100% open to any solution that gets the job done.
We may capture the pattern i.e one or more character (\\w+) followed by a space then one or two digits (\\d{1,2}), followed by space and then four digits (\\d{4}) as a group ((...)) and in the replacement pass a function to convert the captured group to Date class
library(stringr)
str_replace_all(rep, "(\\w+ \\d{1,2} \\d{4})", function(x) as.Date(x, "%b %d %Y"))
-output
[1] "on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-08-04. on 2022-08-25 there will be a test "
NOTE: It is better to name objects with different names as rep is a base R function name
You can pass a named vector with multiple replacements to str_replace_all():
library(stringr)
rep <- "on the evening of june 11 2022, i was too tired to complete my homework that was due on august 4 2022. on august 25 2022 there will be a test "
pattern <- paste(month.name, "[:digit:]{1,2}", "[:digit:]{4}", collapse = "|") %>%
regex(ignore_case = TRUE)
extracted <- str_extract_all(rep, pattern)[[1]]
replacements <- setNames(as.character(as.Date(extracted, format = "%B %d %Y")),
extracted)
str_replace_all(rep, replacements)
#> [1] "on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-08-04. on 2022-08-25 there will be a test "
Created on 2022-05-26 by the reprex package (v2.0.1)

How to change year format

I have a year column in my dataframe, which is formatted as financial year (e.g. 2015-16, 2016-17, etc). I want to change them to just 4-digit year in such a way that 2015-16 becomes 2016; 2016-17 becomes 2017, etc. How can I do it?
You can use parse_number from readr :
x <- c('2015-16', '2016-17')
readr::parse_number(x) + 1
#[1] 2016 2017
parse_number drops any non-numeric characters before or after the first number. So in this example, everything after the first number is dropped and turned to numeric. We then add 1 to to it to get next year.
A possible solution can be,
as.numeric(sub('-.*', '', '2015-16')) + 1
#[1] 2016
We can use sub to capture the first two digits while leaving the next two digits and the -, and in the replacement, specify the backreference (\\1) of the captured group
as.numeric(sub("^(\\d{2})\\d{2}-", "\\1", v1))
#[1] 2016 2017
Or more compactly match the two digits followed by the -, and replace with blank ('')
sub("\\d{2}-", "", v1)
[1] "2016" "2017"
Or using substr
paste0(substr(v1,1, 2), substr(v1, 6, 7))
#[1] "2016" "2017"
NOTE: None of the solutions require any external packages. Also, it doesn't implicitly assume there is always an increment of 1 year. It can be any year range as below and it works
v2 <- c("2015-18", "2014-15", "2012-19")
sub("\\d{2}-", "", v2)
#[1] "2018" "2015" "2019"
data
v1 <- c("2015-16", "2016-17")

remove "The" at the beginning of a character variable, and move it to the end

I have some data that looks like this (code to input data at the end):
Year Movie
2012 The Avengers
2015 Furious 7
2017 The Fate of the Furious
And my desired output is:
Year Movie
2012 Avengers, The
2015 Furious 7
2017 Fate of the Furious, The
Should I be using stringr and regex formats? Is there a link you can recommend that explains regex a little more simply than most sites or help documentation?
This is pretty poor, but it was all I could do for now:
str_replace(df$Movie, pattern = "The", replacement = "")
Even just some hints of what commands to look for in the help documentation, or where to find explanations of what I should be looking for would be helpful.
df <- data.frame(stringsAsFactors=FALSE,
Year = c(2012L, 2015L, 2017L),
Movie = c("The Avengers", "Furious 7", "The Fate of the Furious")
)
df
str_replace(df$Movie, pattern = "The", replacement = "")
Try
sub("^([Tt]he?) (.*)", "\\2, \\1", df$Movie)
#[1] "Avengers, The"
#[2] "Furious 7"
#[3] "Fate of the Furious, The"
? - indicates that "The" is optional and will be matches at most once. Will also match if strings starts with "the". Thanks to #rawr!
. - matches any character - zero or more times, which is what * indicates
() - capture the text matched by the regex inside them into a numbered group that can be reused with a numbered backreference, i.e. \\1 and \\2. See regular-expressions.info.
I hope this makes some sence to you.
Not pretty, but this should work
#Get the index of the movie starting with "The"
inds <- grepl("^The", df$Movie)
#Remove "The" from the beginning of the sentence and paste it at the end.
df$Movie[inds] <- paste(sub("^The", "", df$Movie[inds]), "The")
df
# Year Movie
#1 2012 Avengers The
#2 2015 Furious 7
#3 2017 Fate of the Furious The

Regex to extract a number and its unit of measure that are separated by a string from a word of interest

I'm learning R and I'm trying to use regex to extract specific text. I would like to capture a number and the unit of measure from a recipe for a specific ingredient.
For example for the following text:
text <- c("0.5 Tb of butter","3 grams (0.75 sticks) of chilled butter","2 tbs softened butter", "0.3 Tb of milk")
I would like to extract the numbers and units relating only to butter, i.e:
0.5 Tb
3 grams
2 tbs
I think this would be best done using regex, but I'm quite new to this so I'm struggling somewhat.
Using str_match I can get the number in front of specific unit like this:
str_match(text, "\\s*(\\d+)\\s*Tb")
[,1] [,2]
[1,] "5 Tb" "5"
[2,] NA NA
[3,] NA NA
[4,] "3 Tb" "3"
But how could I get only the values that relate to butter and for a range of units. Is it possible to make a list of possible units (i.e. grams, tbs, Tb etc.) and ask to match any of them (so that in this example grams would match but not sticks)?
Or perhaps this would be done better with some loop? I could put each sentence into a dataframe, loop through each row asking if there is 'butter' in the row search for a number in it and extract the the number and the word that follows, which should be the unit of measure.
Thanks for the help.
A base R solution would be to grep out the butter lines and then use read.table to parse them given that the matched items are always the first two fields. No packages are used and the only regular expression used is the simple expression butter.
butter <- grep("butter", text, value = TRUE)
read.table(text = butter, fill = TRUE, as.is = TRUE)[1:2]
giving:
V1 V2
1 0.5 Tb
2 3.0 grams
3 2.0 tbs
An option would be to detect the 'butter' in the strings and then use str_extract
str_extract(grep("butter", text, value = TRUE), "[0-9.]+\\s+\\w+")
#[1] "0.5 Tb" "3 grams" "2 tbs"
Or using str_detect with str_extract
library(tidyverse)
str_detect(text, "butter") %>%
extract(text, .) %>%
str_extract("[0-9.]+\\s+\\w+")
#[1] "0.5 Tb" "3 grams" "2 tbs"
You may want to take a look at something like this ([\d.]+)\s([a-zA-Z]+).*butter
sub("^(\\S+\\s+\\S+).*", "\\1", text[grepl("butter", text)])
[1] "0.5 Tb" "3 grams" "2 tbs"
\\s+ to match any number of spaces and \\S+ to match any number of non-spaces. ^ to start at the beginning.
text[grepl("butter", text)] returns only the text elements which contain the word butter. Perhaphs add the argument ignore.case = TRUE to grepl() for it to also match Butter...

Capitalizing letters. R equivalent of excel "PROPER" function [duplicate]

This question already has answers here:
Capitalize the first letter of both words in a two word string
(15 answers)
Closed 6 years ago.
Colleagues,
I'm looking at a data frame resembling the extract below:
Month Provider Items
January CofCom 25
july CofCom 331
march vobix 12
May vobix 0
I would like to capitalise first letter of each word and lower the remaining letters for each word. This would result in the data frame resembling the one below:
Month Provider Items
January Cofcom 25
July Cofcom 331
March Vobix 12
May Vobix 0
In a word, I'm looking for R's equivalent of the ROPER function available in the MS Excel.
With regular expressions:
x <- c('woRd Word', 'Word', 'word words')
gsub("(?<=\\b)([a-z])", "\\U\\1", tolower(x), perl=TRUE)
# [1] "Word Word" "Word" "Word Words"
(?<=\\b)([a-z]) says look for a lowercase letter preceded by a word boundary (e.g., a space or beginning of a line). (?<=...) is called a "look-behind" assertion. \\U\\1 says replace that character with it's uppercase version. \\1 is a back reference to the first group surrounded by () in the pattern. See ?regex for more details.
If you only want to capitalize the first letter of the first word, use the pattern "^([a-z]) instead.
The question is about an equivalent of Excel PROPER and the (former) accepted answer is based on:
proper=function(x) paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
It might be worth noting that:
proper("hello world")
## [1] "Hello world"
Excel PROPER would give, instead, "Hello World". For 1:1 mapping with Excel see #Matthew Plourde.
If what you actually need is to set only the first character of a string to upper-case, you might also consider the shorter and slightly faster version:
proper=function(s) sub("(.)", ("\\U\\1"), tolower(s), pe=TRUE)
Another method uses the stringi package. The stri_trans_general function appears to lower case all letters other than the initial letter.
require(stringi)
x <- c('woRd Word', 'Word', 'word words')
stri_trans_general(x, id = "Title")
[1] "Word Word" "Word" "Word Words"
I dont think there is one, but you can easily write it yourself
(dat <- data.frame(x = c('hello', 'frIENds'),
y = c('rawr','rulZ'),
z = c(16, 18)))
# x y z
# 1 hello rawr 16
# 2 frIENds rulZ 18
proper <- function(x)
paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
(dat <- data.frame(lapply(dat, function(x)
if (is.numeric(x)) x else proper(x)),
stringsAsFactors = FALSE))
# x y z
# 1 Hello Rawr 16
# 2 Friends Rulz 18
str(dat)
# 'data.frame': 2 obs. of 3 variables:
# $ x: chr "Hello" "Friends"
# $ y: chr "Rawr" "Rulz"
# $ z: num 16 18

Resources