Extracting Sub-expressions from a Dataframe of Strings Using Regular Expressions - r

I have a regular expression that is able to match my data, using grepl, but I can't figure out how to extract the sub-expressions inside it to new columns.
This is returning the test string as foo, without any of the sub-expressions:
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+)\\s+(\\d*\\:?\\d+\\.\\d+)"
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
m <- regexpr(entryPattern, test)
foo <- regmatches(test, m)
In my real use case, I'm acting on lots of strings similar to test. I'm able to find the correctly formatted ones, so I think the pattern is correct.
rows$isMatch <- grepl(entryPattern, rows$text)
What 'm hoping to do is add the sub-expressions as new columns in the rows dataframe (i.e. rows$rank, rows$name, rows$country, etc.). Thanks in advance for any advice.

It seems that regmatches won't do what I want. Instead, I need the stringr package, as suggested by #kent-johnson.
library(stringr)
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+?)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+?)\\s+(\\d*\\:?\\d+\\.\\d+)"
str_match(test, entryPattern)[1,2:8]
Which outputs:
[1] "101"
[2] "POULET Laure"
[3] "FRA"
[4] "1992"
[5] "25-29"
[6] "E. M. S. Bron Natation"
[7] "26.00"

Related

How to extract a fixed number of characters before a string in R

I have a text that contains somewhere in the document a citation to a court case, such as
x <- "2009 U.S. LEXIS"
I know it is always a four-digit year plus a space in front of the pattern "U.S. LEXIS". How should I extract these four digits of years?
Thanks
I think the data/vector given by you was inadequate to let the experts here solve your problem.
UPDATE try this
str_extract_all(x, "\\d{4}(?=\\sU.S.\\sLEXIS)")
[[1]]
[1] "2009" "2015" "1990"
OR to extract these as numbers
lapply(str_extract_all(x, "\\d{4}(?=\\sU.S.\\sLEXIS)"), as.numeric)
[[1]]
[1] 2009 2015 1990
OLD ANSWER Moreover, I am also new in regex therefore my solution may not be a very clean method. Typically your case is of searching nested groups in regex patterns. Still, you can try this method
x <- "aabv 2009 U.S. LEXIS abcs aa 2015 U.S. LEXIS 45 fghhg ds fdavfd 1990 U.S. LEXIS bye bye!"
> x
[1] "aabv 2009 U.S. LEXIS abcs aa 2015 U.S. LEXIS 45 fghhg ds fdavfd 1990 U.S. LEXIS bye bye!"
Now follow these steps
library(stringr)
lapply(str_extract_all(x, "(\\d{4})\\sU.S.\\sLEXIS"), str_extract, pattern = "(\\d{4})")
[[1]]
[1] "2009" "2015" "1990"
Typically "((\\d{4})\\sU.S.\\sLEXIS)" would have worked as regex pattern but I am sure about nested groups in R, so used lapply here. Basically str_extract_all(x, "(\\d{4})\\sU.S.\\sLEXIS" will cause to return all citations. Try this.
You can try :
x <- "2009 U.S. LEXIS"
as.numeric(sub('.*?(\\d{4}) U.S. LEXIS', '\\1', x))
#[1] 2009
Using stringr::str_extract :
as.numeric(stringr::str_extract(x, '\\d{4}(?= U.S. LEXIS)'))
We can use parse_number from readr
library(readr)
parse_number(x)
#[1] 2009
data
x <- "2009 U.S. LEXIS"
substr function in stringr library solve it
substr(x,1,4)
if you need to convert in numeric, then you can return it as.numeric
as.numeric(substr(x,1,4))

Partial Match String and full replacement over multiple vectors

Would like to efficiently replace all partial match strings over a single column by supplying a vector of strings which will be searched (and matched) and also be used as replacement. i.e. for each vector in df below, it will partially match for vectors in vec_string. Where matches is found, it will simply replace the entire string with vec_string. i.e. turning 'subscriber manager' to 'manager'. By supplying more vectors into vec_string, it will search through the whole df until all is complete.
I have started the function, but can't seem to finish it off by replacing the vectors in df with vec_string. Appreciate your help
df <- c(
'solicitor'
,'subscriber manager'
,'licensed conveyancer'
,'paralegal'
,'property assistant'
,'secretary'
,'conveyancing paralegal'
,'licensee'
,'conveyancer'
,'principal'
,'assistant'
,'senior conveyancer'
,'law clerk'
,'lawyer'
,'legal practice director'
,'legal secretary'
,'personal assistant'
,'legal assistant'
,'conveyancing clerk')
vec_string <- c('manager','law')
#function to search and replace
replace_func <-
function(vec,str_vec) {
repl_str <- list()
for(i in 1:length(str_vec)) {
repl_str[[i]] <- grep(str_vec[i],unique(tolower(vec)))
}
names(repl_str) <- vec_string
return(repl_str)
}
replace_func(df,vec_string)
$`manager`
[1] 2
$law
[1] 13 14
As you can see, the function returns a named list with elements to which the replacement will
This should do the trick
res = sapply(df,function(x){
match = which(sapply(vec_string,function(y) grepl(y,x)))
if (length(match)){x=vec_string[match[1]]}else{x}
})
res
[1] "solicitor" "manager" "licensed conveyancer"
[4] "paralegal" "property assistant" "secretary"
[7] "conveyancing paralegal" "licensee" "conveyancer"
[10] "principal" "assistant" "senior conveyancer"
[13] "law" "law" "legal practice director"
[16] "legal secretary" "personal assistant" "legal assistant"
[19] "conveyancing clerk"
We compare each part of df with each part of vec_string. If there is a match, the vec_string part is returned, else it is left as it is. Watch out as if there are more than 1 matches it will keep the first one.

Turn Street Address Into Components

I have address data I extracted from SQL, and have now loaded into R. I am trying to extract out the individual components, namely the ZIP-CODE at the end of the query (State would also be nice). I would like the ZIP-CODE and State to be in new individual columns.
The primary issue is the ZIP-CODE is sometimes 5 digits, and sometimes 9.
Two example rows would be:
Address_FULL
1234 NOWHERE ST WASHINGTON DC 20005
567 EVERYWHERE LN CHARLOTTE NC 22011-1203
I suspect I'll need some kind of regex \\d{5} notation, or some kind of fancy manipulation in dplyr that I'm not aware exists.
If the zip code is always at the end you could use
str_extract(Address_FULL,"[[:digit:]]{5}(-[[:digit:]]{4})?$")
To add a "zip" column via dplyr you could use
df %>% mutate(zip = str_extract(Address_FULL,"[[:digit:]]{5}(-[[:digit:]]{4})?$"))
Where df is your dataframe containing Address_FULL and
str_extract() is from stringr.
State could be extracted as follows:
str_extract(Address_FULL,"(?<=\\s)[[:alpha:]]{2}(?=\\s[[:digit:]]{5})")
However, this makes the following assumptions:
The state abbreviation is 2 characters long
The state abbreviation is followed immediately by a space
The zip code follows immediately after the space that follows the state
Assuming that the zip is always at the end, you can try:
tail(unlist(strsplit(STRING, split=" ")), 1)
For example
ex1 = "1234 NOWHERE ST WASHINGTON DC 20005"
ex2 = "567 EVERYWHERE LN CHARLOTTE NC 22011-1203"
> tail(unlist(strsplit(ex1, split=" ")), 1)
[1] "20005"
> tail(unlist(strsplit(ex2, split=" ")), 1)
[1] "22011-1203"
Use my package tfwstring
Works automatically on any address type, even with prefixes and suffixes.
if (!require(remotes)) install.packages("remotes")
remotes::install_github("nbarsch/tfwstring")
parseaddress("1234 NOWHERE ST WASHINGTON DC 20005", force_stateabb = F)
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode
"1234" "NOWHERE" "ST" "WASHINGTON" "DC" "20005"
parseaddress("567 EVERYWHERE LN CHARLOTTE NC 22011-1203", force_stateabb = F)
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode
"567" "EVERYWHERE" "LN" "CHARLOTTE" "NC" "22011-1203"

Is there an easy way to obtain a vector with NATO phonetic alphabet?

Consider the following vector
x <- paste0(LETTERS,1:26)
I want to replace the letters with the NATO phonetic alphabet (alpha, bravo, charli etc.) whilst keeping the numbers. Is there a vector within r, similar to LETTERS that has the full NATO phonetic alphabet?
I'm not aware of a built in list. It's just a vector of words, you can get it yourself.
NATO <- strsplit("Alfa, Bravo, Charlie, Delta, Echo, Foxtrot, Golf, Hotel, India, Juliett, Kilo, Lima, Mike, November, Oscar, Papa, Quebec, Romeo, Sierra, Tango, Uniform, Victor, Whiskey, X-ray, Yankee, Zulu", ", ")
z <- paste0(unlist(NATO),1:26)
z
#> [1] "Alfa1" "Bravo2" "Charlie3" "Delta4" "Echo5"
#> [6] "Foxtrot6" "Golf7" "Hotel8" "India9" "Juliett10"
#> [11] "Kilo11" "Lima12" "Mike13" "November14" "Oscar15"
#> [16] "Papa16" "Quebec17" "Romeo18" "Sierra19" "Tango20"
#> [21] "Uniform21" "Victor22" "Whiskey23" "X-ray24" "Yankee25"
#> [26] "Zulu26"

delete characters after first numeral /number in string in R

I'm going through and cleaning a dataset that has location entries like : "Sarasota Florida6h" I'm not sure why but all the strings have either 3 or 2 characters at the end starting with a number:
[413] "Los Angeles11h" "Pittsburgh PA1h"
[415] "London UK18h" "Mumbai India19h"
[417] "Orange County CA1h" "Columbus OH2d"
[419] "4d" "Sarasota Florida6h"
[421] "Toronto9m" "Adelaide Australia7h"
[423] "Wayland MA4h" "Scottsdale AZ USA1h"
[425] "Sydney Australia6d" "Connecticut USA31m"
[427] "United States5m" "Boulder Colorado12h"
[429] "Berlin Germany7h" " India Chaibasa1h"
I need a script to remove all letters after a numeral to clean these out:
I've tried the below, but clearly, there's something wrong here.
follow_dat$loc <- sapply(strsplit(follow_dat$Location, "\\[0-9]"), `[[`, 2)
Your kind assistance is appreciated.
Mari
Use regular expressions
for example you can clean them this way:
gsub("[0-9]..*","",follow_dat$Location)
What this expression is saying is "clean everything after you find a number with nothing '' in all follow_dat$Location"
If there are no other numbers in your strings (as your example suggests), then we can use gsub,
gsub('[0-9]+[a-z]', '',follow_dat$Location)

Resources