A regex to split a text string in R - r

I have a very long string like this sample bellow and I'm struggling to find a regex to split it in parts according to the patern, for example: '1. OAS / AC' and '2. OAS / AD'.
This slice of text has:
1) a varying number in the beginning
2) two capital letters varying from A to Z
I tried this:
x <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")
but not works
Thanks in advance, for any help!
Example
require(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD 79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
want <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")
want <- list(
"1. OAS / AC " = "12345/this is a test string to regex,",
"2. OAS / AD " = "79856/this is another test string to regex,",
"3. OAS / AE " = "87987/this is a new test string to regex.",
"4. OAS / AZ " = "78798456/this is one mode test string to regex."
)

We could do this with a positive lookahead, looking for the pattern of a number, followed by a peroid:
str_split(have, "(?=\\d+\\.)")
[1] "" "1. OAS / AC 12345/this is a test string to regex, "
[3] "2. OAS / AD 79856/this is another test string to regex, " "3. OAS / AE 87987/this is a new test string to regex. "
[5] "4. OAS / AZ 78798456/this is one mode test string to regex."
And we can further clean it up:
str_split(have, "(?=\\d{1,2}\\.)") %>% unlist() %>% .[-1]
[1] "1. OAS / AC 12345/this is a test string to regex, " "2. OAS / AD 79856/this is another test string to regex, "
[3] "3. OAS / AE 87987/this is a new test string to regex. " "4. OAS / AZ 78798456/this is one mode test string to regex."

You may use
library(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD 79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
r <- stringr::str_match_all(have, "(\\d+\\. OAS / [A-Z]{2})\\s*(.*?)(?=\\s*\\d+\\. OAS / [A-Z]{2}|\\z)")
res <- r[[1]][,3]
names(res) <- r[[1]][,2]
Result:
dput(res)
# => structure(c("12345/this is a test string to regex,", "79856/this is another test string to regex,",
# "87987/this is a new test string to regex.", "78798456/this is one mode test string to regex."
# ), .Names = c("1. OAS / AC", "2. OAS / AD", "3. OAS / AE", "4. OAS / AZ"
# ))
See the regex demo.
Pattern details
(\d+\. OAS / [A-Z]{2}) - Capturing group 1:
\d+ - 1+ digits
\. - a .
OAS / - a literal OAS / substring
[A-Z]{2} - two uppercase letters
\s* - 0+ whitespaces
(.*?) - Capturing group 2: any 0+ chars other than line break chars, as few as possible
(?=\s*\d+\. OAS / [A-Z]{2}|\z) - a positive lookahead: immediately to the right of the current location, there must
\s*\d+\. OAS / [A-Z]{2} - 0+ whitespaces, 1+ digits, ., space, /, space, two uppercase letters
| - or
\z - end of string.

They way you described the issue is kinda unclear, but if you want to simply extract till "OAS / AC",
library(qdap)
beg2char(have, " ", 4)#looks for the fourth occurrence of \\s and extracts everything before it.
For the above function to work, the sentences should be individual strings in a character vector
If your aim is to actually insert an "=" sign between the two letter sub-string and the number occurring after "OAS",
gsub("([A-Z])\\s*([0-9])","\\1 = \\2",have,perl=T)

Related

gsub extracting string

My sample data is:
c("2\tNO PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217",
"1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156",
"5\tUNABLE TO WORK PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185",
"2\tNO PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433"
For each line, I'm looking to extract (they are variable names):
Line 1: "PEMJNUM"
Line 2: "PRFAMTYP"
Line 3: "PUBUS1"
Line 4: "PEIO1COW"
My initial goal was to gsub remove the characters to the left and right of each variable name to leave just the variable names, but I was only able to grab everything to the right of the variable name and had issues with grabbing characters to the left. (as shown here https://regexr.com/67r6j).
Not sure if there's a better way to do this!
You can use sub in the following way:
x <- c("2\tNO PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217",
"1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156",
"5\tUNABLE TO WORK PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185",
"2\tNO PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433")
sub("^(?:.*\\b)?(\\w+)\\s*\\b2\\b.*", "\\1", x, perl=TRUE)
# => [1] "PEMJNUM" "PRFAMTYP" "PUBUS1" "PEIO1COW"
See the online regex demo and the R demo.
Details:
^ - start of string
(?:.*\b)? - an optional non-capturing group that matches any zero or more chars (other than line break chars since I use perl=TRUE, if you need to match line breaks, too, add (?s) at the pattern start) as many as possible, and then a word boundary position
(\w+) - Group 1 (\1): one or more word chars
\s* - zero or more whitespaces
\b - a word boundary
2 - a 2 digit
\b - a word boundary
.* - the rest of the line/string.
If there are always whitespaces before 2, the regex can be written as "^(?:.*\\b)?(\\w+)\\s+2\\b.*".

Stringr str_replace_all misses repeated terms

I'm having an issue with the stringr::str_replace_all function. I'm trying to replace all instances of iv with insuredvehicle, but the function only seems to catch the first term.
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = ' iv ', replacement = ' insuredvehicle ', string = text)]
The outcome looks like the following, which missed the 2nd iv term:
1: the driver of the 1st vehicle hit the insuredvehicle iv at a stop
I believe the issue is that the 2 instances share a space, which is part of the search pattern. I did that because I want to replace the iv term, and not iv within driver.
I DON'T want to simply consolidate the repeated terms to 1. I'd like the result to look like:
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
I'd appreciate any help getting this to work!
Maybe if you include a word boundary in your regex, than remove the white spaces from the replacement? It is ideal when you want just a full word matching the pattern, but not parts of words, while staying away from these blank space issues.
\\bseems to do the trick
temp_data[, new_text := stringr::str_replace_all(pattern = '\\biv\\b', replacement = 'insuredvehicle', string = text)]
new_text
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
You can use lookarounds:
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<= )iv(?= )', replacement = 'insuredvehicle', string = text)]
Output:
"the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop"
Use gsub:
gsub("\\biv\\b", "insuredvehicle", temp_data$text)
[1] "the driver of the 1st vehicle hit the uninsuredvehicle uninsuredvehicle at a stop"
Use space boundaries:
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<!\\S)iv(?!\\S)', replacement = 'insuredvehicle', string = text)]
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
iv 'iv'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-ahead

Searching and replacing characters with classes in R

I am trying to replace text in R. I want to find spaces between letters and numbers only and delete them, but when I search using [:alpha:] and [:alnum:] it replaces with that class operator.
> string <- "WORD = 500 * WORD + ((WORD & 400) - (WORD & 300))"
> str_replace_all(string,
+ "[:alpha:] & [:alnum:]",
+ "[:alpha:]&[:alnum:]")
[1] "WORD = 500 * WORD + ((WOR[:alpha:]&[:alnum:]00) - (WOR[:alpha:]&[:alnum:]00))"
How can I use the function so that it returns-
[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"
str_replace_all(string, "([:alpha:]) & ([:alnum:])", "\\1&\\2")
Your requirement is easy enough to handle using sub with lookarounds:
string <- "WORD = 500 * WORD + ((WORD & 400) - (WORD & 300))"
output <- gsub("(?<=\\w) & (?=\\w)", "&", string, perl=TRUE)
output
[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"
Here is a brief explanation of the regex:
(?<=\\w) assert that what precedes is a word character
[ ]&[ ] then match a space, followed by `&`, followed by another space
(?=\\w) assert that what follows is also a word character
Then, we replace with just a single &, with no spaces on either side.
Here is one option where we match regex lookarounds to match one or more spaces (\\s+) either preceding or succeeding a & and replace with blank ("")
gsub("(?<=&)\\s+|\\s+(?=&)", "", string, perl = TRUE)
#[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"

remove pattern before digits and keep those digits

I have a string
text = "Math\n \n \n 600 rubles / 45 min."
text2 = "Math\n \n \n in a group"
And I want to replace\n \n \n with " " only if digits are following.
As a result, I want to have:
"Math 600 rubles / 45 min."
"Math\n \n \n in a group"
I tried gsub("\n \n \n [\\d]", " ", text), but it replaces the first digit too.
You may use a pattern that will match 3 occurrences of \n followed with 6+ spaces and then capture the digit and replace with a backreference to the Group 1:
gsub("(?:\n {6,}){3}(\\d)", " \\1", text)
See the R demo
Details
(?:\n {6,}){3} - 3 consecutive occurrences of:
\n - a newline
{6,} - 6 or more spaces
(\\d) - Group 1 (referred to with \1 from the replacement pattern): any digit.
I came up with the following pattern:
gsub("\\n[[:blank:]]*\\n[[:blank:]]*\\n[[:blank:]]*(\\d+)", " \\1", text)
This pattern matches three newlines, in succession, ending with a number. It allows for an arbitrary and unfixed amount of whitespace between each newline. This makes the match flexible, and helps to avoid a misfire from counting spaces incorrectly (or new incoming data not behaving as you expect).
The main problems I see with your current call to gsub is that you are using fixed width spaces in between newlines. Also, [\\d] is never used in the replacement. Hence, you are consuming that number but it won't show up the replacement.
Demo
text =c("Math\n \n \n 600 rubles / 45 min.","Math\n \n \n in a group")
gsub('((\n\\s+){1,})(?=\\d)',' ',text,perl=T)
#[1] "Math 600 rubles / 45 min." "Math\n \n \n in a group"

extract string from in R using regex

I have this vector:
jvm<-c("test - PROD_DB_APP_185b#SERVER01" ,"uat - PROD_DB_APP_SYS[1]#SERVER2")
I need to extract text until "[" or if there is no "[", then until the "#" character.
result should be
PROD_DB_APP_185b
PROD_DB_APP_SYS
I've tried something like this:
str_match(jvm, ".*\\-([^\\.]*)([.*)|(#.*)")
not working, any ides?
A sub solution with base R:
jvm<-c("test - PROD_DB_APP_185b#SERVER01" ,"uat - PROD_DB_APP_SYS[1]#SERVER2")
sub("^.*?\\s+-\\s+([^#[]+).*", "\\1", jvm)
See the online R demo
Details:
^ - start of string
.*? - any 0+ chars as few as possible
\\s+-\\s+ - a hyphen enclosed with 1 or more whitespaces
([^#[]+) - capturing group 1 matching any 1 or more chars other than #
and [
.* - any 0+ chars, up to the end of string.
Or a stringr solution with str_extract:
str_extract(jvm, "(?<=-\\s)[^#\\[]+")
See the regex demo
Details:
(?<=-\\s) - a positive lookbehind that matches an empty string that is preceded with a - and a whitespace immediately to the left of the current location
[^#\\[]+ - 1 or more chars other than # and [.

Resources