gsub specific pattern and position in character string - r

This is probably a fairly easy fix, but I'm not as good w the RegExpr as would be ideal, so help is appreciated. I have looked elsewhere and nothing is working for me.
I am trying to standardize some names of university degrees. I need the following format:
Degree Code - Major Name
EG - "BA - Computer Stuff"
IE a word, single space, dash, single space, word.
It does not recognize multiple spaces on one or both sides of the dash, and if it sees no spaces, it replaces the letters on either side of the dash with lowercase s, where I thought that \s or \s white space and it would substitute.
This one bit of format fixing is part of a larger mutate statement, IE a single line with brackets ala the ve example elsewhere will not work for me.
I have example data:
data <- data.frame( var = c("BA-English" , "BA - English" , "BA - Chemistry" , "BS - Rubber Chickens") )
var %>%
mutate(var = gsub("\\w\\S-\\S\\w", "\\w\\s-\\s\\w", var) ) -> var_fix )
Any help is very much appreciated. Thank you

You can use
gsub("\\s*-\\s*", " - ", var)
## Or, if the hyphen is in between word chars
gsub("\\b\\s*-\\s*\\b", " - ", var)
See the regex demo #1 and regex demo #2.
Details:
\b - a word boundary
\s* - zero or more whitespaces
- - a hyphen
Note: in case you want to normalize hyphens, you can also consider using gsub("(*UCP)\\s*[\\p{Pd}\\x{00AD}\\x{2212}]\\s*", " - ", var, perl=TRUE) / gsub("(*UCP)\\b\\s*[\\p{Pd}\\x{00AD}\\x{2212}]\\s*\\b", " - ", var, perl=TRUE), where (*UCP) makes the word boundary and whitespace patterns Unicode-aware, \p{Pd} matches any Unicode dash, \x{00AD} matches a soft hyphen and \x{2212} matches a minus symbol.

Related

Apostrophes and regular expressions; Cleaning text in R

I working on cleaning a large collection of text. My process thus far is:
Remove any non-ASCII characters
Remove URLs
Remove email addresses
Correct kerning (i.e., "B A D" becomes "BAD")
Correct elongated words (i.e., "baaaaaad" becomes "bad")
Ensure there is a space after every comma
Replace all numerals and punctuation with a space - except apostrophes
Remove any term 22 characters or longer (anything this size is likely garbage)
Remove any single letters that are leftover
Remove any blank lines
My issue is in the next-to-last step. Originally, my code was:
gsub(pattern = "\\b\\S\\b", replacement = "", perl = TRUE)
but this wrecked any contractions that were left (that I left in on purpose). Then I tried
gsub(pattern = "\\b(\\S^'\\s)\\b", replacement = "", perl = TRUE)
but this left a lot of single characters.
Then I realized that I needed to keep three single-letter words: "A", "I", and "O" (either case).
Any suggestions?
You can use
gsub("(?i)\\b(?<!')(?![AOI])\\p{L}\\b", "", x, perl=TRUE)
Details:
(?i) - case insensitive matching on
\b - a word boundary
(?<!') - no ' is allowed immediately on the left
(?![AOI]) - the next char cannot be A, I, or O
\p{L} - any Unicod letter
\b - a word boundary

Extract n words after a pattern word

This is my first time attempting to extract a string using gsub and regular expressions in R. I would like to extract three words after the first occurrence of the word "at" or "around" in each cell of a text column (col in example) and place the extraction into a new column (new_extract).
What I have thus far is the following:
df$new_extract <- gsub(".*at(\\w{1,}){3}).*", "\\1", df$col, perl = TRUE)
Any advice on changes / different approaches welcomed!
Your regex attempts to match words only after the last at. Also, since there is no pattern to match the gap between at or around (you are not trying to match around at all by the way), your pattern will not extract any words in the end.
I suggest this approach with sub:
sub(".*?\\ba(?:t|round)\\W+(\\w+(?:\\W+\\w+){0,2}).*", "\\1", df$col, perl=TRUE)
See the regex demo.
Here,
.*? - matches from the start, any zero or more chars other than line break chars as few as possible
\ba - a word boundary and then a
(?:t|round) - t or round
\W+ - one or more non-word chars
(\w+(?:\\W+\\w+){0,2}) - Group 1: one or more word chars and then zero, one or two occurrences of one or more non-word chars followed with one or more word chars
.* - any zero or more chars other than line break chars as many as possible.

How to do a find/replace with a regular expession in stringr

Given a string like 'run- ning' I would like to replace 'n- n' by 'nn' in order to obtain 'running'.
Using the stringr package I tried this:
str_replace_all(s, "[:alpha:]\\-([ ])+[:alpha:]", "[:alpha:][:alpha:]")
but it seems not to work that way. I guess variables need to be used, but I could figure out how exactly.
I tried this:
str_replace_all(s, "[:alpha:]\\-([ ])+[:alpha:]", "\\0\\1")
but that does not give the desired result either.
Any ideas?
You may use
stringr::str_replace_all(s, "(?<=\\p{L})- +(?=\\p{L})", "")
stringr::str_replace_all(s, "(\\p{L})- +(\\p{L})", "\\1\\2")
Or, to match any horizontal whitespace chars
stringr::str_replace_all(s, "(?<=\\p{L})-\\h+(?=\\p{L})", "")
stringr::str_replace_all(s, "(\\p{L})-\\h+(\\p{L})", "\\1\\2")
Base R equivalent:
gsub("(?<=\\p{L})-\\h+(?=\\p{L})", "", s, perl=TRUE)
gsub("(\\p{L})-\\h+(\\p{L})", "\\1\\2", s, perl=TRUE)
gsub("([[:alpha:]])-\\s+([[:alpha:]])", "\\1\\2", s)
See the regex demo
Details
(?<=\p{L}) - a positive lookbehind that matches a location immediately preceded with any Unicode letter
- + - a hyphen followed with 1 or more spaces (\h matches any horizontal whitespace)
(?=\p{L}) - a positive lookahead that matches a location immediately followed with any Unicode letter.
(\p{L}) - a capturing group that matches any letter.
The \1\2 in the replacement patterns in the examples using capturing groups are backreferences to the corresponding capturing group values.

Replace only space between two words, not between words and symbol or words and numbers

I'm trying to use the stringr package in R to identify space(s) between words but not space(s) between words and symbols (or vice versa) or words and numbers (or vice versa), or symbols and numbers.
Based on what I could find it seems like [A-Za-z][:space:][a-zA-Z] should work. I'm obviously missing something but not sure what.
I've tried the stringr syntax with [A-Za-z][:space:][a-zA-Z], as well as regex(?) syntax for the spaces such as [A-Za-z]\s+[a-zA-Z]
str_replace_all(x, [A-Za-z][:space:][a-zA-Z], "_")
Sometimes an error I would get is "Error in rep(value, length.out = nrows) : attempt to replicate an object of type 'closure'"
You may use
str_replace_all(x, "(?<=\\p{L})\\s(?=\\p{L})", "_")
gsub("(?<=\\p{L})\\s(?=\\p{L})", "_", x, perl=TRUE)
Or, if there are 1 or more spaces to be replaced with 1 _,
str_replace_all(x, "(?<=\\p{L})\\s+(?=\\p{L})", "_")
gsub("(?<=\\p{L})\\s+(?=\\p{L})", "_", x, perl=TRUE)
See the regex demo
Details
(?<=\p{L}) - a positive lookbehind that matches a location that is immediately preceded with any letter
\s - a whitespace (\s+ matches 1+ whitespaces)
(?=\p{L}) - a positive lookahead that matches a location that is immediately followed with any letter.
NOTE:
You should wrap the regex pattern with quotes to form a string literal
If you want to only support ASCII letters, you may replace \\p{L} with [A-Za-z].

Regular expression to find the last hyphen, then move two spaces to the right and delete everything from there rightward

I have a series of strings that I would like to use regular expressions to compress.
1 617912568590104527563-Congress-Dem-Packages_Nomination-DC2019-08-08.xlsx
2 517912568590504527553-Dem-Plans-Packages_Debate2019-08-08.xlsx
3 47912568590104527523-Congress-Dem-Packages_House2019-08-08 (1).xlsx
I would like the result of the regular expression to be the following compressed strings:
1 Nomination-DC2019-08-08
2 Debate2019-08-08
3 House2019-08-08
Basically, the logic I'm looking for is to find the last hyphen, then move two spaces to the right and delete everything from there rightward. I'm undertaking this in R.
Update: I tried the following workflow, which addressed my issue. h/t to #brittenb for identifying the very useful tools::file_path_sans_ext()
x<-tools::file_path_sans_ext(x)
x<-str_replace(x, " .*", "")
x<-str_replace(x,".*\\_", "")
However, if anyone has a one line regex solution to this that would be great.
Update 2: h/t #WiktorStribiżew for identifying two one-liner solutions:
stringr::str_replace(x, ".*_([^.\\s]+).*", "\\1")
sub(".*_([^.[:space:]]+).*", "\\1", x)
You may simplify the task if you use tools::file_path_sans_ext() to extract the file name without extensions first and then grab all non-whitespace chars from the last _:
x <- c("617912568590104527563-Congress-Dem-Packages_Nomination-DC2019-08-08.xlsx", "517912568590504527553-Dem-Plans-Packages_Debate2019-08-08.xlsx", "47912568590104527523-Congress-Dem-Packages_House2019-08-08 (1).xlsx")
library(stringr)
str_extract(tools::file_path_sans_ext(x), "(?<=_)[^_\\s]+(?=[^_]*$)")
See the R demo. The (?<=_)[^_\\s]+(?=[^_]*$) regex matches a location after _, then matches 1+ chars other than _ and whitespaces and then asserts there are 0+ chars other than _ up to the end of string.
You may achieve what you need without extra libraries:
sub(".*_([^.[:space:]]+).*", "\\1", x)
See the regex demo and the R demo.
With stringr:
str_replace(x, ".*_([^.\\s]+).*", "\\1")
See the regex graph:
Details
.*_ - any 0+ chars as many as possible to the last occurrence of the subsequent patterns starting with _
([^.[:space:]]+) - Capturing group 1 (its value is referenced to with \1 placeholder, or replacement backrefence, from the replacement pattern): 1+ chars other than a dot and whitespace (note \s does not denote whitespace inside [...] in a TRE regex, it does in an ICU regex in stringr regex functions)
.* - any 0+ chars as many as possible.
Full code snippet:
x <- c("617912568590104527563-Congress-Dem-Packages_Nomination-DC2019-08-08.xlsx", "517912568590504527553-Dem-Plans-Packages_Debate2019-08-08.xlsx", "47912568590104527523-Congress-Dem-Packages_House2019-08-08 (1).xlsx")
sub(".*_([^.[:space:]]+).*", "\\1", x)
library(stringr)
stringr::str_replace(x, ".*_([^.\\s]+).*", "\\1")
Both yield
[1] "Nomination-DC2019-08-08" "Debate2019-08-08"
[3] "House2019-08-08"

Resources