Removing square brackets and their contents [duplicate] - r

Suppose I have some text like this,
text<-c("[McCain]: We need tax policies that respect the wage earners and job creators. [Obama]: It's harder to save. It's harder to retire. [McCain]: The biggest problem with American healthcare system is that it costs too much. [Obama]: We will have a healthcare system, not a disease-care system. We have the chance to solve problems that we've been talking about... [Text on screen]: Senators McCain and Obama are talking about your healthcare and financial security. We need more than talk. [Obama]: ...year after year after year after year. [Announcer]: Call and make sure their talk turns into real solutions. AARP is responsible for the content of this advertising.")
and I would like to remove (edit: get rid of) all of the text between the [ and ] (and the brackets themselves). What's the best way to do this? Here is my feeble attempt using regex and the stingr package:
str_extract(text, "\\[[a-z]*\\]")
Thanks for any help!

With this:
gsub("\\[[^\\]]*\\]", "", subject, perl=TRUE);
What the regex means:
\[ # '['
[^\]]* # any character except: '\]' (0 or more
# times (matching the most amount possible))
\] # ']'

The following should do the trick. The ? forces a lazy match, which matches as few . as possible before the subsequent ].
gsub('\\[.*?\\]', '', text)

Here'a another approach:
library(qdap)
bracketX(text, "square")

I think this technically answers what you've asked, but you probably want to add a \\: to the end of the regex for prettier text (removing the colon and space).
library(stringr)
str_replace_all(text, "\\[.+?\\]", "")
#> [1] ": We need tax policies that respect the wage earners..."
vs...
str_replace_all(text, "\\[.+?\\]\\: ", "")
#> [1] "We need tax policies that respect the wage earners..."
Created on 2018-08-16 by the reprex package (v0.2.0).

No need to use a PCRE regex with a negated character class / bracket expression, a "classic" TRE regex will work, too:
subject <- "Some [string] here and [there]"
gsub("\\[[^][]*]", "", subject)
## => [1] "Some here and "
See the online R demo
Details:
\\[ - a literal [ (must be escaped or used inside a bracket expression like [[] to be parsed as a literal [)
[^][]* - a negated bracket expression that matches 0+ chars other than [ and ] (note that the ] at the start of the bracket expression is treated as a literal ])
] - a literal ] (this character is not special in both PCRE and TRE regexps and does not have to be escaped).
If you want to only replace the square brackets with some other delimiters, use a capturing group with a backreference in the replacement pattern:
gsub("\\[([^][]*)\\]", "{\\1}", subject)
## => [1] "Some {string} here and {there}"
See another demo
The (...) parenthetical construct forms a capturing group, and its contents can be accessed with a backreference \1 (as the group is the first one in the pattern, its ID is set to 1).

Related

Regex issue in R when escaping regex special characters with str_extract

I'm trying to extract the status -- in this case the word "Active" from this pattern:
Status\nActive\nHometown\
Using this regex: https://regex101.com/r/xegX00/1, but I cannot get it to work in R using str_extract. It does seem weird to have dual escapes, but I've tried every possible combination here and cannot get this to work. Any help appreciated!
mutate(status=str_extract(df, "(?<=Status\\\\n)(.*?)(?=\\\\)"))
You can use sub in base R -
x <- "Status\nActive\nHometown\n"
sub('.*Status\n(.*?)\n.*', '\\1', x)
#[1] "Active"
If you want to use stringr, here is a suggestion with str_match which avoids using lookahead regex
stringr::str_match(x, 'Status\n(.*)\n')[, 2]
#[1] "Active"
Your regex fails because you tested it against a wrong text.
"Status\nActive\nHometown" is a string literal that denotes (defines, represents) the following plain text:
Status
Active
Hometown
In regular expression testers, you need to test against plain text!
To match a newline, you can use "\n" (i.e. a line feed char, an LF char), or "\\n", a regex escape that also matches a line feed char.
You can use
library(stringr)
x <- "Status\nActive\nHometown\n"
stringr::str_extract(x, "(?<=Status\\n).*") ## => [1] "Active"
## or
stringr::str_extract(x, "(?<=Status\n).*") ## => [1] "Active"
See the R demo online and a correct regex test.
Note you do not need an \n at the end of the pattern, as in an ICU regex flavor (used in R stringr regex methods), the . pattern matches any chars other than line break chars, so it is OK to just use .* to match the whole line.

Add a white-space between number and special character condition R

I'm trying to use stringr or R base calls to conditionally add a white-space for instances in a large vector where there is a numeric value then a special character - in this case a $ sign without a space. str_pad doesn't appear to allow for a reference vectors.
For example, for:
$6.88$7.34
I'd like to add a whitespace after the last number and before the next dollar sign:
$6.88 $7.34
Thanks!
If there is only one instance, then use sub to capture digit and the $ separately and in the replacement add the space between the backreferences of the captured group
sub("([0-9])([$])", "\\1 \\2", v1)
#[1] "$6.88 $7.34"
Or with a regex lookaround
gsub("(?<=[0-9])(?=[$])", " ", v1, perl = TRUE)
data
v1 <- "$6.88$7.34"
This will work if you are working with a vectored string:
mystring<-as.vector('$6.88$7.34 $8.34$4.31')
gsub("(?<=\\d)\\$", " $", mystring, perl=T)
[1] "$6.88 $7.34 $8.34 $4.31"
This includes cases where there is already space as well.
Regarding the question asked in the comments:
mystring2<-as.vector('Regular_Distribution_Type† Income Only" "Distribution_Rate 5.34%" "Distribution_Amount $0.0295" "Distribution_Frequency Monthly')
gsub("(?<=[[:alpha:]])\\s(?=[[:alpha:]]+)", "_", mystring2, perl=T)
[1] "Regular_Distribution_Type<U+2020> Income_Only\" \"Distribution_Rate 5.34%\" \"Distribution_Amount $0.0295\" \"Distribution_Frequency_Monthly"
Note that the \ appears due to nested quotes in the vector, should not make a difference. Also <U+2020> appears due to encoding the special character.
Explanation of regex:
(?<=[[:alpha:]]) This first part is a positive look-behind created by ?<=, this basically looks behind anything we are trying to match to make sure what we define in the look behind is there. In this case we are looking for [[:alpha:]] which matches a alphabetic character.
We then check for a blank space with \s, in R we have to use a double escape so \\s, this is what we are trying to match.
Finally we use (?=[[:alpha:]]+), which is a positive look-ahead defined by ?= that checks to make sure our match is followed by another letter as explained above.
The logic is to find a blank space between letters, and match the space, which then is replaced by gsub, with a _
See all the regex here

Regex in R (Invalid use of repetition operators) expression doesn't work after escaping

I have a document with text like this that I'm trying to select a certain piece with regex:
Section I - Live Animals; Animal Products (Chapter 1-5) Chapter 1 Live Animals
I'm using this expression, which works outside of R:
Section\s[A-Z]+\s?-\s[^(]*+\(Chapter\s[0-9]+-[0-9]+\)
This is how I've written the expression in R (have escaped the + after getting the Invalid use of repetition operators error), but the expression doesn't work - nothing happens. If anyone can see anything I'm missing here it would be much appreciated.
Section\\s[A-Z]\\+\\s?-\\s[^(]*\\+\\(Chapter\\s[0-9]+-[0-9]\\+\\)
I'm trying to select and remove the text like this:
df=data.frame(x="Section I - Live Animals; Animal Products (Chapter 1-5) Chapter 1 Live Animals ")
df=gsub("Section\\s[A-Z]\\+\\s?-\\s[^(]*\\+\\(Chapter\\s[0-9]+-[0-9]\\+\\)", "", df$x)
TRE regex does not support possessive quantifiers, thus *+ quantifier is not valid. You want * quantifier, thus, do not escape +, just remove it.
Also, it makes sense to trim the output, so I suggest using
df <- trimws(gsub("Section\\s[A-Z]+\\s?-\\s[^(]*\\(Chapter\\s[0-9]+-[0-9]+\\)", "", df$x))
## => [1] "Chapter 1 Live Animals"
See the R demo online.

R Capturing String inside Brackets

I'm trying to parse some of my chess pgn data but I'm having some trouble capturing characters just inside one bracket.
testString <- '[Event \"?\"]\n[Site \"http://www.chessmaniac.com play free chess\"]\n[Date \"2018.08.25\"]\n[Round \"-\"]\n[White \"NothingFancy 1497\"]\n[Black \"JR Smith 1985\"]\n[Result \"1-0\"]\n\n1.'
#Attempt to just get who white is, which is inside a bracket [White xxx]
findWhite <- regexpr('\\[White.*\\]', tempString)
regmatches(tempString, findWhite)
The stringr package seems to do what I want, but I'm curious what is different about the use of the same regular expression. I'm fine using stringr, but I like to also know how to do this in base R.
library(stringr)
str_extract(tempString, '\\[White.*\\]')
If you need the whole match starting with [White and ending with ] you may use
regmatches(testString, regexpr("\\[White\\s*[^][]*]", testString))
[1] "[White \"NothingFancy 1497\"]"
If you only need the substring inside double quotes:
regmatches(testString, regexpr("\\[White\\s*\\K[^][]*", testString, perl=TRUE))
[1] "\"NothingFancy 1497\""
See the regex demo.
To strip the double quotes, you may use something like
regmatches(testString, regexpr('\\[White\\s*"\\K.*(?="])', testString, perl=TRUE))
[1] "NothingFancy 1497"
See another regex demo and an online R demo.
Details
\\[ - a [ char
White - a literal substring
\\s* - 0+ whitespaces
\\K - match reset operator discarding the text matched so far
[^][]* - 0+ chars other than [ and ]
.* (in the other version) - matches any 0+ chars other than line break chars, as many as possible
(?="]) - a positive lookahead that matches a position inside a string that is immediately followed with "].
At least one way to do it in base R is to use sub and only keep the part that you want.
sub(".*\\[White\\s(*.*?)\\].*", "\\1", testString)
[1] "\"NothingFancy 1497\""

Negation in R, how can I replace words following a negation in R?

I'm following up on a question that has been asked here about how to add the prefix "not_" to a word following a negation.
In the comments, MrFlick proposed a solution using a regular expression gsub("(?<=(?:\\bnot|n't) )(\\w+)\\b", "not_\\1", x, perl=T).
I would like to edit this regular expression in order to add the not_ prefix to all the words following "not" or "n't" until there is some punctuation.
If I'm editing cptn's example, I'd like:
x <- "They didn't sell the company, and it went bankrupt"
To be transformed into:
"They didn't not_sell not_the not_company, and it went bankrupt"
Can the use of backreference still do the trick here? If so, any example would be much appreciated. Thanks!
You may use
(?:\bnot|n't|\G(?!\A))\s+\K(\w+)\b
and replace with not_\1. See the regex demo.
Details
(?:\bnot|n't|\G(?!\A)) - either of the three alternatives:
\bnot - whole word not
n't - n't
\G(?!\A) - the end of the previous successful match position
\s+ - 1+ whitespaces
\K - match reset operator that discards the text matched so far
(\w+) - Group 1 (referenced to with \1 from the replacement pattern): 1+ word chars (digits, letters or _)
\b - a word boundary.
R demo:
x <- "They didn't sell the company, and it went bankrupt"
gsub("(?:\\bnot|n't|\\G(?!\\A))\\s+\\K(\\w+)\\b", "not_\\1", x, perl=TRUE)
## => [1] "They didn't not_sell not_the not_company, and it went bankrupt"
First you should split the string on the punctuation you want. For example:
x <- "They didn't sell the company, and it went bankrupt. Then something else"
x_split <- strsplit(x, split = "[,.]")
[[1]]
[1] "They didn't sell the company" " and it went bankrupt" " Then something else"
and then apply the regex to every element of the list x_split. Finally merge all the pieces (if needed).
This is not ideal, but gets the job done:
x <- "They didn't sell the company, and it did not go bankrupt. That's it"
gsub("((^|[[:punct:]]).*?(not|n't)|[[:punct:]].*?((?<=\\s)[[:punct:]]|$))(*SKIP)(*FAIL)|\\s",
" not_", x,
perl = TRUE)
# [1] "They didn't not_sell not_the not_company, and it did not not_go not_bankrupt. That's it"
Notes:
This uses the (*SKIP)(*FAIL) trick to avoid any pattern you don't want to regex to match. This basically replaces every space with not_ except for those spaces where they fall between:
Start of string or punctuation and "not" or "n't" or
Punctuation and Punctuation (not followed by space) or end of string

Resources