how to remove all characters except for those enclosed by parentheses - r

Suppose you have a text file like this below
I have apples, bananas, ( some pineapples over 4 ), and cherries ( coconuts with happy face :D ) and so on. You may help yourself except for cherries ( they are for my parents sorry ;C ) . I feel like I can run a fruit business.
What I aim to do is to erase all characters except for those enclosed by parentheses. Please keep in mind that the characters in a pair of parentheses can be varied ranging from English to other characters but no other punctuations may play a role as enclosing characters: only parentheses can be allowed to do.
I think I should utilize gsub but not sure.
This is what I want to have as a result.
( some pineapples over 4 ) ( coconuts with happy face :D ) ( they are for my parents sorry ;C )
Whether using a way of removing or extracting, I hope to get the result above.

We can do this by extracting the substring within the brackets and paste it together
library(stringr)
paste(str_extract_all(str1, "\\([^)]*\\)")[[1]], collapse=' ')
#[1] "( some pineapples over 4 ) ( coconuts with happy face :D ) ( they are for my parents sorry ;C )"
Or we can use a gsub based solution
trimws(gsub("\\s+\\([^)]*\\)(*SKIP)(*FAIL)|.", "", str1, perl = TRUE))
#[1] "( some pineapples over 4 ) ( coconuts with happy face :D ) ( they are for my parents sorry ;C )"
data
str1 <- "I have apples, bananas, ( some pineapples over 4 ), and cherries ( coconuts with happy face :D ) and so on. You may help yourself except for cherries ( they are for my parents sorry ;C ) . I feel like I can run a fruit business."

Related

Find closing parenthesis with regex in r

I have several strings with open and unclosed parenthesis. I managed to remove the opening parenthesis (if there is no closing one), but I do not manage to remove the closing parenthesis if there is no opening one. I want to leave those with matching parenthesis alone
string1 = "This (is solved"
string2 = "This is (fine)"
string3 = "This is the problem)"
This is what I was able to remove the first Problem case with (Opening parenthesis but no opening)
str_remove(data, "[(](?!.*[)])")
But I cannot seem to turn it around. The following grabs all closing parenthesis, but not the one without an oping.
"(?!.*[(])[)]"
Any ideas are appreciated!
If you do not need to handle nested paired (balanced) parentheses, you can use
gsub("(\\([^()]*\\))|[()]", "\\1", string)
See the regex demo. Details:
(\([^()]*\)) - Group 1 (\1 refers to this group value): (, then zero or more chars other than ( and ), and then a ) char
| - or
[()] - a ( or ) char.
See the R demo:
x <- c("This (is solved", "This is (fine)", "This is the problem)")
gsub("(\\([^()]*\\))|[()]", "\\1", x)
# => [1] "This is solved" "This is (fine)" "This is the problem"
If the parentheses can be nested, you can use
gsub("(\\((?:[^()]++|(?1))*\\))|[()]", "\\1", string, perl=TRUE)
See this regex demo. Details:
(\((?:[^()]++|(?1))*\)) - Group 1:
\( - a ( char
(?:[^()\n]++|(?1))* - zero or more sequences of either one or more chars other than ( and ), or the whole Group 1 pattern that is recursed
\) - a ) char
|[()] - or a ( / ) char.

Regex to match only semicolons not in parenthesis [duplicate]

This question already has answers here:
Regex - Split String on Comma, Skip Anything Between Balanced Parentheses
(2 answers)
Closed 1 year ago.
I have the following string:
Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews
I want to replace the semicolons that are not in parenthesis to commas. There can be any number of brackets and any number of semicolons within the brackets and the result should look like this:
Almonds , Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt), Cashews
This is my current code:
x<- Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews
gsub(";(?![^(]*\\))",",",x,perl=TRUE)
[1] "Almonds , Roasted Peanuts (Peanuts, Canola Oil (Antioxidants (319; 320)); Salt), Cashews "
The problem I am facing is if there's a nested () inside a bigger bracket, the regex I have will replace the semicolon to comma.
Can I please get some help on regex that will solve the problem? Thank you in advance.
The pattern ;(?![^(]*\)) means matching a semicolon, and assert that what is to the right is not a ) without a ( in between.
That assertion will be true for a nested opening parenthesis, and will still match the ;
You could use a recursive pattern to match nested parenthesis to match what you don't want to change, and then use a SKIP FAIL approach.
Then you can match the semicolons and replace them with a comma.
[^;]*(\((?>[^()]+|(?1))*\))(*SKIP)(*F)|;
In parts, the pattern matches
[^;]* Match 0+ times any char except ;
( Capture group 1
\( Match the opening (
(?> Atomic group
[^()]+ Match 1+ times any char except ( and )
| Or
(?1) Recurse the whole first sub pattern (group 1)
)* Close the atomic group and optionally repeat
\) Match the closing )
) Close group 1
(*SKIP)(*F) Skip what is matched
| Or
; Match a semicolon
See a regex demo and an R demo.
x <- c("Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews",
"Peanuts (32.5%); Macadamia Nuts (14%; PPPG(AHA)); Hazelnuts (9%); nuts(98%)")
gsub("[^;]*(\\((?>[^()]+|(?1))*\\))(*SKIP)(*F)|;",",",x,perl=TRUE)
Output
[1] "Almonds , Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt), Cashews"
[2] "Peanuts (32.5%), Macadamia Nuts (14%; PPPG(AHA)), Hazelnuts (9%), nuts(98%)"

Finding a word with condition in a vector with regex on R (perl)

I would like to find the rows in a vector with the word 'RT' in it or 'R' but not if the word 'RT' is preceded by 'no'.
The word RT may be preceded by nothing, a space, a dot, etc.
With the regex, I tried :
grep("(?<=[no] )RT", aaa,ignore.case = FALSE, perl = T)
Which was giving me all the rows with "no RT".
and
grep("(?=[^no].*)RT",aaa , perl = T)
which was giving me all the rows containing 'RT' with and without 'no' at the beginning.
What is my mistake? I thought the ^ was giving everything but the character that follows it.
Example :
aaa = c("RT alone", "no RT", "CT/RT", "adj.RTx", "RT/CT", "lang, RT+","npo RT" )
(?<=[no] )RT matches any RT that is immediately preceded with "n " or "o ".
You should use a negative lookbehind,
"(?<!no )RT"
See the regex demo.
Or, if you need to check for a whole word no,
"(?<!\\bno )RT"
See this regex demo.
Here, (?<!no ) makes sure there is no no immediately to the left of the current location, and only then RT is consumed.

Python operators of unequality

i have a code, not written by me . I am learning python and at very basics
codes are below:
ten_things = "Apples Oranges Crows Telephone Light Sugar"
print("Wait there are not 10 things in that list. Let's fix that.")
stuff = ten_things.split(' ')
more_stuff = ["Day", "Night", "Song", "Frisbee",
"Corn", "Banana", "Girl", "Boy"]
while len(stuff) != 10:
next_one = more_stuff.pop()
print("Adding: ", next_one)
stuff.append(next_one)
print(f"There are {len(stuff)} items now.")
print("There we go: ", stuff)
print("Let's do some things with stuff.")
print(stuff[1])
print(stuff[-1]) # whoa! fancy
print(stuff.pop())
print(' '.join(stuff)) # what? cool!
print('#'.join(stuff[3:5])) # super stellar!
my question is we use while len(stuff) !=10 but it runs till 10 only, it should skip 10 and go ahead until it pops the whole list but why it runs till 10? there are 11,12,13 and 14 which are also not equal to 10.It should also include them.
ideally, it should be while len(stuff) => 10: but we used != which is also not equal
can anyone help me?

Lower Case Certain Words R

I need to convert certain words to lower case. I am working with a list of movie titles, where prepositions and articles are normally lower case if they are not the first word in the title. If I have the vector:
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')
What I need is this:
movies_updated = c('The Kings of Summer', 'The Words', 'Out of the Furnace', 'Me and Earl and the Dying Girl')
Is there an elegant way to do this without using a long series of gsub(), as in:
movies_updated = gsub(' In ', ' in ', movies)
movies_updated = gsub(' In', ' in', movies_updated)
movies_updated = gsub(' Of ', ' of ', movies)
movies_updated = gsub(' Of', ' of', movies_updated)
movies_updated = gsub(' The ', ' the ', movies)
movies_updated = gsub(' the', ' the', movies_updated)
And so on.
In effect, it appears that you are interested in converting your text to title case. This can be easily achieved with use of the stringi package, as shown below:
>> stringi::stri_trans_totitle(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings Of Summer" "The Words" "Out Of The Furnace"
Alternative approach would involve making use of the toTitleCase function available in the the tools package:
>> tools::toTitleCase(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings of Summer" "The Words" "Out of the Furnace"
Though I like #Konrad's answer for its succinctness, I'll offer an alternative that is more literal and manual.
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace',
'Me And Earl And The Dying Girl')
gr <- gregexpr("(?<!^)\\b(of|in|the)\\b", movies, ignore.case = TRUE, perl = TRUE)
mat <- regmatches(movies, gr)
regmatches(movies, gr) <- lapply(mat, tolower)
movies
# [1] "The Kings of Summer" "The Words"
# [3] "Out of the Furnace" "Me And Earl And the Dying Girl"
The tricks of the regular expression:
(?<!^) ensures we don't match a word at the beginning of a string. Without this, the first The of movies 1 and 2 will be down-cased.
\\b sets up word-boundaries, such that in in the middle of Dying will not match. This is slightly more robust than your use of space, since hyphens, commas, etc, will not be spaces but do indicate the beginning/end of a word.
(of|in|the) matches any one of of, in, or the. More patterns can be added with separating pipes |.
Once identified, it's as simple as replacing them with down-cased versions.
Another example of how to turn certain words to lower case with gsub (with a PCRE regex):
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')
gsub("(?!^)\\b(Of|In|The)\\b", "\\L\\1", movies, perl=TRUE)
See the R demo
Details:
(?!^) - not at the start of the string (it does not matter if we use a lookahead or lookbehind here since the pattern inside is a zero-width assertion)
\\b - find leading word boundary
(Of|In|The) - capture Of or In or The into Group 1
\\b - assure there is a trailing word boundary.
The replacement contains the lowercasing operator \L that turns all the chars in the first backreference value (the text captured into Group 1) to lower case.
Note it can turn out a more flexible approach than using tools::toTitleCase. The code part that keeps specific words in lower case is:
## These should be lower case except at the beginning (and after :)
lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$"
If you only need to apply lowercasing and do not care about the other logic in the function, it might be enough to add these alternatives (do not use ^ and $ anchors) to the regex at the top of the post.

Resources