Pattern to match only characters within parentheses - r

I have looked at lots of posts here on SO with suggestions on REGEX patterns to grab texts from parentheses. However, from what I have looked into I cannot find a solution that works.
For example, I have had a look at the following:
R - Regular Expression to Extract Text Between Parentheses That Contain Keyword, Extract text in parentheses in R, regex to pickout some text between parenthesis [duplicate]
In the following order, here were the top answers solutions (with some amendments):
pattern1= '\\([^()]*[^()]*\\)'
pattern2= '(?<=\\()[^()]*(?=\\))'
pattern3= '.*\\((.*)\\).*'
all_patterns = c(pattern1, pattern2, pattern3)
I have used the following:
sapply(all_patterns , function(x)stringr::str_extract('I(data^2)', x))
\\([^()]*[^()]*\\) (?<=\\()[^()]*(?=\\)) .*\\((.*)\\).*
"(data^2)" "data^2" "I(data^2)"
None of these seem to only grab the characters within the brackets, so how can I just grab the characters inside brackets?
Expected output:
data

With str_extract, it would extract all those characters matched in the patterns. Instead, use a regex lookaround to match one or more characters that are not a ^ or the closing bracket ()) ([^\\^\\)]+) that succeeds an opening bracket ((?<=\\() - these are escaped (\\) as they are metacharacters
library(stringr)
str_extract('I(data^2)', '(?<=\\()[^\\^\\)]+')
# [1] "data"

Here is combinations of str_extract and str_remove
library(stringr)
str_extract(str_remove('I(data^2)', '.\\('), '\\w*')
[1] "data"

Related

How do I add a space between two characters using regex in R?

I want to add a space between two punctuation characters (+ and -).
I have this code:
s <- "-+"
str_replace(s, "([:punct:])([:punct:])", "\\1\\s\\2")
It does not work.
May I have some help?
There are several issues here:
[:punct:] pattern in an ICU regex flavor does not match math symbols (\p{S}), it only matches punctuation proper (\p{P}), if you still want to match all of them, combine the two classes, [\p{P}\p{S}]
"\\1\\s\\2" replacement contains a \s regex escape sequence, and these are not supported in the replacement patterns, you need to use a literal space
str_replace only replaces one, first occurrence, use str_replace_all to handle all matches
Even if you use all the above suggestions, it still won't work for strings like -+?/. You need to make the second part of the regex a zero-width assertion, a positive lookahead, in order not to consume the second punctuation.
So, you can use
library(stringr)
s <- "-+?="
str_replace_all(s, "([\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", "\\1 ")
str_replace_all(s, "(?<=[\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", " ")
gsub("(?<=[[:punct:]])(?=[[:punct:]])", " ", s, perl=TRUE)
See the R demo online, all three lines yield [1] "- + ? =" output.
Note that in PCRE regex flavor (used with gsub and per=TRUE) the POSIX character class must be put inside a bracket expression, hence the use of double brackets in [[:punct:]].
Also, (?<=[[:punct:]]) is a positive lookbehind that checks for the presence of its pattern immediately on the left, and since it is non-consuming there is no need of any backreference in the replacement.

How to remove a certain portion of the column name in a dataframe?

I have column names in the following format:
col= c('UserLanguage','Q48','Q21...20','Q22...21',"Q22_4_TEXT...202")
I would like to get the column names without everything that is after ...
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
I am not sure how to code it. I found this post here but I am not sure how to specify the pattern in my case.
You can use gsub.
gsub("\\...*","",col)
#[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
Or you can use stringr
library(stringr)
str_remove(col, "\\...*")
Since . matches any character, we need to "escape" (\) to specify exactly what we want to match in the regular expression (and not use the special behavior of the .). So, to match a period, we would need \.. However, the backslash (\) is used to escape special behavior (e.g., escape symbol in strings) in regexps. So, to create the regular expression, we need an additional backslash, \\. In this case, we want to match additional periods, so we can add those here, hence \\.... Then, * specifies that the previous expression (everything the three periods) may occur 0 or more times.
You could sub and capture the first word in each column:
col <- c("UserLanguage", "Q48", "Q21...20", "Q22...21", "Q22_4_TEXT...202")
sub("^(\\w+).*$", "\\1", col)
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
The regex pattern used here says to match:
^ from the start of the input
(\w+) match AND capture the first word
.* then consume the rest
$ end of the input
Then, using sub we replace with \1 to retain just the first word.

How can I edit my regex so that it captures only the substring between (and not including) quotation marks?

I'm a total novice to regex, and have a hard time wrapping my head around it. Right now I have a column filled with strings, but the only relevant text to my analysis is between quotation marks. I've tried this:
response$text <- stri_extract_all_regex(response$text, '"\\S+"')
but when I view response$text, the output comes out like this:
"\"caring\""
How do I change my regex expression so that instead the output reads:
caring
You can use
library(stringi)
response$text <- stri_extract_all_regex(response$text, '(?<=")[^\\s"]+(?=")')
Or, with stringr:
library(stringr)
response$text <- str_extract_all(response$text, '(?<=")[^\\s"]+(?=")')
However, with several words inside quotes, I'd rather use stringr::str_match_all:
library(stringr)
matches <- str_match_all(response$text, '"([^\\s"]+)"')
response$text <- lapply(matches, function(x) x[,2])
See this regex demo.
With the capturing group approach used in "([^\\s"]+)" it becomes possible to avoid overlapping matches between quoted substrings, and str_match_all becomes handy since the matches it returns contain the captured substrings as well (unlike *extract* functions).

Why is my regex backreference in R being reversed when I use one backslash with gsub?

I do not understand why I am required to use two backslashes to prevent a reversal of my backreference. Below, I detail how I discovered my problem:
I wanted to transform a character that looks like this:
x <- 53/100 000
And transform it to look like this:
53/100000
Here are a few ideas I had before I came to ask this question:
I thought that I could use the function gsub to remove all spaces that occur after the / character. However, I thought that a regex solution might be more elegant/efficient.
At first, I didn't know how to backreference in regex, so I tried this:
> gsub("/.+\\s",".+",x)
[1] "53.+000"
Then I read that you can backreference captured patterns using \1 from this website. So I began to use this:
> gsub("/.+\\s","\1",x)
[1] "53\001000"
Then I realized that the backreference only considers the wildcard match. But I wanted to keep the / character. So I added it back in:
> gsub("/.+\\s","/\1",x)
[1] "53/\001000"
I then tried a bunch of other things, but I fixed it by adding an extra backslash and enclosing my wildcard in parentheses:
> gsub("/(.+)\\s","/\\1",x)
[1] "53/100000"
Moreover, I was able to remove the / character from my replacement by inserting the left parenthesis at the beginning of the pattern:
> gsub("(/.+)\\s","\\1",x)
[1] "53/100000"
Hm, so it seemed two things were required: parentheses and an extra backslash. The parentheses I understand I think, because I believe the parentheses indicate what is the part of text that you are backreferencing.
What I do not understand is why two backslashes are required. From the reference website it is said that only \l is required. What's going on here? Why is my backreference being reversed?
The extra backslash is required so that R doesn't parse the "\1" as an escape character before passing it to gsub. "\\1" is read as the regex \1 by gsub.

Is there a simple way to get substring in R?

i get the substring of word in the following way:
word="xyz9874"
pattern="[0-9]+"
x=gregexpr(pattern,word)
substr(word,start=x[[1]],stop=x[[1]]+attr(x[[1]],"match.length")-1)
[1] "9874"
Is there a more simple way to get the result in R?
Sure, use gsub and backreferencing:
gsub( ".*?([0-9]+).*", "\\1", word )
Explanation: in most regex implementations, \1 is the back reference to the first subpattern matched. The subpattern is enclosed in parentheses. In R, you need to escape the backslash irrespective of the type of quotation marks you are using.
The question mark, an idiom of the "extended" regular expressions means that the given regex pattern should not be greedy, in other words -- it should take as little of the string as possible. Othrewise, the .* in the pattern .*([0-9]+) would match xyz987 and ([0-9]+) would match 4. Alternatively, we can write
gsub( ".*[^0-9]+([0-9]+).*", "\\1", word )
but then we have a problem with strings that start with a number.
By the way, note that instead of [0-9] you can write \d, or, actually, \\d:
gsub( ".*?(\\d+).*", "\\1", word )

Resources