Extract text in parentheses in R - r

Two related questions. I have vectors of text data such as
"a(b)jk(p)" "ipq" "e(ijkl)"
and want to easily separate it into a vector containing the text OUTSIDE the parentheses:
"ajk" "ipq" "e"
and a vector containing the text INSIDE the parentheses:
"bp" "" "ijkl"
Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.

Text outside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"
Text inside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp" "" "ijkl"
The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.
> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp" "" "ijkl"
This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.

The rm_round function in the qdapRegex package I maintain was born to do this:
First we'll get and load the package via pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)
## Then we can use it to remove and extract the parts you want:
x <-c("a(b)jk(p)", "ipq", "e(ijkl)")
rm_round(x)
## [1] "ajk" "ipq" "e"
rm_round(x, extract=TRUE)
## [[1]]
## [1] "b" "p"
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] "ijkl"
To condense b and p use:
sapply(rm_round(x, extract=TRUE), paste, collapse="")
## [1] "bp" "NA" "ijkl"

Related

Extract a string that spans across multiple lines - stringr

I need to extract a string that spans across multiple lines on an object.
The objetc:
> text <- paste("abc \nd \ne")
> cat(text)
abc
d
e
With str_extract_all I can extract all the text between ‘a’ and ‘c’, for example.
> str_extract_all(text, "a.*c")
[[1]]
[1] "abc"
Using the function ‘regex’ and the argument ‘multiline’ set to TRUE, I can extract a string across multiple lines. In this case, I can extract the first character of multiple lines.
> str_extract_all(text, regex("^."))
[[1]]
[1] "a"
> str_extract_all(text, regex("^.", multiline = TRUE))
[[1]]
[1] "a" "d" "e"
But when I try the to extract "every character between a and d" (a regex that spans across multiple lines), the output is "character(0)".
> str_extract_all(text, regex("a.*d", multiline = TRUE))
[[1]]
character(0)
The desired output is:
“abcd”
How to get it with stringr?
dplyr:
library(dplyr)
library(stringr)
data.frame(text) %>%
mutate(new = lapply(str_extract_all(text, "(?!e)\\w"), paste0, collapse = ""))
text new
1 abc \nd \ne abcd
Here we use the character class \\w, which does not include the new line metacharacter \n. The negative lookahead (?!e) makes sure the e is not matched.
base R:
unlist(lapply(str_extract_all(text, "(?!e)\\w"), paste0, collapse = ""))
[1] "abcd"
str_remove_all(text,"\\s\\ne?")
[1] "abcd"
OR
paste0(trimws(strsplit(text, "\\ne?")[[1]]), collapse="")
[1] "abcd"
The anwers above remove line breaks. So, a two step approach can work to get the desired output 'abcd'.
1 - Use str_remove_all or gsub to remove the line breaks (in this case, also removing blank spaces).
2 - Use str_extract_all to get the desired output ('abcd' in this case).
> text %>%
+ str_remove_all("\\s\\n") %>%
+ str_extract_all("a.*d")
[[1]]
[1] "abcd"
Short regex reference:
\n - new line (return)
\s - any whitespace
\r - carriage return
Update:
In base R to get the desired output abcd:
text <- gsub("[\r\n]|[[:blank:]]", "", text)
substr(text,1, nchar(text)-1)
[1] "abcd"
First answer:
We can use gsub:
gsub("[\r\n]|[[:blank:]]", "", text)
[1] "abcde"

Removing unwanted parts of strings in a list, and combining the pieces into a single string in R

I am trying to take a list of strings, remove everything except capital letters, and output a list of strings without any spaces or breaks.
Unfortunately, I have been trying to use str_extract_all() but it outputs the relevent pieces of the string separated as a list of character vectors, when there was non-capital letter string elements contained in the original string.
Can anyone please suggest a way to get the desired output?
# Some example data:
a <- list("n[28.0313]MVNNGHSFNVEYDDSQDK[28.0313]AVLK[28.0313]D_+4",
"SLGKVGTRC[71.0371]CTK[28.0313]PESER_+4",
"n[28.0313]AVVQDPALK[28.0313]PLALVY_+3",
"n[28.0313]TCVADESHAGC[71.0371]EK[28.0313]_+2")
# The desired output:
list("MVNNGHSFNVEYDDSQDKAVLKD",
"SLGKVGTRCCTKPESER",
"AVVQDPALKPLALVY",
"TCVADESHAGCEK")
# What I've tried so far:
a %>% str_extract_all("[A-Z]+")
[[1]]
[1] "MVNNGHSFNVEYDDSQDK" "AVLK" "D"
[[2]]
[1] "SLGKVGTRC" "CTK" "PESER"
[[3]]
[1] "AVVQDPALK" "PLALVY"
[[4]]
[1] "TCVADESHAGC" "EK"
# Not what I want.
I need to find a way to isolate the strings and combine them, but I'm at the limit of my R knowledge.
As it is a list of multiple elements, we can just paste it together by looping over the list
library(dplyr)
library(stringr)
library(purrr)
a %>%
str_extract_all("[A-Z]+") %>%
map_chr(str_c, collapse="")
-output
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER"
[3] "AVVQDPALKPLALVY" "TCVADESHAGCEK"
Or just use gsub to match all characters other than the upper case and replace with blank
gsub("[^A-Z]+", "", a)
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER" "AVVQDPALKPLALVY" "TCVADESHAGCEK"
or with str_remove_all
str_remove_all(a, "[^A-Z]+")
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER" "AVVQDPALKPLALVY" "TCVADESHAGCEK"
The output is a vector, which we can wrap it in a list
list(str_remove_all(a, "[^A-Z]+"))

Regex: how to keep all digits when splitting a string?

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

Multiline text extraction in R with stringr

I have a column in my dataframe which has free text in it
I would like to extract the text after INDICATIONS FOR EXAMINATION and before the next capitalized line. In the example below the result would be 'Anaemia'
INDICATIONS FOR EXAMINATION
Anaemia
PROCEDURE PERFORMED
Gastroscopy (OGD)
I am having some trouble as I'm using stringr and I can't seem to get multiline matches.
I have been using:
EoE$IndicationsFroExamination<-str_extract(EoE$Endo_ResultText, '(?<=INDICATIONS FOR EXAMINATION).*?[A-Z]+')
It requires a little digging. You can use the regex() modifier function.
Use the multiline argument to switch on multiline fitting:
str_extract_all("a\nb\nc", "^.")
# [[1]]
# [1] "a"
str_extract_all("a\nb\nc", regex("^.", multiline = TRUE))
# [[1]]
# [1] "a" "b" "c"
Please be aware of the dotall argument, that will switch on multiline behaviour of ".*":
str_extract_all("a\nb\nc", "a.")
# [[1]]
# character(0)
str_extract_all("a\nb\nc", regex("a.", dotall = TRUE))
# [[1]]
# [1] "a\n"
These are documented in stringi::stri_opts_regex(), which stringr::regex() passes arguments to.
I made the regular expression a bit more generic so it will match all occurrences and used the str_extract_all package from stringr:
matches <- str_extract_all(str, "(?<=[A-Z]\n)([^\n]*)")
Which, given the string you provided, should return:
[[1]]
[1] "Anaemia" "Gastroscopy (OGD)"

Get indices of all character elements matches in string in R

I want to get indices of all occurences of character elements in some word. Assume these character elements I look for are: l, e, a, z.
I tried the following regex in grep function and tens of its modifications, but I keep receiving not what I want.
grep("/([leazoscnz]{1})/", "ylaf", value = F)
gives me
numeric(0)
where I would like:
[1] 2 3
To use grep work with individual characters of a string, you first need to split the string into separate character vectors. You can use strsplit for this:
strsplit("ylaf", split="")[[1]]
[1] "y" "l" "a" "f"
Next you need to simplify your regular expression, and try the grep again:
strsplit("ylaf", split="")[[1]]
grep("[leazoscnz]", strsplit("ylaf", split="")[[1]])
[1] 2 3
But it is easier to use gregexpr:
gregexpr("[leazoscnz]", "ylaf")
[[1]]
[1] 2 3
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE

Resources