Multiline text extraction in R with stringr - r

I have a column in my dataframe which has free text in it
I would like to extract the text after INDICATIONS FOR EXAMINATION and before the next capitalized line. In the example below the result would be 'Anaemia'
INDICATIONS FOR EXAMINATION
Anaemia
PROCEDURE PERFORMED
Gastroscopy (OGD)
I am having some trouble as I'm using stringr and I can't seem to get multiline matches.
I have been using:
EoE$IndicationsFroExamination<-str_extract(EoE$Endo_ResultText, '(?<=INDICATIONS FOR EXAMINATION).*?[A-Z]+')

It requires a little digging. You can use the regex() modifier function.
Use the multiline argument to switch on multiline fitting:
str_extract_all("a\nb\nc", "^.")
# [[1]]
# [1] "a"
str_extract_all("a\nb\nc", regex("^.", multiline = TRUE))
# [[1]]
# [1] "a" "b" "c"
Please be aware of the dotall argument, that will switch on multiline behaviour of ".*":
str_extract_all("a\nb\nc", "a.")
# [[1]]
# character(0)
str_extract_all("a\nb\nc", regex("a.", dotall = TRUE))
# [[1]]
# [1] "a\n"
These are documented in stringi::stri_opts_regex(), which stringr::regex() passes arguments to.

I made the regular expression a bit more generic so it will match all occurrences and used the str_extract_all package from stringr:
matches <- str_extract_all(str, "(?<=[A-Z]\n)([^\n]*)")
Which, given the string you provided, should return:
[[1]]
[1] "Anaemia" "Gastroscopy (OGD)"

Related

Removing unwanted parts of strings in a list, and combining the pieces into a single string in R

I am trying to take a list of strings, remove everything except capital letters, and output a list of strings without any spaces or breaks.
Unfortunately, I have been trying to use str_extract_all() but it outputs the relevent pieces of the string separated as a list of character vectors, when there was non-capital letter string elements contained in the original string.
Can anyone please suggest a way to get the desired output?
# Some example data:
a <- list("n[28.0313]MVNNGHSFNVEYDDSQDK[28.0313]AVLK[28.0313]D_+4",
"SLGKVGTRC[71.0371]CTK[28.0313]PESER_+4",
"n[28.0313]AVVQDPALK[28.0313]PLALVY_+3",
"n[28.0313]TCVADESHAGC[71.0371]EK[28.0313]_+2")
# The desired output:
list("MVNNGHSFNVEYDDSQDKAVLKD",
"SLGKVGTRCCTKPESER",
"AVVQDPALKPLALVY",
"TCVADESHAGCEK")
# What I've tried so far:
a %>% str_extract_all("[A-Z]+")
[[1]]
[1] "MVNNGHSFNVEYDDSQDK" "AVLK" "D"
[[2]]
[1] "SLGKVGTRC" "CTK" "PESER"
[[3]]
[1] "AVVQDPALK" "PLALVY"
[[4]]
[1] "TCVADESHAGC" "EK"
# Not what I want.
I need to find a way to isolate the strings and combine them, but I'm at the limit of my R knowledge.
As it is a list of multiple elements, we can just paste it together by looping over the list
library(dplyr)
library(stringr)
library(purrr)
a %>%
str_extract_all("[A-Z]+") %>%
map_chr(str_c, collapse="")
-output
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER"
[3] "AVVQDPALKPLALVY" "TCVADESHAGCEK"
Or just use gsub to match all characters other than the upper case and replace with blank
gsub("[^A-Z]+", "", a)
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER" "AVVQDPALKPLALVY" "TCVADESHAGCEK"
or with str_remove_all
str_remove_all(a, "[^A-Z]+")
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER" "AVVQDPALKPLALVY" "TCVADESHAGCEK"
The output is a vector, which we can wrap it in a list
list(str_remove_all(a, "[^A-Z]+"))

only remove punctuation for words not numbers

I am using the tm package in R to remove punctuation.
TextDoc <- tm_map(TextDoc, removePunctuation)
Is there a way I can only remove puncutation if it has to do with a letter/word instead of a number?
E.g.
I want performance. --> performance
But I want 3.14 --> 3.14
Example of how i want function to work:
wall, --> wall
expression. --> expression
ef. --> ef
A. --> A
name: --> name
:ok --> ok
91.8.10 --> 91.8.10
EDIT:
TextDoc is of the form:
You may also try this gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T) where text is your text vector. Explanation of regex
(?<!\\d) negative lookbehind for any digit character
[[:punct:]] searches for punctuation marks
(?=\\D) followed by positive lookahead for any non-digit character
? 0 or once
check this for regex demo
text <- c("wall, 88.1", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10")
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T)
#> [1] "wall 88.1" "expression" "ef" "ok" "A"
#> [6] "3.14" "91.8.10"
long_text <- "wall, 88.1 expression. ef. :ok A. 3.14 91.8.10"
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', long_text, perl = T)
#> [1] "wall 88.1 expression ef ok A 3.14 91.8.10"
Created on 2021-06-13 by the reprex package (v2.0.0)
I've completely revamped my answer based on your specification and Anil's answer below, which is much more widely applicable than what I originally had.
library(tm)
# Here we pretend that your texts are like this
text <- c("wall,", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10",
"w.a.ll, 6513.645+1646-5")
# and we create a corpus with them, like the one you show
corp <- Corpus(VectorSource(text))
# you create a function with any of the solutions that we've provided here
# I'm taking AnilGoyal's because it's better than my rushed purrr one.
my_remove_punct <- function(x) {
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', x, perl = T)
}
# pass the function to tm_map
new_corp <- tm_map(corp, my_remove_punct)
# Applying the function will give you a warning about dropping documents; but it's a bug of the TM package.
# We use this to confirm that the contents are indeed correct. The last line is a print-out of all the individual documents together.
sapply(new_corp, print)
#> [1] "wall"
#> [1] "expression"
#> [1] "ef"
#> [1] "ok"
#> [1] "A"
#> [1] "3.14"
#> [1] "91.8.10"
#> [1] "wall 6513.645+1646-5"
#> [1] "wall" "expression" "ef"
#> [4] "ok" "A" "3.14"
#> [7] "91.8.10" "wall 6513.645+1646-5"
The warning you receive about "dropping documents" is not real as you can see by printing. An explanation is in this other SO question.
In the future, note that you can quickly get better answers by providing raw data with the function dput to your object. Something like dput(TextDoc). If it is too much, you can subset it.
Tried to make it less ugly but here is my best shot:
library(data.table)
TextDoc <- data.table(text = c("wall",
"expression.",
"ef.",
"91.8.10",
"A.",
"name:",
":ok"))
TextDoc[grepl("[a-zA-Z]", text),
text := unlist(tm_map(Corpus(VectorSource(as.vector(text))), removePunctuation))[1:length(grepl("[a-zA-Z]", text))]]
Which gives us:
> TextDoc
text
1: wall
2: expression
3: ef
4: 91.8.10
5: A
6: name
7: ok

Extract string using `rm_between` function

I want to extract strings using rm_between function from the library(qdapRegex)
I need to extract the string between the second "|" and the word "_HUMAN".
I cant figure out how to select the second "|" and not the first.
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
prots <- rm_between(example, '|', 'HUMAN', extract=TRUE)
Thank you!!
Another alternative using regmatches, regexpr and using perl=TRUE to make use of \K
^(?:[^|]*\|){2}\K[^|_]+(?=_HUMAN)
Regex demo
For example
regmatches(example, regexpr("^(?:[^|]*\\|){2}\\K[^|_]+(?=_HUMAN)", example, perl=TRUE))
Output
[1] "EIFCL" "EIF3C"
In your rm_between(example, '|', 'HUMAN', extract=TRUE) command, the | is used to match the leftmost | and HUMAN is used to match the left most HUMAN right after.
Note the default value for the FIXED argument is TRUE, so | and HUMAN are treated as literal chars.
You need to make the pattern a regex pattern, by setting fixed=FALSE. However, the ^(?:[^|]*\|){2} as the left argument regex will not work because the qdap package creates an ICU regex with lookarounds (since you use extract=TRUE that sets include.markers to FALSE), which is (?<=^(?:[^|]*\|){2}).*?(?=HUMAN).
As a workaround, you could use a constrained-width lookbehind, by replacing * with a limiting quantifier with a reasonably large max parameter. Say, if you do not expect more than a 1000 chars between each pipe, you may use {0,1000}:
rm_between(example, '^(?:[^|]{0,1000}\\|){2}', '_HUMAN', extract=TRUE, fixed=FALSE)
# => [[1]]
# [1] "EIFCL"
#
# [[2]]
# [1] "EIF3C"
However, you really should think of using simpler approaches, like those described in other answers. Here is another variation with sub:
sub("^(?:[^|]*\\|){2}(.*?)_HUMAN.*", "\\1", example)
# => [1] "EIFCL" "EIF3C"
Details
^ - startof strig
(?:[^|]*\\|){2} - two occurrences of any 0 or more non-pipe chars followed with a pipe char (so, matching up to and including the second |)
(.*?) - Group 1: any 0 or more chars, as few as possible
_HUMAN.* - _HUMAN and the rest of the string.
\1 keeps only Group 1 value in the result.
A stringr variation:
stringr::str_match(example, "^(?:[^|]*\\|){2}(.*?)_HUMAN")[,2]
# => [1] "EIFCL" "EIF3C"
With str_match, the captures can be accessed easily, we do it with [,2] to get Group 1 value.
this is not exactly what you asked for, but you can achieve the result with base R:
sub("^.*\\|([^\\|]+)_HUMAN.*$", "\\1", example)
This solution is an application of regular expression.
"^.*\\|([^\\|]+)_HUMAN.*$" matches the entire character string.
\\1 matches whatever was matched inside the first parenthesis.
Using regular gsub:
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
gsub(".*?\\|.*?\\|(.*?)_HUMAN", "\\1", example)
#> [1] "EIFCL" "EIF3C"
The part (.*?) is replaced by itself as the replacement contains the back-reference \\1.
If you absolutely prefer qdapRegex you can try:
rm_between(example, '.{0,100}\\|.{0,100}\\|', '_HUMAN', fixed = FALSE, extract = TRUE)
The reason why we have to use .{0,100} instead of .*? is that the underlying stringi needs a mamixmum length for the look-behind pattern (i.e. the left argument in rm_between).
Just saying that you could easily just use sapply()/strsplit():
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
unlist(sapply(strsplit(example, "|", fixed = T),
function(item) strsplit(item[3], "_HUMAN", fixed = T)))
# [1] "EIFCL" "EIF3C"
It just splits on | in the first list and on _HUMAN on every third element within that list.

Extracting Headers from a list [duplicate]

I have a character string and what to extract the information inside of multiple parentheses. Currently I can extract the information from the last parenthesis with the code below. How would I do it so it extracts multiple parentheses and returns as a vector?
j <- "What kind of cheese isn't your cheese? (wonder) Nacho cheese! (groan) (Laugh)"
sub("\\).*", "", sub(".*\\(", "", j))
Current output is:
[1] "Laugh"
Desired output is:
[1] "wonder" "groan" "Laugh"
Here is an example:
> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan" "Laugh"
I think this should work well:
> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)" "(Laugh)"
but the results includes parenthesis... why?
This works:
regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]
Thanks #MartinMorgan for the comment.
Using the stringr package we can reduce this a little bit.
library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)
#kohske uses regmatches but I'm currently using 2.13 so don't have access to that function at the moment. This adds the dependency on stringr but I think it is a little easier to work with and the code is a little clearer (well... as clear as using regular expressions can be...)
Edit: We could also try something like this -
re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])
This one works by defining a marked subexpression inside the regular expression. It extracts everything that matches the regex and then gsub extracts only the portion inside the subexpression.
I think there are basically three easy ways of extracting multiple capture groups in R (without using substitution); str_match_all, str_extract_all, and regmatches/gregexpr combo.
I like #kohske's regex, which looks behind for an open parenthesis ?<=\\(, looks ahead for a closing parenthesis ?=\\), and grabs everything in the middle (lazily) .+?, in other words (?<=\\().+?(?=\\))
Using the same regex:
str_match_all returns the answer as a matrix.
str_match_all(j, "(?<=\\().+?(?=\\))")
[,1]
[1,] "wonder"
[2,] "groan"
[3,] "Laugh"
# Subset the matrix like this....
str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan" "Laugh"
str_extract_all returns the answer as a list.
str_extract_all(j, "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
str_extract_all(j, "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan" "Laugh"
regmatches/gregexpr also returns the answer as a list. Since this is a base R option, some people prefer it. Note the recommended perl = TRUE.
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan" "Laugh"
Hopefully, the SO community will correct/edit this answer if I've mischaracterized the most popular options.
Using rex may make this type of task a little simpler.
matches <- re_matches(j,
rex(
"(",
capture(name = "text", except_any_of(")")),
")"),
global = TRUE)
matches[[1]]$text
#>[1] "wonder" "groan" "Laugh"

Extract text in parentheses in R

Two related questions. I have vectors of text data such as
"a(b)jk(p)" "ipq" "e(ijkl)"
and want to easily separate it into a vector containing the text OUTSIDE the parentheses:
"ajk" "ipq" "e"
and a vector containing the text INSIDE the parentheses:
"bp" "" "ijkl"
Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.
Text outside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"
Text inside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp" "" "ijkl"
The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.
> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp" "" "ijkl"
This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.
The rm_round function in the qdapRegex package I maintain was born to do this:
First we'll get and load the package via pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)
## Then we can use it to remove and extract the parts you want:
x <-c("a(b)jk(p)", "ipq", "e(ijkl)")
rm_round(x)
## [1] "ajk" "ipq" "e"
rm_round(x, extract=TRUE)
## [[1]]
## [1] "b" "p"
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] "ijkl"
To condense b and p use:
sapply(rm_round(x, extract=TRUE), paste, collapse="")
## [1] "bp" "NA" "ijkl"

Resources