Find match in multiple line breaks - r

I need KPRMILL from the text below. Pattern is finding first : and a single space, then followed by desired text (till first line break \n)
x <- "\n \n NSE: KPRMILL\n \n \n | \n BSE: 532889\n \n \n | INDUSTRY : TEXTILES\n | SECTOR : TEXTILES, APPARELS & ACCESSORIES\n "
I am able to solve this via combination of str_extract( ) and str_replace( ), looking for efficient solution.
x %>% str_extract("[.*?:]\\s+(.*?\\n)") %>% str_replace("(:\\s+)(.*)\\n","\\2")

You can use regex lookaround to find text before and/or after your pattern without including them in the returned text. (?<=abc)qu+x means "find and return qu+x when it is preceded by abc"; similarly, qu+x(?=abc) means *"find and return qu+x when it is followed by abc.
str_extract(x, "(?<=: )(.*)(?=\n)")
# [1] "KPRMILL"
I'm inferring that you only want the first of the patterns in your x, since there are four. If you want the others, use str_extract_all:
str_extract_all(x, "(?<=: )(.*)(?=\n)")
# [[1]]
# [1] "KPRMILL" "532889"
# [3] "TEXTILES" "TEXTILES, APPARELS & ACCESSORIES"

Related

Remove spaces between line breaks only

I have the following example string with line breaks "\n" and spaces " ":
a <- "\n \n \n \nTEST TEST\n"
I would like to remove spaces (" ") directly following after line breaks ("\n"), but not the spaces after other strings (like "TEST" in my toy example). My desired output is therefore:
"\n\n\n\nTEST TEST\n"
I tried stringr's str_remove_all and str_replace_all but didn't succeed as those seem to have problems in this case with the adjacent occurences of the line breaks. This is the closest I got:
str_replace_all(a, "\n[ ]*\n", "\n\n")
I spent hours on this (probably ridiculously easy) problem, any help is thus highly appreciated!
gsub("\n *", "\n", a)
or
str_replace_all(a, "\n *", "\n") # with stringr package
will get you the desired output "\n\n\n\nTEST TEST\n"
EDIT: For space(s) only between blank lines
Note that the above will also remove spaces that appear at the start of non-blank lines—e.g., if the string was "\n TEST TEST \n"
#bobble bubble's suggestion of including (?=\n) into the search pattern (i.e., "\n *(?=\n)") works for between blank lines. (Thank you, bobble bubble)
gsub("\n *(?=\n)", "\n", a, perl=TRUE)
or
str_replace_all(a, "\n *(?=\n)", "\n") # with stringr package
(?=(regex)) is a positive lookahead assertion. As "\n *(?=\n)", it means that the asserted regex \n needs to appear directly after \n * (new line with blank space(s)), but it will not be captured in the string pattern. Because the asserted regex is not captured in the pattern, it does not get replaced when using gsub or stringr::str_replace_all.
To illustrate this more clearly, only the "b" that appears before "bu" is replaced in the following example:
str_replace_all("bobblebbubble", "b(?=bu)", "_")
#[1] "bobble_bubble"
I believe you can remove any line that consists of horizontal whitespace. With stringr, you can use
library(stringr)
a <- "\n \n \n \nTEST TEST\n"
stringr::str_replace_all(a, "(?m)^\\h+$", "")
See the R demo and the regex demo. Details:
(?m) - a multiline modifier making ^ match start of any line and $ match any end of line positions
^ - line start
\h+ - one or more horizontal whitespace chars
$ - line end.

How do I extract text between two characters in R

I'd like to extract text between two strings for all occurrences of a pattern. For example, I have this string:
x<- "\nTYPE: School\nCITY: ATLANTA\n\n\nCITY: LAS VEGAS\n\n"
I'd like to extract the words ATLANTA and LAS VEGAS as such:
[1] "ATLANTA" "LAS VEGAS"
I tried using gsub(".*CITY:\\s|\n","",x). The output this yields is:
[1] " LAS VEGAS"
I would like to output both cities (some patterns in the data include more than 2 cities) and to output them without the leading space.
I also tried the qdapRegex package but could not get close. I am not that good with regular expressions so help would be much appreciated.
You may use
> unlist(regmatches(x, gregexpr("CITY:\\s*\\K.*", x, perl=TRUE)))
[1] "ATLANTA" "LAS VEGAS"
Here, CITY:\s*\K.* regex matches
CITY: - a literal substring CITY:
\s* - 0+ whitespaces
\K - match reset operator that discards the text matched so far (zeros the current match memory buffer)
.* - any 0+ chars other than line break chars, as many as possible.
See the regex demo online.
Note that since it is a PCRE regex, perl=TRUE is indispensible.
Another option:
library(stringr)
str_extract_all(x, "(?<=CITY:\\s{3}).+(?=\\n)")
[[1]]
[1] "ATLANTA" "LAS VEGAS"
reads as: extract anything preceded by "City: " (and three spaces) and followed by "\n"
An option can be as:
regmatches(x,gregexpr("(?<=CITY:).*(?=\n\n)",x,perl = TRUE))
# [[1]]
# [1] " ATLANTA" " LAS VEGAS"

Split long string by a vector of words

I'm looking to split some television scripts into a data frame with two variables: (1) spoken dialogue and (2) speaker.
Here is the sample data: http://www.buffyworld.com/buffy/transcripts/127_tran.html
Loaded to R via:
require(rvest)
url <- 'http://www.buffyworld.com/buffy/transcripts/127_tran.html')
url <- read_html(url)
all <- url %>% html_text()
[1] "Selfless - Buffy Episode 7x5 'Selfless' (#127) Transcript\n\nBuffy Episode #127: \"Selfless\" \n Transcript\nWritten by Drew Goddard\n Original Air Date: October 22, 2002 Skip Teaser.. Take Me To Beginning Of Episode. \n\n \n \n NB: The content of this transcript, including the characters \n and the story, belongs to Mutant Enemy. This transcript was created \n based on the broadcast episode.\n \n \n \n \n BUFFYWORLD.COM \n prefers that you direct link to this transcript rather than post \n it on your site, but you can post it on your site if you really \n want, as long as you keep everything intact, this includes the link \n to buffyworld.com and this writing. Please also keep the disclaimers \n intact.\n \n Originally transcribed for: http://www.buffyworld.com/.\n\t \n TEASER (RECAP SEGMENT):\n GILES (V.O.)\n\n Previousl... <truncated>
What I'm trying now is to split at each character's name (I have a full list). For example, 'GILES' above. This works fine except I can't retain character name if I split there. Here's a simplified example.
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
all <- strsplit(all, to_parse)
This gives me the splits I want, but doesn't retain the character name.
Finite question: Any approach to retain that character name w/ what I'm doing?
Infinite question: Any other approaches I should be trying?
Thanks in advance!
I think you can use perl compatible regular expressions with strsplit. For explanatory purposes, I used a shorter sample string, but it should work the same:
string <- "text BUFFY more text WILLOW other text"
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
strsplit(string, paste0("(?<=", to_parse, ")"), perl = TRUE)
#[[1]]
#[1] "text BUFFY" " more text WILLOW" " other text"
As suggested by #Lamia, if you instead had the name before the text you could do a positive look-ahead. I edited the suggestion slightly so that the split string includes the delimiter.
strsplit(string, paste0("(?<=.(?=", to_parse, "))"), perl = TRUE)
#[[1]]
#[1] "text " "BUFFY more text " "WILLOW other text"

regular expression to find exact matching containing a space and a punctuation

I am going through a dataset containing text values (names) that are formatted like this example :
M.Joan (13-2)
A.Alfred (20-13)
F.O'Neil (12-231)
D.Dan Fun (23-3)
T.Collins (51-82) J.Maddon (12-31)
Some strings have two names in it like
M.Joan (13-2) A.Alfred (20-13)
I only want to extract the name from the string.
Some names are easy to extract because they don't have spaces or anything.
However some are hard because they have a space like the last one above.
name_pattern = "[A-Z][.][^ (]{1,}"
base <- str_extract_all(baseball1$Managers, name_pattern)
When I use this code to extract the names, it works well even for names with spaces or punctuations. However, the extracted names have a space at the end. I was wondering if I can find the exact pattern of " (", a space and a parenthesis.
Output:
[[1]]
[1] "Z.Taylor "
[[2]]
[1] "Z.Taylor "
[[3]]
[1] "Z.Taylor "
[[4]]
[1] "Z.Taylor "
[[5]]
[1] "Y.Berra "
[[6]]
[1] "Y.Berra "
You may use
x <- c("M.Joan (13-2) ", "A.Alfred (20-13)", "F.O'Neil (12-231)", "D.Dan Fun (23-3)", "T.Collins (51-82) J.Maddon (12-31)", "T.Hillman (12-34) and N.Yost (23-45)")
regmatches(x, gregexpr("\\p{Lu}.*?(?=\\s*\\()", x, perl=TRUE))
See the regex demo
Or the str_extract_all version:
str_extract_all(baseball1$Managers, "\\p{Lu}.*?(?=\\s*\\()")
See the regex demo.
It matches
\p{Lu} - an uppercase letter
.*? - any char other than line break chars, as few as possible, up to the first occurrence of (but not including into the match, as (?=...) is a non-consuming construct)....
(?=\\s*\\() - positive lookahead that, immediately to the right of the current location, requires the presence of:
\\s* - 0+ whitespace chars
\\( - a literal (.

separating last sentence from a string in R

I have a vector of strings and i want to separate the last sentence from each string in R.
Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R.
You can use strsplit to get the last sentence from each string as shown:-
## paragraph <- "Your vector here"
result <- strsplit(paragraph, "\\.|\\!|\\?")
last.sentences <- sapply(result, function(x) {
trimws((x[length(x)]))
})
Provided that your input is clean enough (in particular, that there are spaces between the sentences), you can use:
sub(".*(\\.|\\?|\\!) ", "", trimws(yourvector))
It finds the longest substring ending with a punctuation mark and a space and removes it.
I added trimws just in case there are trailing spaces in some of your strings.
Example:
u <- c("This is a sentence. And another sentence!",
"By default R regexes are greedy. So only the last sentence is kept. You see ? ",
"Single sentences are not a problem.",
"What if there are no spaces between sentences?It won't work.",
"You know what? Multiple marks don't break my solution!!",
"But if they are separated by spaces, they do ! ! !")
sub(".*(\\.|\\?|\\!) ", "", trimws(u))
# [1] "And another sentence!"
# [2] "You see ?"
# [3] "Single sentences are not a problem."
# [4] "What if there are no spaces between sentences?It won't work."
# [5] "Multiple marks don't break my solution!!"
# [6] "!"
This regex anchors to the end of the string with $, allows an optional '.' or '!' at the end. At the front it finds the closest ". " or "! " as the end of the prior sentence. The negative lookback ?<= ensures the "." or '!' are not matched. Also provides for a single sentence by using ^ for the beginning.
s <- "Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R."
library (stringr)
str_extract(s, "(?<=(\\.\\s|\\!\\s|^)).+(\\.|\\!)?$")
yields
# [1] "Hence i am confused as to how to separate the last sentence from a string in R."

Resources