I would like to extract a string matching dog|cat (0-5 words, \r, \n or spaces between) 1. and some more text until 2.appears.
myStrings <- c(
"the dog says: 1. hello cat 2. I do not care",
"the dog barks ba ba ba ba ba ba ba and says: 1. no 2. no",
"the doggie says: 1. hello 2. you",
"the cat is angry and asks: 1. hello dog 2. go away",
"the dog says: 2. nothing 3. nothing")
My approach is:
str_extract(string=myStrings,pattern=regex("(dog|cat(?:\\w+\\W+){1,5}?1.).*(?=2.)"))
I tried to implement this (https://www.regular-expressions.info/near.html) , however, my regex matches
> [1] "dog says: 1. hello cat " "dog barks ba ba ba ba ba
> ba ba: 1. no " "doggie says: 1. hello " "dog " "dog says: "
What I would need is
> [1] "dog says: 1. hello cat " "NA" "NA" "the cat is angry and asks: 1. hello dog " "NA"
Your lookbehind assertion is unbounded, meaning, it can match any amount of tokens. The engine needs to statically be able to determine the length of the lookbehind.
Btw, it seems you have uneven parenthesis in your regex, which means I don't know which tokens are supposed to be included in the lookbehind. If you include anything like \w+, it will be unbounded.
I would like to find the rows in a vector with the word 'RT' in it or 'R' but not if the word 'RT' is preceded by 'no'.
The word RT may be preceded by nothing, a space, a dot, etc.
With the regex, I tried :
grep("(?<=[no] )RT", aaa,ignore.case = FALSE, perl = T)
Which was giving me all the rows with "no RT".
and
grep("(?=[^no].*)RT",aaa , perl = T)
which was giving me all the rows containing 'RT' with and without 'no' at the beginning.
What is my mistake? I thought the ^ was giving everything but the character that follows it.
Example :
aaa = c("RT alone", "no RT", "CT/RT", "adj.RTx", "RT/CT", "lang, RT+","npo RT" )
(?<=[no] )RT matches any RT that is immediately preceded with "n " or "o ".
You should use a negative lookbehind,
"(?<!no )RT"
See the regex demo.
Or, if you need to check for a whole word no,
"(?<!\\bno )RT"
See this regex demo.
Here, (?<!no ) makes sure there is no no immediately to the left of the current location, and only then RT is consumed.
I have what seems like a simple regular expression I want to match on and replace. I'm processing a bunch of free form text and respondents have a variety of ways of denoting a line break. One such is at least 4 sequential bits of whitespace. Another is a heavy dot. However, in R (perl=FALSE) I get some very strange behavior. The regex \\s{4,}|• replaces the whole string with one <br>, if I change the repetition to be just 4 (\\s{4}|•) then it returns 19 <br>. If I remove the |• then it works fine. If I explicitly call out 4 whitespace characters or the heavy dot, \\s\\s\\s\\s+|•, it works fine.
What is it about repeating \\s or checking for a heavy dot • that causes such erratic behavior?
x = "Call Narrative <br>11/15/2017 19:53:00 J574511 <br> <br>"
replacement = "<br>"
orig_pattern = "\\s{4,}|•"
alt1 = "\\s\\s\\s\\s+|•"
alt2 = "\\s{4,}"
alt3 = "\\s{4,}|<p>"
alt4 = "\\s{4}|•"
gsub(orig_pattern,replacement,x)
#> [1] "<br>"
gsub(alt1,replacement,x)
#> [1] "Call Narrative <br>11/15/2017<br>19:53:00<br>J574511 <br> <br>"
gsub(alt2,replacement,x)
#> [1] "Call Narrative <br>11/15/2017<br>19:53:00<br>J574511 <br> <br>"
gsub(alt3,replacement,x)
#> [1] "Call Narrative <br>11/15/2017<br>19:53:00<br>J574511 <br> <br>"
gsub(alt4,replacement,x)
#> [1] "<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>"
UPDATE
It seems to be associated with the OS. The problem originated on Amazon Linux 2 but works fine on Windows.
My data is in this format. It's a text file and the class is "character". I have posted few lines from the file. There are about 14000 lines.
"KEY: Aback"
"SYN: Backwards, rearwards, aft, abaft, astern, behind, back."
"ANT: Onwards, forwards, ahead, before, afront, beyond, afore."
"KEY: Abandon"
"SYN: Leave, forsake, desert, renounce, cease, relinquish,"
"discontinue, castoff, resign, retire, quit, forego, forswear,"
"depart_from, vacate, surrender, abjure, repudiate."
"ANT: Pursue, prosecute, undertake, seek, court, cherish, favor,"
"protect, claim, maintain, defend, advocate, retain, support, uphold,"
"occupy, haunt, hold, assert, vindicate, keep."
Line 6 and 7 is the continuation of line 5. Line 9 and 10 is the continuation of line 8. My struggle is how can I bring up line 6 and 7 to line 5 and similarly line 9 and 10 to line 8.
Any hints gratefully received.
First thing that comes to mind (your text is stored as x):
#prefix each line starter (identifies as pattern: `CAPS:`) with a newline (\n)
strsplit(gsub("([A-Z]+:)", "\n\\1", paste(x, collapse = " ")),
split = "\n")[[1L]][-1L]
# [1] "KEY: Aback "
# [2] "SYN: Backwards, rearwards, aft, abaft, astern, behind, back. "
# [3] "ANT: Onwards, forwards, ahead, before, afront, beyond, afore. "
# [4] "KEY: Abandon "
# [5] "SYN: Leave, forsake, desert, renounce, cease, relinquish, discontinue, castoff, resign, retire, quit, forego, forswear, depart_from, vacate, surrender, abjure, repudiate. "
# [6] "ANT: Pursue, prosecute, undertake, seek, court, cherish, favor, protect, claim, maintain, defend, advocate, retain, support, uphold, occupy, haunt, hold, assert, vindicate, keep."
I would like to split a string of character at pattern "|"
but
unlist(strsplit("I am | very smart", " | "))
[1] "I" "am" "|" "very" "smart"
or
gsub(pattern="|", replacement="*", x="I am | very smart")
[1] "*I* *a*m* *|* *v*e*r*y* *s*m*a*r*t*"
The problem is that by default strsplit interprets " | " as a regular expression, in which | has special meaning (as "or").
Use fixed argument:
unlist(strsplit("I am | very smart", " | ", fixed=TRUE))
# [1] "I am" "very smart"
Side effect is faster computation.
stringr alternative:
unlist(stringr::str_split("I am | very smart", fixed(" | ")))
| is a metacharacter. You need to escape it (using \\ before it).
> unlist(strsplit("I am | very smart", " \\| "))
[1] "I am" "very smart"
> sub(pattern="\\|", replacement="*", x="I am | very smart")
[1] "I am * very smart"
Edit: The reason you need two backslashes is that the single backslash prefix is reserved for special symbols such as \n (newline) and \t (tab). For more information look in the help page ?regex. The other metacharacters are . \ | ( ) [ { ^ $ * + ?
If you are parsing a table than calling read.table might be a better option. Tiny example:
> txt <- textConnection("I am | very smart")
> read.table(txt, sep='|')
V1 V2
1 I am very smart
So I would suggest to fetch the wiki page with Rcurl, grab the interesting part of the page with XML (which has a really neat function to parse HTML tables also) and if HTML format is not available call read.table with specified sep. Good luck!
Pipe '|' is a metacharacter, used as an 'OR' operator in regular expression.
try
unlist(strsplit("I am | very smart", "\s+\|\s+"))