How to strsplit using '|' character, it behaves unexpectedly? - r

I would like to split a string of character at pattern "|"
but
unlist(strsplit("I am | very smart", " | "))
[1] "I" "am" "|" "very" "smart"
or
gsub(pattern="|", replacement="*", x="I am | very smart")
[1] "*I* *a*m* *|* *v*e*r*y* *s*m*a*r*t*"

The problem is that by default strsplit interprets " | " as a regular expression, in which | has special meaning (as "or").
Use fixed argument:
unlist(strsplit("I am | very smart", " | ", fixed=TRUE))
# [1] "I am" "very smart"
Side effect is faster computation.
stringr alternative:
unlist(stringr::str_split("I am | very smart", fixed(" | ")))

| is a metacharacter. You need to escape it (using \\ before it).
> unlist(strsplit("I am | very smart", " \\| "))
[1] "I am" "very smart"
> sub(pattern="\\|", replacement="*", x="I am | very smart")
[1] "I am * very smart"
Edit: The reason you need two backslashes is that the single backslash prefix is reserved for special symbols such as \n (newline) and \t (tab). For more information look in the help page ?regex. The other metacharacters are . \ | ( ) [ { ^ $ * + ?

If you are parsing a table than calling read.table might be a better option. Tiny example:
> txt <- textConnection("I am | very smart")
> read.table(txt, sep='|')
V1 V2
1 I am very smart
So I would suggest to fetch the wiki page with Rcurl, grab the interesting part of the page with XML (which has a really neat function to parse HTML tables also) and if HTML format is not available call read.table with specified sep. Good luck!

Pipe '|' is a metacharacter, used as an 'OR' operator in regular expression.
try
unlist(strsplit("I am | very smart", "\s+\|\s+"))

Related

Using quantifiers in look-arounds (R/stringr)

I'd like to extract the name John Doe from the following string:
str <- 'Name: | |John Doe |'
I can do:
library(stringr)
str_extract(str,'(?<=Name: \\| \\|).*(?= \\|)')
[1] "John Doe"
But that involves typing a lot of spaces, and it doesn't work well when the number of spaces is not fixed. But when I try to use a quantifier (+), I get an error:
str_extract(str,'(?<=Name: \\| +\\|).*(?= +\\|)')
Error in stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) :
Look-Behind pattern matches must have a bounded maximum length. (U_REGEX_LOOK_BEHIND_LIMIT, context=`(?<=Name: \| +\|).*(?= +\|)`)
The same goes for other variants:
str_extract(str,'(?<=Name: \\|\\s+\\|).*(?=\\s+\\|)')
str_extract(str,'(?<=Name: \\|\\s{1,}\\|).*(?=\\s{1,}\\|)')
Is there a solution to this?
How about:
First we remove Name
Then we replace all special characters with space
and finally str_squish it
Library(stringr)
str_squish(str_replace_all( str_remove(str, "Name"), "[^[:alnum:]]", " "))
[1] "John Doe"
Another solution using base R:
sub("Name: \\|\\s+\\|(.*\\S)\\s+\\|", "\\1", str)
# [1] "John Doe"
You might also use the \K to keep what is matched so far out of the regex match.
Name: \|\h+\|\K.*?(?=\h+\|)
Explanation
Name: \| match Name: |
\h+\| Match 1+ spaces and |
\K Forget what is matched so far
.*? Match as least as possible chars
(?=\h+\|) Positive lookahead, assert 1+ more spaces to the right followed by |
See a regex demo and a R demo.
Example
str <- 'Name: | |John Doe |'
regmatches(str, regexpr("Name: \\|\\h+\\|\\K.*?(?=\\h+\\|)", str, perl=T))
Output
[1] "John Doe"

Regex for matching between a colon and last newline prior to next colon

I am trying to parse a string with regex to pull out information between a colon and the last newline prior to the next colon. How can I do this?
string <- "Name: Al's\nPlace\nCountry:\nState\n/ Province: RI\n"
stringr::str_extract_all(string, "(?<=:)(.*)(?:\\n)")
but I get:
[[1]]
[1] " Al's\n" " \n" " RI\n"
when I want:
[[1]]
[1] " Al's\nPlace\n" " \n" " RI\n"
I'm not sure if this is what you're after as your wanted output looks a bit different.
:((?:.*\\n?)+?)(?=.*:|$)
: match a colon
((?:.*\n?)+?) match and capture lazily any lines (to optional \n)
(?=.*:|$) until there is a line with colon ahead
See this demo at regex101

gsub with "|" character in R

I have a data frame with strings under a variable with the | character. What I want is to remove anything downstream of the | character.
For example, considering the string
heat-shock protein hsp70, putative | location=Ld28_v01s1:1091329-1093293(-) | length=654 | sequence_SO=chromosome | SO=protein_coding
I wish to have only:
heat-shock protein hsp70, putative
Do I need any escape character for the | character?
If I do:
a <- c("foo_5", "bar_7")
gsub("*_.", "", a)
I get:
[1] "foo" "bar"
i.e. I am removing anything downstream of the _ character.
However, If I repeat the same task with a | instead of the _:
b <- c("foo|5", "bar|7")
gsub("*|.", "", a)
I get:
[1] "" ""
You have to scape | by adding \\|. Try this
> gsub("\\|.*$", "", string)
[1] "heat-shock protein hsp70, putative "
where string is
string <- "heat-shock protein hsp70, putative | location=Ld28_v01s1:1091329-1093293(-) | length=654 | sequence_SO=chromosome | SO=protein_coding"
This alternative remove the space at the end of line in the output
gsub("\\s+\\|.*$", "", string)
[1] "heat-shock protein hsp70, putative"
Maybe a better job for strsplit than for a gsub
And yes, it looks like the pipe does need to be escaped.
string <- "heat-shock protein hsp70, putative | location=Ld28_v01s1:1091329-1093293(-) | length=654 | sequence_SO=chromosome | SO=protein_coding"
strsplit(string, ' \\| ')[[1]][1]
That outputs
"heat-shock protein hsp70, putative"
Note that I'm assuming you only want the text from before the first pipe, and that you want to drop the space that separates the pipe from the piece of the string you care about.

How to completely remove head and tail white spaces or punctuation characters?

I have string_a, such that
string_a <- " ,A thing, something, . ."
Using regex, how can I just retain "A thing, something"?
I have tried the following and got such output:
sub("[[:punct:]]$|^[[:punct:]]","", trimws(string_a))
[1] "A thing, something, . ."
We can use gsub to match one or more punctuation characters including spaces ([[:punct:] ] +) from the start (^) or | those characters until the end ($) of the string and replace it with blank ("")
gsub("^[[:punct:] ]+|[[:punct:] ]+$", "", string_a)
#[1] "A thing, something"
Note: sub will replace only a single instance
Or as #Cath mentioned [[:punct:] ] can be replaced with \\W

How to replace text sequences ending in a fixed pattern within a long text string in R?

I have a column within a data frame containing long text sequences (often in the thousands of characters) of the format:
abab(VR) | ddee(NR) | def(NR) | fff(VR) | oqq | pqq | ppf(VR)
i.e. a string, a suffix in brackets, then a delimiter
I'm trying to work out the syntax in R to delete the items that end in (VR), including the trailing pipe if present, so that I'm left with:
ddee(NR) | def(NR) | oqq | pqq
I cannot work out the regular expression (or gsub) that will remove these entries and would like to request if anyone could help me please.
If you want to use gsub, you can remove the pattern in two stages:
gsub(" \\| $", "", gsub("\\w+\\(VR\\)( \\| )?", "", s))
# firstly remove all words ending with (VR) and optional | following the pattern and
# then remove the possible | at the end of the string
# [1] "ddee(NR) | def(NR) | oqq | pqq"
regular expression \\w+\\(VR\\) will match words ending with (VR), parentheses are escaped by \\;
( \\| )? matches optional delimiter |, this makes sure it will match the pattern both in the middle and at the end of the string;
possible | left out at the end of the string can be removed by a second gsub;
Here is a method using strsplit and paste with the collapse argument:
paste(sapply(strsplit(temp, split=" +\\| +"),
function(i) { i[setdiff(seq_along(i), grep("\\(VR\\)$", i))] }),
collapse=" | ")
[1] "ddee(NR) | def(NR) | oqq | pqq"
We split on the pipe and spaces, then feed the resulting list to sapply which uses the grep function to drop any elements of the vector that end with "(VR)". Finally, the result is pasted together.
I added a subsetting method with setdiff so that vectors without any "(VR)" will return without any modification.

Resources