gsub with "|" character in R - r

I have a data frame with strings under a variable with the | character. What I want is to remove anything downstream of the | character.
For example, considering the string
heat-shock protein hsp70, putative | location=Ld28_v01s1:1091329-1093293(-) | length=654 | sequence_SO=chromosome | SO=protein_coding
I wish to have only:
heat-shock protein hsp70, putative
Do I need any escape character for the | character?
If I do:
a <- c("foo_5", "bar_7")
gsub("*_.", "", a)
I get:
[1] "foo" "bar"
i.e. I am removing anything downstream of the _ character.
However, If I repeat the same task with a | instead of the _:
b <- c("foo|5", "bar|7")
gsub("*|.", "", a)
I get:
[1] "" ""

You have to scape | by adding \\|. Try this
> gsub("\\|.*$", "", string)
[1] "heat-shock protein hsp70, putative "
where string is
string <- "heat-shock protein hsp70, putative | location=Ld28_v01s1:1091329-1093293(-) | length=654 | sequence_SO=chromosome | SO=protein_coding"
This alternative remove the space at the end of line in the output
gsub("\\s+\\|.*$", "", string)
[1] "heat-shock protein hsp70, putative"

Maybe a better job for strsplit than for a gsub
And yes, it looks like the pipe does need to be escaped.
string <- "heat-shock protein hsp70, putative | location=Ld28_v01s1:1091329-1093293(-) | length=654 | sequence_SO=chromosome | SO=protein_coding"
strsplit(string, ' \\| ')[[1]][1]
That outputs
"heat-shock protein hsp70, putative"
Note that I'm assuming you only want the text from before the first pipe, and that you want to drop the space that separates the pipe from the piece of the string you care about.

Related

How can I remove certain part of row names in data frame

I have a data set with the following format:
ID | Value
-------------------------- | -------------------------------
AAA1|404744 | 1.7554
ANKHD1-EIF4EBP3|404734 | 0.5174
HLA-B|3106 | 11.7659
HLA-A|3105 | 18.0851
What I want is removing certain part of the row names like this:
ID | Value
--------------------- | -------------------------------
AAA1 | 1.7554
ANKHD1-EIF4EBP3 | 0.5174
HLA-B | 11.7659
HLA-A | 18.0851
Thanks a lot!
We can do this with sub. Match the | (a metacharacter implies or - so either escape \\| it or place it in brackets to get the literal character) followed by characters (.*) and replace it with blank ("")
df$ID <- sub("[|].*", "", df$ID)

How to replace all escape sequence with blank in robot framework

I have tried multiple things to convert my variable containing escape sequence characters into a blank string. How do I replace and escape sequence character with blank?
${stg} Set Variable \r\n
Replace String ${stg} \r\n ${EMPTY}
Log ${stg}
Should Not Be Equal ${stg} \r\n
In line 4, ${stg} == '\r\n'. How do I make this blank?
You were very close,docs for Replace String gives you the answer:
A modified version of the string is returned and the original
string is not altered.
Examples:
| ${str} = | Replace String | Hello, world! | world | tellus |
| Should Be Equal | ${str} | Hello, tellus! | | |
In your case,assign return of line #2 into ${stg}:
${stg} Replace String ${stg} \r\n ${EMPTY}

How to replace text sequences ending in a fixed pattern within a long text string in R?

I have a column within a data frame containing long text sequences (often in the thousands of characters) of the format:
abab(VR) | ddee(NR) | def(NR) | fff(VR) | oqq | pqq | ppf(VR)
i.e. a string, a suffix in brackets, then a delimiter
I'm trying to work out the syntax in R to delete the items that end in (VR), including the trailing pipe if present, so that I'm left with:
ddee(NR) | def(NR) | oqq | pqq
I cannot work out the regular expression (or gsub) that will remove these entries and would like to request if anyone could help me please.
If you want to use gsub, you can remove the pattern in two stages:
gsub(" \\| $", "", gsub("\\w+\\(VR\\)( \\| )?", "", s))
# firstly remove all words ending with (VR) and optional | following the pattern and
# then remove the possible | at the end of the string
# [1] "ddee(NR) | def(NR) | oqq | pqq"
regular expression \\w+\\(VR\\) will match words ending with (VR), parentheses are escaped by \\;
( \\| )? matches optional delimiter |, this makes sure it will match the pattern both in the middle and at the end of the string;
possible | left out at the end of the string can be removed by a second gsub;
Here is a method using strsplit and paste with the collapse argument:
paste(sapply(strsplit(temp, split=" +\\| +"),
function(i) { i[setdiff(seq_along(i), grep("\\(VR\\)$", i))] }),
collapse=" | ")
[1] "ddee(NR) | def(NR) | oqq | pqq"
We split on the pipe and spaces, then feed the resulting list to sapply which uses the grep function to drop any elements of the vector that end with "(VR)". Finally, the result is pasted together.
I added a subsetting method with setdiff so that vectors without any "(VR)" will return without any modification.

EBNF Definition of Identifier

The EBNF definition of an identifier is (a-zA-Z, _ ){a-zA-Z0-9, _ }. Can someone explain this definition and give me a valid identifier by this definition.
The syntax of EBNF like languages differ a lot.
Normally I would define something like this:
letter = "a" | "b" | ... | "z" | "A" | ... | "Z";
digit = "0" | "1" | "2" | ... | "9";
identifier = letter , { letter | digit | "_" } ;
Your form looks like a mixture of EBNF and regex.
It is hard to tell what this means if I don't know which language we are talking about.
But by pure guessing, I would say it describes a C-like identifier (e.g. variable name) like "myVar_0123ab".
The identifier has to start with a letter, or an underline '_', followed by letters, underlines and digits.

How to strsplit using '|' character, it behaves unexpectedly?

I would like to split a string of character at pattern "|"
but
unlist(strsplit("I am | very smart", " | "))
[1] "I" "am" "|" "very" "smart"
or
gsub(pattern="|", replacement="*", x="I am | very smart")
[1] "*I* *a*m* *|* *v*e*r*y* *s*m*a*r*t*"
The problem is that by default strsplit interprets " | " as a regular expression, in which | has special meaning (as "or").
Use fixed argument:
unlist(strsplit("I am | very smart", " | ", fixed=TRUE))
# [1] "I am" "very smart"
Side effect is faster computation.
stringr alternative:
unlist(stringr::str_split("I am | very smart", fixed(" | ")))
| is a metacharacter. You need to escape it (using \\ before it).
> unlist(strsplit("I am | very smart", " \\| "))
[1] "I am" "very smart"
> sub(pattern="\\|", replacement="*", x="I am | very smart")
[1] "I am * very smart"
Edit: The reason you need two backslashes is that the single backslash prefix is reserved for special symbols such as \n (newline) and \t (tab). For more information look in the help page ?regex. The other metacharacters are . \ | ( ) [ { ^ $ * + ?
If you are parsing a table than calling read.table might be a better option. Tiny example:
> txt <- textConnection("I am | very smart")
> read.table(txt, sep='|')
V1 V2
1 I am very smart
So I would suggest to fetch the wiki page with Rcurl, grab the interesting part of the page with XML (which has a really neat function to parse HTML tables also) and if HTML format is not available call read.table with specified sep. Good luck!
Pipe '|' is a metacharacter, used as an 'OR' operator in regular expression.
try
unlist(strsplit("I am | very smart", "\s+\|\s+"))

Resources