I am trying to use a grep formula to search for at least one of the following terms in quotations in the code below in df$AllPrograms.
grep("Service & Product Provider (Partner;ACT)" | "Buildings (Prospect;INA)", df$AllPrograms)
This isn't working and I suspect it is because grep is not interpret ting the & ; and () as operators rather than characters.
Use a double backslash "\" to escape these characters. This is because the backslash is an escape character in extended regex, but we need to "escape" the first backslash as well.
Also, in your example code you have incorrectly specified the OR statement. Try:
grep("Service \\& Product Provider \\(Partner\\;ACT\\)|Buildings \\(Prospect\\;INA\\)", df$AllPrograms)
If there are many other patterns that you'd like to check for, take a look at this link here:
grep using a character vector with multiple patterns
Related
I have a document, and I need to find all the words(no spaces) borded with '. (e.g. 'apple', 'hello') What would be the regular expression?
I've tried ^''$ but it didn't work.
If there isn't any solution, it could not be "any word" but also it can be a word from an order(e.g. apple, banana, lemon) but it still must have the (')s.
Thank you so much
Andrew
If you want to capture single-quoted strings, literally any character run except single-quotes but between the single-quotes, use
/'[^']+'/
If you need single words, i.e. alphabetic characters but no spaces, try
/'[a-zA-Z]+'/
I'm asssuming a couple things here:
You're using a language that delimits regexes with slashes. This includes Javascript and Perl to my knowledge, and probably a bunch of others. In some other languages, like C#, you should use double quotes to delimit, e.g. "'[a-zA-Z]+'"
You're using a flavor of regex that does not need to escape the plus sign.
You're trying to capture all such words within a long string. I.e., if the input string is "Here is a 'long' string with 'some' 'words' single-quoted" then you will capture three words: 'long','some', and 'words'.
I need to index all the rows that have a string beginning with either "B-" or "B^" in one of the columns. I tried a bunch of combinations, but I am suspecting it might not be working due to "-" and "^" signs being part of grep command as well.
dataset[grep('^(B-|B^)[^B-|B^]*$', dataset$Col1),]
With the above script, rows beginning with "B^" are not being extracted. Please suggest a smart way to handle this.
You can use the escape \\ command in grep:
dataset[grep('^(B\\-|B\\^)[^B\\-|B\\^]*$', dataset$Col1),]
For further explanation, the ^ matches the beginning of a string as an anchor therefore you have to escape it in the middle of string. The [] are a character class so [^B-|B^]* matches any character that's not a B,-,B, or ^. They are unnecessary here.
The simplified regex is:
dataset[grep('^(B-|B\\^)', dataset$Col1),]
I want to split this string in several substrings:
BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965,
0:591}
The separator is | (ascii 124).
It works with all other separators but not with this one.
?regex
Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression. For example, abba|cde matches either the string abba or the string cde. Note that alternation does not work inside character classes, where | has its literal meaning.
The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ?, but note that whether these have a special meaning depends on the context.
Thus:
stringr::str_split('BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965, 0:591}', "\\|")
As #Frank noted, you can do this in base::strsplit() by adding the fixed=TRUE:
strsplit('BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965, 0:591}',"|", fixed=TRUE)
However, you can also do this with stringr::str_split() by decorating the regular expression for the separator:
stringr::str_split('BAA33520.2|/gene="vpf402",/product="Vpf402"|GI:8272373|AB012574|join{7347:7965, 0:591}',
regex("|", literal=TRUE))
Incidentally, stringr is pretty much just a slightly friendlier wrapper to stringi functions at this point and I highly recommend studying the stringi package as it contains some wonderful gems outside of string spiltting.
Windows copies path with backslash \, which R does not accept. So, I wanted to write a function which would convert \ to /. For example:
chartr0 <- function(foo) chartr('\','\\/',foo)
Then use chartr0 as...
source(chartr0('E:\RStuff\test.r'))
But chartr0 is not working. I guess, I am unable to escape /. I guess escaping / may be important in many other occasions.
Also, is it possible to avoid the use chartr0 every time, but convert all path automatically by creating an environment in R which calls chartr0 or use some kind of temporary use like using options
From R 4.0.0 you can use r"(...)" to write a path as raw string constant, which avoids the need for escaping:
r"(E:\RStuff\test.r)"
# [1] "E:\\RStuff\\test.r"
There is a new syntax for specifying raw character constants similar to the one used in C++: r"(...)" with ... any character sequence not containing the sequence )". This makes it easier to write strings that contain backslashes or both single and double quotes. For more details see ?Quotes.
Your fundamental problem is that R will signal an error condition as soon as it sees a single back-slash before any character other than a few lower-case letters, backslashes themselves, quotes or some conventions for entering octal, hex or Unicode sequences. That is because the interpreter sees the back-slash as a message to "escape" the usual translation of characters and do something else. If you want a single back-slash in your character element you need to type 2 backslashes. That will create one backslash:
nchar("\\")
#[1] 1
The "Character vectors" section of _Intro_to_R_ says:
"Character strings are entered using either matching double (") or single (') quotes, but are printed using double quotes (or sometimes without quotes). They use C-style escape sequences, using \ as the escape character, so \ is entered and printed as \, and inside double quotes " is entered as \". Other useful escape sequences are \n, newline, \t, tab and \b, backspace—see ?Quotes for a full list."
?Quotes
chartr0 <- function(foo) chartr('\\','/',foo)
chartr0('E:\\RStuff\\test.r')
You cannot write E:\Rxxxx, because R believes R is escaped.
The problem is that every single forward slash and backslash in your code is escaped incorrectly, resulting in either an invalid string or the wrong string being used. You need to read up on which characters need to be escaped and how. Take a look at the list of escape sequences in the link below. Anything not listed there (such as the forward slash) is treated literally and does not require any escaping.
http://cran.r-project.org/doc/manuals/R-lang.html#Literal-constants
I'd like to gain a better understanding of escape character sequences in R. I've tried searching for things like ?'\' but, that escapes itself and ?'\\'
I'd like to avoid this kind of behaviour with cat(). For example:
cat("\")
+
Versus:
cat("\\")
\
The help page you are looking for is ?Quotes (with the capital Q). String literal syntax is also described (less clearly IMHO) at http://cran.r-project.org/doc/manuals/R-lang.html#Literal-constants.
The backslash escape works very nearly the same as it does in C and all the other languages that borrowed backslash escapes from C -- \n inserts a newline, \\ inserts a single backslash, \" in a double quoted string prevents the " from ending the string, etc.