I'm trying to remove all fields that have special characters (#?.* etc) in their text.
I think I should be using
Filter(function(x) {grepl('|[^[:punct:]]).*?', x)} == FALSE, data$V1)
where data$V1 contains my data. However, it seems like
grepl('|[^[:punct:]]).*?', x)
fails with trivial examples like
grepl('|[^[:punct:]]).*?', 'M')
which outputs TRUE even though M has no special characters. How should I be using grepl to remove fields with special characters from a column of data?
To search for "special characters", you can search for the negation of alphanumeric characters as such:
grepl('[^[:alnum:]_]+', c('m','m#','M9*'))
# [1] FALSE TRUE TRUE
or use the symbol \W
grepl('\\W+', c('m','m#','M9*'))
# [1] FALSE TRUE TRUE
\W is explained in the regular expression help:
"The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]̀)."
Starting a regular expression with a | make it literally useless since it will match anything.
See this JS example:
console.log('With the starting pipe => ' + /|([\W]).*?/.test('M'));
console.log('Without the starting pipe => ' + /([\W]).*?/.test('M'));
Simply put those inside [...] and provide this to the pattern argument to grepl, then negate.
data$V1[!grepl("[#?.*]", data$V1)]
For example,
> x <- c("M", "3#3", "8.*x")
> x[!grepl("[#?.*]", x)]
[1] "M"
Related
I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)
I want to extract strings using rm_between function from the library(qdapRegex)
I need to extract the string between the second "|" and the word "_HUMAN".
I cant figure out how to select the second "|" and not the first.
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
prots <- rm_between(example, '|', 'HUMAN', extract=TRUE)
Thank you!!
Another alternative using regmatches, regexpr and using perl=TRUE to make use of \K
^(?:[^|]*\|){2}\K[^|_]+(?=_HUMAN)
Regex demo
For example
regmatches(example, regexpr("^(?:[^|]*\\|){2}\\K[^|_]+(?=_HUMAN)", example, perl=TRUE))
Output
[1] "EIFCL" "EIF3C"
In your rm_between(example, '|', 'HUMAN', extract=TRUE) command, the | is used to match the leftmost | and HUMAN is used to match the left most HUMAN right after.
Note the default value for the FIXED argument is TRUE, so | and HUMAN are treated as literal chars.
You need to make the pattern a regex pattern, by setting fixed=FALSE. However, the ^(?:[^|]*\|){2} as the left argument regex will not work because the qdap package creates an ICU regex with lookarounds (since you use extract=TRUE that sets include.markers to FALSE), which is (?<=^(?:[^|]*\|){2}).*?(?=HUMAN).
As a workaround, you could use a constrained-width lookbehind, by replacing * with a limiting quantifier with a reasonably large max parameter. Say, if you do not expect more than a 1000 chars between each pipe, you may use {0,1000}:
rm_between(example, '^(?:[^|]{0,1000}\\|){2}', '_HUMAN', extract=TRUE, fixed=FALSE)
# => [[1]]
# [1] "EIFCL"
#
# [[2]]
# [1] "EIF3C"
However, you really should think of using simpler approaches, like those described in other answers. Here is another variation with sub:
sub("^(?:[^|]*\\|){2}(.*?)_HUMAN.*", "\\1", example)
# => [1] "EIFCL" "EIF3C"
Details
^ - startof strig
(?:[^|]*\\|){2} - two occurrences of any 0 or more non-pipe chars followed with a pipe char (so, matching up to and including the second |)
(.*?) - Group 1: any 0 or more chars, as few as possible
_HUMAN.* - _HUMAN and the rest of the string.
\1 keeps only Group 1 value in the result.
A stringr variation:
stringr::str_match(example, "^(?:[^|]*\\|){2}(.*?)_HUMAN")[,2]
# => [1] "EIFCL" "EIF3C"
With str_match, the captures can be accessed easily, we do it with [,2] to get Group 1 value.
this is not exactly what you asked for, but you can achieve the result with base R:
sub("^.*\\|([^\\|]+)_HUMAN.*$", "\\1", example)
This solution is an application of regular expression.
"^.*\\|([^\\|]+)_HUMAN.*$" matches the entire character string.
\\1 matches whatever was matched inside the first parenthesis.
Using regular gsub:
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
gsub(".*?\\|.*?\\|(.*?)_HUMAN", "\\1", example)
#> [1] "EIFCL" "EIF3C"
The part (.*?) is replaced by itself as the replacement contains the back-reference \\1.
If you absolutely prefer qdapRegex you can try:
rm_between(example, '.{0,100}\\|.{0,100}\\|', '_HUMAN', fixed = FALSE, extract = TRUE)
The reason why we have to use .{0,100} instead of .*? is that the underlying stringi needs a mamixmum length for the look-behind pattern (i.e. the left argument in rm_between).
Just saying that you could easily just use sapply()/strsplit():
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
unlist(sapply(strsplit(example, "|", fixed = T),
function(item) strsplit(item[3], "_HUMAN", fixed = T)))
# [1] "EIFCL" "EIF3C"
It just splits on | in the first list and on _HUMAN on every third element within that list.
Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?
You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.
You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
I'm trying to extract certain records from a dataframe with grepl.
This is based on the comparison between two columns Result and Names. This variable is build like this "WordNumber" but for the same word I have multiple numbers (more than 30), so when I use the grepl expression to get for instance Word1 I get also results that I would like to avoid, like Word12.
Any ideas on how to fix this?
Names <- c("Word1")
colnames(Names) <- name
Results <- c("Word1", "Word11", "Word12", "Word15")
Records <- c("ThisIsTheResultIWant", "notThis", "notThis", "notThis")
Relationships <- data.frame(Results, Records)
Relationships <- subset(Relationships, grepl(paste(Names$name, collapse = "|"), Relationships$Results))
This doesn't work, if I use fixed = TRUE than it doesn't return any result at all (which is weird). I have also tried concatenating the name part with other numbers like this, but with no success:
Relationships <- subset(Relationships, grepl(paste(paste(Names$name, '3', sep = ""), collapse = "|"), Relationships$Results))
Since I'm concatenating I'm not really sure of how to use the \b to enforce a full match.
Any suggestions?
In addition to #Richard's solution, there are multiple ways to enforce a full match.
\b
"\b" is an anchor to identify word before/after pattern
> grepl("\\bWord1\\b",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
\< & \>
"\<" is an escape sequence for the beginning of a word, and ">" is used for end
> grepl("\\<Word1\\>",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
Use ^ to match the start of the string and $ to match the end of the string
Names <-c('^Word1$')
Or, to apply to the entire names vector
Names <-paste0('^',Names,'$')
I think this is just:
Relationships[Relationships$Results==Names,]
If you end up doing ^Word1$ you're just doing a straight subset.
If you have multiple names, then instead use:
Relationships[Relationships$Results %in% Names,]
Despite reading the help page of R regex
Finally, to include a literal -, place it first or last (or, for perl
= TRUE only, precede it by a backslash).
I can't understand the difference between
grepl(pattern=paste("^thing1\\-",sep=""),x="thing1-thing2")
and
grepl(pattern=paste("^thing1-",sep=""),x="thing1-thing2")
Both return TRUE. Should I escape or not here? What is the best practice?
The hyphen is mostly a normal character in regular expressions.
You do not need to escape the hyphen outside of a character class; it has no special meaning.
Within a character class [ ] you can place a hyphen as the first or last character in the range. If you place the hyphen anywhere else you need to escape it in order to add it to your class.
Examples:
grepl('^thing1-', x='thing1-thing2')
[1] TRUE
grepl('[-a-z]+', 'foo-bar')
[1] TRUE
grepl('[a-z-]+', 'foo-bar')
[1] TRUE
grepl('[a-z\\-\\d]+', 'foo-bar')
[1] TRUE
Note: It is more common to find a hyphen placed first or last within a character class.
To see what it means for - to have a special meaning inside of a character class (and how putting it last gives it its literal meaning), try the following:
grepl("[w-y]", "x")
# [1] TRUE
grepl("[w-y]", "-")
# [1] FALSE
grepl("[wy-]", "-")
# [1] TRUE
grepl("[wy-]", "x")
# [1] FALSE
They are both matching the exact same text in these instances. I.e.:
x <- "thing1-thing2"
regmatches(x,regexpr("^thing1\\-",x))
#[1] "thing1-"
regmatches(x,regexpr("^thing1-",x))
#[1] "thing1-"
Using a - is a special character in certain situations though, for specifying ranges of values, such as characters between a and z when specifed inside [], e.g.:
regmatches(x,regexpr("[a-z]+",x))
#[1] "thing"