Special character matching in r using grep

Special character matching in r using grep - r

If I have a sentence separated by spaces
s<-("C java","C++ java")
grep("C",s)
gives output as
[1] [2]
while I only require
[1]
How to do that? ( I have used c\++ to identify c++ separately but matching with C gives [1] and [2] both as the output)

If we want to match 1 only, then we can use the start (^) and end ($) of the string to denote that there are no characters after or before 'C'
grep("^C$",s)
#[1] 1
data
s<- c("C","C++","java")

s<-c("C","C++","java")
which(s %in% "C")
grep() gives a positive result for any match within a string

Related

replacing word with same word + added characters R

I have a regex "^[0-9]\\.[0-9]|^§"
Now i want to replace occurences and add something
Example
"foo" becomes "[[foo]]"
grep("^[0-9]\\.[0-9]|^§", Vector)
gives me all occurences unsure how to continue

You can use sub. If you put parentheses around your pattern, then you can refer to it in the replacement string with \1
For example, if your vector is like this:
Vector <- c("2.9", "7.4", "A", "2.2")
And your regex is like this:
grep("^[0-9]\\.[0-9]|^§", Vector)
#> [1] 1 2 4
You can do
sub("(^[0-9]\\.[0-9]|^§)", "[[\\1]]", Vector)
#> [1] "[[2.9]]" "[[7.4]]" "A" "[[2.2]]"

How to I use regular expressions to match a substring?

I want to change the rownames of cov_stats, such that it contains a substring of the FileName column values. I only want to retain the string that begins with "SRR" followed by 8 digits (e.g., SRR18826803).
cov_list <- list.files(path="./stats/", full.names=T)
cov_stats <- rbindlist(sapply(cov_list, fread, simplify=F), use.names=T, idcol="FileName")
rownames(cov_stats) <- gsub("^\.\/\SRR*_\stats.\txt", "SRR*", cov_stats[["FileName"]])
Second attempt
rownames(cov_stats) <- gsub("^SRR[:digit:]*", "", cov_stats[["FileName"]])
Original strings
> cov_stats[["FileName"]]
[1] "./stats/SRR18826803_stats.txt" "./stats/SRR18826804_stats.txt"
[3] "./stats/SRR18826805_stats.txt" "./stats/SRR18826806_stats.txt"
[5] "./stats/SRR18826807_stats.txt" "./stats/SRR18826808_stats.txt"
Desired substring output
[1] "SRR18826803" "SRR18826804"
[3] "SRR18826805" "SRR18826806"
[5] "SRR18826807" "SRR18826808"

Would this work for you?
library(stringr)
stringr::str_extract(cov_stats[["FileName"]], "SRR.{0,8}")

You can use
rownames(cov_stats) <- sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats[["FileName"]])
See the regex demo. Details:
^ - start of string
\./stats/ - ./stats/ string
(SRR\d{8}) - Group 1 (\1): SRR string and then eight digits
.* - the rest of the string till its end.
Note that sub is used (not gsub) because there is only one expected replacement operation in the input string (since the regex matches the whole string).
See the R demo:
cov_stats <- c("./stats/SRR18826803_stats.txt", "./stats/SRR18826804_stats.txt", "./stats/SRR18826805_stats.txt", "./stats/SRR18826806_stats.txt", "./stats/SRR18826807_stats.txt")
sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats)
## => [1] "SRR18826803" "SRR18826804" "SRR18826805" "SRR18826806" "SRR18826807"
An equivalent extraction stringr approach:
library(stringr)
rownames(cov_stats) <- str_extract(cov_stats[["FileName"]], "SRR\\d{8}")

Inverting a regex in R

I have this string:
[1] "19980213" "19980214" "19980215" "19980216" "19980217" "iffi" "geometry"
[8] "date_consid"
and I want to match all the elements that are not dates and not "date_consid". I tried
res = grep("(?!\\d{8})|(?!date_consid)", vec, value=T)
But I just cant make it work...

You can use
vec <- c("19980213", "19980214", "19980215", "19980216","19980217", "iffi","geometry", "date_consid")
grep("^(\\d{8}|date_consid)$", vec, value=TRUE, invert=TRUE)
## => [1] "iffi" "geometry"
See the R demo
The ^(\d{8}|date_consid)$ regex matches a string that only consists of any eight digits or that is equal to date_consid.
The value=TRUE makes grep return values rather than indices and invert=TRUE inverses the regex match result (returns those that do not match).

The pattern that you tried gives all the matches because the lookaheads are unanchored.
Using separate statements with or | will still match all strings.
You can change to logic to asserting from the start of the string, what is directly to the right is not either 8 digits or date_consid in a single check.
Using a positive lookahead, you have to add perl=T and add an anchor ^ to assert the start of the string and add an anchor $ to assert the end of the string after the lookahead.
^(?!\\d{8}$|date_consid$)
^ Start of string
(?! Negative lookahead
\\d{8}$ Match 8 digits until end of string
| Or
date_consid$Match date_consid until end of string
) Close lookahead
For example
vec <- c("19980213", "19980214", "19980215", "19980216","19980217", "iffi","geometry", "date_consid")
grep("^(?!\\d{8}$|date_consid$)", vec, value=T, perl=T)
Output
[1] "iffi" "geometry"

How to count " in the string? [duplicate]

I am trying to get the number of open brackets in a character string in R. I am using the str_count function from the stringr package
s<- "(hi),(bye),(hi)"
str_count(s,"(")
Error in stri_count_regex(string, pattern, opts_regex = attr(pattern,
: ` Incorrectly nested parentheses in regexp pattern.
(U_REGEX_MISMATCHED_PAREN)
I am hoping to get 3 for this example

( is a special character. You need to escape it:
str_count(s,"\\(")
# [1] 3
Alternatively, given that you're using stringr, you can use the coll function:
str_count(s,coll("("))
# [1] 3

You could also use gregexpr along with length in base R:
sum(gregexpr("(", s, fixed=TRUE)[[1]] > 0)
[1] 3
gregexpr takes in a character vector and returns a list with the starting positions of each match. I added fixed=TRUE in order to match literals.length will not work because gregexpr returns -1 when a subexpression is not found.
If you have a character vector of length greater than one, you would need to feed the result to sapply:
# new example
s<- c("(hi),(bye),(hi)", "this (that) other", "what")
sapply((gregexpr("(", s, fixed=TRUE)), function(i) sum(i > 0))
[1] 3 1 0

If you want to do it in base R you can split into a vector of individual characters and count the "(" directly (without representing it as a regular expression):
> s<- "(hi),(bye),(hi)"
> chars <- unlist(strsplit(s,""))
> length(chars[chars == "("])
[1] 3

r: regex for containing pattern with negation

Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?

You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.

You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Special character matching in r using grep - r

If I have a sentence separated by spaces s<-("C java","C++ java") grep("C",s) gives output as [1] [2] while I only require [1] How to do that? ( I have used c\++ to identify c++ separately but matching with C gives [1] and [2] both as the output)

If we want to match 1 only, then we can use the start (^) and end ($) of the string to denote that there are no characters after or before 'C' grep("^C$",s) #[1] 1 data s<- c("C","C++","java")

s<-c("C","C++","java") which(s %in% "C") grep() gives a positive result for any match within a string

Related

replacing word with same word + added characters R

How to I use regular expressions to match a substring?

Inverting a regex in R

How to count " in the string? [duplicate]

r: regex for containing pattern with negation

Categories

Resources