How to perl regex match in R in the grepl function?

How to perl regex match in R in the grepl function? - r

I have a function in R which uses the grepl command as follows:
function(x) grepl('\bx\b',res$label, perl=T)
This doesn't seem to work - the 'x' input is a character type string (a sentence), and i'd like to create word boundaries around the 'x' as I match, as I don't want the term to pull out other terms in the table I am searching through which contains some similar terms.
Any suggestions?

You just need to properly escape the slash in your regex
ff<-function(x) grepl('\\bx\\b',x, perl=T)
ff(c("axa","a x a", "xa", "ax","x"))
# [1] FALSE TRUE FALSE FALSE TRUE

If you just want to know whether string is a sentence, not single word, you could use: function(x) grepl('\\s',x)

Related

Replace Exact String in R using Variables [duplicate]

I'm trying to extract certain records from a dataframe with grepl.
This is based on the comparison between two columns Result and Names. This variable is build like this "WordNumber" but for the same word I have multiple numbers (more than 30), so when I use the grepl expression to get for instance Word1 I get also results that I would like to avoid, like Word12.
Any ideas on how to fix this?
Names <- c("Word1")
colnames(Names) <- name
Results <- c("Word1", "Word11", "Word12", "Word15")
Records <- c("ThisIsTheResultIWant", "notThis", "notThis", "notThis")
Relationships <- data.frame(Results, Records)
Relationships <- subset(Relationships, grepl(paste(Names$name, collapse = "|"), Relationships$Results))
This doesn't work, if I use fixed = TRUE than it doesn't return any result at all (which is weird). I have also tried concatenating the name part with other numbers like this, but with no success:
Relationships <- subset(Relationships, grepl(paste(paste(Names$name, '3', sep = ""), collapse = "|"), Relationships$Results))
Since I'm concatenating I'm not really sure of how to use the \b to enforce a full match.
Any suggestions?

In addition to #Richard's solution, there are multiple ways to enforce a full match.
\b
"\b" is an anchor to identify word before/after pattern
> grepl("\\bWord1\\b",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
\< & \>
"\<" is an escape sequence for the beginning of a word, and ">" is used for end
> grepl("\\<Word1\\>",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE

Use ^ to match the start of the string and $ to match the end of the string
Names <-c('^Word1$')
Or, to apply to the entire names vector
Names <-paste0('^',Names,'$')

I think this is just:
Relationships[Relationships$Results==Names,]
If you end up doing ^Word1$ you're just doing a straight subset.
If you have multiple names, then instead use:
Relationships[Relationships$Results %in% Names,]

Partial Match word from sentence in R

I am looking to partial match string using %in% operator in R when I run below I get FALSE
'I just want to partial match string' %in% 'partial'
FALSE
Expected Output is TRUE in above case (because it is matched partially)

Since you want to match partially from a sentence you should try using %like% from data.table, check below
library(data.table)
'I just want to partial match string' %like% 'partial'
TRUE
The output is TRUE

`%in_str%` <- function(pattern,s){
grepl(pattern, s)
}
Usage:
> 'a' %in_str% 'abc'
[1] TRUE

You need to strsplit the string so each word in it is its own element in a vector:
"partial" %in% unlist(strsplit('I just want to partial match string'," "))
[1] TRUE
strsplit takes a string and breaks it into a vector of shorter strings. In this case, it breaks on the space (that's the " " at the end), so that you get a vector of individual words. Unfortunately, strstring defaults to save its results as a list, which is why I wrapped it in an unlist - so we get a single vector.
Then we do the %in%, which works in the opposite direction from the one you used: you're trying to find out if string partial is %in% the sentence, not the other way around.
Of course, this is an annoying way of doing it, so it's probably better to go with a grep-based solution if you want to stay within base-R, or Priyanka's data.table solution above -- both of which will also be better at stuff like matching multiple-word strings.

Match pattern so long as it doesn't contain a specific string

Let's say I have the following strings:
quiz.1.player.chat_results
and
partner_quiz.1.player.chat_results
I have hundreds of strings like this where the only difference is that one is prefixed with "partner" and the other is not. I'm trying to match one but not the other.
The specific pattern I'd like to match looks like so:
index <- grep('^(quiz.)[1-5]{1}.player.chat_results', names(data))
But this will match both strings. I'm guessing I have to use some negative lookahead like so:
^((?!partner).)
But I'm not sure where to use it.

I'll answer your title question, as it will be the most useful to other people finding this question.
How to match strings that do not contain a given pattern? Easy, match the pattern and invert it.
index <- grep('^partner', names(data), invert = TRUE)

Another approach: use str_detect from stringr
> library(stringr)
> str_detect(string, "partner", negate=TRUE)
[1] TRUE FALSE
You can even use one grepl and negate the result
> !grepl("partner", string)
[1] TRUE FALSE
Just for fun: you can split the string using as separator \\. or _ and then iterate over each element of the resulting list comparing each element to partner and finally invert the result
> sapply(strsplit(string, "\\.|_"), function(x) !"partner" %in% x)
[1] TRUE FALSE

We can use two grepl to avoid any confusion
grepl('quiz', names(data)) & !grepl('partner', names(data))
#[1] TRUE FALSE

For someone who is a bit regex-blind like myself, sub can help,
sub('_.*', '', x) == 'partner'
#[1] TRUE FALSE

If you want to match the pattern including the digits, you could use a word boundary \b followed by a negative lookahead (?!partner) to assert what is directly on the right is not partner.
Note to escape the dot to match it literally and you can omit {1}. If you are not the value of the captured group around quiz, you might omit it as well.
To match the rest of the string, you might use \S+ to match not a non whitespace char.
\b(?!partner)quiz\.[1-5]\.player\S*
Regex demo | R demo
For example
regmatches(txt1,regexpr("\\b(?!partner)quiz\\.[1-5]\\.player\\S*",txt, per=TRUE))

subset strings without a pattern stringr

I want to extract elements of a character vector which do not match a given pattern. See the example:
x<-c("age_mean","n_aitd","n_sle","age_sd","n_poly","n_sero","child_age")
x_age<-str_subset(x,"age")
x_notage<-setdiff(x,x_age)
In this example I want to extract those strings which do not match the pattern "age". How to achieve this in a single call of str_subset ? What is the appropriate syntax of the pattern "not age". As you can see I am not very expert with regex. Thanks for any comments.

In this case there seems to be no reason to use stringr (efficiency perhaps). You may simply use grep:
grep("age", x, invert = TRUE, value = TRUE)
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
If, however, you want to stick with str_stringr, note that (from ?str_subset)
str_subset() is a wrapper around x[str_detect(x, pattern)], and is equivalent to grep(pattern, x, value = TRUE).
So,
x[!str_detect(x, "age")]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
or also
x[!grepl("age", x)]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"

Negating a string while matching others [duplicate]

This question already has answers here:
Regular expression that both includes and excludes certain strings in R
(3 answers)
Closed 5 years ago.
I would like to match some strings using regex while negating others in R. In the below example, I would like exclude subsections of strings that I would otherwise like to match. Example below using the answer from Regular expression to match a line that doesn't contain a word?.
My confusion is that when I try this, grepl throws an error:
Error in grepl(mypattern, mystring) :
invalid regular expression 'boardgames|(^((?!games).)*$)', reason 'Invalid regexp'
mypattern <- "boardgames|(^((?!games).)*$)"
mystring <- c("boardgames", "boardgames", "games")
grepl(mypattern, mystring)
Note running using str_detect returns desired results (i.e. T, T, F), but I would like to use grepl.

We need perl = TRUE as the default option is perl = FALSE
grepl(mypattern, mystring, perl = TRUE)
#[1] TRUE TRUE FALSE
This is needed when Perl-compatible regexps are used
According to ?regexp
The perl = TRUE argument to grep, regexpr, gregexpr, sub, gsub and
strsplit switches to the PCRE library that implements regular
expression pattern matching using the same syntax and semantics as
Perl 5.x, with just a few differences.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to perl regex match in R in the grepl function? - r

You just need to properly escape the slash in your regex ff<-function(x) grepl('\\bx\\b',x, perl=T) ff(c("axa","a x a", "xa", "ax","x")) # [1] FALSE TRUE FALSE FALSE TRUE

If you just want to know whether string is a sentence, not single word, you could use: function(x) grepl('\\s',x)

Related

Replace Exact String in R using Variables [duplicate]

Partial Match word from sentence in R

Match pattern so long as it doesn't contain a specific string

subset strings without a pattern stringr

Negating a string while matching others [duplicate]

Categories

Resources