Match pattern so long as it doesn't contain a specific string - r

Let's say I have the following strings:
quiz.1.player.chat_results
and
partner_quiz.1.player.chat_results
I have hundreds of strings like this where the only difference is that one is prefixed with "partner" and the other is not. I'm trying to match one but not the other.
The specific pattern I'd like to match looks like so:
index <- grep('^(quiz.)[1-5]{1}.player.chat_results', names(data))
But this will match both strings. I'm guessing I have to use some negative lookahead like so:
^((?!partner).)
But I'm not sure where to use it.

I'll answer your title question, as it will be the most useful to other people finding this question.
How to match strings that do not contain a given pattern? Easy, match the pattern and invert it.
index <- grep('^partner', names(data), invert = TRUE)

Another approach: use str_detect from stringr
> library(stringr)
> str_detect(string, "partner", negate=TRUE)
[1] TRUE FALSE
You can even use one grepl and negate the result
> !grepl("partner", string)
[1] TRUE FALSE
Just for fun: you can split the string using as separator \\. or _ and then iterate over each element of the resulting list comparing each element to partner and finally invert the result
> sapply(strsplit(string, "\\.|_"), function(x) !"partner" %in% x)
[1] TRUE FALSE

We can use two grepl to avoid any confusion
grepl('quiz', names(data)) & !grepl('partner', names(data))
#[1] TRUE FALSE

For someone who is a bit regex-blind like myself, sub can help,
sub('_.*', '', x) == 'partner'
#[1] TRUE FALSE

If you want to match the pattern including the digits, you could use a word boundary \b followed by a negative lookahead (?!partner) to assert what is directly on the right is not partner.
Note to escape the dot to match it literally and you can omit {1}. If you are not the value of the captured group around quiz, you might omit it as well.
To match the rest of the string, you might use \S+ to match not a non whitespace char.
\b(?!partner)quiz\.[1-5]\.player\S*
Regex demo | R demo
For example
regmatches(txt1,regexpr("\\b(?!partner)quiz\\.[1-5]\\.player\\S*",txt, per=TRUE))

Related

Replace Exact String in R using Variables [duplicate]

I'm trying to extract certain records from a dataframe with grepl.
This is based on the comparison between two columns Result and Names. This variable is build like this "WordNumber" but for the same word I have multiple numbers (more than 30), so when I use the grepl expression to get for instance Word1 I get also results that I would like to avoid, like Word12.
Any ideas on how to fix this?
Names <- c("Word1")
colnames(Names) <- name
Results <- c("Word1", "Word11", "Word12", "Word15")
Records <- c("ThisIsTheResultIWant", "notThis", "notThis", "notThis")
Relationships <- data.frame(Results, Records)
Relationships <- subset(Relationships, grepl(paste(Names$name, collapse = "|"), Relationships$Results))
This doesn't work, if I use fixed = TRUE than it doesn't return any result at all (which is weird). I have also tried concatenating the name part with other numbers like this, but with no success:
Relationships <- subset(Relationships, grepl(paste(paste(Names$name, '3', sep = ""), collapse = "|"), Relationships$Results))
Since I'm concatenating I'm not really sure of how to use the \b to enforce a full match.
Any suggestions?
In addition to #Richard's solution, there are multiple ways to enforce a full match.
\b
"\b" is an anchor to identify word before/after pattern
> grepl("\\bWord1\\b",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
\< & \>
"\<" is an escape sequence for the beginning of a word, and ">" is used for end
> grepl("\\<Word1\\>",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
Use ^ to match the start of the string and $ to match the end of the string
Names <-c('^Word1$')
Or, to apply to the entire names vector
Names <-paste0('^',Names,'$')
I think this is just:
Relationships[Relationships$Results==Names,]
If you end up doing ^Word1$ you're just doing a straight subset.
If you have multiple names, then instead use:
Relationships[Relationships$Results %in% Names,]

Partial Match word from sentence in R

I am looking to partial match string using %in% operator in R when I run below I get FALSE
'I just want to partial match string' %in% 'partial'
FALSE
Expected Output is TRUE in above case (because it is matched partially)
Since you want to match partially from a sentence you should try using %like% from data.table, check below
library(data.table)
'I just want to partial match string' %like% 'partial'
TRUE
The output is TRUE
`%in_str%` <- function(pattern,s){
grepl(pattern, s)
}
Usage:
> 'a' %in_str% 'abc'
[1] TRUE
You need to strsplit the string so each word in it is its own element in a vector:
"partial" %in% unlist(strsplit('I just want to partial match string'," "))
[1] TRUE
strsplit takes a string and breaks it into a vector of shorter strings. In this case, it breaks on the space (that's the " " at the end), so that you get a vector of individual words. Unfortunately, strstring defaults to save its results as a list, which is why I wrapped it in an unlist - so we get a single vector.
Then we do the %in%, which works in the opposite direction from the one you used: you're trying to find out if string partial is %in% the sentence, not the other way around.
Of course, this is an annoying way of doing it, so it's probably better to go with a grep-based solution if you want to stay within base-R, or Priyanka's data.table solution above -- both of which will also be better at stuff like matching multiple-word strings.

subset strings without a pattern stringr

I want to extract elements of a character vector which do not match a given pattern. See the example:
x<-c("age_mean","n_aitd","n_sle","age_sd","n_poly","n_sero","child_age")
x_age<-str_subset(x,"age")
x_notage<-setdiff(x,x_age)
In this example I want to extract those strings which do not match the pattern "age". How to achieve this in a single call of str_subset ? What is the appropriate syntax of the pattern "not age". As you can see I am not very expert with regex. Thanks for any comments.
In this case there seems to be no reason to use stringr (efficiency perhaps). You may simply use grep:
grep("age", x, invert = TRUE, value = TRUE)
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
If, however, you want to stick with str_stringr, note that (from ?str_subset)
str_subset() is a wrapper around x[str_detect(x, pattern)], and is equivalent to grep(pattern, x, value = TRUE).
So,
x[!str_detect(x, "age")]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
or also
x[!grepl("age", x)]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"

Prevent grep in R from treating "." as a letter

I have a character vector that contains text similar to the following:
text <- c("ABc.def.xYz", "ge", "lmo.qrstu")
I would like to remove everything before a .:
> "xYz" "ge" "qrstu"
However, the grep function seems to be treating . as a letter:
pattern <- "([A-Z]|[a-z])+$"
grep(pattern, text, value = T)
> "ABc.def.xYz" "ge" "lmo.qrstu"
The pattern works elsewhere, such as on regexpal.
How can I get grep to behave as expected?
grep is for finding the pattern. It returns the index of the vector that matches a pattern. If, value=TRUE is specified, it returns the value. From the description, it seems that you want to remove substring instead of returning a subset of the initial vector.
If you need to remove the substring, you can use sub
sub('.*\\.', '', text)
#[1] "xYz" "ge" "qrstu"
As the first argument, we match a pattern i.e. '.*\\.'. It matches one of more characters (.*) followed by a dot (\\.). The \\ is needed to escape the . to treat it as that symbol instead of any character. This will match until the last . character in the string. We replace that matched pattern with a '' as the replacement argument and thereby remove the substring.
grep doesn't do any replacements. It searches for matches and returns the indices (or the value if you specify value=T) that give a match. The results you're getting are just saying that those meet your criteria at some point in the string. If you added something that doesn't meet the criteria anywhere into your text vector (for example: "9", "#$%23", ...) then it wouldn't return those when you called grep on it.
If you want it just to return the matched portion you should look at the regmatches function. However for your purposes it seems like sub or gsub should do what you want.
gsub(".*\\.", "", text)
I would suggest reading the help page for regexs ?regex. The wikipedia page is a decent read as well but note that R's regexs are a little different than some others. https://en.wikipedia.org/wiki/Regular_expression
You may try str_extract function from stringr package.
str_extract(text, "[^.]*$")
This would match all the non-dot characters exists at the last.
Your pattern does work, the problem is that grep does something different than what you are thinking it does.
Let's first use your pattern with str_extract_all from the package stringr.
library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"
[[2]]
[1] "ge"
[[3]]
[1] "qrstu"
Notice that the results came as you expected!
The problem you are having is that grep will give you the complete element that matches you regular expression and not only the matching part of the element. For example, in the example below, grep will return you the first element because it matches "a":
grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"

How to perl regex match in R in the grepl function?

I have a function in R which uses the grepl command as follows:
function(x) grepl('\bx\b',res$label, perl=T)
This doesn't seem to work - the 'x' input is a character type string (a sentence), and i'd like to create word boundaries around the 'x' as I match, as I don't want the term to pull out other terms in the table I am searching through which contains some similar terms.
Any suggestions?
You just need to properly escape the slash in your regex
ff<-function(x) grepl('\\bx\\b',x, perl=T)
ff(c("axa","a x a", "xa", "ax","x"))
# [1] FALSE TRUE FALSE FALSE TRUE
If you just want to know whether string is a sentence, not single word, you could use: function(x) grepl('\\s',x)

Resources