Check if string contains characters other than alphabetical character - r

I want a function to return TRUE if a string contains only letters, and FALSE otherwise.
I had a hard time finding a solution for this problem using R even though there are many answer pages for other languages.

We can use grep. We match letters [A-Za-z] from the start (^) to the end $ of the string.
grepl('^[A-Za-z]+$', str1)
#[1] TRUE FALSE
data
str1 <- c('Azda', 'A123Zda')

Related

R regex / grep / grepl for letters followed by a dash and numbers

I'm trying to find the right grep notation to identify strings that have this pattern: Any number of letters followed by a dash (-) followed by any number of numbers. For example ABC-123 would be a fit while 123-ABC or A1-B2 would not.
I've tried grepl('[[A:Za:z]]\\-[[0:9]]','ABC-123') but am not getting the correct results.
Any suggestions?
We can change the range (:) to - and instead of [[. In the pattern, we also specify the ^ and $ for start and end of string respectively. The + for letters and digits specify one or more ...
grepl("^[A-Za-z]+-[0-9]+$", str1)
#[1] TRUE FALSE FALSE
Or if we want to use [[,
grepl("^[[:alpha:]]+-\\d+$", str1)
#[1] TRUE FALSE FALSE
data
str1 <- c("ABC-123", "123-ABC", "A1-B2")

Match pattern so long as it doesn't contain a specific string

Let's say I have the following strings:
quiz.1.player.chat_results
and
partner_quiz.1.player.chat_results
I have hundreds of strings like this where the only difference is that one is prefixed with "partner" and the other is not. I'm trying to match one but not the other.
The specific pattern I'd like to match looks like so:
index <- grep('^(quiz.)[1-5]{1}.player.chat_results', names(data))
But this will match both strings. I'm guessing I have to use some negative lookahead like so:
^((?!partner).)
But I'm not sure where to use it.
I'll answer your title question, as it will be the most useful to other people finding this question.
How to match strings that do not contain a given pattern? Easy, match the pattern and invert it.
index <- grep('^partner', names(data), invert = TRUE)
Another approach: use str_detect from stringr
> library(stringr)
> str_detect(string, "partner", negate=TRUE)
[1] TRUE FALSE
You can even use one grepl and negate the result
> !grepl("partner", string)
[1] TRUE FALSE
Just for fun: you can split the string using as separator \\. or _ and then iterate over each element of the resulting list comparing each element to partner and finally invert the result
> sapply(strsplit(string, "\\.|_"), function(x) !"partner" %in% x)
[1] TRUE FALSE
We can use two grepl to avoid any confusion
grepl('quiz', names(data)) & !grepl('partner', names(data))
#[1] TRUE FALSE
For someone who is a bit regex-blind like myself, sub can help,
sub('_.*', '', x) == 'partner'
#[1] TRUE FALSE
If you want to match the pattern including the digits, you could use a word boundary \b followed by a negative lookahead (?!partner) to assert what is directly on the right is not partner.
Note to escape the dot to match it literally and you can omit {1}. If you are not the value of the captured group around quiz, you might omit it as well.
To match the rest of the string, you might use \S+ to match not a non whitespace char.
\b(?!partner)quiz\.[1-5]\.player\S*
Regex demo | R demo
For example
regmatches(txt1,regexpr("\\b(?!partner)quiz\\.[1-5]\\.player\\S*",txt, per=TRUE))

Negative Lookahead Invalidated by extra numbers in string

I am trying to write a regular expression in R that matches a certain string up to the point where a . occurs. I thought a negative lookahead might be the answer, but I am getting some false positives.
So in the following 9-item vector
vec <- c("mcq_q9", "mcq_q10", "mcq_q11", "mcq_q12", "mcq_q1.factor", "mcq_q2.factor", "mcq_q10.factor", "mcq_q11.factor", "mcq_q12.factor")
The grep
grep("mcq_q[0-9]+(?!\\.)", vec, perl = T)
does its job for the first six elements in the vector, matching "mcq_q11" but not "mcq_q2.factor". Unfortunately though it does match the last 3 elements, when there are two numbers following the second q. Why does that second number kill off my negative lookahead?
I think you want your negative lookahead to scan the entire string first, ensuring it sees no "dot":
(?!.*\.)mcq_q[0-9]+
https://regex101.com/r/f5XxR2/2/
If you are to capture until a dot then you should use this:
mcq_q[0-9]+(?![\d\.])
Demo
Sample Source ( run here )
vec <- c("mcq_q9", "mcq_q10", "mcq_q11", "mcq_q12", "mcq_q1.factor", "mcq_q2.factor", "mcq_q10.factor", "mcq_q11.factor", "mcq_q12.factor")
grep("mcq_q[0-9]+(?![\\d\\.])", vec, perl = T)
We can use it without any lookaround to match zero or more characters that are not a . after the numbers ([0-9]+) till the end of the string ($)
grep("mcq_q[0-9]+[^.]*$", vec, value = TRUE)
#[1] "mcq_q9" "mcq_q10" "mcq_q11" "mcq_q12"
A negative lookahead is tricky nere, as explained in a comment. But you don't need it
/mcq_q[0-9]+(?:$|[^.0-9])/
This requires that a string of digits is followed by either end-of-string or a non-[.,digit] character. So it will allow mcq_q12a etc. If your permissible strings may only end in numbers remove |[^...], and then the non-capturing group (?:...) isn't needed either, for /mcq_q[0-9]+$/
Tested only in Perl as the question was tagged with it. It should be the same for your example in R.

Check if a string contain digits and dash

I'm trying to find the right regex to grepl weather a string contains digits[0-9] and the special character "-" only.
ex,
str1="00-25" #TRUE
str2="0a-2" #FALSE
I have tried
grepl("[^[:digit:]|-]",str2)
#[1] TRUE
thoughts?
You want to check if the string has only digit and -.
To create the ensemble, you need to use "[]" so :
[0-9-]
Now you want to check that every character of the string is in the ensemble you have created, in other term you want to start(^) and finish($) by this ensemble :
^[0-9-]$
Finally in the variable there is 1 or more character, so I use the "+" :
grepl("^[0-9-]+$",str)

How to perl regex match in R in the grepl function?

I have a function in R which uses the grepl command as follows:
function(x) grepl('\bx\b',res$label, perl=T)
This doesn't seem to work - the 'x' input is a character type string (a sentence), and i'd like to create word boundaries around the 'x' as I match, as I don't want the term to pull out other terms in the table I am searching through which contains some similar terms.
Any suggestions?
You just need to properly escape the slash in your regex
ff<-function(x) grepl('\\bx\\b',x, perl=T)
ff(c("axa","a x a", "xa", "ax","x"))
# [1] FALSE TRUE FALSE FALSE TRUE
If you just want to know whether string is a sentence, not single word, you could use: function(x) grepl('\\s',x)

Resources