R grepl - Matching Pattern to String

R grepl - Matching Pattern to String - r

I am using grepl() in R to match patterns to a string.
I need to match multiple strings to a common string and return TRUE if they all match.
For example:
a <- 'DEARBORN TRUCK INCDBA'
b <- 'DEARBORN TRUCK INC DBA'
I want to see if all words in variable b are also in variable a.
I can't just use grepl(b, a) because the patterns (spaces) aren't the same.
It seems like it should be something like this:
grepl('DEARBORN&TRUCK&INC&DBA', a)
or
grepl('DEARBORN+TRUCK+INC+DBA', a)
but neither work. I need to compare each individual word in b to a. In this case, since all the words exist in a, it should return TRUE.
Thanks!

Use strsplit to split b into words and then use sapply to perform a grepl on each such word. The result will be a logical vector and if its all TRUE then return TRUE:
all(sapply(strsplit(b, " ")[[1]], grepl, a))
giving:
[1] TRUE
Note: If you are only looking to determine if a and b are the same aside from spaces then remove the spaces from both and compare what is left:
gsub(" ", "", a) == gsub(" ", "", b)

Related

extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?

Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"

We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"

Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

R: Drop all not matching letters of string vector

I have a string vector
d <- c("sladfj0923rn2", ääas230ß0sadfn", 823Höl32basdflk")
I want to remove all characters from this vector that do not
match "a-z", "A-z" and "'"
I tried to use
gsub("![a-zA-z'], "", d)
but that doesn't work.

We could even make your replacement pattern even tighter by doing a case insensitive sub:
d <- c("sladfj0923rn2", "ääas230ß0sadfn", "823Höl32basdflk")
gsub("[^a-z]", "", d, ignore.case=TRUE)
[1] "sladfjrn" "assadfn" "Hlbasdflk"

We can use the ^ inside the square brackets to match all characters except the one specified within the bracket
gsub("[^a-zA-Z]", "", d)
#[1] "sladfjrn" "assadfn" "Hlbasdflk"
data
d <- c("sladfj0923rn2", "ääas230ß0sadfn", "823Höl32basdflk")

Search every n characters in a string for a pattern

Let's say I have the string ABCC321BB321A. I want to search for a pattern that consists of ABC...321, where ... can be any character(s). However, I want to only return results in which characters in the substring can be grouped into sets of 3.
E.g., I don't want ABCC321 (ABC - C32 - 1), but I do want ABCC321BB321 (ABC - C32 - 1BB - 321).
How would I do this in R? Is it possible to achieve using regular expressions? I guess I could possibly split the string up into a list containing groups of 3 or use conditionals to only return matches that are divisible by 3 to get the answer I want, but I'm assuming there's a more efficient method.

Try this:
x <- "ABCC321BB321A"
threes <- regmatches(x, gregexpr(".{3}", x))[[1]]
threes
paste(threes, collapse = "-")
which produces:
[[1]]
[1] "ABC" "C32" "1BB" "321"
and
[1] "ABC-C32-1BB-321"

Exact match with grepl R

I'm trying to extract certain records from a dataframe with grepl.
This is based on the comparison between two columns Result and Names. This variable is build like this "WordNumber" but for the same word I have multiple numbers (more than 30), so when I use the grepl expression to get for instance Word1 I get also results that I would like to avoid, like Word12.
Any ideas on how to fix this?
Names <- c("Word1")
colnames(Names) <- name
Results <- c("Word1", "Word11", "Word12", "Word15")
Records <- c("ThisIsTheResultIWant", "notThis", "notThis", "notThis")
Relationships <- data.frame(Results, Records)
Relationships <- subset(Relationships, grepl(paste(Names$name, collapse = "|"), Relationships$Results))
This doesn't work, if I use fixed = TRUE than it doesn't return any result at all (which is weird). I have also tried concatenating the name part with other numbers like this, but with no success:
Relationships <- subset(Relationships, grepl(paste(paste(Names$name, '3', sep = ""), collapse = "|"), Relationships$Results))
Since I'm concatenating I'm not really sure of how to use the \b to enforce a full match.
Any suggestions?

In addition to #Richard's solution, there are multiple ways to enforce a full match.
\b
"\b" is an anchor to identify word before/after pattern
> grepl("\\bWord1\\b",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
\< & \>
"\<" is an escape sequence for the beginning of a word, and ">" is used for end
> grepl("\\<Word1\\>",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE

Use ^ to match the start of the string and $ to match the end of the string
Names <-c('^Word1$')
Or, to apply to the entire names vector
Names <-paste0('^',Names,'$')

I think this is just:
Relationships[Relationships$Results==Names,]
If you end up doing ^Word1$ you're just doing a straight subset.
If you have multiple names, then instead use:
Relationships[Relationships$Results %in% Names,]

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!

You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"

You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.

This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)

Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R grepl - Matching Pattern to String - r

Related

extract substring in R

R: Drop all not matching letters of string vector

Search every n characters in a string for a pattern

Exact match with grepl R

Extract first X Numbers from Text Field using Regex

Categories

Resources