Wildcard to match string in R - r

This might sound quite silly but it's driving me nuts.
I have a matrix that has alphanumeric values and I'm struggling to test if some elements of that matrix match only the initial and final letters. As I don't care the middle character, I'm trying (withouth success) to use a wildcard.
As an example, consider this matrix:
m <- matrix(nrow=3,ncol=3)
m[1,]=c("NCF","NBB","FGF")
m[2,]=c("MCF","N2B","CCD")
m[3,]=c("A3B","N4F","MCP")
I want to evaluate if m[2,2] starts with "N" and ends with "B", regardless of the 2nd letter in the string. I've tried something like
grep("N.B",m)
and it works, but still I want to know if there is a more compact way of doing it, like:
m[2,2]="N.B"
which ovbiously didn't work!
Thanks

You can use grepl with the subseted m like:
grepl("^N.B$", m[2,2])
#[1] TRUE
or use startsWith and endsWith:
startsWith(m[2,2], "N") & endsWith(m[2,2], "B")
#[1] TRUE

Related

Replace characters only if it is not repeating

Is there a way to replace a character only if it is not repeating, or repeating a certain number of times?
str = c("ddaabb", "daabb", "aaddbb", "aadbb")
gsub("d{1}", "c", str)
[1] "ccaabb" "caabb" "aaccbb" "aacbb"
#Expected output
[1] "ddaabb" "caabb" "aaddbb" "aacbb"
You can use negative lookarounds in your regex to exclude cases where d is preceeded or followed by another d:
gsub("(?<!d)d(?!d)", "c", str, perl=TRUE)
Edit: adding perl=TRUE as suggested by OP. For more info about regex engine in R see this question
Now that you've added "or repeating a specified number of times," the regex-based approaches may get messy. Thus I submit my wacky code from a previous comment.
foo <- unlist(strsplit(str, '')
bar <- rle(foo)
and then look for instances of bar$lengths == desired_length and use the returned indices to locate (by summing all bar$lengths[1:k] ) the position in the original sequence. If you only want to replace a specific character, check the corresponding value of bar$values[k] and selectively replace as desired.

Finding the rhyming words in a corpus with R, regex

I have the following corpus:
corpus_rhyme <- c("helter-skelter", "lovey-dovey", "riff-raff", "hunter-gatherer",
"day-to-day", "second-hand", "chock-a-block")
Out of all of these words I only need words like "helter-skelter", "lovey-dovey" and "chock-a-block", which are rhyming reduplicatives with a change of consonants. They are usually spelled with a hyphen, and may have a medial component in between the elements, such as "a" in "chock-a-block". I only need to find rhyming reduplicative expressions that have the same number of syllables. For example, although "phoney-baloney" is a rhyming reduplicative, I do not need it.
I was using the following code to find rhyming reduplicatives:
rhyme<- grep("\\b(\\w*)(\\w{2,}?)-(\\w{1,}?-)?\\w*\\2\\b", corpus_rhyme,
ignore.case = TRUE, perl = TRUE, value = TRUE)
This code produces too many false positives. The output is as follows:
rhyme
[1] "helter-skelter" "lovey-dovey" "riff-raff" "hunter-gatherer" "day-to-day" "second-hand"
[7] "chock-a-block"
I was sifting out these false positives manually, which takes too much time. Can anyone advise a better version of this line of code to reduce the number of false positives? For example, "riff-raff" comes up because I want at least 2 last letters to be the same at the end of a reduplicative expression, otherwise I will miss expressions like "rat-tat". But can we specify in this code that these two letters have to be different from each other, so that "rat-tat" is found ("a" is different from "t"), but "riff-raff" ("f" and "f" are the same) does not come up?
Another possible improvement: how can I get rid of words like "day-to-day" where the two elements are exactly the same? I only need rhyming reduplicatives that have a difference in the initial consonants.
Finally, I am not sure if anything can be done about "hunter-gatherer", unless there is a way to calculate the number of syllables and make sure that both elements of the expression have the same number of syllables.
Base R solution, there is most probably an easier way:
# Count the number of hyphens in each element: hyphen_count => integer vector
hyphen_count <- lengths(regmatches(corpus_rhyme, gregexpr("\\-", corpus_rhyme)))
# Check if the expression rhymes: named logical vector => stdout(console)
vapply(ifelse(hyphen_count == 1,
gsub('^[^aeiouAEIOU]+([aeiou]+\\w+)\\s*\\-[^aeiouAEIOU]+([aeiou]+\\w+)', '\\1 \\2', corpus_rhyme),
gsub('^[^aeiouAEIOU]+([aeiou]+\\w+)\\s*\\-\\w+\\-[^aeiouAEIOU]+([aeiou]+\\w+)', '\\1 \\2', corpus_rhyme)
), function(x){
split_string <- unlist(strsplit(x, "\\s"))
identical(split_string[1], split_string[2])
}, logical(1)
)
Data:
corpus_rhyme <- c("helter-skelter", "lovey-dovey", "riff-raff", "hunter-gatherer",
"day-to-day", "second-hand", "chock-a-block")
If it is general phenomenon that last three letters or more of the first word are the same as last three letters or more of the last word, then you could try this:
grep(".*(\\w{3,})-([aeiou]-)?(\\w+)(\\1)", corpus_rhyme,ignore.case = TRUE,value=TRUE)
#[1] "helter-skelter" "lovey-dovey" "chock-a-block"

Extract shortest matching string regex

Minimal Reprex
Suppose I have the string as1das2das3D. I want to extract everything from the letter a to the letter D. There are three different substrings that match this - I want the shortest / right-most match, i.e. as3D.
One solution I know to make this work is stringr::str_extract("as1das2das3D", "a[^a]+D")
Real Example
Unfortunately, I can't get this to work on my real data. In my real data I have string with (potentially) two URLs and I'm trying to extract the one that's immediately followed by rel=\"next\". So, in the below example string, I'd like to extract the URL https://abc.myshopify.com/ZifQ.
foo <- "<https://abc.myshopify.com/YifQ>; rel=\"previous\", <https://abc.myshopify.com/ZifQ>; rel=\"next\""
# what I've tried
stringr::str_extract(foo, '(?<=\\<)https://.*(?=\\>; rel\\="next)') # wrong output
stringr::str_extract(foo, '(?<=\\<)https://(?!https)+(?=\\>; rel\\="next)') # error
You could do:
stringr::str_extract(foo,"https:[^;]+(?=>; rel=\"next)")
[1] "https://abc.myshopify.com/ZifQ"
or even
stringr::str_extract(foo,"https(?:(?!https).)+(?=>; rel=\"next)")
[1] "https://abc.myshopify.com/ZifQ"
Would this be an option?
Splitting string on ; or , comparing it with target string and take url from its previous index.
urls <- strsplit(foo, ";\\s+|,\\s+")[[1]]
urls[which(urls == "rel=\"next\"") - 1]
#[1] "<https://abc.myshopify.com/ZifQ>"
Here may be an option.
gsub(".+\\, <(.+)>; rel=\"next\"", "\\1", foo, perl = T)
#[1] "https://abc.myshopify.com/ZifQ"

R: Why can't for loop or c() work out for grep function?

Thanks for grep using a character vector with multiple patterns, I figured out my own problem as well.
The question here was how to find multiple values by using grep function,
and the solution was either these:
grep("A1| A9 | A6")
or
toMatch <- c("A1", "A9", "A6")
matches <- unique (grep(paste(toMatch,collapse="|")
So I used the second suggestion since I had MANY values to search for.
But I'm curious why c() or for loop doesn't work out instead of |.
Before I researched the possible solution in stackoverflow and found recommendations above, I tried out two alternatives that I'll demonstrate below:
First, what I've written in R was something like this:
find.explore.l<-lapply(text.words.bl ,function(m) grep("^explor",m))
But then I had to 'grep' many words, so I tried out this
find.explore.l<-lapply(text.words.bl ,function(m) grep(c("A1","A2","A3"),m))
It didn't work, so I tried another one(XXX is the list of words that I'm supposed to find in the text)
for (i in XXX){
find.explore.l<-lapply(text.words.bl ,function(m) grep("XXX[i]"),m))
.......(more lines to append lines etc)
}
and it seemed like R tried to match XXX[i] itself, not the words inside.
Why can't c() and for loop for grep return right results?
Someone please let me know! I'm so curious :P
From the documentation for the pattern= argument in the grep() function:
Character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr and gregexpr.
This confirms that, as #nrussell said in a comment, grep() is not vectorized over the pattern argument. Because of this, c() won't work for a list of regular expressions.
You could, however, use a loop, you just have to modify your syntax.
toMatch <- c("A1", "A9", "A6")
# Loop over values to match
for (i in toMatch) {
grep(i, text)
}
Using "XXX[i]" as your pattern doesn't work because it's interpreting that as a regular expression. That is, it will match exactly XXXi. To reference an element of a vector of regular expressions, you would simply use XXX[i] (note the lack of surrounding quotes).
You can apply() this, but in a slightly different way than you had done. You apply it to each regex in the list, rather than each text string.
lapply(toMatch, function(rgx, text) grep(rgx, text), text = text)
However, the best approach would be, as you already have in your post, to use
matches <- unique(grep(paste(toMatch, collapse = "|"), text))
Consider that:
XXX <- c("a", "b", "XXX[i]")
grep("XXX[i]", XXX, value=T)
character(0)
grep("XXX\\[i\\]", XXX, value=T)
[1] "XXX[i]"
What is R doing? It is using special rules for the first argument of grep. The brackets are considered special characters ([ and ]). I put in two backslashes to tell R to consider them regular brackets. And imgaine what would happen if I put that last expression into a for loop? It wouldn't do what I expected.
If you would like a for loop that goes through a character vector of possible matches, take out the quotes in the grep function.
#if you want the match returned
matches <- c("a", "b")
for (i in matches) print(grep(i, XXX, value=T))
[1] "a"
[1] "b"
#if you want the vector location of the match
for (i in matches) print(grep(i, XXX))
[1] 1
[1] 2
As the comments point out, grep(c("A1","A2","A3"),m)) is violating the grep required syntax.

R - Using grep and gsub to return more than one match in the same (character) vector element

Imagine we want to find all of the FOOs and subsequent numbers in the string below and return them as a vector (apologies for unreadability, I wanted to make the point there is no regular pattern before and after the FOOs):
xx <- "xasdrFOO1921ddjadFOO1234dakaFOO12345ndlslsFOO1643xasdf"
We can use this to find one of them (taken from 1)
gsub(".*(FOO[0-9]+).*", "\\1", xx)
[1] "FOO1643"
However, I want to return all of them, as a vector.
I've thought of a complicated way to do it using strplit() and gregexpr() - but I feel there is a better (and easier) way.
You may be interested in regmatches:
> regmatches(xx, gregexpr("FOO[0-9]+", xx))[[1]]
[1] "FOO1921" "FOO1234" "FOO12345" "FOO1643"
xx <- "xasdrFOO1921ddjadFOO1234dakaFOO12345ndlslsFOO1643xasdf"
library(stringr)
str_extract_all(xx, "(FOO[0-9]+)")[[1]]
#[1] "FOO1921" "FOO1234" "FOO12345" "FOO1643"
this can take vectors of strings as well, and mathces will be in list elements.
Slightly shorter version.
library(gsubfn)
strapplyc(xx,"FOO[0-9]*")[[1]]

Resources