Replace characters only if it is not repeating - r

Is there a way to replace a character only if it is not repeating, or repeating a certain number of times?
str = c("ddaabb", "daabb", "aaddbb", "aadbb")
gsub("d{1}", "c", str)
[1] "ccaabb" "caabb" "aaccbb" "aacbb"
#Expected output
[1] "ddaabb" "caabb" "aaddbb" "aacbb"

You can use negative lookarounds in your regex to exclude cases where d is preceeded or followed by another d:
gsub("(?<!d)d(?!d)", "c", str, perl=TRUE)
Edit: adding perl=TRUE as suggested by OP. For more info about regex engine in R see this question

Now that you've added "or repeating a specified number of times," the regex-based approaches may get messy. Thus I submit my wacky code from a previous comment.
foo <- unlist(strsplit(str, '')
bar <- rle(foo)
and then look for instances of bar$lengths == desired_length and use the returned indices to locate (by summing all bar$lengths[1:k] ) the position in the original sequence. If you only want to replace a specific character, check the corresponding value of bar$values[k] and selectively replace as desired.

Related

Finding the rhyming words in a corpus with R, regex

I have the following corpus:
corpus_rhyme <- c("helter-skelter", "lovey-dovey", "riff-raff", "hunter-gatherer",
"day-to-day", "second-hand", "chock-a-block")
Out of all of these words I only need words like "helter-skelter", "lovey-dovey" and "chock-a-block", which are rhyming reduplicatives with a change of consonants. They are usually spelled with a hyphen, and may have a medial component in between the elements, such as "a" in "chock-a-block". I only need to find rhyming reduplicative expressions that have the same number of syllables. For example, although "phoney-baloney" is a rhyming reduplicative, I do not need it.
I was using the following code to find rhyming reduplicatives:
rhyme<- grep("\\b(\\w*)(\\w{2,}?)-(\\w{1,}?-)?\\w*\\2\\b", corpus_rhyme,
ignore.case = TRUE, perl = TRUE, value = TRUE)
This code produces too many false positives. The output is as follows:
rhyme
[1] "helter-skelter" "lovey-dovey" "riff-raff" "hunter-gatherer" "day-to-day" "second-hand"
[7] "chock-a-block"
I was sifting out these false positives manually, which takes too much time. Can anyone advise a better version of this line of code to reduce the number of false positives? For example, "riff-raff" comes up because I want at least 2 last letters to be the same at the end of a reduplicative expression, otherwise I will miss expressions like "rat-tat". But can we specify in this code that these two letters have to be different from each other, so that "rat-tat" is found ("a" is different from "t"), but "riff-raff" ("f" and "f" are the same) does not come up?
Another possible improvement: how can I get rid of words like "day-to-day" where the two elements are exactly the same? I only need rhyming reduplicatives that have a difference in the initial consonants.
Finally, I am not sure if anything can be done about "hunter-gatherer", unless there is a way to calculate the number of syllables and make sure that both elements of the expression have the same number of syllables.
Base R solution, there is most probably an easier way:
# Count the number of hyphens in each element: hyphen_count => integer vector
hyphen_count <- lengths(regmatches(corpus_rhyme, gregexpr("\\-", corpus_rhyme)))
# Check if the expression rhymes: named logical vector => stdout(console)
vapply(ifelse(hyphen_count == 1,
gsub('^[^aeiouAEIOU]+([aeiou]+\\w+)\\s*\\-[^aeiouAEIOU]+([aeiou]+\\w+)', '\\1 \\2', corpus_rhyme),
gsub('^[^aeiouAEIOU]+([aeiou]+\\w+)\\s*\\-\\w+\\-[^aeiouAEIOU]+([aeiou]+\\w+)', '\\1 \\2', corpus_rhyme)
), function(x){
split_string <- unlist(strsplit(x, "\\s"))
identical(split_string[1], split_string[2])
}, logical(1)
)
Data:
corpus_rhyme <- c("helter-skelter", "lovey-dovey", "riff-raff", "hunter-gatherer",
"day-to-day", "second-hand", "chock-a-block")
If it is general phenomenon that last three letters or more of the first word are the same as last three letters or more of the last word, then you could try this:
grep(".*(\\w{3,})-([aeiou]-)?(\\w+)(\\1)", corpus_rhyme,ignore.case = TRUE,value=TRUE)
#[1] "helter-skelter" "lovey-dovey" "chock-a-block"

Add space before a character with gsub (R) [duplicate]

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

R Regex to identify and replace characters between multiple dots

I have the following codes
"ABC.A.SVN.10.10.390.10.UDGGL"
"XYZ.Z.SVN.11.12.111.99.ASDDL"
and I need to replace the characters that exist between the 2nd and the 3rd dot. In this case it is SVN but it may well be any combination of between A and ZZZ, so really the only way to make this work is by using the dots.
The required outcome would be:
"ABC.A..10.10.390.10.UDGGL"
"XYZ.Z..11.12.111.99.ASDDL"
I tried variants of grep("^.+(\\.\\).$", "ABC.A.SVN.10.10.390.10.UDGGL") but I get an error.
Some examples of what I have tried with no success :
Link 1
Link 2
EDIT
I tried #Onyambu 's first method and I ran into a variant which I had not accounted for: "ABC.A.AB11.1.12.112.1123.UDGGL". In the replacement part, I also have numeric values. The desired outcome is "ABC.A..1.12.112.1123.UDGGL" and I get it using sub("\\.\\w+.\\B.",".",x) per the second part of his answer!
See code in use here
x <- c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sub("^(?:[^.]*\\.){2}\\K[^.]*", "", x, perl=T)
^ Assert position at the start of the line
(?:[^.]*\.){2} Match the following exactly twice
[^.]*\. Match any character except . any number of times, followed by .
\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
[^.]* Match any character except . any number of times
Results in [1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
x= "ABC.A.SVN.10.10.390.10.UDGGL" "XYZ.Z.SVN.11.12.111.99.ASDDL"
sub("([A-Z]+)(\\.\\d+)","\\2",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
([A-Z]+) Capture any word that has the characters A-Z
(\\.\\d+) The captured word above, must be followed with a dot ie\\..This dot is then followed by numbers ie \\d+. This completes the capture.
so far the captured part of the string "ABC.A.SVN.10.10.390.10.UDGGL" is SVN.10 since this is the part that matches the regular expression. But this part was captured as SVN and .10. we do a backreference ie replace the whole SVN.10 with the 2nd part .10
Another logic that will work:
sub("\\.\\w+.\\B.",".",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
Not exactly regex but here is one more approach
#DATA
S = c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sapply(X = S,
FUN = function(str){
ind = unlist(gregexpr("\\.", str))[2:3]
paste(c(substring(str, 1, ind[1]),
"SUBSTITUTION",
substring(str, ind[2], )), collapse = "")
},
USE.NAMES = FALSE)
#[1] "ABC.A.SUBSTITUTION.10.10.390.10.UDGGL" "XYZ.Z.SUBSTITUTION.11.12.111.99.ASDDL"

Regex: seeing/finding AB as distinct from AB-C

In R I need to search a character vector as shown below. I need to return "AB" separately from "ABC" so I am using word-boundaries. However, I also need to find "AB-C" as something distinct from "AB"; there are some questions along these lines here but I can't get the proper invocation. Put another way, I need each of these strings to be found uniquely as I loop over them, and my grep expression needs to always return a single answer.
vec <- c("AB", "ABC", "AB-C")
grep("\\bAB\\b", vec) # 1 + 3, but only want 1
We just specify the start (^) and end ($) of the string
grep("^AB$", vec)
#[1] 1

Match & Replace String, utilising the original string in the replacement, in R

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

Resources