How to match distinct repeated characters - r

I'm trying to come up with a regex in R to match strings in which there is repetition of two distinct characters.
x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
This regex matches all of the above, including strings such as "mmmm" and "ohhhh" where the repeated letter is the same in the first and the second repetition:
grep(".*([a-z])\\1.*([a-z])\\2", x, value = T)
What I'd like to match in x are these strings where the repeated letters are distinct:
"cooee","helloee","oooaaah","sshh","vroomm","whoopee","yippee"
How can the regex be tweaked to make sure the second repeated character is not the same as the first?

You may restrict the second char pattern with a negative lookahead:
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# ^^^^^
See the regex demo.
(?!\\1)([a-z]) means match and capture into Group 2 any lowercase ASCII letter if it is not the same as the value in Group 1.
R demo:
x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# => "cooee" "helloee" "oooaaah" "sshh" "vroomm" "whoopee" "yippee"

If you can avoid regex altogether, then I think that's the way to go. A rough example:
nrep <- sapply(
strsplit(x, ""),
function(y) {
run_lengths <- rle(y)
length(unique(run_lengths$values[run_lengths$lengths >= 2]))
}
)
x[nrep > 1]
# [1] "cooee" "helloee" "oooaaah" "sshh" "vroomm" "whoopee" "yippee"

Related

Remove one number at position n of the number in a string of numbers separated by slashes

I have a character column with this configuration:
data <- data.frame(
id = 1:3,
codes = c("08001301001", "08002401002 / 08002601003 / 17134604034", "08004701005 / 08005101001"))
I want to remove the 6th digit of any number within the string. The numbers are always 10 characters long.
My code works. However I believe it might be done easier using RegEx, but I couldn't figure it out.
library(stringr)
remove_6_digit <- function(x){
idxs <- str_locate_all(x,"/")[[1]][,1]
for (idx in c(rev(idxs+7), 6)){
str_sub(x, idx, idx) <- ""
}
return(x)
}
result <- sapply(data$codes, remove_6_digit, USE.NAMES = F)
You can use
gsub("\\b(\\d{5})\\d", "\\1", data$codes)
See the regex demo. This will remove the 6th digit from the start of a digit sequence.
Details:
\b - word boundary
(\d{5}) - Capturing group 1 (\1): five digits
\d - a digit.
While word boundary looks enough for the current scenario, a digit boundary is also an option in case the numbers are glued to word chars:
gsub("(?<!\\d)(\\d{5})\\d", "\\1", data$codes, perl=TRUE)
where perl=TRUE enables the PCRE regex syntax and (?<!\d) is a negative lookbehind that fails the match if there is a digit immediately to the left of the current location.
And if you must only change numeric char sequences of 10 digits (no shorter and no longer) you can use
gsub("\\b(\\d{5})\\d(\\d{4})\\b", "\\1\\2", data$codes)
gsub("(?<!\\d)(\\d{5})\\d(?=\\d{4}(?!\\d))", "\\1", data$codes, perl=TRUE)
One remark though: your numbers consist of 11 digits, so you need to replace \\d{4} with \\d{5}, see this regex demo.
Another possible solution, using stringr::str_replace_all and lookaround :
library(tidyverse)
data %>%
mutate(codes = str_replace_all(codes, "(?<=\\d{5})\\d(?=\\d{5})", ""))
#> id codes
#> 1 1 0800101001
#> 2 2 0800201002 / 0800201003 / 1713404034
#> 3 3 0800401005 / 0800501001

R padding 0's inside a string after the hypen

I have the following data
GT_BUC-01_BUCST-19
ADT_BURC-1_BUCST-09
BT_BUDDC-1_BUDSCST-29
CAST_BUC-31_BUCST-9
CAST_BUC-1_BUCST-9
How do I use R to make the numbers after both hyphens to pad leading zeros so it will have Two digits? The resulting format should look like this:
GT_BUC-01_BUCST-19
ADT_BURC-01_BUCST-09
BT_BUDDC-01_BUDSCST-29
CAST_BUC-31_BUCST-09
CAST_BUC-01_BUCST-09
One option would be to use stringr::str_replace_all
x <- c('GT_BUC-01_BUCST-19', 'ADT_BURC-1_BUCST-09',
'BT_BUDDC-1_BUDSCST-29', 'CAST_BUC-31_BUCST-9', 'CAST_BUC-1_BUCST-9')
stringr::str_replace_all(x, '\\d+', function(m) sprintf('%02s', m))
#[1] "GT_BUC-01_BUCST-19" "ADT_BURC-01_BUCST-09"
#[3] "BT_BUDDC-01_BUDSCST-29" "CAST_BUC-31_BUCST-09"
#[5] "CAST_BUC-01_BUCST-09"
You could try using gsub as follows:
x <- gsub("-(\\d)(?!\\d)", "-0\\1", x, perl=TRUE)
x
[1] "GT_BUC-01_BUCST-19" "ADT_BURC-01_BUCST-09" "BT_BUDDC-01_BUDSCST-29"
[4] "CAST_BUC-31_BUCST-09" "CAST_BUC-01_BUCST-09"
Data:
x <- c("GT_BUC-01_BUCST-19",
"ADT_BURC-1_BUCST-09",
"BT_BUDDC-1_BUDSCST-29",
"CAST_BUC-31_BUCST-9",
"CAST_BUC-1_BUCST-9")
The regex pattern used here matches dash followed by a single number only. In this case, we then replace by prepending a zero to this single number.

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

How to find substrings flanked by a specific character and replace with text of the same length in R?

In R, what is the best way of finding dots flanked by asterisks and replace them with asterisks?
input:
"AG**...**GG*.*.G.*C.C"
desired output:
"AG*******GG***.G.*C.C"
I tried the following function, but it is not elegant to say the least.
library(stringr)
replac <- function(my_string) {
m <- str_locate_all(my_string, "\\*\\.+\\*")[[1]]
if (nrow(m) == 0) return(my_string)
split_s <- unlist(str_split(my_string, ""))
for (i in 1:nrow(m)) {
st <- m[i, 1]
en <- m[i, 2]
split_s[st:en] <- rep("*", length(st:en))
}
paste(split_s, collapse = "")
}
I've have edited the input string and expected output after #TheForthBird answer below to make clear that dots not flanked by asterisks should not be changed, and that other letters other and "A" and "G" may occur.
You might use gsub with perl = TRUE and make use of the \G anchor to assert the position at the end of the previous match.
You could match AG or GG using a character class [AG]G or [A-Z]+ to match 1+ uppercase characters.
In the replacement use *
(?:[A-Z]+\*+|\G(?!^))\K\.(?=[^*]*\*)
That will match
(?: Non capturing group
[A-Z]+*+Match 1+ times uppercase char A-Z, then 1+ times*`
| Or
\G(?!^) Assert position at the end of previous match, not at the start
) Close non capturing group
\K Forget what is currently matched
\. Match literally
(?= Positive lookahead, assert what is on the right is
[^*]*\* Match 0+ times any char except *, then match *
) Close lookahead
Regex demo | R demo
For example:
gsub("(?:[A-Z]+\\*+|\\G(?!^))\\K\\.(?=[^*]*\\*)", "*", "AG**...**GG*.*.G.*C.C", perl = TRUE)
Result
[1] "AG*******GG***.G.*C.C"
Try this code, it's still not wrapped, but at least is a bit shorter than yours and works for all the cases, not only the ones without other occurrences of dots in the string:
replac_v2 <- function(my_string){
b <- my_string #Just a shorter name
while(TRUE){
df<-as.data.frame(str_locate(b,"\\*\\.+\\*"))
add<-as.numeric(df[2]-df[1])+1
if(is.na(add)){return(b)}
b<-str_replace(b,"\\*\\.+\\*",paste(rep("*",add),collapse=""))
}}

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Resources