Remove duplicates within consecutive runs of characters - r

I have strings containing lots of duplicates, like this:
tst <- c("C>C>C>B>B>B>B>C>C>*>*>*>*>*>C", "A>A>A", "*>B>B",
"A>A>A>A>A>*>A>A>A>*>*>*>*>A>A", "*>C>C", "A")
I'd like to remove all consecutive duplicated upper-case and "*" characters, so the expected result is this:
[1] "CBC*C" "A" "*B" "A*A*A" "*C" "A"
I've successfully extracted the duplicated capitals:
library(stringr)
unlist(str_extract_all(gsub(">", "", tst), "(.)(?=\\1)"))
[1] "C" "C" "B" "B" "B" "C" "*" "*" "*" "*"
but am somewhat stuck here. My hunch is that the function which, which returns indices, might be of help but don't know how to implement it in this case.
Any ideas?
EDIT:
I wasn't that far from the solution myself - just using a negative lookahead (instead of the positive lookahead) does the trick:
str_extract_all(gsub(">", "", tst), "(.)(?!\\1)")
[[1]]
[1] "C" "B" "C" "*" "C"
[[2]]
[1] "A"
[[3]]
[1] "*" "B"
[[4]]
[1] "A" "*" "A" "*" "A"
[[5]]
[1] "*" "C"
[[6]]
[1] "A"

We can use gsub
gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"
In order to get the second result, remove the >
gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"
Based on the OP's comments below, may be
gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"

Another way to get CBC*C could be using 2 groups and using group 2 in the replacement.
((.)>)\1+
Regex demo
Example
tst <- "C>C>C>B>B>B>B>C>C>*>*>*>*>*>C"
gsub("((.)>)\\1+", "\\2", tst)
Output
[1] "CBC*C"

For us allergic to regex:
paste(rle(strsplit(tst, ">")[[1]])$values, collapse = ">") # or collapse = ""
[1] "C>B>C>*>C"
...which of course fails for strings with runs of lowercase letters, like "A>A>a>a>A>A"

A somewhat universal base R approach without regexps.
The idea here is to melt down the string to groups and then remove the repeating patterns successively (which makes it distinct from unique):
tst <- "C>C>C>B>B>B>B>C>C>*>*>*>*>*>C"
st <- paste(unlist(strsplit(tst,">")),collapse="")
#[1] "CCCBBBBCC*****C"
paste( unlist( sapply( 1:nchar(st), function(x){
if( substr(st,x,x) != substr(st,(x+1),(x+1)) ){ substr(st,x,x) } } ) ), collapse="" )
#[1] "CBC*C"
Oh, and if you want lowercase functionality (excluding lowercase letters from removal), use this instead:
paste( unlist( sapply( 1:nchar(st), function(x){
a=substr(st,x,x); b=substr(st,(x+1),(x+1));
if( a != b & toupper(a) == a ){ a } else if( toupper(a) != a ){ a } } ) ), collapse="" )

Related

Split a string on alternating index

I have a string similar to "HLeelmloon" which is two words interweaved together. How can I separate this into two separate words, splitting on alternating letters?
I can use strsplit() and a for loop to allocate alternating letters to two new vectors and then join the list but this seems very long winded:
string <- "HLeelmloon"
split<-el(strsplit(string,''))
> split
[1] "H" "L" "e" "e" "l" "m" "l" "o" "o" "n"
word1<-c()
word2<-c()
for(i in 1:length(split)){
if(i %% 2 == 1){
word1<-append(word1, split[i])
} else {
word2<-append(word2, split[i])
}
}
word1 = paste0(word1, collapse = '')
word2 = paste0(word2, collapse = '')
> word1
[1] "Hello"
> word2
[1] "Lemon"
My issue is it's not very elegant, and it doesn't upscale well if I want to split the string into N different words. Is there a better way to do this?
You could use gsub to capture alternating characters into the same group:
gsub("(.)(.)?", "\\1", string)
#[1] "Hello"
gsub("(.)(.)?", "\\2", string)
#[1] "Lemon"
You can do it by using TRUE and FALSE for indexing, i.e.
v1 = strsplit(string, '')[[1]]
paste(v1[c(TRUE, FALSE)], collapse = '')
#[1] "Hello"
paste(v1[c(FALSE, TRUE)], collapse = '')
#[1] "Lemon"
Considering your question is how to split into more than two words, you should use the split function. Using your example data can be a bit confusing because you chose to name one variable 'split'. In the following block, the first 'split' is the function, the second one your split variable.
number_of_words <- 2
lapply(split(split,1:number_of_words),paste0,collapse='')
$`1`
[1] "Hello"
$`2`
[1] "Lemon"
number_of_words <- 3
lapply(split(split,1:number_of_words),paste0,collapse='')
$`1`
[1] "Heln"
$`2`
[1] "Llo"
$`3`
[1] "emo"
To avoid confusion, here's the same code without the variable named split:
number_of_words <- 2
lapply(split(el(strsplit(string,'')),1:number_of_words),paste0,collapse='')
$`1`
[1] "Hello"
$`2`
[1] "Lemon"
Try this code:
paste0(split[seq(1,nchar(string),by = 2)],collapse="")
[1] "Hello"
> paste0(split[seq(2,nchar(string),by = 2)],collapse="")
[1] "Lemon"
It appends even and odd positions in the string string
Another way using your split variable, will work with any number of words:
N <- 2
apply(matrix(split,N),1,paste,collapse="")
# [1] "Hello" "Lemon"

Why does asterisk wildcard fail with sub() command? [r]

When using the sub() function in R, how do we use an asterisk wildcard to replace all characters after (or before) an indicator?
If we want to remove an underscore and all arbitrary text afterward:
x <- c("a_101", "a_275", "b_133", "b_277")
The following code removes nothing:
sub(pattern = "_*", replacement = "", x = x)
[1] "a_101" "a_275" "b_133" "b_277"
Desired output:
"a" "a" "b" "b"
Why does the wildcard fail?
If using sub, you have to specify everything you want to replace, and what you want to replace it with. Here I've done that using a group function for the letter of interest.
sub('([a-z])_\\d+', replacement = '\\1', x)
[1] "a" "a" "b" "b"
Using the wild card will work too.
sub('([a-z])_.*', replacement = '\\1', x)
[1] "a" "a" "b" "b"
And finally more along the lines of what you were thinking:
sub('_.*', replacement = "", x)
[1] "a" "a" "b" "b"

Split character string by forward slash or nothing

I want to split this vector
c("CC", "C/C")
to
[[1]]
[1] "C" "C"
[[2]]
[1] "C" "C"
My final data should look like:
c("C_C", "C_C")
Thus, I need some regex, but don't found how to solve the "non-space" part:
strsplit(c("CC", "C/C"),"|/")
You can use sub (or gsub if it occurs more than once in your string) to directly replace either nothing or a forward slash with an underscore (capturing one character words around):
sub("(\\w)(|/)(\\w)", "\\1_\\3", c("CC", "C/C"))
#[1] "C_C" "C_C"
We can split the string at every character, omit the "/" and paste them together.
sapply(strsplit(x, ""), function(v) paste0(v[v!= "/"], collapse = "_"))
#[1] "C_C" "C_C"
data
x <- c("CC", "C/C")
We can use
lapply(strsplit(v1, "/|"), function(x) x[nzchar(x)])
Or use a regex lookaround
strsplit(v1, "(?<=[^/])(/|)", perl = TRUE)
#[[1]]
#[1] "C" "C"
#[[2]]
#[1] "C" "C"
If the final output should be a vector, then
gsub("(?<=[^/])(/|)(?=[^/])", "_", v1, perl = TRUE)
#[1] "C_C" "C_C"

R: trim consecutive trailing and leading special characters from set of strings

I have a list of character vectors, all equal lengths. Example data:
> a = list('**aaa', 'bb*bb', 'cccc*')
> a = sapply(a, strsplit, '')
> a
[[1]]
[1] "*" "*" "a" "a" "a"
[[2]]
[1] "b" "b" "*" "b" "b"
[[3]]
[1] "c" "c" "c" "c" "*"
I would like to identify the indices of all leading and trailing consecutive occurrences of the character *. Then I would like to remove these indices from all three vectors in the list. By trailing and leading consecutive characters I mean e.g. either only a single occurrence as in the third one (cccc*) or multiple consecutive ones as in the first one (**aaa).
After the removal, all three character vectors should still have the same length.
So the first two and the last character should be removed from all three vectors.
[[1]]
[1] "a" "a"
[[2]]
[1] "*" "b"
[[3]]
[1] "c" "c"
Note that the second vector of the desired result will still have a leading *, which, however became the first character after the operation, so it should be in.
I tried using which to identify the indices (sapply(a, function(x)which(x=='*'))) but this would still require some code to detect the trailing ones.
Any ideas for a simple solution?
I would replace the lead and lag stars with NA:
aa <- lapply(setNames(a,seq_along(a)), function(x) {
star = x=="*"
toNA = cumsum(!star) == 0 | rev(cumsum(rev(!star))) == 0
replace(x, toNA, NA)
})
Store in a data.frame:
DF <- do.call(data.frame, c(aa, list(stringsAsFactors=FALSE)) )
Omit all rows with NA:
res <- na.omit(DF)
# X1 X2 X3
# 3 a * c
# 4 a b c
If you hate data.frames and want your list back: lapply(res,I) or c(unclass(res)), which gives
$X1
[1] "a" "a"
$X2
[1] "*" "b"
$X3
[1] "c" "c"
First of, like Richard Scriven asked in his comment to your question, your output is not the same as the thing you asked for. You ask for removal of leading and trailing characters, but your given ideal output is just the 3rd and 4th element of the character lists.
This would be easily achievable by something like
a <- list('**aaa', 'bb*bb', 'cccc*')
alist = sapply(a, strsplit, '')
lapply(alist, function(x) x[3:4])
Now for an answer as you asked it:
IMHO, sapply() isn't necessary here.
You need a function of the grep family to operate directly on your characters, which all share a help page in R opened by ?grep.
I would propose gsub() and a bit of Regular Expressions for your problem:
a <- list('**aaa', 'bb*bb', 'cccc*')
b <- gsub(pattern = "^(\\*)*", x = a, replacement = "")
c <- gsub(pattern = "(\\*)*$", x = b, replacement = "")
> c
[1] "aaa" "bb*bb" "cccc"
This is doable in one regex, but then you need a backreference for the stuff in between i think, and i didn't get this to work.
If you are familiar with the magrittr package and its excellent pipe operator, you can do this more elegantly:
library(magrittr)
gsub(pattern = "^(\\*)*", x = a, replacement = "") %>%
gsub(pattern = "(\\*)*$", x = ., replacement = "")

Finding midpoint of string in R (mid character of a word)

I'd like to find the midpoint of any word after the following is done to the word:
>x = 'hello'
>y = strsplit(x, '')
>y
[[1]]
[1] "h" "e" "l" "l" "o"
>z = unlist(y)
>z
[1] "h" "e" "l" "l" "o"
Doing this then allows for :
> z[1]
[1] "h"
> z[4]
[1] "l"
The difference being that before z=unlist(y) when you try z[index] you get back NA, example:
> x = 'hello'
> strsplit(x, '')
[[1]]
[1] "h" "e" "l" "l" "o"
> x[1]
[1] "hello"
> x[2]
[1] NA
Anyways, what I want to do is find the mid point of words that are in this format so that the output would be something like:
"l"
in the case of the word "hello". Also, in this example we have a word with 5 letters allowing to easily designate a single character as the midpoint but for a word like "bake" I would like to designate both "a" and "k" together as the midpoint.
Try
f1 <- function(str1){
N <- nchar(str1)
if(!N%%2){
res <- substr(str1, N/2, (N/2)+1)
}
else{
N1 <- median(sequence(N))
res <- substr(str1, N1, N1)
}
res
}
f1('bake')
#[1] "ak"
f1('hello')
#[1] "l"
Another option. get_middle assumes the word has already been split into characters, as per your description:
get_middle <- function(x) {
mid <- (length(x) + 1) / 2
x[unique(c(ceiling(mid), floor(mid)))]
}
Then:
words <- c("bake", "hello")
lapply(strsplit(words, ""), get_middle)
Produces:
[[1]]
[1] "k" "a"
[[2]]
[1] "l"
You could try this:
midpoint <- function(word) {
# Split the word into a vector of letters
split <- strsplit(word, "")[[1]]
# Get the number of letters in the word
n <- nchar(word)
# Get the two middle letters for words of even length,
# otherwise get the single middle letter
if (n %% 2 == 0) {
c(split[n/2], split[n/2+1])
} else {
split[ceiling(n/2)]
}
}
In the case of a word of even length, the middle two characters are returned as a vector.
midpoint("hello")
#[1] "l"
midpoint("bake")
#[1] "a" "k"
How about:
mid<-function(str)substr(str,(nchar(str)+1)%/%2,(nchar(str)+2)%/%2)
Or slightly more legibly:
mid2<-function(str){
n1<-nchar(str)+1
substr(str,n1%/%2,(n1+1)%/%2)
}
> mid("bake")
[1] "ak"
> mid("hello")
[1] "l"
This has the advantage that it immediately vectorizes:
> mid(c("bake","hello"))
[1] "ak" "l"
It is slower than #akrun's solution for long words, but my second version is faster; apparently counting characters can be costly for longer strings.
If you want the final product in a list, you can just strsplit the result:
mid3<-function(str)strsplit(mid2(str),"")
word = c("bake","hello")
print(nchar(word))
q = ifelse (nchar(word)%%2==0, substr(word,nchar(word)/2,nchar(word)/2+1),substr(word,nchar(word)/2+1,nchar(word)/2+1))
print(q)
[1] 4 5
[1] "ak" "l"

Resources