Split a string on alternating index - r

I have a string similar to "HLeelmloon" which is two words interweaved together. How can I separate this into two separate words, splitting on alternating letters?
I can use strsplit() and a for loop to allocate alternating letters to two new vectors and then join the list but this seems very long winded:
string <- "HLeelmloon"
split<-el(strsplit(string,''))
> split
[1] "H" "L" "e" "e" "l" "m" "l" "o" "o" "n"
word1<-c()
word2<-c()
for(i in 1:length(split)){
if(i %% 2 == 1){
word1<-append(word1, split[i])
} else {
word2<-append(word2, split[i])
}
}
word1 = paste0(word1, collapse = '')
word2 = paste0(word2, collapse = '')
> word1
[1] "Hello"
> word2
[1] "Lemon"
My issue is it's not very elegant, and it doesn't upscale well if I want to split the string into N different words. Is there a better way to do this?

You could use gsub to capture alternating characters into the same group:
gsub("(.)(.)?", "\\1", string)
#[1] "Hello"
gsub("(.)(.)?", "\\2", string)
#[1] "Lemon"

You can do it by using TRUE and FALSE for indexing, i.e.
v1 = strsplit(string, '')[[1]]
paste(v1[c(TRUE, FALSE)], collapse = '')
#[1] "Hello"
paste(v1[c(FALSE, TRUE)], collapse = '')
#[1] "Lemon"

Considering your question is how to split into more than two words, you should use the split function. Using your example data can be a bit confusing because you chose to name one variable 'split'. In the following block, the first 'split' is the function, the second one your split variable.
number_of_words <- 2
lapply(split(split,1:number_of_words),paste0,collapse='')
$`1`
[1] "Hello"
$`2`
[1] "Lemon"
number_of_words <- 3
lapply(split(split,1:number_of_words),paste0,collapse='')
$`1`
[1] "Heln"
$`2`
[1] "Llo"
$`3`
[1] "emo"
To avoid confusion, here's the same code without the variable named split:
number_of_words <- 2
lapply(split(el(strsplit(string,'')),1:number_of_words),paste0,collapse='')
$`1`
[1] "Hello"
$`2`
[1] "Lemon"

Try this code:
paste0(split[seq(1,nchar(string),by = 2)],collapse="")
[1] "Hello"
> paste0(split[seq(2,nchar(string),by = 2)],collapse="")
[1] "Lemon"
It appends even and odd positions in the string string

Another way using your split variable, will work with any number of words:
N <- 2
apply(matrix(split,N),1,paste,collapse="")
# [1] "Hello" "Lemon"

Related

Remove duplicates within consecutive runs of characters

I have strings containing lots of duplicates, like this:
tst <- c("C>C>C>B>B>B>B>C>C>*>*>*>*>*>C", "A>A>A", "*>B>B",
"A>A>A>A>A>*>A>A>A>*>*>*>*>A>A", "*>C>C", "A")
I'd like to remove all consecutive duplicated upper-case and "*" characters, so the expected result is this:
[1] "CBC*C" "A" "*B" "A*A*A" "*C" "A"
I've successfully extracted the duplicated capitals:
library(stringr)
unlist(str_extract_all(gsub(">", "", tst), "(.)(?=\\1)"))
[1] "C" "C" "B" "B" "B" "C" "*" "*" "*" "*"
but am somewhat stuck here. My hunch is that the function which, which returns indices, might be of help but don't know how to implement it in this case.
Any ideas?
EDIT:
I wasn't that far from the solution myself - just using a negative lookahead (instead of the positive lookahead) does the trick:
str_extract_all(gsub(">", "", tst), "(.)(?!\\1)")
[[1]]
[1] "C" "B" "C" "*" "C"
[[2]]
[1] "A"
[[3]]
[1] "*" "B"
[[4]]
[1] "A" "*" "A" "*" "A"
[[5]]
[1] "*" "C"
[[6]]
[1] "A"
We can use gsub
gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"
In order to get the second result, remove the >
gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"
Based on the OP's comments below, may be
gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"
Another way to get CBC*C could be using 2 groups and using group 2 in the replacement.
((.)>)\1+
Regex demo
Example
tst <- "C>C>C>B>B>B>B>C>C>*>*>*>*>*>C"
gsub("((.)>)\\1+", "\\2", tst)
Output
[1] "CBC*C"
For us allergic to regex:
paste(rle(strsplit(tst, ">")[[1]])$values, collapse = ">") # or collapse = ""
[1] "C>B>C>*>C"
...which of course fails for strings with runs of lowercase letters, like "A>A>a>a>A>A"
A somewhat universal base R approach without regexps.
The idea here is to melt down the string to groups and then remove the repeating patterns successively (which makes it distinct from unique):
tst <- "C>C>C>B>B>B>B>C>C>*>*>*>*>*>C"
st <- paste(unlist(strsplit(tst,">")),collapse="")
#[1] "CCCBBBBCC*****C"
paste( unlist( sapply( 1:nchar(st), function(x){
if( substr(st,x,x) != substr(st,(x+1),(x+1)) ){ substr(st,x,x) } } ) ), collapse="" )
#[1] "CBC*C"
Oh, and if you want lowercase functionality (excluding lowercase letters from removal), use this instead:
paste( unlist( sapply( 1:nchar(st), function(x){
a=substr(st,x,x); b=substr(st,(x+1),(x+1));
if( a != b & toupper(a) == a ){ a } else if( toupper(a) != a ){ a } } ) ), collapse="" )

Iteratively extract repeated word forms across speaking turns

I'm working on speaking turns in conversation. My interest is in the words that get repeated from a prior turn to a next turn:
turnsX <- data.frame(
speaker = c("A","B","A","B"),
speech = c("let's have a look",
"yeah let's take a look",
"yeah okay so where to start",
"let's start here"), stringsAsFactors = F
)
I want to extract the repeated word forms. To this end I've run a for loop, iteratively defining each speech turn as a regex pattern for the next speech turn and str_extracting the words that get repeated from turn to turn:
library(stringr)
pattern <- c()
extracted <- c()
for(i in 1:nrow(turnsX)){
pattern[i] <- paste0(unlist(str_split(turnsX$speech[i], " ")), collapse = "|")
extracted[i+1] <- str_extract_all(turnsX$speech[i+1], pattern[i])
}
The result however is partly incorrect:
extracted
[[1]]
NULL
[[2]]
[1] "a" "let's" "a" "a" "look"
[[3]]
[1] "yeah" "a" "a"
[[4]]
[1] "start"
[[5]]
[1] NA
The correct result should be:
extracted
[[1]]
NULL
[[2]]
[1] "let's" "a" "look"
[[3]]
[1] "yeah"
[[4]]
[1] "start"
Where's the mistake? How can the code be mended, or what other approach is there, to get the correct result?
Maybe you can use Map and %in%.
x <- strsplit(turnsX$speech, " ")
Map(function(y,z) y[y %in% z], x[-length(x)], x[-1])
#[[1]]
#[1] "let's" "a" "look"
#
#[[2]]
#[1] "yeah"
#
#[[3]]
#[1] "start"
Here's a base R approach using Map :
tmp <- strsplit(turnsX$speech, ' ')
c(NA, Map(intersect, tmp[-1], tmp[-length(tmp)]))
#[[1]]
#[1] NA
#[[2]]
#[1] "let's" "a" "look"
#[[3]]
#[1] "yeah"
#[[4]]
#[1] "start"
You want the word boundaries "\\b"
library(stringr)
pattern <- c()
extracted <- c()
for(i in 2:nrow(turnsX)){
pattern[i - 1] <- paste0(unlist(str_split(turnsX$speech[i - 1], " ")), collapse = "|\\b")
extracted[i] <- str_extract_all(turnsX$speech[i], pattern[i - 1])
}
# [[1]]
# NULL
#
# [[2]]
# [1] "let's" "a" "look"
#
# [[3]]
# [1] "yeah"
#
# [[4]]
# [1] "start"

R: trim consecutive trailing and leading special characters from set of strings

I have a list of character vectors, all equal lengths. Example data:
> a = list('**aaa', 'bb*bb', 'cccc*')
> a = sapply(a, strsplit, '')
> a
[[1]]
[1] "*" "*" "a" "a" "a"
[[2]]
[1] "b" "b" "*" "b" "b"
[[3]]
[1] "c" "c" "c" "c" "*"
I would like to identify the indices of all leading and trailing consecutive occurrences of the character *. Then I would like to remove these indices from all three vectors in the list. By trailing and leading consecutive characters I mean e.g. either only a single occurrence as in the third one (cccc*) or multiple consecutive ones as in the first one (**aaa).
After the removal, all three character vectors should still have the same length.
So the first two and the last character should be removed from all three vectors.
[[1]]
[1] "a" "a"
[[2]]
[1] "*" "b"
[[3]]
[1] "c" "c"
Note that the second vector of the desired result will still have a leading *, which, however became the first character after the operation, so it should be in.
I tried using which to identify the indices (sapply(a, function(x)which(x=='*'))) but this would still require some code to detect the trailing ones.
Any ideas for a simple solution?
I would replace the lead and lag stars with NA:
aa <- lapply(setNames(a,seq_along(a)), function(x) {
star = x=="*"
toNA = cumsum(!star) == 0 | rev(cumsum(rev(!star))) == 0
replace(x, toNA, NA)
})
Store in a data.frame:
DF <- do.call(data.frame, c(aa, list(stringsAsFactors=FALSE)) )
Omit all rows with NA:
res <- na.omit(DF)
# X1 X2 X3
# 3 a * c
# 4 a b c
If you hate data.frames and want your list back: lapply(res,I) or c(unclass(res)), which gives
$X1
[1] "a" "a"
$X2
[1] "*" "b"
$X3
[1] "c" "c"
First of, like Richard Scriven asked in his comment to your question, your output is not the same as the thing you asked for. You ask for removal of leading and trailing characters, but your given ideal output is just the 3rd and 4th element of the character lists.
This would be easily achievable by something like
a <- list('**aaa', 'bb*bb', 'cccc*')
alist = sapply(a, strsplit, '')
lapply(alist, function(x) x[3:4])
Now for an answer as you asked it:
IMHO, sapply() isn't necessary here.
You need a function of the grep family to operate directly on your characters, which all share a help page in R opened by ?grep.
I would propose gsub() and a bit of Regular Expressions for your problem:
a <- list('**aaa', 'bb*bb', 'cccc*')
b <- gsub(pattern = "^(\\*)*", x = a, replacement = "")
c <- gsub(pattern = "(\\*)*$", x = b, replacement = "")
> c
[1] "aaa" "bb*bb" "cccc"
This is doable in one regex, but then you need a backreference for the stuff in between i think, and i didn't get this to work.
If you are familiar with the magrittr package and its excellent pipe operator, you can do this more elegantly:
library(magrittr)
gsub(pattern = "^(\\*)*", x = a, replacement = "") %>%
gsub(pattern = "(\\*)*$", x = ., replacement = "")

Finding midpoint of string in R (mid character of a word)

I'd like to find the midpoint of any word after the following is done to the word:
>x = 'hello'
>y = strsplit(x, '')
>y
[[1]]
[1] "h" "e" "l" "l" "o"
>z = unlist(y)
>z
[1] "h" "e" "l" "l" "o"
Doing this then allows for :
> z[1]
[1] "h"
> z[4]
[1] "l"
The difference being that before z=unlist(y) when you try z[index] you get back NA, example:
> x = 'hello'
> strsplit(x, '')
[[1]]
[1] "h" "e" "l" "l" "o"
> x[1]
[1] "hello"
> x[2]
[1] NA
Anyways, what I want to do is find the mid point of words that are in this format so that the output would be something like:
"l"
in the case of the word "hello". Also, in this example we have a word with 5 letters allowing to easily designate a single character as the midpoint but for a word like "bake" I would like to designate both "a" and "k" together as the midpoint.
Try
f1 <- function(str1){
N <- nchar(str1)
if(!N%%2){
res <- substr(str1, N/2, (N/2)+1)
}
else{
N1 <- median(sequence(N))
res <- substr(str1, N1, N1)
}
res
}
f1('bake')
#[1] "ak"
f1('hello')
#[1] "l"
Another option. get_middle assumes the word has already been split into characters, as per your description:
get_middle <- function(x) {
mid <- (length(x) + 1) / 2
x[unique(c(ceiling(mid), floor(mid)))]
}
Then:
words <- c("bake", "hello")
lapply(strsplit(words, ""), get_middle)
Produces:
[[1]]
[1] "k" "a"
[[2]]
[1] "l"
You could try this:
midpoint <- function(word) {
# Split the word into a vector of letters
split <- strsplit(word, "")[[1]]
# Get the number of letters in the word
n <- nchar(word)
# Get the two middle letters for words of even length,
# otherwise get the single middle letter
if (n %% 2 == 0) {
c(split[n/2], split[n/2+1])
} else {
split[ceiling(n/2)]
}
}
In the case of a word of even length, the middle two characters are returned as a vector.
midpoint("hello")
#[1] "l"
midpoint("bake")
#[1] "a" "k"
How about:
mid<-function(str)substr(str,(nchar(str)+1)%/%2,(nchar(str)+2)%/%2)
Or slightly more legibly:
mid2<-function(str){
n1<-nchar(str)+1
substr(str,n1%/%2,(n1+1)%/%2)
}
> mid("bake")
[1] "ak"
> mid("hello")
[1] "l"
This has the advantage that it immediately vectorizes:
> mid(c("bake","hello"))
[1] "ak" "l"
It is slower than #akrun's solution for long words, but my second version is faster; apparently counting characters can be costly for longer strings.
If you want the final product in a list, you can just strsplit the result:
mid3<-function(str)strsplit(mid2(str),"")
word = c("bake","hello")
print(nchar(word))
q = ifelse (nchar(word)%%2==0, substr(word,nchar(word)/2,nchar(word)/2+1),substr(word,nchar(word)/2+1,nchar(word)/2+1))
print(q)
[1] 4 5
[1] "ak" "l"

In R, how can a string be split without using a seperator

i am try split method and i want to have the second element of a string containing only 2 elemnts. The size of the string is 2.
examples :
string= "AC"
result shouldbe a split after the first letter ("A"), that I get :
res= [,1] [,2]
[1,] "A" "C"
I tryed it with split, but I have no idea how to split after the first element??
strsplit() will do what you want (if I understand your Question). You need to split on "" to split the string on it's elements. Here is an example showing how to do what you want on a vector of strings:
strs <- rep("AC", 3) ## your string repeated 3 times
next, split each of the three strings
sstrs <- strsplit(strs, "")
which produces
> sstrs
[[1]]
[1] "A" "C"
[[2]]
[1] "A" "C"
[[3]]
[1] "A" "C"
This is a list so we can process it with lapply() or sapply(). We need to subset each element of sstrs to select out the second element. Fo this we apply the [ function:
sapply(sstrs, `[`, 2)
which produces:
> sapply(sstrs, `[`, 2)
[1] "C" "C" "C"
If all you have is one string, then
strsplit("AC", "")[[1]][2]
which gives:
> strsplit("AC", "")[[1]][2]
[1] "C"
split isn't used for this kind of string manipulation. What you're looking for is strsplit, which in your case would be used something like this:
strsplit(string,"",fixed = TRUE)
You may not need fixed = TRUE, but it's a habit of mine as I tend to avoid regular expressions. You seem to indicate that you want the result to be something like a matrix. strsplit will return a list, so you'll want something like this:
strsplit(string,"",fixed = TRUE)[[1]]
and then pass the result to matrix.
If you sure that it's always two char string (check it by all(nchar(x)==2)) and you want only second then you could use sub or substr:
x <- c("ab", "12")
sub(".", "", x)
# [1] "b" "2"
substr(x, 2, 2)
# [1] "b" "2"

Resources