Trying to find the longest consecutive set of vowels in a string.
I have a very long string.
random_letters <- readr::read_lines("https://byuistats.github.io/M335/data/randomletters.txt")
I have stripped the string of all spaces and periods.
random_letters %>% str_replace_all(., fixed(" "), "") %>% str_replace_all('\\.', '')
I am now trying to find every single time a vowel, or combination of vowels occurs, and then identify the longest one.
so if the string looked like
string <- c("abcnduakngoaibhui")
the output would be
"a" "ua" "oai" "ui
In base R you could use strsplit:
strsplit(string, '[^aeiou]+')
[[1]]
[1] "a" "ua" "oai" "ui"
Also You could use str_extract_all from stringr package:
stringr::str_extract_all(string, "[aeiou]+")
[[1]]
[1] "a" "ua" "oai" "ui"
These gives you a list, since they are vectorized meaning that the string can be a vector:
string <- c("abcnduakngoaibhui", "aeityuioaiii")
strsplit(string, '[^aeiou]+')
[[1]]
[1] "a" "ua" "oai" "ui"
[[2]]
[1] "aei" "uioaiii"
Related
For example, there is character x = "AAATTTGGAA".
What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".
Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.
How should I get this?
Here is a useful regex trick approach:
x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out
[1] "AAA" "TTT" "GG" "AA"
The regex pattern used here says to split at any boundary where the preceding and following characters are different.
(?<=(.)) lookbehind and also capture preceding character in \1
(?!\\1) then lookahead and assert that following character is different
You can split each character in the string. Use rle to find consecutive runs and select only the unique ones.
x <- "AAATTTGGAA"
vec <- unlist(strsplit(x, ''))
rle(vec)$values
#[1] "A" "T" "G" "A"
paste0(rle(vec)$values, collapse = '')
#[1] "ATGA"
We can use regmatch with pattern (.)\\1+ like below
> regmatches(x,gregexpr("(.)\\1+",x))[[1]]
[1] "AAA" "TTT" "GG" "AA"
or if you need the unique letters only
> gsub("(.)\\1+", "\\1", x)
[1] "ATGA"
In genomics, we often have to work with many strings of gene names that are separated by semicolons. I want to do pattern matching (find a specific gene name in a string), and then remove that from the string. I also need to remove any semicolon before or after the gene name. This toy example illustrates the problem.
s <- c("a;b;x", "a;x;b", "x;b", "x")
library(stringr)
str_replace(s, "x", "")
#[1] "a;b;" "a;;b" ";b" ""
The desired output should be.
#[1] "a;b" "a;b" "b" ""
I could do pattern matching for ;x and x; as well and that would give me the output; but that wouldn't be very efficient. We can also use gsub or the stringi package and that would be fine as well.
Remove x and optional ; after it if x is the starting character of the string otherwise remove x and optional ; before it which should cover all the cases as listed:
str_replace(s, "^x(;?)|(;?)x", "")
# [1] "a;b" "a;b" "b" ""
We can use gsub from base R
gsub("^x;|;?x", "", s)
#[1] "a;b" "a;b" "b" ""
I have a list of character vectors, all equal lengths. Example data:
> a = list('**aaa', 'bb*bb', 'cccc*')
> a = sapply(a, strsplit, '')
> a
[[1]]
[1] "*" "*" "a" "a" "a"
[[2]]
[1] "b" "b" "*" "b" "b"
[[3]]
[1] "c" "c" "c" "c" "*"
I would like to identify the indices of all leading and trailing consecutive occurrences of the character *. Then I would like to remove these indices from all three vectors in the list. By trailing and leading consecutive characters I mean e.g. either only a single occurrence as in the third one (cccc*) or multiple consecutive ones as in the first one (**aaa).
After the removal, all three character vectors should still have the same length.
So the first two and the last character should be removed from all three vectors.
[[1]]
[1] "a" "a"
[[2]]
[1] "*" "b"
[[3]]
[1] "c" "c"
Note that the second vector of the desired result will still have a leading *, which, however became the first character after the operation, so it should be in.
I tried using which to identify the indices (sapply(a, function(x)which(x=='*'))) but this would still require some code to detect the trailing ones.
Any ideas for a simple solution?
I would replace the lead and lag stars with NA:
aa <- lapply(setNames(a,seq_along(a)), function(x) {
star = x=="*"
toNA = cumsum(!star) == 0 | rev(cumsum(rev(!star))) == 0
replace(x, toNA, NA)
})
Store in a data.frame:
DF <- do.call(data.frame, c(aa, list(stringsAsFactors=FALSE)) )
Omit all rows with NA:
res <- na.omit(DF)
# X1 X2 X3
# 3 a * c
# 4 a b c
If you hate data.frames and want your list back: lapply(res,I) or c(unclass(res)), which gives
$X1
[1] "a" "a"
$X2
[1] "*" "b"
$X3
[1] "c" "c"
First of, like Richard Scriven asked in his comment to your question, your output is not the same as the thing you asked for. You ask for removal of leading and trailing characters, but your given ideal output is just the 3rd and 4th element of the character lists.
This would be easily achievable by something like
a <- list('**aaa', 'bb*bb', 'cccc*')
alist = sapply(a, strsplit, '')
lapply(alist, function(x) x[3:4])
Now for an answer as you asked it:
IMHO, sapply() isn't necessary here.
You need a function of the grep family to operate directly on your characters, which all share a help page in R opened by ?grep.
I would propose gsub() and a bit of Regular Expressions for your problem:
a <- list('**aaa', 'bb*bb', 'cccc*')
b <- gsub(pattern = "^(\\*)*", x = a, replacement = "")
c <- gsub(pattern = "(\\*)*$", x = b, replacement = "")
> c
[1] "aaa" "bb*bb" "cccc"
This is doable in one regex, but then you need a backreference for the stuff in between i think, and i didn't get this to work.
If you are familiar with the magrittr package and its excellent pipe operator, you can do this more elegantly:
library(magrittr)
gsub(pattern = "^(\\*)*", x = a, replacement = "") %>%
gsub(pattern = "(\\*)*$", x = ., replacement = "")
What a strsplit function in R does is, match and delete a given regular expression to split the rest of the string into vectors.
>strsplit("abc123def", "[0-9]+")
[[1]]
[1] "abc" "" "" "def"
But how should I split the string the same way using regular expression, but also retain the matches? I need something like the following.
>FUNCTION("abc123def", "[0-9]+")
[[1]]
[1] "abc" "123" "def"
Using strapply("abc123def", "[0-9]+|[a-z]+") works here, but what if the rest of the string other than the matches cannot be captured by a regular expression?
Fundamentally, it seems to me that what you want is not to split on [0-9]+ but to split on the transition between [0-9]+ and everything else. In your string, that transition is not pre-existing. To insert it, you could pre-process with gsub and back-referencing:
test <- "abc123def"
strsplit( gsub("([0-9]+)","~\\1~",test), "~" )
[[1]]
[1] "abc" "123" "def"
You could use lookaround assertions.
> test <- "abc123def"
> strsplit(test, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", perl=T)
[[1]]
[1] "abc" "123" "def"
You can use strapply from gsubfn package.
test <- "abc123def"
strapply(X=test,
pattern="([^[:digit:]]*)(\\d+)(.+)",
FUN=c,
simplify=FALSE)
[[1]]
[1] "abc" "123" "def"
i am try split method and i want to have the second element of a string containing only 2 elemnts. The size of the string is 2.
examples :
string= "AC"
result shouldbe a split after the first letter ("A"), that I get :
res= [,1] [,2]
[1,] "A" "C"
I tryed it with split, but I have no idea how to split after the first element??
strsplit() will do what you want (if I understand your Question). You need to split on "" to split the string on it's elements. Here is an example showing how to do what you want on a vector of strings:
strs <- rep("AC", 3) ## your string repeated 3 times
next, split each of the three strings
sstrs <- strsplit(strs, "")
which produces
> sstrs
[[1]]
[1] "A" "C"
[[2]]
[1] "A" "C"
[[3]]
[1] "A" "C"
This is a list so we can process it with lapply() or sapply(). We need to subset each element of sstrs to select out the second element. Fo this we apply the [ function:
sapply(sstrs, `[`, 2)
which produces:
> sapply(sstrs, `[`, 2)
[1] "C" "C" "C"
If all you have is one string, then
strsplit("AC", "")[[1]][2]
which gives:
> strsplit("AC", "")[[1]][2]
[1] "C"
split isn't used for this kind of string manipulation. What you're looking for is strsplit, which in your case would be used something like this:
strsplit(string,"",fixed = TRUE)
You may not need fixed = TRUE, but it's a habit of mine as I tend to avoid regular expressions. You seem to indicate that you want the result to be something like a matrix. strsplit will return a list, so you'll want something like this:
strsplit(string,"",fixed = TRUE)[[1]]
and then pass the result to matrix.
If you sure that it's always two char string (check it by all(nchar(x)==2)) and you want only second then you could use sub or substr:
x <- c("ab", "12")
sub(".", "", x)
# [1] "b" "2"
substr(x, 2, 2)
# [1] "b" "2"