Extract a fixed-length character in R - r

I have an attribute consisting DNA sequences and would like to translate it to its amino name.
So I need to split the sequence in a fixed-length character that is 3.
Here is the sample of the data
data=c("AATAGACGT","TGACCC","AAATCACTCTTT")
How can I extract it into:
[1] "AAT" "AGA" "CGT"
[2] "TGA" "CCC"
[3] "AAA" "TCA" "CTC" "TTT"
So far I can only find how to split a string given a certain regex as the separator

Try
strsplit(data, '(?<=.{3})', perl=TRUE)
Or
library(stringi)
stri_extract_all_regex(data, '.{1,3}')

Another solution, still one liner, but less elegant than the other ones (using lapply):
lapply(data, function(u) substring(u, seq(1, nchar(u), 3), seq(3, nchar(u),3)))
#[[1]]
#[1] "AAT" "AGA" "CGT"
#[[2]]
#[1] "TGA" "CCC"
#[[3]]
#[1] "AAA" "TCA" "CTC" "TTT"

as.list(gsub("(.{3})", "\\1 ", data))
[[1]]
[1] "AAT AGA CGT "
[[2]]
[1] "TGA CCC "
[[3]]
[1] "AAA TCA CTC TTT "
or
regmatches(data, gregexpr(".{3}", data))
[[1]]
[1] "AAT" "AGA" "CGT"
[[2]]
[1] "TGA" "CCC"
[[3]]
[1] "AAA" "TCA" "CTC" "TTT"

Another:
library(gsubfn)
strapply(data, "...")

Related

How to flag missing left-hand collocates with NA

I want to compute collocates of the lemma GO, including all its forms such as go, goes, gone, etc.:
go <- c("go after it", "here we go", "he went bust", "go get it go", "i 'm gon na go", "she 's going berserk")
The lemma forms are stored in this vector:
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")
and this vector turns them into an alternation pattern:
pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")
However, when using the pattern with str_extract_all to extract the immediately left-hand collocate of GO, the extraction misses out on those strings where GO is the first word in the string and reoccurs later in the string:
library(stringr)
str_extract_all(go, paste0("'?\\b[a-z']+\\b(?=\\s?", pattern_GO, ")"))
[[1]]
character(0)
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
The expected result is this:
[[1]]
[1] NA
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] NA "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
How can the extraction be mended to also return NA in the absence of a left-hand collocate?
You can add an alternative to match at the start of a string, or your consuming pattern:
str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
See the regex demo.
See the R demo:
go <- c("go after it", "here we go", "he went bust", "go get it go", "i 'm gon na go", "she 's going berserk")
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")
pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")
library(stringr)
str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
Output:
[[1]]
[1] ""
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] "" "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
Sukces #stdin #stdout 0.26s 42528KB
[1] "\\b(go|goes|going|gone|went|gon na)\\b"
[[1]]
[1] ""
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] "" "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
If you want, you can turn all empty items into NA using
res <- str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
res <- lapply(res, function(x) ifelse(x=="", NA, x))

ft_tokenizer tokenizes words to lower, I want it to be as they are

I am using ft_tokenizer for spark dataframe in R.
and it tokenizes each word and changes it to all lower, I want the words to be in the format they originally are.
text_data <- data_frame(
x = c("This IS a sentence", "So is this")
)
tokenized <- text_data_tbl %>%
ft_tokenizer("x", "word")
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "this"
##
## [[1]][[2]]
## [1] "is"
##
## [[1]][[3]]
## [1] "a"
I want:
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "This"
##
## [[1]][[2]]
## [1] "IS"
##
## [[1]][[3]]
## [1] "a"
I guess it is not possible with ft_tokenizer. From ?ft_tokenizer
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
So it's basic feature is to convert the string to lowercase and split on white-space which I guess cannot be changed. Consider doing
text_data$new_x <- lapply(strsplit(text_data$x, "\\s+"), as.list)
which will give the same output as expected and you can continue your process as it is from here.
text_data$new_x
#[[1]]
#[[1]][[1]]
#[1] "This"
#[[1]][[2]]
#[1] "IS"
#[[1]][[3]]
#[1] "a"
#[[1]][[4]]
#[1] "sentence"
#[[2]]
#[[2]][[1]]
#[1] "So"
#[[2]][[2]]
#[1] "is"
#[[2]][[3]]
#[1] "this"

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

Split a list whose elements are multiple element lists

Say I have a list a which is defined as:
a <- list("aaa;bbb", "aaa", "bbb", "aaa;ccc")
I want to split this list by semicolon ;, get only unique values, and return another list. So far I have split the list using str_split():
a <- str_split(a, ";")
which gives me
> a
[[1]]
[1] "aaa" "bbb"
[[2]]
[1] "aaa"
[[3]]
[1] "bbb"
[[4]]
[1] "aaa" "ccc"
How can I manipulate this list (using unique()?) to give me something like
[[1]]
[1] "aaa"
[[2]]
[1] "bbb"
[[3]]
[1] "ccc"
or more simply,
[[1]]
[1] "aaa" "bbb" "ccc"
One option is to use list() with unique() and unlist() inside your list.
# So first you use your code
a <- list("aaa;bbb", "aaa", "bbb", "aaa;ccc")
# Load required library
library(stringr) # load str_split
a <- str_split(a, ";")
# Finally use list() with unique() and unlist()
list(unique(unlist(a)))
# And the otuput
[[1]]
[1] "aaa" "bbb" "ccc"
One alternative in base R is to use rapply which applies a function to each of the inner most elements in a nested list and returns the most simplified object possible by default. Here, it returns a vector of characters.
unique(rapply(a, strsplit, split=";"))
[1] "aaa" "bbb" "ccc"
To return a list, wrap the output in list
list(unique(rapply(a, strsplit, split=";")))
[[1]]
[1] "aaa" "bbb" "ccc"

Filter list in R which has nchar > 1

I have a list of names
> x <- c("Test t", "Cuma Nama K", "O", "Test satu dua t")
> name <- strsplit(x, " ")
> name
[[1]]
[1] "Test" "t"
[[2]]
[1] "Cuma" "Nama" "K"
[[3]]
[1] "O"
[[4]]
[1] "Test" "satu" "dua" "t"
How can I filter a list so that it can become like this?
I am trying to find out how to filter the list which has nchar > 1
> name
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[4]]
[1] "Test" "satu" "dua"
lapply(name, function(x) x[nchar(x)>1])
Results in:
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[3]]
character(0)
[[4]]
[1] "Test" "satu" "dua"
We can loop over the list elements, subset the elements that have nchar greater than 1 and use Filter to remove the elements that 0 elements
Filter(length,lapply(name, function(x) x[nchar(x) >1 ]))
#[[1]]
#[1] "Test"
#[[2]]
#[1] "Cuma" "Nama"
#[[3]]
#[1] "Test" "satu" "dua"
If we want to remove the words with one character from the string, we can also do this without splitting
setdiff(gsub("(^| ).( |$)", "", x), "")
#[1] "Test" "Cuma Nama" "Test satu dua"

Resources