Suppose I have a string in R as "aa1122ddccdsadsa"
I want to convert any string into a vector of letters, how can I do that ?
I mean give an string, I want to it be
"a" "a" "1" "1" "2" etc
There are actually lots of ways to do this. Here's one using strsplit and regex:
x <- c("aa1122ddccdsadsa")
strsplit(gsub("([[:alnum:]]{1})", "\\1 ", x), " ")[[1]]
> strsplit(gsub("([[:alnum:]]{1})", "\\1 ", x), " ")[[1]]
[1] "a" "a" "1" "1" "2" "2" "d" "d" "c" "c" "d" "s" "a" "d" "s" "a"
You could also use substring or plyr.
Using stri_sub function from stringi package
Get substrings of length 1 from 1,2,3... letter
require(stringi)
x <- "alamakota"
stri_sub(x,from=1:stri_length(x),length = 1)
## [1] "a" "l" "a" "m" "a" "k" "o" "t" "a"
Related
I can't seem to figure out a seemingly simple task:
I have phonemic transcriptions of speech. To count the phonemes I want to split the strings into single phonemic segments. Unfortunately, the characters used for the phonemes are not 100% different from each other. For example, a long /i/ sound is transcribed iː (NB: ː is not a colon but a special char!) whereas a short /i/ sound may occasionally be transcribed i. This double use of the i in two distinct phonemes causes a problem in the split operation:
Test data:
test1 <- "dɪd ɛnɪbɒdi liːv ðeə glɑːsɪz hɪə lɑːst wiːk sʌmbədi dɪd"
A vector of all phonemes:
phonemes <- c("ɪə","eɪ","ʊə","ɔɪ","aɪ","eə","aʊ","əʊ", # diphthongs
"iː","uː","ɜː","ɔː","ɑː", # long vowels
"ə","ɪ", "ɛ","ɒ","ʌ","æ","i","ʊ", # short vowels
"j","w", # semi-vowels
"r","l", # approximants
"n","m","ŋ", # nasals
"f","v","θ","ð","s","z","ʃ","ʒ","h", # fricatives
"ʧ","ʤ", # affricates
"p","b","t","d","k","g") # plosives
The alternation pattern:
phonemes_pattern <- paste0("(", paste0(phonemes, collapse = "|"), ")")
The splitting operation:
str_split(gsub(" ", "", test1), paste0("(?<=", phonemes_pattern, ")"))
[[1]]
[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l" "i" "ː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h" "ɪ" "ə" "l" "ɑː" "s" "t"
[30] "w" "i" "ː" "k" "s" "ʌ" "m" "b" "ə" "d" "i" "d" "ɪ" "d" ""
The result is correct except for the long /i/ sound where the two characters iand ː are also separated. Can anybody help with this?
Why not extract the phonemes instead ?
phonemes_pattern <- paste0(phonemes, collapse = "|")
stringr::str_extract_all(test1, phonemes_pattern)[[1]]
#[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l"
#[12] "iː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h"
#[23] "ɪə" "l" "ɑː" "s" "t" "w" "iː" "k" "s" "ʌ" "m"
#[34] "b" "ə" "d" "i" "d" "ɪ" "d"
Or in base R :
regmatches(test1, gregexpr(phonemes_pattern, test1))[[1]]
Just changing the lookbehind to a lookahead makes it work
# using the unchanged phonemes vector
phonemes_pattern <- paste0(phonemes, collapse = "|")
str_split(gsub(" ", "", test1), paste0("(?=", phonemes_pattern, ")"))
I want to randomly disrupt the order of the letters that make up words in sentences. I can do the shuffling for single words, e.g.:
a <- "bach"
sample(unlist(str_split(a, "")), nchar(a))
[1] "h" "a" "b" "c"
but I fail to do it for sentences, e.g.:
b <- "bach composed fugues and cantatas"
What I've tried so far:
split into words:
b1 <- str_split(b, " ")
[[1]]
[1] "bach" "composed" "fugues" "and" "cantatas"
calculate the number of characters per word:
n <- lapply(b1, function(x) nchar(x))
n
[[1]]
[1] 4 8 6 3 8
split words in b1 into single letters:
b2 <- str_split(unlist(str_split(b, " ")), "")
b2
[[1]]
[1] "b" "a" "c" "h"
[[2]]
[1] "c" "o" "m" "p" "o" "s" "e" "d"
[[3]]
[1] "f" "u" "g" "u" "e" "s"
[[4]]
[1] "a" "n" "d"
[[5]]
[1] "c" "a" "n" "t" "a" "t" "a" "s"
Jumble the letters in each word based on the above:
lapply(b2, function(x) sample(unlist(x), unlist(n), replace = T))
[[1]]
[1] "h" "a" "c" "b"
[[2]]
[1] "o" "p" "o" "s"
[[3]]
[1] "g" "s" "s" "u"
[[4]]
[1] "d" "d" "a" "d"
[[5]]
[1] "c" "n" "s" "a"
That's obviously not the right result. How can I randomly jumble the sequence of letters in each word in the sentence?
After b2 you can randomly shuffle character using sample and paste the words back.
paste0(sapply(b2, function(x) paste0(sample(x), collapse = "")), collapse = " ")
#[1] "bhac moodscpe uefusg and tsatnaac"
Note that you don't need to mention the size in sample if you want the output to be of same length as input with replace = FALSE.
I am trying to remove the text before and including a character ("-") for every element in a list.
Ex-
x = list(c("a-b","b-c","c-d"),c("a-b","e-f"))
desired output:
"b" "c" "d"
"b" "f"
I have tried using various combinations of lapply and gsub, such as
lapply(x,gsub,'.*-','',x)
but this just returns a null list-
[[1]]
[1] ""
[[2]]
[1] ""
And only using
gsub(".*-","",x)
returns
"d\")" "f\")"
You are close, but using lapply with gsub, R doesn't know which arguments are which. You just need to label the arguments explicitly.
x <- list(c("a-b","b-c","c-d"),c("a-b","e-f"))
lapply(x, gsub, pattern = "^.*-", replacement = "")
[[1]]
[1] "b" "c" "d"
[[2]]
[1] "b" "f"
This can be done with a for loop.
val<-list()
for(i in 1:length(x)){
val[[i]]<-gsub('.*-',"",x[[i]])}
val
[[1]]
[1] "b" "c" "d"
[[2]]
[1] "b" "f"
How to split this string?
sp|O00602|FCN1_HUMAN
to
[[1]]
[1]"sp","O00602","FCN1_HUMAN"
I used the following code
strsplit("sp|O00602|FCN1_HUMAN",split ="|")
However, the result I got is
[[1]]
[1] "s" "p" "|" "O" "0" "0" "6" "0" "2" "|" "F" "C" "N" "1" "_" "H" "U" "M" "A" "N"
What should I do?
You should use fixed= TRUE so that | is interpreted as a literal string, not as a regular expression:
strsplit("sp|O00602|FCN1_HUMAN",split ="|", fixed= TRUE)
The character "|" is a meta-character, so you need to escape it.
strsplit("sp|O00602|FCN1_HUMAN", split = "\\|")
#[[1]]
#[1] "sp" "O00602" "FCN1_HUMAN"
After defining
> Seq.genes <- as.list(c("ATGCCCAAATTTGATTT","AGAGTTCCCACCAACG"))
I have a list of strings :
> Seq.genes[1:2]
[[1]]
[1] "ATGCCCAAATTTGATTT"
[[2]]
[1] "AGAGTTCCCACCAACG"
I would like to convert it in a list of vectors :
>Seq.genes[1:2]
[[1]]
[1]"A" "T" "G" "C" "C" "C" "A" "A" "A" "T" "T" "T" "G" "A" "T" "T" "T"
[[2]]
[1] "A" "G" "A" "G" "T" "T" "C" "C" "C" "A" "C" "C" "A" "A" "C" "G"
I tried something like :
for (i in length(Seq.genes)){
x <- Seq.genes[i]
Seq.genes[i] <- substring(x, seq(1,nchar(x),2), seq(1,nchar(x),2))
}
It may be better to have the strings in a vector rather than in a list. So, we could unlist, then do an strsplit
strsplit(unlist(Seq.genes), "")
sapply(Seq.genes, strsplit, split = '')
or
lapply(Seq.genes, strsplit, split = '')