How to split sp|O00602|FCN1_HUMAN with (|) - r

How to split this string?
sp|O00602|FCN1_HUMAN
to
[[1]]
[1]"sp","O00602","FCN1_HUMAN"
I used the following code
strsplit("sp|O00602|FCN1_HUMAN",split ="|")
However, the result I got is
[[1]]
[1] "s" "p" "|" "O" "0" "0" "6" "0" "2" "|" "F" "C" "N" "1" "_" "H" "U" "M" "A" "N"
What should I do?

You should use fixed= TRUE so that | is interpreted as a literal string, not as a regular expression:
strsplit("sp|O00602|FCN1_HUMAN",split ="|", fixed= TRUE)

The character "|" is a meta-character, so you need to escape it.
strsplit("sp|O00602|FCN1_HUMAN", split = "\\|")
#[[1]]
#[1] "sp" "O00602" "FCN1_HUMAN"

Related

separating string text by "|" doesn't work

I have a variable, whose value is in string form and looks like this:
test_intro|test_wm02|test_wf06|test_lf10|t ....
When I use this command:
strsplit(df$var,"|")
I get the following output:
"t" "e" "s" "t" "_" "i" "n" "t" "r" "o" "|" "t" "e" "s" "t" "_" "w" "m" "0" "1" "|" "t" "e ....
which makes me think that there's something wrong with the syntax. Would appreciate if someone could point to where the problem might be?
Adding a more robust answer here because fixed = TRUE may fix this problem, but can cause other problems. The problem here is that the | character means "or" in Regex. So you are saying to split the string on blank or blank. Splitting on blank is a special feature in strsplit that intentionally divides a string into its character components (which is REALLY useful sometimes).
Instead of using the fixed = TRUE argument you can write you splitting character in Regex format. In R that means you will need a double escape.
test <- "test_intro|test_wm02|test_wf06|test_lf10|t ...."
# The following doesn't work as expected because | is an or character in regex.
strsplit(test,"|")
# [1] "t" "e" "s" "t" "_" "i" "n" "t" "r" "o" "|" "t" "e" "s" "t" "_" "w" "m" "0" "2" "|" "t" "e" "s" "t" "_" "w" "f" "0"
# [30] "6" "|" "t" "e" "s" "t" "_" "l" "f" "1" "0" "|" "t" " " "." "." "." "."
# Escaping the | character (see regex manual) will make the code work as expected
strsplit(test,"\\|")
# [1] "test_intro" "test_wm02" "test_wf06" "test_lf10" "t ...."
You need to specify that fixed is TRUE:
strsplit(df$var, "|", TRUE)
Output:
"test_intro" "test_wm02" "test_wf06" "test_lf10" "t ...."
If fixed is default (FALSE) then the split expression will be treated as a regular expression. Instead, you want to split by the exact character |, so fixed must be TRUE.
If you want to remove the pipe in JavaScript, you can do this:
let str = "test_intro|test_wm02|test_wf06|test_lf10|t ....";
str.split("|");
// returns an array of your string broken up, without the pipe`

Split string into phonemic segments

I can't seem to figure out a seemingly simple task:
I have phonemic transcriptions of speech. To count the phonemes I want to split the strings into single phonemic segments. Unfortunately, the characters used for the phonemes are not 100% different from each other. For example, a long /i/ sound is transcribed iː (NB: ː is not a colon but a special char!) whereas a short /i/ sound may occasionally be transcribed i. This double use of the i in two distinct phonemes causes a problem in the split operation:
Test data:
test1 <- "dɪd ɛnɪbɒdi liːv ðeə glɑːsɪz hɪə lɑːst wiːk sʌmbədi dɪd"
A vector of all phonemes:
phonemes <- c("ɪə","eɪ","ʊə","ɔɪ","aɪ","eə","aʊ","əʊ", # diphthongs
"iː","uː","ɜː","ɔː","ɑː", # long vowels
"ə","ɪ", "ɛ","ɒ","ʌ","æ","i","ʊ", # short vowels
"j","w", # semi-vowels
"r","l", # approximants
"n","m","ŋ", # nasals
"f","v","θ","ð","s","z","ʃ","ʒ","h", # fricatives
"ʧ","ʤ", # affricates
"p","b","t","d","k","g") # plosives
The alternation pattern:
phonemes_pattern <- paste0("(", paste0(phonemes, collapse = "|"), ")")
The splitting operation:
str_split(gsub(" ", "", test1), paste0("(?<=", phonemes_pattern, ")"))
[[1]]
[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l" "i" "ː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h" "ɪ" "ə" "l" "ɑː" "s" "t"
[30] "w" "i" "ː" "k" "s" "ʌ" "m" "b" "ə" "d" "i" "d" "ɪ" "d" ""
The result is correct except for the long /i/ sound where the two characters iand ː are also separated. Can anybody help with this?
Why not extract the phonemes instead ?
phonemes_pattern <- paste0(phonemes, collapse = "|")
stringr::str_extract_all(test1, phonemes_pattern)[[1]]
#[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l"
#[12] "iː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h"
#[23] "ɪə" "l" "ɑː" "s" "t" "w" "iː" "k" "s" "ʌ" "m"
#[34] "b" "ə" "d" "i" "d" "ɪ" "d"
Or in base R :
regmatches(test1, gregexpr(phonemes_pattern, test1))[[1]]
Just changing the lookbehind to a lookahead makes it work
# using the unchanged phonemes vector
phonemes_pattern <- paste0(phonemes, collapse = "|")
str_split(gsub(" ", "", test1), paste0("(?=", phonemes_pattern, ")"))

Splitting character vector in my data frame by "|" not working

Working on Tidy Tuesday's data set horror_movies.csv and I cannot see how to split the genres column. I tried:
fieldList <- strsplit(df$genres, $"|")
Here is a sample of the output:
[1] "D" "r" "a" "m" "a" "|" " " "H" "o" "r" "r" "o" "r" "|" " " "S" "c" "i" "-" "F" "i"
[22] "|" " " "T" "h" "r" "i" "l" "l" "e" "r"
For some reason this splits my elements into individual characters. Here is a glimpse of this column so you can see how it is structured in the data frame:
$ genres <chr> "Drama| Horror| Thriller", "Horror", "Horror", "Comedy| Horror…
Is the | character special in R? What am I missing?
In R '|' is a logical operator meaning 'OR'.
You can do the following to solve the error, turn fixed=TRUE, this is set to FALSE by default.
fieldList <- strsplit(df$genres, $"|", fixed=TRUE)
Below is the documentation of the above function strsplit:
https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/strsplit

R: Create a vector with all Keyboard constants

I want to create a vector that contains every constant in a QWERTY keyboard. For now I have:
keyboard <- c(LETTERS, letters, 0:9)
I need to add an element to the vector containing all the symbols (e.g. #, !, ?, etc...) Is there an R constant (such as LETTERS for all upper case letters in the alphabet) that contains all the symbols? If not, is there a fast way of getting them without typing them one by one?
The ascii printable characters are represented by 32 through 126. We can generate that vector, convert to 'raw', then use a function to change to the actual characters. After that we just break the string into pieces.
strsplit(rawToChar(as.raw(32:126)), "")[[1]]
which gives
> strsplit(rawToChar(as.raw(32:126)), "")[[1]]
[1] " " "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "."
[16] "/" "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" ":" ";" "<" "="
[31] ">" "?" "#" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L"
[46] "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "["
[61] "\\" "]" "^" "_" "`" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
[76] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y"
[91] "z" "{" "|" "}" "~"

Covert string into a vector of letter in R

Suppose I have a string in R as "aa1122ddccdsadsa"
I want to convert any string into a vector of letters, how can I do that ?
I mean give an string, I want to it be
"a" "a" "1" "1" "2" etc
There are actually lots of ways to do this. Here's one using strsplit and regex:
x <- c("aa1122ddccdsadsa")
strsplit(gsub("([[:alnum:]]{1})", "\\1 ", x), " ")[[1]]
> strsplit(gsub("([[:alnum:]]{1})", "\\1 ", x), " ")[[1]]
[1] "a" "a" "1" "1" "2" "2" "d" "d" "c" "c" "d" "s" "a" "d" "s" "a"
You could also use substring or plyr.
Using stri_sub function from stringi package
Get substrings of length 1 from 1,2,3... letter
require(stringi)
x <- "alamakota"
stri_sub(x,from=1:stri_length(x),length = 1)
## [1] "a" "l" "a" "m" "a" "k" "o" "t" "a"

Resources