R: Create a vector with all Keyboard constants - r

I want to create a vector that contains every constant in a QWERTY keyboard. For now I have:
keyboard <- c(LETTERS, letters, 0:9)
I need to add an element to the vector containing all the symbols (e.g. #, !, ?, etc...) Is there an R constant (such as LETTERS for all upper case letters in the alphabet) that contains all the symbols? If not, is there a fast way of getting them without typing them one by one?

The ascii printable characters are represented by 32 through 126. We can generate that vector, convert to 'raw', then use a function to change to the actual characters. After that we just break the string into pieces.
strsplit(rawToChar(as.raw(32:126)), "")[[1]]
which gives
> strsplit(rawToChar(as.raw(32:126)), "")[[1]]
[1] " " "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "."
[16] "/" "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" ":" ";" "<" "="
[31] ">" "?" "#" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L"
[46] "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "["
[61] "\\" "]" "^" "_" "`" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
[76] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y"
[91] "z" "{" "|" "}" "~"

Related

How to read a text file without showing the other language characters in R?

I have a .txt file looks similar to this...
*** Header Start ***
਍嘀攀爀猀椀漀渀倀攀爀猀椀猀琀㨀 ㄀ഀഀ
LevelName: Session
਍䰀攀瘀攀氀一愀洀攀㨀 䈀氀漀挀欀ഀഀ
LevelName: Trial
਍䰀攀瘀攀氀一愀洀攀㨀 匀甀戀吀爀椀愀氀ഀഀ
LevelName: LogLevel5
਍䰀攀瘀攀氀一愀洀攀㨀 䰀漀最䰀攀瘀攀氀㘀ഀഀ
LevelName: LogLevel7
਍䰀攀瘀攀氀一愀洀攀㨀 䰀漀最䰀攀瘀攀氀㠀ഀഀ
I want to read the file and remove all of the random character lines to just get the English lines as a column in a data frame. When I just read the file using read.table or readLines or scan or read delim I get a weird output do to the characters...
[1] "ÿþ*" "H" "S" "*" "\n" "V" "1" "\n" "L" "S" "\n" "L" "B" "\n" "L" "T" "\n"
[18] "L" "S" "\n" "L" "L" "\n" "L" "L" "\n" "L" "L" "\n" "L" "L" "\n" "L" "L"
[35] "\n" "L" "L" "\n" "E" "G" "\n" "S" "0" "\n" "S" "1" "\n" "S" "8" "4" "P"
[52] "\n" "S" "9" "\n" "S" "1" "\n" "D" "G" "\n" "R" "-" "\n" "G" "1" "\n" "D"
What can I do to fix this without manually removing the lines in all of the text files?

separating string text by "|" doesn't work

I have a variable, whose value is in string form and looks like this:
test_intro|test_wm02|test_wf06|test_lf10|t ....
When I use this command:
strsplit(df$var,"|")
I get the following output:
"t" "e" "s" "t" "_" "i" "n" "t" "r" "o" "|" "t" "e" "s" "t" "_" "w" "m" "0" "1" "|" "t" "e ....
which makes me think that there's something wrong with the syntax. Would appreciate if someone could point to where the problem might be?
Adding a more robust answer here because fixed = TRUE may fix this problem, but can cause other problems. The problem here is that the | character means "or" in Regex. So you are saying to split the string on blank or blank. Splitting on blank is a special feature in strsplit that intentionally divides a string into its character components (which is REALLY useful sometimes).
Instead of using the fixed = TRUE argument you can write you splitting character in Regex format. In R that means you will need a double escape.
test <- "test_intro|test_wm02|test_wf06|test_lf10|t ...."
# The following doesn't work as expected because | is an or character in regex.
strsplit(test,"|")
# [1] "t" "e" "s" "t" "_" "i" "n" "t" "r" "o" "|" "t" "e" "s" "t" "_" "w" "m" "0" "2" "|" "t" "e" "s" "t" "_" "w" "f" "0"
# [30] "6" "|" "t" "e" "s" "t" "_" "l" "f" "1" "0" "|" "t" " " "." "." "." "."
# Escaping the | character (see regex manual) will make the code work as expected
strsplit(test,"\\|")
# [1] "test_intro" "test_wm02" "test_wf06" "test_lf10" "t ...."
You need to specify that fixed is TRUE:
strsplit(df$var, "|", TRUE)
Output:
"test_intro" "test_wm02" "test_wf06" "test_lf10" "t ...."
If fixed is default (FALSE) then the split expression will be treated as a regular expression. Instead, you want to split by the exact character |, so fixed must be TRUE.
If you want to remove the pipe in JavaScript, you can do this:
let str = "test_intro|test_wm02|test_wf06|test_lf10|t ....";
str.split("|");
// returns an array of your string broken up, without the pipe`

Split string into phonemic segments

I can't seem to figure out a seemingly simple task:
I have phonemic transcriptions of speech. To count the phonemes I want to split the strings into single phonemic segments. Unfortunately, the characters used for the phonemes are not 100% different from each other. For example, a long /i/ sound is transcribed iː (NB: ː is not a colon but a special char!) whereas a short /i/ sound may occasionally be transcribed i. This double use of the i in two distinct phonemes causes a problem in the split operation:
Test data:
test1 <- "dɪd ɛnɪbɒdi liːv ðeə glɑːsɪz hɪə lɑːst wiːk sʌmbədi dɪd"
A vector of all phonemes:
phonemes <- c("ɪə","eɪ","ʊə","ɔɪ","aɪ","eə","aʊ","əʊ", # diphthongs
"iː","uː","ɜː","ɔː","ɑː", # long vowels
"ə","ɪ", "ɛ","ɒ","ʌ","æ","i","ʊ", # short vowels
"j","w", # semi-vowels
"r","l", # approximants
"n","m","ŋ", # nasals
"f","v","θ","ð","s","z","ʃ","ʒ","h", # fricatives
"ʧ","ʤ", # affricates
"p","b","t","d","k","g") # plosives
The alternation pattern:
phonemes_pattern <- paste0("(", paste0(phonemes, collapse = "|"), ")")
The splitting operation:
str_split(gsub(" ", "", test1), paste0("(?<=", phonemes_pattern, ")"))
[[1]]
[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l" "i" "ː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h" "ɪ" "ə" "l" "ɑː" "s" "t"
[30] "w" "i" "ː" "k" "s" "ʌ" "m" "b" "ə" "d" "i" "d" "ɪ" "d" ""
The result is correct except for the long /i/ sound where the two characters iand ː are also separated. Can anybody help with this?
Why not extract the phonemes instead ?
phonemes_pattern <- paste0(phonemes, collapse = "|")
stringr::str_extract_all(test1, phonemes_pattern)[[1]]
#[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l"
#[12] "iː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h"
#[23] "ɪə" "l" "ɑː" "s" "t" "w" "iː" "k" "s" "ʌ" "m"
#[34] "b" "ə" "d" "i" "d" "ɪ" "d"
Or in base R :
regmatches(test1, gregexpr(phonemes_pattern, test1))[[1]]
Just changing the lookbehind to a lookahead makes it work
# using the unchanged phonemes vector
phonemes_pattern <- paste0(phonemes, collapse = "|")
str_split(gsub(" ", "", test1), paste0("(?=", phonemes_pattern, ")"))

How to split sp|O00602|FCN1_HUMAN with (|)

How to split this string?
sp|O00602|FCN1_HUMAN
to
[[1]]
[1]"sp","O00602","FCN1_HUMAN"
I used the following code
strsplit("sp|O00602|FCN1_HUMAN",split ="|")
However, the result I got is
[[1]]
[1] "s" "p" "|" "O" "0" "0" "6" "0" "2" "|" "F" "C" "N" "1" "_" "H" "U" "M" "A" "N"
What should I do?
You should use fixed= TRUE so that | is interpreted as a literal string, not as a regular expression:
strsplit("sp|O00602|FCN1_HUMAN",split ="|", fixed= TRUE)
The character "|" is a meta-character, so you need to escape it.
strsplit("sp|O00602|FCN1_HUMAN", split = "\\|")
#[[1]]
#[1] "sp" "O00602" "FCN1_HUMAN"

Using strsplit wid pipe separator in R [duplicate]

This question already has answers here:
How to strsplit using '|' character, it behaves unexpectedly?
(4 answers)
Closed 4 years ago.
I have a string called line as below.
"2015-07-22|06:43:44+0000|37e86ffa-dd28-450d-aa9a-3d6776a31337|dummy|t1|USA-4DTV-DEFAULT|USA|MV000375100000|Striking Distance|MOVIE|TMS|VIEWED_MOVIE|NA|NA|NA|NA|**"
I am trying to split it on the separator "|", as :
strsplit(line, "|")
But the output is a s below :
[1] "2" "0" "1" "5" "-" "0" "7" "-" "2" "2" "|" "0" "6" ":" "4" "3" ":"
[18] "4" "4" "+" "0" "0" "0" "0" "|" "3" "7" "e" "8" "6" "f" "f" "a" "-"
[35] "d" "d" "2" "8" "-" "4" "5" "0" "d" "-" "a" "a" "9" "a" "-" "3" "d"
[52] "6" "7" "7" "6" "a" "3" "1" "3" "3" "7" "|" "d" "u" "m" "m" "y" "|"
[69] "t" "1" "|" "U" "S" "A" "-" "4" "D" "T" "V" "-" "D" "E" "F" "A" "U"
[86] "L" "T" "|" "U" "S" "A" "|" "M" "V" "0" "0" "0" "3" "7" "5" "1" "0"
[103] "0" "0" "0" "0" "|" "S" "t" "r" "i" "k" "i" "n" "g" " " "D" "i" "s"
[120] "t" "a" "n" "c" "e" "|" "M" "O" "V" "I" "E" "|" "T" "M" "S" "|" "V"
[137] "I" "E" "W" "E" "D" "_" "M" "O" "V" "I" "E" "|" "N" "A" "|" "N" "A"
[154] "|" "N" "A" "|" "N" "A" "|" "*" "*"
It is not even recognizing the pipe separators.
Just needs to add to backslash before the bar:
strsplit(x, "\\|")
For example:
> x <- "Hello | Could you help me please?"
> strsplit(x, "\\|")
[[1]]
[1] "Hello " " Could you help me please?"

Resources