I have a .txt file looks similar to this...
*** Header Start ***
嘀攀爀猀椀漀渀倀攀爀猀椀猀琀㨀 ഀഀ
LevelName: Session
䰀攀瘀攀氀一愀洀攀㨀 䈀氀漀挀欀ഀഀ
LevelName: Trial
䰀攀瘀攀氀一愀洀攀㨀 匀甀戀吀爀椀愀氀ഀഀ
LevelName: LogLevel5
䰀攀瘀攀氀一愀洀攀㨀 䰀漀最䰀攀瘀攀氀㘀ഀഀ
LevelName: LogLevel7
䰀攀瘀攀氀一愀洀攀㨀 䰀漀最䰀攀瘀攀氀㠀ഀഀ
I want to read the file and remove all of the random character lines to just get the English lines as a column in a data frame. When I just read the file using read.table or readLines or scan or read delim I get a weird output do to the characters...
[1] "ÿþ*" "H" "S" "*" "\n" "V" "1" "\n" "L" "S" "\n" "L" "B" "\n" "L" "T" "\n"
[18] "L" "S" "\n" "L" "L" "\n" "L" "L" "\n" "L" "L" "\n" "L" "L" "\n" "L" "L"
[35] "\n" "L" "L" "\n" "E" "G" "\n" "S" "0" "\n" "S" "1" "\n" "S" "8" "4" "P"
[52] "\n" "S" "9" "\n" "S" "1" "\n" "D" "G" "\n" "R" "-" "\n" "G" "1" "\n" "D"
What can I do to fix this without manually removing the lines in all of the text files?
I have a variable, whose value is in string form and looks like this:
test_intro|test_wm02|test_wf06|test_lf10|t ....
When I use this command:
strsplit(df$var,"|")
I get the following output:
"t" "e" "s" "t" "_" "i" "n" "t" "r" "o" "|" "t" "e" "s" "t" "_" "w" "m" "0" "1" "|" "t" "e ....
which makes me think that there's something wrong with the syntax. Would appreciate if someone could point to where the problem might be?
Adding a more robust answer here because fixed = TRUE may fix this problem, but can cause other problems. The problem here is that the | character means "or" in Regex. So you are saying to split the string on blank or blank. Splitting on blank is a special feature in strsplit that intentionally divides a string into its character components (which is REALLY useful sometimes).
Instead of using the fixed = TRUE argument you can write you splitting character in Regex format. In R that means you will need a double escape.
test <- "test_intro|test_wm02|test_wf06|test_lf10|t ...."
# The following doesn't work as expected because | is an or character in regex.
strsplit(test,"|")
# [1] "t" "e" "s" "t" "_" "i" "n" "t" "r" "o" "|" "t" "e" "s" "t" "_" "w" "m" "0" "2" "|" "t" "e" "s" "t" "_" "w" "f" "0"
# [30] "6" "|" "t" "e" "s" "t" "_" "l" "f" "1" "0" "|" "t" " " "." "." "." "."
# Escaping the | character (see regex manual) will make the code work as expected
strsplit(test,"\\|")
# [1] "test_intro" "test_wm02" "test_wf06" "test_lf10" "t ...."
You need to specify that fixed is TRUE:
strsplit(df$var, "|", TRUE)
Output:
"test_intro" "test_wm02" "test_wf06" "test_lf10" "t ...."
If fixed is default (FALSE) then the split expression will be treated as a regular expression. Instead, you want to split by the exact character |, so fixed must be TRUE.
If you want to remove the pipe in JavaScript, you can do this:
let str = "test_intro|test_wm02|test_wf06|test_lf10|t ....";
str.split("|");
// returns an array of your string broken up, without the pipe`
I can't seem to figure out a seemingly simple task:
I have phonemic transcriptions of speech. To count the phonemes I want to split the strings into single phonemic segments. Unfortunately, the characters used for the phonemes are not 100% different from each other. For example, a long /i/ sound is transcribed iː (NB: ː is not a colon but a special char!) whereas a short /i/ sound may occasionally be transcribed i. This double use of the i in two distinct phonemes causes a problem in the split operation:
Test data:
test1 <- "dɪd ɛnɪbɒdi liːv ðeə glɑːsɪz hɪə lɑːst wiːk sʌmbədi dɪd"
A vector of all phonemes:
phonemes <- c("ɪə","eɪ","ʊə","ɔɪ","aɪ","eə","aʊ","əʊ", # diphthongs
"iː","uː","ɜː","ɔː","ɑː", # long vowels
"ə","ɪ", "ɛ","ɒ","ʌ","æ","i","ʊ", # short vowels
"j","w", # semi-vowels
"r","l", # approximants
"n","m","ŋ", # nasals
"f","v","θ","ð","s","z","ʃ","ʒ","h", # fricatives
"ʧ","ʤ", # affricates
"p","b","t","d","k","g") # plosives
The alternation pattern:
phonemes_pattern <- paste0("(", paste0(phonemes, collapse = "|"), ")")
The splitting operation:
str_split(gsub(" ", "", test1), paste0("(?<=", phonemes_pattern, ")"))
[[1]]
[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l" "i" "ː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h" "ɪ" "ə" "l" "ɑː" "s" "t"
[30] "w" "i" "ː" "k" "s" "ʌ" "m" "b" "ə" "d" "i" "d" "ɪ" "d" ""
The result is correct except for the long /i/ sound where the two characters iand ː are also separated. Can anybody help with this?
Why not extract the phonemes instead ?
phonemes_pattern <- paste0(phonemes, collapse = "|")
stringr::str_extract_all(test1, phonemes_pattern)[[1]]
#[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l"
#[12] "iː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h"
#[23] "ɪə" "l" "ɑː" "s" "t" "w" "iː" "k" "s" "ʌ" "m"
#[34] "b" "ə" "d" "i" "d" "ɪ" "d"
Or in base R :
regmatches(test1, gregexpr(phonemes_pattern, test1))[[1]]
Just changing the lookbehind to a lookahead makes it work
# using the unchanged phonemes vector
phonemes_pattern <- paste0(phonemes, collapse = "|")
str_split(gsub(" ", "", test1), paste0("(?=", phonemes_pattern, ")"))
How to split this string?
sp|O00602|FCN1_HUMAN
to
[[1]]
[1]"sp","O00602","FCN1_HUMAN"
I used the following code
strsplit("sp|O00602|FCN1_HUMAN",split ="|")
However, the result I got is
[[1]]
[1] "s" "p" "|" "O" "0" "0" "6" "0" "2" "|" "F" "C" "N" "1" "_" "H" "U" "M" "A" "N"
What should I do?
You should use fixed= TRUE so that | is interpreted as a literal string, not as a regular expression:
strsplit("sp|O00602|FCN1_HUMAN",split ="|", fixed= TRUE)
The character "|" is a meta-character, so you need to escape it.
strsplit("sp|O00602|FCN1_HUMAN", split = "\\|")
#[[1]]
#[1] "sp" "O00602" "FCN1_HUMAN"
This question already has answers here:
How to strsplit using '|' character, it behaves unexpectedly?
(4 answers)
Closed 4 years ago.
I have a string called line as below.
"2015-07-22|06:43:44+0000|37e86ffa-dd28-450d-aa9a-3d6776a31337|dummy|t1|USA-4DTV-DEFAULT|USA|MV000375100000|Striking Distance|MOVIE|TMS|VIEWED_MOVIE|NA|NA|NA|NA|**"
I am trying to split it on the separator "|", as :
strsplit(line, "|")
But the output is a s below :
[1] "2" "0" "1" "5" "-" "0" "7" "-" "2" "2" "|" "0" "6" ":" "4" "3" ":"
[18] "4" "4" "+" "0" "0" "0" "0" "|" "3" "7" "e" "8" "6" "f" "f" "a" "-"
[35] "d" "d" "2" "8" "-" "4" "5" "0" "d" "-" "a" "a" "9" "a" "-" "3" "d"
[52] "6" "7" "7" "6" "a" "3" "1" "3" "3" "7" "|" "d" "u" "m" "m" "y" "|"
[69] "t" "1" "|" "U" "S" "A" "-" "4" "D" "T" "V" "-" "D" "E" "F" "A" "U"
[86] "L" "T" "|" "U" "S" "A" "|" "M" "V" "0" "0" "0" "3" "7" "5" "1" "0"
[103] "0" "0" "0" "0" "|" "S" "t" "r" "i" "k" "i" "n" "g" " " "D" "i" "s"
[120] "t" "a" "n" "c" "e" "|" "M" "O" "V" "I" "E" "|" "T" "M" "S" "|" "V"
[137] "I" "E" "W" "E" "D" "_" "M" "O" "V" "I" "E" "|" "N" "A" "|" "N" "A"
[154] "|" "N" "A" "|" "N" "A" "|" "*" "*"
It is not even recognizing the pipe separators.
Just needs to add to backslash before the bar:
strsplit(x, "\\|")
For example:
> x <- "Hello | Could you help me please?"
> strsplit(x, "\\|")
[[1]]
[1] "Hello " " Could you help me please?"