R splitting string based on double-character delimiter [duplicate] - r

This question already has an answer here:
strsplit with vertical bar (pipe)
(1 answer)
Closed 3 years ago.
I have a string "Test||Test1||test2" that I want to tokenize by ||. However, what I got is always the individual characters (with 2 empty chars at both ends):
"" "T" "e" "s" "t" "1" "|" "|" "T" "e" "s" "t" "2" "|" "|" "T" "e" "s" "t" "3" ""
I have tried both: strsplit(myString, "||") and str_split(myString, "||") from the library tidyverse (from this tutorial, seems like it should work) but got the same incorrect result.
How do I tokenize string based on double/multiple-character delimiter?

We can wrap with fixed as | is a metacharacter for OR
library(stringr)
str_split(myString, fixed("||"))[[1]]
#[1] "Test" "Test1" "test2"
Or another option is to escape (\\ - as #joran mentioned in the comments) or place it inside a square bracket
data
myString <- "Test||Test1||test2"

Related

separating string text by "|" doesn't work

I have a variable, whose value is in string form and looks like this:
test_intro|test_wm02|test_wf06|test_lf10|t ....
When I use this command:
strsplit(df$var,"|")
I get the following output:
"t" "e" "s" "t" "_" "i" "n" "t" "r" "o" "|" "t" "e" "s" "t" "_" "w" "m" "0" "1" "|" "t" "e ....
which makes me think that there's something wrong with the syntax. Would appreciate if someone could point to where the problem might be?
Adding a more robust answer here because fixed = TRUE may fix this problem, but can cause other problems. The problem here is that the | character means "or" in Regex. So you are saying to split the string on blank or blank. Splitting on blank is a special feature in strsplit that intentionally divides a string into its character components (which is REALLY useful sometimes).
Instead of using the fixed = TRUE argument you can write you splitting character in Regex format. In R that means you will need a double escape.
test <- "test_intro|test_wm02|test_wf06|test_lf10|t ...."
# The following doesn't work as expected because | is an or character in regex.
strsplit(test,"|")
# [1] "t" "e" "s" "t" "_" "i" "n" "t" "r" "o" "|" "t" "e" "s" "t" "_" "w" "m" "0" "2" "|" "t" "e" "s" "t" "_" "w" "f" "0"
# [30] "6" "|" "t" "e" "s" "t" "_" "l" "f" "1" "0" "|" "t" " " "." "." "." "."
# Escaping the | character (see regex manual) will make the code work as expected
strsplit(test,"\\|")
# [1] "test_intro" "test_wm02" "test_wf06" "test_lf10" "t ...."
You need to specify that fixed is TRUE:
strsplit(df$var, "|", TRUE)
Output:
"test_intro" "test_wm02" "test_wf06" "test_lf10" "t ...."
If fixed is default (FALSE) then the split expression will be treated as a regular expression. Instead, you want to split by the exact character |, so fixed must be TRUE.
If you want to remove the pipe in JavaScript, you can do this:
let str = "test_intro|test_wm02|test_wf06|test_lf10|t ....";
str.split("|");
// returns an array of your string broken up, without the pipe`

Splitting character vector in my data frame by "|" not working

Working on Tidy Tuesday's data set horror_movies.csv and I cannot see how to split the genres column. I tried:
fieldList <- strsplit(df$genres, $"|")
Here is a sample of the output:
[1] "D" "r" "a" "m" "a" "|" " " "H" "o" "r" "r" "o" "r" "|" " " "S" "c" "i" "-" "F" "i"
[22] "|" " " "T" "h" "r" "i" "l" "l" "e" "r"
For some reason this splits my elements into individual characters. Here is a glimpse of this column so you can see how it is structured in the data frame:
$ genres <chr> "Drama| Horror| Thriller", "Horror", "Horror", "Comedy| Horror…
Is the | character special in R? What am I missing?
In R '|' is a logical operator meaning 'OR'.
You can do the following to solve the error, turn fixed=TRUE, this is set to FALSE by default.
fieldList <- strsplit(df$genres, $"|", fixed=TRUE)
Below is the documentation of the above function strsplit:
https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/strsplit

How does zero-width negative lookahead assertions work in R? [duplicate]

This question already has answers here:
Why does strsplit use positive lookahead and lookbehind assertion matches differently?
(3 answers)
Closed 6 years ago.
the output of
strsplit('abc dcf', split = '(?=c)', perl = T)
is as expected.
However, the output of
strsplit('abc dcf', split = '(?!c)', perl = T)
is
[[1]]
[1] "a" "b" "c" " " "d" "c" "f"
while my expectation is
[[1]]
[1] "a" "b" "c " "d" "cf"
becasue I thought it wouldn't be splited if the last character of previous chunk matches the char c. Is my understanding of negative lookahead wrong?
We can try
strsplit('abc dcf', "(?![c ])\\s*\\b", perl=TRUE)
#[[1]]
#[1] "a" "b" "c " "d" "cf"

R: how to display the first n characters from a string of words

I have the following string:
Getty <- "Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal."
I want to display the first 10 characters. So I began by splitting the string into individual characters:
split <- strsplit(Getty, split="")
split
I get all the individual characters as this point. Then I make a substring of the first 10 characters.
first.10 <- substr(split, start=1, stop=10)
first.10
And here is the output:
"c(\"F\", \"o\""
I am not understanding why this prints out? I thought it would just print out something like:
"F" "o" "u" "r" "s"
Is there a way I can alter my code to print what I have above?
Thank you everyone!
Turn your code around and you get what you want.
Getty <- "Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal."
first.10 <- substr(Getty, start=1, stop=10)
first.10
"Four score"
split <- strsplit(first.10, split="")
split
"F" "o" "u" "r" " " "s" "c" "o" "r" "e"
The other answers didn't eliminate the spaces as you did in your example, so I'll add this:
strsplit(substr(gsub("\\s+", "", Getty), 1, 10), '')[[1]]
#[1] "F" "o" "u" "r" "s" "c" "o" "r" "e" "a"
The reason why you got "c(\"F\", \"o\"" is because the strsplit output is a list. We can convert the list to vector by extracting the first list element ie. [[1]]. Use the head to get the first 10 characters.
head(strsplit(Getty, '')[[1]], 10)
Update
If you just want to extract characters without the spaces,
library(stringr)
head(str_extract_all(Getty, '[^ ]')[[1]],10)
#[1] "F" "o" "u" "r" "s" "c" "o" "r" "e" "a"

Scan without spaces in R?

How do I scan for individual chars in a .txt for R? From my understanding, scan uses whitespace as separators, but if i want to use white space as something to scan for in R how do i do this?
ie (I want to scan the string "Hello World") how do i get H,e,l,l,o, ,W,o,r,l,d ?
strsplit would also be your friend here:
test <- readLines(textConnection("Hello world
Line two"))
strsplit(test,"")
> strsplit(test,"")
[[1]]
[1] "H" "e" "l" "l" "o" " " "w" "o" "r" "l" "d"
[[2]]
[1] "L" "i" "n" "e" " " "t" "w" "o"
And unlisted as suggested by #Thilo...
> unlist(strsplit(test,""))
[1] "H" "e" "l" "l" "o" " " "w" "o" "r" "l" "d" "L" "i" "n" "e" " " "t" "w" "o"
I would go a two-step approach: First read the file as plain text with readLines and then split the single lines to vectors of characters:
lines <- readLines("test.txt")
characterlist <- lapply(a, function(x) substring(x, 1:nchar(x), 1:nchar(x)))
Note that this approach does not return a well formed matrix or data.frame, but a list.
Depending on what you want to do, there might be a few different modifications:
unlist(characterlist)
gives you a vector of all characters in a row. If your textfile is so well behaved that you have exactly the same number of characters in each line, you may just add simplify=T to lapply and hopfully will get a matrix of your characters.

Resources