Scan without spaces in R? - r

How do I scan for individual chars in a .txt for R? From my understanding, scan uses whitespace as separators, but if i want to use white space as something to scan for in R how do i do this?
ie (I want to scan the string "Hello World") how do i get H,e,l,l,o, ,W,o,r,l,d ?

strsplit would also be your friend here:
test <- readLines(textConnection("Hello world
Line two"))
strsplit(test,"")
> strsplit(test,"")
[[1]]
[1] "H" "e" "l" "l" "o" " " "w" "o" "r" "l" "d"
[[2]]
[1] "L" "i" "n" "e" " " "t" "w" "o"
And unlisted as suggested by #Thilo...
> unlist(strsplit(test,""))
[1] "H" "e" "l" "l" "o" " " "w" "o" "r" "l" "d" "L" "i" "n" "e" " " "t" "w" "o"

I would go a two-step approach: First read the file as plain text with readLines and then split the single lines to vectors of characters:
lines <- readLines("test.txt")
characterlist <- lapply(a, function(x) substring(x, 1:nchar(x), 1:nchar(x)))
Note that this approach does not return a well formed matrix or data.frame, but a list.
Depending on what you want to do, there might be a few different modifications:
unlist(characterlist)
gives you a vector of all characters in a row. If your textfile is so well behaved that you have exactly the same number of characters in each line, you may just add simplify=T to lapply and hopfully will get a matrix of your characters.

Related

Opposite of "bitwAnd" Function in R?

I found this function in R that can generate the "power set" for a set of elements:
f <- function(set) {
n <- length(set)
masks <- 2^(1:n-1)
lapply( 1:2^n-1, function(u) set[ bitwAnd(u, masks) != 0 ] )
}
results = f((LETTERS[1:5])
results = sapply(results, paste, collapse = " ")
In a previous question (Subsetting Elements in a "Hypothetical" List), I learned how to interact with very large "power sets" that the computer can not load into memory. For example - suppose I wanted to make the "power set" for all 26 letters in the English alphabet (this set would contain 2^26 = 67108864 elements). I could find out the "13626980"th element in this list without actually generating the list (since it would be impossible to generate/store such a big list):
LETTERS[bitwAnd(13626980, 2^(1:26-1)) != 0]
[1] "C" "F" "G" "J" "K" "L" "N" "O" "P" "Q" "R" "S" "T" "W" "X"
I had the following question : Is it possible to do the "opposite" of this task?
For example, given the number "13626980" - can some function determine which sequence of letters ("C" "F" "G" "J" "K" "L" "N" "O" "P" "Q" "R" "S" "T" "W" "X") corresponds to? Is there some hypothetical function like:
#input
> hypothetical_function(c("C" "F" "G" "J" "K" "L" "N" "O" "P" "Q" "R" "S" "T" "W" "X"))
#output
13626980
Is this possible?
Thank you!

separating string text by "|" doesn't work

I have a variable, whose value is in string form and looks like this:
test_intro|test_wm02|test_wf06|test_lf10|t ....
When I use this command:
strsplit(df$var,"|")
I get the following output:
"t" "e" "s" "t" "_" "i" "n" "t" "r" "o" "|" "t" "e" "s" "t" "_" "w" "m" "0" "1" "|" "t" "e ....
which makes me think that there's something wrong with the syntax. Would appreciate if someone could point to where the problem might be?
Adding a more robust answer here because fixed = TRUE may fix this problem, but can cause other problems. The problem here is that the | character means "or" in Regex. So you are saying to split the string on blank or blank. Splitting on blank is a special feature in strsplit that intentionally divides a string into its character components (which is REALLY useful sometimes).
Instead of using the fixed = TRUE argument you can write you splitting character in Regex format. In R that means you will need a double escape.
test <- "test_intro|test_wm02|test_wf06|test_lf10|t ...."
# The following doesn't work as expected because | is an or character in regex.
strsplit(test,"|")
# [1] "t" "e" "s" "t" "_" "i" "n" "t" "r" "o" "|" "t" "e" "s" "t" "_" "w" "m" "0" "2" "|" "t" "e" "s" "t" "_" "w" "f" "0"
# [30] "6" "|" "t" "e" "s" "t" "_" "l" "f" "1" "0" "|" "t" " " "." "." "." "."
# Escaping the | character (see regex manual) will make the code work as expected
strsplit(test,"\\|")
# [1] "test_intro" "test_wm02" "test_wf06" "test_lf10" "t ...."
You need to specify that fixed is TRUE:
strsplit(df$var, "|", TRUE)
Output:
"test_intro" "test_wm02" "test_wf06" "test_lf10" "t ...."
If fixed is default (FALSE) then the split expression will be treated as a regular expression. Instead, you want to split by the exact character |, so fixed must be TRUE.
If you want to remove the pipe in JavaScript, you can do this:
let str = "test_intro|test_wm02|test_wf06|test_lf10|t ....";
str.split("|");
// returns an array of your string broken up, without the pipe`

Split string into phonemic segments

I can't seem to figure out a seemingly simple task:
I have phonemic transcriptions of speech. To count the phonemes I want to split the strings into single phonemic segments. Unfortunately, the characters used for the phonemes are not 100% different from each other. For example, a long /i/ sound is transcribed iː (NB: ː is not a colon but a special char!) whereas a short /i/ sound may occasionally be transcribed i. This double use of the i in two distinct phonemes causes a problem in the split operation:
Test data:
test1 <- "dɪd ɛnɪbɒdi liːv ðeə glɑːsɪz hɪə lɑːst wiːk sʌmbədi dɪd"
A vector of all phonemes:
phonemes <- c("ɪə","eɪ","ʊə","ɔɪ","aɪ","eə","aʊ","əʊ", # diphthongs
"iː","uː","ɜː","ɔː","ɑː", # long vowels
"ə","ɪ", "ɛ","ɒ","ʌ","æ","i","ʊ", # short vowels
"j","w", # semi-vowels
"r","l", # approximants
"n","m","ŋ", # nasals
"f","v","θ","ð","s","z","ʃ","ʒ","h", # fricatives
"ʧ","ʤ", # affricates
"p","b","t","d","k","g") # plosives
The alternation pattern:
phonemes_pattern <- paste0("(", paste0(phonemes, collapse = "|"), ")")
The splitting operation:
str_split(gsub(" ", "", test1), paste0("(?<=", phonemes_pattern, ")"))
[[1]]
[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l" "i" "ː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h" "ɪ" "ə" "l" "ɑː" "s" "t"
[30] "w" "i" "ː" "k" "s" "ʌ" "m" "b" "ə" "d" "i" "d" "ɪ" "d" ""
The result is correct except for the long /i/ sound where the two characters iand ː are also separated. Can anybody help with this?
Why not extract the phonemes instead ?
phonemes_pattern <- paste0(phonemes, collapse = "|")
stringr::str_extract_all(test1, phonemes_pattern)[[1]]
#[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l"
#[12] "iː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h"
#[23] "ɪə" "l" "ɑː" "s" "t" "w" "iː" "k" "s" "ʌ" "m"
#[34] "b" "ə" "d" "i" "d" "ɪ" "d"
Or in base R :
regmatches(test1, gregexpr(phonemes_pattern, test1))[[1]]
Just changing the lookbehind to a lookahead makes it work
# using the unchanged phonemes vector
phonemes_pattern <- paste0(phonemes, collapse = "|")
str_split(gsub(" ", "", test1), paste0("(?=", phonemes_pattern, ")"))

Splitting character vector in my data frame by "|" not working

Working on Tidy Tuesday's data set horror_movies.csv and I cannot see how to split the genres column. I tried:
fieldList <- strsplit(df$genres, $"|")
Here is a sample of the output:
[1] "D" "r" "a" "m" "a" "|" " " "H" "o" "r" "r" "o" "r" "|" " " "S" "c" "i" "-" "F" "i"
[22] "|" " " "T" "h" "r" "i" "l" "l" "e" "r"
For some reason this splits my elements into individual characters. Here is a glimpse of this column so you can see how it is structured in the data frame:
$ genres <chr> "Drama| Horror| Thriller", "Horror", "Horror", "Comedy| Horror…
Is the | character special in R? What am I missing?
In R '|' is a logical operator meaning 'OR'.
You can do the following to solve the error, turn fixed=TRUE, this is set to FALSE by default.
fieldList <- strsplit(df$genres, $"|", fixed=TRUE)
Below is the documentation of the above function strsplit:
https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/strsplit

R: how to display the first n characters from a string of words

I have the following string:
Getty <- "Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal."
I want to display the first 10 characters. So I began by splitting the string into individual characters:
split <- strsplit(Getty, split="")
split
I get all the individual characters as this point. Then I make a substring of the first 10 characters.
first.10 <- substr(split, start=1, stop=10)
first.10
And here is the output:
"c(\"F\", \"o\""
I am not understanding why this prints out? I thought it would just print out something like:
"F" "o" "u" "r" "s"
Is there a way I can alter my code to print what I have above?
Thank you everyone!
Turn your code around and you get what you want.
Getty <- "Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal."
first.10 <- substr(Getty, start=1, stop=10)
first.10
"Four score"
split <- strsplit(first.10, split="")
split
"F" "o" "u" "r" " " "s" "c" "o" "r" "e"
The other answers didn't eliminate the spaces as you did in your example, so I'll add this:
strsplit(substr(gsub("\\s+", "", Getty), 1, 10), '')[[1]]
#[1] "F" "o" "u" "r" "s" "c" "o" "r" "e" "a"
The reason why you got "c(\"F\", \"o\"" is because the strsplit output is a list. We can convert the list to vector by extracting the first list element ie. [[1]]. Use the head to get the first 10 characters.
head(strsplit(Getty, '')[[1]], 10)
Update
If you just want to extract characters without the spaces,
library(stringr)
head(str_extract_all(Getty, '[^ ]')[[1]],10)
#[1] "F" "o" "u" "r" "s" "c" "o" "r" "e" "a"

Resources