Split string into phonemic segments - r

I can't seem to figure out a seemingly simple task:
I have phonemic transcriptions of speech. To count the phonemes I want to split the strings into single phonemic segments. Unfortunately, the characters used for the phonemes are not 100% different from each other. For example, a long /i/ sound is transcribed iː (NB: ː is not a colon but a special char!) whereas a short /i/ sound may occasionally be transcribed i. This double use of the i in two distinct phonemes causes a problem in the split operation:
Test data:
test1 <- "dɪd ɛnɪbɒdi liːv ðeə glɑːsɪz hɪə lɑːst wiːk sʌmbədi dɪd"
A vector of all phonemes:
phonemes <- c("ɪə","eɪ","ʊə","ɔɪ","aɪ","eə","aʊ","əʊ", # diphthongs
"iː","uː","ɜː","ɔː","ɑː", # long vowels
"ə","ɪ", "ɛ","ɒ","ʌ","æ","i","ʊ", # short vowels
"j","w", # semi-vowels
"r","l", # approximants
"n","m","ŋ", # nasals
"f","v","θ","ð","s","z","ʃ","ʒ","h", # fricatives
"ʧ","ʤ", # affricates
"p","b","t","d","k","g") # plosives
The alternation pattern:
phonemes_pattern <- paste0("(", paste0(phonemes, collapse = "|"), ")")
The splitting operation:
str_split(gsub(" ", "", test1), paste0("(?<=", phonemes_pattern, ")"))
[[1]]
[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l" "i" "ː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h" "ɪ" "ə" "l" "ɑː" "s" "t"
[30] "w" "i" "ː" "k" "s" "ʌ" "m" "b" "ə" "d" "i" "d" "ɪ" "d" ""
The result is correct except for the long /i/ sound where the two characters iand ː are also separated. Can anybody help with this?

Why not extract the phonemes instead ?
phonemes_pattern <- paste0(phonemes, collapse = "|")
stringr::str_extract_all(test1, phonemes_pattern)[[1]]
#[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l"
#[12] "iː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h"
#[23] "ɪə" "l" "ɑː" "s" "t" "w" "iː" "k" "s" "ʌ" "m"
#[34] "b" "ə" "d" "i" "d" "ɪ" "d"
Or in base R :
regmatches(test1, gregexpr(phonemes_pattern, test1))[[1]]

Just changing the lookbehind to a lookahead makes it work
# using the unchanged phonemes vector
phonemes_pattern <- paste0(phonemes, collapse = "|")
str_split(gsub(" ", "", test1), paste0("(?=", phonemes_pattern, ")"))

Related

Sampling alternating blocks of character strings (words) from two fixed lists in R

I am attempting to create a 5000 word vector composed of 500 blocks of 10 words. One block is drawn from sampling with replacement from a fixed list of animals, and this block is to alternate with a fixed list of foods. The following code yields one iteration of what I need:
anim<- data.frame(cbind(stim=list.sample(animals$WORD, 10, replace=T), cond="animal"))
food <- data.frame(cbind(stim=list.sample(foods$WORD, 10, replace=T), cond="food"))
both <- data.frame(rbind(anim, food))
This yields output as follows:
I just cannot figure out how to repeat this procedure 499 more times to create the total vector I need -- I will be running semantic distances between clusters to determine whether I can autosegment the boundaries between foods and animals. I attempted a repeat loop to no avail
Thanks for any ideas!
Since you did not provide any reproducible data, we will assume that LETTERS are food and letters are animals. This line of code generates the vector you specified. Here we are only using batches of 5 to illustrate the process:
result <- as.vector(replicate(5, c(sample(LETTERS, 5, replace=TRUE), sample(letters, 5, replace=TRUE))))
result
# [1] "H" "O" "T" "K" "J" "m" "c" "s" "u" "c" "P" "Y" "V" "U" "Y" "p" "u" "q" "k" "l" "B" "H" "U" "F" "K" "h" "v" "g"
# [29] "c" "d" "X" "F" "R" "N" "U" "v" "t" "u" "q" "x" "N" "E" "G" "Q" "L" "d" "a" "v" "e" "a"

Opposite of "bitwAnd" Function in R?

I found this function in R that can generate the "power set" for a set of elements:
f <- function(set) {
n <- length(set)
masks <- 2^(1:n-1)
lapply( 1:2^n-1, function(u) set[ bitwAnd(u, masks) != 0 ] )
}
results = f((LETTERS[1:5])
results = sapply(results, paste, collapse = " ")
In a previous question (Subsetting Elements in a "Hypothetical" List), I learned how to interact with very large "power sets" that the computer can not load into memory. For example - suppose I wanted to make the "power set" for all 26 letters in the English alphabet (this set would contain 2^26 = 67108864 elements). I could find out the "13626980"th element in this list without actually generating the list (since it would be impossible to generate/store such a big list):
LETTERS[bitwAnd(13626980, 2^(1:26-1)) != 0]
[1] "C" "F" "G" "J" "K" "L" "N" "O" "P" "Q" "R" "S" "T" "W" "X"
I had the following question : Is it possible to do the "opposite" of this task?
For example, given the number "13626980" - can some function determine which sequence of letters ("C" "F" "G" "J" "K" "L" "N" "O" "P" "Q" "R" "S" "T" "W" "X") corresponds to? Is there some hypothetical function like:
#input
> hypothetical_function(c("C" "F" "G" "J" "K" "L" "N" "O" "P" "Q" "R" "S" "T" "W" "X"))
#output
13626980
Is this possible?
Thank you!

Matching between a vector and multiple vectors in a list in R

I have a list of vectors such as:
>list
[[1]]
[1] "a" "m" "l" "s" "t" "o"
[[2]]
[1] "a" "y" "o" "t" "e"
[[3]]
[1] "n" "a" "s" "i" "d"
I want to find the matches between each of them and the remaining (i.e. between the 1st and the other 2, the 2nd and the other 2, and so on) and keep the couple with the highest number of matches. I could do it with a "for" loop and intersect by couples. For example
for (i in 2:3) { intersect(list[[1]],list[[i]]) }
and then save the output into a vector or some other structure. However, this seems so inefficient to me (given than rather than 3 I have thousands) and I am wondering if R has some built-in function to do that in a clever way.
So the question would be:
Is there a way to look for matches of one vector to a list of vectors without the explicit use of a "for" loop?
I don't believe there is a built-in function for this. The best you could try is something like:
lsts <- lapply(1:5, function(x) sample(letters, 10)) # make some data (see below)
maxcomb <- which.max(apply(combs <- combn(length(lsts), 2), 2,
function(ix) length(intersect(lsts[[ix[1]]], lsts[[ix[2]]]))))
lsts <- lsts[combs[, maxcomb]]
# [[1]]
# [1] "m" "v" "x" "d" "a" "g" "r" "b" "s" "t"
# [[2]]
# [1] "w" "v" "t" "i" "d" "p" "l" "e" "s" "x"
A dump of the original:
[[1]]
[1] "z" "r" "j" "h" "e" "m" "w" "u" "q" "f"
[[2]]
[1] "m" "v" "x" "d" "a" "g" "r" "b" "s" "t"
[[3]]
[1] "w" "v" "t" "i" "d" "p" "l" "e" "s" "x"
[[4]]
[1] "c" "o" "t" "j" "d" "g" "u" "k" "w" "h"
[[5]]
[1] "f" "g" "q" "y" "d" "e" "n" "s" "w" "i"
datal <- list (a=c(2,2,1,2),
b=c(2,2,2,4,3),
c=c(1,2,3,4))
# all possible combinations
combs <- combn(length(datal), 2)
# split into list
combs <- split(combs, rep(1:ncol(combs), each = nrow(combs)))
# calculate length of intersection for every combination
intersections_length <- sapply(combs, function(y) {
length(intersect(datal[[y[1]]],datal[[y[2]]]))
}
)
# What lists have biggest intersection
combs[which(intersections_length == max(intersections_length))]

Scan without spaces in R?

How do I scan for individual chars in a .txt for R? From my understanding, scan uses whitespace as separators, but if i want to use white space as something to scan for in R how do i do this?
ie (I want to scan the string "Hello World") how do i get H,e,l,l,o, ,W,o,r,l,d ?
strsplit would also be your friend here:
test <- readLines(textConnection("Hello world
Line two"))
strsplit(test,"")
> strsplit(test,"")
[[1]]
[1] "H" "e" "l" "l" "o" " " "w" "o" "r" "l" "d"
[[2]]
[1] "L" "i" "n" "e" " " "t" "w" "o"
And unlisted as suggested by #Thilo...
> unlist(strsplit(test,""))
[1] "H" "e" "l" "l" "o" " " "w" "o" "r" "l" "d" "L" "i" "n" "e" " " "t" "w" "o"
I would go a two-step approach: First read the file as plain text with readLines and then split the single lines to vectors of characters:
lines <- readLines("test.txt")
characterlist <- lapply(a, function(x) substring(x, 1:nchar(x), 1:nchar(x)))
Note that this approach does not return a well formed matrix or data.frame, but a list.
Depending on what you want to do, there might be a few different modifications:
unlist(characterlist)
gives you a vector of all characters in a row. If your textfile is so well behaved that you have exactly the same number of characters in each line, you may just add simplify=T to lapply and hopfully will get a matrix of your characters.

Generate a sequence of characters from 'A'-'Z'

I can make a sequence of numbers like this:
s = seq(from=1, to=10, by=1)
How do I make a sequence of characters from A-Z? This doesn't work:
seq(from=1, to=10)
Use LETTERS and letters (for uppercase and lowercase respectively).
Use the code you have with letters and/or LETTERS:
> LETTERS[seq( from = 1, to = 10 )]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
> letters[seq( from = 1, to = 10 )]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
Just use the predefined variables letters and LETTERS.
And for completeness, here it something using seq:
R> rawToChar(as.raw(seq(as.numeric(charToRaw('a')), as.numeric(charToRaw('z')))))
[1] "abcdefghijklmnopqrstuvwxyz"
R>
R.oo package has an intToChar function, that uses ASCII values, if LETTERS and letters aren't any good. A is 65 in ASCII:
> require(R.oo)
> intToChar(65:79)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
or you can use the fact that the lowest unicode numbers are ascii and hence intToUtf8 in R-base like this:
> intToUtf8(65:78,multiple=TRUE)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N"
or faff around with rawToChar:
> rawToChar(as.raw(65:78))
[1] "ABCDEFGHIJKLMN"
LETTERS returns A-Z
To generate A-E for instance
Uppercase:
> LETTERS[1:5]
Lowercase
letters[1:5]

Resources