Can I use a vector as a regex pattern parameter in R? - r

I want to search a phonetic dictionary (tsv with two columns, one for words, another for phonetic transcription: IPA) for certain consonant clusters according to the type combination (e.g. fricative+plosive, plosive+fricative, plosive+liquid, etc.). I created a vector concatenating the corresponding phonemes:
plosives <- c("p", "b", "t", "d", "k", "g")
fricatives <- c("f", "v", "s", "z", "ʂ", "ʐ", "x")
The point of writing these vectors in the first place I to shorthand and quickly reference each consonant type when writing different regexes. I want to search all two-consonant combinations from these two types (FP, PF, PP, FF). How can I write a regex in R using these vectors as pattern parameters?
I know crossing (fricatives, plosives) gives me all combinations as a string, but I get an error when using it in: CC.all <- str_extract_all(ruphondict$IPA, crossing (fricatives, plosives), simplify = T)

A base R way to form a regex.
paste(
apply(expand.grid(plosives, fricatives), 1, paste0, collapse = ""),
collapse = "|"
)
Note that this is in fact a one-liner.
paste(apply(expand.grid(plosives, fricatives), 1, paste0, collapse = ""),collapse = "|")

You need to make a |-delimited string to use as a regular expression:
plosives <- c("p", "b", "t", "d", "k", "g")
fricatives <- c("f", "v", "s", "z", "ʂ", "ʐ", "x")
my_regex <- (crossing(plosives, fricatives)
|> mutate(comb = paste0(plosives, fricatives))
|> pull(comb)
|> paste(collapse = "|")
)
[1] "bf|bs|bʂ|bv|bx|bz|bʐ|df|ds|dʂ|dv|dx|dz|dʐ|gf|gs|gʂ|gv|gx|gz|gʐ|kf|ks|kʂ|kv|kx|kz|kʐ|pf|ps|pʂ|pv|px|pz|pʐ|tf|ts|tʂ|tv|tx|tz|tʐ"

Related

R How to remap letters in a string

I’d be grateful for suggestions as to how to remap letters in strings in a map-specified way.
Suppose, for instance, I want to change all As to Bs, all Bs to Ds, and all Ds to Fs. If I do it like this, it doesn’t do what I want since it applies the transformations successively:
"abc" %>% str_replace_all(c(a = "b", b = "d", d = "f"))
Here’s a way I can do what I want, but it feels a bit clunky.
f <- function (str) str_c( c(a = "b", b = "d", c = "c", d = "f") %>% .[ strsplit(str, "")[[1]] ], collapse = "" )
"abc" %>% map_chr(f)
Better ideas would be much appreciated.
James.
P.S. Forgot to specify. Sometimes I want to replace a letter with multiple letters, e.g., replace all As with the string ZZZ.
P.P.S. Ideally, this would be able to handle vectors of strings too, e.g., c("abc", "gersgaesg", etc.)
We could use chartr in base R
chartr("abc", "bdf", "abbbce")
#[1] "bdddfe"
Or a package solution would be mgsub which would also match and replace strings with number of characters greater than 1
library(mgsub)
mgsub("abbbce", c("a", "b", "c"), c("b", "d", "f"))
#[1] "bdddfe"
mgsub("abbbce", c("a", "b", "c"), c("ba", "ZZZ", "f"))
#[1] "baZZZZZZZZZfe"
Maybe this is more elegant? It will also return warnings when values aren't found.
library(plyr)
library(tidyverse)
mappings <- c(a = "b", b = "d", d = "f")
str_split("abc", pattern = "") %>%
unlist() %>%
mapvalues(from = names(mappings), to = mappings) %>%
str_c(collapse = "")
# The following `from` values were not present in `x`: d
# [1] "bdc"

Combining mutate and filter functions

I am a beginner when it comes to R language so sorry if I am duplicating a question btw I use tidyverse packages.
My problem is at follows:
I have a dataframe in which one column looks like that
pre_schwa
IY0
SH
Z
+1500 rows
Now I need to create a column(variable) which corresposnds to this specific column. I created four vectors:
vowels <- c("AY1", "ER0", "IY0", "IY1", "UW2")
sonorants <- c("M","N", "R", "Y", "ZH", "W")
fricatives <- c("F", "S", "SH", "TH", "V", "Z")
stops <- c("B", "CH", "D", "G", "JH", "K", "P", "T")
Having this I want to create a column called sonority_grouped which would consist of four names(vowels, sonorants, fricatives, stops) depending what character is in the pre_schwa column so I want it to look like this
pre_schwa sonority_grouped
SH fricatives
ER0 vowels
B stops
Z fricative
+1500 rows
I tried combining mutate() and filter() functions by %>% but I suck at programming.
Thank you for any reponse.
You can also use case_when.
df %>%
mutate(sonority_grouped = case_when(
pre_schwa %in% vowels ~ "vowels",
pre_schwa %in% sonorants ~ "sonorants",
pre_schwa %in% fricatives ~ "fricatives",
pre_schwa %in% stops ~ "stops",
))
Data
df <- read.table(text="pre_schwa
IY0
SH
Z", header=TRUE, stringsAsFactors=FALSE)
I recommend converting your individual vectors into a data.frame via
vowels <- c("AY1", "ER0", "IY0", "IY1", "UW2")
sonorants <- c("M", "N", "R", "Y", "ZH", "W")
fricatives <- c("F", "S", "SH", "TH", "V", "Z")
stops <- c("B", "CH", "D", "G", "JH", "K", "P", "T")
patterns <- c("vowels", "sonorants", "fricatives", "stops")
df2 <- stack(mget(patterns))
Alternatively, as pointed by MrFlick, you can use lattice::make.groups(...)
df2 <- lattice::make.groups(vowels, sonorants, fricatives, stops) %>%
dplyr::rename(pre_schwa=data, sonority_grouped=which)
Then you can use dplyr::left_join to obtain your result
ans <- dplyr::left_join(df, df2, by=c("pre_schwa" = "values"))
# pre_schwa ind
# 1 IY0 vowels
# 2 SH fricatives
# 3 Z fricatives
With MrFlick's answer use
ans <- dplyr::left_join(df, df2)

search for next closest element not in a list

I am trying to replace 2 alphabets (repeats ) from vector of 26 alphabets.
I already have 13 of 26 alphabets in my table (keys), so replacement alphabets should not be among those 13 'keys'.
I am trying to write code to replace C & S by next present alphabet which should not be part of 'keys'.
The following code is replacing repeat C by D and S by T, but those both letters are in my 'keys'. Could someone know how I can implement condition so that code will re-run loop if letter to be replace is already present in 'key'?
# alphabets <- toupper(letters)
keys <- c("I", "C", "P", "X", "H", "J", "S", "E", "T", "D", "A", "R", "L")
repeats <- c("C", "S")
index_of_repeat_in_26 <- which(repeats %in% alphabets)
# index_of_repeat_in_26 is 3 , 19
# available_keys <- setdiff(alphabets,keys)
available <- alphabets[available_keys]
# available <- c("B", "F", "G", "K", "O", "Q", "U", "V", "W", "Y", "Z")
index_available_keys <- which(alphabets %in% available_keys)
# 2 6 7 11 15 17 21 22 23 25 26
for (i in 1:length(repeat)){
for(j in 1:(26-sort(index_of_repeat_in_26)[1])){
if(index_of_repeat_in_26[i]+j %in% index_available_keys){
char_to_replace_in_key[i] <- alphabets[index_of_capital_repeat_in_26[i]+1]
}
else{
cat("\n keys not available to replace \n")
}
}
}
keys <- c("I", "C", "P", "X", "H", "J", "S", "E", "T", "D", "A", "R", "L")
repeats <- c("C", "S")
y = sort(setdiff(LETTERS, keys)) # get the letters not present in 'keys'
y = factor(y, levels = LETTERS) # make them factor so that we can do numeric comparisons with the levels
y1 = as.numeric(y) # keep them numeric to compare
z = factor(repeats, levels = LETTERS)
z1 = as.numeric(z)
func <- function(x) { # so here, in each iteration, the index(in this case 1:4 gets passed)
xx = y1 - z1[x] # taking the difference between each 'repeat' element from all 'non-keys'
xx = which(xx>0)[1]# choose the one with smallest difference(because 'y1' is already sorted. So the first nearest non-key gets selected
r = y[xx] # extract the corresponding 'non-key' element
y <<- y[-xx] # after i get the closest letter, I remove that from global list so that it doesn't get captured the next time
y1 <<- y1[-xx] # similarily removed from the equivalent numeric list
r # return the extracted 'closest non-key' chracter
}
# sapply is also a for-loop by itself, in which a single element get passed ro func at a time.
# Here 'seq_along' is used to pass the index. i.e. for 'C' - 1, for 'S' - 2 , etc gets passed.
ans = sapply(seq_along(repeats), func)
if (any(is.na(ans))){
cat("\n",paste0("keys not available to replace for ",
paste0(repeats[which(is.na(ans))], collapse = ",")) ,
"\n")
ans <- ans[!is.na(ans)]
}
# example 2 with :
repeats <- c("Y", "Z")
# output :
# keys not available to replace for Z
# ans
# [1] Z
Note : to understand how each ieration of sapply() works : you should run debug(func) and then run the sapply() call. You can then check on console how each variable xx, r is getting evaluated. Hope this helps!

R , Replicating the rownames in data.frame

I have a data.frame with dimension [6587 37] and the rownames must repeat after every 18 rows. How i can do this in Rstudio.
If your 18 column names are:
mynames <- c("a", "b", "c", "d", "e", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s")
You can get what you want with:
paste0(rep(mynames,length.out=6587),rep(1:366,each=18,length.out=6587))
Or you can modify the names pasting different things.
Row names in data.frames have to be unique.
> df <- data.frame(x = 1:2)
> rownames(df) <- c("a", "a")
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘a’
You could use make.names to make the names unique, but still carry some repeating information.
> make.names(c("a","a"), unique = TRUE)
[1] "a" "a.1"
These could be identified with help from grep
Or you could make a column in df or a second data.frame that holds the information

Collapse vector to string of characters with respective numbers of consequtive occurences

I would like to collapse a CIGAR vector to a CIGAR string. By CIGAR vector to String I mean the following:
I want a function that converts:
cigar.vector = c("M", "M", "I", "I", "M", "I", "", "M", "D", "D", "M", "I", "D", "M", "I")
to this:
cigar.string = "2M2I1M1I1M2D1M1I1D1M1I"
and viceversa.
Note that there is a "" (empty character), that does not count. thanks!
rle seems the obvious choice here:
rcv <- rle(cigar.vector[cigar.vector!=""])
paste0(rcv$lengths,rcv$values,collapse="")
#[1] "2M2I1M1I1M2D1M1I1D1M1I"
If you want to get fancy, you could also exploit the fact that rle gives a list of length 2:
paste(do.call(rbind,rle(cigar.vector[cigar.vector!=""])),collapse="")
#[1] "2M2I1M1I1M2D1M1I1D1M1I"
Going backwards will be impossible if only given the result (assign above to result), as it has lost information for the "" cases. Excluding those cases, you can get close enough with something like:
backwards <- rep(
unlist(strsplit(result,"\\d+"))[-1],
as.numeric(unlist(strsplit(result,"[^0-9]")))
)
identical(cigar.vector[cigar.vector!=""],backwards)
#[1] TRUE

Resources