Combining mutate and filter functions - r

I am a beginner when it comes to R language so sorry if I am duplicating a question btw I use tidyverse packages.
My problem is at follows:
I have a dataframe in which one column looks like that
pre_schwa
IY0
SH
Z
+1500 rows
Now I need to create a column(variable) which corresposnds to this specific column. I created four vectors:
vowels <- c("AY1", "ER0", "IY0", "IY1", "UW2")
sonorants <- c("M","N", "R", "Y", "ZH", "W")
fricatives <- c("F", "S", "SH", "TH", "V", "Z")
stops <- c("B", "CH", "D", "G", "JH", "K", "P", "T")
Having this I want to create a column called sonority_grouped which would consist of four names(vowels, sonorants, fricatives, stops) depending what character is in the pre_schwa column so I want it to look like this
pre_schwa sonority_grouped
SH fricatives
ER0 vowels
B stops
Z fricative
+1500 rows
I tried combining mutate() and filter() functions by %>% but I suck at programming.
Thank you for any reponse.

You can also use case_when.
df %>%
mutate(sonority_grouped = case_when(
pre_schwa %in% vowels ~ "vowels",
pre_schwa %in% sonorants ~ "sonorants",
pre_schwa %in% fricatives ~ "fricatives",
pre_schwa %in% stops ~ "stops",
))

Data
df <- read.table(text="pre_schwa
IY0
SH
Z", header=TRUE, stringsAsFactors=FALSE)
I recommend converting your individual vectors into a data.frame via
vowels <- c("AY1", "ER0", "IY0", "IY1", "UW2")
sonorants <- c("M", "N", "R", "Y", "ZH", "W")
fricatives <- c("F", "S", "SH", "TH", "V", "Z")
stops <- c("B", "CH", "D", "G", "JH", "K", "P", "T")
patterns <- c("vowels", "sonorants", "fricatives", "stops")
df2 <- stack(mget(patterns))
Alternatively, as pointed by MrFlick, you can use lattice::make.groups(...)
df2 <- lattice::make.groups(vowels, sonorants, fricatives, stops) %>%
dplyr::rename(pre_schwa=data, sonority_grouped=which)
Then you can use dplyr::left_join to obtain your result
ans <- dplyr::left_join(df, df2, by=c("pre_schwa" = "values"))
# pre_schwa ind
# 1 IY0 vowels
# 2 SH fricatives
# 3 Z fricatives
With MrFlick's answer use
ans <- dplyr::left_join(df, df2)

Related

Can I use a vector as a regex pattern parameter in R?

I want to search a phonetic dictionary (tsv with two columns, one for words, another for phonetic transcription: IPA) for certain consonant clusters according to the type combination (e.g. fricative+plosive, plosive+fricative, plosive+liquid, etc.). I created a vector concatenating the corresponding phonemes:
plosives <- c("p", "b", "t", "d", "k", "g")
fricatives <- c("f", "v", "s", "z", "ʂ", "ʐ", "x")
The point of writing these vectors in the first place I to shorthand and quickly reference each consonant type when writing different regexes. I want to search all two-consonant combinations from these two types (FP, PF, PP, FF). How can I write a regex in R using these vectors as pattern parameters?
I know crossing (fricatives, plosives) gives me all combinations as a string, but I get an error when using it in: CC.all <- str_extract_all(ruphondict$IPA, crossing (fricatives, plosives), simplify = T)
A base R way to form a regex.
paste(
apply(expand.grid(plosives, fricatives), 1, paste0, collapse = ""),
collapse = "|"
)
Note that this is in fact a one-liner.
paste(apply(expand.grid(plosives, fricatives), 1, paste0, collapse = ""),collapse = "|")
You need to make a |-delimited string to use as a regular expression:
plosives <- c("p", "b", "t", "d", "k", "g")
fricatives <- c("f", "v", "s", "z", "ʂ", "ʐ", "x")
my_regex <- (crossing(plosives, fricatives)
|> mutate(comb = paste0(plosives, fricatives))
|> pull(comb)
|> paste(collapse = "|")
)
[1] "bf|bs|bʂ|bv|bx|bz|bʐ|df|ds|dʂ|dv|dx|dz|dʐ|gf|gs|gʂ|gv|gx|gz|gʐ|kf|ks|kʂ|kv|kx|kz|kʐ|pf|ps|pʂ|pv|px|pz|pʐ|tf|ts|tʂ|tv|tx|tz|tʐ"

Loops: How can I loop case_when function in R?

Here's the code, where I am trying to create a variable by detecting the words and matching them. Here I use dplyr package and its function mutate in combination with case_when. The problem is I am adding each one of the values manually as you see. How can I automate it by applying some loop functions to match the two?
city <- LETTERS #26 cities
district <- letters[10:20] #11 districts
streets <- paste0(district, district)
streets <- streets[-c(5:26)] #4 streets
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
library(dplyr)
library(stringi)
df2 <- df %>%
mutate(districts = case_when(
stri_detect_fixed(address, "b") ~ "b", #address[1]
#address[2]
stri_detect_fixed(address, "a") ~ "a", #address[3]
#address[4]
stri_detect_fixed(address, "cc") ~ "cc" #address[5]
))
The code scans through address for the value in district vector. I would love to do the same for city and street variables. So I used the modified version of the code from another question in Stack Overflow. It produces an error.
for (j in town_village2) {
trn_house3[,93] <- case_when(
stri_detect_fixed(trn_house3[1:6469, 4], j) ~ j)
}
I seek to produce this result:
x address city district street
1 A, b, cc, A b cc
2 B, dd B NA dd
3 a, dd NA a dd
4 C C NA NA
5 D, a, cc D a cc
If you are going to add a loop, it makes no sense to use case_when(); you don't have to add all options into it if you can loop over them.
You can solve it with a for-loop:
library(stringi)
df2 <- df
for(c in city) df2$city[stri_detect_fixed(df2$address, c)] <- c
for(d in district) df2$district[stri_detect_fixed(df2$address, d)] <- d
for(s in streets) df2$street[stri_detect_fixed(df2$address, s)] <- s
Note that your example code didn't work; the district names are 'a' and 'b' in your example dataset, but you generate names 'j' through 't'. I fixed that in my code above.
And it will cause an error if names of cities, districts and/or streets overlap. For instance, if one row is in the district 'b', and in the street 'cc', stri_detect_fixed will also see the 'c' and think it is in 'c'. I propose a completely different method to overcome this:
Alternative method
Given your example data, it makes most sense to first split the given address by ,, then look for exact matches with your reference city/district/street names. We can look for those exact matches with intersect().
# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y",
"Z")
districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
streets <- c("aa", "bb", "cc", "dd")
# example dataset
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
# vectorize address into elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace
Compare df$address and the newly created address_elems:
> df$address
[1] "A, b, cc," "B, dd" "a, dd" "C" "D, a, cc"
> address_elems
[[1]]
[1] "A" "b" "cc"
[[2]]
[1] "B" "dd"
[[3]]
[1] "a" "dd"
[[4]]
[1] "C"
[[5]]
[1] "D" "a" "cc"
We could find matching cities for just the first vector in address_elems in with intersect(cities, address_elems[[1]]).
Because we might get multiple matches, we only take the first element, with intersect(cities, address_elems[[1]])[[1]].
To apply this to every vector in address_elems, we can use sapply() or lapply():
# intersect the respective reference lists with each list of
# address items, taking only the first element
df$cities = sapply(address_elems, function(x) intersect(cities, x)[1])
df$district = sapply(address_elems, function(x) intersect(districts, x)[1])
df$street = sapply(address_elems, function(x) intersect(streets, x)[1])
PIAT
Putting it all together we get:
# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y",
"Z")
districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
streets <- c("aa", "bb", "cc", "dd")
# example dataset
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
# create vector of address elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace
# intersect the respecitve reference lists with each list of
# address items, take only the first element
df$cities = lapply(address_elems, function(x) intersect(cities, x)[1])
df$district = sapply(address_elems, function(x) intersect(districts, x)[1])
df$street = sapply(address_elems, function(x) intersect(streets, x)[1])
# cleanup
rm(address_elems)
This will separate the elements into vectors:
library(tidyverse)
df <- data.frame(
x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc")
)
df3 <-
df %>%
separate_rows(address, sep = "[, ]+") %>%
filter(nchar(address) > 0) %>%
nest(address) %>%
transmute(x, districts = data %>% map(~ .x[[1]]))
#> Warning: All elements of `...` must be named.
#> Did you want `data = address`?
df3
#> # A tibble: 5 × 2
#> x districts
#> <int> <list>
#> 1 1 <chr [3]>
#> 2 2 <chr [2]>
#> 3 3 <chr [2]>
#> 4 4 <chr [1]>
#> 5 5 <chr [3]>
df3$districts[[1]]
#> [1] "A" "b" "cc"
Created on 2022-04-14 by the reprex package (v2.0.0)
a data.table approach
library(data.table)
DT <- data.table(city, streets, district)
# create a lookup table with all elements
lookup <- melt(DT, measure.vars = names(DT))
# set df to data.table format
setDT(df)
final <- df[, .(address = unlist(tstrsplit(address, ",[ ]*", perl = TRUE))), by = .(x)]
# now add elements
final[lookup, type := i.variable, on = .(address = value)]
# and dcast to wide
dcast(final, x ~ type, value.var = "address")
# x city streets district
# 1: 1 A cc b
# 2: 2 B dd <NA>
# 3: 3 <NA> dd a
# 4: 4 C <NA> <NA>
# 5: 5 D cc a

search for next closest element not in a list

I am trying to replace 2 alphabets (repeats ) from vector of 26 alphabets.
I already have 13 of 26 alphabets in my table (keys), so replacement alphabets should not be among those 13 'keys'.
I am trying to write code to replace C & S by next present alphabet which should not be part of 'keys'.
The following code is replacing repeat C by D and S by T, but those both letters are in my 'keys'. Could someone know how I can implement condition so that code will re-run loop if letter to be replace is already present in 'key'?
# alphabets <- toupper(letters)
keys <- c("I", "C", "P", "X", "H", "J", "S", "E", "T", "D", "A", "R", "L")
repeats <- c("C", "S")
index_of_repeat_in_26 <- which(repeats %in% alphabets)
# index_of_repeat_in_26 is 3 , 19
# available_keys <- setdiff(alphabets,keys)
available <- alphabets[available_keys]
# available <- c("B", "F", "G", "K", "O", "Q", "U", "V", "W", "Y", "Z")
index_available_keys <- which(alphabets %in% available_keys)
# 2 6 7 11 15 17 21 22 23 25 26
for (i in 1:length(repeat)){
for(j in 1:(26-sort(index_of_repeat_in_26)[1])){
if(index_of_repeat_in_26[i]+j %in% index_available_keys){
char_to_replace_in_key[i] <- alphabets[index_of_capital_repeat_in_26[i]+1]
}
else{
cat("\n keys not available to replace \n")
}
}
}
keys <- c("I", "C", "P", "X", "H", "J", "S", "E", "T", "D", "A", "R", "L")
repeats <- c("C", "S")
y = sort(setdiff(LETTERS, keys)) # get the letters not present in 'keys'
y = factor(y, levels = LETTERS) # make them factor so that we can do numeric comparisons with the levels
y1 = as.numeric(y) # keep them numeric to compare
z = factor(repeats, levels = LETTERS)
z1 = as.numeric(z)
func <- function(x) { # so here, in each iteration, the index(in this case 1:4 gets passed)
xx = y1 - z1[x] # taking the difference between each 'repeat' element from all 'non-keys'
xx = which(xx>0)[1]# choose the one with smallest difference(because 'y1' is already sorted. So the first nearest non-key gets selected
r = y[xx] # extract the corresponding 'non-key' element
y <<- y[-xx] # after i get the closest letter, I remove that from global list so that it doesn't get captured the next time
y1 <<- y1[-xx] # similarily removed from the equivalent numeric list
r # return the extracted 'closest non-key' chracter
}
# sapply is also a for-loop by itself, in which a single element get passed ro func at a time.
# Here 'seq_along' is used to pass the index. i.e. for 'C' - 1, for 'S' - 2 , etc gets passed.
ans = sapply(seq_along(repeats), func)
if (any(is.na(ans))){
cat("\n",paste0("keys not available to replace for ",
paste0(repeats[which(is.na(ans))], collapse = ",")) ,
"\n")
ans <- ans[!is.na(ans)]
}
# example 2 with :
repeats <- c("Y", "Z")
# output :
# keys not available to replace for Z
# ans
# [1] Z
Note : to understand how each ieration of sapply() works : you should run debug(func) and then run the sapply() call. You can then check on console how each variable xx, r is getting evaluated. Hope this helps!

R , Replicating the rownames in data.frame

I have a data.frame with dimension [6587 37] and the rownames must repeat after every 18 rows. How i can do this in Rstudio.
If your 18 column names are:
mynames <- c("a", "b", "c", "d", "e", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s")
You can get what you want with:
paste0(rep(mynames,length.out=6587),rep(1:366,each=18,length.out=6587))
Or you can modify the names pasting different things.
Row names in data.frames have to be unique.
> df <- data.frame(x = 1:2)
> rownames(df) <- c("a", "a")
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘a’
You could use make.names to make the names unique, but still carry some repeating information.
> make.names(c("a","a"), unique = TRUE)
[1] "a" "a.1"
These could be identified with help from grep
Or you could make a column in df or a second data.frame that holds the information

Filtering only unique value from multiple column in R

I have data like this:
X <- data.frame(fac_1 = c("A", "B", "C", "X", "Y"), fac_2 = c("B", "X", "P", "Q", "C"), fac_3 = c("C", "P", "Q", "T", "U"))
fac_1 fac_2 fac_3
A B C
B X P
C P Q
X Q T
Y C U
I want only those alphabet which are common
(1) between fac_1 and fac_2 (like B,C,X) and
(2) all factors which are common among fac_1, fac_2 and fac_3 (like C only)
You can use intersect
intersect(intersect(X$fac_1, X$fac_2), X$fac_3)
#[1] "C"
intersect(X$fac_1, X$fac_2)
#[1] "B" "C" "X"
Alternatively, the function Reduce can be used as described by #docendo discimus at comments section.
Reduce(intersect, X)
#[1] "C"

Resources