Dynamic variables from dataframe value in R with value names? - r

Given a dataframe of types and values like so:
topic
keyword
cheese
cheddar
meat
beef
meat
chicken
cheese
swiss
bread
focaccia
bread
sourdough
cheese
gouda
My aim is to make a set of dynamic regexs based on the type, but I don't know how to make the variable names from the types. I can do this individually like so:
fn_get_topic_regex <- function(targettopic,df)
{
filter_df <- df |>
filter(topic == targettopic)
regex <- paste(filter_df$keyword, collapse = "|")
}
and do things like:
cheese_regex <- fn_get_topic_regex("cheese",df)
But what I'd like to be able to do is build all these regexes automatically without having to define each one.
The intended output would be something like:
cheese_regex: "cheddar|swiss|gouda"
bread_regex: "focaccia|sourdough"
meat_regex: "beef|chicken"
Where the start of the variable name is the distinct topic.
What's the best way to do that without defining each regex individually by hand?

You can use dplyr's group_by() and summarise()
df %>%
group_by(topic) %>%
summarise(regex = paste(keyword, collapse = "|"))
# A tibble: 3 × 2
topic regex
<chr> <chr>
1 bread focaccia|sourdough
2 cheese cheddar|swiss|gouda
3 meat beef|chicken
Or you can apply your function to every unique value in df$topic:
map_chr(unique(df$topic) %>% setNames(paste0(., "_regex")),
fn_get_topic_regex, df = df)
cheese_regex meat_regex bread_regex
"cheddar|swiss|gouda" "beef|chicken" "focaccia|sourdough"
Just remember to add return(regex) to the end of your function, or not to assign the last line to a variable at all. I would even put everything in a single pipe chain:
fn_get_topic_regex <- function(targettopic,df)
{
df |>
filter(topic == targettopic) |>
pull(keyword) |>
paste(collapse = "|")
}

Here is a base R solution with your intended output in a named list.
df <- structure(list(topic = c("cheese", "meat", "meat", "cheese", "bread", "bread", "cheese"),
keyword = c("cheddar", "beef", "chicken", "swiss", "focaccia", "sourdough", "gouda")),
class = "data.frame", row.names = c(NA, -7L))
#split into a list per topic
topics <- split(df, df$topic)
#collapse the keyword column
topics <- lapply(topics, function(t) {
paste(t$keyword, collapse = "|")
})
#rename
names(topics)<- paste0(names(topics), "_regex")
topics
$bread_regex
[1] "focaccia|sourdough"
$cheese_regex
[1] "cheddar|swiss|gouda"
$meat_regex
[1] "beef|chicken"

We could do something like this:
after grouping we could use summarise together with paste and collapse to get our regex s
Then, when the regex is needed we could refer to it by indexing like the example below:
library(dplyr)
library(stringr) #str_detect
my_regex <- df %>%
group_by(topic) %>%
summarise(regex = paste(keyword, collapse = "|"))
df %>%
mutate(new_col = ifelse(str_detect(keyword, my_regex$regex[1]), "it is bread", "it is not bread"))
A tibble: 3 × 2
topic regex
<chr> <chr>
1 bread focaccia|sourdough
2 cheese cheddar|swiss|gouda
3 meat beef|chicken
> df %>%
+ mutate(new_col = ifelse(str_detect(keyword, my_regex$regex[1]), "it is bread", "it is not bread"))
topic keyword new_col
1 cheese cheddar it is not bread
2 meat beef it is not bread
3 meat chicken it is not bread
4 cheese swiss it is not bread
5 bread focaccia it is bread
6 bread sourdough it is bread
7 cheese gouda it is not bread

Related

What is the easiest way to clean up messy rowdata dplyr

I have a messy dataframe:
df <- data.frame(name = c('Chicken','Chicken1','ChiCKen','Chicke_N',
'Eg_g','EGG','egg'))
levels(as.factor(df$name))
[1] "Chicke_N" "Chicken" "ChiCKen" "Chicken1" "Eg_g" "egg" "EGG"
What would be the most efficient way of cleaning up these names, making them either 'chicken' or 'egg'?
I thought something like this would work:
df <- df%>% mutate(name = ifelse(name %in% c('Chicken','Chicken1','ChiCKen','Chicke_N'),'chicken','egg'))
I was wrong.
This changes to lower case and removes all punctuation and (contiguous) numbers, so it works on the given examples.
data.frame(name = c('Chicken','Chicken1','ChiCKen','Chicke_N',
'Eg_g','EGG','egg')) %>%
mutate(name = tolower(name) %>% stringr::str_remove("[[:punct:]]|\\d+"))
For a versatile approach, you might consider joining by stringdistance.
Make sure to read the helpfiles on the different methods for computing stringdistance (i.e. osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw and soundex).
df.valid <- data.frame(name = c("chicken", "egg"))
library(tidyverse)
library(fuzzyjoin)
df %>% stringdist_left_join(df.valid, ignore_case = TRUE, max_dist = 5)
# name.x name.y
# 1 Chicken chicken
# 2 Chicken1 chicken
# 3 ChiCKen chicken
# 4 Chicke_N chicken
# 5 Eg_g egg
# 6 EGG egg
# 7 egg egg
Another alternative is to use approximate matching by using agrepl.
df |>
mutate(cleaned = ifelse(agrepl("chicken", df$name, ignore.case = TRUE), "chicken", "egg"))
name cleaned
1 Chicken chicken
2 Chicken1 chicken
3 ChiCKen chicken
4 Chicke_N chicken
5 Eg_g egg
6 EGG egg
7 egg egg
An alternative is to use lookup tables:
df <- data.frame(name = c('Chicken','Chicken1','ChiCKen','Chicke_N',
'Eg_g','EGG','egg'))
lkup <- data.frame(name = levels(as.factor(df$name)),
clean = c('chicken','chicken','chicken','chicken',
'egg','egg','egg'))
inner_join(df,lkup,by='name') %>% mutate(name = clean) %>% select(name)
This method is more verbose, but in some cases the name might not be coercible, for example when trying to match 'bird' to chicken..

Return string pattern match plus text before and after pattern

Suppose I have diary entries from 5 people and I want to determine if they mention any food-related key words. I want an output of the key word with a window of one word before and after to provide context before determining if they are food-related.
The search should be case-insensitive, and it's ok if the key word is embedded in another word. E.g., If a key word is "rice", I want to output to include "price".
Assume I have the following data:
foods <- c('corn', 'hot dog', 'ham', 'rice')
df <- data.frame(id = 1:5,
diary = c('I ate rice and corn today',
'Sue ate my corn.',
'He just hammed it up',
'Corny jokes are my fave',
'What is the price of milk'))
The output I'm looking for is:
|ID|Output |
|--|--------------------------------|
|1 |"ate rice and", "and corn today"|
|2 |"my corn" |
|3 |"just hammed it" |
|4 |"Corny jokes" |
|5 |"the price of" |
I've used strings::stri_detect but the output includes the entire diary entry.
I've used strings::stri_extract but I can't find a way to include one word before and after the key word.
The following solution works when the same food appears multiple times in the same phrase. It is based on the splitting of each phrase into its individual words.
library(tidyverse)
extract3 <- function(txt, word)
{
str_split(txt, "\\W") %>%
unlist() %>%
{. ->> w} %>%
map(~ str_extract(.x,regex(paste0("(.)*",word,"(.)*"),ignore_case=T))) %>%
unlist() %>%
is.na() %>%
`!` %>%
which() %>%
map_chr(~ paste(
w[unique(c(max(c(.x-1,1)),.x,min(c(.x+1,length(w)))))], collapse = " ")) %>%
paste(collapse = ", ")
}
df_out <- tibble()
for (i in 1:nrow(df))
for (j in 1:length(foods))
df_out <- rbind(df_out,
tibble(
id=df$id[i],diary=df$diary[i], output=extract3(df$diary[i],foods[j])))
df_out %>%
filter(output != "") %>%
group_by(id) %>%
mutate(output=paste(output,collapse = ", ")) %>%
ungroup() %>%
distinct()
EDITED (WITHOUT FOR CYCLES)
library(tidyverse)
extract3 <- function(txt, word)
{
str_split(txt, "\\W") %>%
unlist() %>%
{. ->> w} %>%
map(~ str_extract(.x,regex(paste0("(.)*",word,"(.)*"),ignore_case=T))) %>%
unlist() %>%
is.na() %>%
`!` %>%
which() %>%
map_chr(~ paste(
w[unique(c(max(c(.x-1,1)),.x,min(c(.x+1,length(w)))))], collapse = " ")) %>%
paste(collapse = ", ") %>%
str_trim()
}
map_dfr(
1:nrow(df),
function(id) map_dfr(1:length(foods), ~ tibble(
id = df$id[id],
diary = df$diary[id],
output = extract3(df$diary[id], foods[.])))) %>%
filter(output != "") %>%
group_by(id) %>%
mutate(output = paste(output,collapse = ", ")) %>%
ungroup() %>%
distinct()
We can collapse the regex and extract the words ("\w+") that preceed or follow the collapsed pattern. The regex() function allows the argument ignore_case = TRUE, which is very useful for case-insensitive matching. We may have to include optional word boundaries arount the collapsed pattern, so both rice and price, ham or hammed are included.
I made some small changes to the data to make it more illustrative.
I posted two answers.
One will exclude matches inside larger words, such as "hammed" or "price", so non-food matches will return empty strings.
The other is more inclusive.
library(dplyr)
library(stringr)
df %>% mutate(Output = str_extract_all (diary,
regex(paste0("\\w+\\s+(",
paste("\\b",foods, "\\b", collapse = "|", sep=''),
")\\s+\\w+"),
ignore_case=TRUE)))
output 1
id diary Output
1 1 I ate rice and corn today ate rice and
2 2 Sue ate my corn.
3 3 He just hammed it up
4 4 Corny jokes are my fave
5 5 What is the price of milk
6 6 I like to eat ham sandwiches eat ham sandwiches
solution 2
df %>% mutate(Output = str_extract_all (diary,
regex(paste0("\\w+\\s+(",
paste("\\b\\w*",foods, "\\w*\\b", collapse = "|", sep=''),
")\\s+\\w+"),
ignore_case=TRUE)))
id diary Output
1 1 I ate rice and corn today ate rice and
2 2 Sue ate my corn.
3 3 He just hammed it up just hammed it
4 4 Corny jokes are my fave
5 5 What is the price of milk the price of
6 6 I like to eat ham sandwiches eat ham sandwiches
data
foods <- c('corn', 'hot dog', 'ham', 'rice')
df <- data.frame(id = 1:6,
diary = c('I ate rice and corn today',
'Sue ate my corn.',
'He just hammed it up',
'Corny jokes are my fave',
'What is the price of milk',
'I like to eat ham sandwiches'))
FINAL EDIT
I figured out the problem with "corn", and handled the multiple matches issue.
We have to do a nested loop. First loop through all entries in "diary"(outer loop). Then, in the inner loop, loop through all "foods", and call "str_extract_all", with the appropriate regex. The initial regex required a food word be preceded or followed by another word, so foods at sentence boundaries were not matched. I included a ? quantifier (0 or 1 matches) around the surrounding words (\\w+\\s+) so it all works smoothly. The only issue left is the order of the matches in multiple matches, it is still odd. But I think the solution is fine now.
df %>% mutate(output=map(df$diary,
~map(foods, \(x) str_extract_all(.x,
regex(paste0("(\\w+\\s+)?(",
paste("\\b\\w*", x, "\\w*\\b", collapse = "|", sep=''),
")(\\s+\\w+)?"),
ignore_case=TRUE))))%>%
map(unlist))
id diary output
1 1 I ate rice and corn today and corn today, ate rice and
2 2 Sue ate my corn. my corn
3 3 He just hammed it up just hammed it
4 4 Corny jokes are my fave Corny jokes
5 5 What is the price of milk the price of
6 6 I like to eat ham sandwiches eat ham sandwiches
Not entirely sure whether that's 100% helpful but worth a try:
First, define your keywords as a case-insensitive alternation pattern:
patt <- paste0("(?i)(", paste0(foods, collapse = "|"), ")")
Then extract the word on the left, the keyword itself called node, and the word on the right using stringr's function str_extract_all:
library(stringr)
df1 <- data.frame(
left = unlist(str_extract_all(gsub("[.,!?]", "", df$diary), paste0("(?i)(\\S+|^)(?=\\s?", patt, ")"))),
node = unlist(str_extract_all(gsub("[.,!?]", "", df$diary), patt)),
right = unlist(str_extract_all(gsub("[.,!?]", "", df$diary), paste0("(?<=\\s?", patt, "\\s?)(\\S+|$)")))
)
Result:
df1
left node right
1 ate rice and
2 and corn today
3 my corn
4 just ham med
5 Corn y
6 p rice of
While this is not exactly the expected output it may still serve your purpose iff that purpose is to check whether a match is indeed a keyword. In lines 5 and 6, for example, the view provided by df1 immediately makes it clear that these are not keyword matches.
EDIT:
This solution preserves the idvalues:
library(tidyverse)
library(purrr)
extract_ <- function(df_row){
df1 <- data.frame(
id = df_row$id,
left = unlist(str_extract_all(gsub("[.,!?]", "", df_row$diary), paste0("(?i)(\\S+|^)(?=\\s?", patt, ")"))),
node = unlist(str_extract_all(gsub("[.,!?]", "", df_row$diary), patt)),
right = unlist(str_extract_all(gsub("[.,!?]", "", df_row$diary), paste0("(?<=\\s?", patt, "\\s?)(\\S+|$)")))
)
}
df %>%
group_split(id) %>% # splits data frame into list of bins, i.e. by id
map_dfr(.x, .f = ~ extract_(.x)) # now we iterate over bins with our function
id left node right
1 1 ate rice and
2 1 and corn today
3 2 my corn
4 3 just ham med
5 4 Corn y
6 5 p rice of

R / tidyverse - find intersect between multiple character columns

I have the following problem, I have a tibble with mutliple character columns.
I tried to provide an MRE below:
library(tidyverse)
df <- tibble(food = c("pizza, bread, apple","joghurt, cereal, banana"),
food2 = c("bread, sausage, strawberry", "joghurt, oat, bacon"),
food3 = c("ice cream, bread, milkshake", "melon, cake, joghurt")
)
df %>%
# rowwise() %>%
mutate(allcolumns = map2(
str_split(food, ", "),
str_split(food2, ", "),
# str_split(food3, ", "),
intersect
) %>% unlist()
) -> df_new
My goal would be to get the common words for all columns. Words are separated by , in the columns. In the MRE I am able to find the intersect between two columns, however I couldnt get a solution for this issue. I experimented with Reduce but was not able to get it.
As an EDIT: I would also like to append it as a new row to the existing tibble
We may use map to loop over the columns, do the str_split and then reduce to get the intersect for elementwise intersect
library(dplyr)
library(purrr)
library(stringr)
df %>%
purrr::map(str_split, ", ") %>%
transpose %>%
purrr::map_chr(reduce, intersect) %>%
mutate(df, Intersect = .)
-output
# A tibble: 2 x 4
food food2 food3 Intersect
<chr> <chr> <chr> <chr>
1 pizza, bread, apple bread, sausage, strawberry ice cream, bread, milkshake bread
2 joghurt, cereal, banana joghurt, oat, bacon melon, cake, joghurt joghurt
or may also use pmap
df %>%
mutate(Intersect = pmap(across(everything(), str_split, ", "),
~ list(...) %>%
reduce(intersect)))

R: Custom Function - Mutate Existing Column

I'd like to create a custom function to try and standardise strings in multiple different columns, in multiple different data frames, with the ultimate intention of joining data from them together.
In order to do this, I'd like to be able to pass a column name into a custom function and have the function carry out operations on that column. With the example beneath, I'd like to clean columns a and c before joining them together to look like this:
library(tidyverse)
df1 <- tibble(a = c("apple & pear", "kiwi", "plum"), b = c("cat", "dog", "cow"))
df2 <- tibble(c = c("apple and pear", "kiwi.", "plum"), d = c("car", "bike", "truck"))
full_join(df1, df2, by = c("a" = "c") )
a b d
1 apple & pear cat car
2 kiwi dog bike
3 plum cow truck
Instead of how it currently turns out like, which is this:
# A tibble: 5 x 3
a b d
1 apple & pear cat NA
2 kiwi dog NA
3 plum cow truck
4 apple and pear NA car
5 kiwi. NA bike
To do this, I know I need to build custom functions, which I'd be relatively inexperienced at doing, especially with curly-curly. The two functions beneath should change the symbols and remove the trailing punctuation, and ideally these should be combined into the one function, with the flexibility to be able to add more if necessary, like this:
add_symbol <- function(col.name){
mutate({{col.name}} = gsub(" & ", " and ", {{col.name}}))
}
rm_trail_punc <- function(col.name){
mutate({{col.name}} = gsub("[[:punct:]]$", "", {{col.name}}))
}
standardise_col <- function(df, col.name){
df %>%
add_symbol({{col.name}}) %>%
rm_trail_punc({{col.name}})
}
df1 <- standardise_col(df1)
standardise_col(df2) %>%
full_join(., df1, by = c("a" = "c"))
However, these functions can't be created, and return an error unexpected '=' because the column name can't be passed to the left-hand side of the equal sign. Is there any way of passing these values to the mutate without hard-coding them?
I think you can achieve this more simply using with the following:
library(dplyr)
clean_func <- function(df){
df %>% mutate(across(everything(), ~gsub(" & ", " and ", .) %>%
gsub("[[:punct:]]$", "", .)))
}
df1 <- clean_func(df1)
df2 <- clean_func(df2)
You can make updates to the function by adding additional gsub, str_replace, or other calls as needed.
Edit:
Based on update, you can do something like this to target your variables specifically:
add_symbol <- function(col.name){
gsub(" & ", " and ", col.name)
}
rm_trail_punc <- function(col.name){
gsub("[[:punct:]]$", "", col.name)
}
standardise_col <- function(df, col.name){
col.name <- enquo(col.name)
df %>%
mutate(!!col.name := add_symbol(!!col.name),
!!col.name := rm_trail_punc(!!col.name))
}
Your code won't ever work as written, but you could do something like this:
new_df <- standardise_col(df1, a) %>%
left_join(., standardise_col(df2, c), by = c("a"="c"))
Which gives us:
# A tibble: 3 x 3
a b d
<chr> <chr> <chr>
1 apple and pear cat car
2 kiwi dog bike
3 plum cow truck
You can read up on tidy evaluation here: https://tidyeval.tidyverse.org/dplyr.html
As said in the comment by #1k monkeys and a single PC, your example data are different from what you show, so maybe the results could be different, but let's assume you've some data like this:
df1 <- tibble(a = c("apple & pear", "kiwi", "plum"),
b = c("cat","dog","cow"))
df2 <- tibble(c = c("apple and pear", "kiwi.", "orange"),
d = c("truck","bike","car"))
You can manage to use the package fuzzyjoin to merge them:
library(fuzzyjoin)
library(dplyr)
df1 %>%
stringdist_full_join(df2, by = c(a = "c") ,
max_dist = 3,
distance_col = "DIST")
# A tibble: 4 x 5
a b c d DIST
<chr> <chr> <chr> <chr> <dbl>
1 apple & pear cat apple and pear truck 3
2 kiwi dog kiwi. bike 1
3 plum cow <NA> <NA> NA
4 <NA> <NA> orange car NA
The result is different because I've based the data on your example and "plum" and "orange" doesn't match (so cow and car are not aligned). Clearly with a select() you can select the column you need, or with mutate() you can rename them.

replace symbols, in factors, in a data frame, with dplyr mutate

I have a data frame, and for various reasons I need to keep one of the elements as a factor and, maintaining the order of the levels, replace periods in the levels with spaces. Here's an example
library(tidyverse) library(stringr)
sandwich <- c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
data_frame(sandwich_str = sandwich) %>%
mutate(sandwich_factor = factor(sandwich)) %>%
mutate(sandwich2 = factor(sandwich_factor,
levels = str_replace_all(levels(sandwich_factor), "\\.", " "))) %>%
mutate(sandwich3 = str_replace_all(sandwich_str, "\\.", " "))
print(sandwich_df)
# A tibble: 5 x 4
sandwich_str, sandwich_factor, sandwich2, sandwich3
<chr> <fctr>, <fctr> <chr>,
1 bread bread bread bread
2 mustard.sauce mustard.sauce <NA> mustard sauce
3 tuna.fish tuna.fish <NA> tuna fish
4 lettuce lettuce lettuce lettuce
5 bread bread bread bread
So in this data frame:
sandwich_str is an element of characters
sandwich_factor is an element of factors
in sandwich2 I tried replacing all of the periods in the levels of sandwich_factor. For whatever reason, this returns NA whenever there are periods.
in sandwich3 I take the more simple approach of just replacing all of the periods in strings with spaces. This works substantially better.
So I'm wondering what isn't working in my attempt at sandwich2. I'd like it to look more like sandwich3. Any advice?
Does this suit?
library(tidyverse)
library(stringr)
# Data --------------------------------------------------------------------
sandwich <-
c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
df <-
data_frame(sandwich_str = sandwich)
# Convert periods to spaces -----------------------------------------------
df$sandwich_str <-
df$sandwich_str %>%
as.character() %>%
str_replace("\\."," ") %>%
as.factor()
# Print output ------------------------------------------------------------
df %>%
print()
Credit to #aosmith for posting this answer as a comment. I'll post it here as an answer so I can accept and close this.
The problem was that factor levels are defined with the flag labels rather than levels. So the correct way for me to have written this previously would be:
library(tidyverse) library(stringr)
sandwich <- c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
data_frame(sandwich_str = sandwich) %>%
mutate(sandwich_factor = factor(sandwich)) %>%
mutate(sandwich2 = factor(sandwich_factor,
labels = str_replace_all(levels(sandwich_factor), "\\.", " "))) %>%
mutate(sandwich3 = str_replace_all(sandwich_str, "\\.", " "))
print(sandwich_df)

Resources