step_mutate with textrecipes tokenlists - r

I'm doing NLP with the tidymodels framework, taking advantage of the textrecipes package, which has recipe steps for text preprocessing. Here, step_tokenize takes a character vector as input and returns a tokenlist object. Now, I want to perform spell checking on the new tokenized variable with a custom function for correct spelling, using functions from the hunspell package, but I get the following error (link to the spell check blog post):
Error: Problem with `mutate()` column `desc`.
i `desc = correct_spelling(desc)`.
x is.character(words) is not TRUE
Apparently, tokenlists don't parse easily to character vectors. I've noticed the existence of step_untokenize, but simply disolves the tokenlist by pasting and collapsing and that's not what I need.
REPREX
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
product_recipe <- recipe(desc ~ price, data = product_descriptions) %>%
step_tokenize(desc) %>%
step_mutate(desc = correct_spelling(desc))
product_recipe %>% prep()
WHAT I WANT, BUT WITHOUT RECIPES
product_descriptions %>%
unnest_tokens(word, desc) %>%
mutate(word = correct_spelling(word))

There isn't a canonical way to do this using {textrecipes} yet. We need 2 things, a function that takes a vector of tokens and returns spell-checked tokens (you provided that) and a way to apply that function to each element of the tokenlist. For now, there isn't a general step that lets you do that, but you can cheat it by passing the function to custom_stemmer in step_stem(). Giving you the results you want
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(textrecipes)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
product_recipe <- recipe(desc ~ price, data = product_descriptions) %>%
step_tokenize(desc) %>%
step_stem(desc, custom_stemmer = correct_spelling) %>%
step_tf(desc)
product_recipe %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 3 × 6
#> price tf_desc_cad tf_desc_good tf_desc_not tf_desc_product tf_desc_sou
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1000 0 1 0 1 0
#> 2 700 0 1 1 0 1
#> 3 250 1 0 0 1 0

Not nearly as short, but this should work:
library(tidyverse)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
my_stopwords <- c("sou")
product_descriptions %>%
#create a row identifier
mutate(id = row_number()) %>%
#separate all `desc` into separate words (by space) into separate rows
separate_rows(desc, sep = " ") %>%
#helper for naming later on
mutate(word_id = "word") %>%
#word identifier
group_by(id) %>%
mutate(word = row_number()) %>%
ungroup() %>%
#exclude stopwords as defined above
filter(!desc %in% my_stopwords) %>%
#add spellchecker
mutate(desc = correct_spelling(desc)) %>%
#make tibble wide again
pivot_wider(names_from = c(word_id, word), values_from = desc) %>%
#unite all strings that were put into separate columns
unite(desc, starts_with("word_"), remove = FALSE, sep = " ", na.rm = TRUE) %>%
#omit all helper columns
select(-c(id, starts_with("word_"))) %>%
#clean up column ordering
relocate(desc, price)
In this case "sou" is deleted as a stopword and "produkt" is getting corrected to "product". The spellcheck function changes "cad" to "vad" instead of "bad", though.

Related

Rowwise comparison of the length of a string against a list of string lengths

Consider the following data frame with two columns of strings of variable length:
library("tidyverse")
df <- tibble(REF = c("TTG", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "T", "TTGTGTGTGTGTGTGTGTGTGT"),
ALT = c("T", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "TTG", "TTGTGTGTGTGTGTGTGTGTGTGT"))
# # A tibble: 4 × 2
# REF ALT
# <chr> <chr>
# 1 TTG T
# 2 CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
# 3 T TTG
# 4 TTGTGTGTGTGTGTGTGTGTGT TTGTGTGTGTGTGTGTGTGTGTGT
Differently from column REF, column ALT sometimes includes several strings concatenated by comma (e.g. row 2).
I want to compare the length of strings in REF (REF_LEN) and ALT (ALT_LEN), and generate a TYPE column with values:
"SNM" when REF_LEN = ALT_LEN
"INS" when REF_LEN < ALT_LEN
"DEL" when REF_LEN > ALT_LEN
But I want to do it in a way that, when several strings are present in ALT, the output of this new TYPE column contains these types as well separated by a comma. i.e., the expected output here would be:
"DEL" "INS,DEL" "INS" "INS"
So far, I know how to get the length of values in ALT, but I fail at collapsing these values, as the output will contain lengths from all ALTs in the table, not just pairwise (i.e. 1,35,31,3,24):
df %>%
dplyr::mutate(REF_LEN = str_length(REF),
ALT_LEN = str_split(ALT, ","),
ALT_LEN = purrr::map(ALT_LEN, str_length) %>% unlist() %>% paste(collapse = ","))
Code above is incomplete as you can see, but I am also unable to work in a different direction using a helper function that gets the TYPE column above done. This will return many errors, but not sure why, since it seems to work nicely with values from ALT_LEN individually:
name <- function(alt_lens, ref_len) {
alt_lens <- unlist(alt_lens)
ifelse(alt_lens < ref_len, "DEL", ifelse(alt_lens > ref_len, "INS", "SNM"))
}
df %>%
dplyr::mutate(REF_LEN = str_length(REF),
ALT_LEN = str_split(ALT, ","),
TYPE = purrr::map(ALT_LEN, str_length) %>% name(REF_LEN))
Any ideas? thanks!
Here's a codegolf-ish base R solution :
df <- data.frame(REF = c("TTG", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "T", "TTGTGTGTGTGTGTGTGTGTGT"),
ALT = c("T", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "TTG", "TTGTGTGTGTGTGTGTGTGTGTGT"))
df$TYPE <- mapply(
function(x, y) paste(c("INS", "SNM", "DEL")[2 + sign(nchar(x)- nchar(y))], collapse = ","),
df$REF, strsplit(df$ALT, ","), USE.NAMES = FALSE)
df$TYPE
#> [1] "DEL" "INS,DEL" "INS" "INS"
Created on 2022-04-20 by the reprex package (v2.0.1)
Update: Removed first answer. Thanks to akrun for pointing me there!. The concept is the same: using nchar with case_when, the difference is to use separate_rows from tidyr package:
library(dplyr)
library(tidyr)
df %>%
mutate(id = row_number()) %>%
separate_rows(ALT, sep = ",") %>%
mutate(TYPE = case_when(nchar(REF)==nchar(ALT) ~ "SNM",
nchar(REF)< nchar(ALT) ~ "INS",
nchar(REF)> nchar(ALT) ~ "DEL",
TRUE ~ NA_character_)) %>%
group_by(id) %>%
mutate(TYPE = toString(TYPE)) %>%
slice(1)
REF ALT id TYPE
<chr> <chr> <int> <chr>
1 TTG T 1 DEL
2 CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT 2 INS, DEL
3 T TTG 3 INS
4 TTGTGTGTGTGTGTGTGTGTGT TTGTGTGTGTGTGTGTGTGTGTGT 4 INS

Filter specific values of optional argument with dplyr

I have a dataframe that looks like this :
df <- data.frame(ID = rep(1:10, each = 6),
Site = rep(c("A","B","C","D"), each = 6, times = 10),
Department = rep(c("E","F","G","H"), each = 6, times = 10),
Occupation = rep(c("I","J","K","L"), each = 6, times = 10),
Construct = rep(paste0("X",1:6), times = 10),
Score = sample(c("Green","Orange","Red"), size = 60, replace = TRUE))
head(df)
Basically, each ID belongs to a site, a department and has an occupation, and is evaluated on six constructs.
I have adapted a previous function of mine to compute the N and the rate of each Score category for a given Construct, by any combination of Site, Department and Occupation :
my_function <- function(..., dimension = NULL){
df %>%
filter(Construct == dimension) %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
This works perfectly, as I simply have to indicate which Construct, and add any of the three factors (Site, Departement, Occupation) as optional arguments to obtain a summary. For example, a summary of X1 by Site and Department would be :
my_function(dimension = "X1", Site, Department)
However, I would like to filter out some of the values of the Occupation variable, but only when looking at a summary including this variable. I tried to do so by checking whether Occupation was passed as an optional argument, and exclude the specific values when it was the case. Something like :
my_function <- function(..., dimension = NULL){
if(hasArg(Occupation)){
df %>%
filter(Construct == dimension, Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
} else {
df %>%
filter(Construct == dimension) %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
}
But it does not seem to work, as it consistently returns includes the values I'd like to filter out, even when I specify Occupation as an optional argument. I tried to fiddle with things like curly-curly {{}} but I can't seem to get this function to filter the specific values.
hasArg seems to expect all of the arguments to be named, whereas in
my_function(dimension="X1", Site, Department, Occupation)
this is not the case.
Perhaps:
my_function <- function(..., dimension = NULL){
hasOcc <- "Occupation" %in% as.character(match.call())
if (hasOcc) {
df %>%
filter(Construct == dimension, Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
} else {
df %>%
filter(Construct == dimension) %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
}
my_function(Site, Department, Occupation, dimension = "X1")
# # A tibble: 7 x 6
# # Groups: Site, Department, Occupation [3]
# Site Department Occupation Score n rate
# <chr> <chr> <chr> <chr> <int> <dbl>
# 1 B F J Green 6 0.6
# 2 B F J Orange 4 0.4
# 3 C G K Green 2 0.2
# 4 C G K Orange 2 0.2
# 5 C G K Red 6 0.6
# 6 D H L Green 6 0.6
# 7 D H L Orange 4 0.4
Some other thoughts on the function:
reaching out of its scope to get df is not a good practice: it is not really reproducible, and it can be difficult to troubleshoot. For instance, if you forget to assign your data to df, you'll see
my_function(Site, Department, Occupation, dimension = "X1")
# Error in UseMethod("filter") :
# no applicable method for 'filter' applied to an object of class "function"
(This error is because it is finding stats::df.)
Further, if you want to use it against a different non-df-named dataset, you're out of luck.
Recommendation: explicitly pass the data. A tidyverse commonality is to pass it as the first argument. One side-benefit of this is that you can (generally) use this in the middle of a %>%-pipe directly.
my_function <- function(.data, ..., dimension = NULL) { .data %>% ... }
You can reduce the number of pipelines in there by including the Occupation conditional directly in the filter(..). This is not just code-golf: in more complex code examples, it's not hard to imagine updating one of the %>%-pipes and either forgetting the other or updating it differently. Since the only difference here is a component of filter, we can add it there:
my_function <- function(..., dimension = NULL){
hasOcc <- "Occupation" %in% as.character(match.call())
df %>%
filter(Construct == dimension, !hasOcc | Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
If dimension is required, don't default to NULL since, if omitted, this will produce an error.
my_function <- function(.data, ..., dimension) { ... }
If it is instead optional and you don't want to filter on it if not provided, then you need to check for that in your filter:
filter(if (is.null(dimension)) TRUE else Construct == dimension, ...)
If you can imagine wanting dimension to be either NA (matching an explicit NA value in the data) or you might want "one or more", then you may want to use %in% instead of ==:
NA == NA
# [1] NA
NA %in% NA
# [1] TRUE
So your function could use
filter(if (is.null(dimension)) TRUE else Construct %in% dimension, ...)
These points would result in your function being either
my_function <- function(.data, ..., dimension = NULL){
hasOcc <- "Occupation" %in% as.character(match.call())
.data %>%
filter(if (is.null(dimension)) TRUE else Construct %in% dimension, !hasOcc | Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
if dimension is optional, or
my_function <- function(.data, ..., dimension) {
hasOcc <- "Occupation" %in% as.character(match.call())
.data %>%
filter(Construct %in% dimension, !hasOcc | Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
otherwise.

How to pass variable to filter function within a R function

I am fairly new to R. I wrote the below function which tries to summarise a dataframe, based on a feature variable (passed to the function as 'variable') and a target variable (passed to the function as target_var). I also pass it a value (target_val) on which to filter.
The function below falls over on the filter line (filter(target_var == target_val)). I think it has something to do with quo, quosure etc, but can't figure out how to fix it. The following code should be ready to run - if you exclude the filter line it should work, if you included the filter line it will fall over.
library(dplyr)
target <- c('good', 'good', 'bad', 'good', 'good', 'bad')
var_1 <- c('debit_order', 'other', 'other', 'debit_order','debit_order','debit_order')
dset <- data.frame(target, var_1)
odds_by_var <- function(dataframe, variable, target_var, target_val){
df_name <- paste('odds', deparse(substitute(variable)), sep = "_")
variable_string <- deparse(substitute(variable))
target_string <- deparse(substitute(target_var))
temp_df1 <- dataframe %>%
group_by_(variable_string, target_string) %>%
summarise(cnt = n()) %>%
group_by_(variable_string) %>%
mutate(total = sum(cnt)) %>%
mutate(rate = cnt / total) %>%
filter(target_var == target_val)
assign(df_name, temp_df1, envir=.GlobalEnv)
}
odds_by_var(dset, var_1, target, 'bad')
so I assume you want to filter by target good or bad.
In my understanding, always filter() before you group_by(), as you will possibly ommit your filter variables. I restructured your function a little:
dset <- data.frame(target, var_1)
odds_by_var <- function(dataframe, variable, target_var, target_val){
df_name <- paste('odds', deparse(substitute(variable)), sep = "_")
variable_string <- deparse(substitute(variable))
target_string <- deparse(substitute(target_var))
temp_df1 <- dataframe %>%
group_by_(variable_string, target_string) %>%
summarise(cnt = n()) %>%
mutate(total = sum(cnt),
rate = cnt / total)
names(temp_df1) <- c(variable_string,"target","cnt","total","rate" )
temp_df1 <- temp_df1[temp_df1$target == target_val,]
assign( df_name,temp_df1, envir=.GlobalEnv)
}
odds_by_var(dset, var_1, target, "bad")
result:
> odds_var_1
# A tibble: 2 x 5
# Groups: var_1 [2]
var_1 target cnt total rate
<chr> <chr> <int> <int> <dbl>
1 debit_order bad 1 4 0.25
2 other bad 1 2 0.5

Standardize column names in excel sheets before combining with purrr and readxl

I would like to compile an Excel file with multiple tabs labeled by year (2016, 2015, 2014, etc). Each tab has identical data, but column names may be spelled differently from year-to-year.
I would like to standardize columns in each sheet before combining.
This is the generic way of combining using purrr and readxl for such tasks:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map_dfr(read_excel, path = my.file, .id = "sheet")
...however as noted, this creates separate columns for "COLUMN ONE", and "Column One", which have the same data.
Inserting make.names into the pipeline would probably be the best solution.
Keeping it all together would be ideal...something like:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>%
map(~(names(.) %>% #<---WRONG
make.names() %>%
str_to_upper() %>%
str_trim() %>%
set_names()) )
..but the syntax is all wrong.
Rather than defining your own function, the clean_names function from the janitor package may be able to help you. It takes a dataframe/tibble as an input and returns a dataframe/tibble with clean names as an output.
Here's an example:
library(tidyverse)
tibble(" a col name" = 1,
"another-col-NAME" = 2,
"yet another name " = 3) %>%
janitor::clean_names()
#> # A tibble: 1 x 3
#> a_col_name another_col_name yet_another_name
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
You can then plop it right into the code you gave:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>% #<Import as list, not dfr
map(janitor::clean_names) %>% #<janitor::clean_names
bind_rows(.id = "sheet")
Creating a new function is doable but is verbose and uses two maps:
# User defined function: col_rename
col_rename <- function(df){
names(df) <- names(df) %>%
str_to_upper() %>%
make.names() %>%
str_trim()
return(df)
}
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>% #<Import as list, not dfr
map(col_rename) %>% #<Fix colnames (user defined function)
bind_rows(.id = "sheet")

Why do quosures work in group_by() but not filter()?

I'm working on building a function that I will manipulate a data frame based on a string. Within the function, I'll build a column name as from the string and use it to manipulate the data frame, something like this:
library(dplyr)
orig_df <- data_frame(
id = 1:3
, amt = c(100, 200, 300)
, anyA = c(T,F,T)
, othercol = c(F,F,T)
)
summarize_my_df_broken <- function(df, my_string) {
my_column <- quo(paste0("any", my_string))
df %>%
filter(!!my_column) %>%
group_by(othercol) %>%
summarize(
n = n()
, total = sum(amt)
) %>%
# I need the original string as new column which is why I can't
# pass in just the column name
mutate(stringid = my_string)
}
summarize_my_df_works <- function(df, my_string) {
my_column <- quo(paste0("any", my_string))
df %>%
group_by(!!my_column, othercol) %>%
summarize(
n = n()
, total = sum(amt)
) %>%
mutate(stringid = my_string)
}
# throws an error:
# Argument 2 filter condition does not evaluate to a logical vector
summarize_my_df_broken(orig_df, "A")
# works just fine
summarize_my_df_works(orig_df, "A")
I understand what the problem is: unquoting the quosure as an argument to filter() in the broken version is not referencing the actual column anyA.
What I don't understand is why it works in summarize(), but not in filter()--why is there a difference?
Right now you are are making quosures of strings, not symbol names. That's not how those are supposed to be used. There's a big difference between quo("hello") and quo(hello). If you want to make a proper symbol name from a string, you need to use rlang::sym. So a quick fix would be
summarize_my_df_broken <- function(df, my_string) {
my_column <- rlang::sym(paste0("any", my_string))
...
}
If you look more closely I think you'll see the group_by/summarize isn't actually working the way you expect either (though you just don't get the same error message). These two do not produce the same results
summarize_my_df_works(orig_df, "A")
# `paste0("any", my_string)` othercol n total
# <chr> <lgl> <int> <dbl>
# 1 anyA FALSE 2 300
# 2 anyA TRUE 1 300
orig_df %>%
group_by(anyA, othercol) %>%
summarize(
n = n()
, total = sum(amt)
) %>%
mutate(stringid = "A")
# anyA othercol n total stringid
# <lgl> <lgl> <int> <dbl> <chr>
# 1 FALSE FALSE 1 200 A
# 2 TRUE FALSE 1 100 A
# 3 TRUE TRUE 1 300 A
Again the problem is using a string instead of a symbol.
You don't have any conditions for filter() in your 'broken' function, you just specify the column name.
Beyond that, I'm not sure if you can insert quosures into larger expressions. For example, here you might try something like:
df %>% filter((!!my_column) == TRUE)
But I don't think that would work.
Instead, I would suggest using the conditional function filter_at() to target the appropriate column. In that case, you separate the quosure from the filter condition:
summarize_my_df_broken <- function(df, my_string) {
my_column <- quo(paste0("any", my_string))
df %>%
filter_at(vars(!!my_column), all_vars(. == TRUE)) %>%
group_by(othercol) %>%
summarize(
n = n()
, total = sum(amt)
) %>%
mutate(stringid = my_string)
}

Resources