do.call() and tidy evaluation - r

Trying to make do.call() work in the context of tidy evaluation:
library(rlang)
library(dplyr)
data <- tibble(item_name = c("apple", "bmw", "bmw"))
mutate(data, category = case_when(item_name == "apple" ~ "fruit",
item_name == "bmw" ~ "car"))
# # A tibble: 3 x 2
# item_name category
# <chr> <chr>
# 1 apple fruit
# 2 bmw car
# 3 bmw car
What differs between:
category_fn <- function(df, ...){
# browser()
cat1 <- quos(...)
mutate(df, category = case_when(!!! cat1))
}
category_fn(df = data, item_name == "apple" ~ "fruit",
item_name == "bmw" ~ "car")
# # A tibble: 3 x 2
# item_name category
# <chr> <chr>
# 1 apple fruit
# 2 bmw car
# 3 bmw car
and:
cat <- list(item_name == "apple" ~ "fruit", item_name == "bmw" ~ "car")
do.call(category_fn, c(list(df = data), cat), quote = FALSE)
# Or:
do.call(category_fn, c(list(df = data), cat), quote = TRUE)
# Or:
rlang::invoke(category_fn, c(list(df = data), cat))
which all give the same error:
# Error in mutate_impl(.data, dots) :
# Evaluation error: object 'item_name' not found.
I stepped into the function with browser(), examined the arguments, ran expr(mutate(df, category = case_when(!!! cat1))) there (as suggested as a universal debugging strategy in http://rpubs.com/lionel-/programming-draft), with the same output in both cases: mutate(df, category = case_when(~(item_name == "apple" ~ "fruit"), ~(item_name == "bmw" ~ "car"))).
I've also tried to tweak the envir or .env arguments to no avail.
My understanding is that it has likely something to do with different quosure environments, but environment(cat1[[1]]) is also identical (<environment: R_GlobalEnv>).
Note:
This is somehow a follow-up of Tidy evaluation programming with dplyr::case_when which I was trying to answer.
> sessioninfo::session_info()
─ Session info ────────────────────────────────────────────────────────
setting value
version R version 3.4.3 (2017-11-30)
os Linux Mint 18
system x86_64, linux-gnu
[...]
─ Packages ────────────────────────────────────────────────────────────
package * version date source
[...]
dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.3)
[...]
rlang * 0.1.6 2017-12-21 CRAN (R 3.4.3)
[...]

We could create 'cat' as a quosure and then do the evaluation with !!!
cat <- quos(item_name == "apple" ~ "fruit", item_name == "bmw" ~ "car")
category_fn(data, !!!(cat))
# A tibble: 3 x 2
# item_name category
# <chr> <chr>
#1 apple fruit
#2 bmw car
#3 bmw car

I think it's a similar issue to the other post; quoting the list itself is not the same as quoting the elements of the list individually.
I have modified the cat definition to quote the elements individually, and the function slightly to remove the quosure statement and explicitly name the argument. In the do.call statements the second argument, the list of arguments to be supplied to the function, I have included the cat element as part of the list.
With these modifications the two do.call statements and the invoke then return the same result as the direct execution in your post:
data <- tibble(item_name = c("apple", "bmw", "bmw"))
cat <- list(quo(item_name == "apple" ~ "fruit"),
quo(item_name == "bmw" ~ "car"))
category_fn <- function(df, category){
mutate(df, category = case_when(!!! category))
}
> do.call(category_fn, list(data, cat), quote = FALSE)
# A tibble: 3 x 2
item_name category
<chr> <chr>
1 apple fruit
2 bmw car
3 bmw car
> # Or:
> do.call(category_fn, list(data, cat), quote = TRUE)
# A tibble: 3 x 2
item_name category
<chr> <chr>
1 apple fruit
2 bmw car
3 bmw car
> # Or:
> rlang::invoke(category_fn, list(df = data, cat))
# A tibble: 3 x 2
item_name category
<chr> <chr>
1 apple fruit
2 bmw car
3 bmw car
The value of the quote argument makes no difference in the two do.call examples.
I find quosures conceptually difficult, and not made a great deal easier by the current programming with dplyr vignette on Cran.

The answer in part (1a) of my response to Tidy evaluation programming with dplyr::case_when works here too.
If cat, data and category_fn are as in the present question then this works. The first line transforms cat to cat_ which is of a form that will work here.
cat_ <- lapply(cat, function(x) do.call("substitute", list(x)))
do.call("category_fn", c(list(df = data), cat_))
giving:
# A tibble: 3 x 2
item_name category
<chr> <chr>
1 apple fruit
2 bmw car
3 bmw car
Regarding the question at the end which seems to ask for alternatives to quosures in my answer to the original problem which I have linked to above are solutions to that question using the wrapr package and base R. The seplyr package, by the author of wrapr, may also be an alternative.

Related

Matrix to Dataframe conversion fails in R

I have created a function that calls an api and parses for the id, label, description, and score of each annotation. But I can't seem to get the dataframe to display properly.
Here's code:
get_wikidata_links <- function(input_text, minimum_score) {
#
# Function which takes a character vector of length 1 as input (i.e. all text
# needs to be combined into a single character) as well as a minimum certainty
# score, and returns a tibble with key information and links to Wikidata
#
# Input
# - input_text: Text input (character)
# - minimum_score: Minimum score that every returned entity needs to have
# (numeric)
#
# Output
# - top_wikidata_links: Table with the first four columns being 'id', 'label',
# 'description', 'score' (tibble)
#
base_url <- "https://opentapioca.org/api/annotate"
r <- GET(base_url, query = list(query = input_text))
data = content(r)$annotations
framed = list()
vec = list()
dummy = 0
for (i in 1:length(data)) {
data1 = data[[i]]$tags
for (j in 1:length(data1)) {
data2 = data1[[j]]
if (data2$score>minimum_score) {
vec[1] <- data2$id
vec[2] <- data2$label
vec[3] <- data2$desc
vec[4] <- data2$score
dummy <- dummy + 1
framed[[dummy]] <- vec
}
}
}
data_matrix <- do.call("rbind", framed)
top_wikidata_links <- as.data.frame(data_matrix, stringsAsFactors = FALSE)
colnames(top_wikidata_links) <- c("ID", "Label", "Description", "Score")
return(top_wikidata_links)
}
Now I test this function with a couple phrases:
# Test 1
text_example_1 <- c("Karl Popper worked at the LSE.")
get_wikidata_links(input_text_1, -0.5)
#
# Hint: The output should be a tibble similar to the one outlined below
#
# | id | label | description | score |
# | "Q81244" | "Karl Popper" | "Austrian-British philosopher of science" | 2.4568285 |
# | "Q174570" | "London School of Economics and Political Science" | "university in Westminster, UK" | "1.4685043" |
# | "Q171240" | "London Stock Exchange" | "stock exchange in the City of London" | "-0.4124461" |
# Test 2
text_example_2 <- c("Claude Shannon studied at the University of Michigan and at MIT.")
get_wikidata_links(text_example_2, 0)
Now for some reason the matrix data_matrix works fine:
Output
But the data frame conversion fails as such:
Output
I guess it's bit easier to manage through some hoisting and unnesting. Inspired by https://tidyr.tidyverse.org/articles/rectangle.html :
library(httr)
library(tidyr)
library(dplyr)
get_wikidata_links <- function(input_text, minimum_score) {
base_url <- "https://opentapioca.org/api/annotate"
r <- GET(base_url, query = list(query = input_text))
tibble(link = content(r)$annotations) %>%
hoist(link, tags = "tags") %>%
unnest_longer(tags) %>%
hoist(tags, ID = "id", Label = "label", Description = "desc", Score = "score") %>%
select(ID:Score) %>%
filter(Score >= minimum_score)
}
text_example_1 <- c("Karl Popper worked at the LSE.")
get_wikidata_links(text_example_1, -0.5)
#> # A tibble: 3 × 4
#> ID Label Description Score
#> <chr> <chr> <chr> <dbl>
#> 1 Q81244 Karl Popper Austrian-Brit… 2.46
#> 2 Q174570 London School of Economics and Political Science university in… 1.47
#> 3 Q171240 London Stock Exchange stock exchang… -0.412
text_example_2 <- c("Claude Shannon studied at the University of Michigan and at MIT.")
get_wikidata_links(text_example_2, 0)
#> # A tibble: 3 × 4
#> ID Label Description Score
#> <chr> <chr> <chr> <dbl>
#> 1 Q92760 Claude Shannon American mathematician an… 1.96
#> 2 Q230492 University of Michigan public research universit… 1.29
#> 3 Q49108 Massachusetts Institute of Technology research university in Ca… 0.902
Created on 2023-01-19 with reprex v2.0.2
Service itself seems bit unstable or overloaded.

regex_left_join (fuzzyjoin) not working as expected

I am trying to perform a join in R based on a regex pattern from one table. From what I understand, the fuzzyjoin package should be exactly what I need, but I can't get it to work. Here is an example of what I'm trying to do:
library(tidyverse)
library(fuzzyjoin)
(typing_table <- tribble(
~typing,
"DPB02:01",
"DPB04:02"
)
)
(P_group_table <- tribble(
~P_group, ~Alleles,
"DP1", "DPB01:01:01:01/DPB01:01:01:02/DPB01:01:01:03",
"DP2", "DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03",
"DP3", "DPB03:01:01:01/DPB03:01:01:02/DPB03:01:01:03",
"DP4", "DPB04:01:01:01/DPB04:01:01:02/DPB04:01:01:03"
)
)
I am trying to join the P_group_table to the typing_table by searching for the "typing" value in the "Alleles" string. I have used the following expression:
(typing_table %>% regex_left_join(P_group_table, by = c("typing" = "Alleles")))
Which results in a join, but the values from the right table are empty. I assume I must be misunderstanding the syntax of the regex_left_join expression, but I can't figure it out. I have verified that the "typing" value can be used as a regex pattern with the following code:
(typing_table_2 <- typing_table %>% slice_head)
(P_group_table %>% filter(str_detect(Alleles, typing_table_2$typing)))
We could make use of str_detect as match_fun in fuzzy_.*_join
library(fuzzyjoin)
library(stringr)
fuzzy_right_join(P_group_table, typing_table, by = c("Alleles" = "typing"),
match_fun = str_detect)
Or may use
regex_right_join(P_group_table, typing_table, by = c("Alleles" = "typing"))
# A tibble: 2 × 3
P_group Alleles typing
<chr> <chr> <chr>
1 DP2 DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03 DPB02:01
2 <NA> <NA> DPB04:02
Note the difference when we switch the pattern
> str_detect("DPB02:01", "DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03")
[1] FALSE
> str_detect("DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03", "DPB02:01")
[1] TRUE
One option to do the left_join is by getting the substring from the 'P_group_table' before doing the join
left_join(typing_table, P_group_table %>%
mutate(typing = substr(Alleles, 1, 8)), by = "typing")
# A tibble: 2 × 3
typing P_group Alleles
<chr> <chr> <chr>
1 DPB02:01 DP2 DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03
2 DPB04:02 <NA> <NA>

replacing list element of character(0) with NA by using is_empty()

I have a dataframe (my_df) with a list called 'house' with character(0) in one row.
I would like to replace this value with NA_character by using rlang's is_empty(). I am aware that there are workarounds by using e.g. length(x)==0, but I am trying to understand why my approach with is_empty() does not work. Grateful for any hint.
library(tidyverse)
my_df <- data.frame(
stringsAsFactors = FALSE,
type = c("house", "tent", "treehouse, mainhouse")
) %>%
mutate(house=str_extract_all(type, regex("house")))
my_df
#> type house
#> 1 house house
#> 2 tent
#> 3 treehouse, mainhouse house, house
class(my_df$house)
#> [1] "list"
my_df %>%
mutate(house_mod=modify_if(house,
.p = is_empty(house),
.f = NA_character_))
#> Error in `mutate()`:
#> ! Problem while computing `house_mod = modify_if(house, .p =
#> is_empty(house), .f = NA_character_)`.
#> Caused by error in `probe()`:
#> ! length(.p) == length(.x) is not TRUE
my_df <- my_df %>%
mutate(house_empty=map(house, ~is_empty(.)))
my_df
#> type house house_empty
#> 1 house house FALSE
#> 2 tent TRUE
#> 3 treehouse, mainhouse house, house FALSE
my_df %>%
mutate(house_mod=modify_if(house,
.p = house_empty==T,
.f = NA_character_))
#> Error in `mutate()`:
#> ! Problem while computing `house_mod = modify_if(house, .p = house_empty
#> == T, .f = NA_character_)`.
#> ✖ `house_mod` must be size 3 or 1, not 2.
Created on 2022-05-23 by the reprex package (v2.0.1)
The .f argument should take a function, but you have just passed NA_character_ directly. Try ~NA_character_ :
my_df %>%
mutate(house_mod=modify_if(house,
.p = is_empty,
.f = ~NA_character_))
#> type house house_mod
#> 1 house house house
#> 2 tent NA
#> 3 treehouse, mainhouse house, house house, house

R convert character string to a dataframe

Here is a small sample of a larger character string that I have (no whitespaces). It contains fictional details of individuals.
Each individual is separated by a . There are 10 attributes for each individual.
txt = "EREKSON(Andrew,Hélène),female10/06/2011#Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.BOUKAR(Mohamed,El-Hadi),male04/12/1956#London(England),PPF,2001X005729,dist.097,Dt.043/997.HARIMA(Olak,N’nassik,Gerad,Elisa,Jeremie),female25/06/2013#Paris(France),PPF,2009X005729,dist.088,Dt.043/998.THOMAS(Hajil,Pau,Joëli),female03/03/1980#Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."
I'd like to parse this into a dataframe, with as many observations as there are individuals and 10 columns for each variable.
I've tried using regex and looking at other text extraction solutions on stackoverflow, but haven't been able to reach the output I want.
This is the final dataframe I have in mind, based on the character string input -
result = data.frame(first_names = c('Hélène Andrew','Mohamed El-Hadi','Olak N’nassik Gerad Elisa Jeremie','Joëli Pau Hajil'),
family_name = c('EREKSON','BOUKAR','HARIMA','THOMAS'),
gender = c('male','male','female','female'),
birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
birth_city = c('Geneva','London','Paris','Berlin'),
birth_country = c('Switzerland','England','France','Germany'),
acc_type = c('PPF','PPF','PPF','VAT'),
acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
district = c('dist.093','dist.097','dist.088','dist.078'),
code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))
Any help would be much appreciated
Here's a tidy solution with tidyr's functions separate_rows and extract:
library(tidyr)
data.frame(txt) %>%
# separate `txt` into rows using the dot `.` *if*
# preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
extract(
# select column from which to extract:
txt,
# define column names into which to extract:
into = c("family_name","first_names","gender",
"birthday","birth_city","birth_country",
"acc_type","acc_num","district","code"),
# describe the string exhaustively using capturing groups
# `(...)` to delimit what's to be extracted:
regex = "([A-Z]+)\\(([\\w,]+)\\),([a-z]+)([\\d/]+)#(\\w+)\\((\\w+)\\),([A-Z]+),(\\w+),dist.(\\d+),Dt\\.([\\d/]+)")
# A tibble: 4 × 10
family_name first_names gender birthday birth_city birth_country acc_type acc_num
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 EREKSON Andrew,Peter male 10/06/2011 Geneva Switzerland PPF 2000X007…
2 OBAMA Barack,Hussian male 04/12/1956 London England PPF 2001X005…
3 CLINTON Hillary female 25/06/2013 Paris France PPF 2009X005…
4 GATES Melinda female 03/03/1980 Berlin Germany VAT 2010X006…
# … with 2 more variables: district <chr>, code <chr>
Here is a solution using the tidyverse which pipes together different stringr functions to clean the string, before having readr read it, basically as a CSV:
library(dplyr, warn.conflicts = FALSE) # for pipes
df <-
txt %>%
# Replace "." sep with newline
stringr::str_replace_all(
"\\.[A-Z]",
function(x) stringr::str_replace(x, "\\.", "\n")
) %>%
# Replace all commas in (First[,Middle1,Middle2,...]) with space
stringr::str_replace_all(
# Match anything inside brackets, but as few times as possible, so we don't
# match multiple brackets
"\\(.*?\\)",
# Inside the regex that was matched, replace comma with space
function(x) stringr::str_replace_all(x, ",", " ")
) %>%
# Replace ( with ,
stringr::str_replace_all("\\(", ",") %>%
# Remove )
stringr::str_remove_all("\\)") %>%
# Replace # with ,
stringr::str_replace_all("#", ",") %>%
# Remove the last "."
stringr::str_replace_all("\\.$", "\n") %>%
# Add , after female/male
stringr::str_replace_all("male", "male,") %>%
# Read as comma delimited file (works since string contains \n)
readr::read_delim(
file = .,
delim = ",",
col_names = FALSE,
show_col_types = FALSE
)
# Add names (could also be done directly in read_delim with col_names argument)
names(df) <- c(
"family_name",
"first_names",
"gender",
"birthday",
"birth_city",
"birth_country",
"acc_type",
"acc_num",
"district",
"code"
)
df
#> # A tibble: 4 × 10
#> family_name first_names gender birthday birth_city birth_country acc_type
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 EREKSON Andrew Hélène female 10/06/2… Geneva Switzerland PPF
#> 2 BOUKAR Mohamed El-Hadi male 04/12/1… London England PPF
#> 3 HARIMA Olak N’nassik G… female 25/06/2… Paris France PPF
#> 4 THOMAS Hajil Pau Joëli female 03/03/1… Berlin Germany VAT
#> # … with 3 more variables: acc_num <chr>, district <chr>, code <chr>
Created on 2022-03-20 by the reprex package (v2.0.1)
Note that there probably exists more efficient regex'es one could use, but I believe this is simpler and easier to change later.

How to convert misspelt strings to the correct verison in R

I have a lot of dataframes where there are a lot of inconsistencies with people's names. I want to be convert all misspelt names to the correct name, some names have 2 or 3 ways that they are commonly mispelt.
Currently I am attempting this via the code below but the output is far from what I want due to the all changes being applied to all strings.
Help fixing my code or suggesting a much better method would be much appreciated.
Data <- data.frame(Name = c("Tom", "H", "Terry", "Barry"))
NameChangeR <- list('Tom'='Thomas T', 'H' = 'Harry P', 'Terry' = 'Terry T', 'Barry' = 'Barry W')
Data$Name <- chartr(paste(names(NameChangeR), collapse=''),
paste(NameChangeR, collapse=''),
Data$Name )
This sounds like it should be simple, but looping (or *applying) through options seems like an inefficient (from a reading perspective) way to go about it.
Starting with your 'misspelt' data:
Data <- data.frame(Name = c("Tom", "H", "Terry", "Barry"))
And randomising it a bit so that the answer doesn't depend on searching set 1 for name 1
DataMixed <- Data[sample(4), , drop=FALSE]
I turned the problem around into defining a list of possible misspellings of each correct name
BadSpellings <- list('Thomas T' = c('Tom', 'Thom', 'Tommy'),
'Harry P' = c('Herry', 'H', 'Wizard'),
'Terry T' = c('Terry', 'Trry'),
'Barry W' = c('Other Barry', 'Barry'))
BadSpellings
#> $`Thomas T`
#> [1] "Tom" "Thom" "Tommy"
#>
#> $`Harry P`
#> [1] "Herry" "H" "Wizard"
#>
#> $`Terry T`
#> [1] "Terry" "Trry"
#>
#> $`Barry W`
#> [1] "Other Barry" "Barry"
I know it's not the most 'base' way to do it, but there's some great packages in the 'tidyverse' that help here: tibble allows list columns, and tidyr can unpack them.
library(tibble)
library(tidyr)
tbl <- tibble(Good = names(BadSpellings), Bad = BadSpellings)
tbl
#> # A tibble: 4 x 2
#> Good Bad
#> <chr> <list>
#> 1 Thomas T <chr [3]>
#> 2 Harry P <chr [3]>
#> 3 Terry T <chr [2]>
#> 4 Barry W <chr [2]>
merge(DataMixed, unnest(tbl, Bad), by.x = "Name", by.y = "Bad")
#> Name Good
#> 1 Barry Barry W
#> 2 H Harry P
#> 3 Terry Terry T
#> 4 Tom Thomas T
In the last step I've created a tibble of every combination of 'good' name and 'bad name', then I merged the 'bad' names with the 'misspelled' names.

Resources