I'm noticing some odd behavior with R regex quantifiers written as either {min, max} (as recommend in the stringr cheatsheet) vs. as {min - max}, when using the pointblank package. I expect the regexes to work with {min, max} and fail with {min - max}. However, in the two examples below, one works with {min, max} and one works with {min - max}.
Example 1 works as expected: pattern_comma works and pattern_dash does not. But example 2 works unexpectedly: doi_pattern_comma does not work and doi_pattern_dash does work.
Any suggestions about this regex? Or might this be a bug in pointblank (in which case I can open an issue there)?
Thank you, SO community!
library(dplyr)
library(stringr)
library(pointblank)
# EXAMPLE 1
df1 <- tibble(x = c("123", "68"))
pattern_comma <- "^\\d{1,3}$"
pattern_dash <- "^\\d{1-3}$"
stringr::str_detect(df1$x, pattern_comma) #pass
#> [1] TRUE TRUE
stringr::str_detect(df1$x, pattern_dash) #fail
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`^\d{1-3}$`)
#pass
df1 %>%
pointblank::col_vals_regex(
vars(x),
pattern_comma
)
#> # A tibble: 2 x 1
#> x
#> <chr>
#> 1 123
#> 2 68
#fail
df1 %>%
pointblank::col_vals_regex(
vars(x),
pattern_dash
)
#> Error: Exceedance of failed test units where values in `x` should have matched the regular expression: `^\d{1-3}$`.
#> The `col_vals_regex()` validation failed beyond the absolute threshold level (1).
#> * failure level (2) >= failure threshold (1)
# EXAMPLE 2
df2 <- tibble(doi = c("10.1186/s12872-020-01551-9", "10.1002/cpp.1968"))
doi_pattern_comma <- "^10\\.\\d{4,9}/[-.;()/:\\w\\d]+$"
doi_pattern_dash <- "^10\\.\\d{4-9}/[-.;()/:\\w\\d]+$"
stringr::str_detect(df2$doi, doi_pattern_comma) #pass
#> [1] TRUE TRUE
stringr::str_detect(df2$doi, doi_pattern_dash) #fail
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`^10\.\d{4-9}/[-.;()/:\w\d]+$`)
#fail
df2 %>%
col_vals_regex(
vars(doi),
doi_pattern_comma
)
#> Error: Exceedance of failed test units where values in `doi` should have matched the regular expression: `^10\.\d{4,9}/[-.;()/:\w\d]+$`.
#> The `col_vals_regex()` validation failed beyond the absolute threshold level (1).
#> * failure level (2) >= failure threshold (1)
#pass
df2 %>%
col_vals_regex(
vars(doi),
doi_pattern_dash
)
#> # A tibble: 2 x 1
#> doi
#> <chr>
#> 1 10.1186/s12872-020-01551-9
#> 2 10.1002/cpp.1968
Created on 2021-05-09 by the reprex package (v0.3.0)
You must not doubt: {min-max} quantifier does not exist, you need to use
{min,max}. \d{4-9} throws an exception (try it with sub and you will get invalid regular expression '\d{4-9}', reason 'Invalid contents of {}' ).
Next, the second issue is that the regex is parsed with the default TRE regex engine, and you can't use shorthand character classes like \w or \W inside bracket expressions there, so you need to use [:alnum:]_ instead of \w inside square brackets.
Now, that you know the right regex:
"^10\\.\\d{4,9}/[-.;()/:[:alnum:]_]+$"
you can dive deeper.
You can see what results you get if you use test_col_vals_regex:
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4,9}/[-.;()/:[:alnum:]_]+$")
[1] TRUE
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4-9}/[-.;()/:[:alnum:]_]+$")
[1] NA
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4,9}/[-.;()/:\\w]+$")
[1] FALSE
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4-9}/[-.;()/:\\w]+$")
[1] NA
So, all the cases when the regex is malformed return NA and the validation for those items is skipped, passing them in the end.
CONCLUSION: Always test your regex patterns for validity before using them in col_vals_regex.
Related
Let's imagine that I have a class "my" and I want to trigger certain behaviour when it is added to an object that has units (i.e. from units package):
library(units)
my1 = structure(2, class="my")
Ops.my <- function(e1, e2=NULL) {
ok <-
switch(
.Generic,
`-` = ,
`*` = ,
`+` = ,
'<=' = TRUE,
FALSE
)
if (!ok) {
stop(gettextf("%s not meaningful", sQuote(.Generic)))
}
get(.Generic)(as.integer(e1), as.integer(e2))
}
my1+set_units(5,nm)
Currently, it gives me the following warning:
Warning message:
Incompatible methods ("Ops.my", "Ops.units") for "+"
But I actually want to handle "my" and "units" addition in a certain way, how do I do it?
I tried with something like Ops.my.units <- but it doesn't seem to work.
There doesn't seem to be a way to do this with Ops. From the docs:
The classes of both arguments are considered in dispatching any member of this group. For each argument its vector of classes is examined to see if there is a matching specific (preferred) or Ops method. If a method is found for just one argument or the same method is found for both, it is used. If different methods are found, there is a warning about ‘incompatible methods’
This is probably a good thing. Part of the benefit of an object-oriented system in a non-compiled language like R is that it helps preserve type safety. This stops you from accidentally adding apples to oranges, as we can see in the following example:
apples <- structure(2, class = "apples")
oranges <- structure(2, class = "oranges")
Ops.apples <- function(e1, e2) {
value <- do.call(.Generic, list(as.integer(e1), as.integer(e2)))
class(value) <- "apples"
value
}
Ops.oranges <- function(e1, e2) {
value <- do.call(.Generic, list(as.integer(e1), as.integer(e2)))
class(value) <- "oranges"
value
}
apples + apples
#> [1] 4
#> attr(,"class")
#> [1] "apples"
oranges + oranges
#> [1] 4
#> attr(,"class")
#> [1] "oranges"
apples + oranges
#> [1] 4
#> attr(,"class")
#> [1] "apples"
#> Warning message:
#> Incompatible methods ("Ops.apples", "Ops.oranges") for "+"
You can see that even here, we could just ignore the warning.
suppressWarnings(apples + oranges)
#> [1] 4
#> attr(,"class")
#> [1] "apples"
But hopefully you can see why this may not be good - we have added 2 apples and 2 oranges, and have returned 4 apples.
Throughout R and its extension packages, there are numerous type-conversion functions such as as.integer, as.numeric, as.logical, as.character, as.difftime etc. These allow for some element of control when converting between types and performing operations on different types.
The "right" way to do this kind of thing is specifically convert one of the object types to the other in order to perform the operation:
as.my <- function(x) UseMethod("as.my")
as.my.default <- function(x) {
value <- as.integer(x)
class(value) <- 'my'
value
}
my1 + as.my(set_units(5,nm))
#> [1] 7
I am new to {testthat} and am building tests for a function that modifies strings and is expected to produce a specific output for certain input patterns.
As a example (reprex below), add_excitement adds an exclamation point to its input string. When given an input of "hello" it should return "hello!"; when given any other input it should not return "hello!". I would like to {testthat} behavior on a series of patterns and return informative errors which specify which pattern caused the error.
Based on {testthat} package documentation, I believe I should use expect_match. However, this throws an "invalid argument type" error, whereas expect_identical works. I don't understand why this is happening. My questions are:
Why does expect_identical and not expect_match accept the quasi_label argument?
Could I use expect_identical rather than expect_match for my purposes, or does this risk other errors?
Here is a reprex:
library(testthat)
library(purrr)
patterns = c("hello", "goodbye", "cheers")
add_excitement <- function(pattern) paste0(pattern, "!")
# For a single pattern
show_failure(expect_identical(add_excitement(!!patterns[2]), "hello!"))
#> Failed expectation:
#> add_excitement("goodbye") not identical to "hello!".
#> 1/1 mismatches
#> x[1]: "goodbye!"
#> y[1]: "hello!"
try(
show_failure(expect_match(add_excitement(!!patterns[2]), "hello!", fixed = TRUE,all = TRUE))
)
#> Error in !patterns[2] : invalid argument type
# For multiple patterns
purrr::map(
patterns,
~ show_failure(expect_identical(add_excitement(!!.), "hello!"))
)
#> Failed expectation:
#> add_excitement("goodbye") not identical to "hello!".
#> 1/1 mismatches
#> x[1]: "goodbye!"
#> y[1]: "hello!"
#> Failed expectation:
#> add_excitement("cheers") not identical to "hello!".
#> 1/1 mismatches
#> x[1]: "cheers!"
#> y[1]: "hello!"
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> NULL
try(
purrr::map(
patterns,
~ show_failure(expect_match(add_excitement(!!.), "hello!",
fixed = TRUE, all = TRUE)
)
)
)
#> Error in !. : invalid argument type
Created on 2021-02-04 by the reprex package (v0.3.0)
Thank you for your help!
I was able to solve this problem following https://r-pkgs.org/tests.html#building-your-own-testing-tools, which uses non-standard evaluation bquote() and eval() (instead of quasi_label()) to produce the more informative errors.
library(testthat)
library(purrr)
patterns = c("hello", "goodbye", "cheers")
add_excitement <- function(pattern) paste0(pattern, "!")
show_failure(eval(bquote(expect_match(add_excitement(.(patterns[2])), "hello!", fixed = TRUE, all = TRUE))))
#> Failed expectation:
#> add_excitement("goodbye") does not match "hello!".
#> Actual value: "goodbye!"
purrr::walk(
patterns,
~ show_failure(eval(bquote(expect_match(add_excitement(.(.)), "hello!", fixed = TRUE, all = TRUE))))
)
#> Failed expectation:
#> add_excitement("goodbye") does not match "hello!".
#> Actual value: "goodbye!"
#> Failed expectation:
#> add_excitement("cheers") does not match "hello!".
#> Actual value: "cheers!"
Created on 2021-02-04 by the reprex package (v0.3.0)
Or for a tidy version:
library(testthat)
library(purrr)
patterns = c("hello", "goodbye", "cheers")
add_excitement <- function(pattern) paste0(pattern, "!")
expect_hello <- function(pattern) {
show_failure(eval(bquote(expect_match(add_excitement(.(pattern)), "hello!", fixed = TRUE, all = TRUE))))
}
expect_hello(patterns[2])
#> Failed expectation:
#> add_excitement("goodbye") does not match "hello!".
#> Actual value: "goodbye!"
walk(patterns, expect_hello)
#> Failed expectation:
#> add_excitement("goodbye") does not match "hello!".
#> Actual value: "goodbye!"
#> Failed expectation:
#> add_excitement("cheers") does not match "hello!".
#> Actual value: "cheers!"
Created on 2021-02-04 by the reprex package (v0.3.0)
The error is linked to the use of tidyeval bang bang !! operator.
This is superseded, and the example you provided works without :
show_failure(expect_identical(add_excitement(patterns[2]), "hello!"))
# Failed expectation:
# add_excitement(patterns[2]) not identical to "hello!".
# 1/1 mismatches
# x[1]: "goodbye!"
# y[1]: "hello!"
show_failure(expect_match(add_excitement(patterns[2]), "hello!", fixed = TRUE,all = TRUE))
# Failed expectation:
# add_excitement(patterns[2]) does not match "hello!".
# Actual value: "goodbye!"
purrr::map(
patterns,
~ show_failure(expect_match(add_excitement(.x), "hello!",
fixed = TRUE, all = TRUE)
)
)
Failed expectation:
add_excitement(.x) does not match "hello!".
Actual value: "goodbye!"
Failed expectation:
add_excitement(.x) does not match "hello!".
Actual value: "cheers!"
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
I know that, in general, \uxxxx sequences are not supported inside backticks. Do you have any workaround to include them (\uxxxx sequences) in column names?
To be specific, imagine calculating Body Mass Index and adding units to column names.
Start with
dt<-data.frame(
`Weight [kg]` = runif(5,50,100),
`Height [m]` = runif(5,1.5,2),
check.names=F
)
and mutate:
> dt2<-dt %>%
mutate(
`BMI [kg/m\u00b2]`= `Weight [kg]`/`Height [m]`^2
)
This produces an error: Error: \uxxxx sequences not supported inside backticks (line 3).
My workaround is like this:
> dt2<-dt %>%
mutate(
`BMI [kg/m2]`= `Weight [kg]`/`Height [m]`^2
) %>%
set_colnames(colnames(.) %>% str_replace('2\\]', '\u00b2\\]'))
> colnames(dt2)
[1] "Weight [kg]" "Height [m]" "BMI [kg/m²]"
It gives me exactly what I want but is not very elegant.
Suprisingly, a bit clearer approach fails:
> dt2<-dt %>%
mutate(
`BMI [kg/m2]`= `Weight [kg]`/`Height [m]`^2
) %>%
rename_all(str_replace, '2\\]', '\u00b2\\]')
> colnames(dt2)
[1] "Weight [kg]" "Height [m]" "BMI [kg/m2]"
So, my question is: can it be done in not-so-hacky way?
And:
yes, I'm sure, I need \uxxx in column names;
yes, I use them further on graphs;
no, I don't want to replace them with expression's.
How about just using single quotes instead of backticks?
dt %>% mutate('BMI [kg/m\u00b2]' = `Weight [kg]`/`Height [m]`^2)
#> Weight [kg] Height [m] BMI [kg/m²]
#> 1 67.68154 1.757490 21.91211
#> 2 72.32362 1.817616 21.89151
#> 3 89.28197 1.854459 25.96146
#> 4 52.14819 1.709520 17.84395
#> 5 83.48281 1.969367 21.52502
Or double quotes?
dt %>% mutate("BMI [kg/m\u00b2]" = `Weight [kg]`/`Height [m]`^2)
#> Weight [kg] Height [m] BMI [kg/m²]
#> 1 67.68154 1.757490 21.91211
#> 2 72.32362 1.817616 21.89151
#> 3 89.28197 1.854459 25.96146
#> 4 52.14819 1.709520 17.84395
#> 5 83.48281 1.969367 21.52502
You can also use them to access items in your new data frame:
dt2$'BMI [kg/m\u00b2]'
#> [1] 21.91211 21.89151 25.96146 17.84395 21.52502
dt2$"BMI [kg/m\u00b2]"
#> [1] 21.91211 21.89151 25.96146 17.84395 21.52502
Or did you specifically need to use backticks for some reason?
Argument names don't have to be in backticks, they can be regular quoted strings. So this works fine:
dt2<-dt %>%
mutate(
"BMI [kg/m\u00b2]" = `Weight [kg]`/`Height [m]`^2
)
It will be hard to refer to that column name in expressions in later code; you'll need to specify the column by number, or use an expression like dt2["BMI [kg/m\u00b2]"] (or dt2$"BMI [kg/m\u00b2]" as used by #AllanCameron in his answer). But it will print fine:
> dt2
Weight [kg] Height [m] BMI [kg/m²]
1 51.89918 1.825124 15.58029
2 80.74140 1.602126 31.45595
3 71.35380 1.974187 18.30799
4 64.44167 1.989202 16.28580
5 76.13564 1.886232 21.39922
Edited to add: It's also fine to use
`BMI [kg/m²]`
anywhere a column name can be used, you just can't encode the special char with \uxxxx.
I'm making a function that should be able to handle multiple classes for its first argument: formulas, characters, tidy-selection, var names... The goal is then to use tidyselection with tidyselect::vars_select, except with bare formulas.
The problem is that when I test the class of this argument, it will throw an error if the value is a name to be tidy-selected, since it will be considered as a not found object.
I found a workaround with tryCatch, which enquotes the first argument if its evaluation fails (and thus if it doesn't exist in this scope).
library(rlang)
foo=function(.vars){
.vars2=tryCatch(.vars, error=function(e) enquo(.vars))
print(class(.vars2))
print(class(.vars))
}
foo(Species)
# [1] "quosure" "formula"
# Error in print(class(.vars)) : object 'Species' not found
# In addition: Warning message:
# In print(class(.vars)) : restarting interrupted promise evaluation
foo(~Species)
# [1] "formula"
# [1] "formula"
foo(1)
# [1] "numeric"
# [1] "numeric"
foo("Species")
# [1] "character"
# [1] "character"
This doesn't seem clean to me, as I'm catching all errors without filtering on my specific case.
Is there a built-in function to test this, or a cleaner solution than this workaround?
I think the following is what you are trying to do (using here only base R).
foo=function(.vars) {
.vars2 = substitute(.vars)
ifelse(is.symbol(.vars2), class(.vars2), class(.vars))
}
foo(Species)
#[1] "name"
foo(~Species)
#[1] "formula"
foo(1)
#[1] "numeric"
foo("Species")
#[1] "character"
I don't think that there is a function which lets you avoid a structured control flow along the different input types.
library(rlang)
library(tidyselect)
library(dplyr)
foo <- function(df, .vars){
en_vars <- enquo(.vars)
var_expr <- quo_get_expr(en_vars)
if (is.name(var_expr)){
vars_select(names(df), !! en_vars)
} else if (is_formula(var_expr)) {
vars_select(names(df), all.vars(.vars))
} else {
vars_select(names(df), .vars)
}
}
iris_tbl <- as_tibble(iris)
foo(iris_tbl, Species)
#> Species
#> "Species"
foo(iris_tbl, ~Species)
#> Species
#> "Species"
foo(iris_tbl, 1)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(.vars)` instead of `.vars` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> Sepal.Length
#> "Sepal.Length"
foo(iris_tbl, "Species")
#> Species
#> "Species"
Created on 2020-06-21 by the reprex package (v0.3.0)
I am having the weirdest bug with map_int from the purrr package.
# Works as expected
purrr::map_int(1:10, function(x) x)
#> [1] 1 2 3 4 5 6 7 8 9 10
# Why on earth is that not working?
purrr::map_int(1:10, function(x) 2*x)
#> Error: Can't coerce element 1 from a double to a integer
# or that?
purrr::map_int(1:10, round)
#> Error: Can't coerce element 1 from a double to a integer
Created on 2019-03-28 by the reprex package (v0.2.1)
I run 3.5.2 in rocker container (Debian) with the latest github version of everything:
sessioninfo::package_info("purrr")
#> package * version date lib source
#> magrittr 1.5.0.9000 2019-03-28 [1] Github (tidyverse/magrittr#4104d6b)
#> purrr 0.3.2.9000 2019-03-28 [1] Github (tidyverse/purrr#25d84f7)
#> rlang 0.3.2.9000 2019-03-28 [1] Github (r-lib/rlang#9376215)
#>
#> [1] /usr/local/lib/R/site-library
#> [2] /usr/local/lib/R/library
2*x is not an integer because 2 is not. Do instead
purrr::map_int(1:10, function(x) 2L*x)
The documentation from help(map) says
The output of .f will be automatically typed upwards , e.g. logical ->
integer -> double -> character
It appears to be following the larger ordering given in help(c). For example, this produces an error map_dbl(1:10, ~complex(real = .x, imaginary = 1)).
NULL < raw < logical < integer < double < complex < character < list <
expression
As you can see in that ordering, double-to-integer is a downward conversion. So, the function is designed to not do this kind of conversion.
The solution is to either write a function .f which outputs integer (or lower) classed objects (as in #Stéphane Laurent's answer), or just use as.integer(map(.x, .f)).
This is a kind of type-checking, which can be a useful feature for preventing programming mistakes.