R: How to make parse() accept regular expressions with escaped characters? - r

I am trying to use the validator package to check if certain rows in my data table contain a regular expression or not.
I make a vector (fields) with the columns I want to test and then paste together the commands for the validator rules as a string.
To be able to use the rules in the confront() function I use parse() and eval() to turn the character string into an expression
The following example is working as expected:
library(validate)
data <- data.frame("Protocol_Number" = c("123", "A122"), "Numeric_Result" = c("-0.5", "1.44"))
fields <- c("Protocol_Number", "Numeric_Result")
# build validator commands for each field
cmds <- paste("validator(",
paste(
map_chr(
fields, function(x) paste0("grepl('^-?[0-9]', as.character(`", x, "`))")
), collapse = ","),
")")
# convert to rule and do the tests
rule <- eval(parse(text = cmds))
out <- confront(data, rule)
summary(out)
However, I want to use a regex that recognizes any sort of number as opposed to text, like in this working example
grepl('^-?[0-9]\\d*(\\.\\d+)?$', c(1, -1, 0.5, "Not Done"))
When I try to use this regex in the above example, the parse() function will throw an error:
Error: '\d' is an unrecognized escape in character string starting "'^-?[0-9]\d"
This is not working:
# build validator commands for each field
cmds <- paste("validator(",
paste(
map_chr(
fields, function(x) paste0("grepl('^-?[0-9]\\d*(\\.\\d+)?$', as.character(`", x, "`))")
), collapse = ","),
")")
# convert to rule and do the tests
rule <- eval(parse(text = cmds))
out <- confront(data, rule)
summary(out)
How do I make parse() accept the escaped characters? Or is there a better way to do this?

We may escape it with \\
cmds <- paste("validator(",
paste(
map_chr(
fields, function(x)
paste0("grepl('^-?[0-9]\\\\d*(\\\\.\\\\d+)?$', as.character(`", x, "`))")
), collapse = ","),
")")
-testing
> rule <- eval(parse(text = cmds))
>
> out <- confront(data, rule)
> out
Object of class 'validation'
Call:
confront(dat = data, x = rule)
Rules confronted: 2
With fails : 1
With missings: 0
Threw warning: 0
Threw error : 0

Related

In r, use string just as if I had typed it in

I am dealing with one aspect of r that really confuses me. What I have built is a line of code invoking str_remove saved as a string. If I was to copy-paste that string into where I want to use this line of code, it works perfectly as intended.
However I cannot get r to interpret this code correctly. I have tried using e.g. parse, but the escape characters intended for str_remove regular expression throw up errors.
Is there not a simple way to just treat a string as if it was a line of typed code?
Here is my reproducible example:
Make toy data:
maf_list_context <- list(as.data.frame(cbind(c("ATTATCGAATT", "ATTATTTTAAA"), c("this one", "not that one"))),
as.data.frame(cbind(c("ATTACGTAATT", "ATTATTTTAAA"), c("this one too", "not that one either"))) )
maf_list_context <- lapply(maf_list_context, function(x)
{colnames(x) <- c("CONTEXT", "want_it")
return(x)
})
The idea is that context will be an argument to a function and that it can be flexible, so the user can supply any number of contexts of interest separated by commas. These will be stringr regular expressions designed to look for particular contexts in DNA within a string of 11 bases. Here for example we can use two contexts of interest. The code that follows combines these to make an expression for use later in selecting the appropriate rows from the dataframes in the list.
context <- "\\w{5}CG\\w{4}, \\w{4}CG\\w{5}"
contextvec <- unlist(str_split(context, pattern = ", "))
contextexpression <- c()
for(i in 1:length(contextvec)){
contextexpression <- paste0(contextexpression, "str_detect(x$CONTEXT, pattern = '", contextvec[i], "') |")
}
contextexpression <- str_remove(contextexpression, pattern = " \\|$")
'contextexpression' is now:
[1] "str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}')"
If I were to paste this expression directly into apply, it works precisely as I would want it.
> lapply(maf_list_context, function(x){
+
+ x[str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}'), ]
+
+ })
[[1]]
CONTEXT want_it
1 ATTATCGAATT this one
[[2]]
CONTEXT want_it
1 ATTACGTAATT this one too
But of course if I use the string there, it does not.
> lapply(maf_list_context, function(x){
+
+ x[contextexpression, ]
+
+ })
[[1]]
CONTEXT want_it
NA <NA> <NA>
[[2]]
CONTEXT want_it
NA <NA> <NA>
I have tried many different functions but none of them make this work. Is there are way of having r interpret this string as if I had typed it in directly?
The whole reprex:
if (!require("stringr") {
install.packages("stringr", dependencies = TRUE)
library("stringr")
maf_list_context <- list(as.data.frame(cbind(c("ATTATCGAATT", "ATTATTTTAAA"), c("this one", "not that one"))),
as.data.frame(cbind(c("ATTACGTAATT", "ATTATTTTAAA"), c("this one too", "not that one either"))) )
maf_list_context <- lapply(maf_list_context, function(x){
colnames(x) <- c("CONTEXT", "want_it")
return(x)
})
context <- "\\w{5}CG\\w{4}, \\w{4}CG\\w{5}"
contextvec <- unlist(str_split(context, pattern = ", "))
contextexpression <- c()
for(i in 1:length(contextvec)){
contextexpression <- paste0(contextexpression, "str_detect(x$CONTEXT, pattern = '", contextvec[i], "') |")
}
contextexpression <- str_remove(contextexpression, pattern = " \\|$")
maf_list_select <- lapply(maf_list_context, function(x){
x[contextexpression, ]
})
I'm not sure I completely follow what you want your input to be and how to apply it, but your problem seems to be with what you're passing to the subset operator, i.e. x[<codehere>]
The subset operator expects a logical vector. When you "paste the expression" you are actually pasting an expression that gets evaluated to a logical vector, hence it properly subsets. When you pass the variable contextexpression, you are actually passing a string. As R sees it:
x[ "str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}')", ]
Instead of (notice the syntax highlighting difference):
x[ str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}'), ]
You want apply each context to each member of the list to get a logical vector and then subset.
purrr::map2(maf_list_context, contextvec, ~.x[str_detect(.x$CONTEXT, .y), ])
If you want to compare every item in contextvec to every item in maf_list_context, then it's a little more complicated but doable.
purrr::map2(
maf_list_context,
purrr::map(
maf_list_context,
function(data){
purrr::reduce(contextvec,
function(prev, cond) str_detect(data$CONTEXT, cond) | prev,
.init = logical(length(contextvec))
)
}
),
~.x[.y]
)
There's probably a more efficient way to short circuit the matching against the items in maf_list_context, but the general approach applies. The str_detect handles the comparison of a single condition against a single maf_list item. The reduce call combines the results of all the comparisons of contextvec to a single item in maf_list_context to a single boolean. The inner map iterates through maf_list_context. The outer map2 iterates through the list of boolean values created by the inner map and maf_list_context to subset for matches.
If maf_list_context has n items and contextvec has m items:
reduce makes m comparisons, resulting in 1 value
map makes n calls to reduce result in n values
map2 makes n iterations to subset maf_list_context

How to turn the Web of Science advanced query into regular expression in R?

To do advanced search in Web of Science, we could use query like:
TI = ("ecology" AND ("climate change" OR "biodiversity"))
This means we want to extract papers with titles containing "ecology" and ("climate change" or "biodiversity"). The according regular expression would be(here TI is a string vector of titles):
library(stringr)
str_detect(TI,"ecology") & str_detect(TI,"climate change|biodiversity")
Is there any way to get the regular expression from the WoS query?
1) Firstly we need to define the question more precisely. We assume that a WoS query is a character string containing AND, OR, NOT, parentheses and fixed character strings in lower case or mixed case possibly surrounded by double quotes (this excludes upper case AND or OR appearing within double quotes unless part of a longer string). We assume that we wish to generate a character string holding an R statement containing str_detect instances such as that shown in the question but not necessarily identical to the example shown as long as it satisfies the above.
For AND, OR and NOT we just replace them with the operators &, | and & !. We then replace each instance of a word character followed by spaces followed by word character with the same except the spaces are replaced with an underscore. We then replace any string of word characters that is not quoted with that string surrounded by quotes and finally we revert the underscores to spaces.
If s is the resulting string then eval(parse(text = s)[[1]]) could be used to evaluate it against target.
wos2stmt does not use any packages but the generated statement depends on stringr due to the use of str_detect for consistency with the question.
wos2stmt <- function(TI, target = "target") {
TI |>
gsub(pattern = "\\bNOT\\b", replacement = "& !") |>
gsub(pattern = "\\bAND\\b", replacement = "&") |>
gsub(pattern = "\\bOR\\b", replacement = "|") |>
gsub(pattern = "(\\w) +(\\w)", replacement = "\\1_\\2") |>
gsub(pattern = '(?<!")\\b(\\w+)\\b(?!")', replacement = '"\\1"', perl = TRUE) |>
gsub(pattern = "_", replacement = " ") |>
gsub(pattern = '("[^"]+")', replacement = sprintf("str_detect(%s, \\1)", target)) |>
gsub(pattern = '"(\\w)', replacement = r"{"\\\\b\1}") |>
gsub(pattern = '(\\w)"', replacement = r"{\1\\\\b"}")
}
# test
TI <- '"ecology" AND ("climate change" OR "biodiversity")'
stmt <- wos2stmt(TI)
giving:
cat(stmt, "\n")
## [1] "str_detect(target, \"ecology\") & (str_detect(target, \"climate change\") | str_detect(target, \"biodiversity\"))"
2) The question seems to refer to generating R statements with str_detect but the subject refers to generating regular expressions. In the latter case we accept a WoS query and output a regular expression for use with str_detect like this. I haven't tested this out much so you will need to do that to explore its limitations.
Note that unlike (1)
this addresses the original question which we defined as not including NOT and automatic quoting (they are not mentioned in the quesiton as requirements).
wos2rx <- function(TI) {
TI |>
gsub(pat = ' *\\bOR\\b *', repl = '|') |>
gsub(pat = ' *\\bAND\\b *', repl = '') |>
gsub(pat = ' *"([^"]+)" *', repl = '(?=.*\\1)')
}
# test
library(stringr)
TI <- '("ecology" AND ("climate change" OR "biodiversity"))'
rx <- wos2rx(TI)
str_detect("biodiversity ecology", rx)
## [1] TRUE
str_detect("climate change biodiversity", rx)
## [1] FALSE

Replace multiple strings comprising of a different number of characters with one gsubfn()

Here Replace multiple strings in one gsub() or chartr() statement in R? it is explained to replace multiple strings of one character at in one statement with gsubfn(). E.g.:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", " " = ""), x)
# "doremig_k"
I would however like to replace the string 'doremi' in the example with ''. This does not work:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", "doremi" = ""), x)
# "doremi g_k"
I guess it is because of the fact that the string 'doremi' contains multiple characters and me using the metacharacter . in gsubfn. I have no idea what to replace it with - I must confess I find the use of metacharacters sometimes a bit difficult to udnerstand. Thus, is there a way for me to replace '-' and 'doremi' at once?
You might be able to just use base R sub here:
x <- "doremi g-k"
result <- sub("doremi\\s+([^-]+)-([^-]+)", "\\1_\\2", x)
result
[1] "g_k"
Does this work for you?
gsubfn::gsubfn(pattern = "doremi|-", list("-" = "_", "doremi" = ""), x)
[1] " g_k"
The key is this search: "doremi|-" which tells to search for either "doremi" or "-". Use "|" as the or operator.
Just a more generic solution to #RLave's solution -
toreplace <- list("-" = "_", "doremi" = "")
gsubfn(paste(names(toreplace),collapse="|"), toreplace, x)
[1] " g_k"

Avoiding backtick characters with dplyr

How can I write the argument of select without backtick characters? I would like to do this so that I can pass in this argument from a variable as a character string.
df <- dat[["__Table"]] %>% select(`__ID` ) %>% mutate(fk_table = "__Table", val = 1)
Changing the argument of select to "__ID" gives this error:
Error: All select() inputs must resolve to integer column positions.
The following do not:
* "__ID"
Unfortunately, the _ characters in column names cannot be avoided since the data is downloaded from a relational database (FileMaker) via ODBC and needs to be written back to the database while preserving the column names.
Ideally, I would like to be able to do the following:
colName <- "__ID"
df <- dat[["__Table"]] %>% select(colName) %>% mutate(fk_table = "__Table", val = 1)
I've also tried eval(parse()):
df <- dat[["__Table"]] %>% select( eval(parse(text="__ID")) ) %>% mutate(fk_table = "__Table", val = 1)
It throws this error:
Error in parse(text = "__ID") : <text>:1:1: unexpected input
1: _
^
By the way, the following does work, but then I'm back to square one (still with backtick symbol).
eval(parse(text="`__ID`")
References about backtick characters in R:
Removing backticks in R output
What do backticks do in R?
R encoding ASCII backtick
You can use as.name() with select_():
colName <- "__ID"
df <- data.frame(`__ID` = c(1,2,3), `123` = c(4,5,6), check.names = FALSE)
select_(df, as.name(colName))

Generalizing a function to return a list of data.frame columns with invalid UTF-8 bytes/code points

I recently wrote a function that uses grep and regex to find invalid UTF-8 code point (Since I work on a mac, my locale is also UTF-8). The input doesn't have to be UTF-8, as it is looking for invalid UTF-8 bytes. I wrote the function for work, and was wondering if anyone could provide tips for generalizing/catch any red flags in the code that I didn't notice (e.g. using base code instead of dplyr). Feel free to use any of the code if it's useful to you.
enc_check <- function(data) {
library(dplyr)
library(magrittr)
# Create vector of all possible 2-digit hexadecimal numbers (2 digits is the lenth of a byte)
allBytes <- list(x_esc = '\\x',
hex1 = as.character(c(seq(0,9),
c('a','b','c','d','e','f'))),
hex2 = as.character(c(seq(0,9),
c('a','b','c','d','e','f')))
) %$%
expand.grid(x_esc, hex1, hex2) %>%
apply(1, paste, collapse = '')
# Valid mixed alphanumeric bytes
validBytes1 <- list(x_esc = '\\x',
hexNum = as.character(c(seq(2,7))),
hexAlpha = c('a','b','c','d','e','f')
) %$%
expand.grid(x_esc, hexNum, hexAlpha) %>%
apply(1, paste, collapse = '') %>%
extract(. != '\\x7f')
# Valid purely numeric bytes
validBytes2 <- list(x_esc = '\\x',
hexNum2 = as.character(seq(20,79))
) %$%
expand.grid(x_esc, hexNum2) %>%
apply(1, paste, collapse = '')
# New-line byte
validBytes3 <- '\\x0a'
# charToRaw('\n')
# [1] 0a
# Filter all possible combinations down to only invalid bytes
validBytes <- c(validBytes1, validBytes2, validBytes3)
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
# Create list of data.frame columns with invalid bytes
a_vector <- vector()
matches <- list()
for (i in 1:dim(data)[2]) {
a_vector <- data[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
}
# Get rid of empty list elements
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
# matches <- matches[lapply(matches,length) > 0]
return(matches)
}
Edit: Here's the updated code with the suggestions implemented.
enc_check <- function(dataset) {
library(dplyr)
library(magrittr)
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
validBytes <- paste0("\\x",
c(as.character(as.hexmode(32:126)),
sapply(rASCII, charToRaw))) %>%
extract(not(duplicated(.)))
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
a_vector <- vector()
matches <- list()
for (i in 1:dim(dataset)[2]) {
a_vector <- dataset[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
} # sapply() is preferable to lapply due to USE.NAMES = TRUE
names(matches) <- names(dataset)
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
return(matches)
}
2nd Edit: A better strategy was to use iconv. Let's say you have a file or object with some invalid bytes but that is generally UTF-8. This is often the case with Mac computers, whose default locale setting seems to be UTF-8. Moreover, Mac-based RStudio seems to use UTF-8 internally, and this can't be changed even if you set your computer's locale to a different encoding. Anyway, you can use iconv to sub all invalid bytes, normally displayed as hexadecimal bytes, (e.g. "\x8f") for the Unicode replacement symbol. Then you can search for that symbol and return a list of unique observations within a data.frame column with that symbol. Based on that, you can use "sub()" to replace those characters with the desired ones. One thing to note is that converting the file to another encoding, say latin-1, can have unexpected results if invalid bytes are present. When I did this, I noticed that some invalid bytes were converted to Unicode control characters, while other invalid bytes apparently matched valid latin-1 bytes and were displayed as nonsensical characters. In either case, I wrote a package to search data.frames for these characters and return a list, then do some replacement. The package isn't nearly as official as something off of CRAN, but if anyone's interested then here's a link to the repository: https://github.com/jkroes/FixEncoding. It's important to note that the "stable" version of the package isn't on the "master" branch; it's actually on branch "iconv". The documentation can be searched in R via "?FixEncoding" after installation of the correct branch, then finding the functions listed there and searching help for those.
This will construct all the alpha-versions of the hex numbers to "ff":
allBytes <- as.character( as.hexmode(0:255) )
Or as greppish patterns as you seem to prefer:
allBytes <- paste0("\\x", as.character( as.hexmode(0:255) ) )
The "special" characters that R recognizes does include the "\n" that you lissted but also a few more listed on the ?Quotes help page:
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
You could create a vector of valid grep patterns for "characters" " space up to tilde ("~") just with this:
validBytes1 <- c(rASCII, paste( "\\x", as.hexmode( c(20:126)) )
I have concerns about using this strategy since my R throws errors when it tried to do greppish matches with what it considers an invalid input string.
> txt <- "ttt\nuuu\tiii\xff"
> dfrm <- data.frame(a = txt)
> lapply(dfrm, grep, patt = "\\xff")
$a
integer(0)
Warning message:
In FUN(X[[i]], ...) : input string 1 is invalid in this locale
> lapply(dfrm, grep, patt = "\\\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
> lapply(dfrm, grep, patt = "\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
You may want to switch over to grepRaw since it doesn't throw the same errors:
> grepRaw("\xff", txt)
[1] 12
Or may use ?tools::showNonASCII as suggested by Duncan Murdoch when this came up on Rhelp 4 years ago:
?tools::showNonASCII
# and the help page has a reproducible example of its use:
out <- c(
"fa\xE7ile test of showNonASCII():",
"\\details{",
" This is a good line",
" This has an \xfcmlaut in it.",
" OK again.",
"}")
f <- tempfile()
cat(out, file = f, sep = "\n")
tools::showNonASCIIfile(f)
#-------output appears in red----
1: fa<e7>ile test of showNonASCII():
4: This has an <fc>mlaut in it.

Resources