Please consider the body of read.table as a text file, created with the following code:
sink("readTable.txt")
body(read.table)
sink()
Using regular expressions, I'd like to find all function calls of the form foo(a, b, c) (but with any number of arguments) in "readTable.txt". That is, I'd like the result to contain the names of all called functions in the body of read.table. This includes nested functions of the form
foo(a, bar(b, c)). Reserved words (return, for, etc) and functions that use back-ticks ('=='(), '+'(), etc) can be included since I can remove them later.
So in general, I'm looking for the pattern text( or text ( then possible nested functions like text1(text2(, but skipping over the text if it's an argument, and not a function. Here's where I'm at so far. It's close, but not quite there.
x <- readLines("readTable.txt")
regx <- "^(([[:print:]]*)\\(+.*\\))"
mat <- regexpr(regx, x)
lines <- regmatches(x, mat)
fns <- gsub(".*( |(=|(<-)))", "", lines)
head(fns, 10)
# [1] "default.stringsAsFactors()" "!missing(text))"
# [3] "\"UTF-8\")" "on.exit(close(file))" "(is.character(file))"
# [6] "(nzchar(fileEncoding))" "fileEncoding)" "\"rt\")"
# [9] "on.exit(close(file))" "\"connection\"))"
For example, in [9] above, the calls are there, but I do not want file in the result. Ideally it would be on.exit(close(.
How can I go about improving this regular expression?
If you've ever tried to parse HTML with a regular expression you know what a nightmare it can be. It's always better to use some HTML parser and extract info that way. I feel the same way about R code. The beauty of R is that it's functional and you inspect any function via code.
Something like
call.ignore <-c("[[", "[", "&","&&","|","||","==","!=",
"-","+", "*","/", "!", ">","<", ":")
find.funcs <- function(f, descend=FALSE) {
if( is.function(f)) {
return(find.funcs(body(f), descend=descend))
} else if (is(f, "name") | is.atomic(f)) {
return(character(0))
}
v <- list()
if (is(f, "call") && !(deparse(f[[1]]) %in% call.ignore)) {
v[[1]] <- deparse(f)
if(!descend) return(v[[1]])
}
v <- append(v, lapply(as.list(f), find.funcs, descend=descend))
unname(do.call(c, v))
}
could work. Here we iterate over each object in the function looking for calls, ignoring those you don't care about. You would run it on a function like
find.funcs(read.table)
# [1] "default.stringsAsFactors()"
# [2] "missing(file)"
# [3] "missing(text)"
# [4] "textConnection(text, encoding = \"UTF-8\")"
# [5] "on.exit(close(file))"
# [6] "is.character(file)"
# ...
You can set the descend= parameter to TRUE if you want to look in calls to functions for other functions.
I'm sure there are plenty of packages that make this easier, but I just wanted to show how simple it really is.
Recursive Regex in Perl Mode
In the general case, I am sure you're aware of the hazards of trying to match such constructions: what if your file contains things like if() that you don't want to match?
That being said, I believe this recursive regex fits the requirements as I understand them
[a-z]+(\((?:`[()]|[^()]|(?1))*\))
See demo.
I'm not completely up to scratch on R syntax, but something like this should work, and you can tweak the function name and arguments to suit your needs:
grepl("[a-z]+(\\((?:`[()]|[^()]|(?1))*\\))", subject, perl=TRUE);
Explanation
[a-z]+ matches the letters before the opening parenthesis
( starts Group 1
\( matches an opening parenthesis
(?: starts a non-capture group that will be repeated. The capture group matches several possibilities:
BACKTICK[()] matches a backtick + ( or ) (sorry, don't know how to make the backtick appear in this editor
|[^()] OR match one character that is not a parenthesis
|(?1) OR match the pattern defined by the Group 1 parentheses (recurse)
)* close non-capture group, repeat zero or more times
\) matches a closing parenthesis
) ends Group 1
Related
I have a sentence that may contain keywords. I search for them, if one is true, I want the word before and after the keyword.
cont <- c("could not","would not","does not","will not","do not","were not","was not","did not")
text <- "this failed to increase incomes and production did not improve"
str_extract(text,"([^\\s]+\\s+){1}names(which(sapply(cont,grepl,text)))(\\s+[^\\s]+){1}")
This fails when I dynamically search using the names function but if I input:
str_extract(text,"([^\\s]+\\s+){1}did not(\\s+[^\\s]+){1}")
it correctly returns: production did not improve.
How can I get this to function without directly inputing the keywords?
Final note: I do not completely understand the syntax used to get surrounding objects. Basic r books have not covered this. Can someone explain please?
You could use your cont vector to create a vector of regex strings:
targets <- paste0("([^\\s]+\\s+){1}", cont, "(\\s+[^\\s]+){1}")
Which you can feed into str_extract_all and then unlist:
unlist(stringr::str_extract_all(text, targets))
#> [1] "production did not improve"
If this is something you need to do quite frequently, you could wrap it in a function:
get_surrounding <- function(string, keywords) {
targets <- paste0("([^\\s]+\\s+){1}", keywords, "(\\s+[^\\s]+){1}")
unlist(stringr::str_extract_all(string, targets))
}
With which you can easily run the query on new strings:
new_text <- "The production did not increase because the manager would not allow it."
get_surrounding(new_text, cont)
#> [1] "manager would not allow" "production did not increase"
Perhaps we can try this
> regmatches(text, gregexpr(sprintf("\\w+\\s(%s)\\s\\w+", paste0(cont, collapse = "|")), text))[[1]]
[1] "production did not improve"
Each match of the following regular expression will save the preceding and following words in capture groups 1 and 2, respectively.
\\b([a-z]+) +(?:could|would|does|will|do|were|was|did) +not +([a-z]+)\\b
You will of course have to form this expression programmatically, but that should be straightforward.
Hover the cursor over each element of the expression at this demo to obtain an explanation of its function.
For the string
"she could not believe that production did not improve"
there are two matches. For the first ("she could not believe") "she" and "believe" are saved to capture groups 1 and 2, respectively. For the second ("production did not improve") "production" and "improve" are saved to capture groups 1 and 2, respectively.
I have read about prefix functions and infix functions on Hadley Wickham's Advanced R website. I would like to know if there is any way to define functions that are called by placing a prefix and suffix around a single argument, so that the prefix and suffix operate like brackets. Is there any way to create a function like this, and if so, how do you do it?
An example for formulation: In order to give a specific example for formulation, suppose you have an object char that is a character string. You want to create a function that is called on a character string using the prefix _# and suffix #_ and the function adds five dashes to the front of the character string. If programmed successfully, it would operate as shown below.
char
[1] "Hello"
_#char#_
[1] "-----Hello"
There is a way to do this as long as your special operator takes a particular form, that is .%_% char %_%. . This is because the parser will interpret the dot as a variable name. If we use non-standard evaluation, we don't need the dot to actually exist, and we only need to use this as a marker for opening and closing our special operator. So we can do something like this:
`%_%` <- function(a, b)
{
if((deparse(match.call()$a) != ".") +
(deparse(match.call()$b) != ".") != 1)
stop("Unrecognised SPECIAL")
if(deparse(match.call()$a == "."))
return(`attr<-`(b, "prepped", TRUE))
if(attr(a, "prepped"))
return(paste0("-----", a))
stop("Unrecognised SPECIAL")
}
.%_% "hello" %_%.
#> [1] "-----hello"
However, this is a weird thing to do in R. It's not idiomatic and uses more keystrokes than a simple function call would. It would also very likely cause unpredictable problems in places where non-standard evaluation is used. This is really just a demo to show that it can be done. Not that it should be done.
Writing a simple function seems like a more R-like solution. If terseness is a priority, then maybe something like
._ <- function(x) paste0("-----", x)
._("hello")
# [1] "-----hello"
Or if you wanted something more bracket-like
.. <- structure(list(NULL), class="dasher")
`[.dasher` <- function(a, x) paste0("-----", x)
..["hello"]
# [1] "-----hello"
Another way to use a custom class would be to redefine the - operator to paste that value in front of the string. For example
literal <- function(x) {class(x)<-"literal"; x}
`-.literal` <- function(e1, e2) {literal(paste0("-", unclass(e1)))}
print.literal <- function(x) print(unclass(x))
Then you can do
val <- literal("hello")
-----val
# [1] "-----hello"
---val
# [1] "---hello"
So here the number of - you type is the number you get in the output.
You can get creative/weird with syntax, but you need to make sure whatever symbols you come up with can be parsed by the parser otherwise you are out-of-luck.
I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}
I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".
Thanks for grep using a character vector with multiple patterns, I figured out my own problem as well.
The question here was how to find multiple values by using grep function,
and the solution was either these:
grep("A1| A9 | A6")
or
toMatch <- c("A1", "A9", "A6")
matches <- unique (grep(paste(toMatch,collapse="|")
So I used the second suggestion since I had MANY values to search for.
But I'm curious why c() or for loop doesn't work out instead of |.
Before I researched the possible solution in stackoverflow and found recommendations above, I tried out two alternatives that I'll demonstrate below:
First, what I've written in R was something like this:
find.explore.l<-lapply(text.words.bl ,function(m) grep("^explor",m))
But then I had to 'grep' many words, so I tried out this
find.explore.l<-lapply(text.words.bl ,function(m) grep(c("A1","A2","A3"),m))
It didn't work, so I tried another one(XXX is the list of words that I'm supposed to find in the text)
for (i in XXX){
find.explore.l<-lapply(text.words.bl ,function(m) grep("XXX[i]"),m))
.......(more lines to append lines etc)
}
and it seemed like R tried to match XXX[i] itself, not the words inside.
Why can't c() and for loop for grep return right results?
Someone please let me know! I'm so curious :P
From the documentation for the pattern= argument in the grep() function:
Character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr and gregexpr.
This confirms that, as #nrussell said in a comment, grep() is not vectorized over the pattern argument. Because of this, c() won't work for a list of regular expressions.
You could, however, use a loop, you just have to modify your syntax.
toMatch <- c("A1", "A9", "A6")
# Loop over values to match
for (i in toMatch) {
grep(i, text)
}
Using "XXX[i]" as your pattern doesn't work because it's interpreting that as a regular expression. That is, it will match exactly XXXi. To reference an element of a vector of regular expressions, you would simply use XXX[i] (note the lack of surrounding quotes).
You can apply() this, but in a slightly different way than you had done. You apply it to each regex in the list, rather than each text string.
lapply(toMatch, function(rgx, text) grep(rgx, text), text = text)
However, the best approach would be, as you already have in your post, to use
matches <- unique(grep(paste(toMatch, collapse = "|"), text))
Consider that:
XXX <- c("a", "b", "XXX[i]")
grep("XXX[i]", XXX, value=T)
character(0)
grep("XXX\\[i\\]", XXX, value=T)
[1] "XXX[i]"
What is R doing? It is using special rules for the first argument of grep. The brackets are considered special characters ([ and ]). I put in two backslashes to tell R to consider them regular brackets. And imgaine what would happen if I put that last expression into a for loop? It wouldn't do what I expected.
If you would like a for loop that goes through a character vector of possible matches, take out the quotes in the grep function.
#if you want the match returned
matches <- c("a", "b")
for (i in matches) print(grep(i, XXX, value=T))
[1] "a"
[1] "b"
#if you want the vector location of the match
for (i in matches) print(grep(i, XXX))
[1] 1
[1] 2
As the comments point out, grep(c("A1","A2","A3"),m)) is violating the grep required syntax.