Regex to extract string from a file path R? - r

I have the following file path with uuid incapsulated within it's path:
"~/My_Files/F0/F1/F2/0b27ea5fad61c99d/0b27ea5fad61c99d/2015-04-1-04-25-12-925"
I want to extract it using regular expression.
I know that I can just unlist(strsplit(string, "/")) and take the 7th element but it seems to me too slow and not efficient problem solving.
Here is what I have tried so far:
\w{16}
I am keep trying to play with this, please advise.
I want to extract the uuid: 0b27ea5fad61c99d

Here's a slightly hacky but compact and regex-free solution:
basename(dirname(x))
#[1] "0b27ea5fad61c99d"
Where
x <- "~/My_Files/F0/F1/F2/0b27ea5fad61c99d/0b27ea5fad61c99d/2015-04-1-04-25-12-925"

EDIT: As per Onyambu's comment adding following solution too now.
sub(".*/(.*)/[^/]+$","\\1",val)
Could you please try following gsub function to of base R and let me know if this helps you.
gsub("([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/][0-9a-z]+)/(.*)","\\6",val)
Explanation: Here is a brief explanation of above snippet.
([^/]*): Selecting all from starting to till a / and keeping it in first place holder of memory.
/: Mentioning / then.
Again repeating these above 2 steps till 5 times to select 6th field which is mentioned by ([^/][0-9a-z]+) then /(.*) means taking all rest of the matches in 7th memory place.
"\\6": Now substituting whole value of variable val with only 6th memory place which is actually required by OP to get the desired results.

Have you conducted some benchmarks regarding timing? I think your own solution performs already quite well, especially with the minor improvement of introducing fixed = T. See below timings. Why start a complicated regex search when you already know the exact symbols where to split your string...
Update with respect to comments: The vectorized versions shows, that f2 shows not the best but still acceptable performance. But as indicated in the comments, vectorized regex approaches will usually perform better with increasing length of the vector - and of course, they are more flexible if you have less knowledge on the structure of your directory names.
Update 2: If anyone is still interested, I have updated the function f2 by using a better way for accessing sub-elements of the lists. This now makes it the fastest approach for the specific example, at least, for a benchmark of 500 items.
library(microbenchmark)
library(stringi)
string = "~/My_Files/F0/F1/F2/0b27ea5fad61c99d/0b27ea5fad61c99d/2015-04-1-04-25-12-925"
string = rep(string, 500)
f1 = function(x) sapply(strsplit(x, "/"), `[[`, 7)
f2 = function(x) sapply(strsplit(x, "/", fixed = T), `[[`, 7)
f2b = function(x) sapply(stri_split_fixed(string, "/"), `[[`, 7)
f3 = function(x) stri_extract_first_regex(x, "(?=[a-f0-9]+[a-f])(?=[a-f0-9]+[0-9])([a-f0-9]{16})")
f4 = function(x) sapply(x, function(y) tail(unlist(strsplit(dirname(x), "/")),1), USE.NAMES = FALSE)
f5 = function(x) basename(dirname(x))
f6 = function(x) gsub("([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/][0-9a-z]+)/(.*)","\\6",x)
f7 = function(x) sub("^.*/(.*)$", "\\1", dirname(x))
f8 = function(x) sub(".*/(.*)/[^/]+$","\\1",x)
bm = microbenchmark(
(a = f1(string))
,(b = f2(string))
, (b2 = f2b(string))
,(c = f3(string))
, (d = f4(string))
, (e = f5(string))
, (f = f6(string))
, (g = f7(string))
, (h = f8(string))
, times = 25)
bm
# Unit: microseconds
# expr min lq mean median uq max neval
# (a = f1(string)) 1894.017 1947.307 2083.6390 2072.444 2142.709 2896.684 25
# (b = f2(string)) 532.520 575.153 605.7698 592.917 630.813 823.451 25
# (b2 = f2b(string)) 545.152 569.232 617.1387 606.733 637.129 778.450 25
# (c = f3(string)) 855.426 894.112 953.5931 946.614 999.511 1286.890 25
# (d = f4(string)) 2497889.661 2538700.607 2604673.5850 2602081.839 2654385.172 2820226.019 25
# (e = f5(string)) 4686.881 4935.573 5087.7735 5155.450 5201.240 5544.674 25
# (f = f6(string)) 5991.532 6357.861 6750.8284 6584.054 6886.039 9232.438 25
# (g = f7(string)) 4313.840 4462.661 4770.6780 4696.749 4900.046 6442.733 25
# (h = f8(string)) 2328.637 2422.193 2620.5163 2606.542 2660.229 3697.239 25
all(all.equal(a, b)
,all.equal(a, c)
,all.equal(a, d)
,all.equal(a, e)
,all.equal(a, f)
,all.equal(a, g)
,all.equal(a, b2)
,all.equal(a, h)
)
# TRUE

You might match a forward slash, then use a positive use 2 positive lookaheads (?= to assert that what follows is at least [a-f] and at least [0-9]. Then capture in a group ([a-f0-9]{16})
/(?=[a-f0-9]+[a-f])(?=[a-f0-9]+[0-9])([a-f0-9]{16})

You may use regular expression:
(?:[^\/]+\/){5}(\w+)
(?: Start of non capturing group.
[^\/]+\/ Anything except a forward slash /, followed by forward slash /.
) Close non capturing group.
{5} Match exactly five occurrences of the preceding pattern.
(\w+) Capturing group. Capture alphanumeric characters greedily.
The substring of interest is contained in the capture group.
You can test the regex live here.

Related

Regex to convert time equations to R date-time (POSIXct)

I'm reading in data from another platform where a combination of the strings listed below is used for expressing timestamps:
\* = current time
t = current day (00:00)
mo = month
d = days
h = hours
m = minutes
For example, *-3d is current time minus 3 days, t-3h is three hours before today morning (midnight yesterday).
I'd like to be able to ingest these equations into R and get the corresponding POSIXct value. I'm trying using regex in the below function but lose the numeric multiplier for each string:
strTimeConverter <- function(z){
ret <- stringi::stri_replace_all_regex(
str = z,
pattern = c('^\\*',
'^t',
'([[:digit:]]{1,})mo',
'([[:digit:]]{1,})d',
'([[:digit:]]{1,})h',
'([[:digit:]]{1,})m'),
replacement = c('Sys.time()',
'Sys.Date()',
'*lubridate::months(1)',
'*lubridate::days(1)',
'*lubridate::hours(1)',
'*lubridate::minutes(1)'),
vectorize_all = F
)
return(ret)
# return(eval(expr = parse(text = ret)))
}
> strTimeConverter('*-5mo+3d+4h+2m')
[1] "Sys.time()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"
> strTimeConverter('t-5mo+3d+4h+2m')
[1] "Sys.Date()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"
Expected output:
# *-5mo+3d+4h+2m
"Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"
# t-5mo+3d+4h+2m
"Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"
I assumed that wrapping the [[:digit]]{1,} in parentheses () would preserve them but clearly that's not working. I defined the pattern like this else the code replaces repeat occurrences e.g. * gets converted to Sys.time() but then the m in Sys.time() gets replaced with *lubridate::minutes(1).
I plan on converting the (expected) output to R date-time using eval(parse(text = ...)) - currently commented out in the function.
I'm open to using other packages or approach.
Update
After tinkering around for a bit, I found the below version works - I'm replacing strings in the order such that newly replaced characters are not replaced again:
strTimeConverter <- function(z){
ret <- stringi::stri_replace_all_regex(
str = z,
pattern = c('y', 'd', 'h', 'mo', 'm', '^t', '^\\*'),
replacement = c('*years(1)',
'*days(1)',
'*hours(1)',
'*days(30)',
'*minutes(1)',
'Sys.Date()',
'Sys.time()'),
vectorize_all = F
)
ret <- gsub(pattern = '\\*', replacement = '*lubridate::', x = ret)
rdate <- (eval(expr = parse(text = ret)))
attr(rdate, 'tzone') <- 'UTC'
return(rdate)
}
sample_string <- '*-5mo+3d+4h+2m'
strTimeConverter(sample_string)
This works but is not very elegant and will likely fail as I'm forced to incorporate other expressions (e.g. yd for day of the year e.g. 124).
You can use backreferences in the replacements like this:
library(stringr)
x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo' = '\\1*lubridate::months(1)', '(\\d+)d' = '\\1*lubridate::days(1)', '(\\d+)h' = '\\1*lubridate::hours(1)', '(\\d+)m' = '\\1*lubridate::minutes(1)')
stringr::str_replace_all(x, repl)
## => [1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
## [2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
See the R demo online.
See, for example, '(\\d+)mo' = '\\1*lubridate::months(1)'. Here, (\d+)mo matches and captures into Group 1 one or more digits, and mo is just matched. Then, when the match is found, \1 in \1*lubridate::months(1) inserts the contents of Group 1 into the resulting string.
Note that it might make the replacements safer if you cap the time period match with a word boundary (\b) on the right:
repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo\\b' = '\\1*lubridate::months(1)', '(\\d+)d\\b' = '\\1*lubridate::days(1)', '(\\d+)h\\b' = '\\1*lubridate::hours(1)', '(\\d+)m\\b' = '\\1*lubridate::minutes(1)')
It won't work if the time spans are glued one to another without any non-word delimiters, but you have + in your example strings, so it is safe here.
Actually, you can make it work with the function you used, too. Just make sure the backreferences have the $n syntax:
x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
pattern = c('^\\*', '^t', '(\\d+)mo', '(\\d+)d', '(\\d+)h', '(\\d+)m')
replacement = c('Sys.time()', 'Sys.Date()', '$1*lubridate::months(1)', '$1*lubridate::days(1)', '$1*lubridate::hours(1)', '$1*lubridate::minutes(1)')
stringi::stri_replace_all_regex(x, pattern, replacement, vectorize_all=FALSE)
Output:
[1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
[2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
Another option to produce the time directly, would be the following:
strTimeConvert <- function(base=Sys.time(), delta="-5mo+3d+4h+2m"){
mo <- gsub(".*([+-]\\d+)mo.*", "\\1", x)
ds <- gsub(".*([+-]\\d+)d.*", "\\1", x)
hs <- gsub(".*([+-]\\d+)h.*", "\\1", x)
ms <- gsub(".*([+-]\\d+)m.*", "\\1", x)
out <- base + months(as.numeric(mo)) + days(as.numeric(ds)) +
hours(as.numeric(hs)) + minutes(as.numeric(ms))
out
}
strTimeConvert()
# [1] "2020-07-21 20:32:19 EDT"

How to remove/replace specific parentheses from a string containing multiple parentheses in R

Given the following string of parentheses, I am trying to remove one specific parentheses,
where the position of one of its bracket is marked with 1.
((((((((((((((((((********))))))))))))))))))
00000000000000000000000000000000010000000000
So for the above example, the solution I am looking for is
((((((((((-(((((((********)))))))-))))))))))
00000000000000000000000000000000010000000000
I am tried using strsplit function from stringr to split and get the indexes of the bracket marked with 1. But I am not sure how I can get the index of its corresponding closing bracket.
Could anyone give some input on this..
What I did..
a = "((((((((((-(((((((********)))))))-))))))))))"
b = "00000000000000000000000000000000010000000000"
which(unlist(strsplit(b,"")) == 1)
#[1] 34
a_mod = unlist(strsplit(a,""))[-34]
here, I removed one bracket of the parentheses which I wanted to remove but I do not know how I can remove its corresponding opening bracket which is in 11th position in this example
Locate the 1 in b giving pos2 and also calculate the length of b giving n. Then replace positions pos2 and pos1 = n-pos2+1 with minus characters. See ?gregexpr and ?nchar and ?substr for more info. No packages are used.
pos2 <- regexpr(1, b)
n <- nchar(a)
pos1 <- n - pos2 + 1
substr(a, pos1, pos1) <- substr(a, pos2, pos2) <- "-"
a
## [1] "((((((((((-(((((((********)))))))-))))))))))"
Since the parentheses are paired the index of the close parentheses is just the length of the string minus the index of the open parentheses (they're equidistant from the string ends)
library(stringr)
string <- "((((((((((((((((((********))))))))))))))))))"
b <- "00000000000000000000000000000000010000000000"
location <- str_locate(b, "1")[1]
len <- str_length(string)
substr(string, location, location) <- "-"
substr(string, len-location, len-location) <- "-"
string
"(((((((((-((((((((********)))))))-))))))))))"
You should show what you have tried. One very simple way that would work for your example would be to do something like:
gsub("\\*){8}", "\\*)))))))-", "((((((((((((((((((********))))))))))))))))))")
#> [1] "((((((((((((((((((********)))))))-))))))))))"
Edit:
In response to your question: It depends what you mean by other similar examples.
If you go purely by position in the string, you already have an excellent answer from G. Grothendieck. If you want a solution where you want to replace the nth closing bracket, for example, you could do:
s <- "((((((((((((((((((********))))))))))))))))))"
replace_par <- function(n, string) {
sub(paste0("(!?\\))(\\)){", n, "}"),
paste0(paste(rep(")", (n-1)), collapse=""), "-"),
string, perl = TRUE)}
replace_par(8, s)
#> [1] "((((((((((((((((((********)))))))-)))))))))"
Created on 2020-05-21 by the reprex package (v0.3.0)
You could write a function that does the replacement the way you want:
strreplace <- function(x,y,val = "-")
{
regmatches(x,regexpr(1,y)) <- val
sub(".([(](?:[^()]|(?1))*+[)])(?=-)", paste0(val, "\\1"), x, perl = TRUE)
}
a <- "((((((((((((((((((********))))))))))))))))))"
b < -"00000000000000000000000000000000010000000000"
strreplace(a, b)
[1] "((((((((((-(((((((********)))))))-))))))))))"
# Nested paranthesis
a = "((((****))))((((((((((((((((((********))))))))))))))))))"
b = "00000000000000000000000000000000000000000000010000000000"
strreplace(a,b)
[1] "((((****))))((((((((((-(((((((********)))))))-))))))))))"

What is the difference among eval(parse()), eval(str2lang()), eval(str2expression()), and eval(call()[[1]])?

I'm writing a function to extract a variable provided as a string from either a given data.frame df or the environment env. Initially, I had been using the eval(parse(text=s), df, env) construction to do this, but I learned that there are more efficient alternatives. Other options include:
eval(str2lang(s), df, env)
eval(str2expression(s), df, env)
eval(call(s)[[1]], df, env)
There may be a get solution as well, but I don't know if it can check to see if the variable is in df first before turning to env if it isn't.
Using microbenchmark, it seems that call is the fastest:
library(microbenchmark)
x1 = 1
df = data.frame(x2 = 2)
microbenchmark(call = eval(call('x1')[[1]], df),
parse = eval(parse(text='x1'), df),
str2lang = eval(str2lang('x1'), df),
str2exp = eval(str2expression('x1'), df),
check = "identical")
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> call 1.128 1.2115 1.60815 1.4585 1.6360 4.659 100 a
#> parse 39.183 39.8705 46.60755 40.2405 42.0415 135.462 100 b
#> str2lang 2.235 2.3570 3.26144 2.5995 2.8925 24.641 100 a
#> str2exp 2.230 2.3200 2.81387 2.4780 2.6970 10.312 100 a
microbenchmark(call = eval(call('x2')[[1]], df),
parse = eval(parse(text='x2'), df),
str2lang = eval(str2lang('x2'), df),
str2exp = eval(str2expression('x2'), df),
check = "identical")
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> call 1.124 1.194 1.47770 1.3675 1.5795 9.031 100 a
#> parse 38.254 38.762 40.21497 38.9630 39.3120 116.510 100 b
#> str2lang 2.214 2.304 2.55036 2.3960 2.6530 10.639 100 a
#> str2exp 2.238 2.331 2.50011 2.4210 2.6515 3.619 100 a
Created on 2020-04-23 by the reprex package (v0.3.0)
I'm therefore inclined to use call but I want to make sure there wouldn't be any unintended consequences of doing so rather than using the other solutions. In other words, in what situations (within the context I'm using them in) would the four methods not give the same answer, leading one to favor one over the others?
Both call and as.symbol are fine, I think as.symbol is preferable. as.symbol is unambiguous R terminology (as.name was S terminology) and is as fast as call. A call can contain any R object, including symbols and other calls, plus documentation is all over the place. Because the strings could contain punctuation we can not be sure at which point str2lang, or str2expression fail or what error message might pop up.
Checking if variables exist in an environment is safer than doing so for data.frames.
Differences
a look under the hood reveals main differences are in their classes:
l = list(w = as.symbol("x"), x = str2lang("x"), y = str2expression("x"), z = call("x"))
rbind(sapply(l, class), sapply(l, typeof), sapply(l, is.language))
w x y z
[1,] "name" "name" "expression" "call"
[2,] "symbol" "symbol" "expression" "language"
[3,] "TRUE" "TRUE" "TRUE" "TRUE"
A good starting point is the R Language Definition, and for examples the Expressions chapter in Hadley Wickham's Advanced R (source), followed by reading ?call, ?parse, rlang::call2.
What are they?
In Evaluating the Design of the R Language (source) the following definition is given:
                                Arguments of calls are expressions which may be named by a symbol name.
Expression - an action or actions.
Call - represent the action of calling a function.
Symbols (Name) - refer to R objects (2.1.3)
Which one to use when?
From the ?call documentation:
Instead of as.call(<string>), consider using str2lang(*) which is an efficient version of parse(text=*). call() and as.call(), are much preferable to parse() based approaches.
Using pryr::show_c_source(.Internal(str2lang(s))) we can see that str2lang and str2expression both call
the C function do_str2lang but differ in their argument. This difference can be found in ?parse:
str2expression, equal to parse(text = "x", keep.source = F) is always an expression.
str2lang("x"), equal to parse(text = "x", keep.source = F)[[1]] can evaluate to a call OR a symbol, NULL, numeric, integer or logical, i.e. a call or simpler.
In the R Language Definition it is mentioned that expressions are only evaluated when passed to eval, other language objects may get evaluated in some unexpected cases. What these cases are, I could not find.
We can throw some edge cases at all approaches to see when which might fail:
"\"" wrongly parsed quote when reading in data?
" " empty variable? space in variable name?
"_" illegal token (legacy assignment operator) issues?
backtick
NA
# if it breaks, show how it breaks
do <- function(x, ...) tryCatch(eval(x, ..1, ..2), error = function(t) t$message)
check <- function(x, ...){
list(
do(call(x)[[1]], ...),
do(as.symbol(x), ...),
do(str2lang(x), ...),
do(str2expression(x), ...)
)
}
# test 1: some variables do not exist
e <- new.env() ; e$x1 <- 5
df = data.frame(x2 = 3)
no_var <- lapply(list("x1", "x2", "\"", " ", "_", "`", NA_character_), check, df, e)
# test 2: some variables exist
e <- new.env() ; e$x1 = 5; e$`_` = 2 ; e$"\"" = 5; e$`NA` <- 5
df = data.frame(x2 = 3, " " = 7, "_" = 2)
var <- lapply(list("x1", "x2", "\"", " ", "_", "`", NA_character_), check, df, e)
# difference in outcomes in
var[3:7]
no_var[3:7]
When variables do not exist:
as.symbol and call always reach eval, str2lang and str2expression fail early in all cases. str2expression differs from str2lang when an empty string is given.
When variables exist:
as.symbol and call succeed in cases 3, 4, 7 while str2lang and str2expression throw errors.
On a sidenote, be careful with storing variables in data.frames, they are easy to corrupt.
names(data.frame(" " = 1, "_" = 2, "\"" = 3, "`" = 4))
[1] "X." "X_" "X..1" "X..2"
I think it depends on what strings you are trying to evaluate. If it's always a single name, then as.name('x1') is probably a contender as well; my tests show it just a little slower than your solution with call, but I'd say it's preferable just because it's less obscure, and less likely to cause future headaches.
The headaches might arise because of this: currently call("x1") produces the language object x1(), and then you extract the name of the function being called. But what if in some future version of R it remembers that the x1 in that expression is supposed to be a function? eval(x1()) already knows to ignore non-function objects named x1, so it's not going to break much code to make that change, and it might be useful from an efficiency point of view.
as.name also allows non-character arguments, e.g. as.name(123); this could be positive or negative for you. (Both solutions would allow "123".)

Find unused character(s) in string

For a library call I have to provide a separator, which must not occur in the in the text, because otherwise the library call gets confused.
Now I was wondering how I can adapt my code to assure that the separator I use is guaranteed not to occur in the input text.
I am solving this issue with a while loop: I make a (hardcoded) assumption about the most unlikely string in the input, check if it is present and if so, just enlarges the string. This works but feels very hackish, so I was wondering whether there is a more elegant version (e.g. an existing base R function, or a loop free solution), which does the same for me? Ideally the found separator is also minimal in length.
I could simply hardcode a large enough set of potential separators and look for the first one not occuring in the text, but this may also break at some point if all of these sepeatirs happen to occur in my input.
Reasoning for that is that even if it will never happen (well never say never), I am afraid that in some distant future there will be this one input string which requires thousands of while loops before finding an unused string.
input_string <- c("a/b", "a#b", "a//b", "a-b", "a,b", "a.b")
orig_sep <- sep <- "/" ## first guess as a separator
while(any(grepl(sep, input_string, fixed = TRUE))) {
sep <- paste0(sep, orig_sep)
}
print(sep)
# "///"
In case 1 ASCII can be found you can use table.
tt <- table(factor(strsplit(paste(input_string, collapse = ""), "")[[1]]
, rawToChar(as.raw(32:126), TRUE)))
names(tt)[tt==0]
rawToChar(as.raw(32:126), TRUE) gives you all ASCII's, which are used as factor levels. And table counts all cases. If there is at least one 0 you can use it.
In case you need 2 ASCII you can try the following returning all possible delimiters:
x <- rawToChar(as.raw(32:126), TRUE)
x <- c(outer(x, x, paste0))
x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})]
Or for n-ASCII:
orig_sep <- x <- rawToChar(as.raw(32:126), TRUE)
sep <- x[0]
repeat {
sep <- x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})]
if(length(sep) > 0) break;
x <- c(outer(x, orig_sep, paste0))
}
sep
Search for 1-2 ASCII with only a sapply-loop and taking separator with minimal length.
x <- rawToChar(as.raw(32:126), TRUE)
x <- c(x, outer(x, x, paste0))
x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})][1]
#[1] " "
In case you want to know how many times a character needs to be repeated to work as a separator, as you do it in the question, you can use gregexpr.
strrep("/", max(sapply(gregexpr("/*", input_string)
, function(x) max(attributes(x)$match.length)))+1)
#[1] "///"
strrep("/", max(c(0, sapply(gregexpr("/+", input_string)
, function(x) max(attributes(x)$match.length))))+1)
#[1] "///"
I made some benchmarks, and the sad news is that only if we have a lot of occurrences of the separator in the input string the regex solution will pay off. I won't expect long repetitions of the separator, so from that perspective the while solution should be preferable, but it would be the first time in my R life that I actually had to rely on a while construct.
Code
library(microbenchmark)
sep <- "/"
make_input <- function(max_occ, vec_len = 1000) {
paste0("A", strrep(sep, sample(0:max_occ, vec_len, TRUE)))
}
set.seed(1)
no_occ <- make_input(0)
typ_occ <- make_input(1)
mid_occ <- make_input(10)
high_occ <- make_input(100)
while_fun <- function(in_str) {
my_sep <- sep
while(any(grepl(my_sep, in_str, fixed = TRUE))) {
my_sep <- paste0(my_sep, sep)
}
my_sep
}
greg_fun <- function(in_str) {
strrep(sep,
max(sapply(gregexpr(paste0(sep, "+"), in_str),
purrr::attr_getter("match.length")), 0) + 1)
}
microbenchmark(no_occ_w = while_fun(no_occ),
no_occ_r = greg_fun(no_occ),
typ_occ_w = while_fun(typ_occ),
typ_occ_r = greg_fun(typ_occ),
mid_occ_w = while_fun(mid_occ),
mid_occ_r = greg_fun(mid_occ),
high_occ_w = while_fun(high_occ),
high_occ_r = greg_fun(high_occ))
Results
Unit: microseconds
expr min lq mean median uq max neval cld
no_occ_w 12.3 13.30 15.947 14.60 16.55 51.1 100 a
no_occ_r 1074.8 1184.90 1981.637 1253.45 1546.20 7037.9 100 b
typ_occ_w 33.8 36.00 42.842 38.55 41.45 229.2 100 a
typ_occ_r 1090.4 1192.15 2090.526 1283.80 1547.10 8490.7 100 b
mid_occ_w 277.9 283.35 336.466 288.30 309.45 3452.2 100 a
mid_occ_r 1161.6 1269.50 2204.213 1368.45 1789.20 7664.7 100 b
high_occ_w 3736.4 3852.95 4082.844 3962.30 4097.60 6658.3 100 d
high_occ_r 1685.5 1776.15 2819.703 1868.10 4065.00 7960.9 100 c

Substitute the ^ (power) symbol with C's pow syntax in mathematical expression

I have a math expression, for example:
((2-x+3)^2+(x-5+7)^10)^0.5
I need to replace the ^ symbol to pow function of C language. I think that regex is what I need, but I don't know a regex like a pro. So I ended up with this regex:
(\([^()]*)*(\s*\([^()]*\)\s*)+([^()]*\))*
I don't know how to improve this. Can you advice me something to solve that problem?
The expected output:
pow(pow(2-x+3,2)+pow(x-5+7,10),0.5)
One of the most fantastic things about R is that you can easily manipulate R expressions with R. Here, we recursively traverse your expression and replace all instances of ^ with pow:
f <- function(x) {
if(is.call(x)) {
if(identical(x[[1L]], as.name("^"))) x[[1L]] <- as.name("pow")
if(length(x) > 1L) x[2L:length(x)] <- lapply(x[2L:length(x)], f)
}
x
}
f(quote(((2-x+3)^2+(x-5+7)^10)^0.5))
## pow((pow((2 - x + 3), 2) + pow((x - 5 + 7), 10)), 0.5)
This should be more robust than the regex since you are relying on the natural interpretation of the R language rather than on text patterns that may or may not be comprehensive.
Details: Calls in R are stored in list like structures with the function / operator at the head of the list, and the arguments in following elements. For example, consider:
exp <- quote(x ^ 2)
exp
## x^2
is.call(exp)
## [1] TRUE
We can examine the underlying structure of the call with as.list:
str(as.list(exp))
## List of 3
## $ : symbol ^
## $ : symbol x
## $ : num 2
As you can see, the first element is the function/operator, and subsequent elements are the arguments to the function.
So, in our recursive function, we:
Check if an object is a call
If yes: check if it is a call to the ^ function/operator by looking at the first element in the call with identical(x[[1L]], as.name("^")
If yes: replace the first element with as.name("pow")
Then, irrespective of whether this was a call to ^ or anything else:
if the call has additional elements, cycle through them and apply this function (i.e. recurse) to each element, replacing the result back into the original call (x[2L:length(x)] <- lapply(x[2L:length(x)], f))
If no: just return the object unchanged
Note that calls often contain the names of functions as the first element. You can create those names with as.name. Names are also referenced as "symbols" in R (hence the output of str).
Here is a solution that follows the parse tree recursively and replaces ^:
#parse the expression
#alternatively you could create it with
#expression(((2-x+3)^2+(x-5+7)^10)^0.5)
e <- parse(text = "((2-x+3)^2+(x-5+7)^10)^0.5")
#a recursive function
fun <- function(e) {
#check if you are at the end of the tree's branch
if (is.name(e) || is.atomic(e)) {
#replace ^
if (e == quote(`^`)) return(quote(pow))
return(e)
}
#follow the tree with recursion
for (i in seq_along(e)) e[[i]] <- fun(e[[i]])
return(e)
}
#deparse to get a character string
deparse(fun(e)[[1]])
#[1] "pow((pow((2 - x + 3), 2) + pow((x - 5 + 7), 10)), 0.5)"
This would be much easier if rapply worked with expressions/calls.
Edit:
OP has asked regarding performance. It is very unlikely that performance is an issue for this task, but the regex solution is not faster.
library(microbenchmark)
microbenchmark(regex = {
v <- "((2-x+3)^2+(x-5+7)^10)^0.5"
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
while(x) {
v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", "pow(\\2, \\3)", v, perl=TRUE);
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
}
},
BrodieG = {
deparse(f(parse(text = "((2-x+3)^2+(x-5+7)^10)^0.5")[[1]]))
},
Roland = {
deparse(fun(parse(text = "((2-x+3)^2+(x-5+7)^10)^0.5"))[[1]])
})
#Unit: microseconds
# expr min lq mean median uq max neval cld
# regex 321.629 323.934 335.6261 335.329 337.634 384.623 100 c
# BrodieG 238.405 246.087 255.5927 252.105 257.227 355.943 100 b
# Roland 211.518 225.089 231.7061 228.802 235.204 385.904 100 a
I haven't included the solution provided by #digEmAll, because it seems obvious that a solution with that many data.frame operations will be relatively slow.
Edit2:
Here is a version that also handles sqrt.
fun <- function(e) {
#check if you are at the end of the tree's branch
if (is.name(e) || is.atomic(e)) {
#replace ^
if (e == quote(`^`)) return(quote(pow))
return(e)
}
if (e[[1]] == quote(sqrt)) {
#replace sqrt
e[[1]] <- quote(pow)
#add the second argument
e[[3]] <- quote(0.5)
}
#follow the tree with recursion
for (i in seq_along(e)) e[[i]] <- fun(e[[i]])
return(e)
}
e <- parse(text = "sqrt((2-x+3)^2+(x-5+7)^10)")
deparse(fun(e)[[1]])
#[1] "pow(pow((2 - x + 3), 2) + pow((x - 5 + 7), 10), 0.5)"
DISCLAIMER: The answer was written with the OP original regex in mind, when the question sounded as "process the ^ preceded with balanced (nested) parentheses". Please do not use this solution for generic math expression parsing, only for educational purposes and only when you really need to process some text in the balanced parentheses context.
Since a PCRE regex can match nested parentheses, it is possible to achieve in R with a mere regex in a while loop checking the presence of ^ in the modified string with x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE). Once there is no ^, there is nothing else to substitute.
The regex pattern is
(\(((?:[^()]++|(?1))*)\))\^(\d*\.?\d+)
See the regex demo
Details:
(\(((?:[^()]++|(?1))*)\)) - Group 1: a (...) substring with balanced parentheses capturing what is inside the outer parentheses into Group 2 (with ((?:[^()]++|(?1))*) subpattern) (explanation can be found at How can I match nested brackets using regex?), in short, \ matches an outer (, then (?:[^()]++|(?1))* matches zero or more sequences of 1+ chars other than ( and ) or the whole Group 1 subpattern ((?1) is a subroutine call) and then a ))
\^ - a ^ caret
(\d*\.?\d+) - Group 3: an int/float number (.5, 1.5, 345)
The replacement pattern contains a literal pow() and the \\2 and \\3 are backreferences to the substrings captured with Group 2 and 3.
R code:
v <- "((2-x+3)^2+(x-5+7)^10)^0.5"
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
while(x) {
v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", "pow(\\2, \\3)", v, perl=TRUE);
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(\\d*\\.?\\d+)", v, perl=TRUE)
}
v
## => [1] "pow(pow(2-x+3, 2)+pow(x-5+7, 10), 0.5)"
And to support ^(x-3) pows, you may use
v <- sub("(\\(((?:[^()]++|(?1))*)\\))\\^(?|()(\\d*\\.?\\d+)|(\\((‌​(?:[^()]++|(?3))*)\\‌​)))", "pow(\\2, \\4)", v, perl=TRUE);
and to check if there are any more values to replace:
x <- grepl("(\\(((?:[^()]++|(?1))*)\\))\\^(?|()(\\d*\\.?\\d+)|(\\((‌​(?:[^()]++|(?3))*)\\‌​)))", v, perl=TRUE)
Here's an example exploiting R parser (using getParseData function) :
# helper function which turns getParseData result back to a text expression
recreateExpr <- function(DF,parent=0){
elements <- DF[DF$parent == parent,]
s <- ""
for(i in 1:nrow(elements)){
element <- elements[i,]
if(element$terminal)
s <- paste0(s,element$text)
else
s <- paste0(s,recreateExpr(DF,element$id))
}
return(s)
}
expr <- "((2-x+3)^2+(x-5+7)^10)^0.5"
DF <- getParseData(parse(text=expr))[,c('id','parent','token','terminal','text')]
# let's find the parents of all '^' expressions
parentsOfPow <- unique(DF[DF$token == "'^'",'parent'])
# replace all the the 'x^y' expressions with 'pow(x,y)'
for(p in parentsOfPow){
idxs <- which(DF$parent == p)
if(length(idxs) != 3){ stop('expression with '^' is not correct') }
idxtok1 <- idxs[1]
idxtok2 <- idxs[2]
idxtok3 <- idxs[3]
# replace '^' token with 'pow'
DF[idxtok2,c('token','text')] <- c('pow','pow')
# move 'pow' token as first token in the expression
tmp <- DF[idxtok1,]
DF[idxtok1,] <- DF[idxtok2,]
DF[idxtok2,] <- tmp
# insert new terminals '(' ')' and ','
DF <- rbind(
DF[1:(idxtok2-1),],
data.frame(id=max(DF$id)+1,parent=p,token=',',terminal=TRUE,text='(',
stringsAsFactors=FALSE),
DF[idxtok2,],
data.frame(id=max(DF$id)+2,parent=p,token=',',terminal=TRUE,text=',',
stringsAsFactors=FALSE),
DF[(idxtok2+1):idxtok3,],
data.frame(id=max(DF$id)+3,parent=p,token=')',terminal=TRUE,text=')',
stringsAsFactors=FALSE),
if(idxtok3<nrow(DF)) DF[(idxtok3+1):nrow(DF),] else NULL
)
}
# print the new expression
recreateExpr(DF)
> [1] "pow((pow((2-x+3),2)+pow((x-5+7),10)),0.5)"

Resources