Extracting variables from a formula when there are subscripts - r

There are several posts related to obtaining a list of variables in a regression formula in R - the basic answer being to use all.vars. For example,
> all.vars(log(resp) ~ treat + factor(dose))
[1] "resp" "treat" "dose"
This is nice because it strips out all of the functions and operators (as well as repeats, not shown). However, this is problematic when the formula contains $ operators or subscripts, such as in
> form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
> all.vars(form)
[1] "cows" "weight" "bulls" "herd" "breed"
Here, the data frame names cows, bulls, and herd are identified as variables, and the names of the actual variables are decoupled or lost. Instead, what I really want is this result:
> mystery.fcn(form)
[1] "cows$weight" "bulls[[3]]" "herd$breed"
What is the most elegant way to do this? I have one proposal that I'll post as an answer, but maybe someone has a more elegant solution and will earn more votes!

One approach that works, though a bit tedious, is to replace the operators $, etc. with legal characters for variable names, turn the string back into a formula, apply all.vars, and un-mangle the results:
All.vars = function(expr, retain = c("\\$", "\\[\\[", "\\]\\]"), ...) {
# replace operators with unlikely patterns _Av1_, _Av2_, ...
repl = paste("_Av", seq_along(retain), "_", sep = "")
for (i in seq_along(retain))
expr = gsub(retain[i], repl[i], expr)
# piece things back together in the right order, and call all.vars
subs = switch(length(expr), 1, c(1,2), c(2,1,3))
vars = all.vars(as.formula(paste(expr[subs], collapse = "")), ...)
# reverse the mangling of names
retain = gsub("\\\\", "", retain) # un-escape the patterns
for (i in seq_along(retain))
vars = gsub(repl[i], retain[i], vars)
vars
}
Use the retain argument to specify the patterns that we wish to retain rather than treat as operators. The defaults are $, [[, and ]] (all duly escaped) Here are some results:
> form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
> All.vars(form)
[1] "cows$weight" "bulls[[3]]" "herd$breed"
Change retain to also include ( and ):
> All.vars(form, retain = c("\\$", "\\(", "\\)", "\\[\\[", "\\]\\]"))
[1] "log(cows$weight)" "factor(bulls[[3]])" "herd$breed"
The dots are passed to all.vars, which is really the same as all.names but with different defaults. So we can also obtain the functions and operators not in retain:
> All.vars(form, functions = TRUE)
[1] "~" "log" "cows$weight" "*"
[5] "factor" "bulls[[3]]" "herd$breed"

This isn't sufficient for a general use case, but just for fun I thought I'd take a crack at it:
mystery.fcn = function(string) {
string = gsub(":", " ", string)
string = unlist(strsplit(gsub("\\b.*\\b\\(|\\(|\\)|[*~+-]", "", string), split=" "))
string = string[nchar(string) > 0]
return(string)
}
form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
mystery.fcn(form)
[1] "cows$weight" "bulls[[3]]" "herd$breed"
form1 = ~x[[y]]
mystery.fcn(form1)
[1] "x[[y]]"
form2 = z$three ~ z$one + z$two - z$x_y
mystery.fcn(form2)
[1] "z$three" "z$one" "z$two" "z$x_y"
form3 = z$three ~ z$one:z$two
mystery.fcn(form3)
[1] "z$three" "z$one" "z$two"

Related

Converting unit abbreviations to numbers

I have a dataset that abbreviates numerical values in a column. For example, 12M mean 12 million, 1.2k means 1,200. M and k are the only abbreviations. How can I write code that allows R to sort these values from lowest to highest?
I've though about using gsub to convert M to 000,000 etc but that does not take into account the decimals (1.5M would then be 1.5000000).
So you want to translate SI unit abbreviations ('K','M',...) into exponents, and thus numerical powers-of-ten.
Given that all units are single-letter, and the exponents are uniformly-spaced powers of 10**3, here's working code that handles 'Kilo'...'Yotta', and any future exponents:
> 10 ** (3*as.integer(regexpr('T', 'KMGTPEY')))
[1] 1e+12
Then just multiply that power-of-ten by the decimal value you have.
Also, you probably want to detect and handle the 'no-match' case for unknown letter prefixes, otherwise you'd get a nonsensical -1*3
> unit_to_power <- function(u) {
exp_ <- 10**(as.integer(regexpr(u, 'KMGTPEY')) *3)
return (if(exp_>=0) exp_ else 1)
}
Now if you want to case-insensitive-match both 'k' and 'K' to Kilo (as computer people often write, even though it's technically an abuse of SI), then you'll need to special-case e.g with if-else ladder/expression (SI units are case-sensitive in general, 'M' means 'Mega' but 'm' strictly means 'milli' even if disk-drive users say otherwise; upper-case is conventionally for positive exponents). So for a few prefixes, #DanielV's case-specific code is better.
If you want negative SI prefixes too, use as.integer(regexpr(u, 'zafpnum#KMGTPEY')-8) where # is just some throwaway character to keep uniform spacing, it shouldn't actually get matched. Again if you need to handle non-power-of-10**3 units like 'deci', 'centi', will require special-casing, or the general dict-based approach WeNYoBen uses.
base::regexpr is not vectorized also its performance is bad on big inputs, so if you want to vectorize and get higher-performance use stringr::str_locate.
Give this a shot:
Text_Num <- function(x){
if (grepl("M", x, ignore.case = TRUE)) {
as.numeric(gsub("M", "", x, ignore.case = TRUE)) * 1e6
} else if (grepl("k", x, ignore.case = TRUE)) {
as.numeric(gsub("k", "", x, ignore.case = TRUE)) * 1e3
} else {
as.numeric(x)
}
}
In your case you can using gsubfn
a=c('12M','1.2k')
dict<-list("k" = "e3", "M" = "e6")
as.numeric(gsubfn::gsubfn(paste(names(dict),collapse="|"),dict,a))
[1] 1.2e+07 1.2e+03
I am glad to meet you.
I wrote another answer
Define function
res = function (x) {
result = as.numeric(x)
if(is.na(result)){
text = gsub("k", "*1e3", x, ignore.case = T)
text = gsub("m", "*1e6", text, ignore.case = T)
result = eval(parse(text = text))
}
return(result)
}
Result
> res("5M")
[1] 5e+06
> res("4K")
[1] 4000
> res("100")
[1] 100
> res("4k")
[1] 4000
> res("1e3")
[1] 1000

How can I handle whitespace when using `deparse(substitute))` (or an alternative)?

I'm writing some code that translates mathematical function definitions to valid R code. Therefore I use deparse(substitute)) to access those function definitions in order I can alter them to valid R code.
For example I have the function LN(x)^y that should become log(x)^y. I can do this using the first version of my to_r function:
to_r <- function(x) {
parse(text = gsub("LN", "log", deparse(substitute(x))))
}
to_r(LN(x)^y)
This returns expression(log(x)^y) which is what I expect.
I also get function definitions looking like LN("x a")^y. To handle those I can expand my function:
to_r_2 <- function(x) {
parse(text = gsub(" ", "_", gsub("\"", "", gsub("LN", "log", deparse(substitute(x))))))
}
to_r_2(LN("x a")^y)
This returns expression(log(x_a)^y) which is fine.
However, when my input becomes something like LN("x a")*2^y this fails:
parse(text = gsub(" ", "_", gsub("\"", "", gsub("LN", "log", deparse(substitute(LN("x a")*2^y))))))
Error in parse(text = gsub(" ", "_", gsub("\"", "", gsub("LN", "log",
: :1:9: unexpected input 1: log(x_a)_
^
The reason is that deparse(substitute(LN("x a")*2^y)) introduces whitespaces around * and afterwards I gsub those whitespaces with underscores which is a problem for parse.
Is there a way to solve this?
Maybe an alternative to deparse(substitute))?
(To state the obvious: Replacing gsub(" ", "_", x) with gsub(" ", "", x) is not really an option because variable names get unreadable. For example Reason one of Something would become ReasononeofSomething which is far less readable than the attempted Reason_one_of_Something.)
Here's a helper function to replace any character values in an expression with symbols (with spaces replaced with underscores)
chr_to_sym <- function(x) {
if (is(x, "call")) {
as.call(do.call("c",lapply(as.list(x), chr_to_sym), quote=T))
} else if (is(x, "character")) {
as.symbol(gsub(" ","_", x))
} else {
x
}
}
We can then use that in your translation function
to_r <- function(x) {
expr <- substitute(x)
expr <- do.call("substitute", list(expr, list(LN=quote(log))))
as.expression(chr_to_sym(expr))
}
note that this version works with the expressions directly. It doesn't do any deparsing/string manipulation. This is generally safer. This works for the examples you provide
to_r(LN(x)^y)
# expression(log(x)^y)
to_r(LN("x a")^y)
# expression(log(x_a)^y)
to_r(LN("x a")*2^y)
# expression(log(x_a) * 2^y)
If the input is an R call object then it must, of course, conform to R syntax. In that case we can process it using a recursive function that walks through the input and replaces names containing a space or spaces with the same name but with underscore instead of space(s). Also, at the end replaces LN with log. A call object is returned.
rmSpace <- function(e) {
if (length(e) == 1) e <- as.name(gsub(" ", "_", as.character(e)))
else for (i in 1:length(e)) e[[i]] <- Recall(e[[i]])
do.call("substitute", list(e, list(LN = as.name("log"))))
}
rmSpace(quote(LN("x a")*2^y))
## log(x_a) * `2`^y
# to input an expression add [[1]] after it to make it a call object
rmSpace(expression(LN("x a")*2^y)[[1]])
## log(x_a) * `2`^y
Apply as.expression to the result if you want an expression instead of call object.
If the input is a character string then we can simply replace LN with log and for any space of spaces having digit or letter on both sides we can replace the space(s) with an underscore. We return a string or an call object depending on the second argument.
rmSpace2 <- function(s, retclass = c("character", "call")) {
s1 <- gsub("\\bLN\\b", "log", s)
s2 <- gsub("([[:alnum:]]) +([[:alnum:]])", "\\1_\\2", s1, perl = TRUE)
retclass <- match.arg(retclass)
if (retclass == "character") s2 else parse(text = s2)[[1]]
}
rmSpace2("LN(x a)*2^y")
## [1] "log(x_a)*2^y"
rmSpace2("LN(x a)*2^y", "call")
## log(x_a) * 2^y
If you want an expression instead of a call object use as.expression:
as.expression(rmSpace2("LN(x a)*2^y", "call"))
## expression(log(x_a) * 2^y)

Use capture group within gsub() as index/name for another object (R) [duplicate]

I'm new to R and am stuck with backreferencing that doesn't seem to work. In:
gsub("\\((\\d+)\\)", f("\\1"), string)
It correctly grabs the number in between parentheses but doesn't apply the (correctly defined, working otherwise) function f to replace the number --> it's actually the string "\1" that passes through to f.
Am I missing something or is it just that R does not handle this? If so, any idea how I could do something similar, i.e. applying a function "on the fly" to the (actually many) numbers that occur in between parentheses in the text I'm parsing?
Thanks a lot for your help.
R does not have the option of applying a function directly to a match via gsub. You'll actually have to extract the match, transform the value, then replace the value. This is relativaly easy with the regmatches function. For example
x<-"(990283)M (31)O (29)M (6360)M"
f<-function(x) {
v<-as.numeric(substr(x,2,nchar(x)-1))
paste0(v+5,".1")
}
m <- gregexpr("\\(\\d+\\)", x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
# [1] "990288.1M 36.1O 34.1M 6365.1M"
Of course you can make f do whatever you like just make sure it's vector-friendly. Of course, you could wrap this in your own function
gsubf <- function(pattern, x, f) {
m <- gregexpr(pattern, x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
}
gsubf("\\(\\d+\\)", x, f)
Note that in these examples we're not using a capture group, we're just grabbing the entire match. There are ways to extract the capture groups but they are a bit messier. If you wanted to provide an example where such an extraction is required, I might be able to come up with something fancier.
To use a callback within a regex-capable replacement function, you may use either gsubfn or stringr functions.
When choosing between them, note that stringr is based on ICU regex engine and with gsubfn, you may use either the default TCL (if the R installation has tcltk capability, else it is the default TRE) or PCRE (if you pass the perl=TRUE argument).
Also, note that gsubfn allows access to all capturing groups in the match object, while str_replace_all will only allow to manipulate the whole match only. Thus, for str_replace_all, the regex should look like (?<=\()\d+(?=\)), where 1+ digits are matched only when they are enclosed with ( and ) excluding them from the match.
With stringr, you may use str_replace_all:
library(stringr)
string <- "(990283)M (31)O (29)M (6360)M"
## Callback function to increment found number:
f <- function(x) { as.integer(x) + 1 }
str_replace_all(string, "(?<=\\()\\d+(?=\\))", function(m) f(m))
## => [1] "(990284)M (32)O (30)M (6361)M"
With gsubfn, pass perl=TRUE and backref=0 to be able to use lookarounds and just modify the whole match:
gsubfn("(?<=\\()\\d+(?=\\))", ~ f(m), string, perl=TRUE, backref=0)
## => [1] "(990284)M (32)O (30)M (6361)M"
If you have multiple groups in the pattern, remoe backref=0 and enumerate the group value arguments in the callback function declaration:
gsubfn("(\\()(\\d+)(\\))", function(m,n,o) paste0(m,f(n),o), string, perl=TRUE)
^ 1 ^^ 2 ^^ 3 ^ ^^^^^^^ ^^^^
This is for multiple different replacements.
text="foo(200) (300)bar (400)foo (500)bar (600)foo (700)bar"
f=function(x)
{
return(as.numeric(x[[1]])+5)
}
a=strsplit(text,"\\(\\K\\d+",perl=T)[[1]]
b=f(str_extract_all(text,perl("\\(\\K\\d+")))
paste0(paste0(a[-length(a)],b,collapse=""),a[length(a)]) #final output
#[1] "foo(205) (305)bar (405)foo (505)bar (605)foo (705)bar"
Here's a way by tweaking a bit stringr::str_replace(), in the replace argument, just use a lambda formula as the replace argument, and reference the captured group not by ""\\1" but by ..1, so your gsub("\\((\\d+)\\)", f("\\1"), string) will become str_replace2(string, "\\((\\d+)\\)", ~f(..1)), or just str_replace2(string, "\\((\\d+)\\)", f) in this simple case :
str_replace2 <- function(string, pattern, replacement, type.convert = TRUE){
if(inherits(replacement, "formula"))
replacement <- rlang::as_function(replacement)
if(is.function(replacement)){
grps_mat <- stringr::str_match(string, pattern)[,-1, drop = FALSE]
grps_list <- lapply(seq_len(ncol(grps_mat)), function(i) grps_mat[,i])
if(type.convert) {
grps_list <- type.convert(grps_list, as.is = TRUE)
replacement <- rlang::exec(replacement, !!! grps_list)
replacement <- as.character(replacement)
} else {
replacement <- rlang::exec(replacement, !!! grps_list)
}
}
stringr::str_replace(string, pattern, replacement)
}
str_replace2(
"foo (4)",
"\\((\\d+)\\)",
sqrt)
#> [1] "foo 2"
str_replace2(
"foo (4) (5)",
"\\((\\d+)\\) \\((\\d+)\\)",
~ sprintf("(%s)", ..1 * ..2))
#> [1] "foo (20)"
Created on 2020-01-24 by the reprex package (v0.3.0)

How to get the list of in-built functions used within a function

Lets say I have a function named Fun1 within which I am using many different in-built functions of R for different different processes. Then how can I get a list of in-built functions used inside this function Fun1
Fun1 <- function(x,y){
sum(x,y)
mean(x,y)
c(x,y)
print(x)
print(y)
}
So My output should be like list of characters i.e. sum, mean, c, print. Because these are the in-built functions I have used inside function Fun1.
I have tried using grep function
grep("\\(",body(Fun1),value=TRUE)
# [1] "sum(x, y)" "mean(x, y)" "c(x, y)" "print(x)" "print(y)"
It looks ok, but arguments should not come i.e. x and y. Just the list of function names used inside body of function Fun1 here.
So my overall goal is to print the unique list of in-built functions or any create functions inside a particular function, here Fun1.
Any help on this is highly appreciated. Thanks.
You could use all.vars() to get all the variable names (including functions) that appear inside the body of Fun1, then compare that with some prepared list of functions. You mention in-built functions, so I will compare it with the base package object names.
## full list of variable names inside the function body
(vars <- all.vars(body(Fun1)[-1], functions = TRUE))
# [1] "sum" "x" "y" "mean" "c" "print"
## compare it with the base package object names
intersect(vars, ls(baseenv()))
# [1] "sum" "mean" "c" "print"
I removed the first element of the function body because presumably you don't care about {, which would have been matched against the base package list.
Another possibility, albeit a bit less reliable, would be to compare the formal arguments of Fun1 to all the variable names in the function. Like I said, likely less reliable though because if you make assignments inside the function you will end up with incorrect results.
setdiff(vars, names(formals(Fun1)))
# [1] "sum" "mean" "c" "print"
These are fun though, and you can fiddle around with them.
Access to the parser tokens is available with functions from utils.
tokens <- utils::getParseData(parse(text=deparse(body(Fun1))))
unique(tokens[tokens[["token"]] == "SYMBOL_FUNCTION_CALL", "text"])
[1] "sum" "mean" "c" "print"
This should be somewhat helpful - this will return all functions however.
func_list = Fun1 %>%
body() %>% # extracts function
toString() %>% # converts to single string
gsub("[{}]", "", .) %>% # removes curly braces
gsub("\\s*\\([^\\)]+\\)", "", .) %>% # removes all contents between brackets
strsplit(",") %>% # splits strings at commas
unlist() %>% # converts to vector
trimws(., "both") # removes all white spaces before and after`
[1] "" "sum" "mean" "c" "print" "print"
> table(func_list)
func_list
c mean print sum
1 1 1 2 1
This is extremely limited to your example... you could modify this to be more robust. It will fall over where a function has brackets nesting other functions etc.
this is not so beautiful but working:
Fun1 <- function(x,y){
sum(x,y)
mean(x,y)
c(x,y)
print(x)
print(y)
}
getFNamesInFunction <- function(f.name){
f <- deparse(body(get(f.name)))
f <- f[grepl(pattern = "\\(", x = f)]
f <- sapply(X = strsplit(split = "\\(", x = f), FUN = function(x) x[1])
unique(trimws(f[f != ""]))
}
getFNamesInFunction("Fun1")
[1] "sum" "mean" "c" "print"
as.list(Fun1)[3]
gives you the part of the function between the curly braces.
{
sum(x, y)
mean(x, y)
c(x, y)
print(x)
print(y)
}
Hence
gsub( ").*$", "", as.list(Fun1)[3])
gives you everything before the first " ) " appears which is presumable the name of the first function.
Taking this as a starting point you should be able to include a loop which gives you the other functions and not the first only the first one.

Applying a function to a backreference within gsub in R

I'm new to R and am stuck with backreferencing that doesn't seem to work. In:
gsub("\\((\\d+)\\)", f("\\1"), string)
It correctly grabs the number in between parentheses but doesn't apply the (correctly defined, working otherwise) function f to replace the number --> it's actually the string "\1" that passes through to f.
Am I missing something or is it just that R does not handle this? If so, any idea how I could do something similar, i.e. applying a function "on the fly" to the (actually many) numbers that occur in between parentheses in the text I'm parsing?
Thanks a lot for your help.
R does not have the option of applying a function directly to a match via gsub. You'll actually have to extract the match, transform the value, then replace the value. This is relativaly easy with the regmatches function. For example
x<-"(990283)M (31)O (29)M (6360)M"
f<-function(x) {
v<-as.numeric(substr(x,2,nchar(x)-1))
paste0(v+5,".1")
}
m <- gregexpr("\\(\\d+\\)", x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
# [1] "990288.1M 36.1O 34.1M 6365.1M"
Of course you can make f do whatever you like just make sure it's vector-friendly. Of course, you could wrap this in your own function
gsubf <- function(pattern, x, f) {
m <- gregexpr(pattern, x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
}
gsubf("\\(\\d+\\)", x, f)
Note that in these examples we're not using a capture group, we're just grabbing the entire match. There are ways to extract the capture groups but they are a bit messier. If you wanted to provide an example where such an extraction is required, I might be able to come up with something fancier.
To use a callback within a regex-capable replacement function, you may use either gsubfn or stringr functions.
When choosing between them, note that stringr is based on ICU regex engine and with gsubfn, you may use either the default TCL (if the R installation has tcltk capability, else it is the default TRE) or PCRE (if you pass the perl=TRUE argument).
Also, note that gsubfn allows access to all capturing groups in the match object, while str_replace_all will only allow to manipulate the whole match only. Thus, for str_replace_all, the regex should look like (?<=\()\d+(?=\)), where 1+ digits are matched only when they are enclosed with ( and ) excluding them from the match.
With stringr, you may use str_replace_all:
library(stringr)
string <- "(990283)M (31)O (29)M (6360)M"
## Callback function to increment found number:
f <- function(x) { as.integer(x) + 1 }
str_replace_all(string, "(?<=\\()\\d+(?=\\))", function(m) f(m))
## => [1] "(990284)M (32)O (30)M (6361)M"
With gsubfn, pass perl=TRUE and backref=0 to be able to use lookarounds and just modify the whole match:
gsubfn("(?<=\\()\\d+(?=\\))", ~ f(m), string, perl=TRUE, backref=0)
## => [1] "(990284)M (32)O (30)M (6361)M"
If you have multiple groups in the pattern, remoe backref=0 and enumerate the group value arguments in the callback function declaration:
gsubfn("(\\()(\\d+)(\\))", function(m,n,o) paste0(m,f(n),o), string, perl=TRUE)
^ 1 ^^ 2 ^^ 3 ^ ^^^^^^^ ^^^^
This is for multiple different replacements.
text="foo(200) (300)bar (400)foo (500)bar (600)foo (700)bar"
f=function(x)
{
return(as.numeric(x[[1]])+5)
}
a=strsplit(text,"\\(\\K\\d+",perl=T)[[1]]
b=f(str_extract_all(text,perl("\\(\\K\\d+")))
paste0(paste0(a[-length(a)],b,collapse=""),a[length(a)]) #final output
#[1] "foo(205) (305)bar (405)foo (505)bar (605)foo (705)bar"
Here's a way by tweaking a bit stringr::str_replace(), in the replace argument, just use a lambda formula as the replace argument, and reference the captured group not by ""\\1" but by ..1, so your gsub("\\((\\d+)\\)", f("\\1"), string) will become str_replace2(string, "\\((\\d+)\\)", ~f(..1)), or just str_replace2(string, "\\((\\d+)\\)", f) in this simple case :
str_replace2 <- function(string, pattern, replacement, type.convert = TRUE){
if(inherits(replacement, "formula"))
replacement <- rlang::as_function(replacement)
if(is.function(replacement)){
grps_mat <- stringr::str_match(string, pattern)[,-1, drop = FALSE]
grps_list <- lapply(seq_len(ncol(grps_mat)), function(i) grps_mat[,i])
if(type.convert) {
grps_list <- type.convert(grps_list, as.is = TRUE)
replacement <- rlang::exec(replacement, !!! grps_list)
replacement <- as.character(replacement)
} else {
replacement <- rlang::exec(replacement, !!! grps_list)
}
}
stringr::str_replace(string, pattern, replacement)
}
str_replace2(
"foo (4)",
"\\((\\d+)\\)",
sqrt)
#> [1] "foo 2"
str_replace2(
"foo (4) (5)",
"\\((\\d+)\\) \\((\\d+)\\)",
~ sprintf("(%s)", ..1 * ..2))
#> [1] "foo (20)"
Created on 2020-01-24 by the reprex package (v0.3.0)

Resources