Is there a parser in R to process natural language strings and convert them into R instructions? Something like LEX and BISON for C++. For example it would turn this string:
Dataset: Cars - Column: Speed: 15 - Range: [20-40]
into
filter_cars <- cars[cars$speed <= 15,][20:40,]
What I've seen works only for integrating R expressions in C++, that's why I'm asking.
This can be done using the functions evaluate() and parse(). parse() converts a string into an expression. evaluate() can handle expressions:
data(mtcars)
cars <- mtcars # example data
expr <- "filter_cars <- cars[cars$speed <= 15,][20:40,]" # text string
eval(parse(text = expr)) # convert string to expression and evaluate
EDIT: If you want to create the expressions automatically you will have to write your own function. paste() or cat() are useful for creating text strings from multiple inputs.
Related
I am new to R, please have mercy. I imported a table from an Access database via odbc:
df <- select(dbReadTable(accdb_path, name ="accdb_table"),"Col_1","Col_2","Col_3")
For
> typeof(df$Col_3)
I get
[1] "list"
Using library(dplyr.teradata). I converted blob to string (maybe already on the wrong path here):
df$Hex <- blob_to_string(df$Col_3)
and now end up with a column (typeof = character) full of Hex:
df[1,4]
[1] 49206765742061206c6f74206f662048657820616e642068617665207468652069737375652077697468207370656369616c2063687261637465727320696e204765726d616e206c616e6775616765206c696b65206e2b4150592d7
My question is, how to convert each value in Col_3 into proper Text (if possible, with respect to German special chracters like ü,ö, ä and ß).
I am aware of this solution How to convert a hex string to text in R?, but can't apply it properly:
df$Text <- rawToChar(as.raw(strtoi(df$Hex, 16L)))
Fehler in rawToChar(as.raw(strtoi(BinData$Hex, 16L))) :
Zeichenkette '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
Thx!
If I understand this correctly, what you want to do it to apply a function to each element of a list so that it returns a character vector (that you can add to a data frame, if you so wish).
This can be easily accomplished with the purrr family of functions. The following takes each element df$Col_3 and runs the function (with each element being the x in the given function)
purrr::map_chr(.x = df$Col_3,
.f = function(x) {rawToChar(as.raw(strtoi(x,16L)))})
You should probably achieve the same with base R functions such as lapply() followed by unlist(), or sapply() but with purrr it's often easier to find inconsistent results.
I have read about prefix functions and infix functions on Hadley Wickham's Advanced R website. I would like to know if there is any way to define functions that are called by placing a prefix and suffix around a single argument, so that the prefix and suffix operate like brackets. Is there any way to create a function like this, and if so, how do you do it?
An example for formulation: In order to give a specific example for formulation, suppose you have an object char that is a character string. You want to create a function that is called on a character string using the prefix _# and suffix #_ and the function adds five dashes to the front of the character string. If programmed successfully, it would operate as shown below.
char
[1] "Hello"
_#char#_
[1] "-----Hello"
There is a way to do this as long as your special operator takes a particular form, that is .%_% char %_%. . This is because the parser will interpret the dot as a variable name. If we use non-standard evaluation, we don't need the dot to actually exist, and we only need to use this as a marker for opening and closing our special operator. So we can do something like this:
`%_%` <- function(a, b)
{
if((deparse(match.call()$a) != ".") +
(deparse(match.call()$b) != ".") != 1)
stop("Unrecognised SPECIAL")
if(deparse(match.call()$a == "."))
return(`attr<-`(b, "prepped", TRUE))
if(attr(a, "prepped"))
return(paste0("-----", a))
stop("Unrecognised SPECIAL")
}
.%_% "hello" %_%.
#> [1] "-----hello"
However, this is a weird thing to do in R. It's not idiomatic and uses more keystrokes than a simple function call would. It would also very likely cause unpredictable problems in places where non-standard evaluation is used. This is really just a demo to show that it can be done. Not that it should be done.
Writing a simple function seems like a more R-like solution. If terseness is a priority, then maybe something like
._ <- function(x) paste0("-----", x)
._("hello")
# [1] "-----hello"
Or if you wanted something more bracket-like
.. <- structure(list(NULL), class="dasher")
`[.dasher` <- function(a, x) paste0("-----", x)
..["hello"]
# [1] "-----hello"
Another way to use a custom class would be to redefine the - operator to paste that value in front of the string. For example
literal <- function(x) {class(x)<-"literal"; x}
`-.literal` <- function(e1, e2) {literal(paste0("-", unclass(e1)))}
print.literal <- function(x) print(unclass(x))
Then you can do
val <- literal("hello")
-----val
# [1] "-----hello"
---val
# [1] "---hello"
So here the number of - you type is the number you get in the output.
You can get creative/weird with syntax, but you need to make sure whatever symbols you come up with can be parsed by the parser otherwise you are out-of-luck.
I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".
I have expressions in character that are supposed to be evaluated in a data.table (not important just context).
To make sure all the required columns are present I would like to extract the said columns within the R expression.
What I want:
library(data.table)
DT <- data.table(p001=rnorm(10),p002=rnorm(10),p003=rnorm(10))
expr <- 'p001+mean(p001,na.rm=TRUE)-weighted.mean(p002,w=p003)+someRandomOtherColumn'
# DT[,test:=p001+mean(p001,na.rm=TRUE)-weighted.mean(p002,w=p003)+someRandomOtherColumn]
# would fail as p004 is not in the columns
Basically I am looking for a way (probably a regex) that would extract from expr p001,p002,p003,someRandomOtherColumn.
My view on it:
The way I see it I should be able to capture p001,p001,TRUE,p002,p003,someRandomOtherColumn with some regex that would capture things within f(,) and then filter for 'allowed' column names (TRUE is not in that case).
Nested f(,,) are not an issue as I can call the same function recursively and nested f(,(),)are also fine.
What I have:
From now this is what I have, this can be made to work but this feels bad
expr <- 'p001+mean(p001,na.rm=TRUE)-weighted.mean(p002,w=p003)+someRandomOtherColumn'
clean <- function(string) gsub(string, pattern='[_|\\.|a-zA-z]+\\(([^)]*)\\)', replacement='\\1', perl=TRUE)
clean(expr)
[1] "p001+p001,na.rm=TRUE-p002,w=p003+someRandomOtherColumn"
# Then I can remove =* than split on ,|+|-|*
When you add a ~ to your expression, you can create a valid R formula expression:
expr <- '~ p001+mean(p001,na.rm=TRUE)-weighted.mean(p002,w=p003)+someRandomOtherColumn'
This string can be converted to a formula with as.formula. Afterwards, the variable names can be extracted with all.vars:
all.vars(as.formula(expr))
# [1] "p001" "p002" "p003" "someRandomOtherColumn"
How can I create a data frame from the following string:
my_str <- "a=1, b=2"
In other words, how can I feed y into the data.frame or data.table functions so that it gives me the same thing as
data.frame(a=1, b=2)
Think about how you can easily pass a string of form my_str <- "y~x1+x2+x3" into a statistical model in R by simply using as.formula(my_str) and effectively remove the quotes. So I am looking for a similar solution.
Thank you!
I would strongly discourage you from storing code as a string in R. There are almost always better ways to write R code that don't require parsing strings.
But let's assume you have no other options. Then you can write your own parser, or use R's built in parser. The expression "a=1, b=2" on it's own doesn't make any sense in R (you can't have two "assignments" separated by a comma) so it would only make sense as parameters to a function.
If you want to wrap it in data.frame(), then you can use paste() to make the string you want and then parse() and finally eval() it to run it
my_str <- "a=1, b=2"
my_code <- paste0("data.frame(", my_str, ")")
my_expr <- parse(text=my_code)
eval(my_expr)
# a b
# 1 1 2
But like I already mentioned eval/parse should generally be avoided.