Extract decimal numbers from string in Sparklyr - r

I've been trying to extract decimal numbers from strings in sparklyr, but it does not work with the regular syntax you would normally use outside of Spark.
I have tried using regexp_extract but it returns empty strings.
regexp_extract($170.5M, "[[:digit:]]+\\.*[[:digit:]]*")
I'm trying to get 170.5 as a result.

You could use regexpr from base R
v <- "$170.5M"
regmatches(v, regexpr("\\d*\\.\\d", v))
# [1] "170.5"

You may use
regexp_extract(col_value, "[0-9]+(?:[.][0-9]+)?")
Or
regexp_extract(col_value, "\\p{Digit}+(?:\\.\\p{Digit}+)?")
Your [[:digit:]]+\.*[[:digit:]]* regex does not work, becuae regexp_extract expects a Java compatible regex pattern and that engine does not support POSIX character classes in the [:classname:] syntax. You may use digit POSIX character class like \p{Digit}, see Java regex documentation.
See regexp_extract documentation:
Extract a specific(idx) group identified by a java regex, from the specified string column.

Related

R identifier naming rules can be broken by using quotes? [duplicate]

I'm trying to understand what backticks do in R.
From what I can tell, this is not explained in the ?Quotes documentation page for R.
For example, at the R console:
"[["
# [1] "[["
`[[`
# .Primitive("[[")
It seem to be returning the equivalent to:
get("[[")
A pair of backticks is a way to refer to names or combinations of symbols that are otherwise reserved or illegal. Reserved are words like if are part of the language, while illegal includes non-syntactic combinations like c a t. These two categories, reserved and illegal, are referred to in R documentation as non-syntactic names.
Thus,
`c a t` <- 1 # is valid R
and
> `+` # is equivalent to typing in a syntactic function name
function (e1, e2) .Primitive("+")
As a commenter mentioned, ?Quotes does contain some information on the backtick, under Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used directly in R code. Almost always, other names can be used provided they are quoted. The preferred quote is the backtick (`), and deparse will normally use it, but under many circumstances single or double quotes can be used (as a character constant will often be converted to a name). One place where backticks may be essential is to delimit variable names in formulae: see formula
This prose is a little hard to parse. What it means is that for R to parse a token as a name, it must be 1) a sequence of letters digits, the period and underscores, that 2) is not a reserved word in the language. Otherwise, to be parsed as a name, backticks must be used.
Also check out ?Reserved:
Reserved words outside quotes are always parsed to be references to the objects linked to in the 'Description', and hence they are not allowed as syntactic names (see make.names). They are allowed as non-syntactic names, e.g.inside backtick quotes.
In addition, Advanced R has some examples of how backticks are used in expressions, environments, and functions.
They are equivalent to verbatim. For example... try this:
df <- data.frame(20a=c(1,2),b=c(3,4))
gives error
df <- data.frame(`20a`=c(1,2),b=c(3,4))
doesn't give error
Here is an incomplete answer using improper vocabulary: backticks can indicate to R that you are using a function in a non-standard way. For instance, here is a use of [[, the list subsetting function:
temp <- list("a"=1:10, "b"=rnorm(5))
extract element one, the usual way
temp[[1]]
extract element one using the [[ function
`[[`(temp,1)

R NAMESPACE export pattern using Perl style (non-consuming regular expression)

I want to export all functions from an R-package calle myPackageId that do not start with a period and do not start with string "myPackageId_".
Functions with the second pattern are automatically generated by Rcpp in C-code as "RcppExport SEXP myPackageId_cFunctionname" that should not be exported by the package.
I found a solution using non-consuming regular expression:
exportPattern("(?=^[^\\.])(?=^(?!myPackageId_))")
This works with R grep with option perl=TRUE. However, default R grep with extended RE and R CMD INSTALL complain about an invalid pattern.
words <- c(".test","test","myPackageId_test")
grep("(?=^[^\\.])(?=^(?!myPackageId_))",words, perl=TRUE)
grep("(?=^[^\\.])(?=^(?!myPackageId_))",words)
In the above example, the word "test" would be accepted, whereas the other words ".test" and "myPackageId_test" would not be accepted.
Expected inputs are all valid R names. These are the usual words composed of ASCII characters without whitespace. In R also the period, ".", can start a name.
Is there a pattern that I can use with grep option perl=FALSE to achieve the same goal?
Or can I tell R somehow in the NAMESPACE file to use the perl variant with grep?

What do backticks do in R?

I'm trying to understand what backticks do in R.
From what I can tell, this is not explained in the ?Quotes documentation page for R.
For example, at the R console:
"[["
# [1] "[["
`[[`
# .Primitive("[[")
It seem to be returning the equivalent to:
get("[[")
A pair of backticks is a way to refer to names or combinations of symbols that are otherwise reserved or illegal. Reserved are words like if are part of the language, while illegal includes non-syntactic combinations like c a t. These two categories, reserved and illegal, are referred to in R documentation as non-syntactic names.
Thus,
`c a t` <- 1 # is valid R
and
> `+` # is equivalent to typing in a syntactic function name
function (e1, e2) .Primitive("+")
As a commenter mentioned, ?Quotes does contain some information on the backtick, under Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used directly in R code. Almost always, other names can be used provided they are quoted. The preferred quote is the backtick (`), and deparse will normally use it, but under many circumstances single or double quotes can be used (as a character constant will often be converted to a name). One place where backticks may be essential is to delimit variable names in formulae: see formula
This prose is a little hard to parse. What it means is that for R to parse a token as a name, it must be 1) a sequence of letters digits, the period and underscores, that 2) is not a reserved word in the language. Otherwise, to be parsed as a name, backticks must be used.
Also check out ?Reserved:
Reserved words outside quotes are always parsed to be references to the objects linked to in the 'Description', and hence they are not allowed as syntactic names (see make.names). They are allowed as non-syntactic names, e.g.inside backtick quotes.
In addition, Advanced R has some examples of how backticks are used in expressions, environments, and functions.
They are equivalent to verbatim. For example... try this:
df <- data.frame(20a=c(1,2),b=c(3,4))
gives error
df <- data.frame(`20a`=c(1,2),b=c(3,4))
doesn't give error
Here is an incomplete answer using improper vocabulary: backticks can indicate to R that you are using a function in a non-standard way. For instance, here is a use of [[, the list subsetting function:
temp <- list("a"=1:10, "b"=rnorm(5))
extract element one, the usual way
temp[[1]]
extract element one using the [[ function
`[[`(temp,1)

Using non-ASCII characters inside functions for packages

I'm trying to write a function equivalent to scales::dollar that adds a pound (£) symbol to the beginning of a figure. Since the scales code is so robust, I've used it as a framework and simply replaced the $ for the £.
A stripped-down function example:
pounds<-function(x) paste0("£",x)
When I run a CHECK I get the following:
Found the following file with non-ASCII characters:
pounds.R
Portable packages must use only ASCII characters in their R code,
except perhaps in comments.
Use \uxxxx escapes for other characters.
Looking through the Writing R extensions guide it doesn't give a lot of help (IMO) on how to resolve this issue. It mentions the \uxxxx and says it refers to Unicode characters.
Looking up unicode characters yields me the code &#163 but the guidance I can find for \uxxxx is minimal and relates to Java on W3schools.
My question is thus:
How do you implement the usage of non-unicode characters in R functions using the \uxxxx escapes and how does the usage affect the display of such characters after the function has been used?
For the \uxxxx escapes, you need to know the hexadecimal number of your character. You can determine it using charToRaw:
sprintf("%X", as.integer(charToRaw("£")))
[1] "A3"
Now you can use this to specify your non-ascii character. Both \u00A3 and £ represent the same character.
Another option is to use stringi::stri_escape_unicode:
library(stringi)
stringi::stri_escape_unicode("➛")
# "\\u279b"
This informs you that "\u279b" represents the character "➛".
Try this:
pounds<-function(x) paste0("\u00A3",x)
The stringi package can be useful is these situations:
library(stringi)
stri_escape_unicode("£")
#> [1] "\\u00a3"

stargazer and omit regular expressions

I am trying to use regular expressions to omit some variables in stargazer. I finally found a working regex, but it's using the Perl standard. This doesn't work for the base regex in R, though regexpr in R can take a perl=T option. Given that you wrap the regex for variable sets to omit in "", you can't really pass it this option. Any ideas on how to use perl regex with stargazer?
An example of the regex I would like to use is
placed.ind2*(?:(?!:switchind).)*$
applied to these 4 strings:
placed.ind2PROF SERVICES
placed.ind2TRANSPORT
placed.ind2PROF SERVICES:switchind2TRUE
placed.ind2TRANSPORT:switchind2TRUE
I would like the first two to be selected, but the last to be.
Starting from version 4.0 (on CRAN now), you can run stargazer with the argument perl=TRUE to allow for Perl-compatible regular expressions in your other arguments.

Resources