R identifier naming rules can be broken by using quotes? [duplicate] - r

I'm trying to understand what backticks do in R.
From what I can tell, this is not explained in the ?Quotes documentation page for R.
For example, at the R console:
"[["
# [1] "[["
`[[`
# .Primitive("[[")
It seem to be returning the equivalent to:
get("[[")

A pair of backticks is a way to refer to names or combinations of symbols that are otherwise reserved or illegal. Reserved are words like if are part of the language, while illegal includes non-syntactic combinations like c a t. These two categories, reserved and illegal, are referred to in R documentation as non-syntactic names.
Thus,
`c a t` <- 1 # is valid R
and
> `+` # is equivalent to typing in a syntactic function name
function (e1, e2) .Primitive("+")
As a commenter mentioned, ?Quotes does contain some information on the backtick, under Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used directly in R code. Almost always, other names can be used provided they are quoted. The preferred quote is the backtick (`), and deparse will normally use it, but under many circumstances single or double quotes can be used (as a character constant will often be converted to a name). One place where backticks may be essential is to delimit variable names in formulae: see formula
This prose is a little hard to parse. What it means is that for R to parse a token as a name, it must be 1) a sequence of letters digits, the period and underscores, that 2) is not a reserved word in the language. Otherwise, to be parsed as a name, backticks must be used.
Also check out ?Reserved:
Reserved words outside quotes are always parsed to be references to the objects linked to in the 'Description', and hence they are not allowed as syntactic names (see make.names). They are allowed as non-syntactic names, e.g.inside backtick quotes.
In addition, Advanced R has some examples of how backticks are used in expressions, environments, and functions.

They are equivalent to verbatim. For example... try this:
df <- data.frame(20a=c(1,2),b=c(3,4))
gives error
df <- data.frame(`20a`=c(1,2),b=c(3,4))
doesn't give error

Here is an incomplete answer using improper vocabulary: backticks can indicate to R that you are using a function in a non-standard way. For instance, here is a use of [[, the list subsetting function:
temp <- list("a"=1:10, "b"=rnorm(5))
extract element one, the usual way
temp[[1]]
extract element one using the [[ function
`[[`(temp,1)

Related

How to remove "\" from paste function output with quotation marks?

I'm working with the following code:
Y_Columns <- c("Y.1.1")
paste('{"ImportId":"', Y_Columns, '"}', sep = "")
The paste function produces the following output:
"{\"ImportId\":\"Y.1.1\"}"
How do I get the paste function to omit the \? Such that, the output is:
"{"ImportId":"Y.1.1"}"
Thank you for your help.
Note: I did do a search on SO to see if there were any Q's that asked "what is an escape character in R". But I didn't review all the 160 answers, only the first 20.
This is one way of demonstrating what I wrote in my comment:
out <- paste('{"ImportId":"', Y_Columns, '"}', sep = "")
out
#[1] "{\"ImportId\":\"Y.1.1\"}"
?print
print(out,quote=FALSE)
#[1] {"ImportId":"Y.1.1"}
Both R and regex patterns use escape characters to allow special characters to be displayed in print output or input. (And sometimes regex patterns need to have doubled escapes.) R has a few characters that need to be "escaped" in certain situation. You illustrated one such situation: including double-quote character inside a result that will be printed with surrounding double-quotes. If you were intending to include any single quotes inside a character value that was delimited by single quotes at the time of creation, they would have needed to be escaped as well.
out2 <- '\'quoted\''
nchar(out2)
#[1] 8 ... note that neither the surround single-quotes nor the backslashes get counted
> out2
[1] "'quoted'" ... and the default output quote-char is a double-quote.
Here's a good Q&A to review:How to replace '+' using gsub() function in R
It has two answers, both useful: one shows how to double escape a special character and the other shows how to use teh fixed argument to get around that requirement.
And another potentially useful Q&A on the topic of handling Windows paths:
File path issues in R using Windows ("Hex digits in character string" error)
And some further useful reading suggestions: Look at the series of help pages that start with capital letters. (Since I can never remember which one has which nugget of essential information, I tried ?Syntax first and it has a "See Also" list of essential reading: Arithmetic, Comparison, Control, Extract, Logic, NumericConstants, Paren, Quotes, Reserved. and I then realized what I wanted to refer you to was most likely ?Quotes where all the R-specific escape sequence letters should be listed.

R seems to ignore part of variable name after underscore

I encountered a strange problem with R. I have a dataframe with several variables. I add a variable to this dataframe that contains an underscore, for example:
allres$tmp_weighted <- allres$day * allres$area
Before I do this, R tells me that the variable allres$tmp does not exist (which is right). However, after I add allres$tmp_weighted to the dataframe and call allres$tmp, I get the data for allres$tmp_weighted. It seems as if the part after the underscore does not matter at all for R. I tried it with several other variables / names and it always works that way
I don't think this should work like this? Am I overlooking something here? Below I pasted some code together with output from the Console.
# first check whether variable exists
allres_sw$Ndpsw
> NULL
#define new variable with underscore in variable name
allres_sw$Ndpsw_weighted <- allres_sw$Ndepswcrit * allres_sw$Area
#check again whether variable exists
allres_sw$Ndpsw
> [1] 17.96480 217.50240 44.84415 42.14560 0.00000 43.14444 53.98650 9.81939 0.00000 110.67720
# this is the output that I would expect from "Ndpsw_weighted" - and indeed do get
allres_sw$Ndpsw_weighted
> [1] 17.96480 217.50240 44.84415 42.14560 0.00000 43.14444 53.98650 9.81939 0.00000 110.67720
Have a look at ?`[` or ?`$` in your R console. If you look at the name argument of the extract functions it states that names are partially matched when using the $ operator (as opposed to the `[[` operator, which uses exact matches based on the exact = TRUE argument).
From ?`$`
A literal character string or a name (possibly backtick quoted). For extraction, this is normally (see under ‘Environments’) partially matched to the names of the object.
Just to expand somewhat on Wil's answer... From help('$'):
x$name
name
A literal character string or a name (possibly backtick
quoted). For extraction, this is normally (see under
‘Environments’) partially matched to the names
of the object.
x$name is equivalent to
x[["name", exact = FALSE]]. Also, the partial matching
behavior of [[ can be controlled using the exact argument.
exact
Controls possible partial matching of [[ when
extracting by a character vector (for most objects, but see under
‘Environments’). The default is no partial matching. Value
NA allows partial matching but issues a warning when it
occurs. Value FALSE allows partial matching without any
warning.
The key phrase here is partial match (see pmatch). You'll understand now that the underscore is nothing special - you can abbreviate allres_sw$Ndpsw_weighted to allres_sw$Ndp, provided no name is more similar than allres_sw$Ndepswcrit.

Using grep() with Unicode characters in R

(strap in!)
Hi, I'm running into issues involving Unicode encoding in R.
Basically, I'm importing data sets that contain Unicode (UTF-8) characters, and then running grep() searches to match values. For example, say I have:
bigData <- c("foo","αβγ","bar","αβγγ (abgg)", ...)
smallData <- c("αβγ","foo", ...)
What I'm trying to do is take the entries in smallData and match them to entries in bigData. (The actual sets are matrixes with columns of values, so what I'm trying to do is find the indexes of the matches, so I can tell what row to add the values to.) I've been using
matches <- grepl(smallData[i], bigData, fixed=T)
which usually results in a vector of matches. For i=2, it would return 1, since "foo" is element 1 of bigData. This is peachy and all is well. But RStudio seems to not be dealing with unicode characters properly. When I import the sets and view them, they use the character IDs.
dataset <- read_csv("[file].csv", col_names = FALSE, locale = locale())
Using View(dataset) shows "aß<U+03B3>" instead of "αβγ." The same goes for
dataset[1]
A tibble: 1x1 <chr>
[1] aß<U+03B3>
print(dataset[1])
A tibble: 1x1 <chr>
[1] aß<U+03B3>
However, and this is why I'm stuck rather than just adjusting the encoding:
paste(dataset[1])
[1] "αβγ"
Encoding(toString(dataset[1]))
[1] "UTF-8"
So it appears that R is recognizing in certain contexts that it should display Unicode characters, while in others it just sticks to--ASCII? I'm not entirely sure, but certainly a more limited set.
In any case, regardless of how it displays, what I want to do is be able to get
grep("αβγ", bigData)
[1] 2 4
However, none of the following work:
grep("αβ", bigData) #(Searching the two letters that do appear to convert)
grep("<U+03B3>",bigData,fixed=T) #(Searching the code ID itself)
grep("αβ", toString(bigData)) #(converts the whole thing to one string)
grep("\\β", bigData) #(only mentioning because it matches, bizarrely, to ß)
The only solution I've found is:
grep("\u03B3", bigData)
[1] 2 4
Which is not ideal for a couple reasons, most jarringly that it doesn't look like it's possible to just take every <U+####> and replace it with \u####, since not every Unicode character is converted to the <U+####> format, but none of them can be searched. (i.e., α and ß didn't turn into their unicode keys, but they're also not searchable by themselves. So I'd have to turn them into their keys, then alter their keys to a form that grep() can use, then search.)
That means I can't just regex the keys into a searchable format--and even if I could, I have a lot of entries including characters that'd need to be escaped (e.g., () or ), so having to remove the fixed=T term would be its own headache involving nested escapes.
Anyway...I realize that a significant part of the problem is that my set apparently involves every sort of character under the sun, and it seems I have thoroughly entrapped myself in a net of regular expressions.
Is there any way of forcing a search with (arbitrary) unicode characters? Or do I have to find a way of using regular expressions to escape every ( and α in my data set? (coordinate to that second question: is there a method to convert a unicode character to its key? I can't seem to find anything that does that specific function.)

What do backticks do in R?

I'm trying to understand what backticks do in R.
From what I can tell, this is not explained in the ?Quotes documentation page for R.
For example, at the R console:
"[["
# [1] "[["
`[[`
# .Primitive("[[")
It seem to be returning the equivalent to:
get("[[")
A pair of backticks is a way to refer to names or combinations of symbols that are otherwise reserved or illegal. Reserved are words like if are part of the language, while illegal includes non-syntactic combinations like c a t. These two categories, reserved and illegal, are referred to in R documentation as non-syntactic names.
Thus,
`c a t` <- 1 # is valid R
and
> `+` # is equivalent to typing in a syntactic function name
function (e1, e2) .Primitive("+")
As a commenter mentioned, ?Quotes does contain some information on the backtick, under Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used directly in R code. Almost always, other names can be used provided they are quoted. The preferred quote is the backtick (`), and deparse will normally use it, but under many circumstances single or double quotes can be used (as a character constant will often be converted to a name). One place where backticks may be essential is to delimit variable names in formulae: see formula
This prose is a little hard to parse. What it means is that for R to parse a token as a name, it must be 1) a sequence of letters digits, the period and underscores, that 2) is not a reserved word in the language. Otherwise, to be parsed as a name, backticks must be used.
Also check out ?Reserved:
Reserved words outside quotes are always parsed to be references to the objects linked to in the 'Description', and hence they are not allowed as syntactic names (see make.names). They are allowed as non-syntactic names, e.g.inside backtick quotes.
In addition, Advanced R has some examples of how backticks are used in expressions, environments, and functions.
They are equivalent to verbatim. For example... try this:
df <- data.frame(20a=c(1,2),b=c(3,4))
gives error
df <- data.frame(`20a`=c(1,2),b=c(3,4))
doesn't give error
Here is an incomplete answer using improper vocabulary: backticks can indicate to R that you are using a function in a non-standard way. For instance, here is a use of [[, the list subsetting function:
temp <- list("a"=1:10, "b"=rnorm(5))
extract element one, the usual way
temp[[1]]
extract element one using the [[ function
`[[`(temp,1)

Removing grave accent (`) from the names of a list in R

I have list l which has grave accent "`" in output. Why am I getting this in some variable and not in others?
l
$`AMLM12PAH037A-B`
Left.Gene.Symbols Right.Gene.Symbols
PCMTD1 0 1
STK31 3 0
$AMLOT120AT
Left.Gene.Symbols Right.Gene.Symbols
ARHGEF3 2 0
CD96 2 0
RALYL 12 0
TRIO 0 1
You can't have invalid names, in this case it is the - inside it. If you do, you will either get them backticked, like yours, converted, or an error depending on how you made them.
You also cannot start a name with a number among other restrictions.
See the functions check.names and make.names
From the R FAQ:
A syntactic name is a string the parser interprets as this type of expression. It consists of letters, numbers, and the dot and (for
versions of R at least 1.9.0) underscore characters, and starts with
either a letter or a dot not followed by a number. Reserved words are
not syntactic names.
An object name is a string associated with an object that is assigned in an expression either by having the object name on the
left of an assignment operation or as an argument to the assign()
function. It is usually a syntactic name as well, but can be any
non-empty string if it is quoted (and it is always quoted in the
call to assign()).
An argument name is what appears to the left of the equals sign when supplying an argument in a function call (for example,
f(trim=.5)). Argument names are also usually syntactic names, but
again can be anything if they are quoted.
An element name is a string that identifies a piece of an object (a component of a list, for example.) When it is used on the right
of the ‘$’ operator, it must be a syntactic name, or quoted.
Otherwise, element names can be any strings. (When an object is
used as a database, as in a call to eval() or attach(), the element
names become object names.)

Resources