Trying to understand R structure: what does a dot in function names signify? - r

I am trying to learn how to use R. I can use it to do basic things like reading in data and running a t-test. However, I am struggling to understand the way R is structured (I am have a very mediocre java background).
What I don't understand is the way the functions are classified.
For example in is.na(someVector), is is a class? Or for read.csv, is csv a method of the read class?
I need an easier way to learn the functions than simply memorizing them randomly. I like the idea of things belonging to other things. To me it seems like this gives a language a tree structure which makes learning more efficient.
Thank you
Sorry if this is an obvious question I am genuinely confused and have been reading/watching quite a few tutorials.

Your confusion is entirely understandable, since R mixes two conventions of using (1) . as a general-purpose word separator (as in is.na(), which.min(), update.formula(), data.frame() ...) and (2) . as an indicator of an S3 method, method.class (i.e. foo.bar() would be the "foo" method for objects with class attribute "bar"). This makes functions like summary.data.frame() (i.e., the summary method for objects with class data.frame) especially confusing.
As #thelatemail points out above, there are some other sets of functions that repeat the same prefix for a variety of different options (as in read.table(), read.delim(), read.fwf() ...), but these are entirely conventional, not specified anywhere in the formal language definition.
dotfuns <- apropos("[a-z]\\.[a-z]")
dotstart <- gsub("\\.[a-zA-Z]+","",dotfuns)
head(dotstart)
tt <- table(dotstart)
head(rev(sort(tt)),10)
## as is print Sys file summary dev format all sys
## 118 51 32 18 17 16 16 15 14 13
(Some of these are actually S3 generics, some are not. For example, Sys.*(), dev.*(), and file.*() are not.)
Historically _ was used as a shortcut for the assignment operator <- (before = was available as a synonym), so it wasn't available as a word separator. I don't know offhand why camelCase wasn't adopted instead.
Confusingly, methods("is") returns is.na() among many others, but it is effectively just searching for functions whose names start with "is."; it warns that "function 'is' appears not to be generic"
Rasmus Bååth's presentation on naming conventions is informative and entertaining (if a little bit depressing).
extra credit: are there any dot-separated S3 method names, i.e. cases where a function name of the form x.y.z represents the x.y method for objects with class attribute z ?
answer (from Hadley Wickham in comments): as.data.frame.data.frame() wins. as.data.frame is an S3 generic (unlike, say, as.numeric), and as.data.frame.data.frame is its method for data.frame objects. Its purpose (from ?as.data.frame):
If a data frame is supplied, all classes preceding ‘"data.frame"’
are stripped, and the row names are changed if that argument is
supplied.

Related

use of $ and () in same syntax?

I'm certain there is a really basic answer to this, which is possibly why I'm finding it hard to actually search for and find an answer. But... can somebody please explain exactly what it means to combine $ and () in the same syntax in R?
For example from this vignette:
https://cran.r-project.org/web/packages/pivottabler/vignettes/v00-vignettes.html
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$renderPivot()
I never encountered this while learning R until now years later. I'm seeing it more and more lately but it is not intuitive to me?
$ is usually used when accessing sub-structures of objects in R like columns of a data frame e.g dataframe$column1, while () is usually used to enclose all arguments of a named function e.g rnorm(10,0,1)
What does it mean when they are used together? e.g. x$y(z)
The dollar is a generic operator used to extract or replace parts of recursive objects, such as lists and data frames.
A list is an object consisting of an ordered collection of objects (including other lists), perhaps of different types, said components.
Consider the following list:
L <- list(a = 1, f = function() message("hello"))
This is a list with two components: a and f.
The first is a number and the second is a function. By applying the $-operator, you extract the value of the component, which can also be reassigned:
L$a
# 1
L$a <- 2
L$a
# 2
In the case of the f component, because it is a function, you get its body:
L$f
# function() message("hello")
This is in line with each function identifier: its value is the function's body. It is not surprising that, applying the parentheses to the function's identifier, you execute the function, that is:
L$f()
# hello
This opens the doors to very powerful structures, where you can store both data and the functions to manipulate them.
This logic resembles the classes used in the OOP world. Of course, you need much more features, such instantiations, inheritance. Such mechanisms are provided, for example, by the R6 package, which you mention in your tag.
library(R6)
A <- R6Class("A", list(f=function() message("hello") ))
a <- A$new()
a$f()
# hello
A is an R6 class, so A$new() creates a new instance of the class, a, by means of the class function new. As you can see, this function is called using a syntax (and a logic) similar to L$f() above. The instance a inherits the class function f, said method here, and a$f() executes it.

How to make an R object immutable? [duplicate]

I'm working in R, and I'd like to define some variables that I (or one of my collaborators) cannot change. In C++ I'd do this:
const std::string path( "/projects/current" );
How do I do this in the R programming language?
Edit for clarity: I know that I can define strings like this in R:
path = "/projects/current"
What I really want is a language construct that guarantees that nobody can ever change the value associated with the variable named "path."
Edit to respond to comments:
It's technically true that const is a compile-time guarantee, but it would be valid in my mind that the R interpreter would throw stop execution with an error message. For example, look what happens when you try to assign values to a numeric constant:
> 7 = 3
Error in 7 = 3 : invalid (do_set) left-hand side to assignment
So what I really want is a language feature that allows you to assign values once and only once, and there should be some kind of error when you try to assign a new value to a variabled declared as const. I don't care if the error occurs at run-time, especially if there's no compilation phase. This might not technically be const by the Wikipedia definition, but it's very close. It also looks like this is not possible in the R programming language.
See lockBinding:
a <- 1
lockBinding("a", globalenv())
a <- 2
Error: cannot change value of locked binding for 'a'
Since you are planning to distribute your code to others, you could (should?) consider to create a package. Create within that package a NAMESPACE. There you can define variables that will have a constant value. At least to the functions that your package uses. Have a look at Tierney (2003) Name Space Management for R
I'm pretty sure that this isn't possible in R. If you're worried about accidentally re-writing the value then the easiest thing to do would be to put all of your constants into a list structure then you know when you're using those values. Something like:
my.consts<-list(pi=3.14159,e=2.718,c=3e8)
Then when you need to access them you have an aide memoir to know what not to do and also it pushes them out of your normal namespace.
Another place to ask would be R development mailing list. Hope this helps.
(Edited for new idea:) The bindenv functions provide an
experimental interface for adjustments to environments and bindings within environments. They allow for locking environments as well as individual bindings, and for linking a variable to a function.
This seems like the sort of thing that could give a false sense of security (like a const pointer to a non-const variable) but it might help.
(Edited for focus:) const is a compile-time guarantee, not a lock-down on bits in memory. Since R doesn't have a compile phase where it looks at all the code at once (it is built for interactive use), there's no way to check that future instructions won't violate any guarantee. If there's a right way to do this, the folks at the R-help list will know. My suggested workaround: fake your own compilation. Write a script to preprocess your R code that will manually substitute the corresponding literal for each appearance of your "constant" variables.
(Original:) What benefit are you hoping to get from having a variable that acts like a C "const"?
Since R has exclusively call-by-value semantics (unless you do some munging with environments), there isn't any reason to worry about clobbering your variables by calling functions on them. Adopting some sort of naming conventions or using some OOP structure is probably the right solution if you're worried about you and your collaborators accidentally using variables with the same names.
The feature you're looking for may exist, but I doubt it given the origin of R as a interactive environment where you'd want to be able to undo your actions.
R doesn't have a language constant feature. The list idea above is good; I personally use a naming convention like ALL_CAPS.
I took the answer below from this website
The simplest sort of R expression is just a constant value, typically a numeric value (a number) or a character value (a piece of text). For example, if we need to specify a number of seconds corresponding to 10 minutes, we specify a number.
> 600
[1] 600
If we need to specify the name of a file that we want to read data from, we specify the name as a character value. Character values must be surrounded by either double-quotes or single-quotes.
> "http://www.census.gov/ipc/www/popclockworld.html"
[1] "http://www.census.gov/ipc/www/popclockworld.html"

What do . (dot) and % (percentage) mean in R?

My question might sound stupid but I have noticed that . and % is often used in R and to be frank I don't really know why it is used.
I have seen it in dplyr (go here for an example) and data.table (i.e. .SD) but I am sure it must be used in other place as well.
Therefore, my question is:
What does . mean? Is it some kind of R coding best practice nomenclature? (i.e. _functionName is often used in javascript to indicate it is a private function). If yes, what's the rule?
Same question for %, which is also often used in R (i.e. %in%,%>%,...).
My guess always has been that . and % are a convenient way to quickly call function but the way data.table uses . does not follow this logic, which confuses me.
. has no inherent/magical meaning in R. It's just another character that you can use in symbol names. But because it is so convenient to type, it has been given special meaning by certain functions and conventions in R. Here are just a few
. is used look up S3 generic method implementations. For example, if you call a generic function like plot with an object of class lm as the first parameter, then it will look for a function named plot.lm and, if found, call that.
often . in formulas means "all other variables", for example lm(y~., data=dd) will regress y on all the other variables in the data.frame dd.
libraries like dplyr use it as a special variable name to indicate the current data.frame for methods like do(). They could just as easily have chosen to use the variable name X instead
functions like bquote use .() as a special function to escape variables in expressions
variables that start with a period are considered "hidden" and will not show up with ls() unless you call ls(all.names=TRUE) (similar to the UNIX file system behavior)
However, you can also just define a variable named my.awesome.variable<-42 and it will work just like any other variable.
A % by itself doesn't mean anything special, but R allows you to define your own infix operators in the form %<something>% using two percent signs. If you define
`%myfun%` <- function(a,b) {
a*3-b*2
}
you can call it like
5 %myfun% 2
# [1] 11
MrFlick's answer doesn't cover the usage of . in data.table;
In data.table, . is (essentially) an alias for list, so any* call to [.data.table that accepts a list can also be passed an object wrapped in .().
So the following are equivalent:
DT[ , .(x, y)]
DT[ , list(x, y)]
*well, not quite. any use in the j argument, yes; elsewhere is a work in progress, see here.

Why is using assign bad?

This post (Lazy evaluation in R – is assign affected?) covers some common ground but I am not sure it answers my question.
I stopped using assign when I discovered the apply family quite a while back, albeit, purely for reasons of elegance in situations such as this:
names.foo <- letters
values.foo <- LETTERS
for (i in 1:length(names.foo))
assign(names.foo[i], paste("This is: ", values.foo[i]))
which can be replaced by:
foo <- lapply(X=values.foo, FUN=function (k) paste("This is :", k))
names(foo) <- names.foo
This is also the reason this (http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-turn-a-string-into-a-variable_003f) R-faq says this should be avoided.
Now, I know that assign is generally frowned upon. But are there other reasons I don't know? I suspect it may mess with the scoping or lazy evaluation but I am not sure? Example code that demonstrates such problems will be great.
Actually those two operations are quite different. The first gives you 26 different objects while the second gives you only one. The second object will be a lot easier to use in analyses. So I guess I would say you have already demonstrated the major downside of assign, namely the necessity of then needing always to use get for corralling or gathering up all the similarly named individual objects that are now "loose" in the global environment. Try imagining how you would serially do anything with those 26 separate objects. A simple lapply(foo, func) will suffice for the second strategy.
That FAQ citation really only says that using assignment and then assigning names is easier, but did not imply it was "bad". I happen to read it as "less functional" since you are not actually returning a value that gets assigned. The effect looks to be a side-effect (and in this case the assign strategy results in 26 separate side-effects). The use of assign seems to be adopted by people that are coming from languages that have global variables as a way of avoiding picking up the "True R Way", i.e. functional programming with data-objects. They really should be learning to use lists rather than littering their workspace with individually-named items.
There is another assignment paradigm that can be used:
foo <- setNames( paste0(letters,1:26), LETTERS)
That creates a named atomic vector rather than a named list, but the access to values in the vector is still done with names given to [.
As the source of fortune(236) I thought I would add a couple examples (also see fortune(174)).
First, a quiz. Consider the following code:
x <- 1
y <- some.function.that.uses.assign(rnorm(100))
After running the above 2 lines of code, what is the value of x?
The assign function is used to commit "Action at a distance" (see http://en.wikipedia.org/wiki/Action_at_a_distance_(computer_programming) or google for it). This is often the source of hard to find bugs.
I think the biggest problem with assign is that it tends to lead people down paths of thinking that take them away from better options. A simple example is the 2 sets of code in the question. The lapply solution is more elegant and should be promoted, but the mere fact that people learn about the assign function leads people to the loop option. Then they decide that they need to do the same operation on each object created in the loop (which would be just another simple lapply or sapply if the elegant solution were used) and resort to an even more complicated loop involving both get and apply along with ugly calls to paste. Then those enamored with assign try to do something like:
curname <- paste('myvector[', i, ']')
assign(curname, i)
And that does not do quite what they expected which leads to either complaining about R (which is as fair as complaining that my next door neighbor's house is too far away because I chose to walk the long way around the block) or even worse, delve into using eval and parse to get their constructed string to "work" (which then leads to fortune(106) and fortune(181)).
I'd like to point out that assign is meant to be used with environments.
From that point of view, the "bad" thing in the example above is using a not quite appropriate data structure (the base environment instead of a list or data.frame, vector, ...).
Side note: also for environments, the $ and $<- operators work, so in many cases the explicit assign and get isn't necessary there, neither.

Declaring a Const Variable in R

I'm working in R, and I'd like to define some variables that I (or one of my collaborators) cannot change. In C++ I'd do this:
const std::string path( "/projects/current" );
How do I do this in the R programming language?
Edit for clarity: I know that I can define strings like this in R:
path = "/projects/current"
What I really want is a language construct that guarantees that nobody can ever change the value associated with the variable named "path."
Edit to respond to comments:
It's technically true that const is a compile-time guarantee, but it would be valid in my mind that the R interpreter would throw stop execution with an error message. For example, look what happens when you try to assign values to a numeric constant:
> 7 = 3
Error in 7 = 3 : invalid (do_set) left-hand side to assignment
So what I really want is a language feature that allows you to assign values once and only once, and there should be some kind of error when you try to assign a new value to a variabled declared as const. I don't care if the error occurs at run-time, especially if there's no compilation phase. This might not technically be const by the Wikipedia definition, but it's very close. It also looks like this is not possible in the R programming language.
See lockBinding:
a <- 1
lockBinding("a", globalenv())
a <- 2
Error: cannot change value of locked binding for 'a'
Since you are planning to distribute your code to others, you could (should?) consider to create a package. Create within that package a NAMESPACE. There you can define variables that will have a constant value. At least to the functions that your package uses. Have a look at Tierney (2003) Name Space Management for R
I'm pretty sure that this isn't possible in R. If you're worried about accidentally re-writing the value then the easiest thing to do would be to put all of your constants into a list structure then you know when you're using those values. Something like:
my.consts<-list(pi=3.14159,e=2.718,c=3e8)
Then when you need to access them you have an aide memoir to know what not to do and also it pushes them out of your normal namespace.
Another place to ask would be R development mailing list. Hope this helps.
(Edited for new idea:) The bindenv functions provide an
experimental interface for adjustments to environments and bindings within environments. They allow for locking environments as well as individual bindings, and for linking a variable to a function.
This seems like the sort of thing that could give a false sense of security (like a const pointer to a non-const variable) but it might help.
(Edited for focus:) const is a compile-time guarantee, not a lock-down on bits in memory. Since R doesn't have a compile phase where it looks at all the code at once (it is built for interactive use), there's no way to check that future instructions won't violate any guarantee. If there's a right way to do this, the folks at the R-help list will know. My suggested workaround: fake your own compilation. Write a script to preprocess your R code that will manually substitute the corresponding literal for each appearance of your "constant" variables.
(Original:) What benefit are you hoping to get from having a variable that acts like a C "const"?
Since R has exclusively call-by-value semantics (unless you do some munging with environments), there isn't any reason to worry about clobbering your variables by calling functions on them. Adopting some sort of naming conventions or using some OOP structure is probably the right solution if you're worried about you and your collaborators accidentally using variables with the same names.
The feature you're looking for may exist, but I doubt it given the origin of R as a interactive environment where you'd want to be able to undo your actions.
R doesn't have a language constant feature. The list idea above is good; I personally use a naming convention like ALL_CAPS.
I took the answer below from this website
The simplest sort of R expression is just a constant value, typically a numeric value (a number) or a character value (a piece of text). For example, if we need to specify a number of seconds corresponding to 10 minutes, we specify a number.
> 600
[1] 600
If we need to specify the name of a file that we want to read data from, we specify the name as a character value. Character values must be surrounded by either double-quotes or single-quotes.
> "http://www.census.gov/ipc/www/popclockworld.html"
[1] "http://www.census.gov/ipc/www/popclockworld.html"

Resources