I was reading the book 'Data Mining with R' and came across this code:
library(DMwR)
clean.algae <- knnImputation(algae, k = 10)
x <- sapply(names(clean.algae)[12:18],
function(x,names.attrs) {
f <- as.formula(paste(x,"~ ."))
dataset(f,clean.algae[,c(names.attrs,x)],x)
},
names(clean.algae)[1:11])
I thought x could be rewritten as:
y <- sapply(names(clean.algae)[12:18],
function(x) {
f <- as.formula(paste(x,"~ ."))
dataset(f,clean.algae[,c(names(clean.algae)[1:11],x)],x)
}
)
However, identical(x,y) returns FALSE.
I decided to investigate why by restricting my attention to just the first element these lists.
I found that:
identical(attributes(x[[1]])$data,
attributes(y[[1]])$data)
[1] FALSE
However:
which(!(attributes(x[[1]])$data == attributes(y[[1]])$data))
integer(0)
Which to me means all elements in the data frame are equal, hence the two data frames must be identical. Why is this not the case?
I also have similar question for the object's formula attribute:
> identical(attributes(x[[1]])$formula,
+ attributes(y[[1]])$formula)
[1] FALSE
>
> attributes(x[[1]])$formula == attributes(y[[1]])$formula
[1] TRUE
tl;dr the source of the non-identicality is indeed in differences in associated environments, both of the #formula slots of the components of the objects, and in the terms attributes of the #data slots. As #ThomasK points out in comments above, for most comparison purposes all.equal() is good enough/preferred ...
Formulas are equal but not identical:
identical(x$a1#formula,y$a1#formula)
## [1] FALSE
all.equal(x$a1#formula,y$a1#formula)
## TRUE
Environments differ:
environment(x$a1#formula)
## <environment: 0x9a408dc>
environment(y$a1#formula)
## <environment: 0x9564aa4>
Setting the environments to be identical makes the formulae identical:
environment(x$a1#formula) <- .GlobalEnv
environment(y$a1#formula) <- .GlobalEnv
identical(x$a1#formula,y$a1#formula)
## TRUE
However, there's more stuff that's different: identical(x$a1,y$a1) is still FALSE.
Digging some more:
for (i in slotNames(x$a1)) {
print(i)
print(identical(slot(x$a1,i),slot(y$a1,i)))
}
## [1] "data"
## [1] FALSE
## [1] "name"
## [1] TRUE
## [1] "formula"
## [1] TRUE
Digging deeper into the data slot (also with judicious use of str()) finds more environments -- associated with terms (closely related to formulae) this time:
dx <- x$a1#data
dy <- y$a1#data
environment(attr(dx,"terms"))
## <environment: 0x9a408dc>
environment(attr(dy,"terms"))
## <environment: 0x9564aa4>
Setting these equal to each other should lead to identicality between x$a1 and y$a1, but I haven't tested.
Related
Let's imagine that I have a class "my" and I want to trigger certain behaviour when it is added to an object that has units (i.e. from units package):
library(units)
my1 = structure(2, class="my")
Ops.my <- function(e1, e2=NULL) {
ok <-
switch(
.Generic,
`-` = ,
`*` = ,
`+` = ,
'<=' = TRUE,
FALSE
)
if (!ok) {
stop(gettextf("%s not meaningful", sQuote(.Generic)))
}
get(.Generic)(as.integer(e1), as.integer(e2))
}
my1+set_units(5,nm)
Currently, it gives me the following warning:
Warning message:
Incompatible methods ("Ops.my", "Ops.units") for "+"
But I actually want to handle "my" and "units" addition in a certain way, how do I do it?
I tried with something like Ops.my.units <- but it doesn't seem to work.
There doesn't seem to be a way to do this with Ops. From the docs:
The classes of both arguments are considered in dispatching any member of this group. For each argument its vector of classes is examined to see if there is a matching specific (preferred) or Ops method. If a method is found for just one argument or the same method is found for both, it is used. If different methods are found, there is a warning about ‘incompatible methods’
This is probably a good thing. Part of the benefit of an object-oriented system in a non-compiled language like R is that it helps preserve type safety. This stops you from accidentally adding apples to oranges, as we can see in the following example:
apples <- structure(2, class = "apples")
oranges <- structure(2, class = "oranges")
Ops.apples <- function(e1, e2) {
value <- do.call(.Generic, list(as.integer(e1), as.integer(e2)))
class(value) <- "apples"
value
}
Ops.oranges <- function(e1, e2) {
value <- do.call(.Generic, list(as.integer(e1), as.integer(e2)))
class(value) <- "oranges"
value
}
apples + apples
#> [1] 4
#> attr(,"class")
#> [1] "apples"
oranges + oranges
#> [1] 4
#> attr(,"class")
#> [1] "oranges"
apples + oranges
#> [1] 4
#> attr(,"class")
#> [1] "apples"
#> Warning message:
#> Incompatible methods ("Ops.apples", "Ops.oranges") for "+"
You can see that even here, we could just ignore the warning.
suppressWarnings(apples + oranges)
#> [1] 4
#> attr(,"class")
#> [1] "apples"
But hopefully you can see why this may not be good - we have added 2 apples and 2 oranges, and have returned 4 apples.
Throughout R and its extension packages, there are numerous type-conversion functions such as as.integer, as.numeric, as.logical, as.character, as.difftime etc. These allow for some element of control when converting between types and performing operations on different types.
The "right" way to do this kind of thing is specifically convert one of the object types to the other in order to perform the operation:
as.my <- function(x) UseMethod("as.my")
as.my.default <- function(x) {
value <- as.integer(x)
class(value) <- 'my'
value
}
my1 + as.my(set_units(5,nm))
#> [1] 7
I am looking for a way to check if a given instance of R6 class is present in a vector of R6 class instances.
library(R6)
# define a class
Person <- R6Class("Person", list(
name = NULL,
initialize = function(name) self$name <- name
))
# create two instances of a class
Jack <- Person$new(name = "Jack")
Jill <- Person$new(name = "Jill")
I naively used %in% to check this, and it seems to have worked:
# yes
c(Jack) %in% c(Jack, Jill)
#> [1] TRUE
But it actually returns TRUE no matter the instance:
# also yes
c(Jack) %in% c(Jill)
#> [1] TRUE
So I had two questions:
What is %in% actually matching that it always returns TRUE?
How can I correctly check if an instance of R6 class is present in a vector of class instances?
R6 objects are environments, c(Jack) is a list containing an environment and %in% acts like this on lists of environments.
e1 <- new.env()
e2 <- new.env()
list(e1) %in% list(e2)
## [1] TRUE
Try identical
sapply(c(Jack , Jill), identical, Jack)
## [1] TRUE FALSE
R6 objects have "R6" in their class vector so
sapply(c(Jack, Jill, sin, 37), inherits, "R6")
## [1] TRUE TRUE FALSE FALSE
I am trying to dive into the internals of static code analysis packages like codetools and CodeDepends, and my immediate goal is to understand how to detect function calls written as package_name::function_name() or package_name:::function_name(). I would have liked to just use findGlobals() from codetools, but this is not so simple.
Example function to analyze:
f <- function(n){
tmp <- digest::digest(n)
stats::rnorm(n)
}
Desired functionality:
analyze_function(f)
## [1] "digest::digest" "stats::rnorm"
Attempt with codetools:
library(codetools)
f = function(n) stats::rnorm(n)
findGlobals(f, merge = FALSE)
## $functions
## [1] "::"
##
## $variables
## character(0)
CodeDepends comes closer, but I am not sure I can always use the output to match functions to packages. I am looking for an automatic rule that connects rnorm() to stats and digest() to digest.
library(CodeDepends)
getInputs(body(f)
## An object of class "ScriptNodeInfo"
## Slot "files":
## character(0)
##
## Slot "strings":
## character(0)
##
## Slot "libraries":
## [1] "digest" "stats"
##
## Slot "inputs":
## [1] "n"
##
## Slot "outputs":
## [1] "tmp"
##
## Slot "updates":
## character(0)
##
## Slot "functions":
## { :: digest rnorm
## NA NA NA NA
##
## Slot "removes":
## character(0)
##
## Slot "nsevalVars":
## character(0)
##
## Slot "sideEffects":
## character(0)
##
## Slot "code":
## {
## tmp <- digest::digest(n)
## stats::rnorm(n)
## }
EDIT To be fair to CodeDepends, there is so much customizability and power for those who understand the internals. At the moment, I am just trying to wrap my head around collectors, handlers, walkers, etc. Apparently, it is possible to modify the standard :: collector to make special note of each namespaced call. For now, here is a naive attempt at something similar.
col <- inputCollector(`::` = function(e, collector, ...){
collector$call(paste0(e[[2]], "::", e[[3]]))
})
getInputs(quote(stats::rnorm(x)), collector = col)#functions
Browse[1]> getInputs(quote(stats::rnorm(x)), collector = col)#functions
stats::rnorm rnorm
NA NA
If you want to extract namespaced functions from a function, try something like this
find_ns_functions <- function(f, found=c()) {
if( is.function(f) ) {
# function, begin search on body
return(find_ns_functions(body(f), found))
} else if (is.call(f) && deparse(f[[1]]) %in% c("::", ":::")) {
found <- c(found, deparse(f))
} else if (is.recursive(f)) {
# compound object, iterate through sub-parts
v <- lapply(as.list(f), find_ns_functions, found)
found <- unique( c(found, unlist(v) ))
}
found
}
And we can test with
f <- function(n){
tmp <- digest::digest(n)
stats::rnorm(n)
}
find_ns_functions(f)
# [1] "digest::digest" "stats::rnorm"
Ok, so this was possible with CodeDepends previously, but a bit harder than it should have been. I've just committed version 0.5-4 to github, which now makes this really "easy". Essentially you just need to modify the default colonshandlers ("::" and/or ":::") as follows:
library(CodeDepends) # version >= 0.5-4
handler = function(e, collector, ..., iscall = FALSE) {
collector$library(asVarName(e[[2]]))
## :: or ::: name, remove if you don't want to count those as functions called
collector$call(asVarName(e[[1]]))
if(iscall)
collector$call(deparse((e))) #whole expr ie stats::norm
else
collector$vars(deparse((e)), input=TRUE) #whole expr ie stats::norm
}
getInputs(quote(stats::rnorm(x,y,z)), collector = inputCollector("::" = handler))
getInputs(quote(lapply( 1:10, stats::rnorm)), collector = inputCollector("::" = handler))
The first getInputs call above gives the result:
An object of class "ScriptNodeInfo"
Slot "files":
character(0)
Slot "strings":
character(0)
Slot "libraries":
[1] "stats"
Slot "inputs":
[1] "x" "y" "z"
Slot "outputs":
character(0)
Slot "updates":
character(0)
Slot "functions":
:: stats::rnorm
NA NA
Slot "removes":
character(0)
Slot "nsevalVars":
character(0)
Slot "sideEffects":
character(0)
Slot "code":
stats::rnorm(x, y, z)
As, I believe, desired.
One thing to note here is the iscall argument I've added to the colons handler. The default handler and applyhandlerfactory now have special logic so that when they invoke one of the colons handlers in a situation where it is a function being called, that is set to TRUE.
I haven't done extensive testing yet of what will happen when "stats::rnorm" appears in lieu of symbols, particularly in the inputs slot when calculating dependencies, but I'm hopeful that should all continue to work as well. If it doesn't let me know.
~G
I have a data.table myDT, and I'm making "copies" of this table by 3 different ways:
myDT <- data.table(colA = 1:3)
myDT[colA == 3]
copy1 <- copy(myDT)
copy2 <- myDT # yes I know that it's a reference, not real copy
copy3 <- myDT[,.(colA)] # I list all columns from the original table
Then I'm comparing those copies with the original table:
identical(myDT, copy1)
# TRUE
identical(myDT, copy2)
# TRUE
identical(myDT, copy3)
# FALSE
I was trying to figure out what was the difference between myDT and copy3
identical(names(myDT), names(copy3))
# TRUE
all.equal(myDT, copy3, check.attributes=FALSE)
# TRUE
all.equal(myDT, copy3, check.attributes=FALSE, trim.levels=FALSE, check.names=TRUE)
# TRUE
attr.all.equal(myDT, copy3, check.attributes=FALSE, trim.levels=FALSE, check.names=TRUE)
# NULL
all.equal(myDT, copy3)
# [1] "Attributes: < Length mismatch: comparison on first 1 components >"
attr.all.equal(myDT, copy3)
# [1] "Attributes: < Names: 1 string mismatch >"
# [2] "Attributes: < Length mismatch: comparison on first 3 components >"
# [3] "Attributes: < Component 3: Attributes: < Modes: list, NULL > >"
# [4] "Attributes: < Component 3: Attributes: < names for target but not for current > >"
# [5] "Attributes: < Component 3: Attributes: < current is not list-like > >"
# [6] "Attributes: < Component 3: Numeric: lengths (0, 3) differ >"
My original question was how to understand the last output. Finally I came to using the attributes() function:
attr0 <- attributes(myDT)
attr3 <- attributes(copy3)
str(attr0)
str(attr3)
it has shown that original data.table had an index attribute which was not copied when I created copy3.
In order to make this question a bit clearer (and maybe useful for future readers), what really happened here is that you (probably not) set a secondary key while explicitly calling set2key, OR, data.table seemingly set a secondary key while you were making some ordinary operations such as filtering. This is a (not so) new feature added in V 1.9.4
DT[column==value] and DT[column %in% values] are now optimized to use
DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a.
index) is automatically added so the next DT[column==value] is much
faster. No code changes are needed; existing code should automatically
benefit. Secondary keys can be added manually using set2key() and
existence checked using key2(). These optimizations and function
names/arguments are experimental and may be turned off with
options(datatable.auto.index=FALSE).
Lets reproduce this
myDT <- data.table(A = 1:3)
options(datatable.verbose = TRUE)
myDT[A == 3]
# Creating new index 'A' <~~~~ Here it is
# forder took 0 sec
# Coercing double column i.'V1' to integer to match type of x.'A'. Please avoid coercion for efficiency.
# Starting bmerge ...done in 0 secs
# A
# 1: 3
attr(myDT, "index") # or using `key2(myDT)`
# integer(0)
# attr(,"__A")
# integer(0)
So, unlike you were assuming, you actually did create a copy and thus the secondary key wasn't transferred with it. Compare
copy1 <- myDT
attr(copy1, "index")
# integer(0)
# attr(,"__A")
# integer(0)
copy2 <- myDT[,.(A)]
# Detected that j uses these columns: A <~~~ This is where the copy occures
attr(copy2, "index")
# NULL
identical(myDT, copy1)
# [1] TRUE
identical(myDT, copy2)
# [1] FALSE
And for some further validation
tracemem(myDT)
# [1] "<00000000159CBBB0>"
tracemem(copy1)
# [1] "<00000000159CBBB0>"
tracemem(copy2)
# [1] "<000000001A5A46D8>"
The most interesting conclusion here, one could claim, that [.data.table does create a copy, even if the object remains unchanged.
Here is the code:
mf = function(..., expr) {
expr = substitute(expr)
print(class(expr))
print(str(expr))
expr
}
mf(a = 1, b = 2, expr = {matrix(NA, 4, 4)})
Output:
[1] "{"
length 2 { matrix(NA, 4, 4) }
- attr(*, "srcref")=List of 2
..$ :Class 'srcref' atomic [1:8] 1 25 1 25 25 25 1 1
.. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fbcdbce3860>
..$ :Class 'srcref' atomic [1:8] 1 26 1 41 26 41 1 1
.. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fbcdbce3860>
- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fbcdbce3860>
- attr(*, "wholeSrcref")=Class 'srcref' atomic [1:8] 1 0 1 42 0 42 1 1
.. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fbcdbce3860>
NULL
{
matrix(NA, 4, 4)
}
Apparently the result of substitute(expr) produces something of the class "{". What is this class exactly? Why is {matrix(NA, 4, 4)} of length 2? What do these strange attrs mean?
The { is the class for a block of code. Just looking at the classes, note the difference between these
mf(a = 1, b = 2, expr = {matrix(NA, 4, 4)})
# [1] "{"
mf(a = 1, b = 2, expr = matrix(NA, 4, 4))
# [1] "call"
A class of { can hold multiple statements. The length() indicates how many statements are in the block (including the start of the block). For example
length(quote({matrix(NA, 4, 4)}))
# [1] 2
length(quote({matrix(NA, 4, 4); matrix(NA,3,3)}))
# [1] 3
length(quote({}))
# [1] 1
The attributes "srcref" and "srcfile" are how R tracks where functions are defined for trying to give informative error messages. You can see the ?srcfile help page for more information about that.
What you're seeing is a reflection of the way R exposes its internal language structure through its own data structures.
The substitute() function returns the parse tree of an R expression. The parse tree is a tree of language elements. These can include literal values, symbols (basically variable names), function calls, and braced blocks. Here's a demonstration of all the R language elements as returned by substitute(), showing their types in all of R's type classification schemes:
tmc <- function(x) c(typeof(x),mode(x),class(x));
tmc(substitute(TRUE));
## [1] "logical" "logical" "logical"
tmc(substitute(4e5L));
## [1] "integer" "numeric" "integer"
tmc(substitute(4e5));
## [1] "double" "numeric" "numeric"
tmc(substitute(4e5i));
## [1] "complex" "complex" "complex"
tmc(substitute('a'));
## [1] "character" "character" "character"
tmc(substitute(somevar));
## [1] "symbol" "name" "name"
tmc(substitute(T));
## [1] "symbol" "name" "name"
tmc(substitute(sum(somevar)));
## [1] "language" "call" "call"
tmc(substitute(somevec[1]));
## [1] "language" "call" "call"
tmc(substitute(somelist[[1]]));
## [1] "language" "call" "call"
tmc(substitute(somelist$x));
## [1] "language" "call" "call"
tmc(substitute({blah}));
## [1] "language" "call" "{"
Notes:
Note how all three type classification schemes are very similar, but subtly different. This can be a source of confusion. typeof() gives the storage type of the object, sometimes called the "internal" type (to be honest, it probably shouldn't be called "internal" because it is frequently exposed very directly to the user at the R level, but it is often described that way; I would call it the "fundamental" or "underlying" type), mode() gives a similar classification scheme that everyone should probably ignore, and class() gives the implicit (if there's no class attribute) or explicit (if there is) class of the object, which is used for S3 method lookup (and, it should be said, is sometimes examined directly by R code, independent of the S3 lookup process).
Note how TRUE is a logical literal, but T is a symbol, just like any other variable name, and just happens to be assigned to TRUE by default (and ditto for F and FALSE). This is why sometimes people recommend against using T and F in favor of using TRUE and FALSE, because T and F can be reassigned (but personally I prefer to use T and F for the concision; no one should ever reassign those!).
The astute reader will notice that in my demonstration of literals, I've omitted the raw type. This is because there's no such thing as a raw literal in R. In fact, there are very few ways to get a hold of raw vectors in R; raw(), as.raw(), charToRaw(), and rawConnectionValue() are the only ways that I'm aware of, and if I used those functions in a substitute() call, they would be returned as "call" objects, just like in the sum(somevar) example, not literal raw values. The same can be said for the list type; there's no such thing as a list literal (although there are many ways to acquire a list via function calls). Plain raw vectors return 'raw' for all three type classifications, and plain lists return 'list' for all three type classifications.
Now, when you have a parse tree that is more complicated than a simple literal value or symbol (meaning it must be a function call or braced expression), you can generally examine the contents of that parse tree by coercing to list. This is how R exposes its internal language structure through its own data structures.
Diving into your example:
pt <- as.list(substitute({matrix(NA,4,4)}));
pt;
## [[1]]
## `{`
##
## [[2]]
## matrix(NA, 4, 4)
This makes it clear why length() returns 2: that's the length of the list that represents the parse tree. In general, the bracing of the expression is translated into the first list component, and the remaining list components are built from the semicolon-separated statements within the braces:
as.list(substitute({}));
## [[1]]
## `{`
##
as.list(substitute({a}));
## [[1]]
## `{`
##
## [[2]]
## a
##
as.list(substitute({a;b}));
## [[1]]
## `{`
##
## [[2]]
## a
##
## [[3]]
## b
##
as.list(substitute({a;b;c}));
## [[1]]
## `{`
##
## [[2]]
## a
##
## [[3]]
## b
##
## [[4]]
## c
Note that this is identical to how function calls work, except with the difference that, for function calls, the list components are formed from the comma-separated arguments to the function call:
as.list(substitute(sum()));
## [[1]]
## sum
##
as.list(substitute(sum(1)));
## [[1]]
## sum
##
## [[2]]
## [1] 1
##
as.list(substitute(sum(1,3)));
## [[1]]
## sum
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 3
##
as.list(substitute(sum(1,3,5)));
## [[1]]
## sum
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 5
From the above it becomes clear that the first list component is actually a symbol representing the name of a function, for both braced expressions and function calls. In other words, the open brace is a function call, one which simply returns its final argument. Just as square brackets are normal function calls with a convenient syntax built on top of them, the open brace is a normal function call with a convenient syntax built on top of it:
a <- 4:6;
a[2];
## [1] 5
`[`(a,2);
## [1] 5
{1;2};
## [1] 2
`{`(1,2);
## [1] 2
Returning to your example, we can fully explore the parse tree by traversing the list structure that represents the parse tree. I just wrote a nice little recursive function that can do this very easily:
unwrap <- function(x) if (typeof(x) == 'language') lapply(as.list(x),unwrap) else x;
unwrap(substitute(3));
## [1] 3
unwrap(substitute(a));
## a
unwrap(substitute(a+3));
## [[1]]
## `+`
##
## [[2]]
## a
##
## [[3]]
## [1] 3
##
unwrap(substitute({matrix(NA,4,4)}));
## [[1]]
## `{`
##
## [[2]]
## [[2]][[1]]
## matrix
##
## [[2]][[2]]
## [1] NA
##
## [[2]][[3]]
## [1] 4
##
## [[2]][[4]]
## [1] 4
As you can see, the braced expression turns into a normal function call of the function `{`(), taking one argument, which is the single statement you coded into it. That statement consists of a single function call to matrix(), taking three arguments, each of which being a literal value: NA, 4, and 4. And that's the entire parse tree.
So now we can understand the meaning of the "{" class on a deep level: it represents an element of a parse tree that is a function call to the `{`() function. It happens to be classed differently from other function calls ("{" instead of "call"), but as far as I can tell, that has no significance anywhere. Also observe that the typeof() and mode() are identical ("language" and "call", respectively) between all parse tree elements representing function calls, for both `{`() and others alike.