hash - Identical R Dataframes, different hashes (not an attribute problem)

hash - Identical R Dataframes, different hashes (not an attribute problem) - r

I have two dataframes of ~150 rows of X and Y where identical(X, Y) is TRUE but identical(digest(X), digest(Y)) is FALSE. I'm looking into why this is the case.
I did look at this answer and re-ran what they tested, with similar results, but unlike their problem, the attributes for my dataframes are the same. Testing results:
> names(attributes(X))
[1] "names" "row.names" "class"
> names(attributes(Y))
[1] "names" "row.names" "class"
> digest(X)
[1] "07b7ef11ce6eaae01ddd79e4facef581"
> digest(Y)
[1] "09d8abcab0af0a72265a9b690f4eacc3"
> digest(X[1:nrow(X),])
[1] "2f338de9972529bd2bc9c39c3c5762ea"
> digest(Y[1:nrow(Y),])
[1] "2f338de9972529bd2bc9c39c3c5762ea"
> identical(X, Y, attrib.as.set=FALSE)
[1] TRUE
I also saved the dataframes as .RDS files, and re-read them in.
> X_rds <- read_rds("cache_vars/X.rds")
> Y_rds <- read_rds("cache_vars/Y.rds")
> identical(X_rds , Y_rds )
[2] TRUE
> digest(X_rds)
[2] "07b7ef11ce6eaae01ddd79e4facef581"
> digest(Y_rds )
[2] "09d8abcab0af0a72265a9b690f4eacc3"
> identical(X_rds , Y_rds , attrib.as.set=FALSE)
[2] TRUE
And like the other poster, converting to matrices and back to dataframe yielded identical digests, so it's probably some structural problem.
> X_Mat <- as.matrix(X_rds)
> Y_Mat <- as.matrix(Y_rds)
> identical(digest(X_Mat), digest(Y_Mat))
[2] TRUE
> X_DF <- as.data.frame(X_Mat)
> Y_DF <- as.data.frame(Y_Mat)
> identical(digest(X_DF ), digest(Y_DF))
[2] TRUE
Dataframe X was produced from a parallel-designed loop (but with the %do% flag so no actual parallelism was done) and Y was produced from a sequential loop.
The .RDS files for X and Y can be found at this link.
Update:
MrFlick has it right. As it turns out, the serialization during parallel's rbind function was also adding the gp=0x20 flag, similar to what they described occurs when writing to RDS.

When you write to rds, the objects are serialized. The serialization contains some information in addition to just the values the vectors contain. Note that if we just compare all the columns, they produce a different digests
sapply(seq_along(X_rds), function(i)
digest::digest(X_rds[[i]])==digest::digest(Y_rds[[i]])
)
So the vectors that are being stored in the data.frame are different. We can use the internal inspect function to get some of the meta-data for the vectors
.Internal(inspect(X_rds[[1]]))
# #135305a00 14 REALSXP g0c7 [REF(4),gp=0x20] (len=150, tl=0)
# 1.009e+06,1.009e+06,1.009e+06,1.009e+06,1.009e+06,...
.Internal(inspect(Y_rds[[1]]))
# #115dbfc00 14 REALSXP g0c7 [REF(29)] (len=150, tl=0)
# 1.009e+06,1.009e+06,1.009e+06,1.009e+06,1.009e+06,...
So we see they differ in the [] parts. I believe the REF() number represents the reference count to that object for memory clearing purposes. I do not believe that this number is used in the serialization. But the X_rds also has gp=0x20 set. The "gp" stands for "general purpose" bits/flags. I believe in this case it means the GROWABLE_MASK was set on that object. These values are preserved when the object is serialized which is the default behavior for digest. Thus these vectors do not have the exact same serialization due to this flag difference.
Another way to see the difference is to look at the desrialization
substring(rawToChar(serialize(X_rds[[1]], connection = NULL, ascii = TRUE)), 1, 45)
# [1] "A\n3\n262657\n197888\n5\nUTF-8\n131086\n150\n1009002\n"
substring(rawToChar(serialize(Y_rds[[1]], connection = NULL, ascii = TRUE)), 1, 45)
# [1] "A\n3\n262657\n197888\n5\nUTF-8\n14\n150\n1009002\n1009"
We have a a bit of a header, then we start to see the values being output. There is one value where there is a difference and that's where X has 131086 (0x20000E) and Y has 14 (0xE). Those differences are due to the flags where are written here in the R source code.
When you use identical, only the values in the data.frame are compared, not the additional metadata.
If you wanted to get around this, you could write your own wrapper around digest that avoids the serialization. For example
dfdigest <- function(x) {
charsToRaw <- function(x) unlist(lapply(x, charToRaw))
bytes <- unlist(c(list(charsToRaw(names(x))),
lapply(x, function(col) {
if (typeof(col)=="double") writeBin(col, raw())
else if (typeof(col)=="character") charsToRaw(col)
else stop(paste("unconfigured data type:", typeof(col)))
})))
digest::digest(bytes, serialize = FALSE)
}
dfdigest(X_rds)
# [1] "2488505e3ad1a370d030b539a287b7ca"
dfdigest(Y_rds)
# [1] "2488505e3ad1a370d030b539a287b7ca"

Related

Why do two references to the same vector return different memory addresses for each element of the vector?

I'm learning R and currently I'm reading this book. To make sure I understand the concept, I ran the following test which turned out to be quite confusing for me and I'd appreciate if you could clarify it. Here is the test, which I ran directly in the R shell from the terminal (not using RStudio or Emacs ESS).
> library(lobstr)
>
> x <- c(1500,2400,8800)
> y <- x
> ### So the following two lines must return the same memory address
> obj_addr(x)
[1] "0xb23bc50"
> obj_addr(y)
[1] "0xb23bc50"
> ### So as I expected, indeed both x and y point to the same memory
> ### location: 0xb23bc50
>
>
>
> ### Now let's check that each element can be referenced by the same
> ### memory address either by using x or y
> x[1]
[1] 1500
> y[1]
[1] 1500
> obj_addr(x[1])
[1] "0xc194858"
> obj_addr(y[1])
[1] "0xc17db88"
> ### And here is exactly what I don't understand: x and y point
> ### to the same memory address, so the same must be true for
> ### x[1] and y[1]. So how come I obtain two different memory
> ### addresses for the same element of the same vector?
>
>
>
> x[2]
[1] 2400
> y[2]
[1] 2400
> obj_addr(x[2])
[1] "0xc15eca0"
> obj_addr(y[2])
[1] "0xc145d30"
> ### Same problem!
>
>
>
> x[3]
[1] 8800
> y[3]
[1] 8800
> obj_addr(x[3])
[1] "0xc10e9b0"
> obj_addr(y[3])
[1] "0xc0f78e8"
> ### Again the same problem: different memory addresses
Could you tell me where my mistake is and what I've misunderstood in this problem?

Any R object is a C (pointer -called SEXP- to a) "multi-object" (struct). This includes information (that R needs to operate, e.g. length, number of references -to know when to copy an object- and more) about the R object and, also, the actual data of the R object that we have access to.
lobstr::obj_addr, presumably, returns the memory address that a SEXP points to. That part of the memory contains both the information about and the data of the R object. From within the R environment we can't/don't need to access the (pointer to the) memory of the actual data in each R object.
As Adam notes in his answer, the function [ copies the nth element of the data contained in the C object to a new C object and returns its SEXP pointer to R. Each time [ is called, a new C object is created and returned to R.
We can't access the memory address of each element of the actual data of our object through R. But playing a bit around, we can trace the respective addresses using the C api:
A function to get the addresses:
ff = inline::cfunction(sig = c(x = "integer"), body = '
Rprintf("SEXP # %p\\n", x);
Rprintf("first element of SEXP actual data # %p\\n", INTEGER(x));
for(int i = 0; i < LENGTH(x); i++)
Rprintf("<%d> # %p\\n", INTEGER(x)[i], INTEGER(x) + i);
return(R_NilValue);
')
And applying to our data:
x = c(1500L, 2400L, 8800L) #converted to "integer" for convenience
y = x
lobstr::obj_addr(x)
#[1] "0x1d1c0598"
lobstr::obj_addr(y)
#[1] "0x1d1c0598"
ff(x)
#SEXP # 0x1d1c0598
#first element of SEXP actual data # 0x1d1c05c8
#<1500> # 0x1d1c05c8
#<2400> # 0x1d1c05cc
#<8800> # 0x1d1c05d0
#NULL
ff(y)
#SEXP # 0x1d1c0598
#first element of SEXP actual data # 0x1d1c05c8
#<1500> # 0x1d1c05c8
#<2400> # 0x1d1c05cc
#<8800> # 0x1d1c05d0
#NULL
The successive memory difference between our object's data elements equals the size of int type:
diff(c(strtoi("0x1d1c05c8", 16),
strtoi("0x1d1c05cc", 16),
strtoi("0x1d1c05d0", 16)))
#[1] 4 4
Using the [ function:
ff(x[1])
#SEXP # 0x22998358
#first element of SEXP actual data # 0x22998388
#<1500> # 0x22998388
#NULL
ff(x[1])
#SEXP # 0x22998438
#first element of SEXP actual data # 0x22998468
#<1500> # 0x22998468
#NULL
This might be a more than needed extensive answer and is simplistic on the actual technicalities, but, hopefully, offers a clearer "big" picture.

This is one way to look at it. I am sure there is a more technical view. Remember that in R, nearly everything is a function. This includes the extract function, [. Here is an equivalent statement to x[1]:
> `[`(x, 1)
[1] 1500
So what you are doing is running a function which returns a value (check out ?Extract). That value is an integer. When you run obj_addr(x[1]), it is evaluating the function x[1] and then giving you the obj_addr() of that function return, not the address of the first element of the array that you bound to both x and y.

Assign value to indices of nested lists stored as strings in R

I have a dataframe of nested list indices, which have been stored as strings. To give a simplified example:
df1 <- data.frame(x = c("lst$x$y$a", "lst$x$y$b"), stringsAsFactors = F)
These are then coordinates for the following list:
lst <- list(x=list(y=list(a="foo",b="bar",c="")))
I'd like replace values or assign new values to these elements using the indices in df1.
One attempt was
do.call(`<-`, list(eval(parse(text = df1[1,1])), "somethingelse"))
but this doesn't seem tow work. Instead it assigns "something" to foo.
I'm not too happy with using eval(parse(text=)) (maintaining code will become a nightmare), but recognise I may have little choice.
Any tips welcome.

Let's consider 3 situations:
Case 1
do.call(`<-`, list("lst$x$y$a", "somethingelse"))
This will create a new variable named lst$x$y$a in your workspace, so the following two commands will call different objects. (The former is the object you store in lst, and the latter is the new variable. You need to call it with backticks because its name will confuse R.)
> lst$x$y$a # [1] "foo"
> `lst$x$y$a` # [1] "somethingelse"
Case 2
do.call(`<-`, list(parse(text = "lst$x$y$a"), "somethingelse"))
You mostly get what you expect with this one but an error still occurs:
invalid (do_set) left-hand side to assignment
Let's check:
> parse(text = "lst$x$y$a") # expression(lst$x$y$a)
It belongs to the class expression, and the operator <- seems not to accept this class to the left-hand side.
Case 3
This one will achieve what you want:
do.call(`<-`, list(parse(text = "lst$x$y$a")[[1]], "somethingelse"))
If put [[1]] behind an expression object, a call object will be extracted and take effect in the operator <-.
> lst
# $x
# $x$y
# $x$y$a
# [1] "somethingelse"
#
# $x$y$b
# [1] "bar"
#
# $x$y$c
# [1] ""

How can one make visible the difference in the outputs of quote() and substitute()?

As applied to the same R code or objects, quote and substitute typically return different objects. How can one make this difference apparent?
is.identical <- function(X){
out <- identical(quote(X), substitute(X))
out
}
> tmc <- function(X){
out <- list(typ = typeof(X), mod = mode(X), cls = class(X))
out
}
> df1 <- data.frame(a = 1, b = 2)
Here the printed output of quote and substitute are the same.
> quote(df1)
df1
> substitute(df1)
df1
And the structure of the two are the same.
> str(quote(df1))
symbol df1
> str(substitute(df1))
symbol df1
And the type, mode and class are all the same.
> tmc(quote(df1))
$typ
[1] "symbol"
$mod
[1] "name"
$cls
[1] "name"
> tmc(substitute(df1))
$typ
[1] "symbol"
$mod
[1] "name"
$cls
[1] "name"
And yet, the outputs are not the same.
> is.identical(df1)
[1] FALSE
Note that this question shows some inputs that cause the two functions to display different outputs. However, the outputs are different even when they appear the same, and are the same by most of the usual tests, as shown by the output of is.identical() above. What is this invisible difference, and how can I make it appear?
note on the tags: I am guessing that the Common LISP quote and the R quote are similar

The reason is that the behavior of substitute() is different based on where you call it, or more precisely, what you are calling it on.
Understanding what will happen requires a very careful parsing of the (subtle) documentation for substitute(), specifically:
Substitution takes place by examining each component of the parse tree
as follows: If it is not a bound symbol in env, it is unchanged. If it
is a promise object, i.e., a formal argument to a function or
explicitly created using delayedAssign(), the expression slot of the
promise replaces the symbol. If it is an ordinary variable, its value
is substituted, unless env is .GlobalEnv in which case the symbol is
left unchanged.
So there are essentially three options.
In this case:
> df1 <- data.frame(a = 1, b = 2)
> identical(quote(df1),substitute(df1))
[1] TRUE
df1 is an "ordinary variable", but it is called in .GlobalEnv, since env argument defaults to the current evaluation environment. Hence we're in the very last case where the symbol, df1, is left unchanged and so it identical to the result of quote(df1).
In the context of the function:
is.identical <- function(X){
out <- identical(quote(X), substitute(X))
out
}
The important distinction is that now we're calling these functions on X, not df1. For most R users, this is a silly, trivial distinction, but when playing with subtle tools like substitute it becomes important. X is a formal argument of a function, so that implies we're in a different case of the documented behavior.
Specifically, it says that now "the expression slot of the promise replaces the symbol". We can see what this means if we debug() the function and examine the objects in the context of the function environment:
> debugonce(is.identical)
> is.identical(X = df1)
debugging in: is.identical(X = df1)
debug at #1: {
out <- identical(quote(X), substitute(X))
out
}
Browse[2]>
debug at #2: out <- identical(quote(X), substitute(X))
Browse[2]> str(quote(X))
symbol X
Browse[2]> str(substitute(X))
symbol df1
Browse[2]> Q
Now we can see that what happened is precisely what the documentation said would happen (Ha! So obvious! ;) )
X is a formal argument, or a promise, which according to R is not the same thing as df1. For most people writing functions, they are effectively the same, but the internal implementation disagrees. X is a promise object, and substitute replaces the symbol X with the one that it "points to", namely df1. This is what the docs mean by the "expression slot of the promise"; that's what R sees when in the X = df1 part of the function call.
To round things out, try to guess what will happen in this case:
is.identical <- function(X){
out <- identical(quote(A), substitute(A))
out
}
is.identical(X = df1)
(Hint: now A is not a "bound symbol in the environment".)
A final example illustrating more directly the final case in the docs with the confusing exception:
#Ordinary variable, but in .GlobalEnv
> a <- 2
> substitute(a)
a
#Ordinary variable, but NOT in .GlobalEnv
> e <- new.env()
> e$a <- 2
> substitute(a,env = e)
[1] 2

R: serialize base64 encode/decode of text not exactly matching

in my previous question about using serialize() to create a CSV of objects I got a great answer from jmoy where he recommended base64 encoding of my serialized text. That was exactly what I was looking for. Oddly enough, when I try to put this in practice I get results that look right but don't exactly match what I ran through the serialize/encoding process.
The example below takes a list with 3 vectors and serializes each vector. Then each vector is base64 encoded and written to a text file along with a key. The key is simply the index number of the vector. I then reverse the process and read each line back from the csv. At the very end you can see some items don't exactly match. Is this a floating point issue? Something else?
require(caTools)
randList <- NULL
set.seed(2)
randList[[1]] <- rnorm(100)
randList[[2]] <- rnorm(200)
randList[[3]] <- rnorm(300)
#delete file contents
fileName <- "/tmp/tmp.txt"
cat("", file=fileName, append=F)
i <- 1
for (item in randList) {
myLine <- paste(i, ",", base64encode(serialize(item, NULL, ascii=T)), "\n", sep="")
cat(myLine, file=fileName, append=T)
i <- i+1
}
linesIn <- readLines(fileName, n=-1)
parsedThing <- NULL
i <- 1
for (line in linesIn){
parsedThing[[i]] <- unserialize(base64decode(strsplit(linesIn[[i]], split=",")[[1]][[2]], "raw"))
i <- i+1
}
#floating point issue?
identical(randList, parsedThing)
for (i in 1:length(randList[[1]])) {
print(randList[[1]][[i]] == parsedThing[[1]][[i]])
}
i<-3
randList[[1]][[i]] == parsedThing[[1]][[i]]
randList[[1]][[i]]
parsedThing[[1]][[i]]
Here's the abridged output:
> #floating point issue?
> identical(randList, parsedThing)
[1] FALSE
>
> for (i in 1:length(randList[[1]])) {
+ print(randList[[1]][[i]] == parsedThing[[1]][[i]])
+ }
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
...
>
> i<-3
> randList[[1]][[i]] == parsedThing[[1]][[i]]
[1] FALSE
>
> randList[[1]][[i]]
[1] 1.587845
> parsedThing[[1]][[i]]
[1] 1.587845
>

ascii=T in your call to serialize is making R do imprecise binary-decimal-binary conversions when serializing and unserializing causing the values to differ. If you remove ascii=T you get exactly the same numbers back as now it is a binary representation which is written out.
base64encode can encode raw vectors so it doesn't need ascii=T.
The binary representation used by serialize is architecture independent, so you can happily serialize on one machine and unserialize on another.
Reference: http://cran.r-project.org/doc/manuals/R-ints.html#Serialization-Formats

JD: I ran your code snippet on my Linux box, then looked at the differences computed by randList[[1]][[i]] - parsedThing[[1]][[i]].
Yes, the values are different, but only at the level my machine's floating-point tolerance. A typical difference was -4.440892e-16 -- which is pretty tiny. Some differences were zero.
It does not surprise me that the save/restore introduced that (tiny) level of change. Any significant data conversion runs the risk of "bobbling" the least significant digit.

Ok, now that you show the output I can explain to you what you're doing (following Paul's lead here).
As that is a known issue (see e.g. this R FAQ entry), you should buckle up and use any one of
identical()
all.equal()
functions from the RUnit package such as checkEquals
In sum, there seems nothing wrong with the base64 encoding you are using. You simply employed the wrong definition of exactly. But hey, we're economists, and anything below a trillion or two is rounding error anyway...

R-thonic replacement for simple for loops containing a condition

I'm using R, and I'm a beginner. I have two large lists (30K elements each). One is called descriptions and where each element is (maybe) a tokenized string. The other is called probes where each element is a number. I need to make a dictionary that mapsprobes to something in descriptions, if that something is there. Here's how I'm going about this:
probe2gene <- list()
for (i in 1:length(probes)){
strings<-strsplit(descriptions[i]), '//')
if (length(strings[[1]]) > 1){
probe2gene[probes[i]] = strings[[1]][2]
}
}
Which works fine, but seems slow, much slower than the roughly equivalent python:
probe2gene = {}
for p,d in zip(probes, descriptions):
try:
probe2gene[p] = descriptions.split('//')[1]
except IndexError:
pass
My question: is there an "R-thonic" way of doing what I'm trying to do? The R manual entry on for loops suggests that such loops are rare. Is there a better solution?
Edit: a typical good "description" looks like this:
"NM_009826 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// AB070619 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// ENSMUST00000027040 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421"
a bad "description: looks like this
"-----"
though it can quite easily be some other not-very-helpful string. Each probe is simply a number. The probe and description vectors are the same length, and completely correspond to each other, i.e. probe[i] maps to description[i].

It's usually better in R if you use the various apply-like functions, rather than a loop. I think this solves your problem; the only drawback is that you have to use string keys.
> descriptions <- c("foo//bar", "")
> probes <- c(10, 20)
> probe2gene <- lapply(strsplit(descriptions, "//"), function (x) x[2])
> names(probe2gene) <- probes
> probe2gene <- probe2gene[!is.na(probe2gene)]
> probe2gene[["10"]]
[1] "bar"
Unfortunately, R doesn't have a good dictionary/map type. The closest I've found is using lists as a map from string-to-value. That seems to be idiomatic, but it's ugly.

If I understand correctly you are looking to save each probe-description combination where the there is more than one (split) value in description?
Probe and Description are the same length?
This is kind of messy but a quick first pass at it?
a <- list("a","b","c")
b <- list(c("a","b"),c("DEF","ABC"),c("Z"))
names(b) <- a
matches <- which(lapply(b, length)>1) #several ways to do this
b <- lapply(b[matches], function(x) x[2]) #keeps the second element only
That's my first attempt. If you have a sample dataset that would be very useful.
Best regards,
Jay

Another way.
probe<-c(4,3,1)
gene<-c('red//hair','strange','blue//blood')
probe2gene<-character()
probe2gene[probe]<-sapply(strsplit(gene,'//'),'[',2)
probe2gene
[1] "blood" NA NA "hair"
In the sapply, we take advantage of the fact that in R the subsetting operator is also a function named '[' to which we can pass the index as an argument. Also, an out-of-range index does not cause an error but gives a NA value. On the left hand of the same line, we use the fact that we can pass a vector of indices in any order and with gaps.

Here's another approach that should be fast. Note that this doesn't
remove the empty descriptions. It could be adapted to do that or you
could clean those in a post processing step using lapply. Is it the
case that you'll never have a valid description of length one?
make_desc <- function(n)
{
word <- function(x) paste(sample(letters, 5, replace=TRUE), collapse = "")
if (runif(1) < 0.70)
paste(sapply(seq_len(n), word), collapse = "//")
else
"----"
}
description <- sapply(seq_len(10), make_desc)
probes <- seq_len(length(description))
desc_parts <- strsplit(description, "//", fixed=TRUE, useBytes=TRUE)
lens <- sapply(desc_parts, length)
probes_expand <- rep(probes, lens)
ans <- split(unlist(desc_parts), probes_expand)
> description
[1] "fmbec"
[2] "----"
[3] "----"
[4] "frrii//yjxsa//wvkce//xbpkc"
[5] "kazzp//ifrlz//ztnkh//dtwow//aqvcm"
[6] "stupm//ncqhx//zaakn//kjymf//swvsr//zsexu"
[7] "wajit//sajgr//cttzf//uagwy//qtuyh//iyiue//xelrq"
[8] "nirex//awvnw//bvexw//mmzdp//lvetr//xvahy//qhgym//ggdax"
[9] "----"
[10] "ubabx//tvqrd//vcxsp//rjshu//gbmvj//fbkea//smrgm//qfmpy//tpudu//qpjbu"
> ans[[3]]
[1] "----"
> ans[[4]]
[1] "frrii" "yjxsa" "wvkce" "xbpkc"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

hash - Identical R Dataframes, different hashes (not an attribute problem) - r

Related

Why do two references to the same vector return different memory addresses for each element of the vector?

Assign value to indices of nested lists stored as strings in R

How can one make visible the difference in the outputs of quote() and substitute()?

R: serialize base64 encode/decode of text not exactly matching

R-thonic replacement for simple for loops containing a condition

Categories

Resources