Identical data frames with different digests in R? - r

I have two large data frames, a and b for which identical(a,b) is TRUE, as is all.equal(a,b), but identical(digest(a),digest(b)) is FALSE. What could cause this?
What's more, I tried to dig in deeper, by applying digest to bunches of rows. Incredibly, at least to me, there is agreement in the digest values on sub-frames all the way to the last row of the data frames.
Here is a sequence of comparisons:
> identical(a, b)
[1] TRUE
> all.equal(a, b)
[1] TRUE
> digest(a)
[1] "cac56b06078733b6fb520442e5482684"
> digest(b)
[1] "fdd5ab78ca961982d195f800e3cf60af"
> digest(a[1:nrow(a),])
[1] "e44f906723405756509a6b17b5949d1a"
> digest(b[1:nrow(b),])
[1] "e44f906723405756509a6b17b5949d1a"
Every method I can think of indicates these two objects are identical, but their digest values are different. Is there something else about data frames that can produce such discrepancies?
For further details: the objects are about 10M rows x 12 columns. Here's the output of str():
'data.frame': 10056987 obs. of 12 variables:
$ V1 : num 1 11 21 31 41 61 71 81 91 101 ...
$ V2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ V3 : num 2 3 2 3 4 5 2 4 2 4 ...
$ V4 : num 1 1 1 1 1 1 1 1 1 1 ...
$ V5 : num 1.8 2.29 1.94 2.81 3.06 ...
$ V6 : num 0.0653 0.0476 0.0324 0.034 0.0257 ...
$ V7 : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
$ V8 : num 0.00653 0.00476 0.00324 0.0034 0.00257 ...
$ V9 : num 1.8 2.3 1.94 2.81 3.06 ...
$ V10: num 0.1957 0.7021 0.0604 0.1866 0.9371 ...
$ V11: num 1704 1554 1409 1059 1003 ...
$ V12: num 23309 23309 23309 23309 23309 ...
> print(object.size(a), units = "Mb")
920.7 Mb
Update 1: On a whim, I converted these to matrices. The digests are the same.
> aM = as.matrix(a)
> bM= as.matrix(b)
> identical(aM,bM)
[1] TRUE
> digest(aM)
[1] "c5147d459ba385ca8f30dcd43760fc90"
> digest(bM)
[1] "c5147d459ba385ca8f30dcd43760fc90"
I then tried converting back to a data frame, and the digest values are equal (and equal to the previous value for a).
> aMF = as.data.frame(aM)
> bMF = as.data.frame(bM)
> digest(aMF)
[1] "cac56b06078733b6fb520442e5482684"
> digest(bMF)
[1] "cac56b06078733b6fb520442e5482684"
So, b looks like the bad boy, and it has a colorful past. b came from a much bigger data frame, say B. I took only the columns of B that appeared in a and checked to see if they were equal. Well, they were equal, but had different digests. I converted the column names (from "InformativeColumnName1" to "V1", etc.), just to avoid any issues that might arise - though all.equal and identical tend to point out when column names differ.
Since I am working on two different programs and don't have simultaneous access to a and b, it is easiest for me to use the digest values to check the calculations. However, something seems to be odd in how I extract columns from a data frame and then apply digest() to it.
ANSWER:
It turns out, to my astonishment (dismay, horror, embarrassment, you name it), identical is very forgiving about attributes. I had assumed that only all.equal was forgiving about attributes.
This was discovered via Tommy's suggestion identical(d1, d2, attrib.as.set=FALSE). Running attributes(a) is a bad, bad idea: the deluge of row names took awhile before Ctrl-C could interrupt it. Here is the output of names(attributes()):
> names(attributes(a))
[1] "names" "row.names" "class"
> names(attributes(b))
[1] "names" "class" "row.names"
They're in different orders! Kudos to digest() for being straight with me.
UPDATE
To aid others with this problem, it seems that simply rearranging the attributes will be adequate to get identical hash values. Since tinkering with attribute orders is new to me, this may break something, but it works in my case. Note that it is a little time consuming if the objects are big; I'm not aware of a faster method for doing this. (I'm also looking to move to using matrices or data tables instead of data frames, and this may be another incentive to avoid data frames.)
tmpA0 = attributes(a)
tmpA1 = tmpA0[sort(names(tmpA0))]
a2 = a
attributes(a2) = tmpA1
tmpB0 = attributes(b)
tmpB1 = tmpB0[sort(names(tmpB0))]
b2 = b
attributes(b2) = tmpB1
digest(a2) # e04e624692d82353479efbd713ec03f6
digest(b2) # e04e624692d82353479efbd713ec03f6
identical(b,b2, attrib.as.set = FALSE) # FALSE
identical(b,b2, attrib.as.set = TRUE) # TRUE
identical(a2,b2, attrib.as.set = FALSE) # TRUE

Without having the actual data.frames it is of course hard to know, but one difference could be the order of the attributes. identical ignores that by default, but setting attrib.as.set=FALSE can change that:
d1 <- structure(1, foo=1, bar=2)
d2 <- structure(1, bar=2, foo=1)
identical(d1, d2) # TRUE
identical(d1, d2, attrib.as.set=FALSE) # FALSE

Our digest package uses the internal R function serialize() to get what we feed to the hash-generating functions (md5, sha1, ...).
So I strongly suspect that may have something like an attribute differ. Until you can construct something reproducible that does not depend on your 1e7 x 12 data set, there is little we can do.
Also, the digest() function can output intermediate results and (as of the recent 0.5.1 version) even raw vectors.  That may help.
Lastly, you can always contact us (as the package maintainers / authors) off-line which happens to be the recommended way within R land, the popularity of StackOverflow not withstanding.

Related

Remove outlier from five-number summary statistics

How can I force fivenum function to not put outliers as my maximum/minimum values?
I want to be able to see uppper and lower whisker numbers on my boxplot.
My code:
boxplot(data$`Weight(g)`)
text(y=fivenum(data$`Weight(g)`),labels=fivenum(data$`Weight(g)`),x=1.25, title(main = "Weight(g)"))
boxplot returns a named-list that includes things you can use to remove outliers in your call to fivenum:
$out includes the literal outliers. It can be tempting to use setdiff(data$`Weight(g)`), but that may be prone to problems due to R FAQ 7.31 (and floating-point equality), so I recommend against this; instead,
$stats includes the numbers used for the boxplot itself without the outliers. I suggest we work with this.
(BTW, title(.) does its work via side-effect, and it is not used by text(.), I suggest you move that call.)
Reproducible data/code:
vec <- c(1, 10:20, 30)
bp <- boxplot(vec)
str(bp)
# List of 6
# $ stats: num [1:5, 1] 10 12 15 18 20
# $ n : num 13
# $ conf : num [1:2, 1] 12.4 17.6
# $ out : num [1:2] 1 30
# $ group: num [1:2] 1 1
# $ names: chr "1"
five <- fivenum(vec[ vec >= min(bp$stats) & vec <= max(bp$stats)])
text(x=1.25, y=five, labels=five)
title("Weight(g)")

How do I matricise a column/vector (applying a function like sum/diff/boolean)? [duplicate]

I am trying create a data.frame from which to create a graph. I have a function and two vectors that I want to use as the two inputs. This is a bit simplified, but basically all I have is:
relGPA <- seq(-1.5,1.5,.2)
avgGPA <- c(-2,0,2)
f <- function(relGPA, avgGPA) 1/(1+exp(sum(relGPA*pred.model$coef[1],avgGPA*pred.model$coef[2])))
and all I want is a data.frame with 3 columns for the avgGPA values, and 16 rows for the relGPA values with the resulting values in the cells.
I apologize for how basic this is, but I assure you I have tried to make this happen without your assistance. I have tried following the examples on the sapply and mapply man pages, but I'm just a little too new to R to see what I'm trying to do.
Thanks!
Cannot be tested with the information offered, but this should work:
expGPA <- outer(relGPA, avgGPA, FUN=f) # See below for way to make this "work"
Another useful function when you want to generate combinations is expand.grid and this would get you the "long form":
expGPA2 <-expand.grid(relGPA, avgGPA)
expGPA2$fn <- apply(expGPA2, 1, f)
The long form is what lattice and ggplot will expect as input format for higher level plotting.
EDIT: It may be necessary to construct a more specific method for passing column references to the function as pointed out by djhurio and (solved) by Sam Swift with the Vectorize strategy. In the case of apply, the sum function would work out of the box as described above, but the division operator would not, so here is a further example that can be generalized to more complex functions with multiple arguments. All the programmer needs is the number of the column for the appropriate argument in the "apply()"-ed" function, because (unfortunately) the column names are not carried through to the x argument:
> expGPA2$fn <- apply(expGPA2, 1, function(x) x[1]/x[2])
> str(expGPA2)
'data.frame': 48 obs. of 3 variables:
$ Var1: num -1.5 -1.3 -1.1 -0.9 -0.7 ...
$ Var2: num -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
$ fn : num 0.75 0.65 0.55 0.45 0.35 ...
- attr(*, "out.attrs")=List of 2
..$ dim : int 16 3
..$ dimnames:List of 2
.. ..$ Var1: chr "Var1=-1.5" "Var1=-1.3" "Var1=-1.1" "Var1=-0.9" ...
.. ..$ Var2: chr "Var2=-2" "Var2= 0" "Var2= 2"
Edit2: (2013-01-05) Looking at this a year later, I realized that SamSwift's function could be vectorized by making its body use "+" instead of sum:
1/(1+exp( relGPA*pred.model$coef[1] + avgGPA*pred.model$coef[2]) # all vectorized fns

simulate observations and calculate sample autocorrelation

Simulating rk (r stands for autocorrelation) for {et} where each et is iid N(0,1).
R code: simulate 100 observations of {et} and calculate r1.
Here is my code so far:
x=rnorm(100,0,1)
x
y=ts(x)
trial_r1=acf(y)[1]
trial_r1
Is my code right? How to get r1 after running acf()
(I'll post as an answer, both to close the question, plus to help with searching for answers to similarly-structured questions.)
When looking for what you believe is but one part of a structured return, it's useful to look at the return value in detail. One common way to do this is with str:
set.seed(42)
x <- rnorm(100, mean = 0, sd = 1)
ret <- acf(ts(x))
str(ret)
## List of 6
## $ acf : num [1:21, 1, 1] 1 0.05592 -0.00452 0.03542 0.00278 ...
## $ type : chr "correlation"
## $ n.used: int 100
## $ lag : num [1:21, 1, 1] 0 1 2 3 4 5 6 7 8 9 ...
## $ series: chr "ts(x)"
## $ snames: NULL
## - attr(*, "class")= chr "acf"
In this instance, you'll see two clusters of numbers in $acf and $lag. The latter "clearly" is just an array of incrementing integers so is not that interesting in this endeavor, but the former looks more interesting. By seeing that the results is ultimately just a list, you can use dollar-sign subsetting (or [[, over to you) to extract what you need:
ret$acf
## , , 1
## [,1]
## [1,] 1.000000e+00
## [2,] 5.592310e-02
## [3,] -4.524017e-03
## [4,] 3.541639e-02
## [5,] 2.784590e-03
## ...snip...
In the case of your question, you should notice that the first element of this 3-dimensional array is the perfectly-autocorrelated 1, but your first real autocorrelation of concern is the second element, or 0.0559. So your first value is attainable with ret$acf[2,,] (or more formally ret$acf[2,1,1]).

Confusion about unpacking lists in R

I'm befuddled by how R is dealing with lists and data frames. For example:
agg = function() {
df1 = data.frame(a=1:5,b=1:5)
df2 = data.frame(a=11:15,b=11:15)
return(list(df1, df2))
}
res = agg()
# returns NULL
res[1]$a
# returns 1:5
res[[1]]$a
I don't understand why the first element of res is not a data frame; rather, I need double-referencing to get at the elements. I read Hadley Wickham's excellent Data Structures chapter in his Advanced R website, but still can't figure out what's up with this example. Can anyone explain what I'm missing?
Single square brackets [] are used to index vectors in R. Double square brackets [[]] are used to index lists. You have a list, so [] doesn't work:
is.list(res)
# [1] TRUE
str(res)
# List of 2
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ a: int [1:5] 1 2 3 4 5
# ..$ b: int [1:5] 1 2 3 4 5
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ a: int [1:5] 11 12 13 14 15
# ..$ b: int [1:5] 11 12 13 14 15
See ?[, vectors, and lists for more information. The following SO posts might also help:
What are the differences between R vector and R list data types
Learning R for someone used to MATLAB, and confusion with R data types
how to understand list(list(object)) in r?
First element of the list is a list, thus agg[1] returns a list.
You are looking for the first component of the list, which is saved in agg[[1]]. Thus agg[[1]]$a works.
E.g., take a look at the following
res[[1]]$a
res[1][[1]]$a
res[1][1][[1]]$a
res[1][1][1][[1]]$a
They are all returning the column a of the first component of the list. In these cases, they are all the same list, i.e. the first element of res.
Hope that makes sense.

mapply basics? - how to create a matrix from two vectors and a function

I am trying create a data.frame from which to create a graph. I have a function and two vectors that I want to use as the two inputs. This is a bit simplified, but basically all I have is:
relGPA <- seq(-1.5,1.5,.2)
avgGPA <- c(-2,0,2)
f <- function(relGPA, avgGPA) 1/(1+exp(sum(relGPA*pred.model$coef[1],avgGPA*pred.model$coef[2])))
and all I want is a data.frame with 3 columns for the avgGPA values, and 16 rows for the relGPA values with the resulting values in the cells.
I apologize for how basic this is, but I assure you I have tried to make this happen without your assistance. I have tried following the examples on the sapply and mapply man pages, but I'm just a little too new to R to see what I'm trying to do.
Thanks!
Cannot be tested with the information offered, but this should work:
expGPA <- outer(relGPA, avgGPA, FUN=f) # See below for way to make this "work"
Another useful function when you want to generate combinations is expand.grid and this would get you the "long form":
expGPA2 <-expand.grid(relGPA, avgGPA)
expGPA2$fn <- apply(expGPA2, 1, f)
The long form is what lattice and ggplot will expect as input format for higher level plotting.
EDIT: It may be necessary to construct a more specific method for passing column references to the function as pointed out by djhurio and (solved) by Sam Swift with the Vectorize strategy. In the case of apply, the sum function would work out of the box as described above, but the division operator would not, so here is a further example that can be generalized to more complex functions with multiple arguments. All the programmer needs is the number of the column for the appropriate argument in the "apply()"-ed" function, because (unfortunately) the column names are not carried through to the x argument:
> expGPA2$fn <- apply(expGPA2, 1, function(x) x[1]/x[2])
> str(expGPA2)
'data.frame': 48 obs. of 3 variables:
$ Var1: num -1.5 -1.3 -1.1 -0.9 -0.7 ...
$ Var2: num -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
$ fn : num 0.75 0.65 0.55 0.45 0.35 ...
- attr(*, "out.attrs")=List of 2
..$ dim : int 16 3
..$ dimnames:List of 2
.. ..$ Var1: chr "Var1=-1.5" "Var1=-1.3" "Var1=-1.1" "Var1=-0.9" ...
.. ..$ Var2: chr "Var2=-2" "Var2= 0" "Var2= 2"
Edit2: (2013-01-05) Looking at this a year later, I realized that SamSwift's function could be vectorized by making its body use "+" instead of sum:
1/(1+exp( relGPA*pred.model$coef[1] + avgGPA*pred.model$coef[2]) # all vectorized fns

Resources