R: serialize base64 encode/decode of text not exactly matching - r

in my previous question about using serialize() to create a CSV of objects I got a great answer from jmoy where he recommended base64 encoding of my serialized text. That was exactly what I was looking for. Oddly enough, when I try to put this in practice I get results that look right but don't exactly match what I ran through the serialize/encoding process.
The example below takes a list with 3 vectors and serializes each vector. Then each vector is base64 encoded and written to a text file along with a key. The key is simply the index number of the vector. I then reverse the process and read each line back from the csv. At the very end you can see some items don't exactly match. Is this a floating point issue? Something else?
require(caTools)
randList <- NULL
set.seed(2)
randList[[1]] <- rnorm(100)
randList[[2]] <- rnorm(200)
randList[[3]] <- rnorm(300)
#delete file contents
fileName <- "/tmp/tmp.txt"
cat("", file=fileName, append=F)
i <- 1
for (item in randList) {
myLine <- paste(i, ",", base64encode(serialize(item, NULL, ascii=T)), "\n", sep="")
cat(myLine, file=fileName, append=T)
i <- i+1
}
linesIn <- readLines(fileName, n=-1)
parsedThing <- NULL
i <- 1
for (line in linesIn){
parsedThing[[i]] <- unserialize(base64decode(strsplit(linesIn[[i]], split=",")[[1]][[2]], "raw"))
i <- i+1
}
#floating point issue?
identical(randList, parsedThing)
for (i in 1:length(randList[[1]])) {
print(randList[[1]][[i]] == parsedThing[[1]][[i]])
}
i<-3
randList[[1]][[i]] == parsedThing[[1]][[i]]
randList[[1]][[i]]
parsedThing[[1]][[i]]
Here's the abridged output:
> #floating point issue?
> identical(randList, parsedThing)
[1] FALSE
>
> for (i in 1:length(randList[[1]])) {
+ print(randList[[1]][[i]] == parsedThing[[1]][[i]])
+ }
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
...
>
> i<-3
> randList[[1]][[i]] == parsedThing[[1]][[i]]
[1] FALSE
>
> randList[[1]][[i]]
[1] 1.587845
> parsedThing[[1]][[i]]
[1] 1.587845
>

ascii=T in your call to serialize is making R do imprecise binary-decimal-binary conversions when serializing and unserializing causing the values to differ. If you remove ascii=T you get exactly the same numbers back as now it is a binary representation which is written out.
base64encode can encode raw vectors so it doesn't need ascii=T.
The binary representation used by serialize is architecture independent, so you can happily serialize on one machine and unserialize on another.
Reference: http://cran.r-project.org/doc/manuals/R-ints.html#Serialization-Formats

JD: I ran your code snippet on my Linux box, then looked at the differences computed by randList[[1]][[i]] - parsedThing[[1]][[i]].
Yes, the values are different, but only at the level my machine's floating-point tolerance. A typical difference was -4.440892e-16 -- which is pretty tiny. Some differences were zero.
It does not surprise me that the save/restore introduced that (tiny) level of change. Any significant data conversion runs the risk of "bobbling" the least significant digit.

Ok, now that you show the output I can explain to you what you're doing (following Paul's lead here).
As that is a known issue (see e.g. this R FAQ entry), you should buckle up and use any one of
identical()
all.equal()
functions from the RUnit package such as checkEquals
In sum, there seems nothing wrong with the base64 encoding you are using. You simply employed the wrong definition of exactly. But hey, we're economists, and anything below a trillion or two is rounding error anyway...

Related

hash - Identical R Dataframes, different hashes (not an attribute problem)

I have two dataframes of ~150 rows of X and Y where identical(X, Y) is TRUE but identical(digest(X), digest(Y)) is FALSE. I'm looking into why this is the case.
I did look at this answer and re-ran what they tested, with similar results, but unlike their problem, the attributes for my dataframes are the same. Testing results:
> names(attributes(X))
[1] "names" "row.names" "class"
> names(attributes(Y))
[1] "names" "row.names" "class"
> digest(X)
[1] "07b7ef11ce6eaae01ddd79e4facef581"
> digest(Y)
[1] "09d8abcab0af0a72265a9b690f4eacc3"
> digest(X[1:nrow(X),])
[1] "2f338de9972529bd2bc9c39c3c5762ea"
> digest(Y[1:nrow(Y),])
[1] "2f338de9972529bd2bc9c39c3c5762ea"
> identical(X, Y, attrib.as.set=FALSE)
[1] TRUE
I also saved the dataframes as .RDS files, and re-read them in.
> X_rds <- read_rds("cache_vars/X.rds")
> Y_rds <- read_rds("cache_vars/Y.rds")
> identical(X_rds , Y_rds )
[2] TRUE
> digest(X_rds)
[2] "07b7ef11ce6eaae01ddd79e4facef581"
> digest(Y_rds )
[2] "09d8abcab0af0a72265a9b690f4eacc3"
> identical(X_rds , Y_rds , attrib.as.set=FALSE)
[2] TRUE
And like the other poster, converting to matrices and back to dataframe yielded identical digests, so it's probably some structural problem.
> X_Mat <- as.matrix(X_rds)
> Y_Mat <- as.matrix(Y_rds)
> identical(digest(X_Mat), digest(Y_Mat))
[2] TRUE
> X_DF <- as.data.frame(X_Mat)
> Y_DF <- as.data.frame(Y_Mat)
> identical(digest(X_DF ), digest(Y_DF))
[2] TRUE
Dataframe X was produced from a parallel-designed loop (but with the %do% flag so no actual parallelism was done) and Y was produced from a sequential loop.
The .RDS files for X and Y can be found at this link.
Update:
MrFlick has it right. As it turns out, the serialization during parallel's rbind function was also adding the gp=0x20 flag, similar to what they described occurs when writing to RDS.
When you write to rds, the objects are serialized. The serialization contains some information in addition to just the values the vectors contain. Note that if we just compare all the columns, they produce a different digests
sapply(seq_along(X_rds), function(i)
digest::digest(X_rds[[i]])==digest::digest(Y_rds[[i]])
)
So the vectors that are being stored in the data.frame are different. We can use the internal inspect function to get some of the meta-data for the vectors
.Internal(inspect(X_rds[[1]]))
# #135305a00 14 REALSXP g0c7 [REF(4),gp=0x20] (len=150, tl=0)
# 1.009e+06,1.009e+06,1.009e+06,1.009e+06,1.009e+06,...
.Internal(inspect(Y_rds[[1]]))
# #115dbfc00 14 REALSXP g0c7 [REF(29)] (len=150, tl=0)
# 1.009e+06,1.009e+06,1.009e+06,1.009e+06,1.009e+06,...
So we see they differ in the [] parts. I believe the REF() number represents the reference count to that object for memory clearing purposes. I do not believe that this number is used in the serialization. But the X_rds also has gp=0x20 set. The "gp" stands for "general purpose" bits/flags. I believe in this case it means the GROWABLE_MASK was set on that object. These values are preserved when the object is serialized which is the default behavior for digest. Thus these vectors do not have the exact same serialization due to this flag difference.
Another way to see the difference is to look at the desrialization
substring(rawToChar(serialize(X_rds[[1]], connection = NULL, ascii = TRUE)), 1, 45)
# [1] "A\n3\n262657\n197888\n5\nUTF-8\n131086\n150\n1009002\n"
substring(rawToChar(serialize(Y_rds[[1]], connection = NULL, ascii = TRUE)), 1, 45)
# [1] "A\n3\n262657\n197888\n5\nUTF-8\n14\n150\n1009002\n1009"
We have a a bit of a header, then we start to see the values being output. There is one value where there is a difference and that's where X has 131086 (0x20000E) and Y has 14 (0xE). Those differences are due to the flags where are written here in the R source code.
When you use identical, only the values in the data.frame are compared, not the additional metadata.
If you wanted to get around this, you could write your own wrapper around digest that avoids the serialization. For example
dfdigest <- function(x) {
charsToRaw <- function(x) unlist(lapply(x, charToRaw))
bytes <- unlist(c(list(charsToRaw(names(x))),
lapply(x, function(col) {
if (typeof(col)=="double") writeBin(col, raw())
else if (typeof(col)=="character") charsToRaw(col)
else stop(paste("unconfigured data type:", typeof(col)))
})))
digest::digest(bytes, serialize = FALSE)
}
dfdigest(X_rds)
# [1] "2488505e3ad1a370d030b539a287b7ca"
dfdigest(Y_rds)
# [1] "2488505e3ad1a370d030b539a287b7ca"

Translation and mapping of emoticons encoded as UTF-8 code in text

I am working with text which includes emoticons. I need to be able to find these and replace them with tags which can be analysed. How to do this?
> main$text[[4]]
[1] "Spread d wrd\xf0\u009f\u0098\u008e"
> grepl("\xf0", main$text[[4]])
[1] FALSE
I tried the above. Why did it not work? I also tried iconv into ASCII, then the byte encoding I got, could be searched with grepl.
> abc<-iconv(main$text[[4]], "UTF-8", "ASCII", "byte")
> abc
[1] "Spread d wrd<f0><9f><98><8e>"
> grepl("<f0>", abc)
[1] TRUE
I really do not understand what I did here and what happened. I also do not understand how the above conversion introduced \n characters into the text.
I also did not know how to encode these, once they were searcheable. I found a list here, but it fell short (for example, "U+E00E" - <ee><80><8e> was not in the list). Is there a comprehensive list for such a mapping?
ADDENDUM
After a lot of trial and error, here is what I realised. There are two kinds of encodings for the emojis in the data. One is in the form of bytes, which is searchable by grepl("\x9f", ...., useBytes=T), like the main$text[[4]], and another (main$text[[6]]) which is searchable as the unicode character without useBytes=T, i.e. grepl("\ue00e",....). Even the way they are displayed in View() and when called on the console is different. I am absolutely confused as to what is going on here.
main$text[[4]]
[1] "Spread d wrd\xf0\u009f\u0098\u008e"
main[4,]
timestamp fromMe remoteResource remoteResourceDisplayName type
b 2014-08-30 02:58:58 FALSE 112233#s.whatsapp.net ABC text
text date
b Spread d wrd<f0><U+009F><U+0098><U+008E> 307114
main$text[[6]]
[1] ""
main[6,]
timestamp fromMe remoteResource remoteResourceDisplayName type text
b 2014-08-30 02:59:17 FALSE 12345#s.whatsapp.net XYZ text <U+E00E>
date
b 307114
grepl("\ue00e", main$text[[6]])
[1] TRUE
grepl("<U+E00E>", main$text[[6]])
[1] FALSE
grepl("\u009f", main$text[[4]])
[1] FALSE
grepl("\x9f", main$text[[4]])
[1] FALSE
grepl("\x9f", main$text[[4]], fixed=T)
[1] FALSE
grepl("\x9f", main$text[[4]], useBytes=T)
[1] TRUE
The maps I have are also different. The one for the bytes case works well. But the other one doesnot, since I am unable to create the "\ue00e" required to search. Here is the sample of the other map, corresponding to the Softbank <U+E238>.
emmm[11]
[1] "E238"
Searching for a single byte of a multi-byte UTF-8 encoded character only works if done with useBytes = TRUE. The fact that "\xf0" here is a part of a multi-byte character is obscured by the less than perfect Unicode support of R on Windows (used in the original example, I presume). How to match by bytes:
foo <- "\xf0\x9f\x98\x8e" # U+1F60E SMILING FACE WITH SUNGLASSES
Encoding(foo) <- "UTF-8"
grepl("\xf0", foo, useBytes = TRUE)
I don't see much use for matching one byte, though. Searching for the whole character would then be:
grepl(foo, paste0("Smiley: ", foo, " and more"), useBytes = TRUE)
Valid ASCII codes correspond to integers 0–127. The iconv() conversion to ASCII in the example replaces any invalid byte 0xYZ (corresponding to integers 128–255) with the literal text <yz> where y and z are hexadecimal digits. As far as I can see, it should not introduce any newlines ("\n").
Using the character list linked to in the question, here is some example code which performs one kind of "emoji tagging" to input strings, namely replacing the emoji with its (slightly formatted) name.
emoji_table <- read.csv2("https://github.com/today-is-a-good-day/Emoticons/raw/master/emDict.csv",
stringsAsFactors = FALSE)
emoji_names <- emoji_table[, 1]
text_bytes_to_raw <- function(x) {
loc <- gregexpr("\\x", x, fixed = TRUE)[[1]] + 2
as.raw(paste0("0x", substring(x, loc, loc + 1)))
}
emoji_raw <- lapply(emoji_table[, 3], text_bytes_to_raw)
emoji_utf8 <- vapply(emoji_raw, rawToChar, "")
Encoding(emoji_utf8) <- "UTF-8"
gsub_many <- function(x, patterns, replacements) {
stopifnot(length(patterns) == length(replacements))
x2 <- x
for (k in seq_along(patterns)) {
x2 <- gsub(patterns[k], replacements[k], x2, useBytes = TRUE)
}
x2
}
tag_emojis <- function(x, codes, names) {
gsub_many(x, codes, paste0("<", gsub("[[:space:]]+", "_", names), ">"))
}
each_tagged <- tag_emojis(emoji_utf8, emoji_utf8, emoji_names)
all_in_one <- tag_emojis(paste0(emoji_utf8, collapse = ""),
emoji_utf8, emoji_names)
stopifnot(identical(paste0(each_tagged, collapse = ""), all_in_one))
As to why U+E00E is not on that emoji list, I don't think it should be. This code point is in a Private Use Area, where character mappings are not standardized. For comprehensive Unicode character lists, you cannot find a better authority than the Unicode Consortium, e.g. Unicode Emoji. Additionally, see convert utf8 code point strings like <U+0161> to utf8 .
Edit after addendum
When there is a string of exactly four hexadecimal digits representing a Unicode code point (let's say "E238"), the following code will convert the string to the corresponding UTF-8 representation, the occurrence of which can be checked with the grep() family of functions. This answers the question of how to "automatically" generate the character that can be manually created by typing "\uE238".
library(stringi)
hex4_to_utf8 <- function(x) {
stopifnot(grepl("^[[:xdigit:]]{4}$", x))
stringi::stri_enc_toutf8(stringi::stri_unescape_unicode(paste0("\\u", x)))
}
foo <- "E238"
foo_utf8 <- hex4_to_utf8(foo)
The value of the useBytes option should not matter in the following grep() call. In the previous code example, I used useBytes = TRUE as a precaution, as I'm not sure how well R on Windows handles Unicode code points U+10000 and larger (five or six digits). Clearly it cannot properly print such codepoints (as shown by the U+1F60E example), and input with the \U + 8 digits method is not possible.
The example in the question shows that R (on Windows) may print Unicode characters with the <U+E238> notation rather than as \ue238. The reason seems to be format(), also used in print.data.frame(). For example (R for Windows running on Wine):
> format("\ue238")
[1] "<U+E238>"
When tested in an 8-bit locale on Linux, the same notation is already used by the default print method. One must note that in this case, this is only a printed representation, which is different from how the character is originally stored.

Weird behavior of double-nested lists in R

This just took me two hours of debugging to identify:
> list1 = list() # empty list
> list1['first'] = list(a=list(a1='goat', a2='horse'), b=42) # double-nested
> print(list1$first$b)
NULL # Should be 42?
> print(list1) # let's check the actual contents of list1
$first
$first$a1 # how did the contents of the innermost a-list end up here?
[1] "goat"
$first$a2
[1] "horse"
In this case, the list assigned to 'first' becomes the list in a so b just disappears without warning. What is happening here, and where did the bvalue go?
I'm using R 3.0.2. How can I do something like this when R prevents me from doing the above?
As joran pointed out in a comment, the solution is to use double-brackets in the assignment:
list1[['first']] = list(a=list(a1='goat', a2='horse'), b=42)
Apparently you get a warning in newer R versions but not in older, if you use single-brackets.

Numeric matrix is taking far more memory than it should - R

I am creating a document term matrix (dtm for short) for a Naive Bayes implementation (I know there is a function for this, but I have to code it myself for homework.) I wrote a function that successfully creates the dtm, the problem is that the resulting matrix is taking up too much memory. For example a 100 x 32000 matrix (of 0's and 1's) is 24MB in size! This is resulting in crashy behavior in r when trying to work with the full 10k documents. The functions follow and a toy example is in the last 3 lines. Can anyone spot why the "sparser" function in particular is returning such memory-intensive results?
listAllWords <- function(docs)
{
str1 <- strsplit(x=docs, split="\\s", fixed=FALSE)
dictDupl <- unlist(str1)[!(unlist(str1) %in% stopWords)]
dictionary <- unique(dictDupl)
}
#function to create the sparse matrix of words as they appear in each article segment
sparser <- function (docs, dictionary)
{
num.docs <- length(docs) #dtm rows
num.words <- length(dictionary) #dtm columns
dtm <- mat.or.vec(num.docs,num.words) # Instantiate dtm of zeroes
for (i in 1:num.docs)
{
doc.temp <- unlist(strsplit(x=docs[i], split="\\s", fixed=FALSE)) #vectorize words
num.words.doc <- length(doc.temp)
for (j in 1:num.words.doc)
{
ind <- which(dictionary == doc.temp[j]) #loop over words and find index in dict.
dtm[i,ind] <- 1 #indicate this word is in this document
}
}
return(dtm)
}
docs <- c("the first document contains words", "the second document is also made of words", "the third document is words and a number 4")
dictionary <- listAllWords(docs)
dtm <- sparser(docs,dictionary)
If it makes any difference I am running this in R Studio in Mac OSX, 64 bit
Surely part of your problem is that you are not actually storing integers, but doubles. Note:
m <- mat.or.vec(100,32000)
m1 <- matrix(0L,100,32000)
> object.size(m)
25600200 bytes
> object.size(m1)
12800200 bytes
And note the lack of the "L" in the code for mat.or.vec:
> mat.or.vec
function (nr, nc)
if (nc == 1L) numeric(nr) else matrix(0, nr, nc)
<bytecode: 0x1089984d8>
<environment: namespace:base>
You will also want to explicitly assign 1L, otherwise R will convert everything to doubles upon the first assignment, I think. You can verify that by simply assigning one value of m1 above the value 1 and recheck the object size.
I should probably also mention the function storage.mode which can help you to verify that you're using integers.
If you want to store 0/1 values economically, I would suggest raw type.
m8 <- matrix(0,100,32000)
m4 <- matrix(0L,100,32000)
m1 <- matrix(raw(1),100,32000)
The raw type takes just 1 byte per value:
> object.size(m8)
25600200 bytes
> object.size(m4)
12800200 bytes
> object.size(m1)
3200200 bytes
Here is how to operate with it:
> m1[2,2] = as.raw(1)
> m1[2,2]
[1] 01
> as.integer(m1[2,2])
[1] 1
If you really want to be economical look at the ff and bit packages.

R-thonic replacement for simple for loops containing a condition

I'm using R, and I'm a beginner. I have two large lists (30K elements each). One is called descriptions and where each element is (maybe) a tokenized string. The other is called probes where each element is a number. I need to make a dictionary that mapsprobes to something in descriptions, if that something is there. Here's how I'm going about this:
probe2gene <- list()
for (i in 1:length(probes)){
strings<-strsplit(descriptions[i]), '//')
if (length(strings[[1]]) > 1){
probe2gene[probes[i]] = strings[[1]][2]
}
}
Which works fine, but seems slow, much slower than the roughly equivalent python:
probe2gene = {}
for p,d in zip(probes, descriptions):
try:
probe2gene[p] = descriptions.split('//')[1]
except IndexError:
pass
My question: is there an "R-thonic" way of doing what I'm trying to do? The R manual entry on for loops suggests that such loops are rare. Is there a better solution?
Edit: a typical good "description" looks like this:
"NM_009826 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// AB070619 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// ENSMUST00000027040 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421"
a bad "description: looks like this
"-----"
though it can quite easily be some other not-very-helpful string. Each probe is simply a number. The probe and description vectors are the same length, and completely correspond to each other, i.e. probe[i] maps to description[i].
It's usually better in R if you use the various apply-like functions, rather than a loop. I think this solves your problem; the only drawback is that you have to use string keys.
> descriptions <- c("foo//bar", "")
> probes <- c(10, 20)
> probe2gene <- lapply(strsplit(descriptions, "//"), function (x) x[2])
> names(probe2gene) <- probes
> probe2gene <- probe2gene[!is.na(probe2gene)]
> probe2gene[["10"]]
[1] "bar"
Unfortunately, R doesn't have a good dictionary/map type. The closest I've found is using lists as a map from string-to-value. That seems to be idiomatic, but it's ugly.
If I understand correctly you are looking to save each probe-description combination where the there is more than one (split) value in description?
Probe and Description are the same length?
This is kind of messy but a quick first pass at it?
a <- list("a","b","c")
b <- list(c("a","b"),c("DEF","ABC"),c("Z"))
names(b) <- a
matches <- which(lapply(b, length)>1) #several ways to do this
b <- lapply(b[matches], function(x) x[2]) #keeps the second element only
That's my first attempt. If you have a sample dataset that would be very useful.
Best regards,
Jay
Another way.
probe<-c(4,3,1)
gene<-c('red//hair','strange','blue//blood')
probe2gene<-character()
probe2gene[probe]<-sapply(strsplit(gene,'//'),'[',2)
probe2gene
[1] "blood" NA NA "hair"
In the sapply, we take advantage of the fact that in R the subsetting operator is also a function named '[' to which we can pass the index as an argument. Also, an out-of-range index does not cause an error but gives a NA value. On the left hand of the same line, we use the fact that we can pass a vector of indices in any order and with gaps.
Here's another approach that should be fast. Note that this doesn't
remove the empty descriptions. It could be adapted to do that or you
could clean those in a post processing step using lapply. Is it the
case that you'll never have a valid description of length one?
make_desc <- function(n)
{
word <- function(x) paste(sample(letters, 5, replace=TRUE), collapse = "")
if (runif(1) < 0.70)
paste(sapply(seq_len(n), word), collapse = "//")
else
"----"
}
description <- sapply(seq_len(10), make_desc)
probes <- seq_len(length(description))
desc_parts <- strsplit(description, "//", fixed=TRUE, useBytes=TRUE)
lens <- sapply(desc_parts, length)
probes_expand <- rep(probes, lens)
ans <- split(unlist(desc_parts), probes_expand)
> description
[1] "fmbec"
[2] "----"
[3] "----"
[4] "frrii//yjxsa//wvkce//xbpkc"
[5] "kazzp//ifrlz//ztnkh//dtwow//aqvcm"
[6] "stupm//ncqhx//zaakn//kjymf//swvsr//zsexu"
[7] "wajit//sajgr//cttzf//uagwy//qtuyh//iyiue//xelrq"
[8] "nirex//awvnw//bvexw//mmzdp//lvetr//xvahy//qhgym//ggdax"
[9] "----"
[10] "ubabx//tvqrd//vcxsp//rjshu//gbmvj//fbkea//smrgm//qfmpy//tpudu//qpjbu"
> ans[[3]]
[1] "----"
> ans[[4]]
[1] "frrii" "yjxsa" "wvkce" "xbpkc"

Resources