How to read large (~20 GB) xml file in R? - r

I want to read data from large xml file (20 GB) and manipulate them. I tired to use "xmlParse()" but it gave me memory issue before loading. Is there any efficient way to do this?
My data dump looks like this,
<tags>
<row Id="106929" TagName="moto-360" Count="1"/>
<row Id="106930" TagName="n1ql" Count="1"/>
<row Id="106931" TagName="fable" Count="1" ExcerptPostId="25824355" WikiPostId="25824354"/>
<row Id="106932" TagName="deeplearning4j" Count="1"/>
<row Id="106933" TagName="pystache" Count="1"/>
<row Id="106934" TagName="jitter" Count="1"/>
<row Id="106935" TagName="klein-mvc" Count="1"/>
</tags>

In XML package the xmlEventParse function implements SAX (reading XML and calling your function handlers). If your XML is simple enough (repeating elements inside one root element), you can use branches parameter to define function(s) for every element.
Example:
MedlineCitation = function(x, ...) {
#This is a "branch" function
#x is a XML node - everything inside element <MedlineCitation>
# find element <ArticleTitle> inside and print it:
ns <- getNodeSet(x,path = "//ArticleTitle")
value <- xmlValue(ns[[1]])
print(value)
}
Call XML parsing:
xmlEventParse(
file = "http://www.nlm.nih.gov/databases/dtd/medsamp2015.xml",
handlers = NULL,
branches = list(MedlineCitation = MedlineCitation)
)
Solution with closure:
Like in Martin Morgan, Storing-specific-xml-node-values-with-rs-xmleventparse:
branchFunction <- function() {
store <- new.env()
func <- function(x, ...) {
ns <- getNodeSet(x, path = "//ArticleTitle")
value <- xmlValue(ns[[1]])
print(value)
# if storing something ...
# store[[some_key]] <- some_value
}
getStore <- function() { as.list(store) }
list(MedlineCitation = func, getStore=getStore)
}
myfunctions <- branchFunction()
xmlEventParse(
file = "medsamp2015.xml",
handlers = NULL,
branches = myfunctions
)
#to see what is inside
myfunctions$getStore()

Related

How to accumulate the results of readr::read_lines_chunked?

I'm using readr::read_lines_chunked in the following way:
if(!require(readr)) install.packages("readr", repos = "http://cran.us.r-project.org")
mytb <- NULL
read_lines_chunked(file="/tmp/huge.xml", chunk_size=10, callback = function(xml, pos) {
// extract values from xml into tmp
if (is.null(mytb)) {
users <- as_tibble(tmp)
} else {
users <- bind_rows(users, as_tibble(tmp))
}
})
but this doesn't work as mytb always ends up being null ... how do you accumulate the results into a tibble?
I found the solution. This package has a group of callback handlers that wrap the custom handler. So this is how it works:
mytb <- read_lines_chunked(file="/tmp/huge.xml", chunk_size=10, callback = DataFrameCallback$new(function(xml, pos) {
// extract values from xml into tmp
as_tibble(tmp)
}))
Note the DataFrameCallback$new(...) decorator and returning the tibble I want to stitch together as rbind.

How to create a dictionary and insert key with a list of value in R?

I want to loop through content of a file, compute the key of each word and then store the words with same key in a list.
In python, it's like below:
dictionary = {}
for word in file:
key = dosomecalucation(word)
dictionary.setdefault(key, [])
dictionary[key].append(word)
In R, how can I declare a dictionary with key as string and value as a list?
How can I check whether a key exist in the dictionary?
You could use the hash package for this task:
library(hash)
h <- hash()
for (word in file) {
key <- dosomecalculation(word)
if (!has.key(key, h)) {
h[key] <- list()
} else {
h[key] <- append(h[[key]], word)
}
}
Using [[ for indexing (e.g. h[["foo"]]) will then return the corresponding list.

Using a closure to generate an R6 binding

I'm using active bindings in an R6 class to check values before assignment to fields. I thought I could use a closure to generate the bindings as below, but this doesn't work.
The binding isn't evaluated in the way I expect (at all?) because the error shows the closure's name argument. What am I missing?
library(R6)
library(pryr)
# pass a field name to create its binding
generate_binding <- function(name) {
function(value) {
if (!missing(value) && length(value) > 0) {
private$name <- value
}
private$name
}
}
bind_x = generate_binding(x_)
# created as intended:
unenclose(bind_x)
# function (value)
# {
# if (!missing(value) && length(value) > 0) {
# private$x_ <- value
# }
# private$x_
# }
MyClass <- R6::R6Class("MyClass",
private = list(
x_ = NULL
),
active = list(
x = bind_x
),
)
my_class_instance <- MyClass$new()
my_class_instance$x <- "foo"
# Error in private$name <- value :
# cannot add bindings to a locked environment
I think you’re misunderstanding how closures work. unenclose is a red herring here (as it doesn’t actually show you what the closure looks like). The closure contains the statement private$name <- value — it does not contain the statement private$x_ <- value.
The usual solution to this problem would be to rewrite the closure such that the unevaluated name argument is deparsed into its string representation, and then used to subset the private environment (private[[name]] <- value). However, this doesn’t work here since R6 active bindings strip closures of their enclosing environment.
This is where unenclose comes in then:
MyClass <- R6::R6Class("MyClass",
private = list(
x_ = NULL
),
active = list(
x = pryr::unenclose(bind_x)
),
)

Order of methods in R reference class and multiple files

There is one thing I really don't like about R reference class: the order you write the methods matters. Suppose your class goes like this:
myclass = setRefClass("myclass",
fields = list(
x = "numeric",
y = "numeric"
))
myclass$methods(
afunc = function(i) {
message("In afunc, I just call bfunc...")
bfunc(i)
}
)
myclass$methods(
bfunc = function(i) {
message("In bfunc, I just call cfunc...")
cfunc(i)
}
)
myclass$methods(
cfunc = function(i) {
message("In cfunc, I print out the sum of i, x and y...")
message(paste("i + x + y = ", i+x+y))
}
)
myclass$methods(
initialize = function(x, y) {
x <<- x
y <<- y
}
)
And then you start an instance, and call a method:
x = myclass(5, 6)
x$afunc(1)
You will get an error:
Error in x$afunc(1) : could not find function "bfunc"
I am interested in two things:
Is there a way to work around this nuisance?
Does this mean I can never split a really long class file into multiple files? (e.g. one file for each method.)
Calling bfunc(i) isn't going to invoke the method since it doesn't know what object it is operating on!
In your method definitions, .self is the object being methodded on (?). So change your code to:
myclass$methods(
afunc = function(i) {
message("In afunc, I just call bfunc...")
.self$bfunc(i)
}
)
(and similarly for bfunc). Are you coming from C++ or some language where functions within methods are automatically invoked within the object's context?
Some languages make this more explicit, for example in Python a method with one argument like yours actually has two arguments when defined, and would be:
def afunc(self, i):
[code]
but called like:
x.afunc(1)
then within the afunc there is the self variable which referes to x (although calling it self is a universal convention, it could be called anything).
In R, the .self is a little bit of magic sprinkled over reference classes. I don't think you could change it to .this even if you wanted.

R tcl tk: How do I pass a variable to a button command?

How would I pass the value of num in a command function of a button?
f.frame <- tktoplevel()
numIDs = 50;
bs = list();
OnPress <- function (inum) { print (inum) }
for (num in 1:numIDs) {
bs[[num]] <- tkbutton (f.frame, command = "OnPress num");
tkpack (bs[[num]]);
}
Create a factory function that returns a function of no arguments:
makepresser=function(n){force(n);function(){cat("Hit me ",n," times\n")}}
in case you've not seen this before, it lets you do:
> m1 = makepresser(1)
> m1()
Hit me 1 times
> m2 = makepresser(9)
> m2()
Hit me 9 times
then its as simple as:
f.frame <- tktoplevel()
bs = list()
for(i in 1:10){
bs[[i]]=tkbutton(f.frame,command=makepresser(i))
tkpack(bs[[i]])
}
The factory function creates a function closure of no arguments which keeps the value of n when it was constructed (the force function is needed here or you get bitten by lazy evaluation).

Resources