R read in largish XML file contraining multiple tables - r

I want to efficiently read in an XML File (200mb in Size) consisting of multiple tables.
Sketch of the Structure:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:od="urn:schemas-microsoft-com:officedata">
<xsd:schema>
<xsd:element name="dataroot">
<xsd:element name="dataroot">
</xsd:schema>
<dataroot>
<TABLE1>
[here comes the data]
</TABLE1>
<TABLE2>
[here comes the data]
</TABLE2>
...
</dataroot>
</root>
How do I read in all or specific tables into a data.frame? And what is probably the most efficient way to do so?
Perhaps as a starter:
library(XML)
library(data.table)
xmldoc <- xmlParse("data.xml")
d <- getNodeSet(xmldoc, "//dataroot//TABLE1")
size <- xmlSize(d)
dt <- rbindlist(lapply(1:size, function(i) {
as.list(getChildrenStrings(d[[i]]))
}), fill = TRUE)
works OK but is not particularly fast. How can I do this with xml2? The package docs are not particualrly enlightening.
Also, I want to loop over all Tables but couldn't figure out the xpath stuff.

Related

R XML parsing to data-frame

I have various XML files with information as shown below. Im having difficulty parsing this variable XML format into a dataframe that can handle both differing numbers of metrics and duplicated properties tags.
<ProducedFruits>
<FruitType>
<FruitName>Apple</FruitName>
<FruitMetrics>
<Properties Sugars="27.51" Rate="5.03" />
<Properties Sugars="219.39" Rate="12.19" />
<Properties Sugars="266.34" Rate="75.9" />
</FruitMetrics>
</FruitType>
<FruitType>
<FruitName>Lime</FruitName>
<FruitMetrics>
<Properties Sugars="1884.2" Rate="5" />
<Properties Sugars="1884.2" Rate="98.3" />
</FruitMetrics>
</FruitType>
<FruitType>
<FruitName>Lemon</FruitName>
<FruitMetrics>
<Properties Sugars="1064.77" Rate="5" />
<Properties Sugars="1064.77" Rate="56" />
</FruitMetrics>
</FruitType>
<FruitType>
<FruitName>Banana</FruitName>
<FruitMetrics>
<Properties Sugars="113" Rate="12" />
<Properties Sugars="113" Rate="79" />
</FruitMetrics>
</FruitType>
</ProducedFruits>
Each file may be somewhat different, so ideally i would to create something that can handle the inconsistent number of values that also preserves the fruitname and creates a dataframe like the one at the bottom.
enter image description here
To pass your xml into R as a dataframe you can use the XML package (https://cran.r-project.org/web/packages/XML/), e.g. data <- XML::xmlParse("doc.xml") then bind lists together with xml_data <- XML::xmlToList(data) then xml_df <- as.data.frame(xml_data) (per: How to parse XML to R data frame)

Reading xml data from oracle table column and parsing it in R

I have one scenario in R.
I have connected the oracle database with R through RODBC package and in one column of table there is xml data. Now when I am using xmlParse function its showing error as XML content does not seem to be XML. and class(xmldata) is data frame.
When i am copying the xml data and put it into new xml file and parsing though xmlParse function its getting parsed correctly and class(sourcefile) as XMLInternalDocument.
Error is raised because you are running XML::xmlParse on a dataframe object which is the returned value of RODBC::sqlQuery(), and not underlying XML content. Simply index the column and row value for specific XML content.
As example, below reads an XML (top 5 StackOverflow users in R tag) into a dataframe and runs xmlParse to reproduce error and another xmlParse call to resolve error.
Dataframe Build (replicating sqlQuery)
txt <- '<?xml version="1.0"?>
<stackoverflow>
<group lang="r">
<topusers>
<user>akrun</user>
<link>https://stackoverflow.com/users/3732271/akrun</link>
<location>Bengaluru, Karnataka, India</location>
<year_rep>15,900</year_rep>
<total_rep>328,573</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>dplyr</tag3>
</topusers>
<topusers>
<user>Dirk Eddelbuettel</user>
<link>https://stackoverflow.com/users/143305/dirk-eddelbuettel</link>
<location>Chicago, IL, United States </location>
<year_rep>5,588</year_rep>
<total_rep>253,481</total_rep>
<tag1>r</tag1>
<tag2>rcpp</tag2>
<tag3>c++</tag3>
</topusers>
<topusers>
<user>42-</user>
<link>https://stackoverflow.com/users/1855677/42</link>
<location>Alameda, CA</location>
<year_rep>4,143</year_rep>
<total_rep>193,407</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>plot</tag3>
</topusers>
<topusers>
<user>A5C1D2H2I1M1N2O1R2T1</user>
<link>https://stackoverflow.com/users/1270695/a5c1d2h2i1m1n2o1r2t1</link>
<location>Chennai, India</location>
<year_rep>3,982</year_rep>
<total_rep>141,425</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>reshape</tag3>
</topusers>
<topusers>
<user>Gavin Simpson</user>
<link>https://stackoverflow.com/users/429846/gavin-simpson</link>
<location>Regina, Canada </location>
<year_rep>2,780</year_rep>
<total_rep>124,779</total_rep>
<tag1>r</tag1>
<tag2>plot</tag2>
<tag3>dataframe</tag3>
</topusers>
</group>
</stackoverflow>'
res <- data.frame(Col1 = txt)
Error line
result1 <- xmlParse(res, asText=TRUE)
# Error: XML content does not seem to be XML: '1'
Resolved line (which yields no error)
# SINGLE XML
result1 <- xmlParse(res$Col1[[1]], asText=TRUE)
# MULTIPLE XML (ACROSS ALL ROWS)
result_list <- lapply(res$Col1, xmlParse, asText=TRUE)

R: how to load a small subset of a large zipped xml file?

I have a fairly large zipped xml file such as myfile.xml.gz
I would like to load a small subsample of the file, but I was not able to find some nrows options in either xml2::read_xml or XML::xmlTreeParse.
Trying to open the whole file directly just crashes my computer (the file is too big).
How can I just load a subset of the xml file into a dataframe?
Use xmlEventParse to read the xml in a SAX way.
Let's take the following xml file:
<items>
<item>
<id>l001</id>
<qty>1</qty>
<price>10</price>
</item>
<item>
<id>l002</id>
<qty>100</qty>
<price>10</price>
</item>
<item>
<id>l003</id>
<qty>5</qty>
<price>12</price>
</item>
[...]
</items>
We will use the event parser to avoid loading everything into memory with the "hybrid mode", loading each item as a tree (using branches instead of handlers). Reusing https://stackoverflow.com/a/31014005/1992669, this gives:
library(XML)
input <- "input.xml"
items <- NULL
maxItems <- 50
parseItem = function (parser, node, ...) {
children <- xmlChildren(node)
items <<- rbind(items, sapply(children, xmlValue))
if (nrow(items) == maxItems) {
xmlStopParser(parser)
}
}
# with XMLParserContextFunction, we get the parser as first parameter
# so we can call xmlStopParser
class(parseItem) = c("XMLParserContextFunction", "SAXBranchFunction")
xmlEventParse(input,
branches = list(item = parseItem),
ignoreBlanks = T
)
items <- as.data.frame(items)

How to get xml attribute value in R

Hej all,
I have load an XML-File into R and want to extract an attribute value.
<espa_metadata version="2.0" xsi:schemaLocation="http://espa.cr.usgs.gov/v2 http://espa.cr.usgs.gov/schema/espa_internal_metadata_v2_0.xsd">
<global_metadata></global_metadata>
<bands>
<band product="cfmask" source="toa_refl" name="cfmask" category="qa" data_type="UINT8" nlines="7801" nsamps="7651" fill_va.lue="255">
<percent_coverage>
<cover type="clear">40.35</cover>
<cover type="cloud">39.99</cover>
</percent_coverage>
</band>
</bands>
</espa_metadata>
I want to extract the value 39.99 for cover type="cloud".
I used the following approach but I only get "NULL"
library(XML)
data <- xmlParse("LC82030342015346LGN00.xml")
xpathApply(data,"//percent_coverage/cover[#type='cloud']" , xmlValue)
Any ideas? Thank u in advance!

How to create a sitemap.xml file using R and the {XML} package?

I have a vector of links from which I would like to create a sitemap.xml file (file protocol is available from here: http://www.sitemaps.org/protocol.html)
I understand the sitemap.xml protocol (it is rather simple), but I'm not sure what is the smartest way to use the {XML} package for it.
A simple example:
links <- c("http://r-statistics.com",
"http://www.r-statistics.com/on/r/",
"http://www.r-statistics.com/on/ubuntu/")
How can "links" be used to construct a sitemap.xml file?
Is something like this what you are looking for. (It uses the httr package to get the last modified bit and writes the XML directly with the very useful whisker package.)
require(whisker)
require(httr)
tpl <- '
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{{#links}}
<url>
<loc>{{{loc}}}</loc>
<lastmod>{{{lastmod}}}</lastmod>
<changefreq>{{{changefreq}}}</changefreq>
<priority>{{{priority}}}</priority>
</url>
{{/links}}
</urlset>
'
links <- c("http://r-statistics.com", "http://www.r-statistics.com/on/r/", "http://www.r-statistics.com/on/ubuntu/")
map_links <- function(l) {
tmp <- GET(l)
d <- tmp$headers[['last-modified']]
list(loc=l,
lastmod=format(as.Date(d,format="%a, %d %b %Y %H:%M:%S")),
changefreq="monthly",
priority="0.8")
}
links <- lapply(links, map_links)
cat(whisker.render(tpl))
I could not use #jverzani's solution, because I wasn't able to create a valid xml file from the cat output. Thus I created an alternative.
## Input a data.frame with 4 columns: loc, lastmod, changefreq, and priority
## This data.frame is named sm in the code below
library(XML)
doc <- newXMLDoc()
root <- newXMLNode("urlset", doc = doc)
temp <- newXMLNamespace(root, "http://www.sitemaps.org/schemas/sitemap/0.9")
temp <- newXMLNamespace(root, "http://www.google.com/schemas/sitemap-image/1.1", "image")
for (i in 1:nrow(sm))
{
urlNode <- newXMLNode("url", parent = root)
newXMLNode("loc", sm$loc[i], parent = urlNode)
newXMLNode("lastmod", sm$lastmod[i], parent = urlNode)
newXMLNode("changefreq", sm$changefreq[i], parent = urlNode)
newXMLNode("priority", sm$priority[i], parent = urlNode)
rm(i, urlNode)
}
saveXML(doc, file="sitemap.xml")
rm(doc, root, temp)
browseURL("sitemap.xml")

Resources