Reading xml data from oracle table column and parsing it in R - r

I have one scenario in R.
I have connected the oracle database with R through RODBC package and in one column of table there is xml data. Now when I am using xmlParse function its showing error as XML content does not seem to be XML. and class(xmldata) is data frame.
When i am copying the xml data and put it into new xml file and parsing though xmlParse function its getting parsed correctly and class(sourcefile) as XMLInternalDocument.

Error is raised because you are running XML::xmlParse on a dataframe object which is the returned value of RODBC::sqlQuery(), and not underlying XML content. Simply index the column and row value for specific XML content.
As example, below reads an XML (top 5 StackOverflow users in R tag) into a dataframe and runs xmlParse to reproduce error and another xmlParse call to resolve error.
Dataframe Build (replicating sqlQuery)
txt <- '<?xml version="1.0"?>
<stackoverflow>
<group lang="r">
<topusers>
<user>akrun</user>
<link>https://stackoverflow.com/users/3732271/akrun</link>
<location>Bengaluru, Karnataka, India</location>
<year_rep>15,900</year_rep>
<total_rep>328,573</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>dplyr</tag3>
</topusers>
<topusers>
<user>Dirk Eddelbuettel</user>
<link>https://stackoverflow.com/users/143305/dirk-eddelbuettel</link>
<location>Chicago, IL, United States </location>
<year_rep>5,588</year_rep>
<total_rep>253,481</total_rep>
<tag1>r</tag1>
<tag2>rcpp</tag2>
<tag3>c++</tag3>
</topusers>
<topusers>
<user>42-</user>
<link>https://stackoverflow.com/users/1855677/42</link>
<location>Alameda, CA</location>
<year_rep>4,143</year_rep>
<total_rep>193,407</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>plot</tag3>
</topusers>
<topusers>
<user>A5C1D2H2I1M1N2O1R2T1</user>
<link>https://stackoverflow.com/users/1270695/a5c1d2h2i1m1n2o1r2t1</link>
<location>Chennai, India</location>
<year_rep>3,982</year_rep>
<total_rep>141,425</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>reshape</tag3>
</topusers>
<topusers>
<user>Gavin Simpson</user>
<link>https://stackoverflow.com/users/429846/gavin-simpson</link>
<location>Regina, Canada </location>
<year_rep>2,780</year_rep>
<total_rep>124,779</total_rep>
<tag1>r</tag1>
<tag2>plot</tag2>
<tag3>dataframe</tag3>
</topusers>
</group>
</stackoverflow>'
res <- data.frame(Col1 = txt)
Error line
result1 <- xmlParse(res, asText=TRUE)
# Error: XML content does not seem to be XML: '1'
Resolved line (which yields no error)
# SINGLE XML
result1 <- xmlParse(res$Col1[[1]], asText=TRUE)
# MULTIPLE XML (ACROSS ALL ROWS)
result_list <- lapply(res$Col1, xmlParse, asText=TRUE)

Related

R read in largish XML file contraining multiple tables

I want to efficiently read in an XML File (200mb in Size) consisting of multiple tables.
Sketch of the Structure:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:od="urn:schemas-microsoft-com:officedata">
<xsd:schema>
<xsd:element name="dataroot">
<xsd:element name="dataroot">
</xsd:schema>
<dataroot>
<TABLE1>
[here comes the data]
</TABLE1>
<TABLE2>
[here comes the data]
</TABLE2>
...
</dataroot>
</root>
How do I read in all or specific tables into a data.frame? And what is probably the most efficient way to do so?
Perhaps as a starter:
library(XML)
library(data.table)
xmldoc <- xmlParse("data.xml")
d <- getNodeSet(xmldoc, "//dataroot//TABLE1")
size <- xmlSize(d)
dt <- rbindlist(lapply(1:size, function(i) {
as.list(getChildrenStrings(d[[i]]))
}), fill = TRUE)
works OK but is not particularly fast. How can I do this with xml2? The package docs are not particualrly enlightening.
Also, I want to loop over all Tables but couldn't figure out the xpath stuff.

R removing duplicate siblings in xml data

I am working on bugs XML data set:
`</short_desc>
<report id="322231">
<update>
<when>1136281841</when>
<what>When uploading a objectice-c++ file (.mm) bugzilla sets the MIME type as application/octet-stream</what>
</update>
<update>
<when>1136420901</when>
<what>When uploading a objective-c++ file (.mm) bugzilla sets the MIME type as application/octet-stream</what>
</update>
</report>
</short_desc> `
I am creating a data frame from the above xml data by keeping only <when> and <what> node data. Due to duplicate content in the <what> node. I wish to keep only last node (most recent), if the content of <what> node in both the <update> is similar. It was supposed to be compared using cosine similarity in R. In case the data in <what> node is different, then I want to keep both in the data frame to be created. Please suggest, there are cases when there are more than two updates in single <report> and have approximately similar text.
try the following...
library(xml2)
sample data
doc <- read_xml( '<report id="322231">
<update>
<when>1136281841</when>
<what>When uploading a objective-c++ file (.mm) bugzilla sets the MIME type as application/octet-stream</what>
</update>
<update>
<when>1136420901</when>
<what>When uploading a objective-c++ file (.mm) bugzilla sets the MIME type as application/octet-stream</what>
</update>
</report>')
code
#create nodeset with all 'what'-nodes
what.nodes <- xml_find_all( doc, ".//what" )
#no make a data.frame
df <- data.frame(
#get report-attribute "id" by retracing the ancestor tree from the what.nodes
report_id = xml_attr( xml_find_first( what.nodes, ".//ancestor::report" ), "id" ),
#get the sibling 'when' fro the what-node
when = xml_text( xml_find_first( what.nodes, ".//preceding-sibling::when" ) ),
#get 'what'
what = xml_text( what.nodes ),
#set stringsAsfactors
stringsAsFactors = FALSE )
#get rows with unique values from the bottom-up
df[ !duplicated( df$what, fromLast = TRUE ), ]
output
# report_id when what
# 2 322231 1136420901 When uploading a objective-c++ file (.mm) bugzilla sets the MIME type as application/octet-stream

Xquery html formatting

I'm new to Xquery. I have a requirement of rewriting the API response into custom xml format.
Input file format:
<root> <_1>
<dataType>
<name>XVar(Osmo [mOsmol/kg])</name>
<term>M185</term>
<type>XVar</type>
</dataType>
<values>305</values>
<values>335</values> </_1> <_2>
<dataType>
<name>XVar(DO (2) [%])</name>
<term>M199</term>
<type>XVar</type>
</dataType>
<values>12</values>
<values>33</values>
</_2> <_3>
<dataType>
<name>Maturity</name>
<type>Maturity</type>
</dataType>
<values>0</values>
<values>0.73600054</values>
</_3> </root>
Expected output:
<element> <XVar(Osmo [mOsmol/kg]> 305</XVar(Osmo [mOsmol/kg]>
<XVar(Osmo [mOsmol/kg]> 335</XVar(Osmo [mOsmol/kg]> <XVar(DO (2)[%])>
12</XVar(DO (2) [%])> <XVar(DO (2) [%])>33 </XVar(DO (2) [%])>
<Maturity>0</Maturity> <Maturity>0.73600054</Maturity> </element>
no of nodes (dataType -> name) will vary in each input file and also
Values will be dynamics .
currently using the below code.
let $input:= /root for $i in $input//values
return <element>
<name>{$i/../dataType/name/text()}</name> <values>{$i/text()} </values>
</element>
but all data are coming in and . my requirement is to
keep the node name as {$i/../dataType/name/text()} as values should
be {$i/text()} -
for the input file sample ideally there should be three different
nodes and its values.
Can any one help me on this?

R: how to load a small subset of a large zipped xml file?

I have a fairly large zipped xml file such as myfile.xml.gz
I would like to load a small subsample of the file, but I was not able to find some nrows options in either xml2::read_xml or XML::xmlTreeParse.
Trying to open the whole file directly just crashes my computer (the file is too big).
How can I just load a subset of the xml file into a dataframe?
Use xmlEventParse to read the xml in a SAX way.
Let's take the following xml file:
<items>
<item>
<id>l001</id>
<qty>1</qty>
<price>10</price>
</item>
<item>
<id>l002</id>
<qty>100</qty>
<price>10</price>
</item>
<item>
<id>l003</id>
<qty>5</qty>
<price>12</price>
</item>
[...]
</items>
We will use the event parser to avoid loading everything into memory with the "hybrid mode", loading each item as a tree (using branches instead of handlers). Reusing https://stackoverflow.com/a/31014005/1992669, this gives:
library(XML)
input <- "input.xml"
items <- NULL
maxItems <- 50
parseItem = function (parser, node, ...) {
children <- xmlChildren(node)
items <<- rbind(items, sapply(children, xmlValue))
if (nrow(items) == maxItems) {
xmlStopParser(parser)
}
}
# with XMLParserContextFunction, we get the parser as first parameter
# so we can call xmlStopParser
class(parseItem) = c("XMLParserContextFunction", "SAXBranchFunction")
xmlEventParse(input,
branches = list(item = parseItem),
ignoreBlanks = T
)
items <- as.data.frame(items)

How to get xml attribute value in R

Hej all,
I have load an XML-File into R and want to extract an attribute value.
<espa_metadata version="2.0" xsi:schemaLocation="http://espa.cr.usgs.gov/v2 http://espa.cr.usgs.gov/schema/espa_internal_metadata_v2_0.xsd">
<global_metadata></global_metadata>
<bands>
<band product="cfmask" source="toa_refl" name="cfmask" category="qa" data_type="UINT8" nlines="7801" nsamps="7651" fill_va.lue="255">
<percent_coverage>
<cover type="clear">40.35</cover>
<cover type="cloud">39.99</cover>
</percent_coverage>
</band>
</bands>
</espa_metadata>
I want to extract the value 39.99 for cover type="cloud".
I used the following approach but I only get "NULL"
library(XML)
data <- xmlParse("LC82030342015346LGN00.xml")
xpathApply(data,"//percent_coverage/cover[#type='cloud']" , xmlValue)
Any ideas? Thank u in advance!

Resources