Hej all,
I have load an XML-File into R and want to extract an attribute value.
<espa_metadata version="2.0" xsi:schemaLocation="http://espa.cr.usgs.gov/v2 http://espa.cr.usgs.gov/schema/espa_internal_metadata_v2_0.xsd">
<global_metadata></global_metadata>
<bands>
<band product="cfmask" source="toa_refl" name="cfmask" category="qa" data_type="UINT8" nlines="7801" nsamps="7651" fill_va.lue="255">
<percent_coverage>
<cover type="clear">40.35</cover>
<cover type="cloud">39.99</cover>
</percent_coverage>
</band>
</bands>
</espa_metadata>
I want to extract the value 39.99 for cover type="cloud".
I used the following approach but I only get "NULL"
library(XML)
data <- xmlParse("LC82030342015346LGN00.xml")
xpathApply(data,"//percent_coverage/cover[#type='cloud']" , xmlValue)
Any ideas? Thank u in advance!
Related
I want to efficiently read in an XML File (200mb in Size) consisting of multiple tables.
Sketch of the Structure:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:od="urn:schemas-microsoft-com:officedata">
<xsd:schema>
<xsd:element name="dataroot">
<xsd:element name="dataroot">
</xsd:schema>
<dataroot>
<TABLE1>
[here comes the data]
</TABLE1>
<TABLE2>
[here comes the data]
</TABLE2>
...
</dataroot>
</root>
How do I read in all or specific tables into a data.frame? And what is probably the most efficient way to do so?
Perhaps as a starter:
library(XML)
library(data.table)
xmldoc <- xmlParse("data.xml")
d <- getNodeSet(xmldoc, "//dataroot//TABLE1")
size <- xmlSize(d)
dt <- rbindlist(lapply(1:size, function(i) {
as.list(getChildrenStrings(d[[i]]))
}), fill = TRUE)
works OK but is not particularly fast. How can I do this with xml2? The package docs are not particualrly enlightening.
Also, I want to loop over all Tables but couldn't figure out the xpath stuff.
I have one scenario in R.
I have connected the oracle database with R through RODBC package and in one column of table there is xml data. Now when I am using xmlParse function its showing error as XML content does not seem to be XML. and class(xmldata) is data frame.
When i am copying the xml data and put it into new xml file and parsing though xmlParse function its getting parsed correctly and class(sourcefile) as XMLInternalDocument.
Error is raised because you are running XML::xmlParse on a dataframe object which is the returned value of RODBC::sqlQuery(), and not underlying XML content. Simply index the column and row value for specific XML content.
As example, below reads an XML (top 5 StackOverflow users in R tag) into a dataframe and runs xmlParse to reproduce error and another xmlParse call to resolve error.
Dataframe Build (replicating sqlQuery)
txt <- '<?xml version="1.0"?>
<stackoverflow>
<group lang="r">
<topusers>
<user>akrun</user>
<link>https://stackoverflow.com/users/3732271/akrun</link>
<location>Bengaluru, Karnataka, India</location>
<year_rep>15,900</year_rep>
<total_rep>328,573</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>dplyr</tag3>
</topusers>
<topusers>
<user>Dirk Eddelbuettel</user>
<link>https://stackoverflow.com/users/143305/dirk-eddelbuettel</link>
<location>Chicago, IL, United States </location>
<year_rep>5,588</year_rep>
<total_rep>253,481</total_rep>
<tag1>r</tag1>
<tag2>rcpp</tag2>
<tag3>c++</tag3>
</topusers>
<topusers>
<user>42-</user>
<link>https://stackoverflow.com/users/1855677/42</link>
<location>Alameda, CA</location>
<year_rep>4,143</year_rep>
<total_rep>193,407</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>plot</tag3>
</topusers>
<topusers>
<user>A5C1D2H2I1M1N2O1R2T1</user>
<link>https://stackoverflow.com/users/1270695/a5c1d2h2i1m1n2o1r2t1</link>
<location>Chennai, India</location>
<year_rep>3,982</year_rep>
<total_rep>141,425</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>reshape</tag3>
</topusers>
<topusers>
<user>Gavin Simpson</user>
<link>https://stackoverflow.com/users/429846/gavin-simpson</link>
<location>Regina, Canada </location>
<year_rep>2,780</year_rep>
<total_rep>124,779</total_rep>
<tag1>r</tag1>
<tag2>plot</tag2>
<tag3>dataframe</tag3>
</topusers>
</group>
</stackoverflow>'
res <- data.frame(Col1 = txt)
Error line
result1 <- xmlParse(res, asText=TRUE)
# Error: XML content does not seem to be XML: '1'
Resolved line (which yields no error)
# SINGLE XML
result1 <- xmlParse(res$Col1[[1]], asText=TRUE)
# MULTIPLE XML (ACROSS ALL ROWS)
result_list <- lapply(res$Col1, xmlParse, asText=TRUE)
I try to parse out the xmlValue for the attribute "NAME" in an XML Document in R.
<NN ID_NAME="107232" ID_NTYP="6" NAME="dSpace_ECat1Error.STS" KOMMENTAR="dSpace_ECat1Error.STS" IS_SYSTEM="0" IS_LOCKED="0" DTYP="Ganzzahl" ADIM="" AFMT=""/><NN ID_NAME="107233" ID_NTYP="6" NAME="dSpace_ECat2Error.STS" KOMMENTAR="dSpace_ECat2Error.STS" IS_SYSTEM="0" IS_LOCKED="0" DTYP="Ganzzahl" ADIM="" AFMT=""/>
The result should be like this:
dSpace_ECat1Error.STS
dSpace_ECat2Error.STS
I use this function:
xpathSApply(root,"//NN[#NAME]",xmlValue)
But as a result, I get just empty "" (Quotes)
What have I done wrong?
Thank's in advance!
I just found out by using:
erg<-xpathSApply(root,"//NN",xmlGetAttr,'NAME')
There should be a better tutorial for this particular XML-function in R....
Given the below XML, what would be the proper SQL XQuery to retrieve the SubscriberStatus where the SubscriberID is empty? Given the XML is stored in a column with the XML datatype.
<ObjectEntry>
<Key>Key1</Key>
<DicValue>
<ObjectEntry>
<Key>SubscriberStatus</Key>
<Value xsi:type="xsd:string">Active</Value>
<DicValue />
</ObjectEntry>
<ObjectEntry>
<Key>SubscriberID</Key>
<Value xsi:type="xsd:string" />
<DicValue />
</ObjectEntry>
</DicValue>
</ObjectEntry>
Try this:
If $node holds your xml fragment then
$node//ObjectEntry[DicValue/ObjectEntry[Key eq "SubscriberStatus"] and DicValue/ObjectEntry[Key eq "SubscriberID"][Value ne ""]]
will give you back the ObjectEntry parent for the non empty SubscriberIDs
This is a simply XPath expression, there is no need for true XQuery. XPath is a subset of XQuery. Given that you want the <Value/> element of the SubscriberStatus you can get it like the following:
//ObjectEntry/DicValue[ObjectEntry[Key = "SubscriberID"]/Value = ""]/ObjectEntry[Key = "SubscriberStatus"]/Value
This fetches all ObjectEntries which do have an empty SubscriberID and then navigates to the SubscriberStatus. If you just want the actual string, you cann append /string()
Thanks for the suggestions! Unfortunately they didn't do what I was asking, but did help me get the right syntax. Here's a solution that works.
select Request.query('//ObjectEntry/DicValue/ObjectEntry[Key = "SubscriberStatus"]/Value') as SubscriberStatus
from RequestLog
where
Request.exist('//ObjectEntry/DicValue[ObjectEntry[Key = "SubscriberID" and Value = ""]]') = 1
I want to get a NodeSet, with the getNodeSet function from the XML package, and write it as text in a file.
For example :
> getNodeSet(htmlParse("http://www.google.fr/"), "//div[#id='hplogo']")[[1]]
<div title="Google" align="left" id="hplogo" onload="window.lol&&lol()" style="height:110px;width:276px;background:url(/images/srpr/logo9w.png) no-repeat">
<div nowrap="" style="color:#777;font-size:16px;font-weight:bold;position:relative;top:70px;left:218px">France</div>
</div>
I want to save all this node unchanged in a file.
The problem is we can't write the object directly with :
write.lines(getNodeSet(...), file)
And as.character(getNodeSet(...)) returns a C pointer.
How can I do this ? Thank you.
To save an XML object to a file, use saveXML, e.g.,
url = "http://www.google.fr/"
nodes = getNodeSet(htmlParse(url), "//div[#id='hplogo']")[[1]]
fl <- saveXML(nodes, tempfile())
readLines(fl)
There has to be a better way, until then you can capture what the print method for a XMLNode outputs:
nodes <- getNodeSet(...)
sapply(nodes, function(x)paste(capture.output(print(x)), collapse = ""))
I know it might be a bit outdated but i got into the same problem and wanted to leave it for future reference, after searching and struggling the answer is as simple as:
htmlnodes <- toString(nodes)
write.lines(htmlnodes, file)