How to loop through xml nodes in R - r

I have a requirement to split an xml document into multiple nodes; and then split each node separately into more sub nodes. I am using xpathSApply/getNodeSet functions in XML package. But it seems like once the xml document is split as nodes, each node is now considered as class "internal node" and hence cannot perform spath operations on it unless we save it as an xml using saveXML(). Any ideas on how this can be worked out without having to do a SAVEXML?
For example, consider sample xml below:
<array>
<ResidentialProperty>
<Listing>
<StreetAddress>
<StreetNumber>11111</StreetNumber>
<StreetName>111th</StreetName>
<StreetSuffix>Avenue Ct</StreetSuffix>
<StateOrProvince>WA</StateOrProvince>
</StreetAddress>
<MLSInformation>
<ListingStatus Status="Active"/>
<StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate>
</MLSInformation>
<GeographicData>
<Latitude>11.111111</Latitude>
<Longitude>-111.111111</Longitude>
<County>Pierce</County>
</GeographicData>
</ResidentialProperty>
<ResidentialProperty>
<Listing>
<StreetAddress>
<StreetNumber>11211</StreetNumber>
<StreetName>11111334th</StreetName>
<StreetSuffix>Av1enue Ct</StreetSuffix>
<StateOrProvince>WA</StateOrProvince>
</StreetAddress>
<MLSInformation>
<ListingStatus Status="Active"/>
<StatusChangeDate>2017-07-05T23:48:53.410</StatusChangeDate>
</MLSInformation>
<GeographicData>
<Latitude>11.111111</Latitude>
<Longitude>-111.111111</Longitude>
<County>Pie2rce</County>
</GeographicData>
</ResidentialProperty>
</array>
I am intending to split the above into:
1. Two separate nodes with root ResidentialProperty
2. Then be able to perform XPATH operations on each of these nodes.
P.S: This is sample data and not similar to the actual data set I am working with. Just tried to use this to explain the problem I am trying to solve.

EDIT : I think I've misunderstood the question. New approach.
We use xpathApply, toString.XMLNode and xmlParseString to extract specific nodes in 2 objects.
Parse the XML file and exctract the nodes :
library(XML) :
doc=xmlParse("pathtoyourXML.xml")
result1=xmlParseString(toString.XMLNode(xpathApply(doc,"(//ResidentialProperty)[1]")))
result2=xmlParseString(toString.XMLNode(xpathApply(doc,"(//ResidentialProperty)[2]")))
We have 2 objects, we evaluate them with :
from.result1=xpathApply(result1,"//StreetAddress")
from.result2=xpathApply(result2,"//StreetAddress")
Sidenote : your XML is not valid. Listings elements are not closed.
EDIT 2 : In fact, you can use XPathApply on a previously "extracted" nodeset :
foo=xpathApply(doc,"(//ResidentialProperty)[2]")
xpathApply(foo[[1]],"//StreetAddress")
foo does not contain the result of the previous xpath expression ((//ResidentialProperty)[2]) but the whole XML nodeset.

Related

how to get list of Auto-IVC component output names

I'm switching over to using the Auto-IVC component as opposed to the IndepVar component. I'd like to be able to get a list of the promoted output names of the Auto-IVC component, so I can then use them to go and pull the appropriate value out of a configuration file and set the values that way. This will get rid of some boilerplate.
p.model._auto_ivc.list_outputs()
returns an empty list. It seems that p.model__dict__ has this information encoded in it, but I don't know exactly what is going on there so I am wondering if there is an easier way to do it.
To avoid confusion from future readers, I assume you meant that you wanted the promoted input names for the variables connected to the auto_ivc outputs.
We don't have a built-in function to do this, but you could do it with a bit of code like this:
seen = set()
for n in p.model._inputs:
src = p.model.get_source(n)
if src.startswith('_auto_ivc.') and src not in seen:
print(src, p.model._var_allprocs_abs2prom['input'][n])
seen.add(src)
assuming 'p' is the name of your Problem instance.
The code above just prints each auto_ivc output name followed by the promoted input it's connected to.
Here's an example of the output when run on one of our simple test cases:
_auto_ivc.v0 par.x

Why does my query report "Steps within a path expression must yield nodes"?

I'm relatively new to XQuery and I'm using a XML with the following format (MODSXML):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/mods/v3
http://www.loc.gov/standards/mods/v3/mods-3-0.xsd">
<mods ID="ISI:000330282600027" version="3.0">
<titleInfo>
<title>{Minimum Relative Entropy for Quantum Estimation: Feasibility and General
Solution}</title>
</titleInfo>
I'm trying to retrieva all titles of the articles contained on the XML file. The expression I'm using is the following:
for $x in collection("ExemploBibtex")/"quantuminformation.xml"/modsCollection/mods/titleInfo/title
return <title>$x/text()</title>
When I try to run this expression on Base, I get the following error:
"[XPTY0019] Steps within a path expression must yield nodes; xs:string
found."
Can anybody tell me what's wrong? The result I was expecting was a list with all the titles in the document.
Okay, problem solved in the BaseX Mailing List :D
I needed to declare the namespace. So now I'm using:
declare namespace v3 ="http://www.loc.gov/mods/v3";
for $doc in collection('ExemploBibtex')
where matches(document-uri($doc), 'quantuminformation.xml')
return $doc/v3:modsCollection/v3:mods/v3:titleInfo/v3:title/text()
And it works.
The problem is here:
collection("ExemploBibtex")/"quantuminformation.xml"/modsCollection
This returns a string with content quantuminformation.xml for each file/root node in the ExemploBibtex collection, and then tries to perform an axis step on each of these strings -- which is not allowed.
It seems you want to access to document quantuminformation.xml within the collection ExemploBibtex. To open a specific file of a collection, use following syntax instead:
collection("ExemploBibtex/quantuminformation.xml")/modsCollection
I cut of the last axis steps for readability and keeping the code lines short; simply add them again, they're fine.

Sequencefiles which map a single key to multiple values

I am trying to do some preprocessing on data that will be fed to LucidWorks Big Data for indexing. LWBD accepts SolrXML in the form of Sequencefile files. I want to create a Pig script which will take all the SolrXML files in a directory and output them in the format
filename_1 => <here goes some XML>
...
filename_N => <here goes some more XML>
Pig's native PigStorage() load function can automatically create a column that includes the name of the file from which the data was extracted, which ideally would look like this:
{"filename_1", "<here goes some XML>"}
...
{"filename_N", "<here goes some more XML>"}
However, PigStorage() also automatically uses '\n' as a line delimiter, so what I actually end up with is a bag that looks like this:
{"filename_1", "<some partial XML from file 1>"}
{"filename_1", "<some more partial XML from file 1>"}
{"filename_1", "<the end of file 1>"}
...
I'm sure you get the picture. My question is, if I were to write this bag to a SequenceFile, how would it be read by other applications? Could it be combined as
"filename_1" => "<some partial XML from file 1>
<some more partial XML from file 1>
<the end of file 1>"
, by the default handling of the application I feed it to? Or is there some post-processing that I can do to get it into this format? Thank you for your help.
Since I can't find anything about a builtin SequenceFile writer, I'm assuming you are using a UDF (and if you aren't, then you need to).
You'll have to group the files (by filename) ahead of time, and then send that to the writer UDF.
DESCRIBE xml ;
-- xml: {filename: chararray, xml_data: chararray}
B = FOREACH (GROUP xml BY filename)
GENERATE group AS filename, xml.xml_data AS all_xml_data ;
Depending on how you have written the SequenceFile writer, it may be easier to convert the all_xml_data bag ahead of time to a chararray using a Python UDF like:
#outputSchema('xml_complete: chararray')
def stringify(bag):
delim = ''
return delim.join(bag)
NOTE: It is important to realize that this way the order of the xml data will become jumbled. If possible based on your data, stringify can maybe be expanded upon the reorgize it.

How to read a Gallio Report

I'm trying to read or redirect the test console output. I'm doing an AssertAreEqualIgnoringOrder and I need to parse some of the values within the failures. The failures output looks like this:
Expected elements to be equal but possibly in a different order.
Ensure the list of idols matches with the db
Equal Elements : ["98932", "670945", "6747749", "6770804", "7110604", "13280109", "13280121", "13280149", "14448042", "14448336", "15726213", "17009409", "17245584", "93123", "2212314", "10129661", "13280123", "13280125", "13280135", "13280144", "17245263", "18784003", "1112597", "2885514", "8505390", "13279857", "15032800", "17009391", "17009396", "17009398", "17880635", "18340462", "3606775", "13280116", "13280133", "13280137", "14448341", "15050039", "16711731", "17008920", "17009377", "17009381", "17009402", "17245606", "17901335", "865871", "6029748", "17009372", "17009386", "17009406", "17245604", "19113286", "19865372"]
Excess Elements : ["14419207"]
Missing Elements : ["17241620"]
How can I read this directly or redirect it to a file? If I can just parse the output, I can work it afterwards with some RegEx, that will not be a problem.
Another solution would be to parse the Gallio report file, but I'm sure there is a faster way.
Can you help me, please?
Thanks,
Andrei

How do I find/use op:except with the multiple xml files?

How do I find/use op:except with the multiple xml files?
I've gotten the nodes from file 1 and file 2, and in the xquery expression I'm tring to find the op:except of those two. When I use op:except, I end up getting an empty set.
XML File 1:
<a>txt</a>
<a>txt2</a>
<a>txt3</a>
XML File 2:
<a>txt2</a>
<a>txt4</a>
<a>txt3</a>
I want output from op:($nodesfromfile1, $nodesfromfile2) to be:
<a>txt</a>
It effectively comes down to the single line behind the return in the following code. You could put that in a function if you like, but it is already very dense, so maybe not worth it..
let $file1 := (
<a>txt</a>,
<a>txt2</a>,
<a>txt3</a>
)
let $file2 := (
<a>txt2</a>,
<a>txt4</a>,
<a>txt3</a>
)
return
$file1[not(. = $file2)]
Note, you also have the 'except' keyword ($file1 except $file2), but that works on node identity which won't work if the nodes comes from different files.
By the way, above code uses string-equality for comparison. If you would prefer to do a comparison on full node-structure, you could also use the deep-equal() function.
HTH!

Resources