How to create XML from .csv properly? - r

I would like to create a XML file from a .csv file. I have some difficulties to get the desired structure:
<?xml version="1.0" encoding="UTF-8"?>
<document>
<employee ID="1">
<Name>Steve</Name>
<City>Boston</City>
<Age>33</Age>
</employee>
<employee ID="2">
<Name>Michael</Name>
<City>Dallas</City>
<Age>45</Age>
</employee>
<employee ID="3">
<Name>John</Name>
<City>New York</City>
<Age>89</Age>
</employee>
<employee ID="4">
<Name>Thomas</Name>
<City>LA</City>
<Age>62</Age>
</employee>
<employee ID="5">
<Name>Clint</Name>
<City>Paris</City>
<Age>30</Age>
</employee>
</document>
What I have tried:
library(XML)
# Some data
df <-
read.csv(textConnection('"ID","Name","City","Age"
"1","Steve","Boston",33
"2","Michael","Dallas",45
"3","John","New York",89
"4","Thomas","LA",62
"5","Clint","Paris",30'),
as.is=TRUE)
xml <- xmlTree()
xml$addTag("document", close=FALSE)
for (i in 1:nrow(df)) {
xml$addTag("employee", close=FALSE)
for (j in names(df)) {
xml$addTag(j, df[i, j])
}
xml$closeTag()
}
xml$closeTag()
Which looks almost as desired, but where ID is beneath employee rather then on the same line and the encoding is not in the header:
<?xml version="1.0"?>
<document>
<employee>
<ID>1</ID>
<Name>Steve</Name>
<City>Boston</City>
<Age>33</Age>
</employee>
<employee>
<ID>2</ID>
<Name>Michael</Name>
<City>Dallas</City>
<Age>45</Age>
</employee>
<employee>
<ID>3</ID>
<Name>John</Name>
<City>New York</City>
<Age>89</Age>
</employee>
<employee>
<ID>4</ID>
<Name>Thomas</Name>
<City>LA</City>
<Age>62</Age>
</employee>
<employee>
<ID>5</ID>
<Name>Clint</Name>
<City>Paris</City>
<Age>30</Age>
</employee>
</document>

Use addNode instead of addTag. They are identical
> identical(xml$addTag, xml$addNode)
[1] TRUE
so its a matter of preference. You can give an attrs argument to add the ID attribute. You can add the encoding when you save the file:
library(XML)
df <-
read.csv(textConnection('"ID","Name","City","Age"
"1","Steve","Boston",33
"2","Michael","Dallas",45
"3","John","New York",89
"4","Thomas","LA",62
"5","Clint","Paris",30'),
as.is=TRUE)
xml <- xmlTree("document")
for (i in 1:nrow(df)) {
xml$addNode("employee", attrs = c(ID = df[i,"ID"]), close = FALSE)
appNames <- names(df)[names(df) != "ID"]
for (j in appNames) {
xml$addNode(j, df[i, j])
}
xml$closeNode()
}
xml$closeNode()
saveXML(xml$doc(), "text.xml", encoding = "UTF-8")
xmlParse("text.xml")
<?xml version="1.0" encoding="UTF-8"?>
<document>
<employee ID="1">
<Name>Steve</Name>
<City>Boston</City>
<Age>33</Age>
</employee>
<employee ID="2">
<Name>Michael</Name>
<City>Dallas</City>
<Age>45</Age>
</employee>
<employee ID="3">
<Name>John</Name>
<City>New York</City>
<Age>89</Age>
</employee>
<employee ID="4">
<Name>Thomas</Name>
<City>LA</City>
<Age>62</Age>
</employee>
<employee ID="5">
<Name>Clint</Name>
<City>Paris</City>
<Age>30</Age>
</employee>
</document>

Related

Remove or filter XML nodes by Xpaths from file in R

I have very very large complex xml files (look like this https://github.com/HL7/C-CDA-Examples/blob/master/General/Parent%20Document%20Replace%20Relationship/CCD%20Parent%20Document%20Replace%20(C-CDAR2.1).xml ) to process but only need attributes and values at particular XPaths (nodes). By removing unneeded nodes, processing time may be cut, filtering out fluff before detailed processing.
So far I have tried using: xml_remove
xmlfile <- paste0(dir,"xmlFiles/",filelist[k])
file<-read_xml(xmlfile)
file<-xml_ns_strip(file)
for(counx in 1:nrow(xpathTable)){
xr <- xml_find_all(file, xpath =paste0('/',toString(xpathTable$xpaths[counx])) )
xml_remove(xr, free = TRUE)
file<-file
}
This works well for removing few nodes but crashes as the numbers go up (>100)
Below show a kind of example of what I want to get too
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
<ISBN>
<Random>12354</Random>
</ISBN>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<ISBN>
<Random>12345</Random>
</ISBN>
<price>39.95</price>
</book>
</bookstore>
Filter by XPaths
/bookstore/book/title
/bookstore/book/year
/bookstore/book/ISBN/Random
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<year>2005</year>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<year>2005</year>
<ISBN>
<Random>12354</Random>
</ISBN>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<year>2003</year>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<year>2003</year>
<ISBN>
<Random>12345</Random>
</ISBN>
</book>
</bookstore>
Looks like an XQuery job, e.g. you could recreate your document like this
<bookstore>{
for $book in /bookstore/*
return <book category="{$book/#category}">
{$book/title}
{$book/year}
{$book/ISBN}
</book>
}</bookstore>
Using the book example to get the result below it. You can test this online here using XQuery as an option https://www.videlibri.de/cgi-bin/xidelcgi
There might be ways to run XQuery from R but I would rather do it in a pre-processing step from the command line using a tool like xidel.
All elements could be looked up in a single XPath 1.0 expression valid for many languages:
/bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]
Equivalent/similar expressions:
/bookstore/book/title | /bookstore/book/year | /bookstore/book/ISBN/Random
//book/#category | //book/year | //ISBN/Random
To filter out elements:
//book/*[not(name()="title" or name()="year" or name()="ISBN" or name()="Random")]
For XMLs with namespaces, local-name() can be used instead of name() if namespace handling is not used.
For the given example and elements and testing on command line:
echo 'cat /bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]' | xmllint --shell test.xml
Result:
/ > cat /bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]
-------
<title lang="en">Everyday Italian</title>
-------
<year>2005</year>
-------
<title lang="en">Harry Potter</title>
-------
<year>2005</year>
-------
<Random>12354</Random>
-------
<title lang="en">XQuery Kick Start</title>
-------
<year>2003</year>
-------
<title lang="en">Learning XML</title>
-------
<year>2003</year>
-------
<Random>12345</Random>
/ >
For the mentioned R crash, worth looking here.

XML to list and back to XML

Situation: "Software" to R and back to "Software". The only interface for "Software" is xml.
In R, I need to make a few changes in the file so i convert it to a list and make some changes.
library(XML)
myFile = xmlParse("myXML")
xml_data <- xmlToList(myFile)
xml_data$timetable$train$.attrs[6] = "HelloNewWorld"
Now i need to convert this list "xml_data" it back to xml.
I found some functions like this:
function(item, tag) {
# just a textnode, or empty node with attributes
if(typeof(item) != 'list') {
if (length(item) > 1) {
xml <- xmlNode(tag)
for (name in names(item)) {
xmlAttrs(xml)[[name]] <- item[[name]]
}
return(xml)
} else {
return(xmlNode(tag, item))
}
}
# create the node
if (identical(names(item), c("text", ".attrs"))) {
# special case a node with text and attributes
xml <- xmlNode(tag, item[['text']])
} else {
# node with child nodes
xml <- xmlNode(tag)
for(i in 1:length(item)) {
if (names(item)[i] != ".attrs") {
xml <- append.xmlNode(xml, listToXml(item[[i]], names(item)[i]))
}
}
}
# add attributes to node
attrs <- item[['.attrs']]
for (name in names(attrs)) {
xmlAttrs(xml)[[name]] <- attrs[[name]]
}
return(xml)
}
But this doesnt work...
Any help or hints appreciated!
Thanks!
In the linked picture you can see the current xml-file. Highlighted in yellow the values that I need to change.
Link:
https://i.stack.imgur.com/remzj.png
Consider XSLT, the special-purpose language designed to transform XML files. No need to rewrite the entire tree in R. Using its xslt package (available on CRAN-R), extension of xml2, you can transform an input source and write output to screen or file.
Using the Identity Transform to copy document as is, below XSLT then rewrites one of the attributes in <train> tag, #source, similar to your above code attempt but with sixth attribute.
XML (sample input from railIML Wiki page)
<?xml version="1.0" encoding="UTF-8"?>
<railml xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="timetable.xsd">
<timetable version="1.1">
<train trainID="RX 100.2" type="planned" source="opentrack">
<timetableentries>
<entry posID="ZU" departure="06:08:00" type="begin"/>
<entry posID="ZWI" departure="06:10:30" type="pass"/>
<entry posID="ZOER" arrival="06:16:00" departure="06:17:00" minStopTime="9" type="stop"/>
<entry posID="WS" departure="06:21:00" type="pass"/>
<entry posID="DUE" departure="06:23:00" type="pass"/>
<entry posID="SCW" departure="06:27:00" type="pass"/>
<entry posID="NAE" departure="06:29:00" type="pass"/>
<entry posID="UST" arrival="06:34:30" type="stop"/>
</timetableentries>
</train>
</timetable>
</railml>
XSLT (save as .xsl file, rewrites the #source attribute)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="#source">
<xsl:attribute name="source">HelloNewWorld</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
R
library(xslt)
doc <- read_xml("/path/to/Input.xml", package = "xslt")
style <- read_xml("/path/to/XLSTScript.xsl", package = "xslt")
new_xml <- xml_xslt(doc, style)
# OUTPUT TO SCREEN
cat(as.character(new_xml))
# OUTPUT TO FILE
write_xml(new_xml, "/path/to/Output.xml")
Output
<?xml version="1.0" encoding="UTF-8"?>
<railml xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="timetable.xsd">
<timetable version="1.1">
<train trainID="RX 100.2" type="planned" source="HelloNewWorld">
<timetableentries>
<entry posID="ZU" departure="06:08:00" type="begin"/>
<entry posID="ZWI" departure="06:10:30" type="pass"/>
<entry posID="ZOER" arrival="06:16:00" departure="06:17:00" minStopTime="9" type="stop"/>
<entry posID="WS" departure="06:21:00" type="pass"/>
<entry posID="DUE" departure="06:23:00" type="pass"/>
<entry posID="SCW" departure="06:27:00" type="pass"/>
<entry posID="NAE" departure="06:29:00" type="pass"/>
<entry posID="UST" arrival="06:34:30" type="stop"/>
</timetableentries>
</train>
</timetable>
</railml>
I found it hard to apply many of the answers listed here so I wonder if this set of simple Java XML XPathHelper Unities may help others. You can find the source code here. I didn't write it all myself but adapted code I found, so I can't take all the credit but it works and it is compact and hope it helps others.
String xmlPayLoad = readFileAsString(payLoadPath + "/payLoad.xml");
TreeMap<String, String> header = new TreeMap<String, String>();
XPathHelperCommon xph = new XPathHelperCommon();
header = xph.findMultipleXMLItems(xmlPayLoad, "//header/*");
header.put("type", "newProcess");
xmlPayLoad = xph.modifyMultipleXMLItems(xmlPayLoad, "//header/*", header);
The primative XML header could be something like this:
<header>
<type>process</type>
<ruleBaseVersion>0</ruleBaseVersion>
<ruleBaseCommitment>0</ruleBaseCommitment>
<sequenceId>0</sequenceId>
<priortiseSID>0</priortiseSID>
<monitorIncomingEvents>0</monitorIncomingEvents>
<activityCount>0</activityCount>
<taskElapsedTime>0</taskElapsedTime>
<processStartTime>0</processStartTime>
<processElapsedTime>0</processElapsedTime>
<eventElapsedTime>0</eventElapsedTime>
<status>0</status>
</header>

Add children to existing node using R XML

I have the following XML file test.graphml that I am trying to manipulate using the XML package in R.
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph id="G" edgedefault="directed">
<node id="n0"/>
<node id="n1"/>
<node id="n2"/>
<node id="n3"/>
<node id="n4"/>
<edge source="n0" target="n1"/>
<edge source="n0" target="n2"/>
<edge source="n2" target="n3"/>
<edge source="n1" target="n3"/>
<edge source="n3" target="n4"/>
</graph>
</graphml>
I would like to nest nodes n0, n1, n2, and n3 into a new graph node as shown below.
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph id="G" edgedefault="directed">
<graph id="g1">
<node id="n0"/>
<node id="n1"/>
<node id="n2"/>
<node id="n3"/>
</graph>
<node id="n4"/>
<edge source="n0" target="n1"/>
<edge source="n0" target="n2"/>
<edge source="n2" target="n3"/>
<edge source="n1" target="n3"/>
<edge source="n3" target="n4"/>
</graph>
</graphml>
The code I have written has unknowns and errors that I am unable to resolve due to lack of experience with XML processing. I would greatly appreciate some pointers to that will help me proceed.
library(XML)
# Read file
x <- xmlParse("test.graphml")
ns <- c(graphml ="http://graphml.graphdrawing.org/xmlns")
# Create new graph node
ng <- xmlNode("graph", attrs = c("id" = "g1"))
# Add n0-n3 as children of new graph node
n0_n1_n2_n3 <- getNodeSet(x,"//graphml:node[#id = 'n0' or #id='n1' or #id='n2' or #id='n3']", namespaces = ns)
ng <- append.xmlNode(ng, n0_n1_n2_n3)
# Get only graph node
g <- getNodeSet(x,"//graphml:graph", namespaces = ns)
# Remove nodes n0-n3 from the only graph node
# How I do this?
# This did not work: removeNodes(g, n0_n1_n2_n3)
# Add new graph node as child of only graph node
g <- append.xmlNode(g, ng)
#! Error message:
Error in UseMethod("append") :
no applicable method for 'append' applied to an object of class "XMLNodeSet"
Consider XSLT, the special-purpose language to transform XML files. Since you require modification of the XML (adding parent node in a select group of children) and have to navigate through an undeclared namespace prefix (xmlns="http://graphml.graphdrawing.org/xmlns"), XSLT is an optimal solution.
However, to date R does not have a fully compliant XSL module to run XSLT 1.0 scripts like other general purpose languages (Java, PHP, Python). Nonetheless, R can call external programs (including aforementioned languages), or dedicated XSLT processors (Xalan, Saxon), or call command line interpreters including PowerShell and terminal's xsltproc using system(). Below are latter solutions.
XSLT (save as .xsl, to be referenced in R script)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="doc:graphml">
<xsl:copy>
<xsl:copy-of select="document('')/*/#xsi:schemaLocation"/>
<xsl:apply-templates select="doc:graph"/>
</xsl:copy>
</xsl:template>
<xsl:template match="doc:graph">
<xsl:element name="{local-name()}" namespace="http://graphml.graphdrawing.org/xmlns">
<xsl:apply-templates select="#*"/>
<xsl:element name="graph" namespace="http://graphml.graphdrawing.org/xmlns">
<xsl:attribute name="id">g1</xsl:attribute>
<xsl:apply-templates select="doc:node[position() < 5]"/>
</xsl:element>
<xsl:apply-templates select="doc:node[#id='n4']|doc:edge"/>
</xsl:element>
</xsl:template>
<xsl:template match="doc:graph/#*">
<xsl:attribute name="{local-name()}"><xsl:value-of select="."/></xsl:attribute>
</xsl:template>
<xsl:template match="doc:node|doc:edge">
<xsl:element name="{local-name()}" namespace="http://graphml.graphdrawing.org/xmlns">
<xsl:attribute name="{local-name(#*)}"><xsl:value-of select="#*"/></xsl:attribute>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
PowerShell script (for Windows PC users, save as XMLTransform.ps1)
param ($xml, $xsl, $output)
if (-not $xml -or -not $xsl -or -not $output) {
Write-Host "& .\xslt.ps1 [-xml] xml-input [-xsl] xsl-input [-output] transform-output"
exit;
}
trap [Exception]{
Write-Host $_.Exception;
}
$xslt = New-Object System.Xml.Xsl.XslCompiledTransform;
$xslt.Load($xsl);
$xslt.Transform($xml, $output);
Write-Host "generated" $output;
R Script (calling command line operations)
library(XML)
# WINDOWS USERS
ps <- '"C:\\Path\\To\\XMLTransform.ps1"' # POWER SHELL SCRIPT
input <- '"C:\\Path\\To\\Input.xml"' # XML SOURCE
xsl <- '"C:\\Path\\To\\XSLTScript.xsl"' # XSLT SCRIPT
output <- '"C:\\Path\\To\\Output.xml"' # BLANK, EMPTY FILE PATH TO BE CREATED
system(paste('Powershell.exe -executionpolicy remotesigned -File',
ps, input, xsl, output)) # NOTE SECURITY BYPASS ARGS
doc <- xmlParse("C:\\Path\\To\\Output.xml")
# UNIX (MAC/LINUX) USERS
system("xsltproc /path/to/XSLTScript.xsl /path/to/input.xml -o /path/to/output.xml")
doc <- xmlParse("/path/to/output.xml")
print(doc)
# <?xml version="1.0" encoding="utf-8"?>
# <graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
# <graph id="G" edgedefault="directed">
# <graph id="g1">
# <node id="n0"/>
# <node id="n1"/>
# <node id="n2"/>
# <node id="n3"/>
# </graph>
# <node id="n4"/>
# <edge source="n0"/>
# <edge source="n0"/>
# <edge source="n2"/>
# <edge source="n1"/>
# <edge source="n3"/>
# </graph>
# </graphml>

xquery help to query products and users from a different element node

please help me to understand how to write this kind of query with xquery.
I have this .xml:
<auctions xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<products>
<product id="1">
<name>Name1</name>
</product>
<product id="2">
<name>Name2</name>
</product>
<product id="3">
<name>Name3</name>
</product>
<product id="4">
<name>Name4</name>
</product>
</products>
<users>
<user username="Kukuk1">
</user>
<user username="Kukuk2">
</user>
<user username="Kukuk3">
</user>
</users>
<bids>
<product id="1">
<bid user="Kukuk1">400</bid>
<bid user="Kukuk2">410</bid>
<bid user="Kukuk1">450</bid>
</product>
<product id="2">
<bid user="Kukuk3">200</bid>
<bid user="Kukuk2">300</bid>
</product>
<product id="3">
<bid user="Kukuk1">150</bid>
</product>
</bids>
</auctions>
and I need to get this output, as follows: The user "Kukuk1" got the products "Name1" (with value "450") and "Name3" (with value "150). The user "Kukuk3" has not won any products. The user "Kukuk2" won the products "Name2".
The elements should be ordered by user ascending and the elements product by value descending, should look like this:
<got>
<user name="Kukuk1">
<product value="450">Name1</product>
<product value="150">Name3</product>
</user>
<user name="Kukuk3"/>
<user name="Kukuk2">
<product value="300">Name2</product>
</user>
</got>
This is what I got so far:
declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
declare option output:item-separator "
";
<got>
{
for $u in (//auctions/users/user)
let $p:= //auctions/products
let $v := //auctions/bids/product
let $max1 := max($v/bid)
let $max2 := max($v/bid[4])
let $got := //auctions/bids/product
let $won-product := $got[#id=$p/product/#id]
order by $u
return
if ($u/#username="Kukuk1") then
(<user name="{fn:string($u/#username)}">
<product value="{$max1}">{fn:string($p/product[1]/name)}</product>
</user>,'
')
else
if
($u/#username="Kukuk3") then
(<user name="{fn:string($u/#username)}">
<product value="{$max2}">{fn:string($p/product[2]/name)}</product>
</user>,'
')
else
if
($u/#username="Kukuk2") then
(<user name="{fn:string($u/#username)}">
<product value="{$max2}">{fn:string($p/product[3]/name)}</product>
</user>,'
')
else ()
}
</got>
And I'm getting this output:
<got>
<user name="Kukuk1">
<product value="450">Name1</product>
</user>
<user name="Kukuk3">
<product value="">Name2</product>
</user>
<user name="Kukuk2">
<product value="">Name3</product>
</user>
</got>
You will need two for loops to achieve the result you describe. In the outer loop you will order the users based on their respective username and in the inner loop you will get the bids and order them by their highest value.
Hence, it should look something like this:
<got>{
for $u in //auctions/users/user
let $username := $u/#username
order by $username
return element user {
$username,
for $product in //auctions/bids/product[bid/#user = $username]
let $highest-bid := max($product/bid)
order by $highest-bid descending
return
if ($product/bid[. = $highest-bid and #user = $username])
then element product {
attribute {"value"} {$highest-bid},
//auctions/products/product[#id = $product/#id]/name/string()
} else ()
}
}</got>
Please note that your example output does not fit your description as Kukuk3 > Kukuk2 and therefore should be order that way. I assumed that your description is correct.

Parse XML based on attributes and text values of related nodes

I have used the XML package to parse both HTML and XML before, and have a rudimentary grasp of xPath. However I've been asked to consider XML data where the important bits are determined by a combination of text and attributes of the elements themselves, as well as those in related nodes. I've never done that. For example
[updated example, slightly more expansive]
<Catalogue>
<Bookstore id="ID910705541">
<location>foo bar</location>
<books>
<book category="A" id="1">
<title>Alpha</title>
<author ref="1">Matthew</author>
<author>Mark</author>
<author>Luke</author>
<author ref="2">John</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Beta</title>
<author ref="1">Huey</author>
<author>Duey</author>
<author>Louie</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>Gamma</title>
<author ref="1">Tweedle Dee</author>
<author ref="2">Tweedle Dum</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
<Bookstore id="ID910700051">
<location>foo</location>
<books>
<book category="A" id="1">
<title>Happy</title>
<author>Dopey</author>
<author>Bashful</author>
<author>Doc</author>
<author ref="1">Grumpy</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Ni</title>
<author ref="1">John</author>
<author ref="2">Paul</author>
<author ref="3">George</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>San</title>
<author ref="1">Ringo</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
<Bookstore id="ID910715717">
<location>bar</location>
<books>
<book category="A" id="1">
<title>Un</title>
<author ref="1">Winkin</author>
<author>Blinkin</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Deux</title>
<author>Nod</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>Trois</title>
<author>Manny</author>
<author>Moe</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
</Catalogue>
I would like to extract all author names where:
1) the location element has a text value that contains "NY"
2) the author element does NOT contain a "ref" attribute; that is where ref is not present in the author tag
I will ultimately need to concatenate the extracted authors together within a given bookstore, so that my resulting data frame is one row per store. I'd like to preserve the bookstore id as an additional field in my data frame so that I can uniqely reference each store.
Since only the first bokstore is in NY, results from this simple example would look something like:
1 Jane Smith John Doe Karl Pearson William Gosset
If another bookstore contained "NY" in its location, it would comprise the second row, and so forth.
Am I asking too much of R to parser under these convoluted conditions?
require(XML)
xdata <- xmlParse(apptext)
xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(#ref)]')
#[[1]]
#<author>Jane Smith</author>
#[[2]]
#<author>John Doe</author>
#[[3]]
#<author>Karl Pearson</author>
#[[4]]
#<author>William Gosset</author>
Breakdown:
Get all locations containing 'NY'
//*/location[text()[contains(.,"NY")]]
Get the books sibling of these nodes
/following-sibling::books
from these notes get all authors without a ref attribute
/.//author[not(#ref)]
Use xmlValue if you want the text:
> xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(#ref)]',xmlValue)
[1] "Jane Smith" "John Doe" "Karl Pearson" "William Gosset"
UPDATE:
child.nodes <- xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(#ref)]')
ans.func<-function(x){
xpathSApply(x,'.//ancestor::bookstore[#id]/#id')
}
sapply(child.nodes,ans.func)
# id id id id
#"1" "1" "1" "1"
UPDATE 2:
With your changed data
xdata <- '<Catalogue>
<Bookstore id="ID910705541">
<location>foo bar</location>
<books>
<book category="A" id="1">
<title>Alpha</title>
<author ref="1">Matthew</author>
<author>Mark</author>
<author>Luke</author>
<author ref="2">John</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Beta</title>
<author ref="1">Huey</author>
<author>Duey</author>
<author>Louie</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>Gamma</title>
<author ref="1">Tweedle Dee</author>
<author ref="2">Tweedle Dum</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
<Bookstore id="ID910700051">
<location>foo</location>
<books>
<book category="A" id="1">
<title>Happy</title>
<author>Dopey</author>
<author>Bashful</author>
<author>Doc</author>
<author ref="1">Grumpy</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Ni</title>
<author ref="1">John</author>
<author ref="2">Paul</author>
<author ref="3">George</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>San</title>
<author ref="1">Ringo</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
<Bookstore id="ID910715717">
<location>bar</location>
<books>
<book category="A" id="1">
<title>Un</title>
<author ref="1">Winkin</author>
<author>Blinkin</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Deux</title>
<author>Nod</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>Trois</title>
<author>Manny</author>
<author>Moe</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
</Catalogue>'
Note previously you had bookstore now Bookstore. NY is gone so I have used foo
require(XML)
xdata <- xmlParse(xdata)
child.nodes <- getNodeSet(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(#ref)]')
ans.func<-function(x){
xpathSApply(x,'.//ancestor::Bookstore[#id]/#id')
}
sapply(child.nodes,ans.func)
# id id id id id
#"ID910705541" "ID910705541" "ID910705541" "ID910705541" "ID910700051"
# id id
#"ID910700051" "ID910700051"
xpathSApply(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(#ref)]',xmlValue)
# [1] "Mark" "Luke" "Duey" "Louie" "Dopey" "Bashful" "Doc"

Resources