Extract child nodes without losing connection to the parent using xml2 - r

Suppose I have the following XML:
library(xml2)
x = xml_children(read_xml('<?xml version="1.0" encoding="UTF-8"?>
<items>
<item type="greeting" id="9273938">
<link type="1" id="139" value="Hi"/>
<link type="1" id="142" value="Hello"/>
<link type="1" id="130" value="Ahoy"/>
</item>
<item type="greeting" id="9225694">
<link type="1" id="138" value="Bye"/>
<link type="1" id="131" value="Adios"/>
</item>
</items>'))
I can loop over it to access the <link> nodes in the individual <item> nodes.
lapply(x, xml_find_all, xpath = "link")
This produces a list of separate nodesets, which allows me to know which collection of "links" belongs to which "item". But looping over a longish nodeset (say thousands of <item> nodes) can be slow.
In contrast the below is almost instant (and I think closer to the spirit of how xml2 should be used) but I no longer know which item the links came from. They appear to all be siblings:
xml_find_all(x, xpath = "link")
Question: How to extract the <link> nodes without losing information about the <item> they came from, avoiding the lapply solution above?

With each link node, you can get the information about the parent item by /parent::item:
library(xml2)
x <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
<items>
<item type="greeting" id="9273938">
<link type="1" id="139" value="Hi"/>
<link type="1" id="142" value="Hello"/>
<link type="1" id="130" value="Ahoy"/>
</item>
<item type="greeting" id="9225694">
<link type="1" id="138" value="Bye"/>
<link type="1" id="131" value="Adios"/>
</item>
</items>')
links <- x %>% xml_find_all("//link")
data.frame(
item_id = links %>% xml_find_first("./parent::item") %>% xml_attr("id"), # notice the dot refers to the current link node
link_id = links %>% xml_attr("id"),
value = links %>% xml_attr("value")
)

Related

loop xml and retrieve node values and construct xml outputusing Xquery

Team, I need your help /expertise to retrieve node value by traversing an xml response. I would like to use this an integration middleware.
Input file sample:
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
xml:base="https://api12preview.sapsf.eu:443/odata/v2/">
<title type="text">PerEmail</title>
<id>https://api12preview.sapsf.eu:443/odata/v2/PerEmail</id>
<updated>2022-11-09T13:58:27Z</updated>
<link href="PerEmail" rel="self" title="PerEmail"/>
<entry>
<id>https://api12preview.sapsf.eu:443/odata/v2/PerEmail(emailType='54139',personIdExternal='GI00152188')</id>
<title type="text"/>
<updated>2022-11-09T13:58:27Z</updated>
<author>
<name/>
</author>
<link href="PerEmail(emailType='54139',personIdExternal='GI00152188')"
rel="edit"
title="PerEmail"/>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"
term="SFOData.PerEmail"/>
<content type="application/xml">
< properties>
<d:personIdExternal>GI00152188</d:personIdExternal>
<d:emailAddress>someone#test_boehringer.com</d:emailAddress>
</m:properties>
</content>
</entry>
<entry>
<id>https://api12preview.sapsf.eu:443/odata/v2/PerEmail(emailType='54139',personIdExternal='GI00453224')</id>
<title type="text"/>
<updated>2022-11-09T13:58:27Z</updated>
<author>
<name/>
</author>
<link href="PerEmail(emailType='54139',personIdExternal='GI00453224')"
rel="edit"
title="PerEmail"/>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"
term="SFOData.PerEmail"/>
<content type="application/xml">
<m:properties>
<d:personIdExternal>GI00453224</d:personIdExternal>
<d:emailAddress>someone#test_boehringer.com</d:emailAddress>
</m:properties>
</content>
</entry>
<link href="https://api12preview.sapsf.eu:443/odata/v2/PerEmail?$select=emailAddress,personIdExternal&$filter=emailType%20eq%2054139&$skiptoken=eyJzdGFydFJvdyI6MTAwMCwiZW5kUm93IjoyMDAwfQ=="
rel="next"/>
</feed>
Out of this response or xml Xquery should run through all 'entry' node and pick values of node 'personIdExternal' and I'm expecting result like this
<element>
<personIdExternal>GI00152188</personIdExternal>
<personIdExternal>GI00453224</personIdExternal>
</element>
I have tried something below code earlier but it's not working here, and I suspect this is due to namespace in the source xml. My knowledge is limited in XQuery - Please help
{let $input:= /entry
for $i in $input/properties
return
<element>
<personIdExternal>{i/personIdExternal/text()}</personIdExternal>
</element>}
/entry doesn't select anything because the entry elements aren't at the top level, and they're in a namespace.
$input/properties is wrong because the properties element isn't a child of entry and it's in a namespace.
i doesn't select anything, it should be $i
personIdExternal doesn't select anything because it's in a namespace.
You just need
<element>{//*:personIdExternal}</element>

How to upload and save a picture with eXist-db?

I am tryng to upload a picture and store it in exist-db but i get the next error when opening the stored picture:
Cannot open specified file: Could not recognize image encoding.
I have tryed the next code with a small adjustment for normal txt files and it works fine but not with pictures.
picture.xhtml
<?xml-model href="http://www.oxygenxml.com/1999/xhtml/xhtml-xforms.nvdl" schematypens="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xf="http://www.w3.org/2002/xforms"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<head>
<title/>
<xf:model>
<xf:instance xmlns="">
<data>
<image xsi:type="xs:base64Binary"/>
</data>
</xf:instance>
<xf:submission id="save" action="save.xquery" method="post"/>
</xf:model>
</head>
<body>
<xf:upload ref="image">
<xf:label>Upload Photo:</xf:label>
</xf:upload>
<br/>
<xf:submit submission="save">
<xf:label>Save</xf:label>
</xf:submit>
</body>
</html>
save.xquery
xquery version "3.1";
declare option exist:serialize "method=xhtml media-type=text/html indent=yes";
let $login:=xmldb:login('xmldb:exist:///db/apps/places','admin','admin')
(: The small adjusment i refer is just to change file extension from .jpeg to .txt :)
return xmldb:store("/db/apps/places/",concat("pic",".jpeg"), util:base64-decode(request:get-data()//image))
If you want to store images to the eXist-db you should probably replace xmldb:store() with xmldb:store-as-binary().

XSLT to format Wordpress WXR XML for importing in to Drupal via Feeds

I'm trying to format a Wordpress WXR file using XSLT so I can import it into Drupal.
I'm aware of modules for Drupal that will import WXR files but I need the flexibility that the Feeds module can give as the imported data will be imported against different content types and I'll be pulling images and other attachments into the newly created Drupal pages. With this in mind the standard WordPress Migrate just won't cut it.
So, the WXR format has Wordpress posts and attachments as separate items within the feed and links the posts an attachments using an id. Attachments can be images, files (pdf,doc etc) and are found at the xpath wp:postmeta/wp:meta_key and have values of _thumbnail_id, _wp_attached_file
What I'd like to do is take various nodes from items of type attachment and put them within the cooresponding post item, where the id links them together
A fragment of the xml to be transformed... First item is post second is attachment. The
<item>
<title>Some groovy title</title>
<link>http://example.com/groovy-example</link>
<wp:post_id>2050</wp:post_id>
<wp:post_type>page</wp:post_type>
...
...
...
<wp:postmeta>
<wp:meta_key>_thumbnail_id</wp:meta_key>
<wp:meta_value>566</wp:meta_value>
</wp:postmeta>
</item>
...
...
...
<item>
<title>My fantastic attachment</title>
<link>http://www.example.com/fantastic-attachment</link>
<wp:post_id>566</wp:post_id>
<wp:post_type>attachment</wp:post_type>
...
...
...
<wp:attachment_url>http://www.example.com/wp-content/uploads/2012/12/fantastic.jpg</wp:attachment_url>
<wp:postmeta>
<wp:meta_key>_wp_attached_file</wp:meta_key>
<wp:meta_value>2012/12/fantastic.jpg</wp:meta_value>
</wp:postmeta>
</item>
After the transform I would like
<item>
<title>Some groovy title</title>
<link>http://example.com/groovy-example</link>
<wp:post_id>2050</wp:post_id>
<wp:post_type>page</wp:post_type>
...
...
...
<wp:postmeta>
<wp:meta_key>_thumbnail_id</wp:meta_key>
<wp:meta_value>566</wp:meta_value>
<wp:meta_url>http://www.example.com/wp-content/uploads/2012/12/fantastic.jpg</wp:attachment_url>
</wp:postmeta>
</item>
Maybe, there is a better approach? Maybe merge post and attachment where the id create a link between the nodes?
I'm new to XSLT and have read a few posts on identity transforms and I think thats the correct direction but I just don't have the experience to pull of what i need, assistance would be appreciated.
It looks like I've managed to sort out a solution.
I used a number of indexes to organise the attachments. My requirements changed a little on further inspection of the XML, as there was
I changed my resulting output to be in the format of...
<item>
<title>Some groovy title</title>
<link>http://example.com/groovy-example</link>
<wp:post_id>2050</wp:post_id>
<wp:post_type>page</wp:post_type>
...
...
...
<thumbnail>
<title>Spaner</title>
<url>http://www.example.com/wp-content/uploads/2012/03/spanner.jpg</url>
</thumbnail>
<attachments>
<attachment>
<title>Fixing your widgets: An idiots guide</title>
<url>http://www.example.com/wp-content/uploads/2012/12/fixiing-widgets.pdf</url>
</attachment>
<attachment>
<title>Do It Yourself Trepanning</title>
<url>http://www.example.com/wp-content/uploads/2013/04/trepanning.pdf</url>
</attachment>
</attachments>
</item>
So using the following xsl gave me the desired result. The conditions on the indexes ensured I was selecting the correct files.
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:wp="http://wordpress.org/export/1.2/">
<xsl:output indent="yes" cdata-section-elements="content"/>
<!-- Setup indexes -->
<!-- Index all main posts -->
<xsl:key
name="mainposts"
match="*/item[wp:post_type[text()='post']]"
use="wp:post_id" />
<!-- Index all sub posts (posts within posts)-->
<xsl:key
name="subposts"
match="*/item[wp:post_type[text()='post'] and category[#nicename = 'documents']]"
use="category[#domain = 'post_tag']" />
<!-- Index all image thumbs -->
<xsl:key
name="images"
match="*/item[wp:post_type[text()='attachment'] and wp:postmeta/wp:meta_key[text()='_wp_attachment_metadata']]"
use="wp:post_parent" />
<!-- Index all files (unable to sort members file at the moment)-->
<xsl:key
name="attachments"
match="*/item[wp:post_type[text()='attachment'] and not(wp:postmeta/wp:meta_key = '_wp_attachment_metadata')]"
use="wp:post_parent" />
<xsl:key
name="thumbnails"
match="*/item[wp:post_type[text()='attachment']]"
use="wp:post_id" />
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*/item/wp:post_parent[text()= 0]">
<wp:post_parent>
<xsl:value-of select="." />
</wp:post_parent>
<xsl:for-each select="key('thumbnails', ../wp:postmeta[wp:meta_key[text()='_thumbnail_id']]/wp:meta_value)">
<thumbnail>
<title><xsl:value-of select="title" /></title>
<url><xsl:value-of select="wp:attachment_url" /></url>
</thumbnail>
</xsl:for-each>
<xsl:for-each select="key('subposts', ../category[#domain = 'post_tag'])">
<attachments>
<xsl:for-each select="key('images', wp:post_id)">
<file>
<title><xsl:value-of select="title" /></title>
<url><xsl:value-of select="wp:attachment_url" /></url>
</file>
</xsl:for-each>
<xsl:for-each select="key('attachments', wp:post_id)">
<file>
<title><xsl:value-of select="title" /></title>
<url><xsl:value-of select="wp:attachment_url" /></url>
</file>
</xsl:for-each>
</attachments>
</xsl:for-each>
</xsl:template>

Xquery delete type query using lookup list

Need help constructing xquery to delete data from XML File that contains data for Google Products data feed.
I have a long list of ID numbers in a spreadsheet column that need to be removed - these are numeric.
Below is the xml schema. I need to select all the records where <g:id> = "my string of numbers" , then delete them from the file.
Thanks for the help!
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
<title></title>
<link></link>
<description></description>
<item>
<title></title>
<link></link>
<description></description>
<g:id></g:id>
<g:condition></g:condition>
<g:price></g:price>
<g:availability></g:availability>
<g:image_link></g:image_link>
<g:brand></g:brand>
<g:mpn></g:mpn>
<g:product_type></g:product_type>
</item>
</channel>
</rss>
This XQuery:
declare namespace g = "http://base.google.com/ns/1.0";
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
{
let $vIds := (1,3)
return
/*/*/*[not(g:id = $vIds)]
}
</channel>
</rss>
when applied on the provided XML document:
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
<title></title>
<link></link>
<description></description>
<item>
<title>Item 1</title>
<link></link>
<description></description>
<g:id>1</g:id>
<g:condition></g:condition>
<g:price></g:price>
<g:availability></g:availability>
<g:image_link></g:image_link>
<g:brand></g:brand>
<g:mpn></g:mpn>
<g:product_type></g:product_type>
</item>
<item>
<title>Item 2</title>
<link></link>
<description></description>
<g:id>2</g:id>
<g:condition></g:condition>
<g:price></g:price>
<g:availability></g:availability>
<g:image_link></g:image_link>
<g:brand></g:brand>
<g:mpn></g:mpn>
<g:product_type></g:product_type>
</item>
<item>
<title>Item 3</title>
<link></link>
<description></description>
<g:id>3</g:id>
<g:condition></g:condition>
<g:price></g:price>
<g:availability></g:availability>
<g:image_link></g:image_link>
<g:brand></g:brand>
<g:mpn></g:mpn>
<g:product_type></g:product_type>
</item>
</channel>
</rss>
produces the wanted, correct result:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<title/>
<link/>
<description/>
<item>
<title>Item 2</title>
<link/>
<description/>
<g:id>2</g:id>
<g:condition/>
<g:price/>
<g:availability/>
<g:image_link/>
<g:brand/>
<g:mpn/>
<g:product_type/>
</item>
</channel>
</rss>
Use the XQuery-Update delete-function
delete node doc("mydata.xml")//item[g:id = (1, 3, 5, 6, 8, 9, 12)]
If you don't want to touch your original data, use modify.
copy $c := doc("mydata.xml")/
modify delete node $c//item[g:id = (1, 3, 5, 6, 8, 9, 12)]
return $c
= compares with some set-semantics, it returns true if of of the items on left side equal one on the right.

Syndication format for describing threaded comments?

How to describe comments tree with Atom/RSS?
There's a draft standard to extend Atom with threaded discussions, but that's no longer active. This is a feed with comments:
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:thr="http://purl.org/syndication/thread/1.0">
<id>http://www.example.org/myfeed</id>
<title>My Example Feed</title>
<updated>2005-07-28T12:00:00Z</updated>
<link href="http://www.example.org/myfeed" />
<author><name>James</name></author>
<entry>
<id>tag:example.org,2005:1</id>
<title>My original entry</title>
<updated>2006-03-01T12:12:12Z</updated>
<link
type="application/xhtml+xml"
href="http://www.example.org/entries/1" />
<summary>This is my original entry</summary>
</entry>
<entry>
<id>tag:example.org,2005:1,1</id>
<title>A response to the original</title>
<updated>2006-03-01T12:12:12Z</updated>
<link href="http://www.example.org/entries/1/1" />
<thr:in-reply-to
ref="tag:example.org,2005:1"
type="application/xhtml+xml"
href="http://www.example.org/entries/1"/>
<summary>This is a response to the original entry</summary>
</entry>
</feed>
You can use html in RSS but < and > must be present as < and >
<description>
...
<!-- comments -->
<ul>
<li>comment1</li>
<li>comment2</li>
<li>comment3</li>
<li>comment4</li>
<ul>
</description>

Resources