How to describe comments tree with Atom/RSS?
There's a draft standard to extend Atom with threaded discussions, but that's no longer active. This is a feed with comments:
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:thr="http://purl.org/syndication/thread/1.0">
<id>http://www.example.org/myfeed</id>
<title>My Example Feed</title>
<updated>2005-07-28T12:00:00Z</updated>
<link href="http://www.example.org/myfeed" />
<author><name>James</name></author>
<entry>
<id>tag:example.org,2005:1</id>
<title>My original entry</title>
<updated>2006-03-01T12:12:12Z</updated>
<link
type="application/xhtml+xml"
href="http://www.example.org/entries/1" />
<summary>This is my original entry</summary>
</entry>
<entry>
<id>tag:example.org,2005:1,1</id>
<title>A response to the original</title>
<updated>2006-03-01T12:12:12Z</updated>
<link href="http://www.example.org/entries/1/1" />
<thr:in-reply-to
ref="tag:example.org,2005:1"
type="application/xhtml+xml"
href="http://www.example.org/entries/1"/>
<summary>This is a response to the original entry</summary>
</entry>
</feed>
You can use html in RSS but < and > must be present as < and >
<description>
...
<!-- comments -->
<ul>
<li>comment1</li>
<li>comment2</li>
<li>comment3</li>
<li>comment4</li>
<ul>
</description>
Related
Team, I need your help /expertise to retrieve node value by traversing an xml response. I would like to use this an integration middleware.
Input file sample:
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
xml:base="https://api12preview.sapsf.eu:443/odata/v2/">
<title type="text">PerEmail</title>
<id>https://api12preview.sapsf.eu:443/odata/v2/PerEmail</id>
<updated>2022-11-09T13:58:27Z</updated>
<link href="PerEmail" rel="self" title="PerEmail"/>
<entry>
<id>https://api12preview.sapsf.eu:443/odata/v2/PerEmail(emailType='54139',personIdExternal='GI00152188')</id>
<title type="text"/>
<updated>2022-11-09T13:58:27Z</updated>
<author>
<name/>
</author>
<link href="PerEmail(emailType='54139',personIdExternal='GI00152188')"
rel="edit"
title="PerEmail"/>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"
term="SFOData.PerEmail"/>
<content type="application/xml">
< properties>
<d:personIdExternal>GI00152188</d:personIdExternal>
<d:emailAddress>someone#test_boehringer.com</d:emailAddress>
</m:properties>
</content>
</entry>
<entry>
<id>https://api12preview.sapsf.eu:443/odata/v2/PerEmail(emailType='54139',personIdExternal='GI00453224')</id>
<title type="text"/>
<updated>2022-11-09T13:58:27Z</updated>
<author>
<name/>
</author>
<link href="PerEmail(emailType='54139',personIdExternal='GI00453224')"
rel="edit"
title="PerEmail"/>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"
term="SFOData.PerEmail"/>
<content type="application/xml">
<m:properties>
<d:personIdExternal>GI00453224</d:personIdExternal>
<d:emailAddress>someone#test_boehringer.com</d:emailAddress>
</m:properties>
</content>
</entry>
<link href="https://api12preview.sapsf.eu:443/odata/v2/PerEmail?$select=emailAddress,personIdExternal&$filter=emailType%20eq%2054139&$skiptoken=eyJzdGFydFJvdyI6MTAwMCwiZW5kUm93IjoyMDAwfQ=="
rel="next"/>
</feed>
Out of this response or xml Xquery should run through all 'entry' node and pick values of node 'personIdExternal' and I'm expecting result like this
<element>
<personIdExternal>GI00152188</personIdExternal>
<personIdExternal>GI00453224</personIdExternal>
</element>
I have tried something below code earlier but it's not working here, and I suspect this is due to namespace in the source xml. My knowledge is limited in XQuery - Please help
{let $input:= /entry
for $i in $input/properties
return
<element>
<personIdExternal>{i/personIdExternal/text()}</personIdExternal>
</element>}
/entry doesn't select anything because the entry elements aren't at the top level, and they're in a namespace.
$input/properties is wrong because the properties element isn't a child of entry and it's in a namespace.
i doesn't select anything, it should be $i
personIdExternal doesn't select anything because it's in a namespace.
You just need
<element>{//*:personIdExternal}</element>
Suppose I have the following XML:
library(xml2)
x = xml_children(read_xml('<?xml version="1.0" encoding="UTF-8"?>
<items>
<item type="greeting" id="9273938">
<link type="1" id="139" value="Hi"/>
<link type="1" id="142" value="Hello"/>
<link type="1" id="130" value="Ahoy"/>
</item>
<item type="greeting" id="9225694">
<link type="1" id="138" value="Bye"/>
<link type="1" id="131" value="Adios"/>
</item>
</items>'))
I can loop over it to access the <link> nodes in the individual <item> nodes.
lapply(x, xml_find_all, xpath = "link")
This produces a list of separate nodesets, which allows me to know which collection of "links" belongs to which "item". But looping over a longish nodeset (say thousands of <item> nodes) can be slow.
In contrast the below is almost instant (and I think closer to the spirit of how xml2 should be used) but I no longer know which item the links came from. They appear to all be siblings:
xml_find_all(x, xpath = "link")
Question: How to extract the <link> nodes without losing information about the <item> they came from, avoiding the lapply solution above?
With each link node, you can get the information about the parent item by /parent::item:
library(xml2)
x <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
<items>
<item type="greeting" id="9273938">
<link type="1" id="139" value="Hi"/>
<link type="1" id="142" value="Hello"/>
<link type="1" id="130" value="Ahoy"/>
</item>
<item type="greeting" id="9225694">
<link type="1" id="138" value="Bye"/>
<link type="1" id="131" value="Adios"/>
</item>
</items>')
links <- x %>% xml_find_all("//link")
data.frame(
item_id = links %>% xml_find_first("./parent::item") %>% xml_attr("id"), # notice the dot refers to the current link node
link_id = links %>% xml_attr("id"),
value = links %>% xml_attr("value")
)
I'm using HtmlAgilityPack v1.11.21 and since upgrading to .NET Core 3.1, I started to receive the following error while trying to load up a web page via URL: 'UTF-8, text/html' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. (Parameter 'name')
I found this post 'UTF8' is not a supported encoding name, but I'm not sure where or how I'm supposed to implement:
System.Text.EncodingProvider provider = System.Text.CodePagesEncodingProvider.Instance;
Encoding.RegisterProvider(provider);
I tried placing it before calling
var web = new HtmlWeb();
var doc = web.Load(urlToSearch);
But that didn't solve the issue.
This was working fine before upgrading to .NET Core 3.1, so I'm not sure where exactly I need to implement a fix.
Any ideas would be appreciated!
Thanks!
For those asking for the url, I'd rather not share that, but here's the heading:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- Bootstrap -->
<!-- Latest compiled and minified CSS -->
<link rel="stylesheet" href="http://www.somesite.com/graphics/cdn/bootstrap-3.3.4-base-and-theme-min.2.css">
<!-- Optional theme -->
<link rel='stylesheet' type="text/css" media="screen" href="http://fonts.googleapis.com/css?family=Droid+Sans:400,700">
<link rel="stylesheet" href="http://www.somesite.com/graphics/cdn/somesite-responsive.css">
<link rel="apple-touch-icon-precomposed" sizes="57x57" href="/apple-touch-icon-57x57.png" />
<link rel="apple-touch-icon-precomposed" sizes="114x114" href="/apple-touch-icon-114x114.png" />
<link rel="apple-touch-icon-precomposed" sizes="72x72" href="/apple-touch-icon-72x72.png" />
<link rel="apple-touch-icon-precomposed" sizes="144x144" href="/apple-touch-icon-144x144.png" />
<link rel="apple-touch-icon-precomposed" sizes="60x60" href="/apple-touch-icon-60x60.png" />
<link rel="apple-touch-icon-precomposed" sizes="120x120" href="/apple-touch-icon-120x120.png" />
<link rel="apple-touch-icon-precomposed" sizes="76x76" href="/apple-touch-icon-76x76.png" />
<link rel="apple-touch-icon-precomposed" sizes="152x152" href="/apple-touch-icon-152x152.png" />
<link rel="icon" type="image/png" href="/favicon-196x196.png" sizes="196x196" />
<link rel="icon" type="image/png" href="/favicon-96x96.png" sizes="96x96" />
<link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32" />
<link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="16x16" />
<link rel="icon" type="image/png" href="/favicon-128.png" sizes="128x128" />
<meta name="application-name" content=" " />
<meta name="msapplication-TileColor" content="#FFFFFF" />
<meta name="msapplication-TileImage" content="/mstile-144x144.png" />
<meta name="msapplication-square70x70logo" content="/mstile-70x70.png" />
<meta name="msapplication-square150x150logo" content="/mstile-150x150.png" />
<meta name="msapplication-wide310x150logo" content="/mstile-310x150.png" />
<meta name="msapplication-square310x310logo" content="/mstile-310x310.png" />
<meta property="og:url" content="http://www.somesite.com/">
<meta property="og:type" content="website">
<meta property="og:title" content="site title">
<meta property="og:image" content="http://www.somesite.com/graphics/somesite_square_logo.png">
<meta property="og:description" content="description">
<title>site title</title>
</head>
<body>
</body>
</html>
There doesn't look like there's anything special there. Was hoping it was a .NET Core 3.1 thing...
As another measure, I've tried implementing the below but the response.Content.ReadAsStringAsync() comes back as empty.
using var httpClient = new HttpClient();
{
var response = await httpClient.GetAsync(urlToSearch);
if (response.IsSuccessStatusCode)
{
var html = await response.Content.ReadAsStringAsync();
var doc = new HtmlDocument();
doc.LoadHtml(html);
var photoUrl = doc.QuerySelector("div #headshot").ChildNodes[0].Attributes["src"].Value;
return new OkObjectResult(photoUrl);
}
}
It looks like it's not the issue with .NET Core 3.1, but with the URL you are trying to load.
.NET Core 3.1 has UTF-8 among defaults
.NET Core, on the other hand, supports only the following encodings:
[...]
UTF-8 (code page 65001), which is returned by the Encoding.UTF8 property.
[...]
I don't recall any place in HTTP Headers or in HTML where a string similar to
UTF-8, text/html
is expected.
In headers it looks like:
Content-Type: text/html;charset=utf-8
In HTML, like:
<meta charset="utf-8"/>
or
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
The page itself could show no sings of a problem in browsers, because they are quite forgiving. And your code before upgrade also could ignore the , text/html part for a ton of reasons. And the issue started to appear after upgrade... for another ton of reasons.
If you do not control the server, then you probably should load the page manually, then remove this error (", text/html") from the string and feed the result to HtmlAgilityPack
Update
Considering your update:
HTTP headers are also important. Even more. They take precedence over <meta>. Try
curl -v yourURL
Not sure about ReadAsStringAsync returning an empty string: maybe it's the same issue - wrong headers, or it may be an error in your code (as far as I know, ReadAsStringAsync doesn't really returns a string). You can try passing the HTML as static string
html = "<!DOCTYPE html>...";
doc.LoadHtml(html);
To isolate the initial issue.
As for ReadAsStringAsync you should check first if it succeeds reading other sites. I looked on the Internet... there are a lot of possibilities. Don't know what will work for you.
If the issue is with the headers. Then you can try this Is it possible to make HttpClient ignore invalid ETag header in response? or this https://stackoverflow.com/a/29426046/12610347 or this How to make HttpClient ignore Content-Length header or something else to your liking.
I'm trying to format a Wordpress WXR file using XSLT so I can import it into Drupal.
I'm aware of modules for Drupal that will import WXR files but I need the flexibility that the Feeds module can give as the imported data will be imported against different content types and I'll be pulling images and other attachments into the newly created Drupal pages. With this in mind the standard WordPress Migrate just won't cut it.
So, the WXR format has Wordpress posts and attachments as separate items within the feed and links the posts an attachments using an id. Attachments can be images, files (pdf,doc etc) and are found at the xpath wp:postmeta/wp:meta_key and have values of _thumbnail_id, _wp_attached_file
What I'd like to do is take various nodes from items of type attachment and put them within the cooresponding post item, where the id links them together
A fragment of the xml to be transformed... First item is post second is attachment. The
<item>
<title>Some groovy title</title>
<link>http://example.com/groovy-example</link>
<wp:post_id>2050</wp:post_id>
<wp:post_type>page</wp:post_type>
...
...
...
<wp:postmeta>
<wp:meta_key>_thumbnail_id</wp:meta_key>
<wp:meta_value>566</wp:meta_value>
</wp:postmeta>
</item>
...
...
...
<item>
<title>My fantastic attachment</title>
<link>http://www.example.com/fantastic-attachment</link>
<wp:post_id>566</wp:post_id>
<wp:post_type>attachment</wp:post_type>
...
...
...
<wp:attachment_url>http://www.example.com/wp-content/uploads/2012/12/fantastic.jpg</wp:attachment_url>
<wp:postmeta>
<wp:meta_key>_wp_attached_file</wp:meta_key>
<wp:meta_value>2012/12/fantastic.jpg</wp:meta_value>
</wp:postmeta>
</item>
After the transform I would like
<item>
<title>Some groovy title</title>
<link>http://example.com/groovy-example</link>
<wp:post_id>2050</wp:post_id>
<wp:post_type>page</wp:post_type>
...
...
...
<wp:postmeta>
<wp:meta_key>_thumbnail_id</wp:meta_key>
<wp:meta_value>566</wp:meta_value>
<wp:meta_url>http://www.example.com/wp-content/uploads/2012/12/fantastic.jpg</wp:attachment_url>
</wp:postmeta>
</item>
Maybe, there is a better approach? Maybe merge post and attachment where the id create a link between the nodes?
I'm new to XSLT and have read a few posts on identity transforms and I think thats the correct direction but I just don't have the experience to pull of what i need, assistance would be appreciated.
It looks like I've managed to sort out a solution.
I used a number of indexes to organise the attachments. My requirements changed a little on further inspection of the XML, as there was
I changed my resulting output to be in the format of...
<item>
<title>Some groovy title</title>
<link>http://example.com/groovy-example</link>
<wp:post_id>2050</wp:post_id>
<wp:post_type>page</wp:post_type>
...
...
...
<thumbnail>
<title>Spaner</title>
<url>http://www.example.com/wp-content/uploads/2012/03/spanner.jpg</url>
</thumbnail>
<attachments>
<attachment>
<title>Fixing your widgets: An idiots guide</title>
<url>http://www.example.com/wp-content/uploads/2012/12/fixiing-widgets.pdf</url>
</attachment>
<attachment>
<title>Do It Yourself Trepanning</title>
<url>http://www.example.com/wp-content/uploads/2013/04/trepanning.pdf</url>
</attachment>
</attachments>
</item>
So using the following xsl gave me the desired result. The conditions on the indexes ensured I was selecting the correct files.
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:wp="http://wordpress.org/export/1.2/">
<xsl:output indent="yes" cdata-section-elements="content"/>
<!-- Setup indexes -->
<!-- Index all main posts -->
<xsl:key
name="mainposts"
match="*/item[wp:post_type[text()='post']]"
use="wp:post_id" />
<!-- Index all sub posts (posts within posts)-->
<xsl:key
name="subposts"
match="*/item[wp:post_type[text()='post'] and category[#nicename = 'documents']]"
use="category[#domain = 'post_tag']" />
<!-- Index all image thumbs -->
<xsl:key
name="images"
match="*/item[wp:post_type[text()='attachment'] and wp:postmeta/wp:meta_key[text()='_wp_attachment_metadata']]"
use="wp:post_parent" />
<!-- Index all files (unable to sort members file at the moment)-->
<xsl:key
name="attachments"
match="*/item[wp:post_type[text()='attachment'] and not(wp:postmeta/wp:meta_key = '_wp_attachment_metadata')]"
use="wp:post_parent" />
<xsl:key
name="thumbnails"
match="*/item[wp:post_type[text()='attachment']]"
use="wp:post_id" />
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*/item/wp:post_parent[text()= 0]">
<wp:post_parent>
<xsl:value-of select="." />
</wp:post_parent>
<xsl:for-each select="key('thumbnails', ../wp:postmeta[wp:meta_key[text()='_thumbnail_id']]/wp:meta_value)">
<thumbnail>
<title><xsl:value-of select="title" /></title>
<url><xsl:value-of select="wp:attachment_url" /></url>
</thumbnail>
</xsl:for-each>
<xsl:for-each select="key('subposts', ../category[#domain = 'post_tag'])">
<attachments>
<xsl:for-each select="key('images', wp:post_id)">
<file>
<title><xsl:value-of select="title" /></title>
<url><xsl:value-of select="wp:attachment_url" /></url>
</file>
</xsl:for-each>
<xsl:for-each select="key('attachments', wp:post_id)">
<file>
<title><xsl:value-of select="title" /></title>
<url><xsl:value-of select="wp:attachment_url" /></url>
</file>
</xsl:for-each>
</attachments>
</xsl:for-each>
</xsl:template>
Input document:
<entry xmlns="http://www.w3.org/2005/Atom">
<id>urn:uuid:1234</id>
<updated>2012-01-20T11:30:11-05:00</updated>
<published>2011-12-29T15:44:11-05:00</published>
<link href="?id=urn:uuid:1234" rel="edit" type="application/atom+xml"/>
<title>Title</title>
<category scheme="http://uri/categories" term="category"/>
<fake:fake xmlns:fake="http://fake/" attr="val"/>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<p>Blah</p>
</div>
</content>
</entry>
<!-- more entries -->
I want the output to be exactly the same, but with non-Atom elements like <fake:fake xmlns:fake="http://fake/" attr="val"/> stripped out. This is what I have, which doesn't work at all, just giving me the same input back:
declare namespace atom = "http://www.w3.org/2005/Atom";
<feed>
<title>All Posts</title>
{
for $e in collection('/db/entries')/atom:entry
return
if
(namespace-uri($e) = "http://www.w3.org/2005/Atom")
then
$e
else
''
}
</feed>
What am I doing wrong?
You can try the following query on try.zorba-xquery.com:
let $entry := <entry xmlns="http://www.w3.org/2005/Atom">
<id>urn:uuid:1234</id>
<updated>2012-01-20T11:30:11-05:00</updated>
<published>2011-12-29T15:44:11-05:00</published>
<link href="?id=urn:uuid:1234" rel="edit" type="application/atom+xml"/>
<title>Title</title>
<category scheme="http://uri/categories" term="category"/>
<fake:fake xmlns:fake="http://fake/" attr="val"/>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<p>Blah</p>
</div>
</content>
</entry>
return {
delete nodes $entry//*[not(namespace-uri(.) = "http://www.w3.org/2005/Atom")];
$entry
}
The following version is more portable:
let $entry := <entry xmlns="http://www.w3.org/2005/Atom">
<id>urn:uuid:1234</id>
<updated>2012-01-20T11:30:11-05:00</updated>
<published>2011-12-29T15:44:11-05:00</published>
<link href="?id=urn:uuid:1234" rel="edit" type="application/atom+xml"/>
<title>Title</title>
<category scheme="http://uri/categories" term="category"/>
<fake:fake xmlns:fake="http://fake/" attr="val"/>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<p>Blah</p>
</div>
</content>
</entry>
return
copy $new-entry := $entry
modify (delete nodes $new-entry//*[not(namespace-uri(.) = "http://www.w3.org/2005/Atom")])
return $new-entry
Sort of a round-about way of doing it but this ended up working:
declare default element namespace "http://www.w3.org/2005/Atom";
<feed>
<title>All Posts</title>
{
for $entry in collection('/db/entries')/entry
return
element{node-name($entry)}{
$entry/#*,
for $child in $entry//*[namespace-uri(.) = "http://www.w3.org/2005/Atom"]
return $child
}
}
</feed>
Waiting for the time limit to expire and then I'll accept it as an answer.