What are the allowed ways of creating hyperlinks in docx files? - docx

I'm currently working on a library that will take docx files as input and use that to build html-pages, due to vague and lack of documentation of docx I have to rely heavily on example output to decide on how to handle certain things. One of these things is hyperlinks.
As far as I have seen so far docx has, at least, two ways of doing hyperlinks:
Anchor - <w:hyperlink w:anchor="_Toc000000000" history="1"></w:hyperlink>
This seems to be the mostly prefered way of doing things like toc-links.
Id - <w:hyperlink w:id="rId7" history="1"></w:hyperlink>
This seems to be the only way to specify a url for the hyperlink (with the id being defined in the .xml.rels file)
So far so good, my problem is that I have encountered files where they simple specify a rStyle (on the textrun object) value of "Hyperlink" and then seems to believe that this will make the text act as a hyperlink to the title specified in the textrun.
For example a document can contain the following:
<w:p>
<w:pPr>
<w:pStyle w:val="Heading1"/>
</w:pPr>
<w:r>
<w:t>Introduction</w:t>
</w:r>
</w:p>
And then further down the follwing:
<w:p>
<w:r>
<w:t>This is a hyperlink to </w:t>
</w:r>
<w:r>
<w:rPr>
<w:rStyle w:val="Hyperlink"/>
</w:rPr>
<w:t>Introduction</w:t>
</w:r>
<w:r>
<w:t>.</w:t>
</w:r>
</w:p>
So my question is, is these kind of "hyperlinks" (w:p instead of w:hyperlink) actually valid or just something that word or the authors of the files i have is doing wrong?

When you say that the docx format is vague and lacks documentation, have you looked at the specs? http://www.ecma-international.org/publications/standards/Ecma-376.htm (Though I do find them vague at key points.)
There are at least two ways I know of to create links. w:hyperlink is one of them.
The w:hyperlink element either links internally or externally, and works more or less how you have discovered.
In the case of an external hyperlink, it will have a relationship id, and an entry in the relationships for this document marked as external that has a uri. The spec says that if the hyperlink is external, the anchor attribute should be ignored, but in practice, I found that Word will stick the anchor part of an external url here. E.g. http://example.com/page#myAnchor will store the uri without #myAnchor in the relationships and the anchor attribute of hyperlink will have "myAnchor" without the '#'. You will probably want to check for both.
For internal hyperlinks, the anchor should either match the name attribute of a w:bookmarkStart element, or be a special value like "_GoBack" or "_top".
The second case is images that are linked, which is, unfortunately, far more complicated. There will be a w:drawing for the image which will have a docPr element with a hlinkClick element, which will have a relationship id with the destination. The spec seems a bit unclear at this point, but looking at what Word does, it looks like if the relationship is internal, it will be a bookmark name (with '#' prepended), and if external, a uri.

Related

How to select the right <category> in RSS? is there a list?

i am noting that in a RSS feed you can add the tag
Source: https://www.w3schools.com/xml/rss_tag_category_item.asp
But I don't undestand one thing: is there a list with all the categories? Or can I write anything? I need a category about videogames
Or can I write anything?
You can write anything.
Unless you're submitting your feed to a directory, with a documented set of categories, it's essentially free text.
However, in RSS:
It has one optional attribute, domain, a string that identifies a categorization taxonomy.
The value of the element is a forward-slash-separated string that identifies a hierarchic location in the indicated taxonomy.
and in Atom:
The "scheme" attribute is an IRI that identifies a categorization
scheme.
you can indicate that your term is from a specific scheme.
In practice, some schema extensions like iTunes introduce a separate element:
<rss version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
...
<itunes:category text="Sports">
<itunes:category text="Wilderness"/>
</itunes:category>
rather than suggesting use of the scheme attribute. The iTunes guide currently includes:
<itunes:category text="Leisure">
<itunes:category text="Video Games" />
</itunes:category>

scrapy link extractor by value of html tag

I'm using scrapy to scrape privacy policies by crawling a website from its homepage as such, I want to intelligently crawl specific links within pages containing specific keywords (privacy, data, protection etc...).
I saw that scrapy's CrawlSpider and the LinkExtractor object allow for just that, however I would like the LinkExtractor to not only apply a regex to the discovered links, but also to the text within the <a></a> tags
In order to, for example, better identify cases like these:
Check out our privacy policy
In which, the URL might not be a perfect match, but the text within the HTML tags is more helpful.
I saw that scrapy's LinkExtractor object already has an argument called process_value which can launch an operation on the text within the HTML tag, but I'm unsure how I could "return a Positive link match" (like the regex expression given in the allow parameter would) and thus "add this link to the list of things to parse by the CrawlSpider object"
You’ll be able to do this in Scrapy 1.7.0 or later. See #3635.
The changes add a restrict_text parameter to LinkExtractor. From the master branch of the Scrapy documentation on LinkExtractor:
restrict_text (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the link’s text must match in order to be extracted. If not given (or empty), it will match all links. If a list of regular expressions is given, the link will be extracted if it matches at least one.

TinyButStrong magnet tag shows in output docx

Hi my colleague and I has been trying to get the TinyButStrong plugin openTBS to create some docx files.
We have a live system which creates some RTF files, with data from MySQL. We want to change this to docx, use openTBS. A couple of super users then in Word manage the templates.
We have a problem with creating the files, as we need to remove a line, if data isn't present.
If we in the Word template do
<w:p>[*fieldname*;magnet=w:p]*some kind of text*</w:p>
it hiddes the line if fieldname contains no data, and if if contains data, it will show the line. GREAT :-)
The problem is, that it also shows <w:p> and </w:p> when it contains data, and we don't like that.
How do we get it to stop showing these tags?
The TBS parameter ope=minv is done for thus purpose: it performs the magnet behavior but keep the field invisible (minv stands for magnet invisible).
So the solution is:
<w:p>[*fieldname*;magnet=tbs:p;ope=minv]*some kind of text*</w:p>
By the way, magnet=tbs:p is better than magnet=w:p because your template stays compatible when converted to another other format (LibreOffice).

What's the correct format for TCDL linkAttributes?

I can see the technology-independent Tridion Content Delivery Language (TCDL) link has the following parameters, which are pretty well described on SDL Live Content.
type
origin
destination
templateURI
linkAttributes
textOnFail
addAnchor
VariantId
How do we add multiple attribute-value pairs for the linkAttributes? Specifically, what do we use to escape the double quotes as well as separate pairs (e.g. if we need class="someclass" and onclick="someevent").
The separate pairs are just space delimited, like a normal series of attributes. Try XML encoding the value of linkAttributes however. So, " become &quote;, etc...
If you are using some Javascript, you might take care of the Javascript quotes too, as in \".
Edit: after I figured out your real question, the answer is a lot simpler:
You should wrap the values inside your linkAttributes in single quotes. Spaces inside linkAttributes are typically handled fine; but if not, escape then with %20.
If you need something more or want something that isn't handled by the standard tcdl:ComponentLink, remember that you can always create your own TCDL tag and and use a TagHandler or TagRenderer (look them up in the docs for examples or search for Jaime's article on TagRenderer) to do precisely what you want.
My original answer was to a question you didn't ask: what is the format for TCDL tags (in general). But the explanation might still be useful to some, so remains below.
I'd suggest having a look at what format the default building blocks (e.g. the Link Resolver TBB in the Default Finish Actions) output and use that as a guide line.
This is what I could quickly get from the transport package of a published page:
<tcdl:Link type="Page" origin="tcm:5-199-64" destination="tcm:5-206-64"
templateURI="tcm:0-0-0" linkAttributes="" textOnFail="true"
addAnchor="" variantId="">Home</tcdl:Link>
<tcdl:ComponentPresentation type="Embedded" componentURI="tcm:5-69"
templateURI="tcm:5-133-32">
<span>
...
One of the things that I know from experience: your entire TCDL tag will have to be on a single line (I wrapped the lines above for readability only). Or at least that is the case if it is used to invoke a REL TagRenderer. Clearly the tcdl:ComponentPresentation tag above will span multiple lines, so that "single line rule" doesn't apply everywhere.
And that is probably the best advice: given the fact that TCDL tags are processed at multiple points in Tridion Publishing, Deployment and Delivery pipeline, I'd stick to the format that the default TBBs output. And from my sample that seems to be: put everything on a single line and wrap the values in (double) quotes.

How to extract element id attribute values from HTML

I am trying to work out the overhead of the ASP.NET auto-naming of server controls. I have a page which contains 7,000 lines of HTML rendered from hundreds of nested ASP.NET controls, many of which have id / name attributes that are hundreds of characters in length.
What I would ideally like is something that would extract every HTML attribute value that begins with "ctl00" into a list. The regex Find function in Notepad++ would be perfect, if only I knew what the regex should be?
As an example, if the HTML is:
<input name="ctl00$Header$Search$Keywords" type="text" maxlength="50" class="search" />
I would like the output to be something like:
name="ctl00$Header$Search$Keywords"
A more advanced search might include the element name as well (e.g. control type):
input|name="ctl00$Header$Search$Keywords"
In order to cope with both Id and Name attributes I will simply rerun the search looking for Id instead of Name (i.e. I don't need something that will search for both at the same time).
The final output will be an excel report that lists the number of server controls on the page, and the length of the name of each, possibly sorted by control type.
Quick and dirty:
Search for
\w+\s*=\s*"ctl00[^"]*"
This will match any text that looks like an attribute, e.g. name="ctl00test" or attr = "ctl00longer text". It will not check whether this really occurs within an HTML tag - that's a little more difficult to do and perhaps unnecessary? It will also not check for escaped quotes within the tag's name. As usual with regexes, the complexity required depends on what exactly you want to match and what your input looks like...
"7000"? "Hundreds"? Dear god.
Since you're just looking at source in a text editor, try this... /(id|name)="ct[^"]*"/
Answering my own question, the easiest way to do this is to use BeautifulSoup, the 'dirty HTML' Python parser whose tagline is:
"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."
It works, and it's available from here - http://crummy.com/software/BeautifulSoup
I suggest xpath, as in this question

Resources