adding images to openxml doc created from altchunk - xhtml

I need an automated process for creating docx files from xhtml source. The xhtml files contain images (<img> elements) whose "src" attributes point to an external reference. But the docx files need to be readable without a network connection, so I need to find a way to embed the images directly into the docx package (namely, in the /media folder).
So far I've used the altChunk method (as described by Eric White) to create the .docx file. I had hoped to use the OpenXML SDK to insert the image parts into the package. But to do that I need to insert paragraphs (<p> nodes) into the document. Unfortunately the document part contains nothing but a reference to the altChunk (stored separately in the docx package). Of course, once the docx is opened, edited and saved, the altChunk part is removed and it’s contents are embedded properly in the document.xml. But I don’t know of any way to do that programatically, so that doesn't help.
Other options I’ve considered:
Partitioning the xhtml into segments, separated between each image, then adding each altChunk one at a time, with the appropriate image reference between each one. (Tedious but seems possible)
Inserting the images into the media folder, and then find way to embed WordProcessingML directly into the xhtml so that the <img> references the packaged image file. (Questionable at best)
Can anyone think of a better approach?

Well, I sorta solved my own problem: I decided to convert the document to mHtml (which can contain images embedded directly in the file) and then use the altchunk to create the final docx file. However, I still wanted to do some post-processing on the file (to insert endnotes in the Word document), but as mentioned above, this is not possible until after the altchunk has been transformed into docx, which cannot be done programmatically.
So it dawned on me that I could bypass the altchunk path altogether and simply use mHtml as the "gateway" from xHtml to docx. I just transformed the xHtml into mHtml, complete with embedded images and endnotes, then renamed the file with a .doc extension. The resulting document can be opened directly by Word (and will be converted more properly on subsequent save). So far it works great (albeit with some bugs in Mac's version of Word, as well as Word2003).

Related

How to see tags in a Powerpoint presentation

I am working for an office-add-in for PowerPoint. I need to assign some unique identifier to my files, so that files can be identified in any dot net application. I did similar work for Word using custom properties. But for PowerPoint there is no way to read/ write custom property using office.js.
The only way I found using tags:
https://learn.microsoft.com/en-us/office/dev/add-ins/powerpoint/tagging-presentations-slides-shapes
but when I add tags to the presentation, I am not able to see those tags in presentation directly, I am able to read/ write through code only. Also I am not getting a way to read these tags from dot net application.
Any help will be great.
I am storing my files to azure blob. And reading files in my dot net core application to identify whether it has been saved from an office-add in or not. I am using syncfusion library in dot net core application to work with files.
Try this:
Open a PPT file and add a few tags "wwwww", "yyyyy", "zzzzz".
Close and save the file.
Add ".zip" onto the end of the filename.
Use any unzipper program to unzip the file.
Search the folder of unzipped files for "wwwww", "yyyyy", "zzzzz".
This should tell you where/how tags are stored in the OOXML.
Your .NET app should be able to use the OOXML SDK to read the tags of a PPT file.
At present, Syncfusion Presentation library do not have support to read and edit the tags of PowerPoint elements. Please track the status of this feature from below link,
https://www.syncfusion.com/feedback/1800/create-and-edit-tags-for-powerpoint-elements
However, kindly try the below suggested workaround solutions to achieve your requirement,
Using Shape.Name property:
Add a shape with unique name while generating a document from Office-Addin. You can use Shape.Name property for this.
In .NET Core application, identify the corresponding shape by using Shape.ShapeName property of Syncfusion Presentation library and decide whether its generated by Office-Addin or not. You can refer below UG documentation for more details,
https://help.syncfusion.com/file-formats/presentation/working-with-shapes#specifying-shape-properties
Using custom meta-data on Presentation:
PowerPoint Javascript API documentation states that, when we apply the tags for Presentation object it’s maintained as a custom property of PowerPoint document.
Screenshot
If so, you can able to iterate the custom property of PowerPoint document using Presentation.CustomDocumentProperties API of Syncfusion Presentation library. You can refer below UG documentation for more details,
https://help.syncfusion.com/file-formats/presentation/working-with-powerpoint-presentation#adding-custom-document-properties
Note: This update is from Syncfusion team

Can I split a single reStructuredText file into multiple HTML pages by section?

I have a long reStructuredText file that I render into HTML. I'd like to spit each section into a different HTML page, for greater readability. Is it possible, without splitting the source file?
No. From http://www.sphinx-doc.org/en/stable/markup/toctree.html
reST does not have facilities to…split documents into multiple output files.

Embedding image in ipython notebook for distribution

I have an ipython notebook with an embedded image from my local drive. I was expecting it to be embedded in the JSON along with the output of code cells, but when I distributed the notebook, the image did not appear to users. What is the recommended way (or ways) to embed an image in a Notebook, so that it doesn't disappear if users rerun code cells, clear cell output, etc.?
The notebook system caches images included with ![label](image.png), but they last only until the python "kernel" serving the notebook is restarted. If I rename the image file on disk, I can close and reopen the notebook and it still shows the image; but it disappears when I restart the kernel.
Edit: If I generate an image as code cell output and then export the notebook to html, the image is embedded in the html as encoded data. Surely there must be a way to hook into this functionality and load the output into a markdown (or better yet "raw nbconvert") cell?
from IPython.display import Image
Image(filename='imagename.png')
will be exported (with ipython nbconvert) to html that contains the following:
<div class="output_png output_subarea output_execute_result">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAnAAAAFgCAYAAAA...
</div>
However, even when I manually embedded this snippet into a markdown cell, I couldn't get the image to display. What am I doing wrong?
Update (2020)
Apparently, the problem has (finally!) been addressed in the newer notebook / Jupyter versions: as of 2018 (thanks for the link #Wayne), the html sanitizer will accept an embedded html image, as in <img src="data:image/png;base64,iV...> . Markdown image syntax also accepts images as embedded data, so there are two ways to do this. Details in these helpful answers:
markdown image syntax (answer by #id01)
html element syntax (in answer by #tel -- note that it works now!)
Are you happy to use an extra code cell to display the image? If so, use this:
from IPython.display import Image
Image(filename="example.png")
The output cell will have the raw image data embedded in the .ipynb file so you can share it and the image will be retained.
Note that the Image class also has a url keyword, but this will only link to the image unless you also specify embed=True (see the documentation for details). So it's safer to use the filename keyword unless you are referring to an image on a remote server.
I'm not sure if there is an easy solution if you require the image to be included in a Markdown cell, i.e. without a separate code cell to generate the embedded image data. You may be able to use the python markdown extension which allows dynamically displaying the contents of Python variables in markdown cells. However, the extension generates the markdown cells dynamically, so in order to retain the output when sharing the notebook you will need to run ipython nbconvert --to notebook original_notebook.ipynb --output preprocessed_notebook using the preprocessor pymdpreprocessor.py as mentioned in the section "Installation". The generated notebook then has the data embedded in the markdown cell as an HTML tag of the form <img src="data:image/png;base64,..."> so you can delete the corresponding code cell from preprocessed_notebook.ipynb. Unfortunately, when I tried this the contents of the <img> tag weren't actually displayed in the browser, so not sure if this is a viable solution. :-/
A different option would be to use the Image class in a code cell to generate the image as above, and then use nbconvert with a custom template to remove code input cells from the notebook. See this thread for details. However, this will strip all code cells from the converted notebook, so it may not be what you want.
The reason why the
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAnAAAAFgCAYAAAA...
tag doesn't do anything when you put it in a markdown cell is because IPython uses an HTML sanitizer (something called Google Caja) that screens out this type of tag (and many others) before it can be rendered.
The HTML sanitizer in IPython can be completely disabled by adding the following line to your custom.js file (usually located at ~/.ipython/profile_default/static/custom/custom.js):
iPython.security.sanitize_html = function (html) { return html; };
It's not a great solution though, as it does create a security risk, and it doesn't really help that much with distribution.
Postscript:
The ability to render base64 encoded strings as images != obvious security concern, so there should be a way for the Caja people to eventually allow this sort of thing through (although the related feature request ticket was first opened back in 2012, so don't hold your breath).
I figured out that replacing the image URL in the ![name](image) with a base64 URL, similar to the ones found above, can embed an image in a markdown container.
Example markdown:
![smile](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAABHNCSVQICAgIfAhkiAAAAD9JREFUGJW1jzEOADAIAqHx/1+mE4ltNXEpI3eJQknCIGsiHSLJB+aO/06PxOo/x2wBgKR2jCeEy0rOO6MDdzYQJRcVkl1NggAAAABJRU5ErkJggg==)
If using the IPython HTML() function to output raw HTML, you can embed a linked image in base64 inside an <img> tag using the following method:
import base64
import requests
from IPython.core.display import HTML
def embedded_image(url):
response = requests.get(url)
uri = ("data:" +
response.headers['Content-Type'] + ";" +
"base64," + str(base64.b64encode(response.content).decode('utf-8')))
return uri
# Here is a small example. When you export the notebook as HTML,
# the image will be embedded in the HTML file
html = f'<img src="{embedded_image("https://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg")}" />'
HTML(html)
UPDATE: As pointed out by #alexis, this doesn't actually answer the question correctly, this will not allow users to re-run cells and have images persist (this solution only allows one to embed the images into exports).
As of Jupyter Notebook 5, you can attach image data to cells, and refer to them from the cell via attachment:<image-file-name>. See the menu Edit > Insert Image, or use drag and drop.
Unfortunately, when converting notebooks with attached (embedded) images to HTML, those images will not show up.
To get them into the HTML code, you can use (for instance) nbtoolbelt.
It will replace those attachment: references by data: with the image data embedded in the img tag.

Alternative to HTML to PDF converter?

I've been using the Winnovative HTML to PDF converter for a few years, but I've noticed the quality can be impared because the images etc have first had to be rendered in HTML before being converted into a PDF format.
Winnovative have another option where you can add objects to the PDF Converter before outputting the result, but as this allows you to add HTML elements, I imagine this works in a similar way to the HTML to PDF converter (in terms of rendering).
Is there an alternative to this so that I can generate a PDF in my ASP.NET Web Application without it first having to be rendered as HTML?
I'm looking for the most high quality option
You can use iTextSharp library. It has an object representation of whole PDF document so it will allow you to add any elements you need without translating it from html elements. It also allows you to convert html to pdf, but of course you can do it manually instead by building PDF document from basic blocks...
If you will use version 4.x then it's free to use in commercial projects (LGPL license). Version 5.x is avaible on Affero General Public License so I believe you have to buy it to use in commercial projects, but the features I've described are avaible in the 4.xversion
try http://wkhtmltopdf.org/
it's lightning fast in comparison to iTextSharp.
For step by step installation check out these articles:
http://www.megustaulises.com/2012/12/mvcnet-convert-html-to-pdf-with-pechkin.html
http://w3facility.org/question/how-to-pass-html-as-a-string-using-wkhtmltopdf/
And this manual:
http://madalgo.au.dk/~jakobt/wkhtmltoxdoc/wkhtmltopdf-0.9.9-doc.html

Display Word Document inside ASP.Net page

I want to display a word Document, which is sitting on my IIS. I want to display the whole document as is, inside a iFrame on my aspx page.
I know I can use MS Word Libs, but I cannot install Word on Server where application will be hosted, (Correct me if I am wrong: I cannot use just dlls without installing MS Word on Server).
How can I display the word document in my iFrame?
Probably the easiest way would be to include the Google Docs Viewer.
Other ways could be to use Aspose.Words (commercial) to convert Word to PDF and then use Aspose.Pdf.Kit to convert PDF to images and then display the images online.
PowerTools for Open XML contains an open source, free implementation of a conversion from DOCX to HTML formatted with CSS. The module HtmlConverter.cs supports all paragraph, character, and table styles, fonts and text formatting, numbered and bulleted lists, images, and more. See http://bit.ly/1bclyg9

Resources