Using CSS when converting Markdown to PDF with Pandoc - css

I'm trying out Pandoc on OS X, and results thus far are impressive. One blocking problem, however, is getting CSS styles to work on inline code samples. I'm converting from Markdown to PDF.
I have this string in my source:
* Create a simple HTML document (<span class="filename">simple.html</span>) and load it into the browser via the file system
I've also tried this:
* Create a simple HTML document (`simple.html`{.filename}) and load it into the browser via the file system
I'd like to apply the class "filename" to the enclosed text in each case, but it doesn't seem to do anything to the output. However the manual says:
Some output formats can use this information to do syntax highlighting. Currently, the only output formats that uses this information are HTML and LaTeX.
Here's my command:
pandoc \
--output ./output.pdf \
--css source/styles.css \
source/en/docs/input.md
I'm converting to PDF, which is written by LaTeX by Pandoc internally. Can I get this to work? Or, can I use a style defined using a LaTeX command? - it doesn't have to be CSS. However, it must be a style system - it's not workable to change italic/font/colour attributes on each occasion.
I've tried sending output temporarily to HTML, and in that situation the styles are imported directly from the specific style asset. So, my stylesheet specification and span markup is correct, at least for one output format.
Addenda
A couple of afterthoughts:
The solution does not have to be Pandoc or Markdown. However, it does need to be a trivial text-based markup language that can convert reliably to PDF, as I want to store the document files on Git for easy forking and merging. I'm not keen on HTML as it is verbose, and engines to convert it aren't that great (though, admittedly, my formatting requirements are modest).
The HTML output from Pandoc is fine, so if I can find something that converts the (simple) HTML/CSS to PDF reliably, I'll be fine. Of course, Pandoc should be able to do this, but inline styles (for the background colour on code fragments) aren't rendered. This might be a faff, as I'll have to reintroduce things like page-breaks, which can be non-trivial in HTML-to-PDF converters.

"I'd like to apply the class "filename" to the enclosed text in each case, but it doesn't seem to do anything to the output."
It works for HTML. Running Pandoc interactively, ^D to see the resulting code:
$> pandoc -f markdown -t html
* Create a simple HTML document (`simple.html`{.filename}) and load it.
^D
<ul>
<li>Create a simple HTML document (<code class="filename">simple.html</code>) and load it.</li>
</ul>
It doesn't work for LaTeX if you use the .filename class. You need to use one of the known classnames:
$> pandoc -f markdown -t latex
* Create a simple HTML document (`simple.html`{.filename}) and load it.
^D
\begin{itemize}
\tightlist
\item
Create a simple HTML document (\texttt{simple.html}) and load it.
\end{itemize}
Now using one of the known classnames, like .bash, .postscript, .php, ...:
$> pandoc -f markdown -t latex
* Create a simple HTML document (`simple.html`{.bash}) and load it.
^D
\begin{itemize}
\tightlist
\item
Create a simple HTML document (\VERB|\KeywordTok{simple.html}| and
load it.
\end{itemize}
To convert HTML + CSS into PDF, you can also look into PrinceXML, which is free for non-commercial use.

I don't know LaTeX at all, but have hacked this solution using this helpful manual. First, create a style:
\definecolor{silver}{RGB}{230,230,230}
\newcommand{\inlinecodeblock}[1]{
\colorbox{silver}{
\texttt{#1}
}
}
And here's how to use it:
Some \inlinecodeblock{inline code}, and some widgets, go here
This creates a style with a background colour and a monospaced font. The margin and padding are a bit large for my preferences, but it's a very useable start. Here's what it looks like:
The disadvantage is that if I wish to output to a format that supports styles proper (such as HTML) then these are lost. Also, my solution only works with LaTeX/PDF. Thus, if you can fix these issues, please add a better answer!
Addendum: I have a better approach, which is thus:
\newcommand{\inlinecodeblock}[1]{
\fboxsep 1pt
\fboxrule 0pt
\colorbox{silver}{\strut{\texttt{#1}}}
}
This avoids the problem of excess horizontal padding - I think it was the line break in the colorbox parameter that did it. I've added in strut, which keeps highlights the same height regardless of whether the text has descenders.
It's not perfect though - there's still too much horizontal margin outside the box, and a comma after a box will still orphan onto the next line. I may give up with LaTeX, and render to HTML from Pandoc, and then use wkhtmltopdf to render the final document.

Related

VScode markdown preview image size control - pandoc compatible?

Is there some VScode extension that allows image size control in this form?
![caption](image.png){ width=whatever }
That is the form that is used to get Pandoc to control image size in its output. In this case I'd like to use VSCode - one of my favorite tools - to compose markdown destined for docx output. I've got the Pandoc part working, I'd just like be able to get the previewing to work better in VScode. There might be an extension that does this, but there are zillions of markdown extension for VSCode so who knows.
An alternative would be if there custom css code that would do this, but my css knowledge is not sufficient to know if this is even possible.

How can I convert markdown to PDF using CSS with a table of contents?

I tried using the Pandoc and Wkhtmltopdf, and Wkhtmltopdf is working fine. But I am unable to get a table of contents and page numbering. Is there a way to get those using Pandoc?
Edit:
I would like to add another potential solution that has helped me. It is PagedMeda/Paged.js. I allows for the usage of PagedMedia CSS, which you can use to make a table of contents.
Pipe the output to wkhtmltopdf and use its table of contents producer (pandoc --toc doesn't do page numbers):
pandoc input.md -s -t html | wkhtmltopdf toc - output.pdf

html tags in an R markdown document compiled to pdf

I'm trying to use R Markdown to create a pdf document, and I'm having problems using certain html tags. For example, the R markdown document
---
output: pdf_document
---
<pre>
code1
</pre>
<code>
code2
</code>
<pre><code>
code3
</code></pre>
compiles to give
code2
when the desired output is
code1
code2
code3
with some nice formatting for code3. But if I compile to html (output: html_document instead of output: pdf_document in the metadata), the problem is solved.
I'm compiling with TexShop on a Mac using the engine below.
#!/bin/bash
/Library/Frameworks/R.framework/Versions/Current/Resources/bin/Rscript -e "rmarkdown::render(\"$1\", encoding='UTF-8')"
I suspect that I'm not allowed to user certain html tags when I compile to a pdf, but I haven't been able to find any guidelines on this.
It is important to remember that the PDF format is not HTML and knows nothing of HTML tags. When a document is converted to PDF, each piece of the document needs to be converted to its corresponding PDF entity. Therefore, when you introduce non-standard raw HTML into your document, the converter can easily be confused.
Of course, how the converter works under the hood could have some effect on the output as well. For example, if the tool you are using converts the Markdown to HTML and then converts that HTML to PDF, then the raw HTML may have a better chance of being mapped properly. However, if the tool goes straight from a parse tree (list of tokens) to the output format, then it may not know anything about the raw HTML (unless it is also an HTML parser). The point is that using raw HTML adds another potential layer of failure when converting to PDF. My suggestion would be to avoid it if at all possible when you indent to convert to PDF (remember Markdown was originally intended to output HTML only).
As it turns out, Markdown already offers a way (or two; depending on which implementation you are using) to mark up code blocks: indented code blocks (and possibly fenced code blocks). Interestingly, the HTML they output is the same as the raw HTML that you have found to work. Perhaps that should provide a clue that the other two possibilities you tried are not valid.
In fact, the HTML Spec is pretty clear that code blocks must be wrapped in <pre><code> tags. The <pre> tag is a block level tag, so it does not need to be wrapped in any parent tags. However, the <pre> tag does not identify its contents as being "code". Therefore, it should never be assumed that it contains "code" itself. On the other hand, the <code> tag is not a block level tag. It must be wrapped by a block level tag (like <pre> or <p>...). And the <code> tag is the only tag which marks content as being "code". Therefore, the only valid way to mark up a code block in HTML is to wrap it in <pre><code> tags. As it turns out, when you do that, it works. Therefore, my conclusion is that the converter is being confused by invalid HTML and failing (as it should).
So, in conclusion, either use native Markdown methods for marking up code or, if you must use raw HTML, stick to valid HTML.

Changing the Pandoc monospace font size or style in DOCX output

When using markdown code blocks the resulting monospace font size is too large in DOCX documents.
I can adjust the font size of paragraphs by specifying a custom template.docx file, but for some reason the generated code blocks do not use a paragraph style, as opposed to most other generated output.
Is there any way to:
Make code blocks use a specific style so that I can override the style in the template.docx
Override the monospace font used in the DOCX representation of code blocks?
Updated to clarify:
I am using an external reference.docx based on a previously generated docx as described in the comments. By modifying the styles for heading1 etc I have reasonable control over the output. The problem is that generated monospace text does not use a named style, it is just "normal" with some changes. So I have no way to change it in the template unless I also change the size of all "normal" text.
Using Pandoc 1.17.2 and Word 2013 I have finally found a solution, it seems later versions of Pandoc uses a linked style that is by default hidden in Word.
Step 1: Generate a custom template file using
pandoc -o template_1.17.2.docx test.md
Where test.md includes source code and all other styles you may want to modify. For example:
~~~~
this is preformatted source using style "Source Code"
~~~~
~~~ xml
<this> is preformatted source using "KeyworkTok" and "NormalTok"</this>
~~~
Open template_1.17.2.docx in Word. The preformatted source is now formatted using the hidden linked style "Source Code". This style is NOT displayed in the styles preview pane by default, you can add it by configuring the styles preview pane by clicking the tiny square-with-arrow in the bottom right of the styles preview panel.
Modify this style as you wish and save the template. Then generate your document based on this template:
pandoc --reference-docx=template_1.17.2.docx -o mydoc.docx mydoc.md
You should now see the source properly formatted in mydoc.
#LinusR suggests that different source styles uses different Layout styles. I have added XML as an example. The formatted XML will use "KeywordTok" and "NormalTok".
Pandoc, when creating DOCX (MS Word) documents uses a reference.docx file. This has to be given on the Pandoc command line. Pandoc will then extract all default styles and formatting settings (unless they use custom names) from this reference DOCX and apply them on the generated DOCX:
pandoc -t docx -o out.docx in-markdown.txt --reference-docx=my.docx
The best way to arrive at a Pandoc-usable reference DOCX is to generate a first simple DOCX with the help of Pandoc, then take it to a Word installation, open it and change the styles to be used by you to your liking. Then save it, take it back to Pandoc and use it as a reference.
For ODT (LibreOffice/OpenOffice/OpenDocument) in addition to the reference.odt (which you can use with the --reference-odt flag), there's also a template. You can print the default template with pandoc -D odt, then modify it and use it with pandoc -o out.odt --template=modifiedTemplate.odt
Last advice: use the latest Pandoc version! (Current is 1.13.2.1. For end of this month a 1.14 is expected.) Its DOCX support improved considerably in recent releases.

Embedding image in ipython notebook for distribution

I have an ipython notebook with an embedded image from my local drive. I was expecting it to be embedded in the JSON along with the output of code cells, but when I distributed the notebook, the image did not appear to users. What is the recommended way (or ways) to embed an image in a Notebook, so that it doesn't disappear if users rerun code cells, clear cell output, etc.?
The notebook system caches images included with ![label](image.png), but they last only until the python "kernel" serving the notebook is restarted. If I rename the image file on disk, I can close and reopen the notebook and it still shows the image; but it disappears when I restart the kernel.
Edit: If I generate an image as code cell output and then export the notebook to html, the image is embedded in the html as encoded data. Surely there must be a way to hook into this functionality and load the output into a markdown (or better yet "raw nbconvert") cell?
from IPython.display import Image
Image(filename='imagename.png')
will be exported (with ipython nbconvert) to html that contains the following:
<div class="output_png output_subarea output_execute_result">
<img src="...
</div>
However, even when I manually embedded this snippet into a markdown cell, I couldn't get the image to display. What am I doing wrong?
Update (2020)
Apparently, the problem has (finally!) been addressed in the newer notebook / Jupyter versions: as of 2018 (thanks for the link #Wayne), the html sanitizer will accept an embedded html image, as in <img src="...> . Markdown image syntax also accepts images as embedded data, so there are two ways to do this. Details in these helpful answers:
markdown image syntax (answer by #id01)
html element syntax (in answer by #tel -- note that it works now!)
Are you happy to use an extra code cell to display the image? If so, use this:
from IPython.display import Image
Image(filename="example.png")
The output cell will have the raw image data embedded in the .ipynb file so you can share it and the image will be retained.
Note that the Image class also has a url keyword, but this will only link to the image unless you also specify embed=True (see the documentation for details). So it's safer to use the filename keyword unless you are referring to an image on a remote server.
I'm not sure if there is an easy solution if you require the image to be included in a Markdown cell, i.e. without a separate code cell to generate the embedded image data. You may be able to use the python markdown extension which allows dynamically displaying the contents of Python variables in markdown cells. However, the extension generates the markdown cells dynamically, so in order to retain the output when sharing the notebook you will need to run ipython nbconvert --to notebook original_notebook.ipynb --output preprocessed_notebook using the preprocessor pymdpreprocessor.py as mentioned in the section "Installation". The generated notebook then has the data embedded in the markdown cell as an HTML tag of the form <img src="data:image/png;base64,..."> so you can delete the corresponding code cell from preprocessed_notebook.ipynb. Unfortunately, when I tried this the contents of the <img> tag weren't actually displayed in the browser, so not sure if this is a viable solution. :-/
A different option would be to use the Image class in a code cell to generate the image as above, and then use nbconvert with a custom template to remove code input cells from the notebook. See this thread for details. However, this will strip all code cells from the converted notebook, so it may not be what you want.
The reason why the
<img src="...
tag doesn't do anything when you put it in a markdown cell is because IPython uses an HTML sanitizer (something called Google Caja) that screens out this type of tag (and many others) before it can be rendered.
The HTML sanitizer in IPython can be completely disabled by adding the following line to your custom.js file (usually located at ~/.ipython/profile_default/static/custom/custom.js):
iPython.security.sanitize_html = function (html) { return html; };
It's not a great solution though, as it does create a security risk, and it doesn't really help that much with distribution.
Postscript:
The ability to render base64 encoded strings as images != obvious security concern, so there should be a way for the Caja people to eventually allow this sort of thing through (although the related feature request ticket was first opened back in 2012, so don't hold your breath).
I figured out that replacing the image URL in the ![name](image) with a base64 URL, similar to the ones found above, can embed an image in a markdown container.
Example markdown:
![smile]()
If using the IPython HTML() function to output raw HTML, you can embed a linked image in base64 inside an <img> tag using the following method:
import base64
import requests
from IPython.core.display import HTML
def embedded_image(url):
response = requests.get(url)
uri = ("data:" +
response.headers['Content-Type'] + ";" +
"base64," + str(base64.b64encode(response.content).decode('utf-8')))
return uri
# Here is a small example. When you export the notebook as HTML,
# the image will be embedded in the HTML file
html = f'<img src="{embedded_image("https://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg")}" />'
HTML(html)
UPDATE: As pointed out by #alexis, this doesn't actually answer the question correctly, this will not allow users to re-run cells and have images persist (this solution only allows one to embed the images into exports).
As of Jupyter Notebook 5, you can attach image data to cells, and refer to them from the cell via attachment:<image-file-name>. See the menu Edit > Insert Image, or use drag and drop.
Unfortunately, when converting notebooks with attached (embedded) images to HTML, those images will not show up.
To get them into the HTML code, you can use (for instance) nbtoolbelt.
It will replace those attachment: references by data: with the image data embedded in the img tag.

Resources