Strip Math ML (Convert to plain text) - mathml

I am working on a project that imports technical documents into a tracking system. A small number of the publications contain embedded HTML. This is normal and we strip out the HTML which is typically used to add formatting such as bold or italics to body text.
Now we are receiving documents containing MathML. Are there any libraries (or approaches) out there that will strip the markup and give a reasonable text equivalent? I realize that that MathML allows for graphical representations, but even those have text equivalents.

To do this you would have to process the MathML and interpret it. Unlike in the case of removing html markup, stripping the tags would normally strip the meaning from the formula.
So you will need a mathml parser. Two do come to mind, both by David Carlisle, and both xslt based: pmml2tex converts to Latex format, which is often more or less readable: your example would be rendered as \frac{a+b+c}{2\times 5}
Alternatively pmathmlascii does little ascii art representations of mathml. Your example would render as
a + b + c
---------
2 * 5
or similar.
Both stylesheets can be found on google code, and are discussed at https://code.google.com/p/web-xslt/wiki/Overview

Related

How to preserve white space at the start of a line in .Rd documentation?

I need to indent some math stuff in the \details section of my .Rd documentation to enhance its readability. I am using mathjaxr. Is there any way to indent without installing roxygen2 or similar?
The math stuff is inline, so simply setting to display using \mjdeqn won't solve this.
I seem to have a reasonable "cheating" work around for indenting the first line using mathjaxr, at least for the PDF and HTML output.
We need to do two things:
Use the mathjax/LaTeX phantom command. phantom works by making a box of the size necessary to type-set whatever its argument is, but without actually type-setting anything in the box. For my purposes, if I want to indent, say, about 2 characters wide, I would start the line with a \mjeqn{\phantom{22}}{ } and following with my actual text, possibly including actual mathy bits. If I want an indent of, say, roughly 4 characters wide, I might use \mjeqn{\phantom{2222}}{ }.
Because mathjaxr has a problem with tacking on unsolicited new lines when starting a line with mjeqn, we need to prefix the use of phantom in 1 above with an empty bit of something non-mathjaxr-ish like \emph{}.
Putting it all together, I can indent by about 2 characters using something like this:
\emph{}\mjeqn{\phantom{22}}Here beginneth mine indented lineā€¦
I need to explore whether the { } business actually indents for ASCII output, or whether I might accomplish that using or some such.

paste0 regular and italicized text in R

I need to concatenate two strings within an R object: one is just regular text; the other is italicized. So, I tried a lot of combinations, e.g.
paste0(" This is Regular", italic( This is Italics))
The desired result should be:
This is Regular This is Italics
Any ideia on how to do it?
Thanks!
In plot labels, you can use expressions, see mathematical annotation :
plot(1,xlab=expression("This is regular"~italic("this is italic")))
To provide an string for which an HTML parser will recognise the need to render the text in Italics, wrap the text in <i> and </i>. For example: "This is plain text, but <i>this is in Italics</i>.".
However, most HTML processors will assume that you want your text to appear as-is and will escape their input by default. This means that the special meanings of certain characters - including < and > will be "turned off". You need to tell the processor not to do this. How you do that will depend on context. I can't tell you that because you haven't given me context.
Are you for example, writing to a raw HTML file? (You need do nothing.) Are you writing to a Markdown file? If so, how? In plain text or in a rendered chunk? Are you writing a caption to a graphic? (Waldi has suggested a solution.) Etc, etc....

Create new emphasis command R Markdown

In R Markdown, to make a text bold, we just need to do:
**code**
The the word code shows in bold.
I was wondering if there is a way to create a new command, let's say:
***code***
That would make the text highlighted?
Thanks!
It is not easily possible to create new markup, but one can change the way existing markup commands are rendered. Text enclosed by three stars is interpreted as emphasized strong emphasis. So one has to change that interpretation and change it to something else. One way to do so is via pandoc Lua filters. We just have to match on pandoc's internal representation of emphasized strong text and convert it to whatever we want:
function Strong (strong)
-- if this contains only one element, and if that element
-- is emphasized text, convert it to highlighted text.
local element = #strong.content == 1 and strong.content[1]
if element and element.t == 'Emph' then
table.insert(element.content, 1, pandoc.RawInline('html', '<mark>'))
table.insert(element.content, pandoc.RawInline('html', '</mark>'))
return element.content
end
end
The above works for HTML output. One would have to define what "highlighted text" means for each targeted format.
See this and this question for other approaches to the problem, and for details of how to use the filter with R Markdown.

emphasis and not emphasis in the same word

In ReStructuredText, is it possible to have emphasis and no emphasis in the same word? For example:
*emph*not-emph
leading to "emph no-emph", but with no white space in between? I can't find a way to do it, not even with a substitution.
What you are looking for is Character-Level Inline Markup. The description from the reStructuredText specification is (emphasis mine):
It is possible to mark up individual characters within a word with backslash escapes [...] Backslash escapes can be used to allow arbitrary text to immediately follow inline markup.
The two examples provided in the specification are:
For a single character immediately following inline markup:
Python ``list``\s use square bracket syntax.
For arbitrary text immediately following inline markup:
Possible in *re*\ ``Structured``\ *Text*, though not encouraged.
So to achieve the output you want, you need to use the backslash-escaped whitespace pattern:
*emph*\ not-emph
The reason this is required is because the inline markup recognition rules require that:
Inline markup end-strings must end a text block or be immediately followed by
whitespace,
one of the ASCII characters - . , : ; ! ? \ / ' " ) ] } > or
a non-ASCII punctuation character with Unicode category Pd (Dash), Po (Other), Pe (Close), Pf (Final quote), or Pi (Initial quote).
Note that the use of that pattern above is discouraged in the reStructuredText specification:
The use of backslash-escapes for character-level inline markup is not encouraged. Such use is ugly and detrimental to the unprocessed document's readability. Please use this feature sparingly and only where absolutely necessary.

IDML : What are Kinsoku/Mojikumi tables?

I am new to the world of Adobe InDesign and IDML file format. I am trying to understand the IDML file format so that I can create IDML files dynamically through code!
I am going through the IDML File format specification and have found references to "Mojikumi Tables" and "Kinsoku Tables" and "Aki". Though the documentation defines various attributes for these elements, there's no clear explanation what these elements actually are.
Any pointers or links to relevant articles would be really helpful.
Thanks.
These are all additional typography settings used in laying out Japanese text.
Kinsoku: A rule set in the Japanese language that is used to determine characters that are not permitted at the beginning or end of a line. Reference.
Mojikumi: Determines spacing between punctuation, symbols, numbers, and other character classes in Japanese type. Reference.
Aki: Means space in Japanese:
"When the glyphs that correspond to characters of different character
classes come together in a run of text, there is spacing behaviour. In
other words, extra space, measured using a fraction of an em, is
introduced depending on which two character classes are in proximity*.
Typical values are one-fourth and one-half of an em"
(Footnote: * 'In Japanese this space is referred to as aki, which simply means
"space"')
Reference and source for this quote.
Here's a link to a book that should provide more information: CJKV Information Processing, 2nd Edition

Resources