pdftotext get font information (font-family, style, size) - text-extraction

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML.
Here's a sample line from the output:
<word xMin="351.852025" yMin="42.548936" xMax="365.689478"
yMax="47.681498">foo</word>
Is there a way to get font information for every word like:
font family, e.g. Verdana
style, i.e. none, bold, italic
size, e.g. font size 9
I'm interested in knowing if either the poppler or xpdf version of pdftotext can do this.

Related

How to subset a Google icon font for optimal dowload size?

Google contains instructions for how to subset a font, and even specifies:
This feature also works for international fonts, allowing you to specify UTF-8 characters. For example, ¡Hola! is represented as: https://fonts.googleapis.com/css?family=Inconsolata&text=%c2%a1Hola!.
Each symbol in the Google Material Icons/Symbols fonts has a codepoint, as specified in this SO question.
Is there a way to pass the codepoint values as the value to &text URL parameter, so that I can a download a subset of the Material Symbols font with all the icons that I want to use in my app? For example, the search icon has a codepoint of e8b6. I tried several incantations, including &text=e8b6 and &text=%e8%b6 that didn't download the search icon. Then I found a converter from codepoint to utf-8, which converted e8b6 to some non-printable characters ^H^F. I then passed &text=^H^F and it downloaded the search icon, but it didn't subset the font, i.e. it downloaded the whole font file, rather than a subset of it.

How to declare math characters inside pandoc code block

There allows various math characters inside a Julia program. But when I tried to convert following character inside code block of my markdown file by using pandoc
∇
I got errors like
[WARNING] Missing character: There is no ∇ in font [SourceCodePro-Regular.otf]/OT:script=
It seems that this problem could be solved by using a different font. I am wondering which fonts are available and how to introduce them into my markdown file.
If you are generating the PDF with XeLaTeX, which I think is the case, then you have access to all fonts installed on your system. Two popular fonts that have this character are Fira Code and Noto Sans Mono. You can make use of those fonts with pandoc by adding
---
monofont: Fira Code
---
to your document's metadata, but you may have to install those first.

R does not show correctly the "NewCenturySchoolbook" font type for PDF output

Why does ggplot2 fail to print out correctly to PDF the NewCenturySchoolbook font type, although it is one of the default font types given in many R examples online.
It works fine for png and svg output files.
EDIT: related problem and solution ggplot embedded fonts in pdf
You may want to double-check that Latex is installed and that it is the default that is used to generate the pdf, and you can add a package to Latex that includes the font.
http://web.eecs.utk.edu/~mgates3/docs/latex-fonts.pdf

Figuring out encodings of PDFs in R

I am currently scraping some text data from several PDFs using readPDF() function in the tm package. This all works very well and in most cases the encoding seems to be "latin1" - in some, however, it is not. Is there a good way in R to check character encodings? I found the functions is.utf8() and is.local() in the tau package but that obviously only gets me so far.
Thanks.
The PDF specification defines these encodings for simple fonts (each of which can include a maximum of 256 character shapes) for latin text that should be predefined in any conforming reader:
/StandardEncoding
(for Type 1 Latin text fonts, but not for TrueType fonts)
/MacRomanEncoding
(the Mac OS standard encoding, for both TrueType and Type1 fonts)
/PDFDocEncoding
(only used for text strings outside of the document's content streams; normally not used to show text from fonts)
/WinAnsiEncoding
(Windows code page 1252 encoding, for both TrueType and Type1 fonts)
/MacExpertEncoding
(name is misleading -- encoding is not platform-specific; however only few fonts have an appropriate character set to use this encoding)
Then there are 2 specific encodings for symbol fonts:
Symbol Encoding
ZapfDingBats Encoding
Also, fonts can have built-in encodings, which may deviate any way their creator wanted from a standard encoding (f.e. also used for differences encoding when embedded standard fonts are subsetted).
So in order to correctly interpret a PDF file, you'll have to lookup each of the font encodings of the fonts used, and you must to take into account any /Encoding using a /Differences array too.
However, the overall task is still quite simple for simple fonts. The PDF viewer program just needs to map 1:1 "each one of a seqence of bytes I see that's meant to represent a text string" to "exactly one glyph for me to draw which I can lookup in the encoding table".
For composite, CID-keyed fonts (which may contain many thousands of character shapes), the lookup/mapping for the viewer program for "this is the sequence of bytes I see that I'm supposed to draw as text" to "this is the sequence of glyph shapes to draw" is no longer 1:1. Here, a sequence of one or more bytes needs to be decoded to select each one glyph from the CIDFont.
And to help this CIDFont decoding, there need to be CMap structures around. CMaps define mappings from Unicode encodings to character collections. The PDF specification defines at least 5 dozen CMaps -- and their standard names -- for Chinese, Japanese and Korean language fonts. These pre-defined CMaps need not be embedded in the PDF (but the conforming PDF reader needs to know how to handle them correctly). But there are (of course) also custom CMaps which may have been generated 'on the fly' when the PDF-creating application wrote out the PDF. In that case the CMap needs to be embedded in the PDF file.
All details about these complexities are layed down in the official PDF-1.7 specification.
I don't know much about R. But I have now poked a bit at CRAN, to see what the mentioned tm and tau packages are.
So tm is for text mining, and for PDF reading it requires and relies on the pdftotext utility from Poppler. I had at first the [obviously wrong] impression, that your mentioned readPDF() function was doing some low-level, library-based access to PDF objects directly in the PDF file.... How wrong I was! Turns out it 'only' looks at the text output of the pdftotext commandline tool.
Now this explains why you'll probably not succeed in reading any of the PDFs which do use more complex font encodings than the 'simple' Latin1.
I'm afraid, the reason for your problem is that currently Poppler and pdftotext are simply not yet able to handle these.
Maybe you're better off to ask the tm maintainers for a feature request: :-)
that you would like them to try + add support to their tm package for a more capable third party PDF text extraction tool, such as PDFlib.com's TET (english version) which for sure is the best text extraction utility on the planet (better than Adobe's own tools, BTW).

Programmatically setting .eps files to CMYK (as opposed to RGB)

We have an app where we need to export files from vector files from Inkscape to .eps to be used for printing. While this is working fine, our printers have complained that they are receiving the files in RGB mode instead of CMYK.
Is there a way to programmatically set .eps files to CMYK document color mode?
Color mode isn't a setting like, say, DPI. Converting from RGB to CMYK for printing is a complex process, often involving color spaces, halftoning, and other nontrivial algorithms.

Resources