I am currently scraping some text data from several PDFs using readPDF() function in the tm package. This all works very well and in most cases the encoding seems to be "latin1" - in some, however, it is not. Is there a good way in R to check character encodings? I found the functions is.utf8() and is.local() in the tau package but that obviously only gets me so far.
Thanks.
The PDF specification defines these encodings for simple fonts (each of which can include a maximum of 256 character shapes) for latin text that should be predefined in any conforming reader:
/StandardEncoding
(for Type 1 Latin text fonts, but not for TrueType fonts)
/MacRomanEncoding
(the Mac OS standard encoding, for both TrueType and Type1 fonts)
/PDFDocEncoding
(only used for text strings outside of the document's content streams; normally not used to show text from fonts)
/WinAnsiEncoding
(Windows code page 1252 encoding, for both TrueType and Type1 fonts)
/MacExpertEncoding
(name is misleading -- encoding is not platform-specific; however only few fonts have an appropriate character set to use this encoding)
Then there are 2 specific encodings for symbol fonts:
Symbol Encoding
ZapfDingBats Encoding
Also, fonts can have built-in encodings, which may deviate any way their creator wanted from a standard encoding (f.e. also used for differences encoding when embedded standard fonts are subsetted).
So in order to correctly interpret a PDF file, you'll have to lookup each of the font encodings of the fonts used, and you must to take into account any /Encoding using a /Differences array too.
However, the overall task is still quite simple for simple fonts. The PDF viewer program just needs to map 1:1 "each one of a seqence of bytes I see that's meant to represent a text string" to "exactly one glyph for me to draw which I can lookup in the encoding table".
For composite, CID-keyed fonts (which may contain many thousands of character shapes), the lookup/mapping for the viewer program for "this is the sequence of bytes I see that I'm supposed to draw as text" to "this is the sequence of glyph shapes to draw" is no longer 1:1. Here, a sequence of one or more bytes needs to be decoded to select each one glyph from the CIDFont.
And to help this CIDFont decoding, there need to be CMap structures around. CMaps define mappings from Unicode encodings to character collections. The PDF specification defines at least 5 dozen CMaps -- and their standard names -- for Chinese, Japanese and Korean language fonts. These pre-defined CMaps need not be embedded in the PDF (but the conforming PDF reader needs to know how to handle them correctly). But there are (of course) also custom CMaps which may have been generated 'on the fly' when the PDF-creating application wrote out the PDF. In that case the CMap needs to be embedded in the PDF file.
All details about these complexities are layed down in the official PDF-1.7 specification.
I don't know much about R. But I have now poked a bit at CRAN, to see what the mentioned tm and tau packages are.
So tm is for text mining, and for PDF reading it requires and relies on the pdftotext utility from Poppler. I had at first the [obviously wrong] impression, that your mentioned readPDF() function was doing some low-level, library-based access to PDF objects directly in the PDF file.... How wrong I was! Turns out it 'only' looks at the text output of the pdftotext commandline tool.
Now this explains why you'll probably not succeed in reading any of the PDFs which do use more complex font encodings than the 'simple' Latin1.
I'm afraid, the reason for your problem is that currently Poppler and pdftotext are simply not yet able to handle these.
Maybe you're better off to ask the tm maintainers for a feature request: :-)
that you would like them to try + add support to their tm package for a more capable third party PDF text extraction tool, such as PDFlib.com's TET (english version) which for sure is the best text extraction utility on the planet (better than Adobe's own tools, BTW).
Related
I am working with Mitsubishi PLC files that were originally commented in Japanese but then opened on English-only computers which converted the Japanese symbols to incomprehensible latin keyboard symbol combinations such as ‰^“]€”õONŠm”F(‘€ì”Õ1).
Being able to understand these comments would greatly enhance my ability to analyze and modify these files as I am required to do so for my work. If I could translate these back to Japanese symbols (I do have the Japanese language pack installed on my windows laptop), I could then translate these with Google Translate, which I know is not perfect, but is a lot better than ##$$##&^.
Does anyone have any ideas how this could be done? I figure that Windows must have interpreted the original characters somehow, and there may be a way to interpret them back to the original symbols.
I am thinking of trying to do some kind of character translation using a script in Python or Powershell or VBA (maybe I can create a map in Excel...)
Any ideas?
I can export these comments into CSV files so easy to get to and manipulate if I can figure out how....
This is an ongoing problem for me so I am willing to put some time into a solution.
I tried re-opening the oldest version of the files, in my computer with the Japanese language pack installed and no luck.
You can run your text through an ascii to hex converter and then through a hex to ascii converter in order to change the encoding without your system settings being in the way.
Is there any way to deal with this letter in R -Å?
In some configuration I'm able to read this letter from SQL by RODBC, but I didn't found any solution to save this letter to csv or txt. It's always getting converted to normal A or Ĺ.
Also, how to read this letter correctly from Excel file?
I understand from you question that the letter displays properly inside R but you have problems writing it to files.
R's writing functions usually have an encoding parameter (for example, for write.csv and write.table it's called fileEncoding).
When you don't set it explicitly, the function will encode the file using your OS's (or R-installations) native encoding, which can sometimes cause problems with special characters. What exactly goes wrong and how to fix it depends heavily on your system setup - especially if you're also interacting with databases, as you describe.
But very often, an easy fix is writing files in UTF-8 encoding, i.e.
write.csv(your_df, your_path, fileEncoding='UTF-8')
as most external programs (such as Excel) are able to automatically detect and properly read UTF-8 encoded files.
Set the fileEncoding argument on write.table to fit your needs (e.g., if your text is encoded as UTF-8, try write.table(my_tab, file = "my_tab.txt", fileEncoding = "UTF8")).
I've scraped Japanese contents from online to conduct content analysis. Now I am preparing the text data, starting with creating term-document matrix. The package I am using to clean and parse things out is "RMeCab". I've been told that this package requires text data to be in ANSI encoding. But my data is in UTF-8 encoding, as is the setting of RMeCab and the global setting within R itself.
Is it necessary that I change the encoding of my text files in order to run RMeCab? In that case, how do I convert the encoding of tens of thousands of separate text files quickly?
I tried encoding conversion websites, which give me some gibberish as an ANSI output. I do not understand the mechanism behind inputting something that looks like a bunch of question marks into RMeCab. If I successfully converted encoding to ANSI and my text data look like a bunch of symbols, would RMeCab still be able to read it as Japanese text?
Getting this error calling ODFWeave on my doc.
Pre-processing the contents
Sweaving content.Rnw
Error: ‘content.Rnw’ is not ASCII and does not declare an encoding
I've seen some ways you can add an encoding switch in LaTeX docs "(Sweave --encoding=utf-8)", but don't know if this can be done with odfWeave
I've worked around it before by converting the source doc back to ASCII, but ideally it would be nice if the conversion would run with whatever is in my doc (and some names, for example, require a non-ASCII charset).
We made changes to odfWeave so that it (rightly) uses a utf-8 encoding. In fact, we coerce this by using the 'encoding="UTF-8"' option to Sweave.
I guess the question is "why isn't the document utf-8"? Honestly, I dont really have a good answer for you since I don't have the document (or the results of sessionInfo()). You might be creating non-utf8 characters in the course of weaving.
One thread that might help is this:
http://r.789695.n4.nabble.com/Running-odfWeave-on-its-own-examples-odt-td4639889.html
Figuring this out appears pretty complex and I wish I had a clear-cut answer for you.
If this is OT, please tell me where to repost it.
I need to render some math equations, in real time. Where can I find a mapping of the LaTeX "english" names (like \sum ) to symbol XYZ of ABC.ttf ? I can read + render ttf's fine; I just don't know where to get the ttfs that have the math symbols and how they're indexed.
Thanks!
I did not find such a list. But I compiled a similar list by hand.
Look at my supersuer question where the AHK script provides a very partial map
I've taken a look at the LaTeX mappings from symbol names to fonts and it's scary. LaTeX's fonts are more complicated than TTF to begin with, and the math fonts are the most complicated of the bunch. For starters, there are separate variants for the glyph depending on context: a big "\sum" in a formula is one character, but if you type "\sum" in inline math you get a different character optimized for inline formulas.
You might be better off using a Unicode database. You can download the Unicode database (as text files) from unicode.org, and there are also some libraries (IBM's LibICU?) which let you look up symbols by name. Most of the symbols you're looking for start at code point U+2200. This stuff doesn't give you nice looking formulas, just the symbols.
You might also be looking for a MathML renderer instead. This will give you the nice looking formulas (not as good as LaTeX), and has an XML interface which should be easy enough to work with.
I would look at Xetex. It allows unicode input, and allows one to use TTF fonts, so they must be doing exactly what you want to do. Or maybe you can even replace Latex with Xetex in whatever you're doing: then you wouldn't need to know the hairy details.