PDFtoTEXT not converting UTF-8 encoded text completely, especially the accented characters - unix

I am working on a project which requires to convert PDF to text. The PDF contains Hindi fonts (Mangal to be specific) along with English.
100% of english is getting converted into text. The conversion of Hindi part is around 95%. Remaining 5% Hindi text is either coming as blank or like " ा". I could figure out that the accented characters are not getting converted to text properly.
I am using following code:
pdftotext -enc UTF-8 pdfname.pdf textname.txt
The PDF uses following Fonts
name, type, emb, sub, uni
ZDPKEY+Mangal, CID TrueType, yes, yes, yes
Mangal TrueType, no, no, no
Helvetica-Bold Type 1, no, no, no
CODUBM+Mangal-Bold, CID TrueType, yes, yes, yes
Mangal-Bold, TrueType, no, no, no
Times-Roman, Type 1 no, no, no
Helvetica, Type 1, no, no, no
Following is the result of conversion. Left side is original PDF. Right side is text opened in notepad:
http://preview.tinyurl.com/qbxud9o
My questions is whether the 5% missing / junk characters be correctly captured in Text with open-source packages? Would appreciate your inputs!

Change your code to.
pdftotext -enc "UTF-8" pdfname.pdf textname.txt
It has worked for me, similarly it should work for you.

Related

paste0 regular and italicized text in R

I need to concatenate two strings within an R object: one is just regular text; the other is italicized. So, I tried a lot of combinations, e.g.
paste0(" This is Regular", italic( This is Italics))
The desired result should be:
This is Regular This is Italics
Any ideia on how to do it?
Thanks!
In plot labels, you can use expressions, see mathematical annotation :
plot(1,xlab=expression("This is regular"~italic("this is italic")))
To provide an string for which an HTML parser will recognise the need to render the text in Italics, wrap the text in <i> and </i>. For example: "This is plain text, but <i>this is in Italics</i>.".
However, most HTML processors will assume that you want your text to appear as-is and will escape their input by default. This means that the special meanings of certain characters - including < and > will be "turned off". You need to tell the processor not to do this. How you do that will depend on context. I can't tell you that because you haven't given me context.
Are you for example, writing to a raw HTML file? (You need do nothing.) Are you writing to a Markdown file? If so, how? In plain text or in a rendered chunk? Are you writing a caption to a graphic? (Waldi has suggested a solution.) Etc, etc....

How can I fix turkish character on font-face?

I am using chunk five fonts on my web site as font-face on css.When I use pure fonts on photoshop, there will turkish characters exist.But when I convert it to font-face.I won't display Turkish characters.I shared a screenshot on the following segment of text;
I've tried to convert different type of font faces.I tried to convert it with the subsetting support and I've checked Turkish field on it.Also, I entered ş,Ş,İ,ı,ğ,Ğ,ü,Ü,Ç,ç,Ö,ö Single Characters field on converter.Unfortunately,It's not worked for me.How can I fix that problem?
Thanks and Regards.
You can try convert fonts from another sources and will fix. I know this is old post but maybe helps you.
Some sources:
http://convertfonts.com/
https://www.web-font-generator.com/
http://www.flaticon.com/font-face
https://fontie.flowyapps.com/home
I have experienced the exact same problem as you when dealing with font conversions for Turkish. First, I had run a conversion using FontSquirrel's tool (available here), but it turns out the conversion was stripping these much-needed characters for the Turkish language.
One of the references from #Karmacoma's answer was very interesting and did the trick for me (Fontie) because it delivers advanced options, which gives us more control over the conversion process.
In order to cover the special characters in Turkish, you must use Switch to advanced view and run the conversion with Latin Extended-A.
I went to Wikipedia for a list of characters covered in Latin Extended-A and you can find them here.

Handle utf 8 characters in unix

I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.

DataMatrix barcode with non-Latin characters

I need to create DataMatrix barcodes which may contain non-Latin characters. I have code which creates the barcodes correctly when they only consist of Latin characters; when I run the same code with non-Latin (Hebrew or Russian) characters, however, although the code runs to completion and the barcode is created, the non-Latin characters are not deciphered by the barcode reader.
Any assistance or ideas would be greatly appreciated!
Your issue is related to the character encoding used prior to generating the barcode. The encoding used by the generator to encode must match the encoding used by the reader to decode.
Possible encodings are:
Extended Channel Interpretations (ECI) is supported by DataMatrix and other 2D barcode standards. The generator places an ECI identifying code inside the barcode data, so the reader knows to use ECI to correctly convert the data back to text.
UTF-8 encodes pretty much any language.
Code page is an older encoding, but if your generator is using it, you can use 1255 for Hebrew code page or 1251 for Russian code page. See this SO answer for more info.
To test your encoding, try Inlite's Online Barcode Reader (OBR) which should read the correct text for ECI and UTF-8 encoded barcodes. If it does, the problem is with your barcode reader which is not decoding correctly.
If OBR returns binary data, either your generator uses code page or does not encode correctly at all. Try another generator that supports ECI or UTF-8.

Hex code lose his format

Can somebody say me what is the problem:
I will print out a hex file on the "Epson TM-t88II" printer, but after this all my "Umlaute" (ä/ö/ü/ß) lost its format and are black dots, etc.
My Hex code for an ü is "FC". Is it false?
&00FC is "ü" in many encoding formats, but not all. If your printer does not know what encoding you're using, it may well not understand what you want to print.
Exactly what are you sending to the printer? And what application are you printing from?
The following may also be of interest.
http://www.joelonsoftware.com/articles/Unicode.html

Resources