Convert ISO-2022-JP format text to Japanese characters using online editors. - decode

I am looking for Online encoders or decoders which can covert the below text coming in SCTP to Japanese characters.
?ISO-2022-JP?B?GyRCJUYlOSVIJWEhPCVrLWItIxsoQg==?=
Let me know if any java library is available to test as well.
I have tried a few online editors but they donot seem to convert ISO-2022-JP format

Related

How to convert text symbols like ‰^“]€”õONŠ back to Japanese so I can google translate it to English? (Script?)

I am working with Mitsubishi PLC files that were originally commented in Japanese but then opened on English-only computers which converted the Japanese symbols to incomprehensible latin keyboard symbol combinations such as ‰^“]€”õONŠm”F(‘€ì”Õ1).
Being able to understand these comments would greatly enhance my ability to analyze and modify these files as I am required to do so for my work. If I could translate these back to Japanese symbols (I do have the Japanese language pack installed on my windows laptop), I could then translate these with Google Translate, which I know is not perfect, but is a lot better than ##$$##&^.
Does anyone have any ideas how this could be done? I figure that Windows must have interpreted the original characters somehow, and there may be a way to interpret them back to the original symbols.
I am thinking of trying to do some kind of character translation using a script in Python or Powershell or VBA (maybe I can create a map in Excel...)
Any ideas?
I can export these comments into CSV files so easy to get to and manipulate if I can figure out how....
This is an ongoing problem for me so I am willing to put some time into a solution.
I tried re-opening the oldest version of the files, in my computer with the Japanese language pack installed and no luck.
You can run your text through an ascii to hex converter and then through a hex to ascii converter in order to change the encoding without your system settings being in the way.

How to convert .txt files to BDIC format?

I was recently looking for a spell checker add-on for Anki and I found this add-on named (Legacy) Spelling Police™ and here it's GitHub page but the problem is that it needs to feed with user's own dictionaries and that also in BDIC format and after looking on the internet for a long time I can only find dictinaries in .txt format and here is the link for it.
So, now I need a way to convert this .txt format to .bdic format.

write.csv with encoding UTF8

I am using Windows7 Home Premium and R Studio 0.99.896.
I have a csv file containing a column with text in several different languages eg english, european, korean, simplified chinese, traditional chinese, greek, japanese etc.
I read it into R using
table<-read.csv("broker.csv",stringsAsFactors =F, encoding="UTF-8")
so that all the text is readable in it's language.
Most of the text is within a column called named "content". Within the console, when I have a look
toplines<-head(table$content,10)
I can see all the languages as they are, but when I try to write to a csv file and open it in excel, I can no longer see the languages. I typed in
write.csv(toplines,file="toplines.csv",fileEncoding="UTF-8")
then I opened toplines.csv in excel 2013 and it looked liked this
1 [<U+5916><U+5A92>:<U+4E2D><U+56FD><U+6C1.....
2 [<U+4E2D><U+56FD><U+6C11><U+822A><U+51C6.....
3 [<U+5916><U+5A92>:<U+4E2D><U+56FD><U+6C1.....
and so forth
Would anyone be able to tell me how I can write to a csv or excel file so that the languages that can be read as they are in Excel 2013? Thank you very much.
write_excel_csv() from the readr package, as suggested by #PhiSeu in the comments, has solved it for me.

Figuring out encodings of PDFs in R

I am currently scraping some text data from several PDFs using readPDF() function in the tm package. This all works very well and in most cases the encoding seems to be "latin1" - in some, however, it is not. Is there a good way in R to check character encodings? I found the functions is.utf8() and is.local() in the tau package but that obviously only gets me so far.
Thanks.
The PDF specification defines these encodings for simple fonts (each of which can include a maximum of 256 character shapes) for latin text that should be predefined in any conforming reader:
/StandardEncoding
(for Type 1 Latin text fonts, but not for TrueType fonts)
/MacRomanEncoding
(the Mac OS standard encoding, for both TrueType and Type1 fonts)
/PDFDocEncoding
(only used for text strings outside of the document's content streams; normally not used to show text from fonts)
/WinAnsiEncoding
(Windows code page 1252 encoding, for both TrueType and Type1 fonts)
/MacExpertEncoding
(name is misleading -- encoding is not platform-specific; however only few fonts have an appropriate character set to use this encoding)
Then there are 2 specific encodings for symbol fonts:
Symbol Encoding
ZapfDingBats Encoding
Also, fonts can have built-in encodings, which may deviate any way their creator wanted from a standard encoding (f.e. also used for differences encoding when embedded standard fonts are subsetted).
So in order to correctly interpret a PDF file, you'll have to lookup each of the font encodings of the fonts used, and you must to take into account any /Encoding using a /Differences array too.
However, the overall task is still quite simple for simple fonts. The PDF viewer program just needs to map 1:1 "each one of a seqence of bytes I see that's meant to represent a text string" to "exactly one glyph for me to draw which I can lookup in the encoding table".
For composite, CID-keyed fonts (which may contain many thousands of character shapes), the lookup/mapping for the viewer program for "this is the sequence of bytes I see that I'm supposed to draw as text" to "this is the sequence of glyph shapes to draw" is no longer 1:1. Here, a sequence of one or more bytes needs to be decoded to select each one glyph from the CIDFont.
And to help this CIDFont decoding, there need to be CMap structures around. CMaps define mappings from Unicode encodings to character collections. The PDF specification defines at least 5 dozen CMaps -- and their standard names -- for Chinese, Japanese and Korean language fonts. These pre-defined CMaps need not be embedded in the PDF (but the conforming PDF reader needs to know how to handle them correctly). But there are (of course) also custom CMaps which may have been generated 'on the fly' when the PDF-creating application wrote out the PDF. In that case the CMap needs to be embedded in the PDF file.
All details about these complexities are layed down in the official PDF-1.7 specification.
I don't know much about R. But I have now poked a bit at CRAN, to see what the mentioned tm and tau packages are.
So tm is for text mining, and for PDF reading it requires and relies on the pdftotext utility from Poppler. I had at first the [obviously wrong] impression, that your mentioned readPDF() function was doing some low-level, library-based access to PDF objects directly in the PDF file.... How wrong I was! Turns out it 'only' looks at the text output of the pdftotext commandline tool.
Now this explains why you'll probably not succeed in reading any of the PDFs which do use more complex font encodings than the 'simple' Latin1.
I'm afraid, the reason for your problem is that currently Poppler and pdftotext are simply not yet able to handle these.
Maybe you're better off to ask the tm maintainers for a feature request: :-)
that you would like them to try + add support to their tm package for a more capable third party PDF text extraction tool, such as PDFlib.com's TET (english version) which for sure is the best text extraction utility on the planet (better than Adobe's own tools, BTW).

Unix vs. Windows rendering of characters

I have a text file that display differently when opening it in FreeBSD vs. Windows.
On FreeBSD:
An·lisis e InvestigaciÛn
On Windows:
Análisis e Investigación
The windows representation is obviously right. Any ideas on how to get that result in bsd?
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The problem is that it's not ASCII, but UTF-8. You have to use another editor which detects the encoding correctly or convert it to something your editor on freebsb understands.
This is not pure ASCII. It's utf-8. Try freebsd editor with utf-8 support or change locales.
From the way the characters are being displayed, I would say that file is UTF-8 encoded unicode. Windows is recognising this, and displaying the 'á' and 'ó' characters correctly, while FreeBSD is assuming it's ISO-8859-1, which results in these characters being displayed as 2 seperate characters (due to the UTF-8 encoding using 2 bytes).
You'll have to tell FreeBSD that it is a UTF-8 file, somehow.
How is the file encoded? I would try re-encoding the file as UTF-16.
So after doing a bit more digging if 1) Open the csv file in excel on mac and export it as csv file and 2) then open it in textmate, copy the text, and save it again it works.
The result of: file file.csv is
UTF-8 Unicode English text, with very long lines
The original is:
on-ISO extended-ASCII English text, with very long lines
This workaround isn't really suitable as this process is supposed to be automated, thanks for the help so far.
It doesn't matter which operating system you're using when you open the file. What matters is the application you use to open it. On Windows you're probably using Notepad, which automatically identifies the encoding as UTF-8.
The app you're using on FreeBSD obviously isn't doing that. Maybe it just can't read UTF-8 and you need to use a different app. Or maybe you just have to tell it which encoding to use. Automatic detection of character encodings is far from universal (and much farther from perfect).

Resources