in our application,we have an utility to generate the file in pdf format but when i do that i see some chinese characters are scrambled in the screen.What could be the reason?
PDF is a binary file format. If you view such a file on the screen it is normal for gibberish characters to appear.
Related
Inside of the file has a bunch of these types of encodings.
I have no idea how this file was encrypted so trying to figure out what program this may be from.
{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;\f1\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red0\green0\blue0;}
This is Rich Text Format (RTF).
all.
I have to use cyrillic symbols, but I have trouble with it. When I print it in console everything is fine. But in view function and write to .csv characters are unreadable. Look at picture
This PNG image is in base64:
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGYAAAAZBAMAAAAxhUZFAAAAG1BMVEXz8/MzMzPDw8NjY2N7e3urq6tLS0vb29uTk5O9wJ50AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABOElEQVQ4je1SsW6DMBQ8Y8CMttoPIEo+ALNkpZWSrCxVM6IKKasroswMlcpn9/nZSEkqlnbNSUaGd/fu/AzwwF+RVFhBmgPttCwBte+dsr0zxvhyDmGMhn12zAkgnsUXJtpp0kO2WZW35wZ482VLK2uLphiZE7BG8oItBmCjUQMC0PAraXzHT3psIJ2qmBOwg1yhowyoNfmxDyXsiOeTUwvqSNWOOQE1BkFdBVSlyRTKGCJnI45cFrz35p4TofEaNGmpfSc5FhSKYpxmTe5+aVQnONuFlfE8Bz9PYfh1jftsJ0mz9OezNNxLKOPskLazzw73M+BRTe7IMall+l40yZa05az5AIrvbAwcxuS/D6Zijb+BfY+crGP4EFXZp8iJ93OFeulvuUXMwEiqRdoNrnmyXKQ98E/8ADIIKtlalug3AAAAAElFTkSuQmCC
Is it possible to decode to get the text from it without using OCR technology?
The base64 code above will result in the following PNG image:
Definitely not.
What you have is a base64 encoded PNG file. PNG files are binary files containing compressed pixel data, not text characters.
OCR would be the only way to try to recognize the characters in the pixel data.
I try to open a UTF-8 encoded .csv file that contains (traditional) Chinese characters in R. For some reason, R displays the information sometimes as Chinese characters, sometimes as unicode characters.
For instance:
data <-read.csv("mydata.csv", encoding="UTF-8")
data
will produce unicode characters, while:
data <-read.csv("mydata.csv", encoding="UTF-8")
data[,1]
will actually display Chinese characters.
If I turn it into a matrix, it will also display Chinese characters, but if I try to look at the data (command View(data) or fix(data)) it is in unicode again.
I've asked for advice from people who use a Mac (I'm using a PC, Windows 7), and some of them got Chinese characters throughout, others didn't. I tried to save the original data as a table instead and read it into R this way - same result. I tried running the script in RStudio, Revolution R, and RGui. I tried to adjust the locale (e.g. to chinese), but either R didn't let me change it or else the result was gibberish instead of unicode characters.
My current locale is:
"LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252"
Any help to get R to consistently display Chinese characters would be greatly appreciated...
Not a bug, more a misunderstanding of the underlying type system conversions (the character type and the factor type) when constructing a data.frame.
You could start first with data <-read.csv("mydata.csv", encoding="UTF-8", stringsAsFactors=FALSE) which will make your Chinese characters to be of the character type and so by printing them out you should see waht you are expecting.
#nograpes: similarly x=c('中華民族');x; y <- data.frame(x, stringsAsFactors=FALSE) and everything should be ok.
In my case, the utf-8 encoding does not work in my r. But the Gb* encoding works.The utf8 wroks in ubuntu. First you need to figure out the default encoding in your OS. And encode it as it is. Excel can not encode it as utf8 properly even it claims that it save as utf8.
(1) Download 'Open Sheet' software.
(2) Open it properly. You can scroll the encoding method until you
see the Chinese character displayed in the preview windows.
(3) Save it as utf-8(if you want utf-8). (UTF-8 is not solution to every problem, you HAVE TO know the default encoding in your system first)
I have a text file that display differently when opening it in FreeBSD vs. Windows.
On FreeBSD:
An·lisis e InvestigaciÛn
On Windows:
Análisis e Investigación
The windows representation is obviously right. Any ideas on how to get that result in bsd?
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The problem is that it's not ASCII, but UTF-8. You have to use another editor which detects the encoding correctly or convert it to something your editor on freebsb understands.
This is not pure ASCII. It's utf-8. Try freebsd editor with utf-8 support or change locales.
From the way the characters are being displayed, I would say that file is UTF-8 encoded unicode. Windows is recognising this, and displaying the 'á' and 'ó' characters correctly, while FreeBSD is assuming it's ISO-8859-1, which results in these characters being displayed as 2 seperate characters (due to the UTF-8 encoding using 2 bytes).
You'll have to tell FreeBSD that it is a UTF-8 file, somehow.
How is the file encoded? I would try re-encoding the file as UTF-16.
So after doing a bit more digging if 1) Open the csv file in excel on mac and export it as csv file and 2) then open it in textmate, copy the text, and save it again it works.
The result of: file file.csv is
UTF-8 Unicode English text, with very long lines
The original is:
on-ISO extended-ASCII English text, with very long lines
This workaround isn't really suitable as this process is supposed to be automated, thanks for the help so far.
It doesn't matter which operating system you're using when you open the file. What matters is the application you use to open it. On Windows you're probably using Notepad, which automatically identifies the encoding as UTF-8.
The app you're using on FreeBSD obviously isn't doing that. Maybe it just can't read UTF-8 and you need to use a different app. Or maybe you just have to tell it which encoding to use. Automatic detection of character encodings is far from universal (and much farther from perfect).