is there an official document on how to convert Chinese characters into pinyin? [closed] - standards

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I need to convert Chinese characters into pinyin and need an official document for that conversion.
There are some libraries around as mentioned by previous posts such as Convert chinese characters to hanyu pinyin .
However, I need an "official standard" more than an "available library". Where could I find such a document? Is there any standard / document / book released by China government for how shall Chinese characters be pronounced/marked by pinyin?
Appreciate your kind help.

Taiwan Ministry of Education has a site listing all the variants of the Chinese character. http://dict.variants.moe.edu.tw/eng.htm
In it, they also specified the pronunciation of the characters. However, the pronunciation used is Zhuyin (popular in Taiwan) and not Hanyu Pinyin (popular in Mainland China).
You could use the list on Wikipedia to map Zhuyin to Hanyu Pinyin http://zh.wikipedia.org/wiki/%E4%B8%AD%E6%96%87%E6%8B%BC%E9%9F%B3%E5%B0%8D%E7%85%A7%E8%A1%A8
For example, the character 井 http://dict.variants.moe.edu.tw/yitia/fra/fra00052.htm has the Zhuyin of ㄐ|ㄥˇ, which you then look up ㄐ|ㄥ = jing. Then combine with the tone and you get jǐng.
I don't know of any official standard in Mainland China or in any other Chinese speaking countries.

There is no unique way to convert a Chinese character to pinyin, since there is not necessary a unique way to pronounce characters; and pinyin a system to transcribe Chinese characters into Latin script from which one can derive how to pronounce the character. It all depends on the context in which the character is used.
Some examples:
The verb 数 meaning "to count" has pinyin shǔ, while the noun 数 meaning "number" has pinyin shù.
长 with meaning "long" is written as cháng, with meaning "chief" however it is written as zhǎng
The pinyin for 好 with meaning "good" is hǎo while the 好 in 爱好 has pinyin hào.
行 with meaning "to walk" has pinyin xíng, while the measurement word meaning for a row of something has pinyin háng.
Chinese is full with such examples. Sometimes only the tones differ (see the 好 example) and something the pronunciation is completely different (the 行 example).
Next to having characters with multiple pronunciations (depending on the context), tones also change when characters are used together with other characters. For example the pinyin for 不 is normally bù, but becomes bú when the character following 不 has a forth tone.

answer my own questions just to add my 2 cents, in case others might bump into this topic.
In mainland China, there is a dictionary 新华字典 (http://en.wikipedia.org/wiki/Xinhua_Zidian) that is quite authoritative. Although it's not endorsed by China government, it's published more than 400 million copies, widely used as reference book for primary school and middle school students & teachers.
unfortunately there's no online websites for this dictionary, though some scanned version are available.

For mainland China, pinyin orthography follows the 《中文拼音正词法基本规则》 (Chinese Pinyin Orthography Basic Rules) published in 1996. This is the national standard, which has to be used in all official publications (although you will see wrong Pinyin use everywhere in China). You can find the full text (including English translation) here: http://www.pinyin.info/rules/pinyinrules_simp.html
For the correct transcription of characters, I agree that Xinhua Zidian is a quasi authority. You can find some online versions, in fact (like http://xh.5156edu.com/), but I don't know if they are reliable.

Related

I have two questions about google translation api

The first problem is The original language is english, "CORN STARCH PROCESSING LINE" is translated into Russian with a neural machine. The result is "ЛИНИЯ ОБРАБОТКИ КУХНЯ", which is larger difference from the original language. And the second problem is the original language is english, "SULLAIR COALESCING FILTER 02250153-324", translated into Arabic with the neural machine, the result is "SULLAIR COALESCING FILTER 02250153-324", not Arabic.how can I solve this problem?
Regarding the Russian translation, the Cloud Translation API is giving a result as much accurate as it is possible. Those results are being constantly improved and updated.
For the Arabic translation part, there seems to be an issue with using the - symbol in the numbers. Because if you remove it or use any other symbol then the words will be translated to Arabic as expected.
I have created an issue tracker for that, you can follow this link to get updates on the fix. Keep in mind that there is no ETA on when the fix will be ready, so as a work around for now, just replace the - symbol with _ symbol and the words will be translated to Arabic.

Google OCR: Special Characters Affecting OCR Number Recognition

I've been playing around with Google's OCR recently using the default tutorial and was trying to parse numbers. I've seen previous issues dealing with numbers on license plates, but was wondering if there was a solution when special characters affect the results of OCR. Most notably, including the '#' character with a number, such as #1, #2, etc as shown below results in the output ##Z#T# and even occasionally gives me Chinese characters even after I set the language to/from settings to English.
Numbers with pound sign
For a similar comparison, the image below is easily read by the OCR:
Numbers without pound sign
Is there a setting that I'm missing that can improve the results or is this just a constraint by the model?

Is there an industry standard output format for OCR?

Is there an industry standard output format for OCR? I can't seem to find any thing that is defined as an industry standard, nor am I very experienced with OCR so I wouldn't know if there was a standard either.
hOCR is an open standard which defines a data format for representation of OCR output.
There is no such one format, but there are commonly used practices and open-standard formats that will satisfy your requirements. This question is like asking "what is the standard result from cooking potatoes". Mashed potatoes, french fries, or baked (Not sure where that example came from, I must be getting hungry...)
Also, an "industry standard" will depend on the specific industry. If you are in a specific vertical, then some formats will be more common (almost standard) than others. For example:
Medical - HL7 formatted text
Libraries - ALTO PDF
Legal/eDiscovery - PDF Text Under Image
Integration/Automation - XML
In general, I will not be wrong if I answer your question that most commonly used and industry-accepted formats are: TXT, XML, PDF (several flavors). Each has unique properties and specific uses, but each can be widely used by other technologies due to open standards.
Approaching it from the opposite end is better, meaning thinking through 'business requirements' what will happen with the data and where it needs to be absorbed should exactly define what hand-off format you would like to use from OCR output.
XIEO (http://xieo.info) uses a (Maya Software) proprietary format called CML (Clix Markup Language) that efficiently encodes page, zone, line, text box, and related information. VisualText/NLP++ (available at http://www.textanalysis.com) has a special tokenizer pass to "inhale" that format and produce a ready-made parse tree. NLP++ analyzers can then build on that initial parse tree.
This work flow has been used for more than 5 years at XIEO, primarily for processing Official Records documents (deeds, mortgages, clerk of court, etc) and extracting information from them.
One can clean up the OCRed text, re-zone to fix up OCR errors and mis-zoning, and extract the pertinent information from text, in this workflow.
Amnon Meyers, CTO, Text Analysis International, Inc amnon.meyers#textanalysis.com

Fixing string variables with varying spellings, etc

I have a dataset with individuals names, addresses, phone numbers, etc. Some individuals appear multiple times, with slightly varying names/ and/or addressees and/or phone numbers. A snippet of the fake data is shown below:
first last address phone
Jimmy Bamboo P.O. Box 1190 xxx-xx-xx00
Jimmy W. Bamboo P.O. Box 1190 xxx-xx-xx22
James West Bamboo P.O. Box 219 xxx-66-xxxx
... and so on. Some times E. is spelled out as east, St. as Street, at other times they are not.
What I need to do is run through almost 120,000 rows of data to identify each unique individual based on their names, addresses, and phone numbers. Anyone have a clue as to how this might be done without manually running through each record, one at a time? The more I stare at it the more I think its impossible without making some judgment calls and saying if at least two or three fields are the same treat this as a single individual.
thanks!!
Ani
As I mentioned in the comments, this is not trivial. You have to decide the trade-off of programmer time/solution complexity with results. You will not achieve 100% results. You can only approach it, and the time and complexity cost will increase the closer to 100% you get. Start with an easy solution (exact matches), and see what issue most commonly causes the missed matches. Implement a fuzzy solution to address that. Rinse and repeat.
There are several tools you can use (we use them all).
1) distance matching, like Damerau Levenshtein . you can use this for names, addresses and other things. It handles error like transpositions, minor spelling, omitted characters, etc.
2) phonetic word matching - soundex is not good. There are other more advanced ones. We ended up writing our own to handle the mix of ethnicities we commonly encounter.
3) nickname lookups - many nicknames will not get caught by either phonetic or distance matching - names like Fanny for Frances. There are many nicknames like that. You can build a lookup of nicknames to regular name. Consider though the variations like Jennifer -> Jen, Jenny, Jennie, Jenee, etc.
Names can be tough. Creative spelling of names seems to be a current fad. For instance, our database has over 30 spelling variations of the name Kaitlynn, and they are all spellings of actual names. This makes nickname matching tough when you're trying to match Katy to any of those.
Here are some other answers on similar topics I've made here on stackoverflow:
Processing of mongolian names
How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?
MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard
You can calculate the pairwise matrix of Levenshtein distances.
See this recent post for more info: http://www.markvanderloo.eu/yaRb/2013/02/26/the-stringdist-package/

English dictionary as txt or xml file with support of synonyms [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Can someone point me to where I can download English dictionary as a txt or xml file. I am building a simple app for myself and looking for something what I could start using immediately without learning complex API.
Support for synonyms would be great, that is it should be easier to retrieve all the synonyms for a particular word.
It would be absolutely fantastic if the dictionary would be listing British and American spelling of the words where they differ.
Even if it would be small dictionary (a few thousand words) that's OK, I only need it for a small project.
I even would be willing to buy one if the price is reasonable, and the dictionary is easy to use - simple XML would be great.
Any directions please.
WordNet is what you want. It's big, containing over a hundred thousand entries, and it's freely available.
However, it's not stored as XML. To access the data, you'll want to use one of the existing WordNet APIs for your language of choice.
Using the APIs is generally pretty straightforward, so I don't think you have to worry much about "learning (a) complex API". For example, borrowing from the WordNet How to for the Python based Natural Language Toolkit (NLTK):
>>> from nltk.corpus import wordnet
>>>
>>> # Get All Synsets for 'dog'
>>> # This is essentially all senses of the word in the db
>>> wordnet.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'),
Synset('cad.n.01'), Synset('frank.n.02'),Synset('pawl.n.01'),
Synset('andiron.n.01'), Synset('chase.v.01')]
>>> # Get the definition and usage for the first synset
>>> wn.synset('dog.n.01').definition
'a member of the genus Canis (probably descended from the common
wolf) that has been domesticated by man since prehistoric times;
occurs in many breeds'
>>> wn.synset('dog.n.01').examples
['the dog barked all night']
>>> # Get antonyms for 'good'
>>> wordnet.synset('good.a.01').lemmas[0].antonyms()
[Lemma('bad.a.01.bad')]
>>> # Get synonyms for the first noun sense of 'dog'
>>> wordnet.synset('dog.n.01').lemmas
[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'),
Lemma('dog.n.01.Canis_familiaris')]
>>> # Get synonyms for all senses of 'dog'
>>> for synset in wordnet.synsets('dog'): print synset.lemmas
[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'),
Lemma('dog.n.01.Canis_familiaris')]
...
[Lemma('frank.n.02.frank'), Lemma('frank.n.02.frankfurter'),
...
While there is an American English bias in WordNet, it supports British spellings and usage. For example, you can look up 'colour' and one of the synsets for 'lift' is 'elevator.n.01'.
Notes on XML
If having the data represented as XML is essential, you could easily use one of the APIs to access the WordNet database
and convert it into XML, e.g. see Thinking XML: Querying WordNet as XML.
I know this question is quite old but I had problems myself for finding that as a txt file, so if anyone would be looking synonyms and antonyms txt file database the simplest yet very detailed try
https://ia801407.us.archive.org/10/items/synonymsantonyms00ordwiala/synonymsantonyms00ordwiala_djvu.txt .
I have used Roget's thesaurus in the past. It has the synonymy information in plain text files. There is also some java code to help you parse the text.
These pages provides links to a bunch of thesauri/lexical resources some of which are freely downloadable.
http://www.w3.org/2001/sw/Europe/reports/thes/thes_links.html
http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/lex.html
Try WordNet.

Resources