Customized translation model specific to place names for translation between Japanese and other languages - google-translate

I am looking for a way to translate sentences with place names properly to a desired target language. The places will only be all places in Japan.
For example if my source language is Japanese and my target language is English, my input is 明日はフォールズシティに行きます, in which translated into English might be I am going to the waterfalls City tomorrow.
*** Please be noted that the place name is just a made-up and translation result is just a sample.
The place name should be Foruzu City, and not directly the meaning of the word. So the output I expect should be, I am going to Foruzu City tomorrow.
So, my question is, how can I train a translation model that is able to handle translation of sentences consisting of place names like above?

I've read about one method where the source/input data is run through a Named Entity Recognition program to identify person's names, city names etc and tag them.
Feeding source language sentences with NE tags/information helps a Neural Machine Translator make the appropriate translation.
There are descriptions of work on NMTs that incorporate NE tags you can find abundantly online.
Some similar work I found is described at this Link and this Link.

Related

Languages word list and dictionary formats

In my current application i am using a tinymce plugin called nanospell, this plugin come with many different dictionaries but it is missing one that is very important for my application (French Canadian), do you guys know where i could find a French Canadian dictionary/word list that i could include in the nanospell dictionaries. It would also help to just find any French Canadian file with a list of words and try to reverse engineer the file to make it work with the format nanospell uses.
You can check out http://www.freescrabbledictionary.com/word-finder-francais/ they seem to have a similar word list

Are there any preset I18n word lists / resource files?

I'm creating a web application that uses I18n. As I don't want to translate very common basic strings like "forgot password?" on my own I'm asking you if there are already any resource files or word lists containing these strings. One option is to download an existing framework and extract somehow these strings but this might be a hassle?
Especially I'm looking for translation regarding user authentication and translations from English to Italian, French and German. The file format doesn't matter.
Professional translators use a tool, TMX is the generic term i think, Translation Memory Exchange, that does what you are talking about by building up standard phrase lists in other languages so when they translate they can bring these phrases in to speed up their job and reduce the repetitive tedium. So these lists exist.
There is a free plugin for MS Word that does this and may come with lists (sorry cannot remember the name although Rosetta rings a bell).
There is an FOSS TMX tool called Okapi at Sourceforge. It may come with the dictionaries but if not it is a point where you can investigate.
You could also approach a site called Proz which is a site for translators and might be able to point you in the right direction
Take care over MT like Google API as it can give some weird results but you could use it to build you list and then double check. Remember that when you check a language that you need to do it with a native speaker who can pick up on the nuances and colloquialisms.
You can use google translator api. and your custom resource bundle

Convert nested dictionary/xml to flat file for sqlite

I've scoured the net and cannot seem to find an appropriate example so I thought I'd ask...
(Btw, much of this is new to me- not all, just most.)
Problem: trying to convert a bio/python nested dictionary (or xml) of pubmed citation data into a flat (normalized) structure eg, sqlite. Citation data was fetched from pubmed using biopython and was parsed into a dictionary, but can also retrieve as xml if needed.
Not all citations will have all fields/keys and not all fields/keys will have the same number of items (authors, mesh terms, refs, etc...) and understand that this is part of the normalization process.
This is about where my practical understanding ends.
That said, I think the process should go something like this: first remove/normalize all unique fields (those that have 1 per paper eg, title, abstract, date, citation, etc..., but say not affiliation as that would be linked to first author). Papers with no abstract could be filled as null?
Then move on to, say, authors and create a separate table again using PMID as the fk and then do same for the various other fields/keys/items in separate tables eg, mesh headings, EC numbers, ref, etc...
Is there a way to do this that removes (pops?) keys/items from the master dictionary so that I can visually see what's been done/needs to be done (obviously leaving the PMID)?
Again, apologies in advance if I'm asking a blindingly obvious question to the initiated- and I do understand that you can't fit a nested structure into a flat space- just looking for the least boneheaded way of going about this and hopefully one that will allow me to make sure that everything was properly captured.
Many thanks,
chris
A quick question -- if you already have the data in XML, why are you normalizing it into a SQL format? Why not just use the raw XML? Berkeley DB XML is a library (like SQLite) that links into your application. There is no separate server to install or maintain. The library allows you to store and query XML data using XPath or XQuery. It's very fast, has a small footprint. is transactional, recoverable and highly reliable. It has HA features as well, if that is required.
Keeping the data in XML should simplify the whole data import process and still allow you to query the semi-structured data.

What is the spec for formatting data in QR codes? I can not find it anywhere

I'm specifically asking if, and what, is the specification for formatting the text inside a QR code - not how to generate a code (which I can do).
I need to put hCard data into a QR code, however I don't know how to mark the QR code as VCF data (versus URL, text, etc) so the decoder knows what to do.
All the info I've seen online has to do with generating the QR code, not formatting the data inside.
There is no official specification for this -- the QR code spec does not say anything about the contents. Everything I know about the commonly-used and de facto formats and conventions is summarized in this wiki:
https://github.com/zxing/zxing/wiki/Barcode-Contents
Who says you have to pay for this info? just go to this page http://qrcodenet.codeplex.com/ and scroll to the bottom REFERENCES section and you should find a free download like tited 4. ISO/IEC 18004:2006(E) - QR code specification ISOIEC 18004_2006Cor 1_2009.pdf
See http://www.onbarcode.com/qr_code/ for helpful info - both about encoding and generating, and several libraries to use as well.
See http://www.denso-wave.com/qrcode/qrstandard-e.html
and http://www.denso-wave.com/qrcode/aboutqr-e.html
QR Code is a kind of 2-D
(two-dimensional) symbology developed
by Denso Wave (a division of Denso
Corporation at the time) and released
in 1994 with the primary aim of being
a symbol that is easily interpreted by
scanner equipment.
...
QR Code is open in the sense that the
specification of QR Code is disclosed
and that the patent right owned by
Denso Wave is not exercised.
from http://www.denso-wave.com/qrcode/qrstandard-e.html:
QR Code is established as an ISO (ISO/IEC18004) standard. QR Code
specification can, therefore, be purchased from this organization.
Please seach by inputting ISO No.18004 to "Search
and ISO Catalogue".
http://www.iso.ch/iso/en/prods-services/ISOstore/store.html
The official spec is available here from iso.org, but you have to pay for it.
In the past I have found information at http://www.nttdocomo.co.jp/english/service/imode/make/content/barcode/function/, a page which I can not trace (easily?) anymore.

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon
I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

Resources