How to reuse reviewed documents in further translations? - azure-cognitive-services

Our instruction manual is in markdown format. We improve the manual daily or weekly and then we would like to translate to other languages. Until now, this process is manual and we use TRADOS (a assisted translation software)
We are studying to improve the process. What we want to do is to translate it using Microsoft translator and then a human reviewer do the fine tune. The problem is how we can reuse the corrected document in Microsoft Translator for the next translation.
Many times we only change 2 or 3 words of a topic and then the translator would create a complete new translation, doing the reviewer work very tedious if the translation is not accurate.
I know that we can train the model, but I think that there is not a 100% of probabilities that the translator uses the review. Also, it seems very time consuming to maintain the distortionary after reviewing the document.
I was wondering if somebody has solved this kind a problem.

We can automate the translation of the document using Microsoft Flow. In Microsoft flow we can use the SharePoint service to store the file which we need to translate and then we can create a flow of operation.
Need to create a folder in SharePoint and mention the site address
Mention the folder name and to which language you need to translate it when the file is available in the SharePoint folder
Give the destination folder. Again, it can be a SharePoint folder.
Note: Can't assure that the meaning after translation will be the same as the original statement.

You are looking for a translation memory system (TMS). A TMS will store your human edits to documents and reapply them to future translations of documents where a paragraph is repeated. A TMS will also help your human translator find close matches of a new segment with a previously translated one, highlighting the change the human will have to perform.
Most TMSes integrate machine translation systems like Microsoft Translator and others as a suggestion to the human editor, who can then approve or edit the suggestion.
https://en.wikipedia.org/wiki/Translation_memory
When selecting a TMS here are some features to consider:
Integration with the content management system (CMS) your business uses for storing, maintaining and publishing documents
Collaboration: Multiple people working on a shared set of documents
Workflow management: Initiate a translation job, track the progress of it, and execute payments to the collaborating humans
Ease of use and translator acceptance of the human-facing CAT component of the TMS
Pretranslate with your favorite machine translation system from inside the TMS
Once you have collected a significant amount of domain-specific human translations, you can train a custom translation system directly from the content of your TMS, tuning future machine translations in the direction of the terminology and style that are exemplified by your human translations.

You said you already use Trados. So why don't you create a Translation Memory in Trados, and save your corrected work to that Translation Memory. Then the next time you create your project in Trados, presumably also using the MS Translator plugin that is freely available from the RWS AppStore you will pre-translate using your translation memory AND MS Translator. Any work you translated before and is in your translation memory will be used in preference to the machine translation result.
If you have the professional version of Trados you can also use Perfect Match. If you do this then when you receive your updated document with only a few words to change you match it against the bilingual file from your previous translation. Everything that remains the same receives a Perfect Match status and is optionally locked making it simple to identify what needs to be changed by the translator.

Related

Translate website and internationalization

I have an application in ASP.Net in AngularJS, until then that my application had the need only in Brazil, but recently the company where I work needs to make this site available to other countries in other languages ​​like Spanish and English.
I have never worked with International applications, so there will be many doubts and difficulties.
I'll start with the difficulties:
- My application has fixed texts in the html, database and code, how can I translate all these fronts? Is there a component that translates everything into the client? (Javascript, AngularJS, etc ...).
- Development time for this change (Too much code to change).
Questions: - Decimal, Date and Time, how to work with these values ​​in an International application? (My application has many logs and values ​​to display and insert)
I'm researching a lot but really needed a hint of where I could go.
Thank you, I'll wait!
I would start searching for "Localization". There is something called "AspNetBoilerplate" that has implemented localization into there project. Here is their link: https://aspnetboilerplate.com/ and here is document on how to use it within their project: https://aspnetboilerplate.com/Pages/Documents/Localization. You can easily download a free project and see what they did to implement it.
They have different localization files based on language being used. The language being used is a setting in the DB based upon the tenant that logs in. All static text looks up to its respective localization file (See their .Core project under "Localization" to see all their files).
As far as DateTimes/Decimals/etc, I would have extension methods that looked up the cached localization being used and format respectively.
By no means am I recommending you use this software but rather see how they implemented it within their project. Also, know that I have only had to do this once and there may be numerous other ways to accomplish this goal.

Are there any preset I18n word lists / resource files?

I'm creating a web application that uses I18n. As I don't want to translate very common basic strings like "forgot password?" on my own I'm asking you if there are already any resource files or word lists containing these strings. One option is to download an existing framework and extract somehow these strings but this might be a hassle?
Especially I'm looking for translation regarding user authentication and translations from English to Italian, French and German. The file format doesn't matter.
Professional translators use a tool, TMX is the generic term i think, Translation Memory Exchange, that does what you are talking about by building up standard phrase lists in other languages so when they translate they can bring these phrases in to speed up their job and reduce the repetitive tedium. So these lists exist.
There is a free plugin for MS Word that does this and may come with lists (sorry cannot remember the name although Rosetta rings a bell).
There is an FOSS TMX tool called Okapi at Sourceforge. It may come with the dictionaries but if not it is a point where you can investigate.
You could also approach a site called Proz which is a site for translators and might be able to point you in the right direction
Take care over MT like Google API as it can give some weird results but you could use it to build you list and then double check. Remember that when you check a language that you need to do it with a native speaker who can pick up on the nuances and colloquialisms.
You can use google translator api. and your custom resource bundle

Lotus Smartsuite to "something newer"

I shall try and keep my scenario as brief as possible and to the point.
The office I’m currently working for uses Lotus Smartsuite on Windows 98 / XP, using lots of Lotus Script to tie together Lotus 123 and Lotus Word Pro documents. They also make heavy use of the Lotus Object Linking functions. I shall describe its behaviour below:
You can fill rows and columns in a 123 Spreadsheet with data galore, style it and format it any way you like and define it as a range (nothing unique here). However, you can then copy that range and paste it as a link in a Lotus Word Pro document. This link is then categorised by its range name, so expanding the range back in the 123 file causes the table in the Word Pro Document to expand. This link also carries with it all the formatting and styling of the cells in the 123 Spreadsheet. As I imagine you are now aware, this link is completely live, you can double click anywhere in the object and it opens up the 123 file for editing, and all changes go backward and forward between the two documents. Most of the data retrieved from testing equipment is stored in these 123 spreadsheets and then parts of that are linked into a final Lotus Word Pro report document sent to the customer.
Note: Just to be clear, this is NOT the same as a DDE link in Open Office, which seems to allow for copying of a non-defined range of cells to be imported into a document where all formatting is lost and editing back and forth is not straight forward. It also behaves differently to an OLE object, which seems to only import the entire Spreadsheet rather than a small subsection of it.
However, in recent years, support this older software (Lotus) is becoming more difficult, especially with regards to sending customers documents (Lotus word Pro files are generally unsupported by more modern Office Tools) and technical support for Lotus Smartsuite seems to be practically non-existent these days. Also, with the fear of on going development in a scripting language no-longer being practised by mainstream IT technicians, on-going development and support seems futile. Once the guys who wrote it move on to other things, we will be left with spaghetti script in a language nobody can help us with.
So, we have this goal of "modernising" our IT system by the end of the year. Linux is becoming a very viable option too (No doubt Debian or a derivative), but Open Office doesn't seem to have the linking capability mentioned above. The reason this linking is so important is because the veterans of the office are so used to working this way - storing data in the spreadsheet, linking back to it later in their Word Pro documents, etc. I think they are more than keen to keep this practice going and we have found no equivalent of it in modern office tools (as was requested of me). I can see, as a software engineer (fluent in many languages), how this practice is not the safest or best way of using and storing data (databases spring to mind), but I was wondering if someone could give me a few other good reasons as to why this is bad practice in the work place (I was always in the belief that you should keep your data away from your reporting and formatting, the two should never be entwined - this looks like spreadsheet hell to me) ... or why this is a good thing to keep doing!?
So, for those of you still with me, I guess what I am asking is:
Is this practice of storing data, formatting it in spreadsheets and importing that directly back and forth between word documents good or bad, and what can be done about it? I guess I'll need to prove my point in case either way for this.
Are there ANY modern alternatives to this linking method (regardless of weather it is good or bad practice or not) out there for Linux or Windows? This link MUST carry formatting as well as dynamic range sizes (DDE links don't seem to be the answer).
What would your solution be if you had to start from scratch? Store everything in databases and use SQL to simply ask for the data you need in your word documents? How would you do this? What software would you use?
Any help with this scenario would be more than helpful, or if you know anywhere I should go to ask for advice, that would be appreciated too.
Thank-you for reading!
My suggestion is to first take a step back. What is the benefit to the way things are done now? Is it just a habit that is tough to break? Is there any reason the documents and spreadsheets need to be maintained and linked the way they are, or is it just a requirement because 'that's how it was done before'?
If you can remove that requirement, you have a lot more options and you're building a system that's easier to understand and maintain.
Regarding question 1, I believe there's nothing wrong with storing data in spreadsheets, especially if the end-users need to create and maintain them and development staff is limited. Some questions are whether that data needs to be secured, is related between spreadsheets, is duplicated across the company, or should be shared in a better way across the company. If any of those are true then a centralized database would make more sense. Personally I'd want any valuable data safely stored in a database where it can be managed, access to it can be controlled, it can be easily backed-up, etc.
Regarding question 2, you can do the same thing in Microsoft Office. You can either link the documents, so that the data stays in the source excel doc but appears in the word doc, or you can embed the excel spreadsheet within the word doc.
You might want to look at Microsoft Access for storing the data and generating reports. Or you could build an application using a relational database back-end and reporting front-end. The possibilities are wide-open. It really depends on where the expertise lies within the company.
If it were me I'd probably use a SQL Express back-end (it's free) and a custom ASP.NET MVC application for generating the reports, but that's just where my expertise lies.

How to scrape websites such as Hype Machine?

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.
The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.
You may want to check the following books:
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204
"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677
"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669
I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org
On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.
Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.
Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.

Categorized Document Management System

At the company I work for, we have an intranet that provides employees with access to a wide variety of documents. These documents fall into several categories and subcategories, and each of these categories have their own web page. Below is one such page (each of the links shown will link to a similar view for that category):
http://img16.imageshack.us/img16/9800/dmss.jpg
We currently store each document as a file on the web server and hand-code links to these documents whenever we need to add a new document. This is tedious and error-prone, and it also means we lack any sort of security for accessing these documents. I began looking into document management systems (like KnowledgeTree and OpenKM), however, none of these systems seem to provide a categorized view like in the preview above.
My question is ... does anyone know of any Document Management System that allow for the type of flexibility we currently have with hand-coding links to our documents into various webpages (major and minor , while also providing security, ease of use, and (less important) version control? Or do you think I'd be better off developing such a system from scratch?
If you are trying to categorize the files or folders in the document management system, That's not a difficult task. You only need to access to admin panel to maintain the folders or categorize the folders
In Laserfiche, You can easily categorize your folders regarding the departments and can also be subcategorized them
You should look into Alfresco. It's extremely extensible and provides a lot of ways of accessing the repository.
Note: click the "Developers" tab for the community edition.
My question is ... does anyone know of
any Document Management System that
allow for the type of flexibility we
currently have with hand-coding links
to our documents into various webpages
(major and minor , while also
providing security, ease of use, and
(less important) version control?
Or do you think I'd be better off developing such a system from scratch?
Well there are companies that make a living selling doc management software. Anything you can get off the shelf is going to be a huge time saver, and its going to be better than anything you could reasonably develop by hand.
I've used a few systems:
Sharepoint: although I hear some people don't like it, I didn't either ;)
HyperOffice worked really well for my company of around 150 employees and has all the features you describe.
Current company uses Confluence, I like it :) But its probably one of those tools whose pricetag isn't worth it, especially if you're only using a subset of its features like doc management.
I haven't used it, but one guy I know raves about Alfresco, a free and open source doc management system. I looked at its website, seems simple enough to use.
We also faced a similar problem. However version control was more on our priority and we did look into many solutions in and around. We found Globodox extremely easy to install and use and more important the support team was absolutely fantastic
Try Mayan EDMS, it's Django based, and open source, used it as a base and build the custom features you wish on top of it.
Code location: https://gitlab.com/mayan-edms/mayan-edms
Homepage at: http://www.mayan-edms.com
The project is also available via PyPI at: https://pypi.python.org/pypi/mayan-edms/

Resources