We have developed a medium-sized ASP.NET / SQL Server application that uses resource files to provide English and Spanish user interface variants. Unicode data types are used throughout the databases. Now we are being asked to add Mandarin to the mix. I have no experience with localising into Asian languages, so I really cannot imagine what kind of job this would be.
My questions are:
How complex would this job be, as compared to localising into another Western language such as French or German?
What additional issues, other than (obviously) translating strings in resource files, should I deal with for Mandarin localisation? Anything related to the different alphabet, perhaps?
Reports of previous experiences or pointers to best practices are most welcome. Thanks.
On the technical side of things, I don't believe it's significantly more difficult. Adding support for non-Western languages will expose encoding issues if you are not using Unicode throughout, but it's pretty much the norm to use UTF-8 encoding and Unicode SQL types (nvarchar instead of varchar) anyway.
I would say that the added complexity and uncertainty is more about the non-technical aspects. Most of us English speakers are able to make some sense of European languages when we see 1:1 translations and can notice a lot of problems. But Mandarin is utterly meaningless to most of us, so it's more important to get a native speaker to review or at least spot-check the app prior to release.
One thing to be mindful of is the issue of input methods: Chinese speakers use IMEs (input method editors) to input text, so avoid writing custom client-side input code such as capturing and processing keystrokes.
Another issue is the actual culture identifier to choose. There are several variants for Chinese (zh-Hans which replaces zh-CHS and is used in China, and zh-Hant which replaces zh-CHT used in Taiwan). See the note on this MSDN page for more info on this. But these are neutral culture identifiers (they are not country-specific) and can be used for localization but not for things such as number and date formatting, therefore ideally you should use a specific culture identifier such as zh-CN for China and zh-TW for Taiwan. Choosing a specific culture for a Web application can be tricky, therefore this choice is usually based on your market expectations. More info on the different .NET culture identifiers is at this other post.
Hope this helps!
As far as translating text in the user interface goes, the localization effort for Chinese is probably comparable to that of Western languages. Like English and Spanish, Chinese is read left to right, so you won't need to mirror the page layout as you would if you had to support Arabic or Hebrew. Here are a couple more points to consider:
Font size: Chinese characters are more intricate than Latin characters, so you may need to use a larger font size. English and Spanish are readable at 8pt; for Chinese, you'll want a minimum of 10pt.
Font style: In English, bold and italics are often used for emphasis. In Chinese, emphasis is usually achieved with a different typeface, font size, or color. Use bold with caution, and avoid italics.
However, if you're targeting an Asian market, more significant changes may be required. Here are a few examples:
Personal names: A typical Chinese name is 孫中山: the first character (孫) is the family name, and the second and third characters (中山) constitute the given name. This of course is the opposite of the common Western convention of "given name" + space + "family name". If you're storing and displaying names, you may want to use a single "Name" field instead of separate "First Name" and "Last Name" fields.
Colors: In the U.S., it's common to use green for "good" and red for "bad". However, in China and Taiwan, red is "good". For instance, compare the stock prices on Yahoo! versus Yahoo! Taiwan.
Lack of an alphabet: Chinese characters are not based on an alphabet. Thus, for example, it wouldn't make sense to offer the ability to filter a list by the first letter of each entry, as in a directory of names.
Regarding sort order, different methods are used in Chinese: binary (i.e. Unicode), radical+stroke, Pinyin (=romanization), and Bopomofo (=syllabary).
On SQL Server, you can define the sort order using the COLLATE clause in the column definition, or on statement level. It's also possible to have a default collation on database level.
Run this statement to see all supported Chinese collations:
select * from fn_helpcollations()
where name like 'chinese%'
In .Net, there's also a list of Chinese cultures to be used for sorting, see MSDN.
Related
We are using R/exams to create tests in Canvas and TestVision.
We have other forms and other software to perform written exams.
I know R/exams has a great NOPS feature and was wondering:
What software is used to autograde the NOPS forms?
Can that software also evaluate string questions?
Now it looks that the NOPS form doesn't make it easy for software to read parts. Ideally the software would be adapted so adapted NOPS forms (changes in blue) could read more easily Student Name, and string questions:
NOPS format
The NOPS forms have not been designed by us but they follow the format that our university has been using. We simply mimicked their format because we initially just generated the PDF files ourselves but used the commercial scanning software of our university.
Scanning
However, over the years we have written our own scanner implementation in R in exams::nops_scan(). The basic approach is to convert PDF pages to PNG images, read these into R, convert them to black and white pixel matrices, find the scanner markings in the corners, and then extract just the boxes relative to these markings. The boxes either contain printed digits in a fixed font for which a simple decision tree yields a reliable classification - or the boxes are empty/filled vs. checked which can also be classified reasonably reliably. The result is stored in a simple text format that was again not developed by us but to be fully compatible with the commercial system that our university used.
Grading
Based on the scan results the function exams::nops_eval() computes points and grades. Various evaluation strategies can be plugged in and starting from version 2.4-0 the reports generated by the function can be customized.
Extension to OCR
At the moment no OCR (optical character recognition) is used, except for the simple task of recognizing printed numbers in a fixed font. But no hand-written characters or digits are ever evaluated automatically. I had played around with this a little bit using tesseract but the results were not reliable enough for our purposes.
The string questions that are currently supported are intended for open-ended questions. Hence students get a reasonable amount of space to write something down. The teacher can then grade the answer sheet manually, again by ticking boxes only, which can be read rather reliably. The scanned images of the full sheet are included in the report for the students so that they can also see any hand-written feedback/corrections included in the answer form.
Tutorial
A hands-on guide to using the NOPS approach is available at: http://www.R-exams.org/tutorials/exams2nops/
Misc
Unfortunately, the system is not implemented in a very modular fashion. The reasons for this were two-fold: (1) We followed very closely the given format our university had been using. (2) The bulk of the implementation was written under a lot of time pressure (see the anecdote below). So while the features you propose would be nice to have, they are unlikely to fit well into the current setup. If you would want to have a stab at this, I would recommend to write a modular new implementation, just using the bits and pieces from the existing code that are useful enough.
Anecdote: Scanning of about 400-500 exam sheets had failed on the university system due to a mistake of the copy shop that had printed the sheets. It was mid-July, everybody was on vacation already including myself. So I sat on my parents porch for two days to write the scanner tool and evaluate the exams that the students were waiting for.
The SOUNDEX() function checks whether words are similar in sound and allows for misspellings to be included in a select.
How I understand it though it that it uses English as a basis for the word sounds it generates, in other languages this can lead to mismatches due to words/vowels etc.. being pronounced differently.
Is there a way to have SOUNDEX() behave with for example German or Dutch or any other language? Or am I misunderstanding how it works and will it work just as properly as with English?
I am writing a package to facilitate importing Brazilian socio-economic microdata sets (Census, PNAD, etc).
I foresee two distinct groups of users of the package:
Users in Brazil, who may feel more at ease with the documentation in
Portuguese. The probably can understand English to some extent, but a
foreign language would probably make the package feel less
"ergonomic".
The broader international users community, from whom English
documentation may be a necessary condition.
Is it possible to write a package in a way that the documentation is "bilingual" (English and Portuguese), and that the language shown to the user will depend on their country/language settings?
Also,
Is that doable within the roxygen2 documentation framework?
I realise there is a tradeoff of making the package more user-friendly by making it bilingual vs. the increased complexity and difficulty to maintain. General comments on this tradeoff from previous expirience are also welcome.
EDIT: following the comment's suggestion I cross-posted r-package-devel mailling list. HERE, then follow the answers at the bottom. Duncan Murdoch posted an interesting answer covering some of what #Brandons answer (bellow) covers, but also including two additional suggestions that I think are useful:
have the package in one language, but the vignettes for different
languages. I will follow this advice.
have to versions of the package , let's say 1.1 and 1.2, one on each
language
According to Ropensci, there is no standard mechanism for translating package documentation into non-English languages. They describe the typical process of internationalization/localization as follows:
To create non-English documentation requires manual creation of
supplemental .Rd files or package vignettes.
Packages supplying
non-English documentation should include a Language field in the
DESCRIPTION file.
And some more info on the Language field:
A ‘Language’ field can be used to indicate if the package
documentation is not in English: this should be a comma-separated list
of standard (not private use or grandfathered) IETF language tags as
currently defined by RFC 5646 (https://www.rfc-editor.org/rfc/rfc5646,
see also https://en.wikipedia.org/wiki/IETF_language_tag), i.e., use
language subtags which in essence are 2-letter ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) or 3-letter ISO 639-3
(https://en.wikipedia.org/wiki/ISO_639-3) language codes.
Care is needed if your package contains non-ASCII text, and in particular if it is intended to be used in more than one locale. It is possible to mark the encoding used in the DESCRIPTION file and in .Rd files.
Regarding encoding...
First, consider carefully if you really need non-ASCII text. Many
users of R will only be able to view correctly text in their native
language group (e.g. Western European, Eastern European, Simplified
Chinese) and ASCII.72. Other characters may not be rendered at all,
rendered incorrectly, or cause your R code to give an error. For .Rd
documentation, marking the encoding and including ASCII
transliterations is likely to do a reasonable job. The set of
characters which is commonly supported is wider than it used to be
around 2000, but non-Latin alphabets (Greek, Russian, Georgian, …) are
still often problematic and those with double-width characters
(Chinese, Japanese, Korean) often need specialist fonts to render
correctly.
On a related note, R does, however, provide support for "errors and warnings" in different languages - "There are mechanisms to translate the R- and C-level error and warning messages. There are only available if R is compiled with NLS support (which is requested by configure option --enable-nls, the default)."
Besides bilingual documentation, please allow me the following comment: Given your two "target" groups, it may be assumed that some of your users will be running non-English OS (typically, Windows in Portuguese). When importing time series data (or any date entries as a matter of fact), due to different "date" formatting (English vs. non-English), you may get different "results" (i.e. misinterpeted date entries) when importing to English/non-English machines. I have some experience with those issues (I often work with Czech-language-based OSs) and -other than ad-hoc coding- I don't find a simple solution.
(If you find this off-topic, please feel free to delete)
In German there is a formal ("Sie") and an informal ("Du") form. I would like to translate some software into informal German, notably woocommerce using transiflex, but new languages can only be added using their locale code.
There only seems to be de_DE as a locale. What's best to differentiate between the two forms, shouldn't there be another locale code just for the informal form, too?
Generally Gettext uses # to distinguish language variants. So in this case it could be de_DE#informal. This way the locale will be correctly handled by Gettext with fallback to de_DE (in contrast with suggested de_DE-x-informal which will not fall back this way as Gettext will see DE-x-informal as country code).
You can find more on locales naming at https://www.gnu.org/software/gettext/manual/html_node/Locale-Names.html
Since you asked about WooCommerce, explaining the current state of how WordPress handles it is probably most relevant:
The best approach is to use locale variants as Michal wrote, but WordPress has its own unique twist on it and don’t use the variant syntax. Instead they add a third component to the filename but not -x-informal either: de_DE.mo is the informal (and also fallback, because it lacks any further specification) variant in WordPress and de_DE_formal.mo contains the formal variant.
Language tags as defined in BCP 47 (currently RFC 5646) do not specify any codes for informal vs. formal variants of German. It would probably be unrealistic to try to have such codes registered, so you would be limited to Private Use subtags, e.g. de-x-formal vs. de-x-informal. Whether your software can handle these in any way is a different issue.
On the practical side. the choice of “Sie” vs. “Du” (or “du”) is hardly a language variant issue. Standard German uses both pronouns to address a person in singular, depending on style of presentation and on the relationship with the addressed person. At the extreme, we could say that the choice of “Sie” vs. “Du” in the context of addressing a user generically in instructions or a user interface is a language variant issue. But on the practical side, just make up your mind.
I try to externalize all strings (and other constants) used in any application I write, for many reasons that are probably second-nature to most stack-overflowers, but one thing I would like to have is the ability to automate spell checking of any user-visible strings. This poses a couple problems:
Not all strings are user-visible, and it's non-trivial to spearate them, and keep that separation in place (but it is possible)
Most, if not all, string externalization methods I've used involve significant text that will not pass a spell checker such as aspell/ispell (eg: theStrName="some string." and comments)
Many spellcheckers (once again, aspell/ispell) don't handle many words out of the box (generally technical terms, proper nouns, or just 'new' terminology, like metadata).
How do you incorporate something like this into your build procedures/test suites? It is not feasible to have someone manually spell check all the strings in an application each time they are changed -- and there is no chance that they will all be spelled correctly the first time.
We do it manually, if errors aren't picked up during testing then they're picked up by the QA team, or during localization by the translators, or during localization QA. Then we lodge a bug.
Most of our developers are not native English speakers, so it's not an uncommon problem for us. The number that slip through the cracks is so small that this is a satisfactory solution for us.
Nothing over a few hundred lines is ever 100% bug-free (well... maybe the odd piece of embedded code), just think of spelling mistakes as bugs and don't waste too much time on it.
As soon as your application matures, over 90% of strings won't change between releases and it would be a reasonably trivial exercise to compare two versions of your resources, figure out what'ts new (check them first), what's changed/updated (check next) and what hasn't changed (no need to check these)
So think of it more like I need to check ALL of these manually the first time, and I'm only going to have to check 10% of them next time. Now ask yourself if you still really need to automate spell checking.
I can think of two ways to approach this semi-automatically:
Have the compiler help you differentiate between strings used in the UI and strings used elsewhere. Overload different variants of the string datatype depending on it's purpose, and overload the output methods to only accept that type - that way you can create a fake UI that just outputs the UI strings, and do the spell checking on that.
If this is doable of course depends on the platform and the overall architecture of the application.
Another approach could be to simply update the spell checkers database with all the strings that appear in the code - comments, xpaths, table names, you name it - and regard them as perfectly cromulent. This will of course reduce the precision of the spell checking.
First thing, regarding string externalization - GNU GetText (if used properly) creates string files that are contain almost no text other then the actual content of the strings (there are some headers but its easy to cause a spell checker to ignore them).
Second thing, what I would do is to run the spell checker in a continuous integration environment and have the errors fed externally, probably through a web interface but email will also work. Developers can then review the errors and either fix them in the code or use some easy interface to let the spell check know that a misspelling should be ignored (a web interface can integrate both the error view and the spell checker interface).
If you're using java and are storing your localized strings in resource bundles then you could check the Bundle.properties files and validate the bundle strings. You could also add a special comment annotation that your processor could use to determine if an entry should be skipped.
This method will allow you to give a hint as to the locale and provide a way of checking multiple languages within the one build process.
I can't answer how you would perform the actual spell checking itself, though I think what I've presented will guid you as for the method of performing the spell checking.
Use aspell. It's a programme, it's available for unixoids and cygwin, it can be run over lots of kinds of source code. Use it.
First point, please don't put it into you build process. I would be a vengeful coder if I (meaning my computer) had to spell check all the content on the site every time I tried to debug or build a new feature. I don't even think this kind of operation belongs as a unit test (you're testing a human interface, not a computerised one).
Second point, don't write a script. You're going to have so many false positives fall through the cracks that people will stop reading the reports and you are no better off than when you started.
Third point, this is probably most easily solved by having humans do it: QA team, copy writers, beta testers, translators, etc. All the big sites with internationalised content that I've built had the same process: we took the copy from the copy writers, sent it to the translating service/agency, put it into the persistence layer, and deployed it. Testers (QA, developers, PMs, designers, etc.) would find spelling or grammatical mistakes and lodge bug reports. There is just too much red tape and pairs of eyes for that many spelling/grammar errors to slip through.
Fourth point, there will always be spelling and grammar mistakes on your page. Even major newspaper web sites haven't gotten around this and they have whole office buildings filled with editors.