What software can I use to create an original (non-Latinized) alphabet with? I have found fontlab 5, but I cannot find documentation that discusses how you create an original alphabet. They assume everyone wants to create new fonts for English. I have created a scientific alphabet that is not used by any culture in the world and I want to know what software I can use to create the characters.
Ultimately, you're stuck with creating fonts overtop existing languages because your alphabet, in addition to its glyphs (the visual representations of characters), requires an encoding (the underlying mapping between bytes and glyphs). To create a new alphabet would entail designing and proliferating a new character encoding.
Unicode provides (currently) 3 private use areas that are reserved for private use, which are guaranteed to not be allocated officially by the Unicode Consortium. You can create a font that uses those Unicode Scalars
Related
We are using R/exams to create tests in Canvas and TestVision.
We have other forms and other software to perform written exams.
I know R/exams has a great NOPS feature and was wondering:
What software is used to autograde the NOPS forms?
Can that software also evaluate string questions?
Now it looks that the NOPS form doesn't make it easy for software to read parts. Ideally the software would be adapted so adapted NOPS forms (changes in blue) could read more easily Student Name, and string questions:
NOPS format
The NOPS forms have not been designed by us but they follow the format that our university has been using. We simply mimicked their format because we initially just generated the PDF files ourselves but used the commercial scanning software of our university.
Scanning
However, over the years we have written our own scanner implementation in R in exams::nops_scan(). The basic approach is to convert PDF pages to PNG images, read these into R, convert them to black and white pixel matrices, find the scanner markings in the corners, and then extract just the boxes relative to these markings. The boxes either contain printed digits in a fixed font for which a simple decision tree yields a reliable classification - or the boxes are empty/filled vs. checked which can also be classified reasonably reliably. The result is stored in a simple text format that was again not developed by us but to be fully compatible with the commercial system that our university used.
Grading
Based on the scan results the function exams::nops_eval() computes points and grades. Various evaluation strategies can be plugged in and starting from version 2.4-0 the reports generated by the function can be customized.
Extension to OCR
At the moment no OCR (optical character recognition) is used, except for the simple task of recognizing printed numbers in a fixed font. But no hand-written characters or digits are ever evaluated automatically. I had played around with this a little bit using tesseract but the results were not reliable enough for our purposes.
The string questions that are currently supported are intended for open-ended questions. Hence students get a reasonable amount of space to write something down. The teacher can then grade the answer sheet manually, again by ticking boxes only, which can be read rather reliably. The scanned images of the full sheet are included in the report for the students so that they can also see any hand-written feedback/corrections included in the answer form.
Tutorial
A hands-on guide to using the NOPS approach is available at: http://www.R-exams.org/tutorials/exams2nops/
Misc
Unfortunately, the system is not implemented in a very modular fashion. The reasons for this were two-fold: (1) We followed very closely the given format our university had been using. (2) The bulk of the implementation was written under a lot of time pressure (see the anecdote below). So while the features you propose would be nice to have, they are unlikely to fit well into the current setup. If you would want to have a stab at this, I would recommend to write a modular new implementation, just using the bits and pieces from the existing code that are useful enough.
Anecdote: Scanning of about 400-500 exam sheets had failed on the university system due to a mistake of the copy shop that had printed the sheets. It was mid-July, everybody was on vacation already including myself. So I sat on my parents porch for two days to write the scanner tool and evaluate the exams that the students were waiting for.
I am writing a package to facilitate importing Brazilian socio-economic microdata sets (Census, PNAD, etc).
I foresee two distinct groups of users of the package:
Users in Brazil, who may feel more at ease with the documentation in
Portuguese. The probably can understand English to some extent, but a
foreign language would probably make the package feel less
"ergonomic".
The broader international users community, from whom English
documentation may be a necessary condition.
Is it possible to write a package in a way that the documentation is "bilingual" (English and Portuguese), and that the language shown to the user will depend on their country/language settings?
Also,
Is that doable within the roxygen2 documentation framework?
I realise there is a tradeoff of making the package more user-friendly by making it bilingual vs. the increased complexity and difficulty to maintain. General comments on this tradeoff from previous expirience are also welcome.
EDIT: following the comment's suggestion I cross-posted r-package-devel mailling list. HERE, then follow the answers at the bottom. Duncan Murdoch posted an interesting answer covering some of what #Brandons answer (bellow) covers, but also including two additional suggestions that I think are useful:
have the package in one language, but the vignettes for different
languages. I will follow this advice.
have to versions of the package , let's say 1.1 and 1.2, one on each
language
According to Ropensci, there is no standard mechanism for translating package documentation into non-English languages. They describe the typical process of internationalization/localization as follows:
To create non-English documentation requires manual creation of
supplemental .Rd files or package vignettes.
Packages supplying
non-English documentation should include a Language field in the
DESCRIPTION file.
And some more info on the Language field:
A ‘Language’ field can be used to indicate if the package
documentation is not in English: this should be a comma-separated list
of standard (not private use or grandfathered) IETF language tags as
currently defined by RFC 5646 (https://www.rfc-editor.org/rfc/rfc5646,
see also https://en.wikipedia.org/wiki/IETF_language_tag), i.e., use
language subtags which in essence are 2-letter ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) or 3-letter ISO 639-3
(https://en.wikipedia.org/wiki/ISO_639-3) language codes.
Care is needed if your package contains non-ASCII text, and in particular if it is intended to be used in more than one locale. It is possible to mark the encoding used in the DESCRIPTION file and in .Rd files.
Regarding encoding...
First, consider carefully if you really need non-ASCII text. Many
users of R will only be able to view correctly text in their native
language group (e.g. Western European, Eastern European, Simplified
Chinese) and ASCII.72. Other characters may not be rendered at all,
rendered incorrectly, or cause your R code to give an error. For .Rd
documentation, marking the encoding and including ASCII
transliterations is likely to do a reasonable job. The set of
characters which is commonly supported is wider than it used to be
around 2000, but non-Latin alphabets (Greek, Russian, Georgian, …) are
still often problematic and those with double-width characters
(Chinese, Japanese, Korean) often need specialist fonts to render
correctly.
On a related note, R does, however, provide support for "errors and warnings" in different languages - "There are mechanisms to translate the R- and C-level error and warning messages. There are only available if R is compiled with NLS support (which is requested by configure option --enable-nls, the default)."
Besides bilingual documentation, please allow me the following comment: Given your two "target" groups, it may be assumed that some of your users will be running non-English OS (typically, Windows in Portuguese). When importing time series data (or any date entries as a matter of fact), due to different "date" formatting (English vs. non-English), you may get different "results" (i.e. misinterpeted date entries) when importing to English/non-English machines. I have some experience with those issues (I often work with Czech-language-based OSs) and -other than ad-hoc coding- I don't find a simple solution.
(If you find this off-topic, please feel free to delete)
In German there is a formal ("Sie") and an informal ("Du") form. I would like to translate some software into informal German, notably woocommerce using transiflex, but new languages can only be added using their locale code.
There only seems to be de_DE as a locale. What's best to differentiate between the two forms, shouldn't there be another locale code just for the informal form, too?
Generally Gettext uses # to distinguish language variants. So in this case it could be de_DE#informal. This way the locale will be correctly handled by Gettext with fallback to de_DE (in contrast with suggested de_DE-x-informal which will not fall back this way as Gettext will see DE-x-informal as country code).
You can find more on locales naming at https://www.gnu.org/software/gettext/manual/html_node/Locale-Names.html
Since you asked about WooCommerce, explaining the current state of how WordPress handles it is probably most relevant:
The best approach is to use locale variants as Michal wrote, but WordPress has its own unique twist on it and don’t use the variant syntax. Instead they add a third component to the filename but not -x-informal either: de_DE.mo is the informal (and also fallback, because it lacks any further specification) variant in WordPress and de_DE_formal.mo contains the formal variant.
Language tags as defined in BCP 47 (currently RFC 5646) do not specify any codes for informal vs. formal variants of German. It would probably be unrealistic to try to have such codes registered, so you would be limited to Private Use subtags, e.g. de-x-formal vs. de-x-informal. Whether your software can handle these in any way is a different issue.
On the practical side. the choice of “Sie” vs. “Du” (or “du”) is hardly a language variant issue. Standard German uses both pronouns to address a person in singular, depending on style of presentation and on the relationship with the addressed person. At the extreme, we could say that the choice of “Sie” vs. “Du” in the context of addressing a user generically in instructions or a user interface is a language variant issue. But on the practical side, just make up your mind.
I am trying to develop an application that will use tickets given the user the ability to validate them. I am wondering why I should choose Aztec barcode as many many companies have already chosen that instead of QR-Codes. What are the pros of the Aztec barcodes?
I good comparison I have found so far is:
http://www.tec-it.com/en/support/knowbase/barcode-overview/2d-barcodes/Default.aspx
and here: http://en.wikipedia.org/wiki/Aztec_Code
on Usage section you can see that is used quite often.
Although Aztec Codes are more compact and tunable, there is poor support for them among open, non-proprietary software. I would still use QR Codes for now, which have very mature software support on a wide variety of platforms.
If space is at a premium for you, and you do not care for users to be able to read or generate your codes with their own software or on a wide variety of devices, then Aztec would be a better choice. Aztec codes do not require a surrounding margin, allow for very finely tunable error correction level, and have a tighter encoding optimized for a wider range of message texts.
For example, the Aztec codec has a mode specialized for encoding lowercase letters, so it could encode most of this question answer with only 5 bits per character. The QR codec is only optimized for uppercase URLs, and must store lowercase letters as full 8-bit binary data. A QR code containing this text would have to encode about 160% as much data as an Aztec code -- and then it needs a margin space too.
QR codes require more space than Aztec codes but have freely available software supporting them.
Aztec codes can store more information, but there is poor free support for them. They can be harder to read and generate efficiently, right now.
On an Android phone, Google's "Barcode Scanner" application will scan an Aztec code after a longer delay than a QR code, and the user has to manually enable Aztec code scanning in the application preferences.
Similarly the free barcode generator package "zint" will produce Aztec codes, but has a handful of bugs, and does not make full use of the codec to optimize their size as small as possible. Its generation of QR codes, on the other hand, is bulletproof.
I share the frustration with the comparative incompleteness of FLOSS support for encoding and decoding Aztec Code, expressed in #fuzzyTew's answer.
There is much more open-source code for encoding and decoding QR, compared to Aztec, and it is more feature-complete and thoroughly tested.
That's a shame, because Aztec Code is in several ways superior to QR. As described in #fuzzyTew's answer…
Aztec doesn't require a “quiet zone” of white space around the symbol, while QR does.
Aztec offers continuously-tunable error correction levels, whereas QR code offers only a few discrete levels.
Aztec offers substantially higher density than QR, for typical applications consisting mostly-to-entirely of standard ASCII text.
Compared to Aztec, QR is particularly inefficient for encoding lowercase Latin letters (8 bits/char) compared to Aztec (5 bits/char).
The largest QR size, 177×177 blocks, can store 2953 bytes in binary mode with Low error correction (~7%), while the largest Aztec size, 151×151 blocks, can store 2318 bytes in binary mode with equivalent 7% error correction. At that size, Aztec requires ~9.83 blocks/byte, while QR requires ~10.61.
Much of the open-source support for Aztec is based on ZXing. That includes the Android Barcode Scanner app, and many online encoders and decoders.
Until recently, ZXing's Aztec Code implementation did not correctly support the encoding or decoding of non-Latin1 characters. (QR, Aztec, PDF417, and Data Matrix all use ECI to support other character encodings.)
Recently I've started doing something about it:
I added support for correctly decoding non-Latin1 characters from Aztec in ZXing.
I added support for correctly encoding non-Latin1 character sets to Aztec in ZXing.
I also added support for encoding non-Latin1 character sets in the Python aztec_code_generator module.
The point of all this is: There's not actually a ton of work to be done to get open-source Aztec encoders/detectors/decoders of a high quality.
We have developed a medium-sized ASP.NET / SQL Server application that uses resource files to provide English and Spanish user interface variants. Unicode data types are used throughout the databases. Now we are being asked to add Mandarin to the mix. I have no experience with localising into Asian languages, so I really cannot imagine what kind of job this would be.
My questions are:
How complex would this job be, as compared to localising into another Western language such as French or German?
What additional issues, other than (obviously) translating strings in resource files, should I deal with for Mandarin localisation? Anything related to the different alphabet, perhaps?
Reports of previous experiences or pointers to best practices are most welcome. Thanks.
On the technical side of things, I don't believe it's significantly more difficult. Adding support for non-Western languages will expose encoding issues if you are not using Unicode throughout, but it's pretty much the norm to use UTF-8 encoding and Unicode SQL types (nvarchar instead of varchar) anyway.
I would say that the added complexity and uncertainty is more about the non-technical aspects. Most of us English speakers are able to make some sense of European languages when we see 1:1 translations and can notice a lot of problems. But Mandarin is utterly meaningless to most of us, so it's more important to get a native speaker to review or at least spot-check the app prior to release.
One thing to be mindful of is the issue of input methods: Chinese speakers use IMEs (input method editors) to input text, so avoid writing custom client-side input code such as capturing and processing keystrokes.
Another issue is the actual culture identifier to choose. There are several variants for Chinese (zh-Hans which replaces zh-CHS and is used in China, and zh-Hant which replaces zh-CHT used in Taiwan). See the note on this MSDN page for more info on this. But these are neutral culture identifiers (they are not country-specific) and can be used for localization but not for things such as number and date formatting, therefore ideally you should use a specific culture identifier such as zh-CN for China and zh-TW for Taiwan. Choosing a specific culture for a Web application can be tricky, therefore this choice is usually based on your market expectations. More info on the different .NET culture identifiers is at this other post.
Hope this helps!
As far as translating text in the user interface goes, the localization effort for Chinese is probably comparable to that of Western languages. Like English and Spanish, Chinese is read left to right, so you won't need to mirror the page layout as you would if you had to support Arabic or Hebrew. Here are a couple more points to consider:
Font size: Chinese characters are more intricate than Latin characters, so you may need to use a larger font size. English and Spanish are readable at 8pt; for Chinese, you'll want a minimum of 10pt.
Font style: In English, bold and italics are often used for emphasis. In Chinese, emphasis is usually achieved with a different typeface, font size, or color. Use bold with caution, and avoid italics.
However, if you're targeting an Asian market, more significant changes may be required. Here are a few examples:
Personal names: A typical Chinese name is 孫中山: the first character (孫) is the family name, and the second and third characters (中山) constitute the given name. This of course is the opposite of the common Western convention of "given name" + space + "family name". If you're storing and displaying names, you may want to use a single "Name" field instead of separate "First Name" and "Last Name" fields.
Colors: In the U.S., it's common to use green for "good" and red for "bad". However, in China and Taiwan, red is "good". For instance, compare the stock prices on Yahoo! versus Yahoo! Taiwan.
Lack of an alphabet: Chinese characters are not based on an alphabet. Thus, for example, it wouldn't make sense to offer the ability to filter a list by the first letter of each entry, as in a directory of names.
Regarding sort order, different methods are used in Chinese: binary (i.e. Unicode), radical+stroke, Pinyin (=romanization), and Bopomofo (=syllabary).
On SQL Server, you can define the sort order using the COLLATE clause in the column definition, or on statement level. It's also possible to have a default collation on database level.
Run this statement to see all supported Chinese collations:
select * from fn_helpcollations()
where name like 'chinese%'
In .Net, there's also a list of Chinese cultures to be used for sorting, see MSDN.