Localized soundex() database query

Localized soundex() database query - mariadb

The SOUNDEX() function checks whether words are similar in sound and allows for misspellings to be included in a select.
How I understand it though it that it uses English as a basis for the word sounds it generates, in other languages this can lead to mismatches due to words/vowels etc.. being pronounced differently.
Is there a way to have SOUNDEX() behave with for example German or Dutch or any other language? Or am I misunderstanding how it works and will it work just as properly as with English?

Related

Data.Map - why is there `takeWhileAntitone` but no `takeWhile`?

I'm confused by Data.Map API. I'm looking for a simple way to find a range of keys of the map at log(n) cost. This is a basic concept known as "binary search", and maybe "bisecting".
I see this strange takeWhileAntitone function where I need to provide an "antitone" predicate function. It's the first time I encounter this concept.
After reading Wikipedia on the topic, this seems to be simply saying that there may be only one place where the function changes from True to False when applied to arguments in key order. This fits a requirement for a binary search.
Since the API is documented in a strange (to me) language, I wanted to ask here:
if my understanding is correct, and
is there a reason these functions aren't called bisect, binarySearch or similar?

Since the API is documented in a strange (to me) language, I wanted to ask here:
if my understanding is correct, and
Yes. takeWhileAntitone (and other similarly named variants in the library) is the function for doing binary search on keys. It's not named takeWhile because it does not work for any argument predicate, so if you're reviewing code, it serves as a reminder to check for that.
is there a reason these functions aren't called bisect, binarySearch or similar?
This name serves to distinguish variants takeWhileAntitone, dropWhileAntitone, spanAntitone that "do binary search" but with different final results.
takeWhile is a well-known name from Haskell's standard library (in Data.List).
In FP we like to distinguish the "what" from the "how". "binary search" is an algorithm ("how"). "take while" is also literally a "how", but its meaning is arguably more naturally connected to a "what" (the longest prefix of elements satisfying a predicate). In particular, the interpretation of "take while" as "longest prefix" doesn't rely on any assumption about the predicate.

What syntactical features of programming languages are problematic for blind programmers?

What are the syntactical features of current languages that prove problematic for screen readers or braille readers? What symbols and constructs are annoying to hear or feel? To the blind programmers out there, if you were to design a programming language which was easier for other blind people to work with, what would it be like?

Although it is a very interesting question, it is highly a matter of personal preferences and likings, so I'll answer as if you asked me personally.
Note: My working system is Windows, so I'll focus on it. I can write cross-platform apps, of course, but I do that on a Windows machine.
Indentation, White Spaces
All indentation-related things are more or less annoying, especially if the indentation is made with multiple spaces, not tabs (yes yes, I know the whole "holy war" around it!). Python did teach me to indent code properly even in languages that don't require it, but it still hurts me to write proper Python because of these spaces. Why so? The answer is simple: my screen reader tells me the number of spaces, and the actual nesting level is four times less. Each such operation (20 spaces, aha, dividing by 4, it is fifth nesting level) brings some overhead to my mind and makes me spend my inner "CPU" resources that I could free for debugging or other fancy stuff. It is a wee little thing, you'd say, and you'd be right, but this overhead is on each and single line of my (or another person's!) code I must read or debug! This is quite a lot.
Tabs are much better: 5 tabs, fifth nesting level, nice and well. Braille display here would be also a problem because, as you probably know, a Braille display (despite the name) is a single line of text, usually 14 to 40 characters long. I.e., imagine a tiny monitor with one single line of text that you pan (i.e., scroll), and nothing besides that. If 20 chars are spaces, you stay with only 20 chars left for the code. A bit more, if you read in Grade 2 Braille, but I don't know whether it would be appropriate for coding, I mostly use speech for it, except some cases.
Yet more painful are some code styling standards where you have to align code in the line. For instance, this was the case with tests in moment.js. There the expected values and messages should match their line position so, for example, the opening quote would be in column 55 on every line (this is beautiful to see, I admit). I couldn't get my pull request accepted for more than a week until I finally understood what Iskren (thank him for his patience with me!), one of the main developers, was trying to tell me. As you can guess, this would be completely obvious for a sighted person.
Block Endings
A problem adjacent to the previous one: for me personally it is very nifty when I know that a particular code block ends. A right brace (as in C) or the word end (as in Ruby) are good, indentation level change (as in Python) is not: still some overhead on getting knowing that the nesting level has abruptly changed.
IDEs
Sorry to admit it, but there is virtually no comfortable IDE for the blind. The closest to such an IDE is Microsoft Visual Studio, especially the latest version (gods bless Jenny Lay-Flurrie!), Android Studio seems also moving towards accessibility starting with version 2. However, these are not as usable, nifty and comfortable as they are for the sighted users. So, for instance, I use text editors and command-line tools to write, compile and debug my code, as do many blind people around me.
Ballad of the snake_case, or Another Holy War
Yet another thing to blame Python about: camelCase is far more comfortable to deal with then snake_case or even PascalCase. Usually screen readers separate words written in camelCase as if they were separated with spaces, so I get no pain readingThisPartOfSentence.
When you write code, you have to turn your punctuation on, otherwise you'll miss something really tiny and "unimportant", such as a quote, a semicolon or a parenthesis. Then, if your punctuation is on, and you read my_very_cool_descriptive_variable_name, you hear the following: "my underline very underline cool underline.... underline underline underline!!!" (bad language and swears censored). I tried even to replace underlines with sounds (yes, my screen reader gives such an opportunity), but the sounds can't be so well synchronized because of the higher speech rate I use. And it is quite a nightmare when dealing with methods and properties like __proto__ (aha, there are two underscores on both sides, not one, not three - well, got it right, I think!), __repr__ and so on, and so forth. Yes, you might say, I can replace the word "underline" with something really short like "un" (this is also possible), but some overhead is still here, as with white spaces and code nesting.
PascalCase is far better but it implies a bit more concentration, also, since we need to remember putting the first capital letter (oh, I'm too fastidious now, let it be PascalCase, but not those under... oh well, you got it already). This was the reason I gave up with Rust, btw.
Searching for Functions
As I have already told you, IDEs are not good for us, so text editor is our best friend. and then, you need to search for functions and methods, as well as for classes and code blocks. In certain languages (not Python this time), there are no keyword that would start a function (see, for example, C or Java code). Searching for functions in these conditions becomes quite painful if, for example, you do know that you have a logical error somewhere in the third or fourth function in the file, but you don't exactly remember its name, or you skim through someone's code... well, you know, there are lots of reasons to do that. In this particular context, Python is good, C is not.
Multiple Repeating and Alike Characters
This is not a problem per se, but a circumstance that complicates debugging, for example, regular expressions or strongly nested operations like ((((a + ((b * c) - d) ** e) / f) + g) - h. this particular example is really synthetic, but you get what I mean: nested ternary operators (that I love, btw!), conditional blocks, and so on. And regular expressions, again.
The Ideal Language
The closest to the ideal blind-friendly language as for me is the D language. The only drawback it has is absence of the word function except for anonymous functions. PHP and Javascript are also good, but they have tons of other, blind-unrelated drawbacks, unfortunately.
Update about Go
In one of his talks Rob Pike, the main developer of the Go language said that no one likes the code style imposed by the Gofmt utility. Probably, no one — except me! I like it, I love it so much, every file in Go is so concise and good to read, I'm really excited about the language because of that. The only slightly annoying thing for a blind coder is when a function has several pairs of parentheses in its definition, like if it is actually a struct method. the <- channel operator still gives me moments to think about what I'm doing, sending or receiving, but I believe it's a matter of habit.
Update about Visual Studio Code
Believe it or not, when I was writing this answer and the first update to it, I wasn't working as a full-time developer — now I am. So many circumstances went favorably for accessibility and me in particular! Slack, that virtually every business uses, became accessible, and so became Microsoft Visual Studio Code (again, gods bless Jenny and her team!). Now I use it as my primary code editor. Yes, it's not an IDE per se, but it's sufficient for my needs. And yes, I had to rework my punctuation reading: now I have shorter and often artificial names for many punctuation signs.
So, calling from end of January 2021, I definitely recommend Visual Studio Code as the coding editor for blind people and their colleagues. What is even more amazing is that LiveShare, their pair programming service, is also accessible! Yes, it does have some quirks (for now you can't tell what line in the file your colleague is editing if you're blind), but still it's an extremely huge step forward.

Bilingual (English and Portuguese) documentation in an R package

I am writing a package to facilitate importing Brazilian socio-economic microdata sets (Census, PNAD, etc).
I foresee two distinct groups of users of the package:
Users in Brazil, who may feel more at ease with the documentation in
Portuguese. The probably can understand English to some extent, but a
foreign language would probably make the package feel less
"ergonomic".
The broader international users community, from whom English
documentation may be a necessary condition.
Is it possible to write a package in a way that the documentation is "bilingual" (English and Portuguese), and that the language shown to the user will depend on their country/language settings?
Also,
Is that doable within the roxygen2 documentation framework?
I realise there is a tradeoff of making the package more user-friendly by making it bilingual vs. the increased complexity and difficulty to maintain. General comments on this tradeoff from previous expirience are also welcome.
EDIT: following the comment's suggestion I cross-posted r-package-devel mailling list. HERE, then follow the answers at the bottom. Duncan Murdoch posted an interesting answer covering some of what #Brandons answer (bellow) covers, but also including two additional suggestions that I think are useful:
have the package in one language, but the vignettes for different
languages. I will follow this advice.
have to versions of the package , let's say 1.1 and 1.2, one on each
language

According to Ropensci, there is no standard mechanism for translating package documentation into non-English languages. They describe the typical process of internationalization/localization as follows:
To create non-English documentation requires manual creation of
supplemental .Rd files or package vignettes.
Packages supplying
non-English documentation should include a Language field in the
DESCRIPTION file.
And some more info on the Language field:
A ‘Language’ field can be used to indicate if the package
documentation is not in English: this should be a comma-separated list
of standard (not private use or grandfathered) IETF language tags as
currently defined by RFC 5646 (https://www.rfc-editor.org/rfc/rfc5646,
see also https://en.wikipedia.org/wiki/IETF_language_tag), i.e., use
language subtags which in essence are 2-letter ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) or 3-letter ISO 639-3
(https://en.wikipedia.org/wiki/ISO_639-3) language codes.
Care is needed if your package contains non-ASCII text, and in particular if it is intended to be used in more than one locale. It is possible to mark the encoding used in the DESCRIPTION file and in .Rd files.
Regarding encoding...
First, consider carefully if you really need non-ASCII text. Many
users of R will only be able to view correctly text in their native
language group (e.g. Western European, Eastern European, Simplified
Chinese) and ASCII.72. Other characters may not be rendered at all,
rendered incorrectly, or cause your R code to give an error. For .Rd
documentation, marking the encoding and including ASCII
transliterations is likely to do a reasonable job. The set of
characters which is commonly supported is wider than it used to be
around 2000, but non-Latin alphabets (Greek, Russian, Georgian, …) are
still often problematic and those with double-width characters
(Chinese, Japanese, Korean) often need specialist fonts to render
correctly.
On a related note, R does, however, provide support for "errors and warnings" in different languages - "There are mechanisms to translate the R- and C-level error and warning messages. There are only available if R is compiled with NLS support (which is requested by configure option --enable-nls, the default)."

Besides bilingual documentation, please allow me the following comment: Given your two "target" groups, it may be assumed that some of your users will be running non-English OS (typically, Windows in Portuguese). When importing time series data (or any date entries as a matter of fact), due to different "date" formatting (English vs. non-English), you may get different "results" (i.e. misinterpeted date entries) when importing to English/non-English machines. I have some experience with those issues (I often work with Czech-language-based OSs) and -other than ad-hoc coding- I don't find a simple solution.
(If you find this off-topic, please feel free to delete)

Is there a locale code for informal German (de_DE-informal?)

In German there is a formal ("Sie") and an informal ("Du") form. I would like to translate some software into informal German, notably woocommerce using transiflex, but new languages can only be added using their locale code.
There only seems to be de_DE as a locale. What's best to differentiate between the two forms, shouldn't there be another locale code just for the informal form, too?

Generally Gettext uses # to distinguish language variants. So in this case it could be de_DE#informal. This way the locale will be correctly handled by Gettext with fallback to de_DE (in contrast with suggested de_DE-x-informal which will not fall back this way as Gettext will see DE-x-informal as country code).
You can find more on locales naming at https://www.gnu.org/software/gettext/manual/html_node/Locale-Names.html

Since you asked about WooCommerce, explaining the current state of how WordPress handles it is probably most relevant:
The best approach is to use locale variants as Michal wrote, but WordPress has its own unique twist on it and don’t use the variant syntax. Instead they add a third component to the filename but not -x-informal either: de_DE.mo is the informal (and also fallback, because it lacks any further specification) variant in WordPress and de_DE_formal.mo contains the formal variant.

Language tags as defined in BCP 47 (currently RFC 5646) do not specify any codes for informal vs. formal variants of German. It would probably be unrealistic to try to have such codes registered, so you would be limited to Private Use subtags, e.g. de-x-formal vs. de-x-informal. Whether your software can handle these in any way is a different issue.
On the practical side. the choice of “Sie” vs. “Du” (or “du”) is hardly a language variant issue. Standard German uses both pronouns to address a person in singular, depending on style of presentation and on the relationship with the addressed person. At the extreme, we could say that the choice of “Sie” vs. “Du” in the context of addressing a user generically in instructions or a user interface is a language variant issue. But on the practical side, just make up your mind.

Localising ASP.NET application into Mandarin

We have developed a medium-sized ASP.NET / SQL Server application that uses resource files to provide English and Spanish user interface variants. Unicode data types are used throughout the databases. Now we are being asked to add Mandarin to the mix. I have no experience with localising into Asian languages, so I really cannot imagine what kind of job this would be.
My questions are:
How complex would this job be, as compared to localising into another Western language such as French or German?
What additional issues, other than (obviously) translating strings in resource files, should I deal with for Mandarin localisation? Anything related to the different alphabet, perhaps?
Reports of previous experiences or pointers to best practices are most welcome. Thanks.

On the technical side of things, I don't believe it's significantly more difficult. Adding support for non-Western languages will expose encoding issues if you are not using Unicode throughout, but it's pretty much the norm to use UTF-8 encoding and Unicode SQL types (nvarchar instead of varchar) anyway.
I would say that the added complexity and uncertainty is more about the non-technical aspects. Most of us English speakers are able to make some sense of European languages when we see 1:1 translations and can notice a lot of problems. But Mandarin is utterly meaningless to most of us, so it's more important to get a native speaker to review or at least spot-check the app prior to release.
One thing to be mindful of is the issue of input methods: Chinese speakers use IMEs (input method editors) to input text, so avoid writing custom client-side input code such as capturing and processing keystrokes.
Another issue is the actual culture identifier to choose. There are several variants for Chinese (zh-Hans which replaces zh-CHS and is used in China, and zh-Hant which replaces zh-CHT used in Taiwan). See the note on this MSDN page for more info on this. But these are neutral culture identifiers (they are not country-specific) and can be used for localization but not for things such as number and date formatting, therefore ideally you should use a specific culture identifier such as zh-CN for China and zh-TW for Taiwan. Choosing a specific culture for a Web application can be tricky, therefore this choice is usually based on your market expectations. More info on the different .NET culture identifiers is at this other post.
Hope this helps!

As far as translating text in the user interface goes, the localization effort for Chinese is probably comparable to that of Western languages. Like English and Spanish, Chinese is read left to right, so you won't need to mirror the page layout as you would if you had to support Arabic or Hebrew. Here are a couple more points to consider:
Font size: Chinese characters are more intricate than Latin characters, so you may need to use a larger font size. English and Spanish are readable at 8pt; for Chinese, you'll want a minimum of 10pt.
Font style: In English, bold and italics are often used for emphasis. In Chinese, emphasis is usually achieved with a different typeface, font size, or color. Use bold with caution, and avoid italics.
However, if you're targeting an Asian market, more significant changes may be required. Here are a few examples:
Personal names: A typical Chinese name is 孫中山: the first character (孫) is the family name, and the second and third characters (中山) constitute the given name. This of course is the opposite of the common Western convention of "given name" + space + "family name". If you're storing and displaying names, you may want to use a single "Name" field instead of separate "First Name" and "Last Name" fields.
Colors: In the U.S., it's common to use green for "good" and red for "bad". However, in China and Taiwan, red is "good". For instance, compare the stock prices on Yahoo! versus Yahoo! Taiwan.
Lack of an alphabet: Chinese characters are not based on an alphabet. Thus, for example, it wouldn't make sense to offer the ability to filter a list by the first letter of each entry, as in a directory of names.

Regarding sort order, different methods are used in Chinese: binary (i.e. Unicode), radical+stroke, Pinyin (=romanization), and Bopomofo (=syllabary).
On SQL Server, you can define the sort order using the COLLATE clause in the column definition, or on statement level. It's also possible to have a default collation on database level.
Run this statement to see all supported Chinese collations:
select * from fn_helpcollations()
where name like 'chinese%'
In .Net, there's also a list of Chinese cultures to be used for sorting, see MSDN.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex