String sort with special characters (ä, ö) in Flex/AS3 - apache-flex

is there any way to sort Strings correctly in different languages than English? In German we have Umlaute, and e.g. 'ä' should come right after 'a' in the ascending sort order.
I am using ObjectUtil.stringCompare(), but it puts those special characters always to the end. Any ideas how to solve this? I thought the locale (de_DE) would take care of it, but it does not.
Thanks,
Martin

In ECMAScript Third Edition (and hence both ActionScript and current browser JavaScript) there is the string.localeCompare method. This does a comparison that depends on the current client locale. For example if I set my system locale (in Windows terms, “language to match the language version of the non-Unicode programs you want to use”) to “German (Germany)” and put javascript:alert('ä'.localeCompare('b')) I get -1, but with English I get 1.
It's generally questionable to depend on the client end locale though. Your application would work differently depending on the client OS installation, and it is not nearly as easy for the user to change their system locale as it is to choose a different language in the web browser prefs UI. I'd avoid it if at all possible, and either:
do an ad-hoc string replacement (eg. ä with ae) before comparison. This may be OK if you are only worried about a few umlauts, but is unfeasible for covering the whole of Unicode... even the whole of the Latin diacritical set.
try to do the comparison on the server side, in a scripting language with better character model support than ECMAScript.

You can write your own compare function and pass it to array.sort(compareFunction). compareFunction(a, b):int should compare two strings and return -1 if a > b, 0 if a=b, and 1 if a < b.
In that function you'll want to compare your strings symbol-by-symbol, taking in account special german symbols.

I don't know AS3, but almost all language that supports Locale should have locale aware comparison function for string or sorting too.
For Example, here is the one of locale aware string comparison localeCompare().

Related

Encoder.HtmlEncode encodes Farsi characters

I want to use the Microsoft AntiXss library for my project. When I use the Microsoft.Security.Application.Encoder.HtmlEncode(str) function to safely show some value in my web page, it encodes Farsi characters which I consider to be safe. For instance, it converts لیست to لیست. Am I using the wrong function? How should I be able to print the user input in my page safely?
I'm currently using it like this:
<h2>#Encoder.HtmlEncode(ViewBag.UserInput)</h2>
I think I messed up! Razor view encodes the values unless you use #Html.Raw right? Well, I encoded the string and it encoded it again. So in the end it just got encoded twice and hence, the weird looking chars (Unicode values)!
If your encoding (lets assume that it's Unicode by default) supports Farsi it's safe to use Farsi, without any additional effort, in ASP.NET MVC almost always.
First of all, escape-on-input is just wrong - you've taken some input and applied some transformation that is totally irrelevant to that data. It's generally wrong to encode your data immediately after you receive it from the user. You should store the data in pure view to your database and encode it only when you display it to the user and according to the possible vulnerabilities for the current system. For example the 'dangerous' html characters are not 'dangerous' for SQL or android etc. and that's one of the main reasons why you shouldn't encode the data when you store it in the server. And one more reason - when you html encode the string you got 6-7 times more characters for your string. This can be a problem with server constraints for strings length. When you store the data to the sql server you should escape, validate, sanitize your data only for it and prevent only its vulnerabilities (like sql injection).
Now for ASP.NET MVC and razor you don't need to html encode your strings because it's done by default unless you use Html.Raw() but generally you should avoid it (or html encode when you use it). Also if you double encode your data you'll result in corrupted output :)
I Hope this will help to clear your mind.

UCS2 or UTF16 should I convert to UTF?

The website I'm currently working on collects data from various sources (human entered). The data is being stored in Nvarchar fields in the database. Currently the site specifies that the charset is UCS-2 through a meta tag. Until now the site has required answers in English. Soon though we will be allowing/requiring at least some of the fields to be entered in their native language (i.e. Chinese in this case). Based on some research and other posts on the site it seems that UCS-2 and UTF-16 are pretty much the same thing with some minor technical differences. If it matters this is an asp.net website running on a SQL Server database. So my questions are:
Is there a reason for me to change the meta tag to specify UTF-16?
Will I have any issues with the way characters are displayed if I change the encoding? (I think the current data should display the same since it's most/all English but I'd like to confirm that)
UCS-2 is a strict subset of UTF-16 -- it can encode characters in the Basic Multilingual Plane (i.e., from U+0000 til U+FFFF) only. If you need to express characters in the supplementary planes (which includes some relatively rare Chinese characters), they must be encoded using pairs of two 16 bit code units ("surrogates"), and if so your data will not be valid UCS-2 but must be declared as UTF-16.
IF you can easily switch the encoding specification to UTF-16, there should be little reason not to do so immediately, unless your data is being consumed by ancient software that doesn't know what "UTF-16" means.

SQLite 3: Character Issue While Ordering By Records

In my SQLite 3 Database, I have some records with Turkish characters such as "Ö", "Ü", "İ" etc. When I select my values with SELECT * FROM TABLE ORDER BY COLUMN_NAME query, the records that begin with these characters are coming at the end.
Normally, they should've come after the letter that is dot-less version of each. Like "Ö" is after "O", "Ü" is after "U".
Is it something about regional settings? Is there a way to control these settings?
I use SQLite Manager in Firefox to manage my DB.
Thanks in advance.
P.S. I know it's not a solution for SQLite but for those who need to use SQLite DB in Objective-C, they can sort the data array after getting from SQLite DB. Here's a good solution: How to sort an NSMutableArray with custom objects in it?
Unfortunately, it seems there's no direct solution for this. For iOS at least. But there are ways to follow.
After I subscribed to mailing list of SQLite, user Named Jean-Christophe Deschamps came with this reply:
"In my SQLite 3 Database, I have some records with Turkish characters
such as "Ö", "Ü", "İ" etc. When I select my values with 'SELECT * FROM
TABLE ORDER BY COLUMN_NAME' query, the records that begin with these
characters are coming at the end."
Bare bone SQLite only collates correctly on the lower ASCII charset.
While that's fine for plain english, it doesn't work for most of us.
"Normally, they should've come after the letter that is dot-less
version of each. Like "Ö" is after "O", "Ü" is after "U". Is it
something about regional settings? Is there a way to control these
settings?"
You have the choice among some ways to get it right or close to right
for your language(s):
o) use ICU either as an extension (for third-party managers) or
linked to
your application.
Advantages: it works 100% correctly for a given language at a time in each
operation.
Drawbacks: it's huge and slow and it requires you register a collation for
every language you deal with. Also it won't work well for columns
containing several non-english languages.
o) write your own collation(s) invoking your OS' ICU routines to
collate
strings.
Advantages: doesn't bloat your code with huge libraries.
Drawbacks: requires you write this extension (in C or something), same
other drawbacks as ICU.
o) If you use Windows, download and use the functions in the
extension I
wrote for a close-to-correct result.
Advantages: it's small, fairly fast and ready to use, it is language-
independant yet works decently well for many languages at
the same time; it also offers a number of Unicode-aware
string manipulation functions (unaccenting or not) functions,
a fuzzy search function and much more. Comes as a C source and
x86 DLL, free for any purpose.
Drawback: it probably doesn't work 100% correctly for any language using
more than "vanilla english letters": your dotless i will collate
along dotted i, for instance. It's a good compromise between
absolute correctness for ONE language and "fair" correctness for
most languages (including some asian languages using diacritics)
Download: http://dl.dropbox.com/u/26433628/unifuzz.zip
"I use SQLite Manager in Firefox to manage my DB."
My little extension will work with this one. You might also want to
try SQLite Expert which has ICU built-in (at least in its Pro version)
and much more.
It could be the regional settings but first I would verify UTF-8 encoding is being used.

Semicolon as URL query separator

Although it is strongly recommended (W3C source, via Wikipedia) for web servers to support semicolon as a separator of URL query items (in addition to ampersand), it does not seem to be generally followed.
For example, compare
        http://www.google.com/search?q=nemo&oe=utf-8
        http://www.google.com/search?q=nemo;oe=utf-8
results. (In the latter case, semicolon is, or was at the time of writing this text, treated as ordinary string character, as if the url was: http://www.google.com/search?q=nemo%3Boe=utf-8)
Although the first URL parsing library i tried, behaves well:
>>> from urlparse import urlparse, query_qs
>>> url = 'http://www.google.com/search?q=nemo;oe=utf-8'
>>> parse_qs(urlparse(url).query)
{'q': ['nemo'], 'oe': ['utf-8']}
What is the current status of accepting semicolon as a separator, and what are potential issues or some interesting notes? (from both server and client point of view)
The W3C Recommendation from 1999 is obsolete. The current status, according to the 2014 W3C Recommendation, is that semicolon is now illegal as a parameter separator:
To decode application/x-www-form-urlencoded payloads, the following algorithm should be used. [...] The output of this algorithm is a sorted list of name-value pairs. [...]
Let strings be the result of strictly splitting the string payload on U+0026 AMPERSAND characters (&).
In other words, ?foo=bar;baz means the parameter foo will have the value bar;baz; whereas ?foo=bar;baz=sna should result in foo being bar;baz=sna (although technically illegal since the second = should be escaped to %3D).
As long as your HTTP server, and your server-side application, accept semicolons as separators, you should be good to go. I cannot see any drawbacks. As you said, the W3C spec is on your side:
We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.
I agree with Bob Aman. The W3C spec is designed to make it easier to use anchor hyperlinks with URLs that look like form GET requests (e.g., http://www.host.com/?x=1&y=2). In this context, the ampersand conflicts with the system for character entity references, which all start with an ampersand (e.g., "). So W3C recommends that web servers allow a semicolon to be used as a field separator instead of an ampersand, to make it easier to write these URLs. But this solution requires that writers remember that the ampersand must be replaced by something, and that a ; is an equally valid field delimiter, even though web browsers universally use ampersands in the URL when submitting forms. That is arguably more difficult that remembering to replace the ampersand with an & in these links, just as would be done elsewhere in the document.
To make matters worse, until all web servers allow semicolons as field delimiters, URL writers can only use this shortcut for some hosts, and must use & for others. They will also have to change their code later if a given host stops allowing semicolon delimiters. This is certainly harder than simply using &, which will work for every server forever. This in turn removes any incentive for web servers to allow semicolons as field separators. Why bother, when everyone is already changing the ampersand to & instead of ;?
In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT. I estimate that when i factor in the complications that i've found, using ampersands as a separator makes the whole process about three times as complicated as using semicolons for separators instead!
I'm a .NET programmer and to my knowledge, .NET does not inherently allow ';' separators, so i wrote my own parsing and handling methods because i saw a tremendous value in using semicolons rather than the already problematic system of using ampersands as separators. Unfortunately, very respectable people (like #Bob Aman in another answer) do not see the value in why semicolon usage is far superior and so much simpler than using ampersands. So i now share a few points to perhaps persuade other respectable developers who don't recognize the value yet of using semicolons instead:
Using a querystring like '?a=1&b=2' in an HTML page is improper (without HTML encoding it first), but most of the time it works. This however is only due to most browsers being tolerant, and that tolerance can lead to hard-to-find bugs when, for instance, the value of the key value pair gets posted in an HTML page URL without proper encoding (directly as '?a=1&b=2' in the HTML source). A QueryString like '?who=me+&+you' is problematic too.
We people can have biases and can disagree about our biases all day long, so recognizing our biases is very important. For instance, i agree that i just think separating with ';' looks 'cleaner'. I agree that my 'cleaner' opinion is purely a bias. And another developer can have an equally opposite and equally valid bias. So my bias on this one point is not any more correct than the opposite bias.
But given the unbiased support of the semicolon making everyone's life easier in the long run, cannot be correctly disputed when the whole picture is taken into account. In short, using semicolons does make life simpler for everyone, with one exception: a small hurdle of getting used to something new. That's all. It's always more difficult to make anything change. But the difficulty of making the change pales in comparison to the continued difficulty of continuing to use &.
Using ; as a QueryString separator makes it MUCH simpler. Ampersand separators are more than twice as difficult to code properly than if semicolons were used. (I think) most implementations are not coded properly, so most implementations aren't twice as complicated. But then tracking down and fixing the bugs leads to lost productivity. Here, i point out 2 separate encoding steps needed to properly encode a QueryString when & is the separator:
Step 1: URL encode both the keys and values of the querystring.
Step 2: Concatenate the keys and values like 'a=1&b=2' after they are URL encoded from step 1.
Step 3: Then HTML encode the whole QueryString in the HTML source of the page.
So special encoding must be done twice for proper (bug free) URL encoding, and not just that, but the encodings are two distinct, different encoding types. The first is a URL encoding and the second is an HTML encoding (for HTML source code). If any of these is incorrect, then i can find you a bug. But step 3 is different for XML. For XML, then XML character entity encoding is needed instead (which is almost identical). My point is that the last encoding is dependent upon the context of the URL, whether that be in an HTML web page, or in XML documentation.
Now with the much simpler semicolon separators, the process is as one wud expect:
1: URL encode the keys and values,
2: concatenate the values together. (With no encoding for step 3.)
I think most web developers skip step 3 because browsers are so lenient. But this leads to bugs and more complications when hunting down those bugs or users not being able to do things if those bugs were not present, or writing bug reports, etc.
Another complication in real use is when writing XML documentation markup in my source code in both C# and VB.NET. Since & must be encoded, it's a real drag, literally, on my productivity. That extra step 3 makes it harder to read the source code too. So this harder-to-read deficit applies not only to HTML and XML, but also to other applications like C# and VB.NET code because their documentation uses XML documentation. So the step #3 encoding complication proliferates to other applications too.
So in summary, using the ; as a separator is simple because the (correct) process when using the semicolon is how one wud normally expect the process to be: only one step of encoding needs to take place.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a separation character that shud be HTML encoded. Thus '&' is the culprit. And semicolon relieves all that complication.
(I will point out that my 3 step vs 2 step process above is usually how many steps it would take for most applications. However, for completely robust code, all 3 steps are needed no matter which separator is used. But in my experience, most implementations are sloppy and not robust. So using semicolon as the querystring separator would make life easier for more people with less website and interop bugs, if everyone adopted the semicolon as the default instead of the ampersand.)

Translating external api results in Drupal

We're building a multi-language Drupal stack and one of the concerns we have is that our payment processor is going to have to send back some information to us. We've been able to narrow this down so that the strings they're sending back look like
<country code>-<number of months>
so we can easily translate that into any number of languages, except English.
t('FR-12') is all well and good if we want to translate that into a french description, but because there's not an English language a similar string like t('EN-12') is not translatable.
Similarly for the generic string: #API_Connection_Error
This sort of generic string approach seemed really compelling to me at first but it seems to not work in Drupal. Do you have any suggestions about how to translate generic strings like this into both English and other languages?
Thank you, I've been looking through Google all morning.
I see two ways to achieve this at the moment:
You could just replace the default English language definition with a custom version. That way, you can 'translate' selected English strings just as with any other language. If you have configured locale module to fallback to the original string in case of absent translations, you can just add your special cases as translations to your custom English version, and everything else will use the original English version.
Take a look at the String Overrides module - it allows you to define custom overrides for any string that gets passed through t(), with separate overrides per language, including the original English.
I'd use the second option in your case, except if the number of 'external' strings is very high. See the first if clause of the t() function for the mechanism used for the overrides (lookup in language specific Drupal variable arrays).
Note that the String Overrides module just adds admin UI pages to configure those Drupal variables in the Backend - you could add/adjust them yourself as well (e.g. from a custom module).

Resources