Alfresco localization encoding - alfresco

Trying to create custom types, aspects and properties for Alfresco, I followed the Alfresco Developer Series guide. When I reached the localization section I found out that Alfresco does not handle UTF-8 encoding in the .properties files that you create. Greek characters are not displayed correctly in Share.
Checking out other built-in .properties files (/opt/alfresco-4.0.e/tomcat/webapps/alfresco/WEB-INF/classes/alfresco/messages) I noticed that in Japanese, for example, the characters are in this notation: \u3059\u3079\u3066\u306e...
So, the question is: do I have to convert the greek words in the above mentioned notation for Share to display them correctly, or is there another -more elegant- way to do it?

The \u#### form is the Java form of the Unicode Escape Sequence, and is used to reference unicode characters without having to worry about the encoding of the file storing them.
This question has some information on how to create and decode them
Another way, which is what Alfresco developers tend to use, is the Native2ASCII tool which ships with Java itself. With that, you can initially write your strings in a UTF-8 (for example) file, then use the tool to turn them into their escaped form.

Related

Extract localizable strings from source code, aspx, xaml to resource files

As part of internationalizing our application which is based on asp.net, c#, silverlight, XBAP, I'm evaluating approaches to start with. I'm having to chose between GNU gettext()(PO files) and Microsoft's resource(resx) based approach. So at this juncture, I'm trying to understand what is the best way to extract localizable strings from .cs files, aspx, ascx, xaml (silverlight) files to resource files(resx) automatically if I have to go the MS way.
I have below options in mind:
Resource Refactoring tool, but it extracts all strings (no matter if you have to translate or not) like page headers etc. And we cannot mark or exclude particular strings. Or we will have to manually select each string and then extract (right click and click extract).
Resharper's Localization assistance, here I do not see the automatic extraction, but I'll have to manually extract string by string.
I know there has to be a bit of manual intervention, but any advise would help in choosing the right direction, between gettext()(gnu gettext() c# or fairlylocal or MS localization approach.
Both the approaches have pros and cons, lets discuss.
FairlyLocal
(GNU Gettext) first, initial tweaking is required:
download library & tools and dump at some place relative to your project
modify the base page object of your site (manual intervention)
add a post-build step to your web project that will run xgettext and update your .po files
second, strings extraction has been taken care-of by FairlyLocal itself.
third, translation of strings could be done in-house or outsourced as PO files are widely known by linguists. fourth, rendering of a few UTF-8 chars (if any) depend on webfonts {eot (trident), svg (webkit, gecko, presto)}. fifth, locale needs to be maintained (like pa-IN languageCode-countryCode). sixth, several converters are available for PO files. seventh, the default logic will fall-back on default-locale (en-US) resources for the value. an issue, The .po files that the build script generates won't be UTF8 by default. You'll need to open them in POEdit (or similar) and explicitly change the encoding the first time you edit them if you want your translated text to correctly show special characters.
MS localization
first, extraction of strings is pretty easy using Resource Refactoring Tool. second, resgen.exe command-line tool could be used to make .resx files linguists friendly.
resgen /compile examplestrings.xx.resx,examplestrings.xx.txt
third, Localization within .NET (not specific to ASP.NET Proper or ASP.NET MVC) implements a standard fallback mechanism. fourth, no dependency on GNU Gettext Utils. fifth, can achieve localization from Strings to Dates, Currency, etc. using CurrentUICulture and CurrentCulture. sixth, webfonts are recommended here too.
thanks.

ASP Classic - determining if a file is binary or text (ascii) using FileSystemObject

If I categorize files as binary (eg: .exe, .mp3, .docx, .pdf) and text (eg: rtf, txt, html, xml), then how can we use the classic ASP's FSO to determine what kind of file a particular file is ?
I looked it up at the Internet all users are the same opinion. There is no direct way to make a difference.
Apparently you have to do it manually. This link gives you a set of rules to make a difference between a text and a binary file.
According to Eric Lippert, the FSO isn't meant for binary files. But using .Read(n) to get the first few characters and compare them to known signatures should work.
P.S.
If you do a full scan to classify the data, as reporter porposed, make sure you use more modern rules (e.g. a UTF16 text file could contain 50% 'zeros').

Supporting long unicode filepaths with System.Data.SQLite

I'm developing an application that needs to be able to create & manipulate SQLite databases in user-defined paths. I'm running into a problem I don't really understand. I'm testing my stuff against some really gross sample data with huge unwieldy unicode paths, for most of them there isn't a problem, but for one there is.
An example of a working connection string is:
Data Source="c:\test6\意外な高価で売れるかも? 出品は手順を覚えれば後はかんたん!\11オークションストアの出品は対象外とさせていただきます。\test.db";Version=3;
While one that fails is
Data Source="c:\test6\意外な高価で売れるかも? 出品は手順を覚えれば後はかんたん!\22今やPCライフに欠かせないのがセキュリティソフト。そのため、現在何種類も発売されているが、それぞれ似\test.db";Version=3;
I'm using System.Data.SQLite v1.0.66.0 due to reasons outside of my control, but I quickly tested with the latest, v1.0.77.0 and had the same problems.
Both when attempting to newly create the test.db file or if I manually put one there and it's attempting to open, SQLiteConnection.Open throws an exception saying only "Unable to open the database file", with the stack trace showing that it's actually System.Data.SQLite.SQLite3.Open that is throwing.
Is there any way I can get System.Data.SQLite to play nicely with these paths? A workaround could be to create and manipulate my databases in a temporary location and then just move them to the actual locations for storage, since I can create and manipulate files normally otherwise. That's kind of a last resort though.
Thank you.
I am guessing you are on a Japanese-locale machine where the default system encoding (ANSI code page) is cp932 Japanese (≈Shift-JIS).
The second path contains:
ソ
which encodes to the byte sequence:
0x83 0x5C
Shift-JIS is a multibyte encoding that has the unfortunate property of sometimes re-using ASCII code units in the trail byte. In this case it has used byte 0x5C which corresponds to the backslash \. (Though this typically displays as a yen sign in Japanese fonts, for historical reasons.)
So if this pathname is passed into a byte-based API, it will get encoded in the ANSI code page, and you won't be able to tell the difference between a backslash meant as a directory separator and one that is a side-effect of multi-byte encoding. Consequently any path with one of the following characters in will fail when accessed with a byte-based IO method:
―ソЫⅨ噂浬欺圭構蚕十申曾箪貼能表暴予禄兔喀媾彌拿杤歃畚秉綵臀藹觸軆鐔饅鷭偆砡纊犾
(Also any pathname that contains a Unicode character not present in cp932 will naturally fail.)
It would appear that behind the scenes SQLite is using a byte-based IO method to open the filename it is given. This is unfortunate, but extremely common in cross-platform code, because the POSIX C standard library is defined to use byte-based filenames for operations like file open().
Consequently using the C stdlib functions it is impossible to reliably access files with non-ASCII names. This sad situation inherits into all sorts of cross-platform libraries and languages written using the stdlib; only tools written with specific support for Win32 Unicode filenames (eg Python) can reliably access all files under Windows.
Your options, then, are:
avoid using non-ASCII characters in the path name for your db, as per the move/rename suggestion;
continue to rely on the system locale being Japanese (ANSI code page=932), and just rename files to avoid any of the characters listed above;
get the short (8.3) filename of the file in question and use that instead of the real one—something like c:\test6\85D0~1\22PC~1\test.db. You can use dir /x to see the short-filenames. They are always pure ASCII, avoiding the encoding problem;
add some code to get the short filename from the real one, using GetShortPathName. This is a Win32 API so you need a little help to call it from .NET. Note also short filenames will still fail if run on a machine with the short filename generation feature disabled;
persuade SQLite to add support for Windows Unicode filenames;
persuade Microsoft to fix this problem once and for all by making the default encoding for byte interfaces UTF-8, like it is on all other modern operating systems.

Warning when validating my website with http://validator.w3.org?

I created a simple test page on my website www.xaisoft.com and it had no errors, but it came back with the following warning and I am not sure what it means.
The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
To find out what the BOM is, you can take a look at the Unicode FAQ (quoting) :
Q: What is a BOM?
A: A byte order mark (BOM) consists of
the character code U+FEFF at the
beginning of a data stream, where it
can be used as a signature defining
the byte order and encoding form,
primarily of unmarked plaintext files.
Under some higher level protocols, use
of a BOM may be mandatory (or
prohibited) in the Unicode data stream
defined in that protocol.
Depending on your editor, you might find an option in the preferences to indicate it should save unicode documents without a BOM... or change editor ^^
Some text editors - notably Notepad - put an extra character at the front of the text file to indicate that it's Unicode and what byte-order it is in. You don't expect Notepad to do this sort of thing, and you don't see it when you edit with Notepad. You need to open the file and explicitly resave it as ANSI. If you're using fancy characters like smart quotes, trademark symbols, circle-r, or that sort of thing, don't. Use the HTML entities instead.

File names containing non-ascii international language characters

Has anyone had experience generating files that have filenames containing non-ascii international language characters?
Is doing this an easy thing to achieve, or is it fraught with danger?
Is this functionality expected from Japanese/Chinese speaking web users?
Should file extensions also be international language characters?
Info: We currently support multilanguage on our site, but our filenames are always ASCII. We are using ASP.NET on the .NET framework. This would be used in a scenario where international users could choose a common format and name for there files.
Is this functionality expected from Japanese/Chinese speaking web users?
Yes.
Is doing this an easy thing to achieve, or is it fraught with danger?
There are issues. If you are serving files directly, or otherwise have the filename in the URL (eg.: http://​www.example.com/files/こんにちは.txt -> http://​www.example.com/files/%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF.txt), you're generally OK.
But if you're serving files with the filename generated by the script, you can have problems. The issue is with the header:
Content-Disposition: attachment;filename="こんにちは.txt"
How do we encode those characters into the filename parameter? Well it would be nice if we could just dump it in in UTF-8. And that will work in some browsers. But not IE, which uses the system codepage to decode characters from HTTP headers. On Windows, the system codepage might be cp1252 (Latin-1) for Western users, or cp932 (Shift-JIS) for Japanese, or something else completely, but it will never be UTF-8 and you can't really guess what it's going to be in advance of sending the header.
Tedious aside: what does the standard say should happen? Well, it doesn't really. The HTTP standard, RFC2616, says that bytes in HTTP headers are ISO-8859-1, which wouldn't allow us to use Japanese. It goes on to say that non-Latin-1 characters can be embedded in a header by the rules of RFC2047, but RFC2047 explicitly denies that its encoded-words can fit in a quoted-string. Normally in RFC822-family headers you would use RFC2231 rules to embed Unicode characters in a parameter of a Content-Disposition (RFC2183) header, and RFC2616 does defer to RFC2183 for definition of that header. But HTTP is not actually an RFC822-family protocol and its header syntax is not completely compatible with the 822 family anyway. In summary, the standard is a bloody mess and no-one knows what to do, certainly not the browser manufacturers who pay no attention to it whatsoever. Hell, they can't even get the ‘quoted-string’ format of ‘filename="..."’ right, never mind character encodings.
So if you want to serve a file dynamically with non-ASCII characters in the name, the trick is to avoid sending the ‘filename’ parameter and instead dump the filename you want in a trailing part of the URL.
Should file extensions also be international language characters?
In principle yes, file extensions are just part of the filename and can contain any character.
In practice on Windows I know of no application that has ever used a non-ASCII file extension.
One final thing to look out for on systems for East Asian users: you will find them typing weird, non-ASCII versions of Latin characters sometimes. These are known as the full-width and half-width forms, and are designed to allow Asians to type Latin characters that line up with the square grid used by their ideographic (Han etc.) characters.
That's all very well in free text, but for fields you expect to parse as Latin text or numbers, receiving an unexpected ‘42’ integer or ‘.txt’ file extension can trip you up. To convert these ‘compatibility characters’ down to plain Latin, normalise your strings to ‘Unicode Normal Form NFKC’ before doing anything with them.
Refer to this overview of file name limitations on Wikipedia.
You will have to consider where your files will travel, and stay within the most restrictive set of rules.
From my experience in Japan, filenames are typically saved in Japanese with the standard English extension. Apply the same to any other language.
The only problem you will run into is that in an unsupported environment for that character set, people will usually just see a whole bunch of squares with an extension. Obviously this won't be a problem for your target users.
I have been playing around with Unicode and Indian languages for a while now. Here are my views on your questions:
Its easy. You will need two things: Enable Unicode (UTF-8/16/32) support in your OS so that you can type those characters and get Unicode compatible editors/tools so that your tools understand those characters.
Also, since you are looking at a localised web application, you have to ensure or atleast inform your visitor that he/she needs to have a browser which uses relevant encoding.
Your file extensions need not be i18-ned.
My two cents:
Key thing to international file names is to make URLs like bobince suggested:
www.example.com/files/%E3%81%93%E3%82%93%E3.txt
I had to make special routine for IE7 since it crop filename if its longer then 30 characters. So instead of "Your very long file name.txt" file will appear as "%d4y long file name.txt". However interesting thing is that IE7 actually understands header attachment;filename=%E3%81%93%E3%82%93%E3.txt correctly.

Resources