Wrong result of TIdURI.URLDecode when unit LazUTF8 is used - http

With Free Pascal 3.0.4, this test program correctly writes ÄÖÜ
program FPCTest;
uses IdURI;
However if the unit LazUTF8 (as described here) is used, it writes ???
program FPCTest;
uses IdURI, LazUTF8;
How can I fix this decoding error for programs which use LazUTF8?

When the String type is an alias for AnsiString 1, much of Indy's functionality exposes extra parameters/properties to let users control which ANSI encodings are used when AnsiString values are passed around in operations that perform AnsiString<->byte conversions.
1: Delphi pre-2009, and FreePascal/Lazarus when {$ModeSwitch UnicodeStrings} and {$Mode DelphiUnicode} are not used (FYI, Indy 11 will use them!).
In most cases, Indy's default byte encoding is ASCII (because many of the Internet protocols that Indy implements originally supported only ASCII - individual Indy components upgrade themselves to UTF as appropriate per protocol), though some things use the OS default codepage/charset instead.
Indy's default byte encoding can be changed at runtime by setting the global GIdDefaultTextEncoding variable in the IdGlobal unit, eg:
GIdDefaultTextEncoding := encUTF8;
But, in this particular situation, TIdURI.URLEncode() does not use GIdDefaultTextEncoding, but it does have an optional ADestEncoding parameter that you can use to specify a specific byte encoding for the returned AnsiString (in addition to an optional AByteEncoding parameter to specify the byte encoding of the parsed url octets - UTF-8 by default), eg:
{$IFNDEF FPC_UNICODESTRINGS}, IndyTextEncoding_UTF8, IndyTextEncoding_UTF8{$ENDIF}
The above will parse the url-encoded octets as UTF-8, and then return that data as-is in a UTF-8 encoded AnsiString.
If you do not specify an output encoding for ADestEncoding, URLDecode() defaults to the OS default. If you want it to use GIdDefaultTextEncoding instead, specify IndyTextEncoding_Default in the ADestEncoding parameter:
{$IFNDEF FPC_UNICODESTRINGS}, IndyTextEncoding_UTF8, IndyTextEncoding_Default{$ENDIF}
Another option would be to use the IndyTextEncoding(CodePage) function for ADestEncoding, passing it FreePascal's DefaultSystemCodePage variable, which the LazUtils package sets to CP_UTF8 2:
{$IFNDEF FPC_UNICODESTRINGS}, IndyTextEncoding_UTF8, IndyTextEncoding(DefaultSystemCodePage){$ENDIF}
2: I have opened a ticket in Indy's issue tracker to add support for DefaultSystemCodePage when compiling for FreePascal/Lazarus.

With this change in TIdURI.URLDecode lines 386ff LazUTF8 can be used:
Result := string(AByteEncoding.GetString(LBytes));
EnsureEncoding(ADestEncoding, encOSDefault);
CheckByteEncoding(LBytes, AByteEncoding, ADestEncoding);
SetString(Result, PAnsiChar(LBytes), Length(LBytes));
Result := AByteEncoding.GetString(LBytes);
This change assumes that the LazUTF8 unit is used always, and the Indy source code change needs to be applied every time when a new version is used.
Also I found no way to fix the TIdURI.URLDecode in a way which works with and without LazUTF8.


Decoding Binary Data in Tcl

I am reading data from a TCP port in TCL using a socket. The messages do not end with any newline, but they do container a header containing the number of bytes of data.
I have the following code to read two byte of data from the socket (16bit little endian) and convert that into an integer I can then use in a loop to read the rest of the data:
binary scan [read $Socket 2] s* length
In this case $Socket is my socket and it has been configured to use binary encoding.
This works well except where either the upper or lower byte is 0x0D. It appears TCL reads 0x0D and 0x0A both as '\n', which then defaults to 0x0A, so the code does work correctly. For example 13 is read as 10. How do I stop this from happening?
The socket should be placed into binary mode if you're moving binary data across it.
chan configure $Socket -translation binary
# Use [fconfigure] instead of [chan configure] in older Tcl versions
This disables all the automatic processing that Tcl usually does — your description says you're having a problem with end-of-line conversion — and makes it so that read will just deliver a string of the bytes (formally a string of characters between U+000000 and U+0000FF, and internally using an efficient in-memory encoding scheme).
For files, you can include b in the control mode when opening to get this done for you. For sockets, you need to do this yourself.
In addition to configuring binary encoding, you also need to set the translation to 'lf'. As this is a frequently occurring situation, there is a shorthand for making these two settings:
fconfigure $Socket -translation binary

Qt error is printed on the console; how to see where it originates from?

I'm getting this on the console in a QML app:
QFont::setPointSizeF: Point size <= 0 (0.000000), must be greater than 0
The app is not crashing so I can't use the debugger to get a backtrace for the exception. How do I see where the error originates from?
If you know the function the warning occurs in (in this case, QFont::setPointSizeF()), you can put a breakpoint there. Following the stack trace will lead you to the code that calls that function.
If the warning doesn't include the name of the function and you have the source code available, use git grep with part of the warning to get an idea of where it comes from. This approach can be a bit of trial and error, as the code may span more than one line, etc, and so you might have to try different parts of the string.
If the warning doesn't include the name of the function, you don't have the source code available and/or you don't like the previous approach, use the QT_MESSAGE_PATTERN environment variable:
QT_MESSAGE_PATTERN="%{function}: %{message}"
For the full list of variables at your disposal, see the qSetMessagePattern() docs:
%{appname} - QCoreApplication::applicationName()
%{category} - Logging category
%{file} - Path to source file
%{function} - Function
%{line} - Line in source file
%{message} - The actual message
%{pid} - QCoreApplication::applicationPid()
%{threadid} - The system-wide ID of current thread (if it can be obtained)
%{qthreadptr} - A pointer to the current QThread (result of QThread::currentThread())
%{type} - "debug", "warning", "critical" or "fatal"
%{time process} - time of the message, in seconds since the process started (the token "process" is literal)
%{time boot} - the time of the message, in seconds since the system boot if that can be determined (the token "boot" is literal). If the time since boot could not be obtained, the output is indeterminate (see QElapsedTimer::msecsSinceReference()).
%{time [format]} - system time when the message occurred, formatted by passing the format to QDateTime::toString(). If the format is not specified, the format of Qt::ISODate is used.
%{backtrace [depth=N] [separator="..."]} - A backtrace with the number of frames specified by the optional depth parameter (defaults to 5), and separated by the optional separator parameter (defaults to "|"). This expansion is available only on some platforms (currently only platfoms using glibc). Names are only known for exported functions. If you want to see the name of every function in your application, use QMAKE_LFLAGS += -rdynamic. When reading backtraces, take into account that frames might be missing due to inlining or tail call optimization.
On an unrelated note, the %{time [format]} placeholder is quite useful to quickly "profile" code by qDebug()ing before and after it.
I think you can use qInstallMessageHandler (Qt5) or qInstallMsgHandler (Qt4) to specify a callback which will intercept all qDebug() / qInfo() / etc. messages (example code is in the link). Then you can just add a breakpoint in this callback function and get a nice callstack.
Aside from the obvious, searching your code for calls to setPointSize[F], you can try the following depending on your environment (which you didn't disclose):
If you have the debugging symbols of the Qt libs installed and are using a decent debugger, you can set a conditional breakpoint on the first line in QFont::setPointSizeF() with the condition set to pointSize <= 0. Even if conditional breakpoints don't work you should still be able to set one and step through every call until you've found the culprit.
On Linux there's the tool ltrace which displays all calls of a binary into shared libs, and I suppose there's something similar in the M$ VS toolbox. You can grep the output for calls to setPointSize directly, but of course this won't work for calls within the lib itself (which I guess could be the case when it handles the QML internally).

libxml2 XML_PARSE_HUGE option for xmlParseMemory

C++ on Centos 6.4, libxml2.x86_64 2.7.6-12.el6_4.1:
I'm trying to fix an old C++ program that occasionally gets XML parser errors on large xml files, seems to need the XML_PARSE_HUGE option set. But I can't see any place to set it! The code that's failing is using the xmlParseMemory function which only has 2 parameters - the char array to parse and its size.
Is there some way to set the XML_PARSE_HUGE option globally?
You have to switch to xmlReadMemory which has an options parameter. Simply convert calls like
xmlParseMemory(buffer, size);
xmlReadMemory(buffer, size, NULL, NULL, XML_PARSE_HUGE);
(I think xmlParseMemory predates the parser options and is only retained for backward compatibility. Also see this question.)

unicode character cannot be converted to cp1252

I am writing a QT5 application (with QT Creator) which uses special characters like zodiac signs. This code works perfectly fine on Linux Mint 14:
QString s = QString::fromUtf8("\u2648");
But when I compile it on Windows XP SP3 get a compiler warning which says that the current codepage is cp1252 and the character \u2648 cannot be converted. When I run the program this character is displayed as a question mark.
According to my system settings UTF8(codepage 65001) is installed on my Windows.
(Note, I have not tried this, and I don't know which compiler you are using, and am completely unfamiliar with QT, so I could be wrong. The following is based on general knowledge about Unicode on Windows.)
On Windows, 8-bit strings are generally assumed to be in the current codepage of the system (also called the "ANSI" codepage). This is never UTF-8. On your system, it's apparently cp1252. So there are actually two things going wrong:
You are specifying a Unicode character, which the compiler tries to covert to the correct codepage. On Windows, this results in a compile time error, because cp1252 doesn't have a code point to represent u+2648.
But assuming that the code would compile, it would still not work. You pass this string, which would be in in cp1251 to fromUtf8, which wants a UTF-8 string. As the string is not valid UTF-8, this would likely result in a runtime error.
On your Linux system, both works "by accident", because it uses UTF-8 for 8-bit strings.
To get this right, specify the 8-bit string in UTF-8 right away:
QString s = QString::fromUtf8("\xE2\x99\x88");
Here is my advice to get everithing work fine:
There is only one encoding type UTF-8! Use it everywhere if possible. So, in QtCreator settings set default codepage for sources UTF-8.
You can convert your source code in QtCreator: edit -> choose encoding and there reload in codepage. If it can't be done, use linux console application iconv this way:
iconv -f cp1252 -t utf-8 your_source_in_cp1251.cpp > your_source_in_utf8.cpp
I use this code snippet for C-strings in my source codes: in main.cpp add #include <QTextCodec>, and then do:
// For correct encoding
QTextCodec *codec = QTextCodec::codecForName("UTF-8");

InputB vs. Get; code pages; slow reading on unix server

We have been using the usual code to read in a complete file into a string to then parse in VB6. The files are ANSI text but encoded using whatever code page the user was in at the time (we have Chinese and English users for example). This is the code
Open FileName For Binary As nFileUnit
sContents = StrConv(InputB(LOF(nFileUnit), nFileUnit), vbUnicode)
However, we have discovered this is VERY slow reading a file from a server running unix/linux, particularly when the ownership of the file is not the same as the process doing the reading.
I have rewritten the above using Get and discovered it is much faster and does not suffer from any issues with file ownership. I appreciate that this might be solved by reconfiguring the server somehow, but I think since deiscovering even without that issue, the Get method is still much faster than InputB I'd like to replace my existing code using Get.
I wonder if someone could tell me if this will really do the same thing. In particular, is it correctly doing the ANSI to Unicode conversion and will this always be true. My testing suggests the following replacement code does the same thing but faster:
Open FileName For Binary As nFileUnit
sContents = String(LOF(nFileUnit), " ")
Get #nFileUnit, , sContents
I also realise I could use a byte array, but again my tests suggest the above is simpler and works. So how does the buffer work correctly (if you believe the online help for Get it talks of characters returned - clearly this would cause problems when reading in an ANSI file written on the Chinese code page with 2-byte Chinese characters in it).
The following might be of interest becuase the InputB approach is commonly given as the method to read a complete file, but it is much slower, examples
Reading 380Kb file across the network from the unix server
InputB (file owned) = 0.875 sec
InputB (not owned) = 72.8 sec
Get (either) = 0.0156 sec
Reading a 9Mb file across the network from the unix server
InputB (file owned) = 19.65 sec
Get (either) = 0.42 sec
InputB() is CVar(InputB$()), and is known to be horribly slow. My suspicion is that InputB$() reads the bytes and converts them to Unicode using the current codepage via some stock logic for reading text from disk, then does another conversion back to ANSI using the current codepage.
You might be far ahead to use ADODB.Stream.LoadFromFile() to load complete ANSI text files. You can set the .Type = adTypeText and .Charset = the appropriate ANSI encoding as required to read Unicode back out of it via .ReadText(x) where x can be a number of bytes, or adReadAll or adReadLine. For line reading you can set .LineSeparator to adCR, adCRLF, or adLF as required.
Many Charset values are supported: KOI8 for Cyrillic, Big5 for Chinese, etc.
