In my program the user can either provide a filename on the command line or using a QFileDialog. In the first case, I have a char* without any encoding information, in the second I have a QString.
To store the filename for later use (Recent Files), I need it as a QString. But to open the file with std::ifstream, I need a std::string.
Now the fun starts. I can do:
filename = QString::fromLocal8Bit(argv[1]);
later on, I can do:
std::string fn = filename.toLocal8Bit().constData();
This works for most characters, but not all. For example, the word Раи́са will look the same after going through this conversion, but, in fact, have different characters.
So while I can have a Раи́са.txt, and it will display Раи́са.txt, it will not find the file in the filesystem. Most letters work, but и́ doesnt.
(Note that it does work correctly when the file was chosen in the QFileDialog. It does not when it originated from the command line.)
Is there any better way to preserve the filename? Right now I obtain it in whatever native encoding, and can pass-on in the same encoding, without knowing it. At least so I thought.
'и́' is not an ASCII character, that is to say it has no 8-bit representation. How it is represented in argv[1] then is OS dependent. But it's not getting represented in just one char.
The fromLocal8bit uses the same QTextCodec::codecForLocale as toLocal8bit. And as you say your std::string will hold "Раи́са.txt" so that's not the problem.
Depending on how your OS defined std::ifstream though std::ifstream may expect each char to be it's own char and not go through the OS's translation. I expect that you are on Windows since you are seeing this problm. In which case you should use the std::wstring implementation of std::fstream which is Microsoft specific: http://msdn.microsoft.com/en-us/library/4dx08bh4.aspx
You can get a std::wstring from QString by using: toStdWString
See here for more info: fstream::open() Unicode or Non-Ascii characters don't work (with std::ios::out) on Windows
EDIT:
A good cross-platform option for projects with access to it is Boost::Filesystem. ypnos Mentions File-Streams as specifically pertinent.
Related
I have user-provided format string (e.g. "%.2f") and a QVariant type that I am attempting to combine to output into a (formatted) string.
I had gone down the path of using QString::asprintf(const char *cformat, ...) to achieve this, where I would supply the appropriate converted data type, like this:
QString result_str = QString::asprintf(disp_fmt.toUtf8(),variant_type.toUInt());
This works fine for the most part, especially when I have a floating point as the input. However, if my format string in this particular integer (.toUInt()) conversion case includes decimal formatting (e.g. "%.2f"), then I get a constant result of "0.00". This caught me by surprise as I expected to instead just get ".00" tacked onto the integer, as I have seen in other languages like Perl.
What am I missing here? Also, I know asprintf() was added fairly recently and the documentation already now advises to use QTextStream or arg() instead. I don't believe this to be an option, however, for me to use this style of format string. Thanks.
The format string is expecting a double, but you're providing an int. It works if you provide an actual double, like this:
QString result_str = QString::asprintf(disp_fmt.toUtf8(),variant_type.toDouble());
Also note, this behavior is identical to how the standard C library functions work (std::sprintf, etc).
When I read in csv files to r the requesting dataframe has very different dimensions than I see when I open the file in excel or notepad and the column heading is labeled as "ÿþA". What does this mean?
thanks,
The file you are reading is using an UTF-16 or UTF-32 encoding (with a BOM), and the r read.csv function has not been informed correctly.
As Karsten suggests you should use the fileEncoding parameter to specify the correct encoding, which I suspect should be "UTF-16LE".
Here is what the R Studio documentation states about encoding:
Encoding
The encoding of the input/output stream of a connection can be specified by name in the same way as it would be given to iconv: see that help page for how to find out what encoding names are recognized on your platform. Additionally, "" and "native.enc" both mean the ‘native’ encoding, that is the internal encoding of the current locale and hence no translation is done.
Re-encoding only works for connections in text mode: reading from a connection with re-encoding specified in binary mode will read the stream of bytes, but mixing text and binary mode reads (e.g. mixing calls to readLines and readChar) is likely to lead to incorrect results.
The encodings "UCS-2LE" and "UTF-16LE" are treated specially, as they are appropriate values for Windows ‘Unicode’ text files. If the first two bytes are the Byte Order Mark 0xFFFE then these are removed as some implementations of iconv do not accept BOMs. Note that whereas most implementations will handle BOMs using encoding "UCS-2" and choose the appropriate byte order, some (including earlier versions of glibc) will not. There is a subtle distinction between "UTF-16" and "UCS-2" (see http://en.wikipedia.org/wiki/UTF-16/UCS-2: the use of surrogate pairs is very rare so "UCS-2LE" is an appropriate first choice.
As from R 3.0.0 the encoding "UTF-8-BOM" is accepted for reading and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications). If it is required (it is not recommended) when writing it should be written explicitly, e.g. by writeChar("\ufeff", con, eos = NULL) or writeBin(as.raw(c(0xef, 0xbb, 0xff)), binary_con)
Requesting a conversion that is not supported is an error, reported when the connection is opened. Exactly what happens when the requested translation cannot be done for invalid input is in general undocumented. On output the result is likely to be that up to the error, with a warning. On input, it will most likely be all or some of the input up to the error.
It may be possible to deduce the current native encoding from Sys.getlocale("LC_CTYPE"), but not all OSes record it.
And here is what Wiki states on the BOM:
Byte order mark
The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.1
Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.
Is there a native method in R to test if a file on disk is an ASCII text file, or a binary file? Similar to the file command in Linux, but a method that will work cross platform?
The file.info() function can distinguish a file from a dir, but it doesn't seem to go beyond that.
If all you care about is whether the file is ASCII or binary...
Well, first up definitions. All files are binary at some level:
is.binary <- function(file){
if(system.type() != "quantum computer"){
return(TRUE)
}else{
return(cat=alive&dead)
}
}
ASCII is just an encoding system for characters. It is therefore impossible to tell if a file is ASCII or binary, because ASCII-ness is a matter of interpretation. If I save a file and decide that binary number 01001101 is Q and 01001110 is Z then you might decode this as ASCII but you'll get the wrong message. Luckily the Americans muscled in and said "Hey, everyone use ASCII to code their text! You get 128 characters and a parity bit! Woo! Go USA!". IBM tried to tell people to use EBCDIC but nobody listened. Which was A Good Thing.
So everyone was packing ASCII-coded text into their 8-bit bytes, and using the eighth bit for parity checking. But then people stopped doing parity checking because TCP/IP handled all that, which was also A Good Thing, and the eighth bit was expected to be zero. If not, there was trouble.
Because people (read "Microsoft") started abusing the eighth bit, and making up their own encoding schemes, and so unless you knew what encoding scheme the file was using, you were stuffed. And the file very rarely told you what encoding scheme it was. And now we have Unicode and even more encoding schemes. And that is a third Good Thing. But I digress.
Nowadays when people ask if a file is binary, what they are normally asking is "Does any byte in this file have it's highest bit set?". Which you can do in R by reading a raw file connection as unsigned integers and testing the highest value. Something like:
is.binary <- function(filepath,max=1000){
f=file(filepath,"rb",raw=TRUE)
b=readBin(f,"int",max,size=1,signed=FALSE)
return(max(b)>128)
}
This will by default test only at most the first 1000 characters. I think the file command does something similar.
You may want to change the test to check for printable character codes, and whitespace, and line feed, carriage return, and other codes you might want to consider plausible in your non-binary files...
Well, how would you do that? I guess you can't without reading (parts or all of) the file, which is why files extensions are used to signal content type.
I looked into that years ago---and as I recall, the file(1) apps actually reads the first few header bytes of a file and compares that to what is stored in a lookup table. Sounds like a good candidate for an add-on package to me..
The example section of the manual for ?raw uses this:
isASCII <- function(txt) all(charToRaw(txt) <= as.raw(127))
I have an application that is supposed to deal with all kinds of characters and at some point display information about them. I use Qt and its inherent Unicode support in QChar, QString etc.
Now I need the code point of a QChar in order to look up some data in http://unicode.org/Public/UNIDATA/UnicodeData.txt, but QChar's unicode() method only returns a ushort (unsigned short), which usually is a number from 0 to 65535 (or 0xFFFF). There are characters with code points > 0xFFFF, so how do I get these? Is there some trick I am missing or is this currently not supported by Qt/QChar?
Each QChar is a UTF-16 value, not a complete Unicode codepoint. Therefore, non-BMP characters consist of two QChar surrogate pairs.
The solution appears to lay in code that is documented but not seen much on the Web. You can get the utf-8 value in decimal form. You then apply to determine if a single QChar is large enough. In this case it is not. Then you need to create two QChar's.
uint32_t cp = 155222; // a 4-byte Japanese character
QString str;
if(Qchar::requiresSurrogate(cp))
{
QChar charArray[2];
charArray[0] = QChar::highSurrogate(cp);
charArray[1] = QChar::lowSurrogate(cp);
str = QString(charArray, 2);
}
The resulting QString will contain the correct information to display your supplemental utf-8 character.
Unicode characters beyond U+FFFF in Qt
QChar itself only supports Unicode characters up to U+FFFF.
QString supports Unicode characters beyond U+FFFF by concatenating two QChars (that is, by using UTF-16 encoding). However, the QString API doesn't help you much if you need to process characters beyond U+FFFF. As an example, a QString instance which contains the single Unicode character U+131F6 will return a size of 2, not 1.
I've opened QTBUG-18868 about this problem back in 2011, but after more than three years (!) of discussion, it was finally closed as "out of scope" without any resolution.
Solution
You can, however, download and use these Unicode Qt string wrapper classes which have been attached to the Qt bug report. Licensed under the LGPL.
This download contains the wrapper classes QUtfString, QUtfChar, QUtfRegExp and QUtfStringList which supplement the existing Qt classes and allow you to do things like this:
QUtfString str;
str.append(0x1307C); // Some Unicode character beyond U+FFFF
Q_ASSERT(str.size() == 1);
Q_ASSERT(str[0] == 0x1307C);
str += 'a';
Q_ASSERT(str.size() == 2);
Q_ASSERT(str[1] == 'a');
Q_ASSERT(str.indexOf('a') == 1);
For further details about the implementation, usage and runtime complexity please see the API documentation included within the download.
Has any one ever seen a backspace delimited flat file? My requirement is to parse such a file but I am not able to put a backspace character into a file to check if I am able to detect it.
Splitting shouldn't be any harder than using any other delimiter. It's just another character, after all. In Python, for instance:
>>> x = "apples\bbanana\bcoconut\bthese are delicious!"
>>> x.split('\b')
['apples', 'banana', 'coconut', 'these are delicious!']
Most languages use \b as the escape character for a backspace. If yours doesn't you can also include the ASCII control code for backspace itself, which is \x08.
I have never seen one, but some editors allow you to put a backspace character in by pressing e.g. Ctrl-V first.
You could write a script that appends the ASCII character code for backspace (\0x008) to a file.
Here is a C program that will generate you a backspace delimited file for testing (with newlines delimiting different rows). Pass in either a filename, or it will write it to stdout (I chose C because you didn't mention a platform; most people have a C compiler available):
#include <stdio.h>
int main(int argc, char **argv) {
FILE *outfile;
if (argc < 2)
outfile = stdout;
else
outfile = fopen(argv[1], "w");
fprintf(outfile, "this\bis\nbackspace\bdelimited\n");
fclose(outfile);
return 0;
}
The same string literal syntax should work in Java; I'll let you write the rest of the program:
"this\bis\nbackspace\bdelimited\n"
If using Windows, you can insert a backspace into notepad by using Ctrl+Backspace.
I would also recommend getting a hex editor like 0xED (for Mac). It's pretty useful for viewing and editing files containing unusual characters. With it, you can just type "08" to insert a backspace character into a file.