Qt i18n encoding - qt

Is there any difference in the way strings are translated in QT between literal strings in code and strings defined in a .ui file? Besides this, I should able to translate between a UTF-8 string to another UTF-8 string, not just from ASCII, right?
This doubt stems from a bug I found when trying to include UTF-8 characters (formatting characters like '»') in a "source" english string in my .ui file. The result is that the translations for those strings are not picked up by QT.
Note: I didn't forget to update to tags in the .ts file

Related

How can I use QSettings to write UTF-8 characters into [section] and [name] of *.ini file properly?

My code snippet is here:
QSettings setting("xxx.ini", QSettings::Format::IniFormat);
setting.setIniCodec(QTextCodec::codecForName("UTF-8"));
setting.beginGroup(u8"运动控制器");
setting.setValue(u8"运动控制器", u8"运动控制器");
setting.endGroup();
But what is written looks like this:
[%U8FD0%U52A8%U63A7%U5236%U5668]
%U8FD0%U52A8%U63A7%U5236%U5668=运动控制器
So it seems I did set the encoding correctly (partly), but what should I do to change the section and name into text from some per-cent-sign code?
Environment is Qt 5.12.11 and Visual Studio 2019
Unfortunately, this is hard-coded behavior in QSettings that you simply cannot change.
In section and key names, Unicode characters <= U+00FF (other than a..z, A..Z, 0..9, _, -, or .) are encoded in %XX hex format, and higher characters are encoded in %UXXXX format. The codec specified in setIniCodec() has no effect on this behavior.
Key values are written in the specified codec, in this case UTF-8.

Bold math characters into moodle with mathml converter

When I have $$\mathbf{x}$$ in my .Rmd file, and use exams2moodle with the pandoc-mathml converter, the xml file contains an "𝐱" character, which needs to be replaced with an "x" character before moodle will import the quiz question (because moodle will give an error saying the file is not UTF-8 without BOM.)
I wonder what are the most practical workarounds? Is this a bug? Thanks!
Minimal example: Here is minimal_example.Rmd
Question
========
Stare hard at the variable.
$$\mathbf{x}$$
What is its value?
Solution
========
If you think hard enough, you will know it is 12.
Meta-information
================
extype: num
exsolution: 12
exname: minimal_example
extol: 0
Here is the minimal_example.r
library("exams")
exams2moodle("minimal_example.Rmd", converter="pandoc-mathml")
And... here is a snippet of the resulting .xml file.
...
<questiontext format="html">
<text><![CDATA[<p>
<p>Stare hard at the variable. <math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mstyle mathvariant="bold"><mi>𝐱</mi></mstyle><annotation encoding="application/x-tex">\mathbf{x}</annotation></semantics></math> What is its value?</p>
</p>]]></text>
</questiontext>
...
If I try importing the XML to my school's moodle, I get a dmlwriteexeption error. If I replace the "𝐱" with "x" the XML imports fine.
I am fairly certain my moodlequiz.xml file does not contain a BOM.
$ file moodlequiz.xml
moodlequiz.xml: XML 1.0 document, UTF-8 Unicode text, with very long lines
$ hexdump -n 3 -C moodlequiz.xml
00000000 3c 3f 78 |<?x|
00000003
I consider this question resolved. Hopefully nobody else has this issue, and I will use one of the proposed workarounds for my own files. Thanks!
TL;DR
exams2moodle(..., converter = "pandoc-mathml") seems to work correctly and produces an UTF-8 encoded XML file moodlequiz.xml. The problem on your end appears to be caused by a BOM (byte order mark) in your XML file. It is unclear to me whether this is introduced through exams2moodle() or through an editor on your end.
Either you can remove the BOM manually or you can avoid the UTF-8 encoding altogether by using exams2moodle(..., converter = "pandoc-mathml-ascii"). The latter requires at least version 2.4-0 of the package.
Replication
Thanks for providing a reproducible example. I ran your example code - both on a Linux machine running in an UTF-8 locale and a Windows 10 machine - and can confirm that I get exactly the same XML code containing the UTF-8 encoded bold x: 𝐱. However, I have no problem importing that into my Moodle system.
Possible sources of the problem
So I looked up what the Moodle error message is about. Moodle does not accept UTF-8-encoded files with a BOM (byte order mark) at the beginning. Some systems use a BOM at the beginning of a file to declare how the file is encoded. See:
Moodle documentation: https://docs.moodle.org/39/en/UTF-8_and_BOM
Wikipedia with general information: https://en.wikipedia.org/wiki/Byte_order_mark
The moodlequiz.xml I produced on the two systems I mentioned above have no BOM. So I suspect that either your R setup produces a file with a BOM or the BOM is inserted later, e.g., after opening the XML file with an editor. The Moodle documentation above has some information on what you can do to detect the BOM and get rid of it. Hopefully, this lets you debug the problem on your end. If the BOM was produced by exams2moodle() (as opposed to your editor for example) and you find out how to avoid that, please let me know.
Alternative solution
In principle it is possible to replace the UTF-8 encocded characters by the corresponding HTML entities. For example, in this particular case we have a "MATHEMATICAL BOLD SMALL X" with Unicode U+1D431 (see https://www.w3.org/Math/characters/bold.html). Thus, we can also represent it as 𝐱 (hexadecimal) or 𝐱 (decimal). Then the XML file can be in ASCII while still leading to the same output in HTML.
While pandoc is generally designed to work with UTF-8 throughout it also has support for (hexa)decimal escapes in certain conversions, see https://pandoc.org/MANUAL.html#option--ascii. And luckily it is possible to combine the --mathml with the --ascii option. There was only a small bug in how R/exams passed on the option to the rmarkdown::pandoc_convert() function which I just fixed. So you need at least version 2.4-0 of exams and can then do:
exams2moodle(..., converter = "pandoc-mathml-ascii")
which yields a moodlequiz.xml in ASCII instead of UTF-8.

RQDA does not read Imported UTF-8 characters in UTF-8 encoded .txt

Trying to import a database of texts for analysis with RQDA. The database consists of word to text converted files with UTF-8 encoding. RQDA is supposed to read UTF-8, however UTF-8 characters like (ą, č, ę, ė, į, š, ų, ū) are not recognized after import to RQDA.
I'm using the "write.FileList" function for import. Its details state, that
"The file content will be converted to UTF-8 character before being written to *.rqda. The original content can be in any suitable encoding, so you can inspect the content correctly; In other words, the better practices is to used the corresponding encoding (you can get a hint by localeToCharset function) to save the imported files."
write.FileList(FileList, encoding = .rqda$encoding, con = .rqda$qdacon)
addFilesFromDir("C:\\output", pattern = "*.txt$")
write.FileList imports the database of text to RQDA, but UTF-8 characters are not recognized.
Shows this warning:
"In rsqlite_fetch(res#ptr, n = n) :
Don't need to call dbFetch() for statements, only for queries"

How to specify encoding while creating file?

I am using an R script to create and append a file. But I need the file to be saved in ANSI encoding,even though some characters are in Unicode format. How to ensure ANSI encoding?
newfile='\home\user\abc.ttl'
file.create(newfile)
text3 <- readLines('\home\user\init.ttl')
sprintf('readlines %d',length(text3))
for(k in 1:length(text3))
{
cat(text3[[k]],file=newfile,sep="\n",append=TRUE)
}
Encoding can be tricky, since you need to detect your encoding upon input, and then you need to convert it before writing. Here it sounds like your input file input.ttl is encoded as UTF-8, and you need it converted to ASCII. This means you are probably going to lose some non-translatable characters, since there may be no mapping from the UTF-8 characters to ASCII outside of the 128-bit lower range. (Within this range the mappings of UTF-8 to ASCII are the same.)
So here is how to do it. You will have to modify your code accordingly to test since you did not supply the elements needed for a reproducible example.
Make sure that your input file is actually UTF-8 and that you are reading it as UTF-8. You can do this by adding encoding = "UTF-8" to the third line of your code, as an argument to readLines(). Note that you may not be able to set the system locale to UTF-8 on a Windows platform, but the file will still be read as UTF-8, even though extended characters may not display properly.
Use iconv() to convert the text from UTF-8 to ASCII. iconv() is vectorised so it works on the whole set of text. You can do this using
text3 <- iconv(text3, "UTF-8", "ASCII", sub = "")
Note here that the sub = "" argument prevents the default behaviour of converting the entire character element to NA if it encounters any untranslatable characters. (These include the seemingly innocent but actually subtly evil things such as "smart quotes".)
Now when you write the file using cat() the output should be ASCII.

CSS character encoding

According to W3C, CSS can set its character encoding by #charset in the first line, is it valid to to say that I should put #charset "UTF-8" in every CSS i made, even it only contains ASCII characters?
Will there be any performance penalty after I declare it using UTF-8 ?
p.s. I can't think of a way to test it out.
No, it is not valid to say so, as an unqualified statement. If your file contains only ASCII characters, it is very likely that its character encoding is ASCII compatible (EBCDIC is not much used there days), so the rule would be harmless, but also pointless as long as the file keeps being ASCII-only.
What matters is what happens when a non-ASCII character gets inserted into the CSS file, for whatever reason. It could be, for example, a innocent-look smart quote (”) inserted when editing the file with a program that produces smart quotes. It is more likely that the smart quote gets inserted in windows-1252 encoding than in UTF-8 encoding. So if the file has the #charset "UTF-8" rule, it probably becomes a bit more difficult to analyze the problem.
If, on the other hand, you know that your CSS file will be edited using software that uses UTF-8 encoding by default, then it is OK to declare it as UTF-8 encoded even if it only contains ASCII characters. For example, if you some day edit the file and add a declaration like content: "“foo”", you might forget to add the #charset rule.
There is no overhead in declaring the encoding as UTF-8. If the data contains ASCII characters only, any decent routine that reads UTF-8 will process the characters as fast as simple reading of ASCII. A routine that reads a UTF-8 bytestream will have to first check whether the byte is in the ASCII range and take it as standing for an ASCII character if it is.

Resources