According to W3C, CSS can set its character encoding by #charset in the first line, is it valid to to say that I should put #charset "UTF-8" in every CSS i made, even it only contains ASCII characters?
Will there be any performance penalty after I declare it using UTF-8 ?
p.s. I can't think of a way to test it out.
No, it is not valid to say so, as an unqualified statement. If your file contains only ASCII characters, it is very likely that its character encoding is ASCII compatible (EBCDIC is not much used there days), so the rule would be harmless, but also pointless as long as the file keeps being ASCII-only.
What matters is what happens when a non-ASCII character gets inserted into the CSS file, for whatever reason. It could be, for example, a innocent-look smart quote (”) inserted when editing the file with a program that produces smart quotes. It is more likely that the smart quote gets inserted in windows-1252 encoding than in UTF-8 encoding. So if the file has the #charset "UTF-8" rule, it probably becomes a bit more difficult to analyze the problem.
If, on the other hand, you know that your CSS file will be edited using software that uses UTF-8 encoding by default, then it is OK to declare it as UTF-8 encoded even if it only contains ASCII characters. For example, if you some day edit the file and add a declaration like content: "“foo”", you might forget to add the #charset rule.
There is no overhead in declaring the encoding as UTF-8. If the data contains ASCII characters only, any decent routine that reads UTF-8 will process the characters as fast as simple reading of ASCII. A routine that reads a UTF-8 bytestream will have to first check whether the byte is in the ASCII range and take it as standing for an ASCII character if it is.
Related
I am using an R script to create and append a file. But I need the file to be saved in ANSI encoding,even though some characters are in Unicode format. How to ensure ANSI encoding?
newfile='\home\user\abc.ttl'
file.create(newfile)
text3 <- readLines('\home\user\init.ttl')
sprintf('readlines %d',length(text3))
for(k in 1:length(text3))
{
cat(text3[[k]],file=newfile,sep="\n",append=TRUE)
}
Encoding can be tricky, since you need to detect your encoding upon input, and then you need to convert it before writing. Here it sounds like your input file input.ttl is encoded as UTF-8, and you need it converted to ASCII. This means you are probably going to lose some non-translatable characters, since there may be no mapping from the UTF-8 characters to ASCII outside of the 128-bit lower range. (Within this range the mappings of UTF-8 to ASCII are the same.)
So here is how to do it. You will have to modify your code accordingly to test since you did not supply the elements needed for a reproducible example.
Make sure that your input file is actually UTF-8 and that you are reading it as UTF-8. You can do this by adding encoding = "UTF-8" to the third line of your code, as an argument to readLines(). Note that you may not be able to set the system locale to UTF-8 on a Windows platform, but the file will still be read as UTF-8, even though extended characters may not display properly.
Use iconv() to convert the text from UTF-8 to ASCII. iconv() is vectorised so it works on the whole set of text. You can do this using
text3 <- iconv(text3, "UTF-8", "ASCII", sub = "")
Note here that the sub = "" argument prevents the default behaviour of converting the entire character element to NA if it encounters any untranslatable characters. (These include the seemingly innocent but actually subtly evil things such as "smart quotes".)
Now when you write the file using cat() the output should be ASCII.
I have reported a bug and entered a support request at the KDiff3 site (https://sourceforge.net/p/kdiff3/bugs/198/), but I wonder if anyone has any prompt information for me about a behavior I'm seeing that might lead me to understanding why such a bug might exist -- if there's anything unusual about these unicode characters.
When I merge two identical files containing the character 稍 using KDiff3 version 0.9.98, it reads the character as 稊 and shows that character in all the panes of the merge. The output then contains that character instead of 稍.
I've observed this behavior with UCS-2 Little Endian encoding in version 0.9.98 of KDiff3, but not with UTF-8 encoding, and not with version 0.9.96a the version of Kdiff3 that comes with TortoiseHg. Although I can reproduce the problem in 0.9.96 and 0.9.97, TortoiseHg's KDiff3 reports that it is version 0.9.96a, and does not exhibit the problem.
Edit: I vaguely suspect the source of the problem to be somewhere in the Qt library. So any information about what Qt does especially in regard to handling international text might be useful.
Utilities that process text files need to break the text into characters to operate effectively. The simplest possible process is to treat each 8-bit byte as a single character. Unfortunately this doesn't work well with UTF-16 or UCS-2 input, since each byte is only half of the character.
The character you're having problems with is 稍 (U+7a0d) which is being converted to 稊 (U+7a0a). When you break those down into little-endian bytes, you get 0x0d, 0x7a and 0x0a, 0x7a. The 8-bit character 0x0d is the ASCII code for Return, and 0x0a is the code for Linefeed. It seems that KDiff3 is interpreting these bytes as line endings, and substituting a Linefeed when it encounters a Return. This is verified by your report of an error message indicating inconsistent line endings in the file.
When working with Unicode it is often better to use UTF-8 encoding. The characters above U+007f will still take up more than one byte, but each of those bytes will have a value of 0x80 or greater and cannot accidentally be mistaken for one of the ASCII characters. For example 稍 becomes 0xe7, 0xa8, 0x8d.
When I read in csv files to r the requesting dataframe has very different dimensions than I see when I open the file in excel or notepad and the column heading is labeled as "ÿþA". What does this mean?
thanks,
The file you are reading is using an UTF-16 or UTF-32 encoding (with a BOM), and the r read.csv function has not been informed correctly.
As Karsten suggests you should use the fileEncoding parameter to specify the correct encoding, which I suspect should be "UTF-16LE".
Here is what the R Studio documentation states about encoding:
Encoding
The encoding of the input/output stream of a connection can be specified by name in the same way as it would be given to iconv: see that help page for how to find out what encoding names are recognized on your platform. Additionally, "" and "native.enc" both mean the ‘native’ encoding, that is the internal encoding of the current locale and hence no translation is done.
Re-encoding only works for connections in text mode: reading from a connection with re-encoding specified in binary mode will read the stream of bytes, but mixing text and binary mode reads (e.g. mixing calls to readLines and readChar) is likely to lead to incorrect results.
The encodings "UCS-2LE" and "UTF-16LE" are treated specially, as they are appropriate values for Windows ‘Unicode’ text files. If the first two bytes are the Byte Order Mark 0xFFFE then these are removed as some implementations of iconv do not accept BOMs. Note that whereas most implementations will handle BOMs using encoding "UCS-2" and choose the appropriate byte order, some (including earlier versions of glibc) will not. There is a subtle distinction between "UTF-16" and "UCS-2" (see http://en.wikipedia.org/wiki/UTF-16/UCS-2: the use of surrogate pairs is very rare so "UCS-2LE" is an appropriate first choice.
As from R 3.0.0 the encoding "UTF-8-BOM" is accepted for reading and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications). If it is required (it is not recommended) when writing it should be written explicitly, e.g. by writeChar("\ufeff", con, eos = NULL) or writeBin(as.raw(c(0xef, 0xbb, 0xff)), binary_con)
Requesting a conversion that is not supported is an error, reported when the connection is opened. Exactly what happens when the requested translation cannot be done for invalid input is in general undocumented. On output the result is likely to be that up to the error, with a warning. On input, it will most likely be all or some of the input up to the error.
It may be possible to deduce the current native encoding from Sys.getlocale("LC_CTYPE"), but not all OSes record it.
And here is what Wiki states on the BOM:
Byte order mark
The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.1
Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.
I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.
I'm trying to get the degrees celsius symbol to show up using the pseudo selector :after but can't seem to any unicode to work. Using the symbol I have in place now prints a capital A before the degree symbol.
.temp:after{
content:"°C";
}
I’m pretty sure it actually prints “°”, i.e. capital A with circumflex before the degree sign. The reason is that the file containing the CSS code is UTF-8 encoded but being interpreted as windows-1252 encoded. (The degree sign, U+00B0, is 0xC2 0xB0 in UTF-8 encoding; if this is interpreted as windows-1252, or as ISO-8859-1, you get U+00C2 U+00B0, that is °.)
The solution is to declare the encoding of the file as UTF-8. The details depend on whether the CSS code is inside an HTML document or in a CSS file, and it may also depend on the server software. See the W3C page Character encodings.
If the code is in an CSS file, the simplest fix is to save that file, in your editor, as UTF-8 with BOM. Depending on software, this might be simply flagged as “UTF-8” (as opposite to “UTF-8 without BOM”). Another way is to write the following at the very start of the CSS file:
#charset "UTF-8";
this: content:'\00b0 C'; seems to work for me ? http://codepen.io/anon/pen/kvyFh
this could be helpfull to you : http://unicode-table.com/en/#00B0 (it gives you html entities code too ° )