I am storing some character strings locally as UTF-8 and they might look something like "bären" I am then sending them off to the AdWords SOAP API to get some search volumes and I am being returned with something like this: "bären".
In R is there a way to get either "bären" to look like "bären" or vice versa so that I can "VLOOKUP"?
When I use the Encoding() function on "bären" it returns UTF-8 when I do the same on "bären" it returns unknown. If i then use the enc2utf8() function on the variable where "bären" is stored, it doesn't change what it is displaying but the Encoding() does then change to UTF-8.
When I don't use funky umlaut characters the API works just fine.
Related
I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++
I am trying to decode filenames in HTTP but the string from browser messages are different.
In my test file I put the name ç.jpg.
What I need is the name %C3%A7.jpg.
But the browser is sending %C3%83%C2%A7.jpg.
It's not UTF8, UTF16 or UTF32.
For another example I test the file name €.jpg.
What I need is the name %E2%82%AC.jpg.
But I am receiving %C3%A2%E2%80%9A%C2%AC.jpg.
how can I convert this names to UTF8?
Ok I played with this for about 30 minutes and I finally figured it out.
This is how the original string was encoded:
The string was in UTF-8
Some encoding mechanism thought it was CP1252, and based on that wrong assumption re-encoded it to UTF-8 again.
The resulting string was url-encoded.
To get back to a real UTF-8 string, this is what I did. (note, I used PHP, don't know what you are using but it should be doable in other languages just the same).
$input = '%C3%A2%E2%80%9A%C2%AC %C3%83%C2%A7';
$str1 = urldecode($input);
echo iconv('UTF-8', 'CP1252', $str1);
// output "€ ç"
So that conversion is counter intuitive. We're converting to CP1252, but still end up with a UTF-8 string. This only works because an existing UTF-8 was falsely treated as CP1252, and that incorrect interpretation was then converted to UTF-8. So I'm just reversing this double-encoding.
In other languages there might be a few more steps, this works in just 1 line with PHP because strings are bytes, not characters.
My object in R contains the following unicode which are extracted from twitter.
\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d
\xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe
\xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf
\xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95
\xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!'
- \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d
\xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d
I need to convert them to human readable strings. If I just put this in a string, e.g.
x <- "\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d \xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe \xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf \xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95 \xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!' - \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d \xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d"
it displays as an unreadable mess. How can I get it to display using the actual characters?
When you assign the hex codes like \xe0\xae\xa8\xe0... to a string, R doesn't know how they are intended to be interpreted, so it assumes the encoding for the current locale on your computer. On most modern Unix-based systems these days, that would be UTF-8, so for example on a Mac your string displays as
> x
[1] "நடிகர் சூர்யா செய்த உதவி மிக முக்கியமானது!' - டெல்டாவில் நெகிழ்ந்த சீமான்"
which I assume is the correct display. Google Translate recognizes it as being written in Tamil.
However, on Windows it displays unreadably. On my Windows 10 system, I see
> x
[1] "நடிகர௠சூரà¯à®¯à®¾ செயà¯à®¤ உதவி மிக à®®à¯à®•à¯à®•à®¿à®¯à®®à®¾à®©à®¤à¯!' - டெலà¯à®Ÿ
because it uses the code page corresponding to the Latin1 encoding, which is wrong for that string. To get it to display properly on Windows, you need to tell R that it is encoded in UTF-8 by declaring its encoding:
Encoding(x) <- "UTF-8"
Then it will display properly in Windows as well, which solves your problem.
For others trying to do this, it's important to know that there are only a few values that work this way. You can declare the encoding to be "UTF-8", "latin1", "bytes" or "unknown". "unknown" means the local encoding on the machine, "bytes" means it shouldn't be interpreted as characters at all. If your string has a different encoding, you need to use a different approach: convert to one of the encodings that R knows about.
For example, the string
x <- "\xb4\xde\xd1\xe0\xde\xd5 \xe3\xe2\xe0\xde"
is Russian encoded in ISO 8859-5. On a system where that was the local encoding it would display properly, but on mine it displays using the hex codes. To get it to display properly I need to convert it to UTF-8 using
y <- iconv(x, from="ISO8859-5", to="UTF-8")
Then it will display properly as [1] "Доброе утро". You can see the full list of encodings that iconv() knows about using iconvlist().
I want to replace every "€" in a string with "[euro]". Now this works perfectly fine with
file.col.name <- gsub("€","[euro]", file.col.name, fixed = TRUE)
Now I am looping over column names from a csv-file and suddenly I have trouble with the string "total€".
It works for other special character (#,?) but the € sign doesn't get recognized.
grep("€",file.column.name)
also returns 0 and if I extract the last letter it prints "€" but
print(lastletter(file.column.name) == "€")
returns FALSE. (lastletter is just a function to extract the last letter of a string.)
Does anyone have an idea why that happens and maybe an idea to solve it? I checked the class of "file.column.name" and it returns "character", also tried to convert it into a character again and stuff like that but didn't help.
Thank you!
Your encodings are probably mixed. Check the encodings of the files, then add the appropriate encoding to, e.g., read.csv using fileEncoding="…" as an argument.
If you are working under Unix/Linux, the file utility will tell you the encoding of text files. Otherwise, any editor should show you the encoding of the files.
Common encodings are UTF-8, ISO-8859-15 and windows-1252. Try "UTF-8", "windows-1252" and "latin-9" as values for fileEncoding (the latter being a portable name for ISO-8859-15 according to R's documentation).
I am using an R script to create and append a file. But I need the file to be saved in ANSI encoding,even though some characters are in Unicode format. How to ensure ANSI encoding?
newfile='\home\user\abc.ttl'
file.create(newfile)
text3 <- readLines('\home\user\init.ttl')
sprintf('readlines %d',length(text3))
for(k in 1:length(text3))
{
cat(text3[[k]],file=newfile,sep="\n",append=TRUE)
}
Encoding can be tricky, since you need to detect your encoding upon input, and then you need to convert it before writing. Here it sounds like your input file input.ttl is encoded as UTF-8, and you need it converted to ASCII. This means you are probably going to lose some non-translatable characters, since there may be no mapping from the UTF-8 characters to ASCII outside of the 128-bit lower range. (Within this range the mappings of UTF-8 to ASCII are the same.)
So here is how to do it. You will have to modify your code accordingly to test since you did not supply the elements needed for a reproducible example.
Make sure that your input file is actually UTF-8 and that you are reading it as UTF-8. You can do this by adding encoding = "UTF-8" to the third line of your code, as an argument to readLines(). Note that you may not be able to set the system locale to UTF-8 on a Windows platform, but the file will still be read as UTF-8, even though extended characters may not display properly.
Use iconv() to convert the text from UTF-8 to ASCII. iconv() is vectorised so it works on the whole set of text. You can do this using
text3 <- iconv(text3, "UTF-8", "ASCII", sub = "")
Note here that the sub = "" argument prevents the default behaviour of converting the entire character element to NA if it encounters any untranslatable characters. (These include the seemingly innocent but actually subtly evil things such as "smart quotes".)
Now when you write the file using cat() the output should be ASCII.