I am trying to decode filenames in HTTP but the string from browser messages are different.
In my test file I put the name ç.jpg.
What I need is the name %C3%A7.jpg.
But the browser is sending %C3%83%C2%A7.jpg.
It's not UTF8, UTF16 or UTF32.
For another example I test the file name €.jpg.
What I need is the name %E2%82%AC.jpg.
But I am receiving %C3%A2%E2%80%9A%C2%AC.jpg.
how can I convert this names to UTF8?
Ok I played with this for about 30 minutes and I finally figured it out.
This is how the original string was encoded:
The string was in UTF-8
Some encoding mechanism thought it was CP1252, and based on that wrong assumption re-encoded it to UTF-8 again.
The resulting string was url-encoded.
To get back to a real UTF-8 string, this is what I did. (note, I used PHP, don't know what you are using but it should be doable in other languages just the same).
$input = '%C3%A2%E2%80%9A%C2%AC %C3%83%C2%A7';
$str1 = urldecode($input);
echo iconv('UTF-8', 'CP1252', $str1);
// output "€ ç"
So that conversion is counter intuitive. We're converting to CP1252, but still end up with a UTF-8 string. This only works because an existing UTF-8 was falsely treated as CP1252, and that incorrect interpretation was then converted to UTF-8. So I'm just reversing this double-encoding.
In other languages there might be a few more steps, this works in just 1 line with PHP because strings are bytes, not characters.
I'm dealing with a text where I find UTF-16 encoded as UTF-8 and I am unable to translate from one to the other in the R language.
For example, and looking at this codepoint (https://codepoints.net/U+D83D) representation in UTF-8 as a text string "ED A0 BD" and I want to convert it to also a text string"D8 3D".
How can I achieve this?
More info on what I want to achieve: stackoverflow.com/questions/35670238/emoji-in-r-utf-8-encoding
I was wonderiing if there exist a short way to get accents of a character string in R. For instance when I the "accentued" string is "Université d'Aix-Marseille" with my script I get "Universit%C3A9 d%27Aix-Marseille". Is there any function or algorithm to get the former one directly ?
I precise tht the file from where a get all my character string is encoded is UTF-8.
Sincerely yours.
You can get and set the encoding of a character vector like this:
s <- "Université d'Aix-Marseille"
Encoding(s)
# set encoding to utf-8
Encoding(s) <- "UTF-8"
s
If that fixes it, you could change your default encoding to UTF-8.
I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.
When I am trying to paste the character » (right double angle quotes) in Unix from my Notepad, it's converting to /273. The corresponding Hex value is BB and the Decimal value is 187.
My actual requirement is to have this character as the file delimiter when I export a .dat file from a database table. So, this character was put in as the delimiter after each column name. But, while copy-pasting, it's getting converted to /273.
Any idea about how to fix this? I am on Solaris (SunOS 5.10).
Thanks,
Visakh
ASCII only defines the character codes up to 127 (0x7F) - everything after that is another encoding, such as ISO-8859-1 or UTF-8. Make sure your locale is set to the encoding you are trying to use - the locale command will report your current locale settings, the locale(5) and environ(5) man pages cover how to set them. A much more in-depth introduction to the whole character encoding concept can be found in Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The character code 0xBB is shown as » in the IS0-8859-1 character chart, so that's probably the character set you want, so the locale would be something like en_US.ISO8859-1 for that character set with US/English messages/date formats/currency settings/etc.