I have a list of character that display fine in WebBrowser in the form of encoded characters such as ...
But when posting these characters onto server to I realized that HttpUtility.HtmlDecode cannot convert them to characters as browser did, they all become space.
text = System.Web.HttpUtility.HtmlDecode("");
I expect it to return € but it return space instead. The same thing happen for some other characters as well.
Does anyone know how to fix this or any workaround?
This is commonly result of using literal values and mixing UTF-8 and ASCII. In UTF-8 euro sign is encoded as 3 bytes so there is no ASCII counterpart for it.
Update
Your code is illegal if you are using UTF-8 since it only supports the first 128 characters and the rest are encoded is multiple bytes. You need to use the Unicode syntax:
// !!! NOT HtmlDecode!!!
text = System.Web.HttpUtility.UrlDecode("%E2%82%AC");
UPDATE
OK, I have left the code as it was but added the comment that it does not work. It does not work because it is not an encoding which is of concern for HTML - it is not an HTML. This is of concern for the URL and as such you need to use UrlDecode instead.
ASCII is 7-Bit; there are no characters 128 through 255. The MSDN article you linked is following the long tradition of pretending ASCII is 8-Bit; the article actually shows code page 437.
I'm not sure why you're not simply writing € (compatibility?), but € or € should do, too.
You typically want to do something like:
string html = ""
string trash = WebUtility.HtmlDecode(html);
//Convert from default encoding to UTF8
byte[] bytes = Encoding.Default.GetBytes(trash);
string proper = Encoding.UTF8.GetString(bytes);
Related
I am trying to decode filenames in HTTP but the string from browser messages are different.
In my test file I put the name ç.jpg.
What I need is the name %C3%A7.jpg.
But the browser is sending %C3%83%C2%A7.jpg.
It's not UTF8, UTF16 or UTF32.
For another example I test the file name €.jpg.
What I need is the name %E2%82%AC.jpg.
But I am receiving %C3%A2%E2%80%9A%C2%AC.jpg.
how can I convert this names to UTF8?
Ok I played with this for about 30 minutes and I finally figured it out.
This is how the original string was encoded:
The string was in UTF-8
Some encoding mechanism thought it was CP1252, and based on that wrong assumption re-encoded it to UTF-8 again.
The resulting string was url-encoded.
To get back to a real UTF-8 string, this is what I did. (note, I used PHP, don't know what you are using but it should be doable in other languages just the same).
$input = '%C3%A2%E2%80%9A%C2%AC %C3%83%C2%A7';
$str1 = urldecode($input);
echo iconv('UTF-8', 'CP1252', $str1);
// output "€ ç"
So that conversion is counter intuitive. We're converting to CP1252, but still end up with a UTF-8 string. This only works because an existing UTF-8 was falsely treated as CP1252, and that incorrect interpretation was then converted to UTF-8. So I'm just reversing this double-encoding.
In other languages there might be a few more steps, this works in just 1 line with PHP because strings are bytes, not characters.
I am encrypting the plain text using RSA and converting that value to base64 string.But while decrypting the I altered the base64 string and try to decrypt it...it given me same original text return.
Is there any thing wrong ?
Original Plain Text :007189562312
Output Base64 string : VfZN7WXwVz7Rrxb+W08u9F0N9Yt52DUnfCOrF6eltK3tzUUYw7KgvY3C8c+XER5nk6yfQFI9qChAes/czWOjKzIRMUTgGPjPPBfAwUjCv4Acodg7F0+EwPkdnV7Pu7jmQtp4IMgGaNpZpt33DgV5AJYj3Uze0A3w7wSQ6/tIgL4=
Altered Base64 String : VfZN7WXwVz7Rrxb+W08u9F0N9Yt52DUnfCOrF6eltK3tzUUYw7KgvY3C8c+XER5nk6yfQFI9qChAes/czWOjKzIRMUTgGPjPPBfAwUjCv4Acodg7F0+EwPkdnV7Pu7jmQtp4IMgGaNpZpt33DgV5AJYj3Uze0A3w7wSQ6/tIgL4=55
Please explain. Thank you.
I'm assuming you're asking whether the altered ciphertext should have thrown an error when decrypting. It looks like the altered string only adds two characters to the end and is otherwise the same string.
Your Base 64 library probably makes some reasonable assumptions when parsing Base 64 data. Base 64 works by encoding 3 bytes into 4 characters. If at the end the data length is not a multiple of 3 it must be padded. That is signalized by the = at the end of the encoded string.
This also means that during parsing, the library knows that padding characters are at the end and stops parsing there. If the alteration appeared at the end of the string then the encoded ciphertext didn't effectively change.
I'm having an encoding problem related to cookies on one of my websites.
A user is inputing Usuário, which has an acute accent, and that's being put in a cookie. The raw HEX for the cookie response is (for the Usuário string):
55 73 75 C3 A1 72 69 6F
When I see it in the browser, it looks like this:
...which is really messy. I need to fix this up.
Then I went to this website: http://www.rapidtables.com/convert/number/hex-to-ascii.htm and converted the HEX value to see how it would look like. And I got the same output:
Right. This means the HEX code is wrong. Then I tried to convert Usuário to ASCII to see how it should be. I used this WebSite: http://www.asciitohex.com/ and this is the result:
For my surprise, the HEX is exactly the one that is showing up messy. Why???
And how do I represent Usuário in ASCII so I can put it in a cookie? Should I manually encode it?
PS: I'm using ASP.NET, just in case it matters.
As of 2015 the standard of the web to store character data is UTF-8 and not ASCII. ASCII actually only contains the first 128 characters of the codepage, and does not include any kind of accented characters. To add accented characters to this 128 characters there were many legacy solutions: codepages. They each added 128 different characters to the default ASCII list thereby allowing representing 256 different characters.
The problem was, that this didn't properly solve the issue: ASCII based codepages were more or less incomatible with each other (except for the first 128 characters), and there was usually no way of programatically knowing which codepage was in used.
One of the solutions was UTF-8, which is a way to encode the unocde character set (containing most of the characters used around the world, and more) while trying to remain compatible with ASCII. The first 128 characters are actually the same in both cases, but afterwards UTF-8 characters become multi-byte: one character is encoded using a series of bytes (usually 2-3, depends on which character needs to be encoded)
The problem is if you are using some kind of ASCII based single byte codebase (like ISO-8859-1), which encodes supported characters in single bytes, but your input is actually UTF-8, which will encode accented characters in multiple bytes (you can see this in your HEX example. á is encoded as C3 A1: two bytes). If you try to read these two bytes in an ASCII based codepage, which uses single bytes for every characters (in West-Europe this codepage is usually ISO-8859-1), then each of this two bytes will be reprensented with two different characters.
In the web world the default encoding is UTF-8, so your clients will usually send their requests using UTF-8. ASP.NET is Unicode aware, so it can handle these requests. However somewere in your code this UTF-8 is converted acccidentally into ISO-8859-1, and then back into UTF-8. This might happen on various layers. As you have issues it probably happens at the cookie layer, which is sometimes problematic (here is how it worked in 2009). You should also double check your application that it uses UTF-8 everywhere else though (views, database, etc.), if you want to properly support accented characters.
I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.
I need to create DataMatrix barcodes which may contain non-Latin characters. I have code which creates the barcodes correctly when they only consist of Latin characters; when I run the same code with non-Latin (Hebrew or Russian) characters, however, although the code runs to completion and the barcode is created, the non-Latin characters are not deciphered by the barcode reader.
Any assistance or ideas would be greatly appreciated!
Your issue is related to the character encoding used prior to generating the barcode. The encoding used by the generator to encode must match the encoding used by the reader to decode.
Possible encodings are:
Extended Channel Interpretations (ECI) is supported by DataMatrix and other 2D barcode standards. The generator places an ECI identifying code inside the barcode data, so the reader knows to use ECI to correctly convert the data back to text.
UTF-8 encodes pretty much any language.
Code page is an older encoding, but if your generator is using it, you can use 1255 for Hebrew code page or 1251 for Russian code page. See this SO answer for more info.
To test your encoding, try Inlite's Online Barcode Reader (OBR) which should read the correct text for ECI and UTF-8 encoded barcodes. If it does, the problem is with your barcode reader which is not decoding correctly.
If OBR returns binary data, either your generator uses code page or does not encode correctly at all. Try another generator that supports ECI or UTF-8.