I am trying to decode filenames in HTTP but the string from browser messages are different.
In my test file I put the name ç.jpg.
What I need is the name %C3%A7.jpg.
But the browser is sending %C3%83%C2%A7.jpg.
It's not UTF8, UTF16 or UTF32.
For another example I test the file name €.jpg.
What I need is the name %E2%82%AC.jpg.
But I am receiving %C3%A2%E2%80%9A%C2%AC.jpg.
how can I convert this names to UTF8?
Ok I played with this for about 30 minutes and I finally figured it out.
This is how the original string was encoded:
The string was in UTF-8
Some encoding mechanism thought it was CP1252, and based on that wrong assumption re-encoded it to UTF-8 again.
The resulting string was url-encoded.
To get back to a real UTF-8 string, this is what I did. (note, I used PHP, don't know what you are using but it should be doable in other languages just the same).
$input = '%C3%A2%E2%80%9A%C2%AC %C3%83%C2%A7';
$str1 = urldecode($input);
echo iconv('UTF-8', 'CP1252', $str1);
// output "€ ç"
So that conversion is counter intuitive. We're converting to CP1252, but still end up with a UTF-8 string. This only works because an existing UTF-8 was falsely treated as CP1252, and that incorrect interpretation was then converted to UTF-8. So I'm just reversing this double-encoding.
In other languages there might be a few more steps, this works in just 1 line with PHP because strings are bytes, not characters.
Related
I am encrypting the plain text using RSA and converting that value to base64 string.But while decrypting the I altered the base64 string and try to decrypt it...it given me same original text return.
Is there any thing wrong ?
Original Plain Text :007189562312
Output Base64 string : VfZN7WXwVz7Rrxb+W08u9F0N9Yt52DUnfCOrF6eltK3tzUUYw7KgvY3C8c+XER5nk6yfQFI9qChAes/czWOjKzIRMUTgGPjPPBfAwUjCv4Acodg7F0+EwPkdnV7Pu7jmQtp4IMgGaNpZpt33DgV5AJYj3Uze0A3w7wSQ6/tIgL4=
Altered Base64 String : VfZN7WXwVz7Rrxb+W08u9F0N9Yt52DUnfCOrF6eltK3tzUUYw7KgvY3C8c+XER5nk6yfQFI9qChAes/czWOjKzIRMUTgGPjPPBfAwUjCv4Acodg7F0+EwPkdnV7Pu7jmQtp4IMgGaNpZpt33DgV5AJYj3Uze0A3w7wSQ6/tIgL4=55
Please explain. Thank you.
I'm assuming you're asking whether the altered ciphertext should have thrown an error when decrypting. It looks like the altered string only adds two characters to the end and is otherwise the same string.
Your Base 64 library probably makes some reasonable assumptions when parsing Base 64 data. Base 64 works by encoding 3 bytes into 4 characters. If at the end the data length is not a multiple of 3 it must be padded. That is signalized by the = at the end of the encoded string.
This also means that during parsing, the library knows that padding characters are at the end and stops parsing there. If the alteration appeared at the end of the string then the encoded ciphertext didn't effectively change.
I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.
I have a string (comprised of a userID and a date/time stamp), which I then encrypt using ColdFusion's Encrypt(inputString, myKey, "Blowfish/ECB/PKCS5Padding", "Hex").
In order to interface with a 3d party I have to then perform the following:
Convert each character pair within the resultant string into a HEX value.
HEX values are then represented as integers.
Resultant integers are then output as ASCII characters.
All the ASCII characters combine to form a Bytestring.
Bytestring is then converted to Base64.
Base64 is URL encoded and finally sent off (phew!)
It all works seamlessly, APART FROM when the original cfEncrypted string contains a "00".
The HEX value 00 translates as the integer (via function InputBaseN) 0 which then refuses to translate correctly into an ASCII character!
The resultant Bytestring (and therefore url string) is messed up and the 3d party is unable to decipher it.
It's worth mentioning that I do declare: <cfcontent type="text/html; charset=iso-8859-1"> at the top of the page.
Is there any way to correctly output 00 as ASCII? Could I avoid having "00" within the original encrypted string? Any help would be greatly appreciated :)
I'm pretty sure ColdFusion (and the Java underneath) use a null-terminated string type. This means that every string contains one and only one asc(0) char, which is the string terminator. If you try to insert an asc(0) into a string, CF is erroring because you are trying to create a malformed string element.
I'm not sure what the end solution is. I would play around with toBinary() and toString(), and talk to your 3rd party vendor about workarounds like sending the raw hex values or similar.
Actually there is a very easy solution. The credit card company who is processing your request needs you to convert it to lower case letters of hex. The only characters processed are :,-,0-9 do a if else and convert them manually into a string.
I have a list of character that display fine in WebBrowser in the form of encoded characters such as ...
But when posting these characters onto server to I realized that HttpUtility.HtmlDecode cannot convert them to characters as browser did, they all become space.
text = System.Web.HttpUtility.HtmlDecode("");
I expect it to return € but it return space instead. The same thing happen for some other characters as well.
Does anyone know how to fix this or any workaround?
This is commonly result of using literal values and mixing UTF-8 and ASCII. In UTF-8 euro sign is encoded as 3 bytes so there is no ASCII counterpart for it.
Update
Your code is illegal if you are using UTF-8 since it only supports the first 128 characters and the rest are encoded is multiple bytes. You need to use the Unicode syntax:
// !!! NOT HtmlDecode!!!
text = System.Web.HttpUtility.UrlDecode("%E2%82%AC");
UPDATE
OK, I have left the code as it was but added the comment that it does not work. It does not work because it is not an encoding which is of concern for HTML - it is not an HTML. This is of concern for the URL and as such you need to use UrlDecode instead.
ASCII is 7-Bit; there are no characters 128 through 255. The MSDN article you linked is following the long tradition of pretending ASCII is 8-Bit; the article actually shows code page 437.
I'm not sure why you're not simply writing € (compatibility?), but € or € should do, too.
You typically want to do something like:
string html = ""
string trash = WebUtility.HtmlDecode(html);
//Convert from default encoding to UTF8
byte[] bytes = Encoding.Default.GetBytes(trash);
string proper = Encoding.UTF8.GetString(bytes);
I need to pass 2 parameters in a query string but would like them to appear as a single parameter to the user. At a low level, how can I concatinate these two values and then later separate them? Both values are Base64 encoded.
?Name=abcyxz
where both abc and xyz are separate Base64 encoded strings.
why don't you just do something like this
temp = base64_encode("var1=abc&var2=yxz")
and then call
?Name=temp
Later you can decode the whole string and split the vars.
(sry for pseudo code :P)
Edit: a small quote from wikipedia
The current version of PEM (specified in RFC 1421) uses a 64-character alphabet consisting of upper- and lower-case Roman alphabet characters (A–Z, a–z), the numerals (0–9), and the "+" and "/" symbols. The "=" symbol is also used as a special suffix code. The original specification, RFC 989, additionally used the "*" symbol to delimit encoded but unencrypted data within the output stream.
You should either use some separator or store the length of the first item.
First of all, I would be curious as to why you can't just pass two parameters. But with that as a given, just choose any character that's a valid character in a URL query string, but won't show up in your base64 encoding, such as ~