Is a (unicode) String encoding neutral? - asp.net

In .NET, a string is a unicode character string. My understanding is the string itself does not contain any particular encoding information, ie is encoding neutral? You can use any encoding method to decode a string into a stream of bytes and then encode a stream of bytes into a recognizable string, as long as the encoding method matches with the decoding method?

In .Net string consists of UTF-16 characters. There is no such thing as "Unicode string". It could be UCS2 or UCS4 string, or various transition formats like UTF-7, UTF-8, UTF-16, but you could not call it "Unicode". It is important to understand the difference between them.
I know that somebody in .Net team called property of Encoding class "Unicode", but it was an error. And this class contains also "Default" property which is another mis-named property. This leads to many defects (majority of people don't read manuals and they simply don't realize that "Unicode" is UTF-16 and "Default" means default OS code page).
As for second part of your question, the answer is unfortunately no. It would be "yes", but there is one small problem. It is GB18030 encoding – the standard encoding for China PRC. It has assigned code points which simply don't exist in Unicode standard (yet). Possibly new version of Unicode standard will resolve this issue.
One important point here (going back to UTF-16) is the fact that bytes are not necessary good for conversions. The problem is related to surrogate pairs and you have to be careful as one character could be defined by two pairs, meaning four bytes.
If you don't care to support GB18030 encoding, you could use the method you mention freely. If by chance you want to sell your software in China, you will need to support it and of course you will have to be very careful (extensive testing will be needed).

Yes, with the caveat that many encoding schemes can't hold all Unicode code points, which renders some round trips non-idempotent.

"Unicode" in .NET is UTF-16 or UCS-2 (2 bytes). It is itself an encoding of full Unicode character set, which requires 32-bits (4 bytes, UCS-4) to hold all characters. So you can serialize the bytes as is and they will be restored on any system that supports UTF-16 will deserialize them properly.

Related

How do you generate a "safe" password for use in web.config?

I need to programmatically generate a password for a SQL Server connection string that will be stored in web.config.
The password contained < which interfered with the build and deployment process. I would like to avoid having this and other "problem" characters from used by a password generator. Is there a list of such characters, or some code snippets that will generate a secure, but XML "safe" password?
Base64 encoded characters are all considered safe.
The particular set of 64 characters chosen to represent the 64 place-values for the base varies between implementations. The general strategy is to choose 64 characters that are both members of a subset common to most encodings, and also printable. This combination leaves the data unlikely to be modified in transit through information systems, such as email, that were traditionally not 8-bit clean.1 For example, MIME's Base64 implementation uses A–Z, a–z, and 0–9 for the first 62 values. Other variations share this property but differ in the symbols chosen for the last two values; an example is UTF-7.
This is what you can use to embed image data inside an HTML document. For example:
You can easily encode to Base64 using string libraries or online tools and they'll handle special characters like spaces, commas and brackets.
hello there, how (are) <you> doing?
Encodes to
aGVsbG8gdGhlcmUsIGhvdyAoYXJlKSA8eW91PiBkb2luZz8=
It's also worth noting that .NET probably has libraries for generating safe hashes and passwords.

Does IMAP protocol support binary inside multi-part body?

IMAP RFC:
8-bit textual and binary mail is supported through the use of a
[MIME-IMB] content transfer encoding. IMAP4rev1 implementations MAY
transmit 8-bit or multi-octet characters in literals, but SHOULD do
so only when the [CHARSET] is identified.
Although a BINARY body encoding is defined, unencoded binary
strings are not permitted. A "binary string" is any string with
NUL characters. Implementations MUST encode binary data into a
textual form, such as BASE64, before transmitting the data. A
string with an excessive amount of CTL characters MAY also be
considered to be binary.
If implementation has to convert to base64, why RFC is saying "BINARY body encoding is defined". Since every time we need to send the data as base64 (or some other format) effectively binary is not supported. Or am i reading some thing wrong?
IMAP supports MIME multi-part, can the parts inside this have binary data? that is content-transfer-encoding?
I am new to IMAP/HTTP, reason for asking this question is, i have to develop a server which supports both HTTP and IMAP, in HTTP server recive the data in binary (HUGE multipart data, with content-transfer-encoding as binary), FETCH can be done in IMAP. Problem is i need to parse the data and convert each parts inside multipart to base64 if IMAP doesnt support binary. Which i think is severe performance issue.
The answer is unfortunately "maybe".
The MIME RFC supports binary, but the IMAP RFC specifically disallows sending NULL characters. This is likely because they can be confusing for text based parsers, especially those written in C, where NULL has the meaning of End of String.
Some IMAP servers just consider the body to be a "bag of bytes" and I doubt few, if any, actually do re-encoding. So if you ask for the entire message, you will probably get the literal content of it.
If your clients can handle MIME-Binary, you will probably be fine.
There is RFC 3516 for an IMAP extension to support BINARY properly, but this is not widely deployed.
As a side note: why are you using Multipart MIME? That is an odd implementation choice for HTTP.

Should I Use Base64 or Unicode for Storing Hashes & Salts?

I have never worked on the security side of web apps, as I am just out of college. Now, I am looking for a job and working on some websites on the side, to keep my skills sharp and gain new ones. One site I am working on is pretty much copied from the original MEAN stack from the guys that created it, but trying to understand it and do things better where I can.
To compute the hash & salt, the creators used PBKDF2. I am not interested in hearing about arguments for or against PBKDF2, as that is not what this question is about. They seem to have used buffers for everything here, which I understand is a common practice in node. What I am interested in are their reasons for using base64 for the buffer encoding, rather than simply using UTF-8, which is an option with the buffer object. Most computers nowadays can handle many of the characters in Unicode, if not all of them, but the creators could have chosen to encode the passwords in a subset of Unicode without restricting themselves to the 65 characters of base64.
By "the choice between encoding as UTF-8 or base64", I mean transforming the binary of the hash, computed from the password, into the given encoding. node.js specifies a couple ways to encode binary data into a Buffer object. From the documentation page for the Buffer class:
Pure JavaScript is Unicode friendly but not nice to binary data. When dealing with TCP
streams or the file system, it's necessary to handle octet streams. Node has several
strategies for manipulating, creating, and consuming octet streams.
Raw data is stored in instances of the Buffer class. A Buffer is similar to an array
of integers but corresponds to a raw memory allocation outside the V8 heap. A Buffer
cannot be resized.
What the Buffer class does, as I understand it, is take some binary data and calculate the value of each 8 (usually) bits. It then converts each set of bits into a character corresponding to its value in the encoding you specify. For example, if the binary data is 00101100 (8 bits), and you specify UTF-8 as the encoding, the output would be , (a comma). This is what anyone looking at the output of the buffer would see when looking at it with a text editor such as vim, as well as what a computer would "see" when "reading" them. The Buffer class has several encodings available, such as UTF-8, base64, and binary.
I think they felt that, while storing any UTF-8 character imaginable in the hash, as they would have to do, would not phase most modern computers, with their gigabytes of RAM and terabytes of space, actually showing all these characters, as they may want to do in logs, etc., would freak out users, who would have to look at weird Chinese, Greek, Bulgarian, etc. characters, as well as control characters, like the Ctrl button or the Backspace button or even beeps. They would never really need to make sense of any of them, unless they were experienced users testing PBKDF2 itself, but the programmer's first duty is to not give any of his users a heart attack. Using base64 increases the overhead by about a third, which is hardly worth noting these days, and decreases the character set, which does nothing to decrease the security. After all, computers are written completely in binary. As I said before, they could have chosen a different subset of Unicode, but base64 is already standard, which makes it easier and reduces programmer work.
Am I right about the reasons why the creators of this repository chose to encode its passwords in base64, instead of all of Unicode? Is it better to stick with their example, or should I go with Unicode or a larger subset of it?
A hash value is a sequence of bytes. This is binary information. It is not a sequence of characters.
UTF-8 is an encoding for turning sequences of characters into sequences of bytes. Storing a hash value "as UTF-8" makes no sense, since it is already a sequence of bytes, and not a sequence of characters.
Unfortunately, many people have took to the habit of considering a byte as some sort of character in disguise; it was at the basis of the C programming language and still infects some rather modern and widespread frameworks such as Python. However, only confusion and sorrow lie down that path. The usual symptoms are people wailing and whining about the dreadful "character zero" -- meaning, a byte of value 0 (a perfectly fine value for a byte) that, turned into a character, becomes the special character that serves as end-of-string indicator in languages from the C family. This confusion can even lead to vulnerabilities (the zero implying, for the comparison function, an earlier-than-expected termination).
Once you have understood that binary is binary, the problem becomes: how are we to handle and store our hash value ? In particular in JavaScript, a language that is known to be especially poor at handling binary values. The solution is an encoding that turns the bytes into characters, not just any character, but a very small subset of well-behaved characters. This is called Base64. Base64 is a generic scheme for encoding bytes into character strings that don't include problematic characters (no zero, only ASCII printable characters, excluding all the control characters and a few others such as quotes).
Not using Base64 would imply assuming that JavaScript can manage an arbitrary sequence of bytes as if it was just "normal characters", and that is simply not true.
There is a fundamental, security-related reason to store as Base64 rather than Unicode: the hash may contain the byte value "0", used by many programming languages as an end-of-string marker.
If you store your hash as Unicode, you, another programmer, or some library code you use may treat it as a string rather than a collection of bytes, and compare using strcmp() or a similar string-comparison function. If your hash contains the byte value "0", you've effectively truncated your hash to just the portion before the "0", making attacks much easier.
Base64 encoding avoids this problem: the byte value "0" cannot occur in the encoded form of the hash, so it doesn't matter if you compare encoded hashes using memcmp() (the right way) or strcmp() (the wrong way).
This isn't just a theoretical concern, either: there have been multiple cases of code for checking digital signatures using strcmp(), greatly weakening security.
This is an easy answer, since there are an abundance of byte sequences which are not well-formed UTF-8 strings. The most common one is a continuation byte (0x80-0xbf) that is not preceded by a leading byte in a multibyte sequence (0xc0-0xf7); bytes 0xf8-0xff aren't valid either.
So these byte sequences are not valid UTF-8 strings:
0x80
0x40 0xa0
0xff
0xfe
0xfa
If you want to encode arbitrary data as a string, use a scheme that allows it. Base64 is one of those schemes.
An addtional point: you might think to yourself, well, I don't really care whether they're well-formed UTF-8 strings, I'm never going to use the data as a string, I just want to hand this byte sequence to store for later.
The problem with that, is if you give an arbitrary byte sequence to an application expecting a UTF-8 string, and it is not well-formed, the application is not obligated to make use of this byte sequence. It might reject it with an error, it might truncate the string, it might try to "fix" it.
So don't try to store arbitrary byte sequences as a UTF-8 string.
Base64 is better, but consider a websafe base64 alphabet for transport. Base64 can conflict with querystring syntax.
Another option you might consider is using hex. Its longer but seldom conflicts with any syntax.

application/x-www-form-urlencoded or multipart/form-data?

In HTTP there are two ways to POST data: application/x-www-form-urlencoded and multipart/form-data. I understand that most browsers are only able to upload files if multipart/form-data is used. Is there any additional guidance when to use one of the encoding types in an API context (no browser involved)? This might e.g. be based on:
data size
existence of non-ASCII characters
existence on (unencoded) binary data
the need to transfer additional data (like filename)
I basically found no formal guidance on the web regarding the use of the different content-types so far.
TL;DR
Summary; if you have binary (non-alphanumeric) data (or a significantly sized payload) to transmit, use multipart/form-data. Otherwise, use application/x-www-form-urlencoded.
The MIME types you mention are the two Content-Type headers for HTTP POST requests that user-agents (browsers) must support. The purpose of both of those types of requests is to send a list of name/value pairs to the server. Depending on the type and amount of data being transmitted, one of the methods will be more efficient than the other. To understand why, you have to look at what each is doing under the covers.
For application/x-www-form-urlencoded, the body of the HTTP message sent to the server is essentially one giant query string -- name/value pairs are separated by the ampersand (&), and names are separated from values by the equals symbol (=). An example of this would be:
MyVariableOne=ValueOne&MyVariableTwo=ValueTwo
According to the specification:
[Reserved and] non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character
That means that for each non-alphanumeric byte that exists in one of our values, it's going to take three bytes to represent it. For large binary files, tripling the payload is going to be highly inefficient.
That's where multipart/form-data comes in. With this method of transmitting name/value pairs, each pair is represented as a "part" in a MIME message (as described by other answers). Parts are separated by a particular string boundary (chosen specifically so that this boundary string does not occur in any of the "value" payloads). Each part has its own set of MIME headers like Content-Type, and particularly Content-Disposition, which can give each part its "name." The value piece of each name/value pair is the payload of each part of the MIME message. The MIME spec gives us more options when representing the value payload -- we can choose a more efficient encoding of binary data to save bandwidth (e.g. base 64 or even raw binary).
Why not use multipart/form-data all the time? For short alphanumeric values (like most web forms), the overhead of adding all of the MIME headers is going to significantly outweigh any savings from more efficient binary encoding.
READ AT LEAST THE FIRST PARA HERE!
I know this is 3 years too late, but Matt's (accepted) answer is incomplete and will eventually get you into trouble. The key here is that, if you choose to use multipart/form-data, the boundary must not appear in the file data that the server eventually receives.
This is not a problem for application/x-www-form-urlencoded, because there is no boundary. x-www-form-urlencoded can also always handle binary data, by the simple expedient of turning one arbitrary byte into three 7BIT bytes. Inefficient, but it works (and note that the comment about not being able to send filenames as well as binary data is incorrect; you just send it as another key/value pair).
The problem with multipart/form-data is that the boundary separator must not be present in the file data (see RFC 2388; section 5.2 also includes a rather lame excuse for not having a proper aggregate MIME type that avoids this problem).
So, at first sight, multipart/form-data is of no value whatsoever in any file upload, binary or otherwise. If you don't choose your boundary correctly, then you will eventually have a problem, whether you're sending plain text or raw binary - the server will find a boundary in the wrong place, and your file will be truncated, or the POST will fail.
The key is to choose an encoding and a boundary such that your selected boundary characters cannot appear in the encoded output. One simple solution is to use base64 (do not use raw binary). In base64 3 arbitrary bytes are encoded into four 7-bit characters, where the output character set is [A-Za-z0-9+/=] (i.e. alphanumerics, '+', '/' or '='). = is a special case, and may only appear at the end of the encoded output, as a single = or a double ==. Now, choose your boundary as a 7-bit ASCII string which cannot appear in base64 output. Many choices you see on the net fail this test - the MDN forms docs, for example, use "blob" as a boundary when sending binary data - not good. However, something like "!blob!" will never appear in base64 output.
I don't think HTTP is limited to POST in multipart or x-www-form-urlencoded. The Content-Type Header is orthogonal to the HTTP POST method (you can fill MIME type which suits you). This is also the case for typical HTML representation based webapps (e.g. json payload became very popular for transmitting payload for ajax requests).
Regarding Restful API over HTTP the most popular content-types I came in touch with are application/xml and application/json.
application/xml:
data-size: XML very verbose, but usually not an issue when using compression and thinking that the write access case (e.g. through POST or PUT) is much more rare as read-access (in many cases it is <3% of all traffic). Rarely there where cases where I had to optimize the write performance
existence of non-ascii chars: you can use utf-8 as encoding in XML
existence of binary data: would need to use base64 encoding
filename data: you can encapsulate this inside field in XML
application/json
data-size: more compact less that XML, still text, but you can compress
non-ascii chars: json is utf-8
binary data: base64 (also see json-binary-question)
filename data: encapsulate as own field-section inside json
binary data as own resource
I would try to represent binary data as own asset/resource. It adds another call but decouples stuff better. Example images:
POST /images
Content-type: multipart/mixed; boundary="xxxx"
... multipart data
201 Created
Location: http://imageserver.org/../foo.jpg
In later resources you could simply inline the binary resource as link:
<main-resource&gt
...
<link href="http://imageserver.org/../foo.jpg"/>
</main-resource>
I agree with much that Manuel has said. In fact, his comments refer to this url...
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
... which states:
The content type
"application/x-www-form-urlencoded" is
inefficient for sending large
quantities of binary data or text
containing non-ASCII characters. The
content type "multipart/form-data"
should be used for submitting forms
that contain files, non-ASCII data,
and binary data.
However, for me it would come down to tool/framework support.
What tools and frameworks do you
expect your API users to be building
their apps with?
Do they have
frameworks or components they can use
that favour one method over the
other?
If you get a clear idea of your users, and how they'll make use of your API, then that will help you decide. If you make the upload of files hard for your API users then they'll move away, of you'll spend a lot of time on supporting them.
Secondary to this would be the tool support YOU have for writing your API and how easy it is for your to accommodate one upload mechanism over the other.
Just a little hint from my side for uploading HTML5 canvas image data:
I am working on a project for a print-shop and had some problems due to uploading images to the server that came from an HTML5 canvas element. I was struggling for at least an hour and I did not get it to save the image correctly on my server.
Once I set the
contentType option of my jQuery ajax call to application/x-www-form-urlencoded everything went the right way and the base64-encoded data was interpreted correctly and successfully saved as an image.
Maybe that helps someone!
If you need to use Content-Type=x-www-urlencoded-form then DO NOT use FormDataCollection as parameter: In asp.net Core 2+ FormDataCollection has no default constructors which is required by Formatters. Use IFormCollection instead:
public IActionResult Search([FromForm]IFormCollection type)
{
return Ok();
}
In my case the issue was that the response contentType was application/x-www-form-urlencoded but actually it contained a JSON as the body of the request. Django when we access request.data in Django it cannot properly converted it so access request.body.
Refer this answer for better understanding:
Exception: You cannot access body after reading from request's data stream

Please help me trace how charsets are handled every step of the way

We all know how easy character sets are on the web, yet every time you think you got it right, a foreign charset bites you in the butt. So I'd like to trace the steps of what happens in a fictional scenario I will describe below. I'm going to try and put down my understanding as well as possible but my question is for you folks to correct any mistakes I make and fill in any BLANKs.
When reading this scenario, imagine that this is being done on a Mac by John, and on Windows by Jane, and add comments if one behaves differently than the other in any particular situation.
Our hero (John/Jane) starts by writing a paragraph in Microsoft Word. Word's charset is BLANK1 (CP1252?).
S/he copies the paragraph, including smart quotes (e.g. “ ”). The act of copying is done by the BLANK2 (Operating system...Windows/Mac?) which BLANK3 (detects what charset the application is using and inherits the charset?). S/he then pastes the paragraph in a text box at StackOverflow.
Let's assume StackOverflow is running on Apache/PHP and that their set up in httpd.conf does not specify AddDefaultCharset utf-8 and their php.ini sets the default_charset to ISO-8859-1.
Yet neither charset above matters, because Stack Overflow's header contains this statement META http-equiv="Content-Type" content="text/html; charset=UTF-8", so even though when you clicked on "Ask Question" you might have seen a *RESPONSE header in firebug of "Content-type text/html;" ... in fact, Firefox/IE/Opera/Other browsers BLANK4 (completely 100% ignore the server header and override it with the Meta Content-type declaration in the header? Although it must read the file before knowing the Content-type, since it doesn't have to do anything with the encoding until it displays the body, this makes no different to the browser?).
Since the Meta Content-type of the page is UTF-8, the input form will convert any characters you type into the box, into UTF-8 characters. BLANK5 (If someone can go into excruciating detail about what the browser does in this step, it would be very helpful...here's my understanding...since the operating system controls the clipboard and display of the character in the form, it inserts the character in whatever charset it was copied from. And displays it in the form as that charset...OVERRIDING the UTF-8 in this example).
Let's assume the form method=GET rather than post so we can play w/ the URL browser input.... Continuing our story, the form is submitted as UTF-8. The smart quotes which represent decimal code 147 & 148, when the browser converts them to UTF-8, it gets transformed into BLANK6 characters.
Let's assume that after submission, Stack Overflow found an error in the form, so rather than displaying the resulting question, it pops back up the input box with your question inside the form. In the php, the form variables are escaped with htmlspecialchars($var) in order for the data to be properly displayed, since this time it's the BLANK7 (browser controlling the display, rather than the operating system...therefore the quotes need to be represented as its UTF-8 equivalent or else you'd get the dreaded funny looking � question mark?)
However, if you take the smart quotes, and insert them directly in the URL bar and hit enter....the htmlspecialchars will do BLANK8, messing up the form display and inserting question marks �� since querying a URL directly will just use the encoding in the url...or even a BLANK9 (mix of encodings?) if you have more than one in there...
When the REQUEST is sent out, the browser lists acceptable charsets to the browser. The list of charsets comes from BLANK10.
Now you might think our story ends there, but it doesn't. Because StackOverflow needs to save this data to a database. Fortunately, the people running this joint are smart. So when their MySQL client connects to the database, it makes sure the client and server are talking to each other UTF-8 by issuing the SET NAMES UTF-8 command as soon as the connection is initiated. Additionally, the default character set for MySQL is set to UTF-8 and each field is set the same way.
Therefore, Stack Overflow has completely secured their website from dB injections, CSRF forgeries and XSS site scripting issues...or at least those borne from charset game playing.
*Note, this is an example, not the actual response by that page.
I don't know if this "answers" your "question", but I can at least help you with what I think may be a critical misunderstanding.
You say, "Since the Meta Content-type of the page is UTF-8, the input form will convert any characters you type into the box, into UTF-8 characters." There is no such thing as a "UTF-8 character", and it isn't true or even meaningful to think of the form "converting" anything into anything when you paste it. Characters are a completely abstract concept, and there's no way of knowing (without reading the source) how a given program, including your web browser, decides to implement them. Since most important applications these days are Unicode-savvy, they probably have some internal abstraction to represent text as Unicode characters--note, that's Unicode and not UTF-8.
A piece of text, in Unicode (or in any other character set), is represented as a series of code points, integers that are uniquely assigned to characters, which are named entities in a large database, each of which has any number of properties (such as whether it's a combining mark, whether it goes right-to-left, etc.). Here's the part where the rubber meets the road: in order to represent text in a real computer, by saving it to a file, or sending it over the wire to some other computer, it has to be encoded as a series of bytes. UTF-8 is an encoding (or a "transformation format" in Unicode-speak), that represents each integer code point as a unique sequence of bytes. There are several interesting and good properties of UTF-8 in particular, but they're not relevant to understanding, in general, what's going on.
In the scenario you describe, the content-type metadata tells the browser how to interpret the bytes being sent as a sequence of characters (which are, remember, completely abstract entities, having no relationship to bytes or anything). It also tells the browser to please encode the textual values entered by the user into a form as UTF-8 on the way back to the server.
All of these remarks apply all the way up and down the chain. When a computer program is processing "text", it is doing operations on a sequence of "characters", which are abstractions representing the smallest components of written language. But when it wants to save text to a file or transmit it somewhere else, it must turn that text into a sequence of bytes.
We use Unicode because its character set is universal, and because the byte sequences it uses in its encodings (UTF-8, the UTF-16s, and UTF-32) are unambiguous.
P.S. When you see �, there are two possible causes.
1) A program was asked to write some characters using some character set (say, ISO-8859-1) that does not contain a particular character that appears in the text. So if text is represented internally as a sequence of Unicode code points, and the text editor is asked to save as ISO-8859-1, and the text contains some Japanese character, it will have to either refuse to do it, or spit out some arbitrary ISO-8859-1 byte sequence to mean "no puedo".
2) A program received a sequence of bytes that perhaps does represent text in some encoding, but it interprets those bytes using a different encoding. Some byte sequences are meaningless in that encoding, so it can either refuse to do it, or just choose some character (such as �) to represent each unintelligible byte sequence.
P.P.S. These encode/decode dances happen between applications and the clipboard in your OS of choice. Imagine the possibilities.
In answer to your comments:
It's not true that "Word uses CP1252 encoding"; it uses Unicode to represent text internally. You can verify this, trivially, by pasting some Katakana character such as サ into Word. Windows-1252 cannot represent such a character.
When you "copy" something, from any application, it's entirely up to the application to decide what to put on the clipboard. For example, when I do a copy operation in Word, I see 17 different pieces of data, each having a different format, placed into the clipboard. One of them has type CF_UNICODETEXT, which happens to be UTF-16.
Now, as for URLs... Details are found here. Before sending an HTTP request, the browser must turn a URL (which can contain any text at all) into an IRI. You convert a URL to an IRI by first encoding it as UTF-8, then representing UTF-8 bytes outside the ASCII printable range by their percent-escaped forms. So, for example, the correct encoding for http://foo.com/dir1/引き割り.html is http://foo.com/dir1/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html . (Host names follow different rules, but it's all in the linked-to resource).
Now, in my opinion, the browser ought to show plain old text in the location bar, and do all of the encoding behind the scenes. But some browsers make stupid choices, and they show you the IRI form, or some chimera of a URL and an IRI.

Resources