I have encountered a number of implementations of algorithms for converting a byte array to a hexadecimal string in many languages:
How to convert a byte array to a hex string in Java?
How do you convert buffer (byte array) to hex string in C?
How can I convert a hex string to a byte array?
Why is this conversion used so often?
What are the advantages of storing byte arrays as hexadecimal strings?
For example:
To store in a file with a format that doesn't support binary, e.g. CSV.
To store in a database field that doesn't support binary.
To send in a protocol that doesn't support binary.
To embed in other content that doesn't support binary, e.g. XML and JSON.
To display to a user.
Many other reasons...
Related
Recently, I want to write an application using netty.The format of message I want to send is serialized object stream using Avro or Protobuf.
There exists one question for me, that is while one side receive the byte streams from the other side, how could I split the byte stream,or how could I know if such stream terminated,and ready for the next serialized objects?
I get some tips that is using special characters between different object byte stream,but doesn't avro or protobuf will generate such characters while serialize objects?
I think you most likely want to "prefix" each serialised object with the number of bytes it is serialized too. This will allow you to ensure you only read the correct number of bytes per Object and so do the right thing when de-serialize it.
Netty itself contains for example the LengthFieldBasedFrameDecoder which will allow you to "slice" out the bytes for an object. And the LengthFieldPrepender which allows you to prefix each of them when do the encoding.
I have never worked on the security side of web apps, as I am just out of college. Now, I am looking for a job and working on some websites on the side, to keep my skills sharp and gain new ones. One site I am working on is pretty much copied from the original MEAN stack from the guys that created it, but trying to understand it and do things better where I can.
To compute the hash & salt, the creators used PBKDF2. I am not interested in hearing about arguments for or against PBKDF2, as that is not what this question is about. They seem to have used buffers for everything here, which I understand is a common practice in node. What I am interested in are their reasons for using base64 for the buffer encoding, rather than simply using UTF-8, which is an option with the buffer object. Most computers nowadays can handle many of the characters in Unicode, if not all of them, but the creators could have chosen to encode the passwords in a subset of Unicode without restricting themselves to the 65 characters of base64.
By "the choice between encoding as UTF-8 or base64", I mean transforming the binary of the hash, computed from the password, into the given encoding. node.js specifies a couple ways to encode binary data into a Buffer object. From the documentation page for the Buffer class:
Pure JavaScript is Unicode friendly but not nice to binary data. When dealing with TCP
streams or the file system, it's necessary to handle octet streams. Node has several
strategies for manipulating, creating, and consuming octet streams.
Raw data is stored in instances of the Buffer class. A Buffer is similar to an array
of integers but corresponds to a raw memory allocation outside the V8 heap. A Buffer
cannot be resized.
What the Buffer class does, as I understand it, is take some binary data and calculate the value of each 8 (usually) bits. It then converts each set of bits into a character corresponding to its value in the encoding you specify. For example, if the binary data is 00101100 (8 bits), and you specify UTF-8 as the encoding, the output would be , (a comma). This is what anyone looking at the output of the buffer would see when looking at it with a text editor such as vim, as well as what a computer would "see" when "reading" them. The Buffer class has several encodings available, such as UTF-8, base64, and binary.
I think they felt that, while storing any UTF-8 character imaginable in the hash, as they would have to do, would not phase most modern computers, with their gigabytes of RAM and terabytes of space, actually showing all these characters, as they may want to do in logs, etc., would freak out users, who would have to look at weird Chinese, Greek, Bulgarian, etc. characters, as well as control characters, like the Ctrl button or the Backspace button or even beeps. They would never really need to make sense of any of them, unless they were experienced users testing PBKDF2 itself, but the programmer's first duty is to not give any of his users a heart attack. Using base64 increases the overhead by about a third, which is hardly worth noting these days, and decreases the character set, which does nothing to decrease the security. After all, computers are written completely in binary. As I said before, they could have chosen a different subset of Unicode, but base64 is already standard, which makes it easier and reduces programmer work.
Am I right about the reasons why the creators of this repository chose to encode its passwords in base64, instead of all of Unicode? Is it better to stick with their example, or should I go with Unicode or a larger subset of it?
A hash value is a sequence of bytes. This is binary information. It is not a sequence of characters.
UTF-8 is an encoding for turning sequences of characters into sequences of bytes. Storing a hash value "as UTF-8" makes no sense, since it is already a sequence of bytes, and not a sequence of characters.
Unfortunately, many people have took to the habit of considering a byte as some sort of character in disguise; it was at the basis of the C programming language and still infects some rather modern and widespread frameworks such as Python. However, only confusion and sorrow lie down that path. The usual symptoms are people wailing and whining about the dreadful "character zero" -- meaning, a byte of value 0 (a perfectly fine value for a byte) that, turned into a character, becomes the special character that serves as end-of-string indicator in languages from the C family. This confusion can even lead to vulnerabilities (the zero implying, for the comparison function, an earlier-than-expected termination).
Once you have understood that binary is binary, the problem becomes: how are we to handle and store our hash value ? In particular in JavaScript, a language that is known to be especially poor at handling binary values. The solution is an encoding that turns the bytes into characters, not just any character, but a very small subset of well-behaved characters. This is called Base64. Base64 is a generic scheme for encoding bytes into character strings that don't include problematic characters (no zero, only ASCII printable characters, excluding all the control characters and a few others such as quotes).
Not using Base64 would imply assuming that JavaScript can manage an arbitrary sequence of bytes as if it was just "normal characters", and that is simply not true.
There is a fundamental, security-related reason to store as Base64 rather than Unicode: the hash may contain the byte value "0", used by many programming languages as an end-of-string marker.
If you store your hash as Unicode, you, another programmer, or some library code you use may treat it as a string rather than a collection of bytes, and compare using strcmp() or a similar string-comparison function. If your hash contains the byte value "0", you've effectively truncated your hash to just the portion before the "0", making attacks much easier.
Base64 encoding avoids this problem: the byte value "0" cannot occur in the encoded form of the hash, so it doesn't matter if you compare encoded hashes using memcmp() (the right way) or strcmp() (the wrong way).
This isn't just a theoretical concern, either: there have been multiple cases of code for checking digital signatures using strcmp(), greatly weakening security.
This is an easy answer, since there are an abundance of byte sequences which are not well-formed UTF-8 strings. The most common one is a continuation byte (0x80-0xbf) that is not preceded by a leading byte in a multibyte sequence (0xc0-0xf7); bytes 0xf8-0xff aren't valid either.
So these byte sequences are not valid UTF-8 strings:
0x80
0x40 0xa0
0xff
0xfe
0xfa
If you want to encode arbitrary data as a string, use a scheme that allows it. Base64 is one of those schemes.
An addtional point: you might think to yourself, well, I don't really care whether they're well-formed UTF-8 strings, I'm never going to use the data as a string, I just want to hand this byte sequence to store for later.
The problem with that, is if you give an arbitrary byte sequence to an application expecting a UTF-8 string, and it is not well-formed, the application is not obligated to make use of this byte sequence. It might reject it with an error, it might truncate the string, it might try to "fix" it.
So don't try to store arbitrary byte sequences as a UTF-8 string.
Base64 is better, but consider a websafe base64 alphabet for transport. Base64 can conflict with querystring syntax.
Another option you might consider is using hex. Its longer but seldom conflicts with any syntax.
In encryption methods like RSA, we operate on an integer which represents our message. I've toyed around with converting the string to an array of bytes and working one character at a time, but that seems overly slow and the RSA algorithm is designed to work with the entire message.
How do we convert a string to a representation (integer, big integer etc) in which we can apply our cryptographic algorithm too?
In typical usage, you don't actually encrypt the entire message using RSA. Instead, you encrypt the encryption key for a symmetric block cipher (like AES) using RSA, then encrypt your stream of data using that block cipher.
Do not attempt to do this on your own! You have to be very careful with how you do the conversion, including setting up a secure padding scheme and using the block cipher correctly and in a secure mode. You might want to look using language-provided crypto libraries or a standard library like OpenSSL.
Hope this helps!
Think about how integers and strings are represented in memory. A 32-bit integer takes up four 8-bit bytes, and a 64-bit integer takes up eight bytes. A string is stored as bytes too, and in case of ASCII, each characters is represented by one byte. (UTF-8 and UTF-16 are variable length encodings, but it's still bytes.)
There is nothing to convert, because all datatypes are represented by bytes internally.
There's no reason this can't be extended to, say, 2048-bit integers for use with RSA.
I have read that if you want to encrypt a string using one programming language and decrypt that string using another programming language, then to ensure compatibility it is best to do some conversions prior to doing the encryption. I have read that it's a best practice to encrypt the byte array of a string rather than the string itself. Also, I have read that certain encryption algorithms expect each encrypted packet to be a fixed length in size. If the last packet to be encrypted isn't the required size, then encryption would fail. Therefore it seems like a good idea to encrypt data that has first been converted into a fixed length, such as hex.
I am trying to identify best practices that are generally useful regardless of the encryption algorithm being used. To maximize compatibility when encrypting and decrypting data across different languages and platforms, I would like a critique on the following steps as a process:
Encryption:
start with a plain text string
convert plain text string to byte array
convert byte array to hex
encrypt hex to encrypted string
end with an encrypted string
Decryption:
start with an encrypted string
decrypt encrypted string to hex
convert hex to byte array
convert byte array to plain text string
end with a plain text string
Really the best practice for encryption is to use a high level encryption framework, there's a lot of things you can do wrong working with the primitives. And mfanto does a good a good job of mentioning important things you need to know if you don't use a high level encryption framework. And i'm guessing that if you are trying to maximize compatibility across programming languages, it's because you need other developers to inter-operate with the encryption, and then they need to learn the low level details of working with encryption too.
So my suggestion for high level framework is to use the Google Keyczar framework, as it handles the details of, algorithm, key management, padding, iv, authentication tag, wire format all for you. And it exists for many different programming Java, Python, C++, C# and Go. Check it out.
I wrote the C# version, so I can tell you the primitives it uses behind the scenes are widely available in most other programming languages too, and it uses standards like json for key management and storage.
Your premise is correct, but in some ways it's a little easier than that. Modern crypto algorithms are meant to be language agnostic, and provided you have identical inputs with identical keys, you should get identical results.
It's true that for most ciphers and some modes, data needs to be a fixed length. Converting to hex won't do it, because the data needs to end on fixed boundaries. With AES for example, if you want to encrypt 4 bytes, you'll need to pad it out to 16 bytes, which a hex representation wouldn't do. Fortunately that'll most likely happen within the crypto API you end up using, with one of the standard padding schemes. Since you didn't tag a language, here's a list of padding modes that the AesManaged class in .NET supports.
On the flip side, encrypting data properly requires a lot more than just byte encoding. You need to choose the correct mode of operation (CBC or CTR is preferred), and then provide some type of message integrity. Encryption alone doesn't protect against tampering with data. If you want to simplify things a bit, then look at a mode like GCM, which handles both confidentiality, and integrity.
Your scheme should then look something like:
Convert plain text to string to byte array. See #rossum's comment for an important note about character encoding.
Generate a random symmetric key or use PBKDF2 to convert a passphrase to a key
Generate a random IV/nonce for use with GCM
Encrypt the byte array and store it, along with the Authentication Tag
You might optionally want to store the byte array as a Base64 string.
For decryption:
If you stored the byte array as a Base64 string, convert back to the byte array.
Decrypt encrypted byte array to plaintext
Verify the resulting Authentication Tag matches the stored Authentication Tag
Convert byte array to plain text string.
I use HashIds for this purpose. It's simple and supports wide range of programming language. We use it to pass encrypted data between our PHP, Node.js, and Golang microservices whenever we need to decrypt data in the destination.
I have read that it's a best practice to encrypt the byte array of a string rather than the string itself.
Crytographic algorithms generally work on byte arrays or byte stream, so yes. You don't encrypt objects (strings) directly, you encrypt their byte representations.
Also, I have read that certain encryption algorithms expect each encrypted packet to be a fixed length in size. If the last packet to be encrypted isn't the required size, then encryption would fail.
This is an implementation detail of the particular encryption algorithm you choose. It really depends on what the API interface is to the algorithm.
Generally speaking, yes, crytographic algorithms will break input into fixed-size blocks. If the last block isn't full then they may pad the end with arbitrary bytes to get a full chunk. To distinguish between padded data and data which just happens to have what-look-like-padding bytes at the end, they'll prepend or append the length of the plain text to the byte stream.
This is the kind of detail that should not be left up to the user, and a good encryption library will take care of these details for you. Ideally you just want to feed in your plain text bytes and get encrypted bytes out on the other side.
Therefore it seems like a good idea to encrypt data that has first been converted into a fixed length, such as hex.
Converting bytes to hex doesn't make it fixed length. It doubles the size, but that's not fixed. It makes it ASCII-safe so it can be embedded into text files and e-mails easily, but that's not relevant here. (And Base64 is a better binary→ASCII encoding than hex anyways.)
In the interest of identifying best practices for ensuring compatibility with encrypting and decrypting data across different languages and platforms, I would like a critique on the following steps as a process:
Encryption:
plain text string
convert plain text string to byte array
convert byte array to hex
encrypt hex to encrypted string
encrypted string
plain text byte array to encrypted byte array
Decryption:
encrypted string
decrypt encrypted string to hex
convert hex to byte array
encrypted byte array
decrypt encrypted byte array to plain text byte array
convert byte array to plain text string
plain text string
To encrypt, convert the plain text string into its byte representation and then encrypt these bytes. The result will be an encrypted byte array.
Transfer the byte array to the other program in the manner of your choosing.
To decrypt, decrypt the encrypted byte array into a plain text byte array. Construct your string from this byte array. Done.
In .NET, a string is a unicode character string. My understanding is the string itself does not contain any particular encoding information, ie is encoding neutral? You can use any encoding method to decode a string into a stream of bytes and then encode a stream of bytes into a recognizable string, as long as the encoding method matches with the decoding method?
In .Net string consists of UTF-16 characters. There is no such thing as "Unicode string". It could be UCS2 or UCS4 string, or various transition formats like UTF-7, UTF-8, UTF-16, but you could not call it "Unicode". It is important to understand the difference between them.
I know that somebody in .Net team called property of Encoding class "Unicode", but it was an error. And this class contains also "Default" property which is another mis-named property. This leads to many defects (majority of people don't read manuals and they simply don't realize that "Unicode" is UTF-16 and "Default" means default OS code page).
As for second part of your question, the answer is unfortunately no. It would be "yes", but there is one small problem. It is GB18030 encoding – the standard encoding for China PRC. It has assigned code points which simply don't exist in Unicode standard (yet). Possibly new version of Unicode standard will resolve this issue.
One important point here (going back to UTF-16) is the fact that bytes are not necessary good for conversions. The problem is related to surrogate pairs and you have to be careful as one character could be defined by two pairs, meaning four bytes.
If you don't care to support GB18030 encoding, you could use the method you mention freely. If by chance you want to sell your software in China, you will need to support it and of course you will have to be very careful (extensive testing will be needed).
Yes, with the caveat that many encoding schemes can't hold all Unicode code points, which renders some round trips non-idempotent.
"Unicode" in .NET is UTF-16 or UCS-2 (2 bytes). It is itself an encoding of full Unicode character set, which requires 32-bits (4 bytes, UCS-4) to hold all characters. So you can serialize the bytes as is and they will be restored on any system that supports UTF-16 will deserialize them properly.