I would like to receive and send bytes that have special meaning in ASCII code like End of Text, End of Transmission etc. but I am not sure if it is allowed. Can it break my communication? Sending and receiving looks like reading from file that is why I doubt I can directly use this specific values. I use Windows OS.
EDIT: I have tested it and there is no problem with any sign. All of control ASCII characters can be sent via RS 232. Both reading and writing cause no unexpected behaviour.
RS232 is a very binary protocol. It does not even assume 8-bit bytes, let alone ASCII. The fact that on Windows, you use file functions does not matter either. Those too do not assume text data, although those do assume 8-bit byets.
RS-232 nodes do not interpret the data except in software flow control mode (XOn/XOff). You use this mode only if both parties agree and agree on the values of XOn and XOff.
The values are historically based on the ASCII DC1 and DC3 characters but the only thing that matters is their values, 0x11, and 0x13.
If you haven't set up software flow control, all values are passed through as-is.
Related
I'm a bit confused how people represent binary data, and how it is sent over networks. I will explain through Wikipedia's example. Shown here <- https://imgur.com/a/POELH -> So I have my binary data encoded as base 64, and I am sending the text TWFU. So I am sending T then W then F and finally U. But to send T, a char. I will need one byte to send it, like I've always been told. One character sent over a network is one byte.
Because now I've come to think that if I encode 24 bytes, I will be sending over 4 characters, but to send over 4 characters I need the same amount of bytes as characters??
So when sending over the network "Man" (unencoded) (Requiring 3 bytes normally) vs "TWFu" (encoded) (requiring 4 bytes normally) in the example from above, are the same sequence of bits sent over the network the same. Because the last time I've used a socket to send over data, they just ask for a string input, never a text + encoding input.
Synopsis: "How" is an agreement. "Raw" is common.
Data is sent in whichever way the sender and receiver agree. There are many protocols that are standard agreements. Protocols operate at many levels. A very common pair that covers two levels is TCP/IP. Many higher-level protocols are layered on top of them. (A higher-level protocol may or may not depend on specific underlying protocols.) HTTP and SMTP are very common higher-level protocols, often with SSL sandwiched in between.
Sometimes the layers or the software that implements them is called a stack. There is also the reference (or conceptual) OSI Model. The key point about it is that it provides a language to talk about different layers. The layers it defines may or may not map to any specific stack.
Your question is too vague to answer directly. With HTTP, "raw" binary data is transferred all the time. The HTTP headers can give the length of the body in octets and the body follows the header. As part of the agreement between the sender and receiver, the header might give meta-data about the binary data using MIME headers. For example: Your gravatar
is sent with headers including:
content-length:871
content-type:image/png
That's enough for the receiver to know that the sender claims that it is a PNG graphic of 871 bytes. The receiver will read the header and then read 871 bytes for the body and then assume that what follows is another HTTP header.
Some protocols use synchronizations methods other than bodies with pre-declared sizes. They might be entirely text-based and use a syntax that allows only certain characters. They can be extended by a nesting agreement to use something like Base64 to represent binary data as text.
Some layers might provide data compression of sufficient density that expansion by higher layers, such as Base64, is not a great concern. See HTTP Compression, for example.
If you want to see HTTP in action, hit F12 and go the Network tab. If you want to see other protocols active on your computer try WireShark, Microsoft Message Analyzer, Fiddler or similar.
Base64 is a method for encoding arbitrary 8-bit data in a purely 7-bit channel. As much as the internet is based on the principle of 8-bit bytes, for text mode it's presumed to be 7-bit ASCII unless otherwise specified.
If you're sending that data Base64 encoded then you'll literally send TWFU. Many text-based protocols use Base64 out of convenience: It's an established standard and it's efficient enough for most applications.
The foundation of the internet, IP, is a protocol based on 8-bit bytes. When sending binary data you can make full use of all 8 bits, but if you're working with a text-mode protocol, of which there are many, you're generally stuck using 7-bit ASCII unless the protocol has a way of specifying which character set or encoding you're using.
If you have the option to switch to a "binary" transfer then you can side-step the need for Base64. If you're working with a 7-bit ASCII protocol then you're probably going to need Base64.
Note this isn't the only method for encoding arbitrary binary characters. There's also quoted printable as used in email, and URI encoding for URLs. These are more efficient in cases where escaping is exceptional, but far less efficient if it's required for each character.
If you know you're dealing with 7-bit text only there's no need for base-64 encoding.
However, if you'd need to send
Man
Boy
over a purely 7-bit text channel you couldn't send it as literal with the line breaks. Instead, you'd send encoded in base64
TWFuDQpCb3kNCg==
which has encoded line breaks but doesn't use incompatible characters. Of course, the receiver needs to know that you're sending encoded text - either implied by the protocol or explicitly marked in some way.
I'm developing a web front end for a GNU Radio application developed by a colleague.
I have a TCP client connecting to the output of two TCP Sink blocks, and the data encoding is not as I expect it to be.
One TCP Sink is sending complex data and the other is sending float data.
I'm decoding the data at the client by reading each 4-byte chunk as a float32 value. The server and the client are both little-endian systems, but I also tried byte swapping (with the GNU Radio Endian Swap block and also manually at the client), and the data is still not right. Actually it's much worse then, confirming there is no byte order mismatch.
When I execute the flow graph in GNU Radio Companion with appropriate GUI elements, the plots look correct. The data values are shown as expected to between 0 and 10.
However the values decoded at the client are generally around 0.00xxxxx, and the plot looks like noise rather than showing a simple tone as is seen in GNU Radio. If I manually scale the data by multiplying by 1000 it still looks like noise.
I'll describe the pre-D path in GNU Radio since it's shorter, but I see the same problem on the post-D path, where a WBFM Receive and a Rational Resampler are added, followed by a Throttle block and then a TCP Sink block sending float data.
File Source (Output Type: complex, vector length: 1) =>
Throttle (vector length: 1) =>
Low Pass Filter (FIR Type: Complex->Complex (Decimating)) =>
Throttle (vector length: 1) =>
TCP Sink (input type: complex, vector length: 1).
This seems to be the correct way to specify the stream parameters (and indeed Companion shows errors if I make changes which mismatch the stream items), but I can find no way to decode the data correctly on the other end of the stream.
"the historic RFC 1700 (also known as Internet standard STD 2) has defined the network order for protocols in the Internet protocol suite to be big-endian , hence the use of the term 'network byte order' for big-endian byte order."
see https://en.wikipedia.org/wiki/Endianness
having mentioned the network order for protocols being big-endian, this actually says nothing about the byte order of network payload itself.
also note: Sun Microsystems made big-endian native byte order computers (upon which much Internet protocol development was done).
i am surprised the previous answer has gone this long without a lesson on network byte order versus native byte order.
GNURadio appears to assume native byte order from a UDP Source block.
Examining the datatype color codes in Help->Types of GNURadio Companion, the orange colored 'float' connections are float32.
To verify a computer's native byte order, in Python, do:
from sys import byteorder
byteorder
the result will be 'little' or 'big'
It might be possible that no matter what type floats you are sending, when bytes get on network they get ordered in little endian. I had similar problem with udp connection, and I solved it by parsing floats as little endian on client side.
This post to the question "What is base 64 encoding used for?" says:
When you have some binary data that you want to ship across a network, you generally don't do it by just streaming the bits and bytes over the wire in a raw format. Why? because some media are made for streaming text. You never know -- some protocols may interpret your binary data as control characters (like a modem), or your binary data could be screwed up because the underlying protocol might think that you've entered a special character combination (like how FTP translates line endings).
I've used sockets in Java a hundert times to send binary data over networks. And as far as I know it very common to send binary data over networks especially if you have big data. I don't see why some devices could interpret binary data wrong, since it contains TCP header etc.
SOAP MTOM also sends binary data over networks.
Am I misunderstanding something? I'm irritated, because this post has many upvotes and is accepted.
The answer you link to isn't incorrect, it just fails to explicitly mention some examples. The answer is in the quote as well:
because some media are made for streaming text
Sockets deal in bytes, they don't care what they transport. It is the higher-level protocols, or the message formats they transport, that do.
It's when this binary data is wrapped in envelopes of such protocols or formats that they can wreak havoc. A less than (<) character in image bytes is perfectly valid, but when used in an XML message, it will break the XML. Other characters, like control characters, can have an influence on how further data is to be interpreted by a protocol handler.
So base64 is used to wrap binary data in a safe-for-transport way where that would otherwise not be safe.
I have never worked on the security side of web apps, as I am just out of college. Now, I am looking for a job and working on some websites on the side, to keep my skills sharp and gain new ones. One site I am working on is pretty much copied from the original MEAN stack from the guys that created it, but trying to understand it and do things better where I can.
To compute the hash & salt, the creators used PBKDF2. I am not interested in hearing about arguments for or against PBKDF2, as that is not what this question is about. They seem to have used buffers for everything here, which I understand is a common practice in node. What I am interested in are their reasons for using base64 for the buffer encoding, rather than simply using UTF-8, which is an option with the buffer object. Most computers nowadays can handle many of the characters in Unicode, if not all of them, but the creators could have chosen to encode the passwords in a subset of Unicode without restricting themselves to the 65 characters of base64.
By "the choice between encoding as UTF-8 or base64", I mean transforming the binary of the hash, computed from the password, into the given encoding. node.js specifies a couple ways to encode binary data into a Buffer object. From the documentation page for the Buffer class:
Pure JavaScript is Unicode friendly but not nice to binary data. When dealing with TCP
streams or the file system, it's necessary to handle octet streams. Node has several
strategies for manipulating, creating, and consuming octet streams.
Raw data is stored in instances of the Buffer class. A Buffer is similar to an array
of integers but corresponds to a raw memory allocation outside the V8 heap. A Buffer
cannot be resized.
What the Buffer class does, as I understand it, is take some binary data and calculate the value of each 8 (usually) bits. It then converts each set of bits into a character corresponding to its value in the encoding you specify. For example, if the binary data is 00101100 (8 bits), and you specify UTF-8 as the encoding, the output would be , (a comma). This is what anyone looking at the output of the buffer would see when looking at it with a text editor such as vim, as well as what a computer would "see" when "reading" them. The Buffer class has several encodings available, such as UTF-8, base64, and binary.
I think they felt that, while storing any UTF-8 character imaginable in the hash, as they would have to do, would not phase most modern computers, with their gigabytes of RAM and terabytes of space, actually showing all these characters, as they may want to do in logs, etc., would freak out users, who would have to look at weird Chinese, Greek, Bulgarian, etc. characters, as well as control characters, like the Ctrl button or the Backspace button or even beeps. They would never really need to make sense of any of them, unless they were experienced users testing PBKDF2 itself, but the programmer's first duty is to not give any of his users a heart attack. Using base64 increases the overhead by about a third, which is hardly worth noting these days, and decreases the character set, which does nothing to decrease the security. After all, computers are written completely in binary. As I said before, they could have chosen a different subset of Unicode, but base64 is already standard, which makes it easier and reduces programmer work.
Am I right about the reasons why the creators of this repository chose to encode its passwords in base64, instead of all of Unicode? Is it better to stick with their example, or should I go with Unicode or a larger subset of it?
A hash value is a sequence of bytes. This is binary information. It is not a sequence of characters.
UTF-8 is an encoding for turning sequences of characters into sequences of bytes. Storing a hash value "as UTF-8" makes no sense, since it is already a sequence of bytes, and not a sequence of characters.
Unfortunately, many people have took to the habit of considering a byte as some sort of character in disguise; it was at the basis of the C programming language and still infects some rather modern and widespread frameworks such as Python. However, only confusion and sorrow lie down that path. The usual symptoms are people wailing and whining about the dreadful "character zero" -- meaning, a byte of value 0 (a perfectly fine value for a byte) that, turned into a character, becomes the special character that serves as end-of-string indicator in languages from the C family. This confusion can even lead to vulnerabilities (the zero implying, for the comparison function, an earlier-than-expected termination).
Once you have understood that binary is binary, the problem becomes: how are we to handle and store our hash value ? In particular in JavaScript, a language that is known to be especially poor at handling binary values. The solution is an encoding that turns the bytes into characters, not just any character, but a very small subset of well-behaved characters. This is called Base64. Base64 is a generic scheme for encoding bytes into character strings that don't include problematic characters (no zero, only ASCII printable characters, excluding all the control characters and a few others such as quotes).
Not using Base64 would imply assuming that JavaScript can manage an arbitrary sequence of bytes as if it was just "normal characters", and that is simply not true.
There is a fundamental, security-related reason to store as Base64 rather than Unicode: the hash may contain the byte value "0", used by many programming languages as an end-of-string marker.
If you store your hash as Unicode, you, another programmer, or some library code you use may treat it as a string rather than a collection of bytes, and compare using strcmp() or a similar string-comparison function. If your hash contains the byte value "0", you've effectively truncated your hash to just the portion before the "0", making attacks much easier.
Base64 encoding avoids this problem: the byte value "0" cannot occur in the encoded form of the hash, so it doesn't matter if you compare encoded hashes using memcmp() (the right way) or strcmp() (the wrong way).
This isn't just a theoretical concern, either: there have been multiple cases of code for checking digital signatures using strcmp(), greatly weakening security.
This is an easy answer, since there are an abundance of byte sequences which are not well-formed UTF-8 strings. The most common one is a continuation byte (0x80-0xbf) that is not preceded by a leading byte in a multibyte sequence (0xc0-0xf7); bytes 0xf8-0xff aren't valid either.
So these byte sequences are not valid UTF-8 strings:
0x80
0x40 0xa0
0xff
0xfe
0xfa
If you want to encode arbitrary data as a string, use a scheme that allows it. Base64 is one of those schemes.
An addtional point: you might think to yourself, well, I don't really care whether they're well-formed UTF-8 strings, I'm never going to use the data as a string, I just want to hand this byte sequence to store for later.
The problem with that, is if you give an arbitrary byte sequence to an application expecting a UTF-8 string, and it is not well-formed, the application is not obligated to make use of this byte sequence. It might reject it with an error, it might truncate the string, it might try to "fix" it.
So don't try to store arbitrary byte sequences as a UTF-8 string.
Base64 is better, but consider a websafe base64 alphabet for transport. Base64 can conflict with querystring syntax.
Another option you might consider is using hex. Its longer but seldom conflicts with any syntax.
Suppose you have a specification for sending information on a TCP or UDP stream and you have a sequence of bytes that you receive delimited with STX and EOT bytes. How do you handle for example the EOT byte occurring in the actual data? This is possible I think: most bytes in the message represent numbers in a defined order (i.e. it's not just ascii text in byte form) so EOT is byte 0x04 and this a number that could occur in the data. The specification is unclear on this: should I always look at the last occurrance of EOT and ignore those in between? Other similar specifications I've seen could even handle multiple messages inside the same TCP/UDP message: for example STX some_data EOT STX more_data EOT inside one TCP/UDP message. In this case you can't just look at the last EOT because it's actually 2 separate messages. Do you do some form of escaping then?
How is this sort of thing handled usually? I couldn't find anything on google, but perhaps I'm not using the best search terms.
"Usually" the protocol should be well designed, so that messages either don't contain the delimiter, use an escape mechanism to include the delimiter, or have a known length so that you know where the message ends without having to depend on the delimiter.
If the messages are fixed size integers for example you'll know that EOT encountered within an integer is not a delimiter.