I have data that is stored on a local machine and periodically replicated using webservices. This data is critical to the application of this program and is along the lines of business transactions.
TransactionHeader JOIN TransactionDetail
So forth.
Should I be using some type of CRC checking when sending the data to the webservice or is this handled by the TCP protocol itself sufficiently?
EDIT: Just to be clear the data isnt deleted from the client until the server acknowledges receipt and I use strongly typed parameters in my webservice but I am more thinking about "mangled" data (although in all cases but string it should theoretically fail datatype casting).
normally tcp does a fine job of transferring data intact but if that data is business critical then you shouldn't leave checking that data up to tcp and should use a good hash function
at the tcp level everything is reduced to byte strings so if you transfered a number if that number was changed in transit due to an error and just happened to still be an number to the other side it would mean that datatype casting wouldn't catch that
if the main problem your dealing with is transfer checking then crc32 or such would work fine but if those hashes are used to verify the data after it's been received and stored a much better hash like sha1 or such should be used
Related
I ask this question because I had a very weird puzzling experience that I am about to tell.
I am instrumenting an HTTP API server to observe it's behavior in the presence of latency between the server and the clients. I had a setup consisting of a single server and a dozen of clients connected with a 10Gbps Ethernet fabric. I measured the time it took to serve certain API requests in 5 scenarios. In each scenario, I set the latency between the server and the clients to one of the values: No latency (I call this baseline), 25ms, 50ms, 250ms or 400ms using the tc-netem(8) utility.
Because I am using histogram buckets to quantify the service time, I observed that all the requests were processed in less than 50ms whatever the scenario is, which clearly doesn't make any sense as, for example, in the case of 400ms, it should be at least around 400ms (as I am only measuring the duration from the moment the request hits the server to the moment the HTTP Write()function returns). Note that the response objects are between 1Kb to 10Kb in size.
Initially, I had doubts that the *http.ResponsWriter's Write() function was asynchronous and returns immediately before data is received by the client. So, I decided to test this hypothesis by writing a toy HTTP server that services the content of a file that is generated using dd(1) and /dev/urandom to be able to reconfigure the response size. Here is the server:
var response []byte
func httpHandler(w http.ResponseWriter, r * http.Request) {
switch r.Method {
case "GET":
now: = time.Now()
w.Write(response)
elapsed: = time.Since(now)
mcs: = float64(elapsed / time.Microsecond)
s: = elapsed.Seconds()
log.Printf("Elapsed time in mcs: %v, sec:%v", mcs, s)
}
}
func main() {
response, _ = ioutil.ReadFile("BigFile")
http.HandleFunc("/hd", httpHandler)
http.ListenAndServe(":8089", nil)
}
Then I start the server like this:
dd if=/dev/urandom of=BigFile bs=$VARIABLE_SIZE count=1 && ./server
from the client side, I issue time curl -X GET $SERVER_IP:8089/hd --output /dev/null
I tried with many values of $VARIABLE_SIZE from the range [1Kb, 500Mb], using an emulated latency of 400ms between the server and each one of the clients. To make long story short, I noticed that the Write() method blocks until the data is sent when the response size is big enough to be visually noticed (on the order of tens of megabytes). However, when the response size is small, the server doesn't report a mentally sane servicing time compared to the value reported by the client. For a 10Kb file, the client reports 1.6 seconds while the server reports 67 microseconds (which doesn't make sense at all, even me as a human I noticed a little delay on the order of a second as it is reported by the client).
To go a little further, I tried to find out starting from which response size the server returns a mentally acceptable time. After many trials using a binary search algorithm, I discovered that the server always returns few microseconds [20us, 600us] for responses that are less than 86501 bytes in size and returns expected (acceptable) times for requests that are >= 86501 bytes (usually half of the time reported by the client). As an example, for a 86501 bytes response, the client reported 4 seconds while the server reported 365 microseconds. For 86502 bytes, the client reported 4s and the sever reported 1.6s. I repeated this experience many times using different servers, the behavior is always the same. The number 86502 looks like magic !!
This experience explains the weird observations I initially had because all the API responses were less than 10Kb in size. However, this opens the door for a serious question. What the heck on earth is happening and how to explain this behavior ?
I've tried to search for answers but didn't find anything. The only thing I can think about is maybe it is related to Linux's sockets size and whether Go makes the system call in a non-blocking fashion. However, AFAIK, TCP packets transporting the HTTP responses should all be acknowledged by the receiver (the client) before the sender (the server) can return ! Breaking this assumption (as it looks like in this case) can lead to disasters ! Can someone please provide an explanation for this weird behavior ?
Technical details:
Go version: 12
OS: Debian Buster
Arch: x86_64
I'd speculate the question is stated in a wong way in fact: you seem to be guessing about how HTTP works instead of looking at the whole stack.
The first thing to consider is that HTTP (1.0 and 1.1, which is the standard version since long time ago) does not specify any means for either party to acknowledge data reception.
There exists implicit acknowledge for the fact the server received the client's request — the server is expected to respond to the request, and when it responds, the client can be reasonably sure the server had actually received the request.
There is no such thing working in the other direction though: the server does not expect the client to somehow "report back" — on the HTTP level — that it had managed to read the whole server's response.
The second thing to consider is that HTTP is carried over TCP connections (or TLS, whcih is not really different as it uses TCP as well).
An oft-forgotten fact about TCP is that it has no message framing — that is, TCP performs bi-directional transfer of opaque byte streams.
TCP only guarantees total ordering of bytes in these streams; it does not in any way preserve any occasional "batching" which may naturally result from the way you work with TCP via a typical programming interface — by calling some sort of "write this set of bytes" function.
Another thing which is often forgotten about TCP is that while it indeed uses acknowledgements to track which part of the outgoing stream was actually received by the receiver, this is a protocol detail which is not exposed to the programming interface level (at least not in any common implementation of TCP I'm aware of).
These features mean that if one wants to use TCP for message-oriented data exchange, one needs to implement support for both message boundaries (so-called "framing") and acknowledgement about the reception of individual messages in the procotol above TCP.
HTTP is a protocol which is above TCP but while it implements framing, it does not implement explicit acknowledgement besides the server responding to the client, described above.
Now consider that most if not all TCP implementations employ buffering in various parts of the stack. At least, the data which is submitted by the program gets buffered, and the data which is read from the incoming TCP stream gets buffered, too.
Finally consider that most commonly used TCP implementations provide for sending data into an active TCP connection through the use of a call allowing to submit a chunk of bytes of arbitrary length.
Considering the buffering described above, such a call typically blocks until all the submitted data gets copied to the sending buffer.
If there's no room in the buffer, the call blocks until the TCP stack manages to stream some amount of data from that buffer into the connection — freeing some room to accept more data from the client.
What all of the above means for net/http.ResponseWriter.Write interacting with a typical contemporary TCP/IP stack?
A call to Write would eventially try to submit the specified data into the TCP/IP stack.
The stack would try to copy that data over into the sending buffer of the corresponding TCP connection — blocking until all the data manages to be copied.
After that you have essentially lost any control about what happens with that data: it may eventually be successfully delivered to the receiver, or it may fail completely, or some part of it might succeed and the rest will not.
What this means for you, is that when net/http.ResponseWriter.Write blocks, it blocks on the sending buffer of the TCP socket underlying the HTTP connection you're operating on.
Note though, that if the TCP/IP stack detects an irrepairable problem with the connection underlying your HTTP request/response exchange — such as a frame with the RST flag coming from the remote part meaning the connection has been unexpectedly teared down — this problem will bubble up the Go's HTTP stack as well, and Write will return a non-nil error.
In this case, you will know that the client was likely not able to receive the complete response.
I have a GUI application that sends/recv over tcp to a server.
Sometimes, we get junk data while doing a tcp recv from the server. While reading these nulls or invalid data, the client application crashes sometimes.
Is there a good way to validate this data? - other than catching this exception.
I dont want the GUI application to crash because of bad data sent by the server.
TCP has a checksum that it uses to validate the data received; that is done by the operating system (or sometimes the network hardware, if you have nice hardware). If the contents are not correct, with a very high probability, the data that was sent was incorrect. I just state that because I'm not totally sure that you were aware of this fact.
If you need to validate the data, you will have to validate the data. Write a function that parses your data, and returns a meaningful value only if there's meaningful data. Make your GUI aware of this.
Your question is kind of self-answering... you can't say "I want to be fault-tolerant, but I don't want to care about faults" ("other than catching this exception"), and based on the lack of description of the data you'd expect, I'd say you don't really care about the form of the data.
I have a requirement to make legal documents available to mobile applications (e.g. android, iphone, etc) via HTTP. Corruption can occur over http (references: 1, 2). In my case it is imperative that the downloaded documents have not been corrupt during transmission.
One mechanism for ensuring integrity is to digitally sign the documents. This approach works well if the documents are xml, however the signing public key will need to be available and trusted by the client.
Another mechanism is to create and store a checksum of the document (e.g. MD5). The client can download the document and the checksum, and then use the checksum to verify the document.
Question 1: Are there any other alternative mechanisms for ensuring the integrity?
Question 2: Does http have any built in mechanisms for ensuring downloaded data has not been corrupted during download?
Question 3: What is the statical likelihood of document corruption during download over HTTP (I would prefer this answer to be backed up by statistical data)?
As far as I know, HTTP itself does not have any built-in checksum mechanism and your suggestion would work for ensuring the data is valid. The thing is though, HTTP is generally implemented on the Transmission Control Protocol (TCP). TCP provides reliable communication between hosts.
Specifically, TCP itself implements error detection (using a checksum) and uses special number sequences to ensure the data arrives in the order that it was sent. If the host sending the data receives information that the receiving host did not get the data, it will resend.
If however the HTTP implementation on the device is actually running on top of the User Datagram Protocol (UDP), it isn't reliable however it is unlikely that a device is using UDP for HTTP or at least the unreliable version (as there is a Reliable User Datagram Protocol).
Now, I couldn't find statistics or much information at all regarding corruption of a HTTP request. Depending how mission critical you deem this to be, treat it like it would happen then. There is mention of downloading files that end up being corrupt. While these mostly seem to relate to ZIP files, I wouldn't think it is due to HTTP but rather other things inbetween like the device itself that is downloading and corrupting the information.
Perhaps in your scenario, it is best to add your checksum if it is absolutely critically important that your information arrives in one piece.
I have a code in C++ in which i use recv() from Berkeley Sockets to receive data from a remote host. The issue is that i do not know the size of the data ( which is variable ) so i need some kind of timeout opt ( probably ) to make this work.
Since I'm new in sockets programming, i was wondering how does for example a web client handle responses from a server ( eg a server sends the html data to the client ). Does it use some kind of timeout, since it doesn't know how big the page is ? Same with an FTP client.
When your data is of variable length, then typically that data is framed within another container. That is to say, there's a header preceding the actual data block that tell the receiver how much data it should accept.
For example HTTP uses new line characters to delimit data. If there's variable-length message, then in the header it will include "Content-length:" field that indicates exactly how many bytes to read once entire header is received (header stops when you read 2 consecutive new lines).
It is perfectly fine to read 4 bytes from socket, get how much data follows, then do another receive and read the rest. Only be careful, when you ask for 4 bytes, the socket might give you anywhere between 1-4 bytes so anything less than 4 means you need to go back and ask for remaining few bytes. This is a very common mistake. In dev environment you will almost always get 4 bytes when asking for 4, but once you deploy your app, somewhere on some machine you will get random crashes because their network behavior is somehow different.
Generally, it is a bad approach to rely on timeouts to determine when you reach end of data. With a timeout, you might get things "reliably" working in a well-controlled dev environment, but it is a very flaky solution. Any CPU/disk/network hick up might cause your app to stop receiving prematurely. You are also limiting your data throughput and responsiveness since your app is sleeping for some time interval instead of doing work.
I'm attempting to create a simple webserver in C# in asynchronous socket programming style. The purpose is very narrow - a Comet server (http long-polling).
I've got the windows service running, accepting connections, dumping request info to the Console and returning simple fixed content to the client.
Now, I can't figure out a manageable strategy for parsing the request data asynchronously and safely. I've written synchronous LL1 parsers before. I'm not sure if LL1 Parser is appropriate or necessary for HTTP. I don't know how to tokenize the input stream asynchronously. All I can think of is having an input buffer per client, reading into that, then copying that to a StringBuilder and periodically checking to see if I have a complete request. But that seems inefficient and might led to difficult to debug/maintain code.
Also, there are the two phases of the connection of receiving the request in full and the sending a response - in this case, after some delay. Once the request is validated and actionable, only then am I planning to enroll the connection in the long-polling manager. However, a misbehaving client could continue to send data and fill up a buffer, so I think I need to continue to monitor and empty the input stream during the response phase, right?
Any guidance on this is appreciated.
I guess the first step is knowing whether it is possible to efficiently tokenize a network stream asynchronously and without a large intermediate buffer. Even without a proper parser, the same challenges of creating a tokenizer apply to reading "lines" of input at a time, or even reading until double blank lines (one big token). I don't want to read one byte at a time from the network, but neither do I want to read too many bytes and have to store them in some intermediate buffer, right?
For HTTP the best way is reading the headers in memory completely (until you receive \r\n\r\n) and then simply splitting by \r\n to get the headers and every header by : to separate name and value.
There's no need to use a complex parser for that.