Optimizing file synchronization over HTTP

Optimizing file synchronization over HTTP - http

I'm attempting to synchornize a set of files over HTTP.
For the moment, I'm using HTTP PUT, and sending files that have been altered. However, this is very inefficient when synchronizing large files where the delta is very small.
I'd like to do something closer to what rsync does to transmit the deltas, but I'm wondering what the best approach to do this would be.
I know I could use an rsync library on both ends, and wrap their communication over HTTP, but this sounds more like an antipattern; tunneling a standalone protocol over HTTP. I'd like to do something that's more in line with how HTTP works, and not wrap binary data (except my files, duh) in an HTTP request/response.
I've also failed to find any relevant/useful functionality already implemented in WebDAV.
I have total control over the client and server implementation, since this is a desktop-ish application (meaning "I don't need to worry about browser compatibility").

The HTTP PATCH recommended in a comment requires the client to keep track of local changes. You may not be able to do that due to the size of the file.
Alternatively you could treat "chunks" of the huge file as resources: depending on the nature of the changes and the content of the file it could be by bytes, chapters, whatever.
The client could query the hash of all chunks, calculate the same for the local version, and PUT only the changed ones.

Related

Does HTTP multipart/form-data provide reliability guarantees?

I have react front-end and flask backend web application. In this web app, I upload large CSV files from client to server via HTTP multipart/form-data. To achieve this, I take file information in <form encType='multipart/form-data'> element, with <input type='file'>. Then I use axios.post to make a POST request to the server.
On the flask server side, I access the file using request.files['file'] and save the file using file.save. This works as expected. The file is transferred successfully.
I'm thinking to compute MD5 checksum on both client and server side in order to make sure that both sides have files with same MD5 hash. However, this requires reading the file in chunks from the disk and compute the MD5. (since I'm dealing with large files, it is not possible to load the entire file in memory). So, I think this is little inefficient. I want to know whether this transfer via 'HTTP multipart/form-data' provide reliability guarantee? If so, I can ignore the MD5 verification right?
If reliability is not guaranteed, is there any good approach to make sure that both sides have exact same file copy? Thanks in advance.

HTTP integrity is as reliable as the underlying transport protocol, be it TCP (HTTP/1 and 2) or UDP (HTTP/3). Bits can fall over and still yield a valid checksum. This does happen.
If you want to make absolutely sure that you've received the same file as the uploader intended, you need to add a checksum yourself, using for example SHA or MD5.

Is it possible to transfer a file through CoAP?

Recently, I am doing a project and I am trying to transfer a json file to the CoAP server. I put some random values in key:value pairs such as:
{
key1: value1,
key2: [value21, value22, value23]
}
Questions:
CoAP is pretty much similar to HTTP. So, like HTTP, is it possible to transfer a json file through CoAP using POST/PUT method? If it is possible, what is the recommended directory location to put the uploaded file into the server (resource directory)?
Update:
The actual file size is about to 152.8 kB.

You can transfer arbitary JSON files using CoAP POST/PUT. Which directory would be writable depends fully on the server.
Note that for a file of that size, transfer times would be considerably longer than with HTTP, as packages are sent in lock-step (putting the first 1kB, response, next 1kB – whereas HTTP has a TCP window).

For a first shot, you may try out eclipse/californium's "simple-fileserver-example".
cf-simple-fileserver
The supports the read (GET) and uses option block 2 for that.
If you go deeper and leave the laboratory, RFC7959 blockwise may be faced several issues.
coap usually assumes, that the endpoints are identified by their ip-address (and
port). Though a blockwise transfer may last longer, that assumption may get broken. If the client is faced such a address change, a block option 2 (GET) may work, but for block option 1 (PUT), that would require special preparation.
Though such a blockwise transfer tends to last longer, it may get paused due to temporary transmission issus. That requires then a "resumption or fail" strategy. Also here GET is much easier than PUT.
Basic transmission issues on crashes. In my experience, blockwise comes with many blocks and so many MID are in use in a short period of time. If a client crashs and select a random MID on startup, the probability of an unaware MID clash is rather high. Depending on the coap servers deduplication implementation (strict according RFC7252 or advanced in awareness of that), your client may require a strategy to escape the situation, where the server retransmits unrelated messages just based on MIDs. My experience from that time was, "analyse what your get, if it smells, wait for the 247s :-)". Your client may also save the last used MID to overcome that or use a special/separate "blockwise endpoint" with disabled deduplication.
IP. FMPOV some have seen the issues left to the implementation and started to fill patents. That may require attention as well.
All together: If you use bockwise for payload of sometimes some K bytes, my experience is not that bad. But if you regulary transfer more, coap may be not the right choice.

Does SPDY/HTTP2 concatenates responses?

I have a question about SPDY/HTTP2:
Normally you concatenate multiple CSS and JS files into one file to save requests and to get a better performance. I heard that SPDY/HTTP2 combines multiple requests into a single response. Would that mean that I don't need to pre-concatenate CSS and JS files anymore, because this is handled by the protocol?
To say it in other words:
Can I use <script source="moduleA.js"></script> and <script source="moduleB.js"></script> with SPDY/HTTP2 in the same way as I would use <script source="allScripts.js"></script> with HTTP1? Is this the same from a response performance point of view, but with the benefit of caching each file on its own, so that I can change moduleB.js and keep moduleA.js cached?

HTTP/2.0 does not (AFAIK) exist yet - it's still a proposed standard. But it seems likely that it will use similar connection handling to SPDY.
SPDY doesn't concatenate them it multiplexes the requests across the same connection - from the network's point of view the effect is the same.
Yes, you don't need to merge the content files by hand, yes they will be cached independently.

SPDY3 and HTTP2 are multiplexing requests on the same physical connection.
But even multiplexed, requests may be sent sequentially for each resource, causing major slowdowns due to roundtrip time waits.
Both SPDY3 and HTTP2 have a feature called "Resource Push" (also known as "SPDY Push", not to be confused with "Server Push") that allows related resources to be pushed without the client requesting them, and the Jetty project - I am a committer - is the only one to my knowledge that implements that feature.
You can watch Resource Push in action in this video: http://webtide.intalio.com/2012/10/spdy-push-demo-from-javaone-2012/.
With Resource Push, you save additional roundtrips to get all the different JS files and still benefit of the browser cache per single file.
The whole point of resource concatenation is exactly to reduce the number of roundtrips necessary to get all the resources needed, and Resource Push helps to solve that problem.

HTTP/2.0 allows for multiplexing, where multiple request/response streams exchange data over the same TCP connection.
Because creating and starting TCP connections is expensive, HTTP/2.0's multiplexing will usually be faster than the semi-parallel downloading of HTTP/1.1, where a limited amount of TCP connections is (re)used by the browser to perform a given amount of requests for resources.
But your mileage may vary. Measure it.
As a sidenote, you might want to reference all your libraries separately when developing and debugging, but bundle and minify the JS/CSS into one file upon a deploy.

Is there anything in the FTP protocol like the HTTP Range header?

Suppose I want to transfer just a portion of a file over FTP - is it possible using a standard FTP protocol?
In HTTP I could use a Range header in the request to specify the data range of the remote resource. If it's a 1mb file, I could ask for the bytes from 600k to 700k.
Is there anything like that in FTP? I am reading the FTP RFC, don't see anything, but want to make sure I'm not missing anything.
There's a Restart command in FTP - would that work?
Addendum
After getting Brian Bondy's answer below, I wrote a read-only Stream class that wraps FTP. It supports Seek() and Read() operations on a resource that is read via FTP, based on the REST verb.
Find it at http://cheeso.members.winisp.net/srcview.aspx?dir=streams&file=FtpReadStream.cs
It's pretty slow to Seek(), because setting up the data socket takes a long time. Best results come when you wrap that stream in a BufferedStream.

Yes you can use the REST command.
REST sets the point at which a subsequent file transfer should start. It is used usually for restarting interrupted transfers. The command must come right before a RETR or STOR and so come after a PORT or PASV.
From FTP's RFC 959:
RESTART (REST) The argument field
represents the server marker at which
file transfer is to be restarted. This
command does not cause file transfer
but skips over the file to the
specified data checkpoint. This
command shall be immediately followed
by the appropriate FTP service command
which shall cause file transfer to
resume.
Read more:
http://www.faqs.org/rfcs/rfc959.html#ixzz0jZp8azux

You should check out how GridFTP does parallel transfers. That's using the sort of techniques that you want (and might actually be code that it is better to borrow rather than implementing from scratch yourself).

Understanding REST: REST as a high volume transport?

I'm designing a system that will need to move multi-GB backup images over TCP, and I'm looking at REST as an alternative to ONC RPC.
For example, I might have
POST http://site/backups/image1
where image1 is an 50GB file whose data is contained in the HTTP body.
My question: is this within the scope of what REST is meant for? Is it inappropriate to move massive files over HTTP? My preliminary testing shows that the performance isn't too bad, and I like the clean, debuggable protocol, as opposed to a custom ONC RPC server. But is this overloading the role of a webserver?
Thanks,
-Steve

HTTP has about the same overheads as FTP.
An HTTP server if often asked to do more work than an FTP server. But otherwise, using HTTP to send a large file is about the same as using FTP.
The only consideration is making sure your web server and web application framework are configured to do this kind of thing without needlessly expanding the entire 50Gb file inside Apache.

Steve,
HTTP has a look-before-you-leap 'feature' that allows the client to ask the server whether it will accept the data submission before it actually sends out the data. I'd look into using this to avoid transferring GBs of data only to find out that the server is currently not willing to handle them. Look at the HTTP Expect header and 100 Continue status codes.
Also, you can use FTP within a RESTful approach, IOW, think along the lines of
<backup-store href="ftp://example.org/site/backup/images/"/>
and make your clients understand the ftp URI scheme.
Finally, the T in HTTP means transfer and not transport - an important distinction to make because the former is an application semantic (HTTP is an application protocol) and the latter is a not.
HTH,
Jan

REST has nothing to do with how large your data is or which method you use to transport it.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex