Is there anything in the FTP protocol like the HTTP Range header? - http

Suppose I want to transfer just a portion of a file over FTP - is it possible using a standard FTP protocol?
In HTTP I could use a Range header in the request to specify the data range of the remote resource. If it's a 1mb file, I could ask for the bytes from 600k to 700k.
Is there anything like that in FTP? I am reading the FTP RFC, don't see anything, but want to make sure I'm not missing anything.
There's a Restart command in FTP - would that work?
Addendum
After getting Brian Bondy's answer below, I wrote a read-only Stream class that wraps FTP. It supports Seek() and Read() operations on a resource that is read via FTP, based on the REST verb.
Find it at http://cheeso.members.winisp.net/srcview.aspx?dir=streams&file=FtpReadStream.cs
It's pretty slow to Seek(), because setting up the data socket takes a long time. Best results come when you wrap that stream in a BufferedStream.

Yes you can use the REST command.
REST sets the point at which a subsequent file transfer should start. It is used usually for restarting interrupted transfers. The command must come right before a RETR or STOR and so come after a PORT or PASV.
From FTP's RFC 959:
RESTART (REST) The argument field
represents the server marker at which
file transfer is to be restarted. This
command does not cause file transfer
but skips over the file to the
specified data checkpoint. This
command shall be immediately followed
by the appropriate FTP service command
which shall cause file transfer to
resume.
Read more:
http://www.faqs.org/rfcs/rfc959.html#ixzz0jZp8azux

You should check out how GridFTP does parallel transfers. That's using the sort of techniques that you want (and might actually be code that it is better to borrow rather than implementing from scratch yourself).

Related

does multipart/form-data sends the whole file data at one go or in a stream

I have a requirement of uploading a large file over HTTP to a remote server.
I am researching on how to send the data using multipart/form-data.
I have gone through How does HTTP file upload work? and understood how it separates the file data using boundaries.
I wanted to know whether all the file data is sent at one go or is streamed with several requests to the remote server.
Because if it is sent at one go, it is not possible to read the whole data at the remote server and write it to a file.
But if it streamed, how does the remote server parses the streamed data, write this streamed data to a file and redo the same thing till all the data is streamed.
Sorry if it a noob question, I am researching about it as well.
Maybe it is outside the scope of multipart/form-data and HTTP is itself taking care of.
Any help is appreciated.
The logistics of the sending is not relevant. What matters is the maximum request size that is set on the server side. How it is set depends on the technology used there: IIS, Apache, nginx? If the post request of the browser exceeds that size (because of a too large file), errors will happen. There is nothing on the browser side u can tweak or change to fix breaking uploads. Unless you are building your own browser:-)

Can I negatively close a SFTP file transfer from the client side?

I am interacting with a SFTP server that deletes files once they are downloaded. To prevent data loss, I need to read the file, land it on persistent storage and then close the connection to indicate I have received it. What's not obvious to me is what happens if I can't safely store the file. Is there a way to close the connection in a way that semantically indicates 'I'm closing the connection and it failed'? All I am finding in the RFC is a SSH_FXP_CLOSE message, which seems to only signal successful transfer. All the other error message types appear to only be used when a server returns a response to a client, not the other way around.
You cannot signal an error to the SFTP server, that's not what SFTP is intended for.
For your particular case, simply closing an SFTP connection without closing the file explicitly (not sending the SSH_FXP_CLOSE) could indicate to the server that something went wrong.
Though it really depends on your server implementation, what it considers an error. The documentation of the SFTP server should describe what it is that triggers the delete.
In SFTP there's nothing like a "download" operation (contrary to FTP RETR command). There are only trivial file operations, like opening file (for reading or writing), reading piece of a file, closing a file. So it is not as simple as "server deletes the file after it is downloaded". The rule can say for example "server deletes the file after it is closed after being previously opened for reading" or something like that.

Optimizing file synchronization over HTTP

I'm attempting to synchornize a set of files over HTTP.
For the moment, I'm using HTTP PUT, and sending files that have been altered. However, this is very inefficient when synchronizing large files where the delta is very small.
I'd like to do something closer to what rsync does to transmit the deltas, but I'm wondering what the best approach to do this would be.
I know I could use an rsync library on both ends, and wrap their communication over HTTP, but this sounds more like an antipattern; tunneling a standalone protocol over HTTP. I'd like to do something that's more in line with how HTTP works, and not wrap binary data (except my files, duh) in an HTTP request/response.
I've also failed to find any relevant/useful functionality already implemented in WebDAV.
I have total control over the client and server implementation, since this is a desktop-ish application (meaning "I don't need to worry about browser compatibility").
The HTTP PATCH recommended in a comment requires the client to keep track of local changes. You may not be able to do that due to the size of the file.
Alternatively you could treat "chunks" of the huge file as resources: depending on the nature of the changes and the content of the file it could be by bytes, chapters, whatever.
The client could query the hash of all chunks, calculate the same for the local version, and PUT only the changed ones.

gawk to read last bit of binary data over a pipe without timeout?

I have a program already written in gawk that downloads a lot of small bits of info from the internet. (A media scanner and indexer)
At present it launches wget to get the information. This is fine, but I'd like to simply reuse the connection between invocations. Its possible a run of the program might make between 200-2000 calls to the same api service.
I've just discovered that gawk can do networking and found geturl
However the advice at the bottom of that page is well heeded, I can't find an easy way to read the last line and keep the connection open.
As I'm mostly reading JSON data, I can set RS="}" and exit when body length reaches the expected content-length. This might break with any trailing white space though. I'd like a more robust approach. Does anyone have a nicer way to implement sporadic http requests in awk that keep the connection open. Currently I have the following structure...
con="/inet/tcp/0/host/80";
send_http_request(con);
RS="\r\n";
read_headers();
# now read the body - but do not close the connection...
RS="}"; # for JSON
while ( con |& getline bytes ) {
body = body bytes RS;
if (length(body) >= content_length) break;
print length(body);
}
# Do not close con here - keep open
Its a shame this one little thing seems to be spoiling all the potential here. Also in case anyone asks :) ..
awk was originally chosen for historical reasons - there were not many other language options on this embedded platform at the time.
Gathering up all of the URLs in advance and passing to wget will not be easy.
re-implementing in perl/python etc is not a quick solution.
I've looked at trying to pipe urls to a named pipe and into wget -i - , that doesn't work. Data gets buffered, and unbuffer not available - also I think wget gathers up all the URLS until EOF before processing.
The data is small so lack of compression is not an issue.
The problem with the connection reuse comes from the HTTP 1.0 standard, not gawk. To reuse the connection you must either use HTTP 1.1 or try some other non-standard solutions for HTTP 1.0. Don't forget to add the Host: header in your HTTP/1.1 request, as it is mandatory.
You're right about the lack of robustness when reading the response body. For line oriented protocols this is not an issue. Moreover, even when using HTTP 1.1, if your scripts locks waiting for more data when it shouldn't, the server will, again, close the connection due to inactivity.
As a last resort, you could write your own HTTP retriever in whatever langauage you like which reuses connections (all to the same remote host I presume) and also inserts a special record separator for you. Then, you could control it from the awk script.

with NodeJS, What's the best way to parse a file upload that does not necessarily end?

Short summary: How to accept content that may be endless and not uploaded at once (the connection needs to be kept alive), in a scenario where I'm the server and I'd like the clients to make those uploads in a RESTful (or something close) way ?
In the same way that I can make an http server that keeps the connection alive with a client and may continue sending content that the client reads and parses intantly (probably using a browser), I need to keep a connection opened with a client that will send me data that may not end or be continuously uploaded.
One (simple) way to do this would be simply to have a TCP server and then clients would write data to a socket.
But how do I do this with an HTTP PUT request ? This answers half of the question: "How will I parse a file upload continuously, without the upload finishing ?" But how will clients proceed to upload something that is not even a file and are separate blocks of data, like if they would be writing those blocks of data to a socket ? Is it even possible ?
If your data isn't going to have a discrete end, then you're not really performing an upload; you're doing a streaming scenario. For a streaming scenario, socket handling is much more appropriate.
First I think Sonier is right. But I found this solution by Felix called "Streaming file uploads with node.js" which might be useful.
Furthermore I think node.js might not be best fit for this, because everything has to be kept in memory and with big file-size you can hit a very hard wall. Some other popular node.js file upload solutions are:
https://github.com/felixge/node-formidable
https://github.com/rootslab/formaline
https://github.com/FooBarWidget/multipart-parser

Resources