Efficiently handling HTTP uploads of many large files in Go - http

There is probably an answer within reach, but most of the search results are "handling large file uploads" where the user does not know what they're doing or "handing many uploads" where the answer consistently is just an explanation of how to work with multipart requests and/or Flash uploader widgets.
I haven't had time to sift through Go's HTTP implementation, yet, but when does the application have the first chance to see the incoming body? Not until it has been completely received?
If I were to [poorly] decide to use HTTP to transfer a large amount of data and posted a single request with several 10-gigabyte parts, would I have to wait for the whole thing to be received before processing it or does the io.Reader with the body iteratively process it?
This is only tangentially related, but I also haven't been able to get a clear answer about whether I can choose to forcibly close the connection in the middle; whether or not, even if I close it, it will just keep receiving it on the port.
Thanks so much.

An application's handler is called after the headers are parsed and before the request body is read. The handler can read the request body as soon as the handler is called. The server does not buffer the entire request body.
An application can read file uploads without buffering the entire request by getting a multipart reader and iterating through the parts.
An application can replace the request body with a MaxBytesReader to force close the connection after a specified limit is breached.
The above comments are about the net/http server included in the standard library. The comments may not apply to other servers.

While I haven't done this with GB size files, my strategy with file processing (mostly stuff I read from and write to S3) is to use https://golang.org/pkg/os/exec/ with a cmd line utility that handles chunking a way you like. Then read and process by tailing the file as explained here: Reading log files as they're updated in Go
In my situations, network utilities can download the data far faster than my code can process it, so it makes sense to send it to disk and pick it up as fast as I can, that way I'm not holding some connection open while I process.

Related

upload file api with uploadtask in symfony 2.8

We realize that if we want to produce a multipart query that contains a video file of 15GB, it is impossible to allocate in memory the size needed for such a large amount of data, most devices have only 2 or 3GB of RAM.
It is therefore absolutely necessary to switch to the uploadTask method which will push to the server the contents of a block file of the maximum size allowed by the IP packets sent to the server.
This is a POST method. However, it does not contain parameters such as the folder id or the file name. So you need a way to transmit these parameters. The best way is to code them in the URL.
I proposed an encoding format in the form of a path behind the endpoint of the API, but we can also very well encode these two parameters in a classic way in the URL, eg:
/api/upload?id=123&filename=video.mp4
From what I read on Stackoverflow, it's trivial with Symfony to retrieve id and filename. Then all the data received in the body of the POST request can be written in a raw way directly into a file, without also passing through a buffer in server-side memory.
The user data must imperatively be streamed, whether mobile side or server side, and whether upload or download. Loading user content in memory is also very dangerous in terms of security.
In symfony, how can I do that?
This goes way beyond Symfony and depends on the web server you are using.
By default with apache/nginx and php you will receive an already buffered request, so you cannot stream it to a file.
However, there are solutions, for example with Apache you can stream requests, see http://hc.apache.org/httpclient-3.x/performance.html#Request_Response_entity_streaming
Probably nginx also has options for it, but I don't know about those.
Another option might be websockets, see http://en.wikipedia.org/wiki/WebSocket

Golang HTTP and file caching

I have an application, written in Go, which runs an HTTP server and uses http.ServeFile() to serve a file which is being updated 3 times per second; this is an audio streaming index file for a HTTP Live Stream which I need to operate in near zero latency, hence the frequent updates. I can see from the logging in my Go server application that this file really is being updated 3 times per second and I call Sync() on the file each time it is updated to make sure that it is written to disk.
My problem is that, on the browser side (Chrome), while this file is being requested several times per second, it is only actually being served once a second; on all the other occasions the server is returning 304, indicating that the file is unchanged:
What might be causing this behaviour and how could I make the file be served on each request?
As state in the comments, it turns out that the modification time checking in HTTP has a minimum resolution of 1 second and so, where a file needs to be change and server more frequently than that, it's best to serve it oneself from RAM. For instance, store it in a slice called content and serve that slice with something like:
http.ServeContent(w, r, filepath.Base(r.URL.Path), time.Time{}, bytes.NewReader(content))
Modification time checking in HTTP only has resolution to the second. However, the alternative is to use entity-tags ('etags'), which can be updated as often as the server needs to change the content.
Therefore your use-case would work better via etags instead of modification times. An etag contains an opaque string that either does or doesn't match.
From https://www.rfc-editor.org/rfc/rfc7232#section-2.3,
An entity-tag is an opaque validator for
differentiating between multiple representations of the same
resource, regardless of whether those multiple representations are
due to resource state changes over time, content negotiation
resulting in multiple representations being valid at the same time,
or both.
Detection of change by the client is usually done using the if-none-match header (https://www.rfc-editor.org/rfc/rfc7232#section-3.2)
If-None-Match is primarily used in conditional GET requests to enable
efficient updates of cached information with a minimum amount of
transaction overhead.

Can a web server begin responding before the client has sent the full request?

I am writing a web application for an academic research group. The researchers need to be able to upload large data sets (100MB - 1GB) in CSV format. I've written the server to process the data as it comes in. This means that if there is an error in the first row of the CSV, we can return an error straight away.
However, when this happens, the browser reports that "The connection was reset" or similar. Clearly, my web server is responding in a way that doesn't make sense.
If I explicitly close the HTTP request stream (this is Kotlin on the JVM by the way) before returning the error to the browser, then the problem goes away. However, it turns out that the close implementation of the request stream first goes and reads the whole stream to its end. So at that point the user still has to wait 30mins+ to find out that there is an error in the first row of their CSV.
Is what I am trying to do possible? Does the HTTP protocol permit a web server, in any circumstances, to begin responding before the full request body has been sent? If not, can you suggest a workaround that would allow me to deliver a user experience where the user doesn't have to wait for the whole file to be uploaded before finding out if there are any problems?
The answer is yes, according to the http spec servers should be able to send responses early and the client should stop sending the request body. Most browsers however, don't implement this correctly.
In theory, your http server needs to return a 4xx error code with a response body, then reset the connection to prevent the upload continuing in the background. See the answers below for a more detailed description of the issue. There are a couple of browser versions that do support this, so if you're doing this in lab conditions where you can control the client being used the links below will help.
https://stackoverflow.com/a/14483857/2274303
https://stackoverflow.com/a/18370751/2274303
[edit]
To answer your question about using a workaround, chunking the uploads using javascript is a good way to mitigate internet connectivity issues, but if you want to parse it in real time it's not as simple as arbitrarily breaking up the file into pieces. You need to make sure you're not splitting the file in the middle of a line, otherwise it will fail even if the data is valid. That brings up the issue of parsing a 1GB file in javascript, which isn't a good idea imo.
If you want to use javascript, continue uploading the entire file at once via an ajax request, so you can get the response outside of the main dom and force a redirect or cancel the upload. Depending on which js libraries you're using there are different ways of doing this.
None of this solves the reverse scenario. What if the file is 95% uploaded before there's an error? The researcher will need to either upload the whole thing again or edit the file to only include the rows from the error going forward. That means your application needs to support partial uploads and know to pick up where it left off. All these things are possible, but you're probably not going to find a simple workaround to get this working well.
Without understanding the dataset and what kind of validation you are doing it's hard to come up with a full solution. If parsing each row doesn't depend on the previous rows being valid, you could always upload the whole file, then display the rows with errors at the end and ask them to upload a second file with just the corrections.
The normal process of a HTTP web server happens like:
Server listens for request
Client creates request
Client sends request to server
Server processes request
Server creates response
Server sends response to client
Client processes response
The client starts the connection for communication and the server is able to respond on that connection, however if you close the connection the server will need to send a response on another connection. The browser may not allow the server to start a new connection that the client didn't request.
You may be able to respond by reading the first line and creating an error quickly, but the client will not read the response until it is done sending the request.
By sending the file in chunks or asynchronously sending lines of the file, you will be able to give feedback more immediately. You will be sending many smaller requests with the ability to respond in between.
The question was about HTTP protocol. I feel like this would be allowed by the protocol if you wrote a custom app and web app, however if you are using browsers then you must use HTTP as the companies have implemented it. In a custom app you could check for interruptions however most browsers will probably fire a full request before listening for a response, which is also a reason AJAX took off 20years ago.

File Upload / The connection was reset

I am writing an upload handler (asp.net) to handle image uploads.
The aim is to check the image type and content size before the entire file is uploaded. So I cannot use the Request object directly as doing so loads the entire file input stream. I therefore use the HttpWorkerRequest.
However, I keep getting "The connection to the server was reset while the page was loading".
After quite a bit of investigation it has become apparent that when posting the file the call works only if the entire input stream is read.
This, of course, is exactly what I do not want to do :)
Can someone please tell me how I can close off the request without causing the "connection reset" issue and having the browser process the response?
There is no way to do this, as this is how HTTP functions. The best you can do is slurp the data from the client (i.e. read it in chunks) and immediately forget about it. This should prevent your memory requirements from being hammered, though will hurt your bandwidth.

asp.net file downloading - track downloaded size

I am trying to design a system for something like this with ASP.net/C#.
The users pay for downloading some content (files- mp3s/PDFs,doc etc).I should be able to track the number of bytes downloaded by the user. If the number of bytes downloaded match the number of bytes on the server, I should set a flag in DB (telling that the download was successful and prevent them from downloading the file again/asking them to pay for the download again). If the download was incomplete, they should be able to download the file again without paying for it again(since the flag will not be set).
Is there any way to keep track of the number of bytes successfully downloaded by the client ?
Also when I see a file size in my WinXP machine, I see two sizes(size,size on disk). Which one should I consider ? And will it differ from one OS to another ?
You can easily measure data passed to the client in ASP.NET assuming you replace a direct IIS-controlled download with your own, which would go something like this:
while (context.Response.IsClientConnected) {
bytesRead = ReadFileChunkAsByteArrayWIthOffsetOrWhatever(buffer, offset);
context.Response.OutputStream.Write(buffer, 0, bytesRead);
context.Response.Flush();
offset += bytesRead;
if (bytesRead != bufferSize)
break;
}
It's complicated to make this 100% reliable from within ASP, but it can be done. You pretty much have to account for every possible failure point and react accordingly.
The problem though is still - as someone mentioned above - that it's impossible to know that the client received the data. If money is involved in this transaction, that can get to be a problem really quickly.
For that reason, the best approach would be to use a custom downloader client, like the one Amazon uses for MP3 file purchases. That way you're not subjecting either yourself or your customers to the vagaries of moving monetized bits over something as unreliable as HTTP.
you can create an asp.net handler that serves the file ( for asp.net mvc u can do a result action instead ... this is what I'm using). Make sure it supports resumable downloads.
from the you can track the bytes served.
Ps. this incurs a performance overhead vs. letting IIS serve it
update 1: I used something pretty similar to this http://dotnetslackers.com/articles/aspnet/Range-Specific-Requests-in-ASP-NET.aspx ... and the article has a pretty clear explanation on what's inside it. You probably can use that one as is, see the example in that post.
You could try looking into HTTP reponse codes (i.e: 200, 404 etc) - the client and server will be exchanging http headers so that they know what's going on - you should be able to monitor these to see if the reponses was successful (not sure - but you should be able to).
With regards to file size - I would try experiments on files with 'known' sizes, compare what the Http Logs tell you with what file explorer tells you.
Also, I've seen tools/wodgets that report file upload progress - so you're right you should be able to to the same in reverse, I guess. You could try looking at file upload code examples and tutorials - you might get some hints. I can't think of any off the top of my head - sorry.
To do custom byte serving like this, you will need to implement your own http handler.
This handler should do the following:
Implement some kind of authentication on the http handler, so you know who you are dealing with.
Then you will need to implement some kind of logging for files requested and files allowed to be downloaded.
Implement etags and expires headers for client side caching.
Server side caching
Deflate, gzip compression
If you want to support resumable downloads, you will need to implement 206 partial responses. This is essential for any kind of streaming and serving pdfs.
So you should be handling the following http headers:
ETag
Expires
Accept-Ranges
Range
If-Range
Last-Modified
If-Match
If-None-Match
If-Modified-Since
If-Unmodified-Since
Unless-Modified-Since
If you are looking for a sample implementations of http handlers check out:
http://code.google.com/p/talifun-web/wiki
It has a static file handler that implements all the above http headers, client side and server side caching and even compression.
There is also a log module and an authorization module that should go a long way into how to implement authentication and logging.
The size you want is the size (not the size on disk). Size on disk includes extra space that is taken up by fitting into the 4K block size of the partition. The size is the exact number of bits in the file.
I don't believe there is a good way to tell that a download has been completed. Response.TransmitFile is probably the best method for sending the file securely. But I don't believe it has anything that will tell you if the user actually recieved the file.
I don't know about the business this is supporting, but I can't think of a legitimate business where users would tolerate a single download per purchase model, and with the abiguity of the standard HTTP request/response model does not lend itself to making an accurate client side reciever. Not to mention this model could be eaisyly hacked by sending a failed response on reciept of the last packet.
I think using somthing like download windows (2hrs after purchase) and then lock it to an IP after the first request would accomplish the same result and result in alot less user issues and support calls. Also unless the file has some sort of stringent DRM, allowing the user persisten access based on their loggin is most likely the appropriate business model, because once they get the file they can copy it as many times as they like.
Look at DVD or Blu-Ray, no amount of copy protection or access controls will save your files from pirates, so make things easy for legitimate users.

Resources