I'm using the Gremlin Python library to perform traversals on a JanusGraph deployment of Gremlin Server (the same also happens using just Tinkergraph). Some long traversals (with thousands of instructions) don't get a response, no errors, no timeouts, no log entries or errors on the server or client. Nothing.
The conditions for this silence treatment aren't clear. The described behaviour doesn't linearly depend on bytes or number of instructions. For instance, this code will hang forever for me:
g = traversal().withRemote(DriverRemoteConnection('ws://localhost:8182/gremlin', 't'))
g = g.inject("")
for i in range(0, 8000):
g = g.constant("test")
print(f"submitting traversal with length={len(g.bytecode.step_instructions)}")
result = g.next()
print(f"done, got: {result}") # this is never reached
It doesn't depend on just the number of bytes in the request message since the number of instructions beyond which I don't get response doesn't change even with very large constant values in place of just "test". For instance, injecting 7000 values with many paragraphs of Lorem Ipsum works as expected and returns in a few milliseconds.
While it shouldn't matter (since I should be getting a proper error instead of nothing), I've already increased server-side maxHeaderSize, maxChunkSize, maxContentLength etc. to ridiculously high numbers. Changing the serialization format (e.g. from GraphSONMessageSerializerV3d0 to GraphBinaryMessageSerializerV1) doesn't help either.
Note: I know that very long traversals are an anti-pattern in Gremlin, but sometimes it's not possible or very inefficient to structure traversals such that they can use injected values instead.
I've answered this question on gremlin-users not realizing it was also asked here on StackOverflow. For completeness, I'll duplicate my response here.
The issue is less related to bytes and string lengths and more with the length of the traversal chain (i.e. the number of steps in your traversal). You end up hitting a JVM limit on the stack size on the server. You can increase the stack size on the jvm by changing the size of the -Xss value which should allow you a longer traversal length. That will likely come with the need to re-examine other JVM settings like -Xmx and perhaps garbage collection options.
I do find it interesting that you don't get any error messages though - you should see a stackoverflow somewhere, unless the server is just wholly bogged down by your request. I'd consider throwing more -Xmx at it to see if you can get it to respond with an error at least or to keep an eye on server logs to at least see it surface there.
So I'm on very constrained bandwidth where I am right now and I clicked a link to a pdf tutorial for something and Chrome began to download it and I was watching the size spiral upward from 20Kb past 5Mb and decided to stop it. How do I know it's not a 4Gb pdf?? Ridiculous, I know.
But I started thinking, surely there must be a way I can simply request the size of the resource to check before downloading. Perhaps some sort of cURL request?
Does anyone know a way?
You could try using the HTTP HEAD method. This should get you the headers of the document without the body. This might have the content length in it.
Or you could send an HTTP Range request header with a GET request. See section 14.35.2 in this document. Range headers look like:
Range: 1-20000
which would request the first 20,000 bytes (octets) of a document. If the document is less than 20,000 bytes, you would get the whole document.
The only problem is that the server might not support the Range header, in which case it will send a 200 status instead of 206. In that case you can just reset the connection if you don't want to risk burning bandwidth on a 5Gb document.
I have been playing around with parsing HTTP in user-space and I see with some research that there are several ways to send data following the HTTP header and \r\n\r\n. Obviously, content-length is not always used, so what are the other methods and how do you determine the size of the data being sent before hand if not streaming?
I did see content-encoding, chunking and so on, I'm just a bit lost with the overall dynamicness of the protocol in this case. What is the sure fire way of determining the amount of data to be sent (when obviously not streaming something never ending)?
Really appreciate the help.
The new HTTP spec describes this in http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-26.html#message.body.
I've got a simple custom HTTP server serving clients.
With SNDBUF set to 512000 everything works fine. However, setting it lower or leaving it at the default (whatever that is) results in Chrome and Firefox not receiving all of the response data – Firefox truncates it after around 150000 - 250000 bytes (the offset changes every time, even though the content stays the same), and Chrome gives an error with no details.
The particular response at issue is about 300000 bytes, and sent all in one chunk.
Tools like Rex Swain's HTTP Viewer, curl and wget report no such problem, and show all of the data.
Why does setting the SNDBUF affect Chrome and Firefox's ability to receive the data? I understand how SNDBUF impacts performance, but I don't understand how setting it too low could corrupt a stream?
Amount of data you can write into underlying socket at a time is limited by the send buffer size available at the time. As Nikolai said, you will need to check return value from send(), or equivalent, to find out if all data you passed in the function has been written to the send buffer. If not, then you will need to wait for the socket to become 'writable' again, then write the outstanding data.
There are a number of reasons why point of truncation differs between different browsers/HTTP clients. One reason could be the receiver's (client's) receive buffer size of its socket because the size determines the TCP congestion window size (for flow control) which also affects actual transmission speed. Another possible reason would be that the HTTP Viewer/curl/wget may just be reading socket faster than how Chrome or Firefox reads, etc.
I intend on writing a small download manager in C++ that supports resuming (and multiple connections per download).
From the info I gathered so far, when sending the http request I need to add a header field with a key of "Range" and the value "bytes=startoff-endoff". Then the server returns a http response with the data between those offsets.
So roughly what I have in mind is to split the file to the number of allowed connections per file and send a http request per splitted part with the appropriate "Range". So if I have a 4mb file and 4 allowed connections, I'd split the file to 4 and have 4 http requests going, each with the appropriate "Range" field. Implementing the resume feature would involve remembering which offsets are already downloaded and simply not request those.
Is this the right way to do this?
What if the web server doesn't support resuming? (my guess is it will ignore the "Range" and just send the entire file)
When sending the http requests, should I specify in the range the entire splitted size? Or maybe ask smaller pieces, say 1024k per request?
When reading the data, should I write it immediately to the file or do some kind of buffering? I guess it could be wasteful to write small chunks.
Should I use a memory mapped file? If I remember correctly, it's recommended for frequent reads rather than writes (I could be wrong). Is it memory wise? What if I have several downloads simultaneously?
If I'm not using a memory mapped file, should I open the file per allowed connection? Or when needing to write to the file simply seek? (if I did use a memory mapped file this would be really easy, since I could simply have several pointers).
Note: I'll probably be using Qt, but this is a general question so I left code out of it.
Regarding the request/response:
for a Range-d request, you could get three different responses:
206 Partial Content - resuming supported and possible; check Content-Range header for size/range of response
200 OK - byte ranges ("resuming") not supported, whole resource ("file") follows
416 Requested Range Not Satisfiable - incorrect range (past EOF etc.)
Content-Range usu. looks like this: Content-Range: bytes 21010-47000/47022, that is bytes start-end/total.
Check the HTTP spec for details, esp. sections 14.5, 14.16 and 14.35
I am not an expert on C++, however, I had once done a .net application which needed similar functionality (download scheduling, resume support, prioritizing downloads)
i used microsoft bits (Background Intelligent Transfer Service) component - which has been developed in c. windows update uses BITS too. I went for this solution because I don't think I am a good enough a programmer to write something of this level myself ;-)
Although I am not sure if you can get the code of BITS - I do think you should just have a look at its documentation which might help you understand how they implemented it, the architecture, interfaces, etc.
Here it is - http://msdn.microsoft.com/en-us/library/aa362708(VS.85).aspx
I can't answer all your questions, but here is my take on two of them.
Chunk size
There are two things you should consider about chunk size:
The smaller they are the more overhead you get form sending the HTTP request.
With larger chunks you run the risk of re-downloading the same data twice, if one download fails.
I'd recommend you go with smaller chunks of data. You'll have to do some test to see what size is best for your purpose though.
In memory vs. files
You should write the data chunks to in memory buffer, and then when it is full write it to the disk. If you are going to download large files, it can be troublesome for your users, if they run out of RAM. If I remember correctly the IIS stores requests smaller than 256kb in memory, anything larger will be written to the disk, you may want to consider a simmilar approach.
Besides keeping track of what were the offsets marking the beginning of your segments and each segment length (unless you want to compute that upon resume, which would involve sort the offset list and calculate the distance between two of them) you will want to check the Accept-Ranges header of the HTTP response sent by the server to make sure it supports the usage of the Range header. The best way to specify the range is "Range: bytes=START_BYTE-END_BYTE" and the range you request includes both START_BYTE and byte END_BYTE, thus consisting of (END_BYTE-START_BYTE)+1 bytes.
Requesting micro chunks is something I'd advise against as you might be blacklisted by a firewall rule to block HTTP flood. In general, I'd suggest you don't make chunks smaller than 1MB and don't make more than 10 chunks.
Depending on what control you plan to have on your download, if you've got socket-level control you can consider writing only once every 32K at least, or writing data asynchronously.
I couldn't comment on the MMF idea, but if the downloaded file is large that's not going to be a good idea as you'll eat up a lot of RAM and eventually even cause the system to swap, which is not efficient.
About handling the chunks, you could just create several files - one per segment, optionally preallocate the disk space filling up the file with as many \x00 as the size of the chunk (preallocating might save you sometime while you write during the download, but will make starting the download slower), and then finally just write all of the chunks sequentially into the final file.
One thing you should beware of is that several servers have a max. concurrent connections limit, and you don't get to know it in advance, so you should be prepared to handle http errors/timeouts and to change the size of the chunks or to create a queue of the chunks in case you created more chunks than max. connections.
Not really an answer to the original questions, but another thing worth mentioning is that a resumable downloader should also check the last modified date on a resource before trying to grab the next chunk of something that may have changed.
It seems to me you would want to limit the size per download chunk. Large chunks could force you to repeat download of data if the connection aborted close to the end of the data part. Specially an issue with slower connections.
for the pause resume support look at this simple example
Simple download manager in Qt with puase/ resume support