curl error 18, attempting to solve problem using SO answer 1759956 - unix

I am trying to follow curl error 18 - transfer closed with outstanding read data remaining.
The top answer is to
...let curl set the length by itself.
I don't know how to do this. I have tried the following:
curl --ignore-content-length http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
However, I still get this error:
curl: (18) transfer closed with outstanding read data remaining

The connection is just getting closed by the server after 30 seconds.
You can try to increase speed of the client but if the server is not delivering enough in the limited time you get the message even with fast connection.
In the case of the example http://corpus-db.org/api/author/Dickens,%20Charles/fulltext I got a larger amount of content with direct output:
curl http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
while the amount was smaller while writing in a file (already ~47MB in 30 seconds):
curl -o Dickens,%20Charles http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
Resuming file transfers can be tried, but on the example server it's not supported:
curl -C - -o Dickens,%20Charles http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
curl: (33) HTTP server doesn't seem to support byte ranges. Cannot resume.
So there might be options to optimize the request, to increase the connection-speed or the cache-size but if you reached the limit and never get more data in the limited time you can't do anything.
The cUrl manual can be found here: https://curl.haxx.se/docs/manual.html
The following links won't help you but perhaps are interesting:
The repository for the data-server can be found here: https://github.com/JonathanReeve/corpus-db
The documentation for the used web-server can be found here: https://hackage.haskell.org/package/warp-3.2.13

It's a speed issue. The server at corpus-db.org will DISCONNECT YOU if you take longer than 35 seconds to download something, regardless of how much you've already downloaded.
To make matters worse, the server does not support Content-Range, so you can't download it in chunks and simply resume download where you left off.
To make matters even worse, not only is Content-Range not supported, but it's SILENTLY IGNORED, which means it seems to work, until you actually inspect what you've downloaded.
If you need to download that page from a slower connection, I recommend renting a cheap VPS, and set it up as a mirror of whatever you need to download, and download from your mirror instead. Your mirror does not need to have the 35-second-limit.
For example, this vps1 costs $1.25/month has a 1Gbps connection, and would be able to download that page. Rent one of those, install nginx on it, wget it in nginx's www folder, and download it from your mirror, and you'll have 300 seconds to download it (nginx default timeout) instead of 35 seconds. If 300 seconds is not enough, you can even change the timeout to whatever you want.
Or you could even get fancy and set up a caching proxy compatible with curl's --proxy, parameter so your command could become
curl --proxy=http://yourserver http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
If someone is interested in an example implementation of this, let me know.
You can't download that page with a 4mbit connection because the server will kick you before the download is complete (after 35 seconds), but if you download it with a 1000mbit connection, you'll be able to download the entire file before the timeout kicks in.
(My home internet connection is 4mbit, and I can't download it from home, but I tried downloading it from a server with a 1000mbit connection, and that works fine.)
1PS: I'm not associated with ramnode in any way, except that I'm a (prior) happy customer of them, and I recommend them to anyone looking for cheap reliable VPSs.

Related

Calculating Server Processing Time With Curl

I am getting request timing info with curl using the --write-out option as described in this article.
Here is example output from one of my requests:
time_namelookup: 0.031
time_connect: 0.328
time_appconnect: 1.560
time_pretransfer: 1.560
time_redirect: 0.000
time_starttransfer: 1.903
----------
time_total: 2.075
My question is: How can I determine how much time the server took processing the request? Is the answer:
time_starttransfer - time-connect
That is, the time from when the connection was established to when the server starting sending its response? That seems correct but I want to be sure.
Details about curl timing variables may be found here: http://curl.haxx.se/libcurl/c/curl_easy_getinfo.html
Yes, (time_starttransfer - time-connect) is the time from the connect was noticed by curl until the first byte arrived. Do note that it also includes the transfer time so for a remote site it will be longer simply because of that.
I'd say that you're right, (time_starttransfer - time_connect) definitely is an amount of time server took to process the request.
However - I also wondered what is the difference between time_connect and time_pretransfer? (intrigued by #Schwartzie and #Cheeso comments)
By looking at several curl queries on the web I observed that sometimes they're equal and sometimes they're not.
Then I figured out (at least I believe so) that they differ only for HTTPS requests, cause server needs some time to decrypt ssl layer, which is not exactly the time spent by the target app but the time spent by server hosting the app/service.
The time to decrypt ssl (and to connect also, the one given in time_connect) is time_appconnect, and only when it's 0 (like for non-https requests) - time_connect and time_pretransfer are EQUAL, otherwise for https requests they differ, and for https time_pretransfer would be equal to time_appconnect (and not to time_connect).
Check the following two examples:
curl -kso /dev/null -w "time_connect=%{time_connect}, time_appconnect:%{time_appconnect}, time_pretransfer=%{time_pretransfer}\n" http://www.csh.rit.edu
time_connect=0.128, time_appconnect:0.000, time_pretransfer=0.128
curl -kso /dev/null -w "time_connect=%{time_connect}, time_appconnect:%{time_appconnect}, time_pretransfer=%{time_pretransfer}\n" https://www.csh.rit.edu
time_connect=0.133, time_appconnect:0.577, time_pretransfer=0.577
so I'd say that time_pretransfer is more precise to be used compared to time_connect since it'll respect ssl connections too, and maybe some other things I'm not aware of.
With all the previous being said, I'd say that more precise answer for the question:
"How can I determine how much time the server took processing the request?"
would probably be:
time_starttransfer - time_pretransfer
as #Schwartzie already mentioned, I just wanted to understand why.
If you don't want to count the SSL handshake part then it's time_starttransfer - time_pretransfer
Here is a nice diagram from this cloudflare blog

gawk to read last bit of binary data over a pipe without timeout?

I have a program already written in gawk that downloads a lot of small bits of info from the internet. (A media scanner and indexer)
At present it launches wget to get the information. This is fine, but I'd like to simply reuse the connection between invocations. Its possible a run of the program might make between 200-2000 calls to the same api service.
I've just discovered that gawk can do networking and found geturl
However the advice at the bottom of that page is well heeded, I can't find an easy way to read the last line and keep the connection open.
As I'm mostly reading JSON data, I can set RS="}" and exit when body length reaches the expected content-length. This might break with any trailing white space though. I'd like a more robust approach. Does anyone have a nicer way to implement sporadic http requests in awk that keep the connection open. Currently I have the following structure...
con="/inet/tcp/0/host/80";
send_http_request(con);
RS="\r\n";
read_headers();
# now read the body - but do not close the connection...
RS="}"; # for JSON
while ( con |& getline bytes ) {
body = body bytes RS;
if (length(body) >= content_length) break;
print length(body);
}
# Do not close con here - keep open
Its a shame this one little thing seems to be spoiling all the potential here. Also in case anyone asks :) ..
awk was originally chosen for historical reasons - there were not many other language options on this embedded platform at the time.
Gathering up all of the URLs in advance and passing to wget will not be easy.
re-implementing in perl/python etc is not a quick solution.
I've looked at trying to pipe urls to a named pipe and into wget -i - , that doesn't work. Data gets buffered, and unbuffer not available - also I think wget gathers up all the URLS until EOF before processing.
The data is small so lack of compression is not an issue.
The problem with the connection reuse comes from the HTTP 1.0 standard, not gawk. To reuse the connection you must either use HTTP 1.1 or try some other non-standard solutions for HTTP 1.0. Don't forget to add the Host: header in your HTTP/1.1 request, as it is mandatory.
You're right about the lack of robustness when reading the response body. For line oriented protocols this is not an issue. Moreover, even when using HTTP 1.1, if your scripts locks waiting for more data when it shouldn't, the server will, again, close the connection due to inactivity.
As a last resort, you could write your own HTTP retriever in whatever langauage you like which reuses connections (all to the same remote host I presume) and also inserts a special record separator for you. Then, you could control it from the awk script.

NGINX Reverse Proxy : Many html status code 400 responses, why?

We have recently implemented a nginx based reverse proxy.
While, debugging our access logs, we are seeing quite a bit of status code 400 results.
They look something like this:
[07/Sep/2011:05:49:04 -0700] - "400" 0 "-" "-" "-"
We have enabled debug error logging, and they usually correspond to something like this:
2011/09/07 05:09:28 [info] 5937#0: *30904 client closed prematurely connection while reading client request line
We have tried raising a number of the buffers, as mentioned by a few pages we were able to google up.
http://www.ruby-forum.com/topic/173362
or
http://blog.craz8.com/articles/2009/06/17/nginx-400-bad-request-errors-due-to-cookies-and-what-to-do-about-them
To no avail.
Why is this happening?
This is a strandard nginx reverse proxy -> apache backend server.
Worth mentioning, the unique type of content on our site is fairly minimal. We have tested this using many browsers and are not personally receiving any of these 400 results.
Thanks!
Further urls detailing similar entries in their logs:
http://blog.rayfoo.info/2009/10/weird-web-server-access-log-entries
I found this was caused by using Chrome, which apparently opens extra connections occasionally without sending any data.
Here's some more info: http://www.ruby-forum.com/topic/2953545
Now the question is what to do about them - the answer provided there wasn't very satisfying.
Are you handling SSL connections? Can you add $ssl_cipher $ssl_protocol to your access log format?
First, it's fairly possible that your clients send request with really big http headers or urls. Maybe an older version of your application set some (probably big) cookies which are unused now and some clients are still trying send them back.
I'd set the header buffers to a really big value and on the application side log the size of the headers/requests and the complete request if they are bigger than usual. Or completely take out the nginx from the chain and log the header/request with the same conditions. If you can, take out the nginx for only those IPs/subnets where the 400 errors came from. I suppose nginx can log the source IP for these 400 errors.

Is there anything in the FTP protocol like the HTTP Range header?

Suppose I want to transfer just a portion of a file over FTP - is it possible using a standard FTP protocol?
In HTTP I could use a Range header in the request to specify the data range of the remote resource. If it's a 1mb file, I could ask for the bytes from 600k to 700k.
Is there anything like that in FTP? I am reading the FTP RFC, don't see anything, but want to make sure I'm not missing anything.
There's a Restart command in FTP - would that work?
Addendum
After getting Brian Bondy's answer below, I wrote a read-only Stream class that wraps FTP. It supports Seek() and Read() operations on a resource that is read via FTP, based on the REST verb.
Find it at http://cheeso.members.winisp.net/srcview.aspx?dir=streams&file=FtpReadStream.cs
It's pretty slow to Seek(), because setting up the data socket takes a long time. Best results come when you wrap that stream in a BufferedStream.
Yes you can use the REST command.
REST sets the point at which a subsequent file transfer should start. It is used usually for restarting interrupted transfers. The command must come right before a RETR or STOR and so come after a PORT or PASV.
From FTP's RFC 959:
RESTART (REST) The argument field
represents the server marker at which
file transfer is to be restarted. This
command does not cause file transfer
but skips over the file to the
specified data checkpoint. This
command shall be immediately followed
by the appropriate FTP service command
which shall cause file transfer to
resume.
Read more:
http://www.faqs.org/rfcs/rfc959.html#ixzz0jZp8azux
You should check out how GridFTP does parallel transfers. That's using the sort of techniques that you want (and might actually be code that it is better to borrow rather than implementing from scratch yourself).

IIS file download hangs/timeouts - sc-win32-status = 64

Any thoughts on why I might be getting tons of "hangs" when trying to download a file via HTTP, based on the following?
Server is IIS 6
File being downloaded is a binary file, rather than a web page
Several clients hang, including TrueUpdate and FlexNet web updating packages, as well as custom .NET app that just does basic HttpWebRequest/HttpWebResponse logic and downloads using a response stream
IIS log file signature when success is 200 0 0 (sc-status sc-substatus sc-win32-status)
For failure, error signature is 200 0 64
sc-win32-status of 64 is "the specified network name is no longer available"
I can point firefox at the URL and download successfully every time (perhaps some retry logic is happening under the hood)
At this point, it seems like either there's something funky with my server that it's throwing these errors, or that this is just normal network behavior and I need to use (or write) a client that is more resilient to the failures.
Any thoughts?
Perhaps your issue was a low level networking issue with the ISP as you speculated in your reply comment. I am experiencing a similar problem with IIS and some mysterious 200 0 64 lines appearing in the log file, which is how I found this post. For the record, this is my understanding of sc-win32-status=64; I hope someone can correct me if I'm wrong.
sc-win32-status 64 means β€œThe specified network name is no longer available.”
After IIS has sent the final response to the client, it waits for an ACK message from the client.
Sometimes clients will reset the connection instead of sending the final ACK back to server. This is not a graceful connection close, so IIS logs the β€œ64” code to indicate an interruption.
Many clients will reset the connection when they are done with it, to free up the socket instead of leaving it in TIME_WAIT/CLOSE_WAIT.
Proxies may have a tendancy to do this more often than individual clients.
I've spent two weeks investigating this issue. For me I had the scenario in which intermittent random requests were being prematurely terminated. This was resulting in IIS logs with status code 200, but with a win32-status of 64.
Our infrastructure includes two Windows IIS servers behind two NetScaler load balancers in HA mode.
In my particular case, the problem was that the NetScaler had a feature called "Intergrated Caching" turned on (http://support.citrix.com/proddocs/topic/ns-optimization-10-5-map/ns-IC-gen-wrapper-10-con.html).
After disabling this feature, the request interruptions ceased. And the site operated normally. I'm not sure how or why this was causing a problem, but there it is.
If you use a proxy or a load balancer, do some investigation of what features they have turned on. For me the cause was something between the client and the server interrupting the requests.
I hope that this explanation will at least save someone else's time.
Check the headers from the server, especially content-type, and content-length, it's possible that your clients don't recognize the format of the binary file and hang while waiting for bytes that never come, or maybe they close the underlying TCP connection, which may cause IIS to log the win32 status 64.
Spent three days on this.
It was the timeout that was set to 4 seconds (curl php request).
Solution was to increase the timeout setting:
//curl_setopt($ch, CURLOPT_TIMEOUT, 4); // times out after 4s
curl_setopt($ch, CURLOPT_TIMEOUT, 60); // times out after 60s
You will have to use wireshare or network monitor to gather more data on this problem. Me think.
I suggest you put Fiddler in between your server and your download client. This should reveal the differences between Firefox and other cients.
Description of all sc-win32-status codes for reference
https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
ERROR_NETNAME_DELETED
64 (0x40)
The specified network name is no longer available.

Resources