HTTP streaming / chunked responses on Heroku with clojure - http

I'm making a clojure web app that streams data to clients using chunked HTTP responses. This works great when I run it locally using foreman, but doesn't work properly when I deploy it to Heroku.
A minimal example exhibiting this behaviour can be found on my github here. The frontend (in resources/index.html) performs an AJAX GET request and prints the response chunks as they arrive. The server uses http-kit to send a new chunk to connected clients every second. By design, the HTTP request never completes.
When the same code is deployed to Heroku, the HTTP connection is closed by the server immediately after the first chunk is sent. It seems to be Heroku's routing mesh which is causing this disconnection to occur.
This can also be seen by performing the GET request using curl:
$ curl -v http://arcane-headland-2284.herokuapp.com/stream
* About to connect() to arcane-headland-2284.herokuapp.com port 80 (#0)
* Trying 54.243.166.168...
* Adding handle: conn: 0x6c3be0
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x6c3be0) send_pipe: 1, recv_pipe: 0
* Connected to arcane-headland-2284.herokuapp.com (54.243.166.168) port 80 (#0)
> GET /stream HTTP/1.1
> User-Agent: curl/7.31.0
> Host: arcane-headland-2284.herokuapp.com
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/html; charset=utf-8
< Date: Sat, 17 Aug 2013 16:57:24 GMT
* Server http-kit is not blacklisted
< Server: http-kit
< transfer-encoding: chunked
< Connection: keep-alive
<
* transfer closed with outstanding read data remaining
* Closing connection 0
curl: (18) transfer closed with outstanding read data remaining
The time is currently Sat Aug 17 16:57:24 UTC 2013 <-- this is the first chunk
Can anybody suggest why this is happening? HTTP streaming is supposed to be supported in Heroku's Cedar stack. The fact the code runs correctly using foreman suggests it is something in Heroku's routing mesh causing it to break.
Live demo of the failing project: http://arcane-headland-2284.herokuapp.com/

This was due to a bug in http-kit which will be fixed shortly.

https://devcenter.heroku.com/articles/request-timeout may be relevant: "long-polling" requests like yours have to send data every 55 seconds or be terminated.

Related

A strange problem about size limit in http header

Context: I maintain a kind of web service server, but with a particular implementation: all data sent by the web services are located in the http header. That means there is only http header in the response (no body part). The web service runs as a windows service. The consumer is my PHP code which invokes the web service via CURL library. All this is in production since 3 years and works fine. I recently had to build a development environment.
I have the web service on a Windows 7 pro, as a windows service.
I have my PHP consumer in another windows 7 pro (WAMP + CURL).
my PHP code invokes the web service and displays the raw response.
in this context the problem occurs: if the response contains more than 1215 characters, I have an empty response (but no error message).
I installed my PHP code (exactly the same) on a new Linux ubuntu: I have the same problem.
I installed my PHP code (exactly the same) on a new Linux centos: I DON'T HAVE THE PROBLEM.
I read a lot on internet about size limitation on http header, and I think today it's not the reason of the problem.
I examined all size limitation parameters on Apache, PHP, Curl but I didn't find something relevant.
If someone has some information. All tracks are welcome. Thanks
not an answer, but want to say that using PHP 7.2.5 under mod_php with Apache 2.4.33, i am unable to reproduce your issue, as i have no problems sending anything from 1 byte to 10,000 to even 100,000 bytes in headers:
here is my producer.php:
<?php
$size=((int)($_GET['s'] ?? 1));
header("X-size: {$size}");
$data=str_repeat("a",$size);
header("X-data: {$data}");
http_response_code(204); // 204 NO CONTENT
and whether i hit http://127.0.0.1/producer.php?s=1 or http://127.0.0.1/producer.php?s=10000 or even http://127.0.0.1/producer.php?s=100000 , the data is returned without issue, as you can see in the screenshot above. can you reproduce the issue using my producer.php code?
btw, interestingly, when i try 1 million bytes, i get this error from curl:
$ curl -I http://127.0.0.1/producer.php?s=1000000
HTTP/1.1 204 No Content
Date: Wed, 16 Jan 2019 20:11:25 GMT
Server: Apache/2.4.33 (Win32) OpenSSL/1.1.0h PHP/7.2.5
X-Powered-By: PHP/7.2.5
X-size: 1000000
curl: (27) Rejected 104960 bytes header (max is 102400)!
Hanshenrik,
i also used CURLOPT_VERBOSE as you said. Here are the 2 curl logs.
The only difference is the line
<* stopped the pause stream!> in the Ubuntu curl log.
CURL log from Ubuntu witch has the probleme:
* Trying 192.168.1.205...
* TCP_NODELAY set
* Connected to 192.168.1.205 (192.168.1.205) port 8084 (#0)
> POST /datasnap/rest/TServerMethods/%22W_GetDashboard%22/ HTTP/1.1
Host: 192.168.1.205:8084
Accept-Encoding: gzip,deflate
Accept: application/json
Content-Type: text/xml; charset=utf-8
Pragma: dssession=146326.909376.656191
Content-Length: 15
* upload completely sent off: 15 out of 15 bytes
< HTTP/1.1 200 OK
< Connection: close
< Content-Encoding: deflate
< Content-Type: application/json
< Content-Length: 348
< Date: Thu, 17 Jan 2019 15:27:03 GMT
< Pragma: dssession=146326.909376.656191,dssessionexpires=3600000
<
* stopped the pause stream!
* Closing connection 0
CURL log from Centos witch has NOT the probleme:
* About to connect() to 192.168.1.205 port 8084 (#1)
* Trying 192.168.1.205...
* Connected to 192.168.1.205 (192.168.1.205) port 8084 (#1)
> POST /datasnap/rest/TServerMethods/%22W_GetDashboard%22/ HTTP/1.1
Host: 192.168.1.205:8084
Accept-Encoding: gzip,deflate
Accept: application/json
Content-Type: text/xml; charset=utf-8
Pragma: dssession=3812.553164.889594
Content-Length: 15
* upload completely sent off: 15 out of 15 bytes
< HTTP/1.1 200 OK
< Connection: close
< Content-Encoding: deflate
< Content-Type: application/json
< Content-Length: 348
< Date: Thu, 17 Jan 2019 15:43:39 GMT
< Pragma: dssession=3812.553164.889594,dssessionexpires=3600000
<
* Closing connection 1

How do I what Content Types are on offer (for HTTP Content Negotiation)?

What one gets back when resolving a DOI depends on content negotiation.
I was looking at https://citation.crosscite.org/docs.html#sec-3
and I see different services offer different Content Types.
For a particular URL I want to know all the content types it can give me.
Some of them might be more useful than any that I am aware of (i.e. i don't want to write a list of preferences in advance).
For example:
https://doi.org/10.5061/dryad.1r170
I thought maybe OPTIONS was the way to do it
but that gave back nothing interesting, only about allowed request methods.
shell> curl -v -X OPTIONS http://doi.org/10.5061/dryad.1r170
* Hostname was NOT found in DNS cache
* Trying 2600:1f14:6cf:c01::d...
* Trying 54.191.229.235...
* Connected to doi.org (2600:1f14:6cf:c01::d) port 80 (#0)
> OPTIONS /10.5061/dryad.1r170 HTTP/1.1
> User-Agent: curl/7.38.0
> Host: doi.org
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Allow: GET, HEAD, POST, TRACE, OPTIONS
< Content-Length: 0
< Date: Mon, 29 Jan 2018 07:01:14 GMT
<
* Connection #0 to host doi.org left intact
I guess there is no such standard yet, but Link header: https://www.w3.org/wiki/LinkHeader could expose this information.
But personally, I won't rely too much on it. For example, a server could start sending a new content type and still NOT expose it via this header.
It might be useful to check the API response headers frequently, via manual or automated means for any changes.

log into a website to grab the data using RCurl

I wanted to login to the website using RCurl and grab the data from the web (The data cannot be seen without logging in.)
I wanted to export this (for example) "http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone" into R after I log in using RCurl. The issue is I cannot log in using RCurl. I haven't tried this before so mostly I referred to http://www.omegahat.org/RCurl/philosophy.html.
So here's what I tried. (here, 'me#gmail.com' is my user ID and '9999' is my Password - i just made it up.)
library(RJSONIO)
library(rjson)
library(RCurl)
appannie <- getURL("http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/.json?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone, userpwd = me#gmail.com:9999", verbose = TRUE)
But this gave me the message below :
About to connect() to www.appannie.com port 80 (#0)
* Trying 69.167.138.64... * connected
* Connected to www.appannie.com (69.167.138.64) port 80 (#0)
> GET /app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone HTTP/1.1
Host: www.appannie.com
Accept: */*
< HTTP/1.1 403 FORBIDDEN
< Server: nginx/1.1.19
< Date: Fri, 01 Mar 2013 23:41:32 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=10
< Vary: Accept-Encoding
< Vary: Cookie,Accept-Encoding
<
* Connection #0 to host www.appannie.com left intact
So, I went back and read this http://www.omegahat.org/RCurl/philosophy.html again and didn't know what to do, so I tried this after I saw the similar question from stackoverflow.
getURL("http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone",.opts=list(userpwd=me#gmail.com:9999"))
But this gives me below output.
[1] ""
Can anyone give me a hint? (After a bunch of different trial, the website starts to send me warnings =(
This is some sort of authentication issue not anything you did wrong with RCurl most likely.
You got through to the server but either your login was incorrect, it wasn't valid or the data is not available via the API.
http://en.wikipedia.org/wiki/HTTP_403

cURL receives empty body response from Nginx server

I try to fetch HTTP content with cURL, but I only get an empty body in the reply:
[root#www ~]# curl -v http://www.existingdomain.com/
* About to connect() to www.existingdomain.com port 80 (#0)
* Trying 95.211.256.257... connected
* Connected to www.existingdomain.com (95.211.256.257) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.21.0 (x86_64-redhat-linux-gnu) libcurl/7.21.0 NSS/3.12.8.0 zlib/1.2.5 libidn/1.18 libssh2/1.2.4
> Host: www.existingdomain.com
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx/0.8.53
< Date: Sat, 28 May 2011 15:56:23 GMT
< Content-Type: text/html
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept-Encoding
< X-Powered-By: PHP/5.3.3-0.dotdeb.1
<
* Connection #0 to host www.existingdomain.com left intact
* Closing connection #0
If I change the URL to another domain, like www.google.com, I get the content.
How can this be possible? And how to fetch content?
The server is free to send to the client whatever he likes, including nothing. While this is not exactly nice, there's little the client can do about this. You could
check the server logs to see if there is some problem which makes him so calm (given the server is under your control) or
try another client to see if the server does not like to talk to curl. You can then configure curl to mimic a regular web browser, if that helps

How to test HTTP Keep alive is actually working

I know HTTP keep-alive is on by default in HTTP 1.1 but I want to find a way to confirm that it is actually working.
Does anyone know of a simple way to test from a web browser (EG how to make sense of wireshark). I know I need to look for multiple HTTP requests over the same TCP connection but I don't know how to confirm that in wireshark or any other way.
Thanks!
As Ron Garrity said on ServerFault, you can use Curl like this:
curl -Iv http://www.aptivate.org 2>&1 | grep -i 'connection #0'
And it outputs these two lines if keep-alive is working:
* Connection #0 to host www.aptivate.org left intact
* Closing connection #0
And if keep-alive is not working, then it just outputs this line:
* Closing connection #0
If you're on Windows Vista or later, you can use Resource Manager. The Network tab will list all open TCP connections and the process they were started by. Open a browser with one tab, browse to your page, and test.
First, try to capture the traffic to the target website in Wireshark and limit it to what you need with a filter like:
tcp port 80 and host targetwebsite.com
Then load the page in a browser or fetch it by any tool you have. If the target web page refreshes itself or one of the values in it, leave it open until you have at least one change in it.
Now you have enough data and you can stop capturing procedure in Wireshark.
You should see dozens of records and their protocol should be TCP or HTTP. For the purpose of your quick simple check, you will not need TCP records. So, lets remove them by applying another filter. In top of the window there is a "filter" field. Type http there, and wireshark will hide all records but those which have a HTTP protocol.
Now select a record and look at the next level of details, which you can find in the 2nd box bellow all records. Just to be sure you are looking at the right place, the first line there starts with "Frame XYZ". The fourth line starts with "Transmission Control Protocol". Look for the port numbers after "SRC Port" and "DST Port:". Depending on the record, one of these numbers belongs to the webserver (typically 80) and the other one shows port number in your end.
Now check a couple of different GET records. To know if the request is a GET record, check the Info column. If the port numbers in your end are used several times, all those requests were made through HTTP keepalive.
Remember that most browsers will open multiple connections, even if the webserver supports keepalive. So, DO NOT conclude your evaluation by finding just one different port.
The most accurate way is to curl the same URL multiple times.
curl -v http://weibo.com -o /dev/null http://weibo.com -o /dev/null
If the output contains Re-using existing connection, then the HTTP keep-alive feature is working. For example,
* TCP_NODELAY set
* Connected to weibo.com (180.149.138.251) port 80 (#0)
> GET / HTTP/1.1
> Host: weibo.com
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 301 Moved Permanently
< ...
< ...
<
{ [236 bytes data]
* Connection #0 to host weibo.com left intact
* Found bundle for host weibo.com: 0x56324121d9a0 [serially]
* Can not multiplex, even if we wanted to!
* Re-using existing connection! (#0) with host weibo.com
* Connected to weibo.com (180.149.138.251) port 80 (#0)
> GET / HTTP/1.1
> Host: weibo.com
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 301 Moved Permanently
< ...
< ...
<
{ [236 bytes data]
* Connection #0 to host weibo.com left intact
Another quick way is to test with ab. But some HTTP servers might not return the Connection: keep-alive header even when they've already turned on keep-alive feature, such as uwsgi. In such cases, ab does NOT send keep-alive requests. That makes ab can only do "positive" detection on HTTP keep-alive.
ab -c 5 -n 50 -k https://www.google.com/
If the result shows
...
Complete requests: 50
Failed requests: 0
Keep-Alive requests: 50 # Pay attention to this line
Total transferred:
...
Then the HTTP keep-alive is enabled.

Resources