What is wrong with my GET request? - http

Sorry to bother with something that should be easy.
I have this HTTP GET request:
GET /ip HTTP/1.1
Host: httpbin.org
Connection: close
Accept: */*
User-Agent: Mozilla/4.0 (compatible; esp8266 Lua; Windows NT 5.1)
When I send this request via my ESP8266 it returns a 404 error:
HTTP/1.1 404 Not Found
Date: Fri, 04 Sep 2015 16:34:46 GMT
Server: Apache
Content-Length: 1363
X-Frame-Options: deny
Connection: close
Content-Type: text/html
But when I (and you) go to http://httpbin.org/ip it works perfectly!
What is wrong?
DETAILS
I construct my request in Lua:
conn:on("connection", function(conn, payload)
print('\nConnected')
req = "GET /ip"
.." HTTP/1.1\r\n"
.."Host: httpbin.org\r\n"
.."Connection: close\r\n"
.."Accept: */*\r\n"
.."User-Agent: Mozilla/4.0 (compatible; esp8266 Lua; Windows NT 5.1)\r\n"
.."\r\n"
print(req)
conn:send(req)
end)
And if I use another host (given is this example) it works:
conn:on("connection", function(conn, payload)
print('\nConnected')
conn:send("GET /esp8266/test.php?"
.."T="..(tmr.now()-Tstart)
.."&heap="..node.heap()
.." HTTP/1.1\r\n"
.."Host: benlo.com\r\n"
.."Connection: close\r\n"
.."Accept: */*\r\n"
.."User-Agent: Mozilla/4.0 (compatible; esp8266 Lua; Windows NT 5.1)\r\n"
.."\r\n")
end)

Are you actually connecting to httpbin.org ? Or somewhere else?
I just tried issuing your request by typing it into telnet, and it worked for me. But the server responding was nginx, whereas your example shows apache.
$ telnet httpbin.org 80
Trying 54.175.219.8...
Connected to httpbin.org.
Escape character is '^]'.
GET /ip HTTP/1.1
Host: httpbin.org
Connection: close
Accept: */*
User-Agent: Mozilla/4.0 (compatible; esp8266 Lua; Windows NT 5.1)
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 07 Oct 2015 06:08:40 GMT
Content-Type: application/json
Content-Length: 32
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
{
"origin": "124.149.55.34"
}
Connection closed by foreign host.
When I try another request with some other URI to force a 404 response, I see this:
HTTP/1.1 404 NOT FOUND
Server: nginx
Date: Wed, 07 Oct 2015 06:12:21 GMT
Content-Type: text/html
Content-Length: 233
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Which again is nothing like the response you said you got from httpbin.org.

http.get("http://httpbin.org/ip", nil, function(code, data)
if (code < 0) then
print("HTTP request failed")
else
print(code, data)
end
end)
http.post('http://httpbin.org/post',
'Content-Type: application/json\r\n',
'{"hello":"world"}',
function(code, data)
if (code < 0) then
print("HTTP request failed")
else
print(code, data)
end
end)
see the reference here.

It's your request line the server doesn't like.
This will do the job:
GET http://httpbin.org/ip HTTP/1.1
Host: httpbin.org

Related

OkHttp3: Getting an 'unexpected end of stream' exception while reading a large HTTP response

I have a Java client, that is making a POST call to the v1/graphql endpoint of a Hasura server (v1.3.3)
I'm making the HTTP call using the Square okhttp3 library (v4.9.1). The data transfer is happening over HTTP1.1, using chunked transfer-encoding.
The client is failing with the following error:
Caused by: java.net.ProtocolException: unexpected end of stream
at okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.kt:415) ~[okhttp-4.9.1.jar:?]
at okhttp3.internal.connection.Exchange$ResponseBodySource.read(Exchange.kt:276) ~[okhttp-4.9.1.jar:?]
at okio.RealBufferedSource.read(RealBufferedSource.kt:189) ~[okio-jvm-2.8.0.jar:?]
at okio.RealBufferedSource.exhausted(RealBufferedSource.kt:197) ~[okio-jvm-2.8.0.jar:?]
at okio.InflaterSource.refill(InflaterSource.kt:112) ~[okio-jvm-2.8.0.jar:?]
at okio.InflaterSource.readOrInflate(InflaterSource.kt:76) ~[okio-jvm-2.8.0.jar:?]
at okio.InflaterSource.read(InflaterSource.kt:49) ~[okio-jvm-2.8.0.jar:?]
at okio.GzipSource.read(GzipSource.kt:69) ~[okio-jvm-2.8.0.jar:?]
at okio.Buffer.writeAll(Buffer.kt:1642) ~[okio-jvm-2.8.0.jar:?]
at okio.RealBufferedSource.readString(RealBufferedSource.kt:95) ~[okio-jvm-2.8.0.jar:?]
at okhttp3.ResponseBody.string(ResponseBody.kt:187) ~[okhttp-4.9.1.jar:?]
Request Headers:
INFO: Content-Type: application/json; charset=utf-8
INFO: Content-Length: 1928
INFO: Host: localhost:10191
INFO: Connection: Keep-Alive
INFO: Accept-Encoding: gzip
INFO: User-Agent: okhttp/4.9.1
Response headers:
INFO: Transfer-Encoding: chunked
INFO: Date: Tue, 27 Apr 2021 12:06:39 GMT
INFO: Server: Warp/3.3.10
INFO: x-request-id: d019408e-e2e3-4583-bcd6-050d4a496b11
INFO: Content-Type: application/json; charset=utf-8
INFO: Content-Encoding: gzip
This is the client code used for the making the POST call:
private static final MediaType MEDIA_TYPE_JSON = MediaType.parse("application/json; charset=utf-8");
private static OkHttpClient okHttpClient = new OkHttpClient.Builder()
.connectTimeout(30, TimeUnit.SECONDS)
.writeTimeout(5, TimeUnit.MINUTES)
.readTimeout(5, TimeUnit.MINUTES)
.addNetworkInterceptor(loggingInterceptor)
.build();
public GenericHttpResponse httpPost(String url, String textBody, GenericHttpMediaType genericMediaType) throws HttpClientException {
RequestBody body = RequestBody.create(okHttpMediaType, textBody);
Request postRequest = new Request.Builder().url(url).post(body).build();
Call postCall = okHttpClient.newCall(okHttpRequest);
Response postResponse = postCall.execute();
return GenericHttpResponse
.builder()
.body(okHttpResponse.body().string())
.headers(okHttpResponse.headers().toMultimap())
.code(okHttpResponse.code())
.build();
}
This failure is only happening for large response sizes. As per the server logs, the response size (after gzip encoding) is around 52MB, but the call is still failing. This same code has been working fine for response sizes around 10-15MB.
I tried replicating the same issue through a simple cURL call, but that ran successfully:
curl -v -s --request POST 'http://<hasura_endpoint>/v1/graphql' \
--header 'Content-Type: application/json' \
--header 'Accept-Encoding: gzip, deflate, br' \
--data-raw '...'
* Trying ::1...
* TCP_NODELAY set
* Connected to <host> (::1) port <port> (#0)
> POST /v1/graphql HTTP/1.1
> Host: <host>:<port>
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Type: application/json
> Accept-Encoding: gzip, deflate, br
> Content-Length: 1840
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
} [1840 bytes data]
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
< Date: Tue, 27 Apr 2021 11:59:24 GMT
< Server: Warp/3.3.10
< x-request-id: 27e3ff3f-8b95-4328-a1bc-a5492e68f995
< Content-Type: application/json; charset=utf-8
< Content-Encoding: gzip
<
{ [6 bytes data]
* Connection #0 to host <host> left intact
* Closing connection 0
So I'm assuming that this error is specific to the Java client.
Based on suggestions provided in similar posts, I tried the following other approaches:
Adding a Connection: close header to the request
Sending Transfer-Encoding: gzip header in the request
Setting the retryOnConnectionFailure for the OkHttp client to true
But none of these approaches were able to resolve the issue.
So, my questions are:
What could be the underlying cause for this issue? Since I'm using chunked transfer encoding here, I suppose it's not due to an incorrect content-length header passed in the response.
What are the approaches I can try for debugging this further?
Would really appreciate any insights on this. Thank you.

How to generate Python XHR Request in requests.post()

I am trying to get info from a website using AJAX. The Website showing different size for perfume and basically, the price would change when selecting different size.
I checked chrome Network Tab and found it's a XHR request, but looking at the request head I have no idea how to generate the same headers and data with the Requests package.
This is how my code currently looks like:
import requests
url = "https://www.beautyfresh.com/uc_aac"
session = requests.Session()
data = {"attributes[Size]":"100ml"} # I want to get the price for 100ml
headers = {"Referer": "https://www.beautyfresh.com/product/fragrance/men/perfume-fragrance/women/perfume-men/fragrance/perfume/jo-malone-orange-blossom-cologne",}
r = session.post(url,headers=headers,data=data)
print(r.text)
The General information under Chrome Network tab is
Request URL: https://www.beautyfresh.com/uc_aac
Request Method: POST
Status Code: 200 OK
Remote Address: 103.255.250.100:443
Referrer Policy: no-referrer-when-downgrade
The Response Headers is
HTTP/1.1 200 OK
date: Fri, 18 Dec 2020 02:01:05 GMT
expires: Sun, 19 Nov 1978 05:00:00 GMT
x-site: beautyfresh
x-url: /uc_aac
last-modified: Fri, 18 Dec 2020 02:01:05 GMT
x-backend-server: web4
content-type: application/json
x-varnish: 700512226
age: 0
via: 1.1 varnish (Varnish/6.0)
x-cache: MISS
cache-control: Cache-Control: store, no-cache, must-revalidate
accept-ranges: bytes
content-length: 2193
The Request Headers is
POST /uc_aac HTTP/1.1
Host: www.beautyfresh.com
Connection: keep-alive
Content-Length: 164
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Origin: https://www.beautyfresh.com
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Sec-Fetch-Dest: empty
Referer: https://www.beautyfresh.com/product/fragrance/men/perfume-fragrance/women/perfume-men/fragrance/perfume/jo-malone-orange-blossom-cologne
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cookie: has_js=1; SESSc5f2026dce40de323b60d32130e6ce0b=n7tr5e2fsragf1js6garc6u06n; _ga=GA1.2.959938064.1608216963; _gid=GA1.2.1979056032.1608216963; _v1EmaticSolutionsUTMData=%7B%22utm_source%22%3A%22%22%2C%22utm_medium%22%3A%22%22%2C%22utm_campaign%22%3A%22%22%7D; _fbp=fb.1.1608216963288.1600789170; _v1EmaticSolutionsBye=%7B%2228732%22%3A%7B%2230038%22%3A%7B%22dont_show_till%22%3A%222020-12-20%22%2C%22loop%22%3A1%7D%7D%2C%2228739%22%3A%7B%2230045%22%3A%7B%22dont_show_till%22%3A%222020-12-20%22%2C%22loop%22%3A1%7D%7D%7D; _v1EmaticSolutionsEI=%7B%22c_28739_1%22%3A%5B1%2C1608216997180%2C33181%5D%2C%22c_28732_2%22%3A%5B1%2C1608216969616%2C0%5D%7D; _v1EmaticSolutions=%5B%22fc5d18b3-4077-11eb-970e-0242ac160003%22%2C1608217230870%2C%5B%22IMG%22%2C%22%22%2C1%2C%22glasshouse_fragrances_amalfi_coast_sea_candle_350gr.jpg%22%5D%5D; __atuvc=8%7C51; __atuvs=5fdc0d5961186ea4000; _gat=1; _dc_gtm_UA-63339192-1=1
And Form Data shows:
attributes%5BSize%5D=100ml&nid=2905&qty=1&form_build_id=form-991e88780c30fdf883375a36a986b550&form_id=uc_product_add_to_cart_form_2905&product-nid=2905&aac_nid=2905
I don't actually know how to construct my request so it can be successfully posted to the server the get the proper response. It should return the price for "100ml", but currently my code get nothing.
Thank you so much for any help!
You are missing some information in your data. To receive a response, try adding "aac_nid": "2905" to your data when sending the post request:
import requests
headers = {
"Referer": "https://www.beautyfresh.com/product/fragrance/men/perfume-fragrance/women/perfume-men/fragrance/perfume/jo-malone-orange-blossom-cologne",
}
data = {"attributes[Size]": "30ml", "aac_nid": "2905"}
response = requests.post(
"https://www.beautyfresh.com/uc_aac", headers=headers, data=data
)
>>> print(response.content)
b'{"nid":"2905","model":"690251006564","replacements":{"sellprice":"<div class=\\"product-info sellprice\\"><div class=\'retail_price\'>City Price: <div class=\'product-info retail_price\'><span class=\'uc-price\'>S$110.00<\\/span><\\/div><\\/div>\\r\\n <div class=\'promoprice\'>Our Price: <div class=\'product-info sellprice1\'><span class=\'uc-price\'>S$89.00<\\/span><\\/div><\\/div><\\/div>","model":"<div class=\\"model\\"><span class=\\"label\\">Product Code: <\\/span>690251006564<\\/div>"},"form":"<form action=\\"\\/uc_aac\\" accept-charset=\\"UTF-8\\" method=\\"post\\" id=\\"uc-product-add-to-cart-form\\" class=\\" uc-aac-cart\\">\\n<div><div class=\'attributes\'><div class=\\"form-item\\" id=\\"edit-attributes-Size-wrapper\\">\\n <label for=\\"edit-attributes-Size\\">Size <span class=\\"form-required\\" title=\\"This field is required.\\">*<\\/span><\\/label>\\n <select name=\\"attributes[Size]\\" class=\\"form-select required chosen-widget\\" data-name=\\"Size\\" id=\\"edit-attributes-Size\\" ><option value=\\"30ml\\" selected=\\"selected\\">30ml<\\/option><option value=\\"100ml\\">100ml<\\/option><\\/select>\\n<\\/div>\\n<\\/div><input type=\\"hidden\\" name=\\"nid\\" id=\\"edit-nid\\" value=\\"2905\\" \\/>\\n<div class=\\"form-item\\" id=\\"edit-qty-wrapper\\">\\n <label for=\\"edit-qty\\">Qty <\\/label>\\n <input type=\\"text\\" maxlength=\\"3\\" name=\\"qty\\" id=\\"edit-qty\\" size=\\"5\\" value=\\"1\\" class=\\"form-text textfield\\" \\/>\\n<\\/div>\\n<input type=\\"hidden\\" name=\\"form_build_id\\" id=\\"form-b7adf002178f04ed96377894057352a2\\" value=\\"form-b7adf002178f04ed96377894057352a2\\" \\/>\\n<input type=\\"hidden\\" name=\\"form_id\\" id=\\"edit-uc-product-add-to-cart-form\\" value=\\"uc_product_add_to_cart_form\\" \\/>\\n<input type=\\"hidden\\" name=\\"aac_nid\\" id=\\"edit-aac-nid\\" value=\\"2905\\" \\/>\\n<div class=\'leadtime_message\' style=\'margin-bottom:1em;\'><p>Delivers in 1-3 working days<\\/p><\\/div><input type=\\"submit\\" name=\\"op\\" id=\\"edit-submit-2905\\" value=\\"Add to Cart\\" class=\\"notranslate form-submit node-add-to-cart primary\\" \\/>\\n\\n<\\/div><\\/form>\\n"}'
To get the price, try searching for it with the help of the built-in re (regex) module:
import re
prices = re.findall(r"'uc-price\\'>S(\$\d.*?)<", str(response.content))
print("Original Price:", prices[0])
print("Our Price:", prices[1])
Output:
Original Price: $110.00
Our Price: $89.00

Strange firefox bug triggers on reload

I noticed a bug in an old version of firefox, that was shipped with my os.
These were the sympthoms:
Guile web server failed to serve the request when data was reposted.
I came up with a minimal example to show the problem.
Steps to reproduce:
start the server script
load localhost:8080
select the test.csv file for upload, and upload it
hit the refresh button in the browser
answer yes to resend post data dialog.
test.scm:
(use-modules (web server)
(rnrs bytevectors))
(define (handler request body)
(if body
(display (utf8->string body)))
(values '((content-type . (text/html)))
(string-append "<html><body>"
"<form action=\"do\" method=\"POST\" enctype=\"multipart/form-dat\
a\">"
"<input type=\"file\" name=\"x\">"
"<input type=\"submit\">")))
(run-server handler)
test.csv:
a,b
Expected result: no error displayed on the console.
Actual result:
-----------------------------18912432064747206221264673165
Content-Disposition: form-data; name="x"; filename="test.csv"
Content-Type: text/csv
In ice-9/boot-9.scm:
841:4 4 (with-throw-handler _ _ _)
In web/server/http.scm:
127:28 3 (_)
In web/request.scm:
205:31 2 (read-request #<closed: file 5559bbcb82a0> _)
In web/http.scm:
1141:6 1 (read-request-line _)
In ice-9/boot-9.scm:
752:25 0 (dispatch-exception _ _ _)
Bad request: Bad Request-Line: "a,b"
What am I doing wrong here?
Some additional information:
on a whireshark capture it turns out, that the following is sent on resend:
POST /do HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: hu-HU,hu;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: http://localhost:8080/do
Content-Type: multipart/form-data; boundary=---------------------------121791188820701943592108452984
Content-Length: 150
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Cache-Control: max-age=0
-----------------------------121791188820701943592108452984
Content-Disposition: form-data; name="x"; filename="test.csv"
Content-Type: text/csv
POST /do HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: hu-HU,hu;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: http://localhost:8080/do
Content-Type: multipart/form-data; boundary=---------------------------121791188820701943592108452984
Content-Length: 150
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Cache-Control: max-age=0
-----------------------------121791188820701943592108452984
Content-Disposition: form-data; name="x"; filename="test.csv"
Content-Type: text/csv
a,b
-----------------------------121791188820701943592108452984--
I will check the http spec if it has anything to say about this. The first http request is partial, followed by a well formed request.
UPDATE:
It turned out that guile webserver threw the error completely legitimately.
Answering my own question:
This is actually this firfox bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=1434553
Fixed in firefox62.

TikaJAXRS PUT from Python client

Apache Tika should be accessible from Python program via HTTP, but I can't get it to work.
I am using this command to run the server (with and without the two options at the end):
java -jar tika-server-1.17.jar --port 5677 -enableUnsecureFeatures -enableFileUrl
And it works fine with curl:
curl -v -T /tmp/tmpsojwBN http://localhost:5677/tika
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 5677 (#0)
> PUT /tika HTTP/1.1
> Host: localhost:5677
> User-Agent: curl/7.47.0
> Accept: */*
> Accept-Encoding: gzip, deflate
> Content-Length: 418074
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Date: Sat, 07 Apr 2018 12:28:41 GMT
< Transfer-Encoding: chunked
< Server: Jetty(8.y.z-SNAPSHOT)
But when I try something like (tried different combinations for headers, here I recreated same headers as python-tika client uses):
with tempfile.NamedTemporaryFile() as tmp_file:
download_file(url, tmp_file)
payload = open(tmp_file.name, 'rb')
headers = {
'Accept': 'application/json',
'Content-Disposition': 'attachment; filename={}'.format(
os.path.basename(tmp_file.name))}
response = requests.put(TIKA_ENDPOINT_URL + '/tika', payload,
headers=headers,
verify=False)
I've tried to use payload as well as fileUrl - with the same result of WARN javax.ws.rs.ClientErrorException: HTTP 406 Not Acceptable and java stack trace on the server. Full trace:
WARN javax.ws.rs.ClientErrorException: HTTP 406 Not Acceptable
at org.apache.cxf.jaxrs.utils.SpecExceptions.toHttpException(SpecExceptions.java:117)
at org.apache.cxf.jaxrs.utils.ExceptionUtils.toHttpException(ExceptionUtils.java:173)
at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:542)
at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:177)
at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:77)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:274)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:76)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:973)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1035)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:641)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:231)
at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:748)
I've also tried to compare ( with nc -l localhost 5677 | less) what is so different with two requests (payload abbreviated):
From curl:
PUT /tika HTTP/1.1
Host: localhost:5677
User-Agent: curl/7.47.0
Accept: */*
Content-Length: 418074
Expect: 100-continue
%PDF-1.4
%<D3><EB><E9><E1>
1 0 obj
<</Creator (Chromium)
From Python requests library:
PUT /tika HTTP/1.1
Host: localhost:5677
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-requests/2.13.0
Content-type: application/pdf
Content-Length: 246176
%PDF-1.4
%<D3><EB><E9><E1>
1 0 obj
<</Creator (Chromium)
The question is, what is the correct way to call Tika server from Python?
I've also tried python tika library in client-only mode and using tika-app via jnius. With tika client, as well as using tika-app.jar with pyjnius, I only freezes (call never returns) when I use them in a celery worker. At the same, pyjnius / tika-app and tika-python script both work nicely in a script: I have not figured out what is wrong inside celery worker. I guess, something to do with threading and/or initialization in wrong place. But that is a topic for another question.
And here is what tika-python requests:
PUT /tika HTTP/1.1
Host: localhost:5677
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-requests/2.13.0
Content-Disposition: attachment; filename=tmpb3YkTq
Content-Length: 183234
%PDF-1.4
%<D3><EB><E9><E1>
1 0 obj
<</Creator (Chromium)
And now it seems like this is some kind of a problem with tika server:
$ tika-python --verbose --server 'localhost' --port 5677 parse all /tmp/tmpb3YkTq
2018-04-08 09:44:11,555 [MainThread ] [INFO ] Writing ./tmpb3YkTq_meta.json
(<open file '<stderr>', mode 'w' at 0x7f0b688eb1e0>, 'Request headers: ', {'Accept': 'application/json', 'Content-Disposition': 'attachment; filename=tmpb3YkTq'})
(<open file '<stderr>', mode 'w' at 0x7f0b688eb1e0>, 'Response headers: ', {'Date': 'Sun, 08 Apr 2018 06:44:13 GMT', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/json', 'Server': 'Jetty(8.y.z-SNAPSHOT)'})
['./tmpb3YkTq_meta.json']
Cf:
$ tika-python --verbose --server 'localhost' --port 5677 parse text /tmp/tmpb3YkTq
2018-04-08 09:43:38,326 [MainThread ] [INFO ] Writing ./tmpb3YkTq_meta.json
(<open file '<stderr>', mode 'w' at 0x7fc3eee4a1e0>, 'Request headers: ', {'Accept': 'application/json', 'Content-Disposition': 'attachment; filename=tmpb3YkTq'})
(<open file '<stderr>', mode 'w' at 0x7fc3eee4a1e0>, 'Response headers: ', {'Date': 'Sun, 08 Apr 2018 06:43:38 GMT', 'Content-Length': '0', 'Server': 'Jetty(8.y.z-SNAPSHOT)'})
2018-04-08 09:43:38,409 [MainThread ] [WARNI] Tika server returned status: 406
['./tmpb3YkTq_meta.json']

HTTP GET Request Structure

Consider the following hyperlink:
<a href="http://www.cs.rutgers.edu/∼shklar/">
What HTTP/1.0 request will get submitted by the browser?
What HTTP/1.1 request will get submitted by the browser?
Will these requests change if the browser is configured to
contact an HTTP proxy? If yes, how?
While you could use tcpdump to dump the actual network traffic, curl is surely more handy to test the HTTP conversation from the command line.
An HTTP/1.0 request:
curl -v -0 http://www.cs.rutgers.edu/∼shklar/
* About to connect() to www.cs.rutgers.edu port 80 (#0)
* Trying 128.6.4.24...
* connected
* Connected to www.cs.rutgers.edu (128.6.4.24) port 80 (#0)
> GET /∼shklar/ HTTP/1.0
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
> Host: www.cs.rutgers.edu
> Accept: */*
>
< HTTP/1.1 404 Not Found
< Date: Wed, 31 Oct 2012 17:57:31 GMT
< Server: Apache/1.3.26 (Unix)
< Content-Type: text/html; charset=iso-8859-1
< Connection: close
An HTTP/1.1 request:
curl -v http://www.cs.rutgers.edu/∼shklar/
* About to connect() to www.cs.rutgers.edu port 80 (#0)
* Trying 128.6.4.24...
* connected
* Connected to www.cs.rutgers.edu (128.6.4.24) port 80 (#0)
> GET /∼shklar/ HTTP/1.1
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
> Host: www.cs.rutgers.edu
> Accept: */*
>
< HTTP/1.1 404 Not Found
< Date: Wed, 31 Oct 2012 17:59:47 GMT
< Server: Apache/1.3.26 (Unix)
< Content-Type: text/html; charset=iso-8859-1
< Transfer-Encoding: chunked
Use the -x (or --proxy) <[protocol://][user#password]proxyhost[:port]> switch to use a proxy and see the results.
More about curl here: http://curl.haxx.se/docs/manpage.html

Resources