How enable gzip compression middleware in go-chi - http

How can I enable gzip compression using the gzip middleware of the go-chi framework?
Try using the example shown here:
https://github.com/go-chi/chi/issues/204
but when I check with curl, I get this:
$ curl -H "Accept-Encoding: gzip" -I http://127.0.0.1:3333
HTTP/1.1 405 Method Not Allowed
Date: Sat, 31 Aug 2019 19:06:39 GMT
I tried the code "hello world":
package main
import (
"net/http"
"github.com/go-chi/chi"
"github.com/go-chi/chi/middleware"
)
func main() {
r := chi.NewRouter()
r.Use(middleware.RequestID)
r.Use(middleware.Logger)
//r.Use(middleware.DefaultCompress) //using this produces the same result
r.Use(middleware.Compress(5, "gzip"))
r.Get("/", Hello)
http.ListenAndServe(":3333", r)
}
func Hello(w http.ResponseWriter, r *http.Request){
w.Header().Set("Content-Type", "text/html") //according to the documentation this must be here to enable gzip
w.Write([]byte("hello world\n"))
}
but when I try to verify with curl, the result is the same
$ curl -H "Accept-Encoding: gzip" -I http://127.0.0.1:3333
HTTP/1.1 405 Method Not Allowed
Date: Sat, 31 Aug 2019 19:06:39 GMT
what's going on?

The other answers are outdated now. I had to solve this myself, so here what I found out.
Your error is here:
r.Use(middleware.Compress(5, "gzip"))
The second argument ("types") refers to the content types that the compression will be applied to. For example: "text/html", "application/json", etc
Just add a list of the content-types you want to compress, or remove the argument altogether:
func main() {
r := chi.NewRouter()
r.Use(middleware.RequestID)
r.Use(middleware.Logger)
r.Use(middleware.Compress(5))
r.Get("/", Hello)
http.ListenAndServe(":3333", r)
}
This will compress all the content-types defined in the default list from middleware.Compress:
var defaultCompressibleContentTypes = []string{
"text/html",
"text/css",
"text/plain",
"text/javascript",
"application/javascript",
"application/x-javascript",
"application/json",
"application/atom+xml",
"application/rss+xml",
"image/svg+xml",
}
Good luck!

r.Use(middleware.DefaultCompress) has now been marked as DEPRECATED.
To enable compression you need to create a compressor, and use its handler.
r := chi.NewRouter()
r.Use(middleware.RequestID)
r.Use(middleware.Logger)
compressor := middleware.NewCompressor(flate.DefaultCompression)
r.Use(compressor.Handler())
r.Get("/", Hello)
http.ListenAndServe(":3333", r)
The flate package must be imported as compress/flate.

Use the commented middleware.DefaultCompress and a normal GET request.
package main
import (
"net/http"
"github.com/go-chi/chi"
"github.com/go-chi/chi/middleware"
)
func main() {
r := chi.NewRouter()
r.Use(middleware.RequestID)
r.Use(middleware.Logger)
r.Use(middleware.DefaultCompress)
r.Get("/", Hello)
http.ListenAndServe(":3333", r)
}
func Hello(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.Write([]byte("hello world\n"))
}
Try with curl:
$ curl -v http://localhost:3333 --compressed
* Rebuilt URL to: http://localhost:3333/
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 3333 (#0)
> GET / HTTP/1.1
> Host: localhost:3333
> User-Agent: curl/7.58.0
> Accept: */*
> Accept-Encoding: deflate, gzip
>
< HTTP/1.1 200 OK
< Content-Encoding: gzip
< Content-Type: text/html
< Date: Sat, 31 Aug 2019 23:37:52 GMT
< Content-Length: 36
<
hello world
* Connection #0 to host localhost left intact
Or HTTPie:
$ http :3333
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 36
Content-Type: text/html
Date: Sat, 31 Aug 2019 23:38:31 GMT
hello world

Related

OkHttp3: Getting an 'unexpected end of stream' exception while reading a large HTTP response

I have a Java client, that is making a POST call to the v1/graphql endpoint of a Hasura server (v1.3.3)
I'm making the HTTP call using the Square okhttp3 library (v4.9.1). The data transfer is happening over HTTP1.1, using chunked transfer-encoding.
The client is failing with the following error:
Caused by: java.net.ProtocolException: unexpected end of stream
at okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.kt:415) ~[okhttp-4.9.1.jar:?]
at okhttp3.internal.connection.Exchange$ResponseBodySource.read(Exchange.kt:276) ~[okhttp-4.9.1.jar:?]
at okio.RealBufferedSource.read(RealBufferedSource.kt:189) ~[okio-jvm-2.8.0.jar:?]
at okio.RealBufferedSource.exhausted(RealBufferedSource.kt:197) ~[okio-jvm-2.8.0.jar:?]
at okio.InflaterSource.refill(InflaterSource.kt:112) ~[okio-jvm-2.8.0.jar:?]
at okio.InflaterSource.readOrInflate(InflaterSource.kt:76) ~[okio-jvm-2.8.0.jar:?]
at okio.InflaterSource.read(InflaterSource.kt:49) ~[okio-jvm-2.8.0.jar:?]
at okio.GzipSource.read(GzipSource.kt:69) ~[okio-jvm-2.8.0.jar:?]
at okio.Buffer.writeAll(Buffer.kt:1642) ~[okio-jvm-2.8.0.jar:?]
at okio.RealBufferedSource.readString(RealBufferedSource.kt:95) ~[okio-jvm-2.8.0.jar:?]
at okhttp3.ResponseBody.string(ResponseBody.kt:187) ~[okhttp-4.9.1.jar:?]
Request Headers:
INFO: Content-Type: application/json; charset=utf-8
INFO: Content-Length: 1928
INFO: Host: localhost:10191
INFO: Connection: Keep-Alive
INFO: Accept-Encoding: gzip
INFO: User-Agent: okhttp/4.9.1
Response headers:
INFO: Transfer-Encoding: chunked
INFO: Date: Tue, 27 Apr 2021 12:06:39 GMT
INFO: Server: Warp/3.3.10
INFO: x-request-id: d019408e-e2e3-4583-bcd6-050d4a496b11
INFO: Content-Type: application/json; charset=utf-8
INFO: Content-Encoding: gzip
This is the client code used for the making the POST call:
private static final MediaType MEDIA_TYPE_JSON = MediaType.parse("application/json; charset=utf-8");
private static OkHttpClient okHttpClient = new OkHttpClient.Builder()
.connectTimeout(30, TimeUnit.SECONDS)
.writeTimeout(5, TimeUnit.MINUTES)
.readTimeout(5, TimeUnit.MINUTES)
.addNetworkInterceptor(loggingInterceptor)
.build();
public GenericHttpResponse httpPost(String url, String textBody, GenericHttpMediaType genericMediaType) throws HttpClientException {
RequestBody body = RequestBody.create(okHttpMediaType, textBody);
Request postRequest = new Request.Builder().url(url).post(body).build();
Call postCall = okHttpClient.newCall(okHttpRequest);
Response postResponse = postCall.execute();
return GenericHttpResponse
.builder()
.body(okHttpResponse.body().string())
.headers(okHttpResponse.headers().toMultimap())
.code(okHttpResponse.code())
.build();
}
This failure is only happening for large response sizes. As per the server logs, the response size (after gzip encoding) is around 52MB, but the call is still failing. This same code has been working fine for response sizes around 10-15MB.
I tried replicating the same issue through a simple cURL call, but that ran successfully:
curl -v -s --request POST 'http://<hasura_endpoint>/v1/graphql' \
--header 'Content-Type: application/json' \
--header 'Accept-Encoding: gzip, deflate, br' \
--data-raw '...'
* Trying ::1...
* TCP_NODELAY set
* Connected to <host> (::1) port <port> (#0)
> POST /v1/graphql HTTP/1.1
> Host: <host>:<port>
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Type: application/json
> Accept-Encoding: gzip, deflate, br
> Content-Length: 1840
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
} [1840 bytes data]
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
< Date: Tue, 27 Apr 2021 11:59:24 GMT
< Server: Warp/3.3.10
< x-request-id: 27e3ff3f-8b95-4328-a1bc-a5492e68f995
< Content-Type: application/json; charset=utf-8
< Content-Encoding: gzip
<
{ [6 bytes data]
* Connection #0 to host <host> left intact
* Closing connection 0
So I'm assuming that this error is specific to the Java client.
Based on suggestions provided in similar posts, I tried the following other approaches:
Adding a Connection: close header to the request
Sending Transfer-Encoding: gzip header in the request
Setting the retryOnConnectionFailure for the OkHttp client to true
But none of these approaches were able to resolve the issue.
So, my questions are:
What could be the underlying cause for this issue? Since I'm using chunked transfer encoding here, I suppose it's not due to an incorrect content-length header passed in the response.
What are the approaches I can try for debugging this further?
Would really appreciate any insights on this. Thank you.

HTTP NTLM authentication

I am trying to consume an API which requires NTLM authentication.
This curl command works fine:
curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' --ntlm -u user:password -d '{ "key1": "100",     "key2": "1"}' http://some/api/v/12
Now I am trying to do the same from a Go program:
package main
import (
"bytes"
"fmt"
"net/url"
"net/http"
"io/ioutil"
"log"
"github.com/Azure/go-ntlmssp"
)
func main() {
url_ := "http://some/api/v/12"
client := &http.Client{
Transport: ntlmssp.Negotiator{
RoundTripper:&http.Transport{},
},
}
data := url.Values{}
data.Set("key1", "100")
data.Set("key2", "1")
b := bytes.NewBufferString(data.Encode())
req, err := http.NewRequest("POST", url_, b)
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Accept", "application/json")
req.SetBasicAuth("user", "password")
resp, err := client.Do(req)
if err != nil {
fmt.Printf("Error : %s", err)
} else {
responseData, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatal(err)
}
responseString := string(responseData)
fmt.Println(responseString)
resp.Body.Close()
}
}
When I execute this program I receive an "invalid credentials" error which I normally receive when I don't include "--ntlm" flag in the curl command.
Can you please me give me a hint how can I accomplish this task with Go?
Update
printing the request from the curl command:
* About to connect() to www.xxx.xxx.com port xx (#0)
* Trying xxx.xxx.x.xxx...
* Connected to www.xxx.xxx.com (xxx.xxx.x.xx) port xx (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* Server auth using NTLM with user 'user'
> POST /some/api/v2 HTTP/1.1
> Authorization: NTLM xxxxx (44 cahracters)
> User-Agent: curl/7.29.0
> Host: www.xxx.xxx.com
> Content-Type: application/json
> Accept: application/json
> Content-Length: 0
>
< HTTP/1.1 401 Unauthorized
< Content-Type: text/html; charset=us-ascii
< Server: Microsoft-HTTPAPI/2.0
< WWW-Authenticate: NTLM xxxxx (312 characters)
< Date: Thu, xx Aug xxxx xx:xx:xx xxx
< Content-Length: 341
<
* Ignoring the response-body
* Connection
* Issue another request to this URL: 'http://some/api/v2'
* Found bundle for host www.xxx.xxx.com: 0x0000
* Re-using existing connection!
* Connected to www.xxx.xxx.com (xxx.xxx.x.xx) port xx (#0)
* Server auth using NTLM with user 'user'
> POST /api/v2 HTTP/1.1
> Authorization: NTLM xxx (176 characters)
> User-Agent: curl/7.29.0
> Host: www.xxx.xxx.com
> Content-Type: application/json
> Accept: application/json
> Content-Length: 39
>
* upload completely sent off: 39 out of 39 bytes
< HTTP/1.1 200 OK
< Cache-Control: no-cache
< Pragma: no-cache
< Content-Type: application/json; charset=utf-8
< Expires: -1
< Server: Microsoft-IIS/7.5
< X-AspNet-Version: 4.0.30319
< Persistent-Auth: true
< X-Powered-By: ASP.NET
< Date: Thu, 08 Aug 2019 06:49:41 GMT
< Content-Length: 1235
NTLM needs a fully qualified Domain\Username login. Email or simple username does not work. So for the username part, it has to look like this:
MYDOMAIN\[username]
where [username] is the actual windows user.

https server python based is closing connection after serving the response

I had created simple server in terminal
#!/usr/bin/env python3
import sys, os, socket, ssl
import requests
import string
import time
from socketserver import ThreadingMixIn
from http.server import HTTPServer,BaseHTTPRequestHandler
from io import BytesIO
import json
import cgi
class ThreadingServer(ThreadingMixIn, HTTPServer):
pass
class RequestHandler(BaseHTTPRequestHandler):
def do_POST(self):
content_length = int(self.headers['Content-Length'])
body = self.rfile.read(content_length)
#self.send_header('Content-type', 'Application/json')
self.send_response(200)
self.end_headers()
response = BytesIO()
self.allow_reuse_address = True
self.wfile.write(b"""{"signingResponse": {"compactidentity": "..SdOwnT70ZZDAjgSmQVP-_0keB_pu4FjkBg5DZDyFf_V5k0EUAY0KCHr2g2a6wOSs-JhsehdYUnrYCfkYItzxLg;info=<http://52.23.250.93:8080/certs/shaken.crt>;alg=ES256;ppt=shaken\n", "TEST": "Nitish","identity": "eyJhbGciOiJFUzI1NiIsInBwdCI6InNoYWtlbiIsInR5cCI6InBhc3Nwb3J0IiwieDV1IjoiaHR0cDovLzUyLjIzLjI1MC45Mzo4MDgwL2NlcnRzL3NoYWtlbi5jcnQifQ.eyJhdHRlc3QiOiJBIiwiZGVzdCI6eyJ0biI6WyIxMjM1NTU1MTIxMiJdfSwiaWF0IjoxNDgzMjI4ODAwLCJvcmlnIjp7InRuIjoiMTIzNTU1NTEyMTIifSwib3JpZ2lkIjoiOGE4ZWM2MTgtYzZiOS0zMGFlLWI0MjctYWY0MTA0YjFjMDJjIn0.SdOwnT70ZZDAjgSmQVP-_0keB_pu4FjkBg5DZDyFf_V5k0EUAY0KCHr2g2a6wOSs-JhsehdYUnrYCfkYItzxLg;info=<http://52.23.250.93:8080/certs/shaken.crt>;alg=ES256;ppt=shaken\n", "requestid": "0"}} """)
httpd = ThreadingServer(('192.168.1.2', 8003), RequestHandler)
httpd.socket = ssl.wrap_socket(httpd.socket, keyfile='/home/nakumar/key.pem', certfile='/home/nakumar/certificate.pem', server_side=True)
httpd.serve_forever()
Using above code i am trying to simulated the server
now when server receives request from client , it send back the responses and closed the connection , as shown below
Request
> POST /stir/v1/signing HTTP/1.1
Host: 192.168.1.2:8003
Accept: application/json
Content-Type: application/json
Content-Length: 331
Reponse
upload completely sent off: 331 out of 331 bytes
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Server: BaseHTTP/0.6 Python/3.5.2
< Date: Tue, 09 Oct 2018 12:43:21 GMT
<
* Closing connection 0
So we can see the connection close is coming from server after response is served,
is there way possible ,not to close the connection after response is served .
Curl Was closing the connection since content length was not present in the response body
after adding the same it started working
> POST /stir/v1/signing HTTP/1.1
Host: [FD00:10:6B50:4510:0:0:0:53]:8101
Accept: application/json
Content-Type: application/json
Content-Length: 325
* upload completely sent off: 325 out of 325 bytes
< HTTP/1.1 200 OK
< Server: HTTP/1.1 Python/3.5.2
< Date: Thu, 18 Oct 2018 09:13:11 GMT
< Content-type: Application/json
< Content-length: 150
<
* Connection #1 to host FD00:10:6B50:4510:0:0:0:53 left intact

Unused headers in curl request

I'm trying to emulate a curl -X GET with Go, but the server I'm contacting has authentication. I've followed several sites that recommend me to use r.Header.Add(), but I can't get my curl call to work.
My curl call that actually returns something:
curl -X GET https://myserver.com/test/anothertest -H 'x-access-token: a1b2c3d4'
My code that doesn't return the expected JSON object:
func get(api string, headers map[string]string, dataStruct interface{}) (data interface{}, err error) {
req, _ := http.NewRequest("GET", api, nil)
for k, v := range headers {
req.Header[k] = []string{v}
}
currentHeader, _ := httputil.DumpRequestOut(req, true)
fmt.Println(string(currentHeader))
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
return
}
defer resp.Body.Close()
data = dataStruct
err = getJson(api, &data)
if err != nil {
return
}
return
}
func main() {
// args := parse_args(Options{})
args := Options{Api: "my_server",
Headers: map[string]string{
"x-access-token": "a1b2c3d4"}}
body, _ := get(args.Api, args.Headers, HistoryDecoder{})
fmt.Println(body)
}
Returns:
GET /v1pre3/users/current HTTP/1.1
Host: my_server
User-Agent: Go-http-client/1.1
X-Access-Token: a1b2c3d4
Accept-Encoding: gzip
map[ResponseStatus:map[Message:Please ensure that valid credentials are being provided for this API call (This request requires authentication but none were provided)] Notifications:[map[Type:error Item:Please ensure that valid credentials are being provided for this API call (This request requires authentication but none were provided)]]]
Could anyone please tell me what I'm doing wrong?
edit adding curl -v response
* About to connect() to my_server port 443 (#0)
* Trying 54.210.110.53...
* Connected to my_server (54.210.110.53) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
* subject: CN=*.cloud-test.com
* start date: Feb 28 00:00:00 2018 GMT
* expire date: Mar 28 12:00:00 2019 GMT
* common name: *.cloud-test.com
* issuer: CN=Amazon,OU=Server CA 1B,O=Amazon,C=US
> GET /v1pre3/users/current HTTP/1.1
> User-Agent: curl/7.29.0
> Host: my_server
> Accept: */*
> x-access-token: a1b2c3d4
>
< HTTP/1.1 200 OK
< Cache-Control: no-cache, no-store, must-revalidate
< Content-Type: application/json
< Date: Tue, 30 Oct 2018 15:49:00 GMT
< Expires: 0
< Pragma: no-cache
< Server:
< x-capabilities: audit
< X-Content-Type-Options: nosniff
< X-Request-ID: 2018.10.30.15.49.00.9zHjdeQ-ZEmg2IvzTymnxQ
< transfer-encoding: chunked
< Connection: keep-alive
<
* Connection #0 to host my_server left intact
After a lot of rewriting and debugging, I figured out that getJson(api, &data) was also making a http.Get call but without headers. Since that took place after the request we were discussing, it bypassed all testing and printing.

TikaJAXRS PUT from Python client

Apache Tika should be accessible from Python program via HTTP, but I can't get it to work.
I am using this command to run the server (with and without the two options at the end):
java -jar tika-server-1.17.jar --port 5677 -enableUnsecureFeatures -enableFileUrl
And it works fine with curl:
curl -v -T /tmp/tmpsojwBN http://localhost:5677/tika
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 5677 (#0)
> PUT /tika HTTP/1.1
> Host: localhost:5677
> User-Agent: curl/7.47.0
> Accept: */*
> Accept-Encoding: gzip, deflate
> Content-Length: 418074
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Date: Sat, 07 Apr 2018 12:28:41 GMT
< Transfer-Encoding: chunked
< Server: Jetty(8.y.z-SNAPSHOT)
But when I try something like (tried different combinations for headers, here I recreated same headers as python-tika client uses):
with tempfile.NamedTemporaryFile() as tmp_file:
download_file(url, tmp_file)
payload = open(tmp_file.name, 'rb')
headers = {
'Accept': 'application/json',
'Content-Disposition': 'attachment; filename={}'.format(
os.path.basename(tmp_file.name))}
response = requests.put(TIKA_ENDPOINT_URL + '/tika', payload,
headers=headers,
verify=False)
I've tried to use payload as well as fileUrl - with the same result of WARN javax.ws.rs.ClientErrorException: HTTP 406 Not Acceptable and java stack trace on the server. Full trace:
WARN javax.ws.rs.ClientErrorException: HTTP 406 Not Acceptable
at org.apache.cxf.jaxrs.utils.SpecExceptions.toHttpException(SpecExceptions.java:117)
at org.apache.cxf.jaxrs.utils.ExceptionUtils.toHttpException(ExceptionUtils.java:173)
at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:542)
at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:177)
at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:77)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:274)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:76)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:973)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1035)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:641)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:231)
at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:748)
I've also tried to compare ( with nc -l localhost 5677 | less) what is so different with two requests (payload abbreviated):
From curl:
PUT /tika HTTP/1.1
Host: localhost:5677
User-Agent: curl/7.47.0
Accept: */*
Content-Length: 418074
Expect: 100-continue
%PDF-1.4
%<D3><EB><E9><E1>
1 0 obj
<</Creator (Chromium)
From Python requests library:
PUT /tika HTTP/1.1
Host: localhost:5677
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-requests/2.13.0
Content-type: application/pdf
Content-Length: 246176
%PDF-1.4
%<D3><EB><E9><E1>
1 0 obj
<</Creator (Chromium)
The question is, what is the correct way to call Tika server from Python?
I've also tried python tika library in client-only mode and using tika-app via jnius. With tika client, as well as using tika-app.jar with pyjnius, I only freezes (call never returns) when I use them in a celery worker. At the same, pyjnius / tika-app and tika-python script both work nicely in a script: I have not figured out what is wrong inside celery worker. I guess, something to do with threading and/or initialization in wrong place. But that is a topic for another question.
And here is what tika-python requests:
PUT /tika HTTP/1.1
Host: localhost:5677
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-requests/2.13.0
Content-Disposition: attachment; filename=tmpb3YkTq
Content-Length: 183234
%PDF-1.4
%<D3><EB><E9><E1>
1 0 obj
<</Creator (Chromium)
And now it seems like this is some kind of a problem with tika server:
$ tika-python --verbose --server 'localhost' --port 5677 parse all /tmp/tmpb3YkTq
2018-04-08 09:44:11,555 [MainThread ] [INFO ] Writing ./tmpb3YkTq_meta.json
(<open file '<stderr>', mode 'w' at 0x7f0b688eb1e0>, 'Request headers: ', {'Accept': 'application/json', 'Content-Disposition': 'attachment; filename=tmpb3YkTq'})
(<open file '<stderr>', mode 'w' at 0x7f0b688eb1e0>, 'Response headers: ', {'Date': 'Sun, 08 Apr 2018 06:44:13 GMT', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/json', 'Server': 'Jetty(8.y.z-SNAPSHOT)'})
['./tmpb3YkTq_meta.json']
Cf:
$ tika-python --verbose --server 'localhost' --port 5677 parse text /tmp/tmpb3YkTq
2018-04-08 09:43:38,326 [MainThread ] [INFO ] Writing ./tmpb3YkTq_meta.json
(<open file '<stderr>', mode 'w' at 0x7fc3eee4a1e0>, 'Request headers: ', {'Accept': 'application/json', 'Content-Disposition': 'attachment; filename=tmpb3YkTq'})
(<open file '<stderr>', mode 'w' at 0x7fc3eee4a1e0>, 'Response headers: ', {'Date': 'Sun, 08 Apr 2018 06:43:38 GMT', 'Content-Length': '0', 'Server': 'Jetty(8.y.z-SNAPSHOT)'})
2018-04-08 09:43:38,409 [MainThread ] [WARNI] Tika server returned status: 406
['./tmpb3YkTq_meta.json']

Resources