TikaJAXRS PUT from Python client - python-requests

Apache Tika should be accessible from Python program via HTTP, but I can't get it to work.
I am using this command to run the server (with and without the two options at the end):
java -jar tika-server-1.17.jar --port 5677 -enableUnsecureFeatures -enableFileUrl
And it works fine with curl:
curl -v -T /tmp/tmpsojwBN http://localhost:5677/tika
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 5677 (#0)
> PUT /tika HTTP/1.1
> Host: localhost:5677
> User-Agent: curl/7.47.0
> Accept: */*
> Accept-Encoding: gzip, deflate
> Content-Length: 418074
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Date: Sat, 07 Apr 2018 12:28:41 GMT
< Transfer-Encoding: chunked
< Server: Jetty(8.y.z-SNAPSHOT)
But when I try something like (tried different combinations for headers, here I recreated same headers as python-tika client uses):
with tempfile.NamedTemporaryFile() as tmp_file:
download_file(url, tmp_file)
payload = open(tmp_file.name, 'rb')
headers = {
'Accept': 'application/json',
'Content-Disposition': 'attachment; filename={}'.format(
os.path.basename(tmp_file.name))}
response = requests.put(TIKA_ENDPOINT_URL + '/tika', payload,
headers=headers,
verify=False)
I've tried to use payload as well as fileUrl - with the same result of WARN javax.ws.rs.ClientErrorException: HTTP 406 Not Acceptable and java stack trace on the server. Full trace:
WARN javax.ws.rs.ClientErrorException: HTTP 406 Not Acceptable
at org.apache.cxf.jaxrs.utils.SpecExceptions.toHttpException(SpecExceptions.java:117)
at org.apache.cxf.jaxrs.utils.ExceptionUtils.toHttpException(ExceptionUtils.java:173)
at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:542)
at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:177)
at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:77)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:274)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:76)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:973)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1035)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:641)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:231)
at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:748)
I've also tried to compare ( with nc -l localhost 5677 | less) what is so different with two requests (payload abbreviated):
From curl:
PUT /tika HTTP/1.1
Host: localhost:5677
User-Agent: curl/7.47.0
Accept: */*
Content-Length: 418074
Expect: 100-continue
%PDF-1.4
%<D3><EB><E9><E1>
1 0 obj
<</Creator (Chromium)
From Python requests library:
PUT /tika HTTP/1.1
Host: localhost:5677
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-requests/2.13.0
Content-type: application/pdf
Content-Length: 246176
%PDF-1.4
%<D3><EB><E9><E1>
1 0 obj
<</Creator (Chromium)
The question is, what is the correct way to call Tika server from Python?
I've also tried python tika library in client-only mode and using tika-app via jnius. With tika client, as well as using tika-app.jar with pyjnius, I only freezes (call never returns) when I use them in a celery worker. At the same, pyjnius / tika-app and tika-python script both work nicely in a script: I have not figured out what is wrong inside celery worker. I guess, something to do with threading and/or initialization in wrong place. But that is a topic for another question.
And here is what tika-python requests:
PUT /tika HTTP/1.1
Host: localhost:5677
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-requests/2.13.0
Content-Disposition: attachment; filename=tmpb3YkTq
Content-Length: 183234
%PDF-1.4
%<D3><EB><E9><E1>
1 0 obj
<</Creator (Chromium)
And now it seems like this is some kind of a problem with tika server:
$ tika-python --verbose --server 'localhost' --port 5677 parse all /tmp/tmpb3YkTq
2018-04-08 09:44:11,555 [MainThread ] [INFO ] Writing ./tmpb3YkTq_meta.json
(<open file '<stderr>', mode 'w' at 0x7f0b688eb1e0>, 'Request headers: ', {'Accept': 'application/json', 'Content-Disposition': 'attachment; filename=tmpb3YkTq'})
(<open file '<stderr>', mode 'w' at 0x7f0b688eb1e0>, 'Response headers: ', {'Date': 'Sun, 08 Apr 2018 06:44:13 GMT', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/json', 'Server': 'Jetty(8.y.z-SNAPSHOT)'})
['./tmpb3YkTq_meta.json']
Cf:
$ tika-python --verbose --server 'localhost' --port 5677 parse text /tmp/tmpb3YkTq
2018-04-08 09:43:38,326 [MainThread ] [INFO ] Writing ./tmpb3YkTq_meta.json
(<open file '<stderr>', mode 'w' at 0x7fc3eee4a1e0>, 'Request headers: ', {'Accept': 'application/json', 'Content-Disposition': 'attachment; filename=tmpb3YkTq'})
(<open file '<stderr>', mode 'w' at 0x7fc3eee4a1e0>, 'Response headers: ', {'Date': 'Sun, 08 Apr 2018 06:43:38 GMT', 'Content-Length': '0', 'Server': 'Jetty(8.y.z-SNAPSHOT)'})
2018-04-08 09:43:38,409 [MainThread ] [WARNI] Tika server returned status: 406
['./tmpb3YkTq_meta.json']

Related

OkHttp3: Getting an 'unexpected end of stream' exception while reading a large HTTP response

I have a Java client, that is making a POST call to the v1/graphql endpoint of a Hasura server (v1.3.3)
I'm making the HTTP call using the Square okhttp3 library (v4.9.1). The data transfer is happening over HTTP1.1, using chunked transfer-encoding.
The client is failing with the following error:
Caused by: java.net.ProtocolException: unexpected end of stream
at okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.kt:415) ~[okhttp-4.9.1.jar:?]
at okhttp3.internal.connection.Exchange$ResponseBodySource.read(Exchange.kt:276) ~[okhttp-4.9.1.jar:?]
at okio.RealBufferedSource.read(RealBufferedSource.kt:189) ~[okio-jvm-2.8.0.jar:?]
at okio.RealBufferedSource.exhausted(RealBufferedSource.kt:197) ~[okio-jvm-2.8.0.jar:?]
at okio.InflaterSource.refill(InflaterSource.kt:112) ~[okio-jvm-2.8.0.jar:?]
at okio.InflaterSource.readOrInflate(InflaterSource.kt:76) ~[okio-jvm-2.8.0.jar:?]
at okio.InflaterSource.read(InflaterSource.kt:49) ~[okio-jvm-2.8.0.jar:?]
at okio.GzipSource.read(GzipSource.kt:69) ~[okio-jvm-2.8.0.jar:?]
at okio.Buffer.writeAll(Buffer.kt:1642) ~[okio-jvm-2.8.0.jar:?]
at okio.RealBufferedSource.readString(RealBufferedSource.kt:95) ~[okio-jvm-2.8.0.jar:?]
at okhttp3.ResponseBody.string(ResponseBody.kt:187) ~[okhttp-4.9.1.jar:?]
Request Headers:
INFO: Content-Type: application/json; charset=utf-8
INFO: Content-Length: 1928
INFO: Host: localhost:10191
INFO: Connection: Keep-Alive
INFO: Accept-Encoding: gzip
INFO: User-Agent: okhttp/4.9.1
Response headers:
INFO: Transfer-Encoding: chunked
INFO: Date: Tue, 27 Apr 2021 12:06:39 GMT
INFO: Server: Warp/3.3.10
INFO: x-request-id: d019408e-e2e3-4583-bcd6-050d4a496b11
INFO: Content-Type: application/json; charset=utf-8
INFO: Content-Encoding: gzip
This is the client code used for the making the POST call:
private static final MediaType MEDIA_TYPE_JSON = MediaType.parse("application/json; charset=utf-8");
private static OkHttpClient okHttpClient = new OkHttpClient.Builder()
.connectTimeout(30, TimeUnit.SECONDS)
.writeTimeout(5, TimeUnit.MINUTES)
.readTimeout(5, TimeUnit.MINUTES)
.addNetworkInterceptor(loggingInterceptor)
.build();
public GenericHttpResponse httpPost(String url, String textBody, GenericHttpMediaType genericMediaType) throws HttpClientException {
RequestBody body = RequestBody.create(okHttpMediaType, textBody);
Request postRequest = new Request.Builder().url(url).post(body).build();
Call postCall = okHttpClient.newCall(okHttpRequest);
Response postResponse = postCall.execute();
return GenericHttpResponse
.builder()
.body(okHttpResponse.body().string())
.headers(okHttpResponse.headers().toMultimap())
.code(okHttpResponse.code())
.build();
}
This failure is only happening for large response sizes. As per the server logs, the response size (after gzip encoding) is around 52MB, but the call is still failing. This same code has been working fine for response sizes around 10-15MB.
I tried replicating the same issue through a simple cURL call, but that ran successfully:
curl -v -s --request POST 'http://<hasura_endpoint>/v1/graphql' \
--header 'Content-Type: application/json' \
--header 'Accept-Encoding: gzip, deflate, br' \
--data-raw '...'
* Trying ::1...
* TCP_NODELAY set
* Connected to <host> (::1) port <port> (#0)
> POST /v1/graphql HTTP/1.1
> Host: <host>:<port>
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Type: application/json
> Accept-Encoding: gzip, deflate, br
> Content-Length: 1840
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
} [1840 bytes data]
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
< Date: Tue, 27 Apr 2021 11:59:24 GMT
< Server: Warp/3.3.10
< x-request-id: 27e3ff3f-8b95-4328-a1bc-a5492e68f995
< Content-Type: application/json; charset=utf-8
< Content-Encoding: gzip
<
{ [6 bytes data]
* Connection #0 to host <host> left intact
* Closing connection 0
So I'm assuming that this error is specific to the Java client.
Based on suggestions provided in similar posts, I tried the following other approaches:
Adding a Connection: close header to the request
Sending Transfer-Encoding: gzip header in the request
Setting the retryOnConnectionFailure for the OkHttp client to true
But none of these approaches were able to resolve the issue.
So, my questions are:
What could be the underlying cause for this issue? Since I'm using chunked transfer encoding here, I suppose it's not due to an incorrect content-length header passed in the response.
What are the approaches I can try for debugging this further?
Would really appreciate any insights on this. Thank you.

How to generate Python XHR Request in requests.post()

I am trying to get info from a website using AJAX. The Website showing different size for perfume and basically, the price would change when selecting different size.
I checked chrome Network Tab and found it's a XHR request, but looking at the request head I have no idea how to generate the same headers and data with the Requests package.
This is how my code currently looks like:
import requests
url = "https://www.beautyfresh.com/uc_aac"
session = requests.Session()
data = {"attributes[Size]":"100ml"} # I want to get the price for 100ml
headers = {"Referer": "https://www.beautyfresh.com/product/fragrance/men/perfume-fragrance/women/perfume-men/fragrance/perfume/jo-malone-orange-blossom-cologne",}
r = session.post(url,headers=headers,data=data)
print(r.text)
The General information under Chrome Network tab is
Request URL: https://www.beautyfresh.com/uc_aac
Request Method: POST
Status Code: 200 OK
Remote Address: 103.255.250.100:443
Referrer Policy: no-referrer-when-downgrade
The Response Headers is
HTTP/1.1 200 OK
date: Fri, 18 Dec 2020 02:01:05 GMT
expires: Sun, 19 Nov 1978 05:00:00 GMT
x-site: beautyfresh
x-url: /uc_aac
last-modified: Fri, 18 Dec 2020 02:01:05 GMT
x-backend-server: web4
content-type: application/json
x-varnish: 700512226
age: 0
via: 1.1 varnish (Varnish/6.0)
x-cache: MISS
cache-control: Cache-Control: store, no-cache, must-revalidate
accept-ranges: bytes
content-length: 2193
The Request Headers is
POST /uc_aac HTTP/1.1
Host: www.beautyfresh.com
Connection: keep-alive
Content-Length: 164
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Origin: https://www.beautyfresh.com
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Sec-Fetch-Dest: empty
Referer: https://www.beautyfresh.com/product/fragrance/men/perfume-fragrance/women/perfume-men/fragrance/perfume/jo-malone-orange-blossom-cologne
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cookie: has_js=1; SESSc5f2026dce40de323b60d32130e6ce0b=n7tr5e2fsragf1js6garc6u06n; _ga=GA1.2.959938064.1608216963; _gid=GA1.2.1979056032.1608216963; _v1EmaticSolutionsUTMData=%7B%22utm_source%22%3A%22%22%2C%22utm_medium%22%3A%22%22%2C%22utm_campaign%22%3A%22%22%7D; _fbp=fb.1.1608216963288.1600789170; _v1EmaticSolutionsBye=%7B%2228732%22%3A%7B%2230038%22%3A%7B%22dont_show_till%22%3A%222020-12-20%22%2C%22loop%22%3A1%7D%7D%2C%2228739%22%3A%7B%2230045%22%3A%7B%22dont_show_till%22%3A%222020-12-20%22%2C%22loop%22%3A1%7D%7D%7D; _v1EmaticSolutionsEI=%7B%22c_28739_1%22%3A%5B1%2C1608216997180%2C33181%5D%2C%22c_28732_2%22%3A%5B1%2C1608216969616%2C0%5D%7D; _v1EmaticSolutions=%5B%22fc5d18b3-4077-11eb-970e-0242ac160003%22%2C1608217230870%2C%5B%22IMG%22%2C%22%22%2C1%2C%22glasshouse_fragrances_amalfi_coast_sea_candle_350gr.jpg%22%5D%5D; __atuvc=8%7C51; __atuvs=5fdc0d5961186ea4000; _gat=1; _dc_gtm_UA-63339192-1=1
And Form Data shows:
attributes%5BSize%5D=100ml&nid=2905&qty=1&form_build_id=form-991e88780c30fdf883375a36a986b550&form_id=uc_product_add_to_cart_form_2905&product-nid=2905&aac_nid=2905
I don't actually know how to construct my request so it can be successfully posted to the server the get the proper response. It should return the price for "100ml", but currently my code get nothing.
Thank you so much for any help!
You are missing some information in your data. To receive a response, try adding "aac_nid": "2905" to your data when sending the post request:
import requests
headers = {
"Referer": "https://www.beautyfresh.com/product/fragrance/men/perfume-fragrance/women/perfume-men/fragrance/perfume/jo-malone-orange-blossom-cologne",
}
data = {"attributes[Size]": "30ml", "aac_nid": "2905"}
response = requests.post(
"https://www.beautyfresh.com/uc_aac", headers=headers, data=data
)
>>> print(response.content)
b'{"nid":"2905","model":"690251006564","replacements":{"sellprice":"<div class=\\"product-info sellprice\\"><div class=\'retail_price\'>City Price: <div class=\'product-info retail_price\'><span class=\'uc-price\'>S$110.00<\\/span><\\/div><\\/div>\\r\\n <div class=\'promoprice\'>Our Price: <div class=\'product-info sellprice1\'><span class=\'uc-price\'>S$89.00<\\/span><\\/div><\\/div><\\/div>","model":"<div class=\\"model\\"><span class=\\"label\\">Product Code: <\\/span>690251006564<\\/div>"},"form":"<form action=\\"\\/uc_aac\\" accept-charset=\\"UTF-8\\" method=\\"post\\" id=\\"uc-product-add-to-cart-form\\" class=\\" uc-aac-cart\\">\\n<div><div class=\'attributes\'><div class=\\"form-item\\" id=\\"edit-attributes-Size-wrapper\\">\\n <label for=\\"edit-attributes-Size\\">Size <span class=\\"form-required\\" title=\\"This field is required.\\">*<\\/span><\\/label>\\n <select name=\\"attributes[Size]\\" class=\\"form-select required chosen-widget\\" data-name=\\"Size\\" id=\\"edit-attributes-Size\\" ><option value=\\"30ml\\" selected=\\"selected\\">30ml<\\/option><option value=\\"100ml\\">100ml<\\/option><\\/select>\\n<\\/div>\\n<\\/div><input type=\\"hidden\\" name=\\"nid\\" id=\\"edit-nid\\" value=\\"2905\\" \\/>\\n<div class=\\"form-item\\" id=\\"edit-qty-wrapper\\">\\n <label for=\\"edit-qty\\">Qty <\\/label>\\n <input type=\\"text\\" maxlength=\\"3\\" name=\\"qty\\" id=\\"edit-qty\\" size=\\"5\\" value=\\"1\\" class=\\"form-text textfield\\" \\/>\\n<\\/div>\\n<input type=\\"hidden\\" name=\\"form_build_id\\" id=\\"form-b7adf002178f04ed96377894057352a2\\" value=\\"form-b7adf002178f04ed96377894057352a2\\" \\/>\\n<input type=\\"hidden\\" name=\\"form_id\\" id=\\"edit-uc-product-add-to-cart-form\\" value=\\"uc_product_add_to_cart_form\\" \\/>\\n<input type=\\"hidden\\" name=\\"aac_nid\\" id=\\"edit-aac-nid\\" value=\\"2905\\" \\/>\\n<div class=\'leadtime_message\' style=\'margin-bottom:1em;\'><p>Delivers in 1-3 working days<\\/p><\\/div><input type=\\"submit\\" name=\\"op\\" id=\\"edit-submit-2905\\" value=\\"Add to Cart\\" class=\\"notranslate form-submit node-add-to-cart primary\\" \\/>\\n\\n<\\/div><\\/form>\\n"}'
To get the price, try searching for it with the help of the built-in re (regex) module:
import re
prices = re.findall(r"'uc-price\\'>S(\$\d.*?)<", str(response.content))
print("Original Price:", prices[0])
print("Our Price:", prices[1])
Output:
Original Price: $110.00
Our Price: $89.00

https server python based is closing connection after serving the response

I had created simple server in terminal
#!/usr/bin/env python3
import sys, os, socket, ssl
import requests
import string
import time
from socketserver import ThreadingMixIn
from http.server import HTTPServer,BaseHTTPRequestHandler
from io import BytesIO
import json
import cgi
class ThreadingServer(ThreadingMixIn, HTTPServer):
pass
class RequestHandler(BaseHTTPRequestHandler):
def do_POST(self):
content_length = int(self.headers['Content-Length'])
body = self.rfile.read(content_length)
#self.send_header('Content-type', 'Application/json')
self.send_response(200)
self.end_headers()
response = BytesIO()
self.allow_reuse_address = True
self.wfile.write(b"""{"signingResponse": {"compactidentity": "..SdOwnT70ZZDAjgSmQVP-_0keB_pu4FjkBg5DZDyFf_V5k0EUAY0KCHr2g2a6wOSs-JhsehdYUnrYCfkYItzxLg;info=<http://52.23.250.93:8080/certs/shaken.crt>;alg=ES256;ppt=shaken\n", "TEST": "Nitish","identity": "eyJhbGciOiJFUzI1NiIsInBwdCI6InNoYWtlbiIsInR5cCI6InBhc3Nwb3J0IiwieDV1IjoiaHR0cDovLzUyLjIzLjI1MC45Mzo4MDgwL2NlcnRzL3NoYWtlbi5jcnQifQ.eyJhdHRlc3QiOiJBIiwiZGVzdCI6eyJ0biI6WyIxMjM1NTU1MTIxMiJdfSwiaWF0IjoxNDgzMjI4ODAwLCJvcmlnIjp7InRuIjoiMTIzNTU1NTEyMTIifSwib3JpZ2lkIjoiOGE4ZWM2MTgtYzZiOS0zMGFlLWI0MjctYWY0MTA0YjFjMDJjIn0.SdOwnT70ZZDAjgSmQVP-_0keB_pu4FjkBg5DZDyFf_V5k0EUAY0KCHr2g2a6wOSs-JhsehdYUnrYCfkYItzxLg;info=<http://52.23.250.93:8080/certs/shaken.crt>;alg=ES256;ppt=shaken\n", "requestid": "0"}} """)
httpd = ThreadingServer(('192.168.1.2', 8003), RequestHandler)
httpd.socket = ssl.wrap_socket(httpd.socket, keyfile='/home/nakumar/key.pem', certfile='/home/nakumar/certificate.pem', server_side=True)
httpd.serve_forever()
Using above code i am trying to simulated the server
now when server receives request from client , it send back the responses and closed the connection , as shown below
Request
> POST /stir/v1/signing HTTP/1.1
Host: 192.168.1.2:8003
Accept: application/json
Content-Type: application/json
Content-Length: 331
Reponse
upload completely sent off: 331 out of 331 bytes
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Server: BaseHTTP/0.6 Python/3.5.2
< Date: Tue, 09 Oct 2018 12:43:21 GMT
<
* Closing connection 0
So we can see the connection close is coming from server after response is served,
is there way possible ,not to close the connection after response is served .
Curl Was closing the connection since content length was not present in the response body
after adding the same it started working
> POST /stir/v1/signing HTTP/1.1
Host: [FD00:10:6B50:4510:0:0:0:53]:8101
Accept: application/json
Content-Type: application/json
Content-Length: 325
* upload completely sent off: 325 out of 325 bytes
< HTTP/1.1 200 OK
< Server: HTTP/1.1 Python/3.5.2
< Date: Thu, 18 Oct 2018 09:13:11 GMT
< Content-type: Application/json
< Content-length: 150
<
* Connection #1 to host FD00:10:6B50:4510:0:0:0:53 left intact

what is curl doing differently to httr::POST that causes 400 bad request?

I'm trying to query data from the Materials Project web API in R.
The documentation provides an example query which is conducted using both curl and python. I've copied the curl command below.
curl -s --header "X-API-KEY: <YOUR-API-KEY>" \
https://materialsproject.org/rest/v2/query \
-F criteria='{"elements": {"$in": ["Li", "Na", "K"], "$all": ["O"]}, "nelements": 2}' \
-F properties='["formula", "formation_energy_per_atom"]'
From reading the httr quickstart guide, it seems to me I should be able to reproduce this query with:
library(httr)
POST(url = "https://www.materialsproject.org/rest/v2/query",
config = add_headers("X-API-KEY" = "<YOUR-API-KEY>",
body = list(criteria = "{'elements': {'$in': ['Li', 'Na', 'K'], '$all': ['O']}, 'nelements': 2}",
properties = "['formula', 'formation_energy_per_atom']"),
encode = "multipart",
verbose())
But while the curl command returns JSON data from the Materials Project database, my R query returns a HTTP/1.1 400 BAD REQUEST. What is curl doing differently than httr in the codes above?
I've tried putting -v on curl and comparing it to the (verbose()) output above, but curl don't show what it's putting in the multipart form.
> Expect: 100-continue
> Content-Type: multipart/form-data; boundary=------------------------d2ef2f3982185118
>
< HTTP/1.1 100 Continue
< HTTP/1.1 200 OK
< Date: Tue, 27 Dec 2016 21:18:58 GMT
< Server: Apache/2.2.15 (CentOS)
< Vary: Accept-Encoding,User-Agent
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: application/json
Meanwhile httr shows:
-> Content-Type: multipart/form-data; boundary=----------------------------5b4873dbc9cd
->
<- HTTP/1.1 100 Continue
>> ------------------------------5b4873dbc9cd
>> Content-Disposition: form-data; name="criteria"
>>
>> {'elements': {'$in': ['Li', 'Na', 'K'], '$all': ['O']}, 'nelements': 2}
>> ------------------------------5b4873dbc9cd
>> Content-Disposition: form-data; name="properties"
>>
>> ['formula', 'formation_energy_per_atom']
>> ------------------------------5b4873dbc9cd--
It's truly a terrible, poorly thought out & lazily implemented API. They seem to like Python so it's unsurprising this would be the case.
The following works:
library(httr)
library(jsonlite)
list(
criteria=toJSON(list(
elements=list(
`$in`=c("Li", "Na", "K"),
`$all`=c("0")
),
nelements=unbox(2)
)),
properties=toJSON(c("formula", "formation_energy_per_atom"))
) -> params
POST(url="https://www.materialsproject.org/rest/v2/query",
add_headers(`X-API-KEY`=Sys.getenv("MATERIALS_PROJECT_API_KEY")),
body=params,
encode="multipart", verbose()) -> res
and here's the verbose() output to prove it:
-> POST /rest/v2/query HTTP/1.1
-> Host: www.materialsproject.org
-> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
-> Accept-Encoding: gzip, deflate
-> Accept: application/json, text/xml, application/xml, */*
-> X-API-KEY: wouldntyouliketoknow
-> Content-Length: 344
-> Expect: 100-continue
-> Content-Type: multipart/form-data; boundary=------------------------34f08173ce0a7818
->
<- HTTP/1.1 100 Continue
>> --------------------------34f08173ce0a7818
>> Content-Disposition: form-data; name="criteria"
>>
>> {"elements":{"$in":["Li","Na","K"],"$all":["0"]},"nelements":2}
>> --------------------------34f08173ce0a7818
>> Content-Disposition: form-data; name="properties"
>>
>> ["formula","formation_energy_per_atom"]
>> --------------------------34f08173ce0a7818--
<- HTTP/1.1 200 OK
<- Date: Wed, 28 Dec 2016 02:08:08 GMT
<- Server: Apache/2.2.15 (CentOS)
<- Vary: Accept-Encoding,User-Agent
<- Content-Encoding: gzip
<- Content-Length: 258
<- Connection: close
<- Content-Type: application/json
<-
It's super picky about the query string structure. They really should have just accepted a JSON body and have been done with it. But half-REDACTED is the way of python folk.
Oh gosh, I just noticed it's a CentOS server supplying the replies. Yep. Those folks really do like pain.

What is wrong with my GET request?

Sorry to bother with something that should be easy.
I have this HTTP GET request:
GET /ip HTTP/1.1
Host: httpbin.org
Connection: close
Accept: */*
User-Agent: Mozilla/4.0 (compatible; esp8266 Lua; Windows NT 5.1)
When I send this request via my ESP8266 it returns a 404 error:
HTTP/1.1 404 Not Found
Date: Fri, 04 Sep 2015 16:34:46 GMT
Server: Apache
Content-Length: 1363
X-Frame-Options: deny
Connection: close
Content-Type: text/html
But when I (and you) go to http://httpbin.org/ip it works perfectly!
What is wrong?
DETAILS
I construct my request in Lua:
conn:on("connection", function(conn, payload)
print('\nConnected')
req = "GET /ip"
.." HTTP/1.1\r\n"
.."Host: httpbin.org\r\n"
.."Connection: close\r\n"
.."Accept: */*\r\n"
.."User-Agent: Mozilla/4.0 (compatible; esp8266 Lua; Windows NT 5.1)\r\n"
.."\r\n"
print(req)
conn:send(req)
end)
And if I use another host (given is this example) it works:
conn:on("connection", function(conn, payload)
print('\nConnected')
conn:send("GET /esp8266/test.php?"
.."T="..(tmr.now()-Tstart)
.."&heap="..node.heap()
.." HTTP/1.1\r\n"
.."Host: benlo.com\r\n"
.."Connection: close\r\n"
.."Accept: */*\r\n"
.."User-Agent: Mozilla/4.0 (compatible; esp8266 Lua; Windows NT 5.1)\r\n"
.."\r\n")
end)
Are you actually connecting to httpbin.org ? Or somewhere else?
I just tried issuing your request by typing it into telnet, and it worked for me. But the server responding was nginx, whereas your example shows apache.
$ telnet httpbin.org 80
Trying 54.175.219.8...
Connected to httpbin.org.
Escape character is '^]'.
GET /ip HTTP/1.1
Host: httpbin.org
Connection: close
Accept: */*
User-Agent: Mozilla/4.0 (compatible; esp8266 Lua; Windows NT 5.1)
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 07 Oct 2015 06:08:40 GMT
Content-Type: application/json
Content-Length: 32
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
{
"origin": "124.149.55.34"
}
Connection closed by foreign host.
When I try another request with some other URI to force a 404 response, I see this:
HTTP/1.1 404 NOT FOUND
Server: nginx
Date: Wed, 07 Oct 2015 06:12:21 GMT
Content-Type: text/html
Content-Length: 233
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Which again is nothing like the response you said you got from httpbin.org.
http.get("http://httpbin.org/ip", nil, function(code, data)
if (code < 0) then
print("HTTP request failed")
else
print(code, data)
end
end)
http.post('http://httpbin.org/post',
'Content-Type: application/json\r\n',
'{"hello":"world"}',
function(code, data)
if (code < 0) then
print("HTTP request failed")
else
print(code, data)
end
end)
see the reference here.
It's your request line the server doesn't like.
This will do the job:
GET http://httpbin.org/ip HTTP/1.1
Host: httpbin.org

Resources