How to send GET request longer than 65535 symbols from rust? - http

I am rewriting part of my API from python to rust. In particular, I am trying to make an HTTP request to OSRM server to get a big distance matrix. This kind of request can have quite large URLs. In python everything works fine, but in rust I get an error:
thread 'tokio-runtime-worker' panicked at 'a parsed Url should always be a valid Uri: InvalidUri(TooLong)'
I have tried to use several HTTP client libraries: reqwest, surf, isahc, awc. But it turns out that constraining logic is located at the URL processing library https://github.com/hyperium/http and most HTTP clients depend on this library. So they behave the same. I could not use some libs, for example with awc I got compile-time errors with my async code.
Is there any way to send a large GET request from rust, preferably asynchronously?

As freakish pointed out in the comments already, having such a long URL is a bad idea, anything longer than 2,000 characters won't work in most browsers.
That being said: In the comments, you stated that an external API wants those crazily long URIs, so you don't really have an alternative. Therefore, let's give this problem a shot.
It looks like the limitation to 65.534 bytes is because the http library stores the position of the query string as a u16 (and uses 65,535 if there is no query part). The following patch seems to make the code use u32 instead, thereby raising the number of characters to 4,294,967,294 (if you've got longer URIs than that, you might be able to use u64 instead, but that would be an URI of a length greater than 4 GB – I doubt you need this):
--- a/src/uri/mod.rs
+++ b/src/uri/mod.rs
## -141,7 +141,7 ## enum ErrorKind {
}
// u16::MAX is reserved for None
-const MAX_LEN: usize = (u16::MAX - 1) as usize;
+const MAX_LEN: usize = (u32::MAX - 1) as usize;
// URI_CHARS is a table of valid characters in a URI. An entry in the table is
// 0 for invalid characters. For valid characters the entry is itself (i.e.
diff --git a/src/uri/path.rs b/src/uri/path.rs
index be2cb65..9abec4c 100644
--- a/src/uri/path.rs
+++ b/src/uri/path.rs
## -11,10 +11,10 ## use crate::byte_str::ByteStr;
#[derive(Clone)]
pub struct PathAndQuery {
pub(super) data: ByteStr,
- pub(super) query: u16,
+ pub(super) query: u32,
}
-const NONE: u16 = ::std::u16::MAX;
+const NONE: u32 = ::std::u32::MAX;
impl PathAndQuery {
// Not public while `bytes` is unstable.
## -32,7 +32,7 ## impl PathAndQuery {
match b {
b'?' => {
debug_assert_eq!(query, NONE);
- query = i as u16;
+ query = i as u32;
break;
}
b'#' => {
You could try to get this merged, however the issue covering this problem sounds like a pull request might not be accepted. Depending on your use case, you could fork the repository, commit the fix and then use the Cargo features for overriding dependencies to make Cargo use your patched version instead of the version in the repositories. The following addition to your Cargo.toml might get you started:
[patch.crates-io]
http = { git = 'https://github.com/your/repository' }
Note however that this only overrides the current version of the Uri crate – as soon as a new version of the original crate is published, it will probably be chosen by Cargo until you update your fork.

Related

Why is Rust's std::thread::sleep allowing my HTTP response to return the correct body?

I am working on the beginning of the final chapter of The Rust Programming Language, which is teaching how to write an HTTP response with Rust.
For some reason, the HTML file being sent does not display in the browser unless I have Rust wait before calling TcpResponse::flush().
Here is the code:
use std::io::prelude::*;
use std::net::TcpListener;
use std::net::TcpStream;
use std::fs;
use std::thread::sleep;
use std::time::Duration;
fn main() {
let listener = TcpListener::bind("127.0.0.1:7878").unwrap();
for stream in listener.incoming() {
let stream = stream.unwrap();
handle_connection(stream);
}
}
fn handle_connection(mut stream: TcpStream) {
let mut buffer = [0; 1024];
stream.read(&mut buffer).unwrap();
let contents = fs::read_to_string("hello.html").unwrap();
let response = format!(
"HTTP/1.1 200 OK\r\nContent-Length: {}\r\n{}",
contents.len(),
contents
);
stream.write(response.as_bytes()).unwrap();
// let i = stream.write(response.as_bytes()).unwrap();
// println!("{} bytes written to the stream", i);
// ^^ using this code instead will sometimes make it display properly
sleep(Duration::from_secs(1));
// ^^ uncommenting this will cause a blank page to load.
stream.flush().unwrap();
}
I observe the same behavior in multiple browsers.
According to the Rust book, calling TcpListener::flush should ensure that the bytes finish writing to the stream. So why would I be unable to view the HTML file in the browser unless I sleep the thread before flushing?
I have done hard reloading and restarted the server with cargo run multiple times and the behavior is the same. I have also printed out the file contents to the terminal, and the contents are being read fine under either condition (of course they are).
I wonder if this is a problem with my operating system. I'm on Windows 10.
It isn't really holding the project up as I can continue learning (and I'm not planning on putting an actual web project into production right now), but I would appreciate any insight anyone has on this issue. There must be something about Rust's handling of the stream or the environment that I am not understanding.
Thanks for your time!

Reading JS library from CDN within Mirth

I'm doing some testing around Mirth-Connect. I have a test channel that the datatypes are Raw for the source and one destination. The destination is not doing anything right now. In the source, the connector type is JavaScript Reader, and the code is doing the following...
var url = new java.net.URL('https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.15/lodash.fp.min.js');
var conn = url.openConnection();
conn.setRequestMethod('GET');
if(conn.getResponseCode() === 200) {
var body = org.apache.commons.io.IOUtils.toString(conn.getInputStream(), 'UTF-8');
logger.debug('CONTENT: ' + body);
globalMap.put('_', body);
}
conn.disconnect();
// This code is in source but also tested in destination
logger.debug('FROM GLOBAL: ' + $('_')); // library was found
var arr = [1, 2, 3, 4];
var _ = $('_');
var newArr = _.chunk(arr, 2);
The error I'm getting is: TypeError: Cannot find function chunk in object.
The reason I want to do this is to build custom/internal libraries with unit test and serve them with an internal/company CDN and allow Mirth to consume them.
How can I make the library available to Mirth?
Rhino actually has commonjs support, but mirth doesn't have it enabled by default. Here's how you can use it in your channel.
channel deploy script
with (JavaImporter(
org.mozilla.javascript.Context,
org.mozilla.javascript.commonjs.module.Require,
org.mozilla.javascript.commonjs.module.provider.SoftCachingModuleScriptProvider,
org.mozilla.javascript.commonjs.module.provider.UrlModuleSourceProvider,
java.net.URI
)) {
var require = new Require(
Context.getCurrentContext(),
this,
new SoftCachingModuleScriptProvider(new UrlModuleSourceProvider([
// Search path. You can add multiple URIs to this array
new URI('https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.15/')
],null)),
null,
null,
true
);
} // end JavaImporter
var _ = require('lodash.min');
require('lodash.fp.min')(_); // convert lodash to fp
$gc('_', _);
Note: There's something funky with the cdnjs lodash fp packages that don't detect the environment correctly and force that weird two stage import. If you use https://cdn.jsdelivr.net/npm/lodash#4.17.15/ instead you only need to do var _ = require('fp'); and it loads everything in one step.
transformer
var _ = $gc('_');
logger.info(JSON.stringify(_.chunk(2)([1,2,3,4])));
Note: This is the correct way to use fp/chunk. In your OP you were calling with the standard chunk syntax.
Additional Commentary
I think it's probably ok to do it this way where you download the library once at deploy time and store it in the globalChannelMap, then retrieve it from the map where needed. It would probably also work to store the require object itself in the map if you wanted to call it elsewhere. It will cache and reuse the object created for future calls to the same resource.
I would not create new Require objects anywhere but the deploy script, or you will be redownloading the resource on every message (or every poll in the case of a Javascript Reader.)
Edit: I guess for an internal webhost, this could be desirable in a Javascript Reader if you intend for it to pick up changes immediately on the next poll without a redeploy, assuming you would be upgrading the library in place instead of incrementing a version
The benefit to using Code Templates, as Vibin suggested is that they get compiled directly into your channel at deploy time and there is no additional fetching step at runtime. Making the library available is as simple as assigning it to your channel.
Even though importing third party libraries could be an option, I was actually looking into this for our team to write our own custom functions, write unit-test for them, and lastly be able to pull that code inside Mirth. I was experimenting with lodash but it was not my end goal to use it, it is. My solution was to do a REST GET call with java in the global script. Your URL would be the GitHub raw URL of the code you want to pull in. Same code of my original question but like I said, the URL is the raw GitHub URL for the function I want to pull in.

Evernote IOS SDK fetchResourceByHashWith throws exception

Working with Evernote IOS SDK 3.0
I would like to retrieve a specific resource from note using
fetchResourceByHashWith
This is how I am using it. Just for this example, to be 100% sure about the hash being correct I first download the note with a single resource using fetchNote and then request this resource using its unique hash using fetchResourceByHashWith (hash looks correct when I print it)
ENSession.shared.primaryNoteStore()?.fetchNote(withGuid: guid, includingContent: true, resourceOptions: ENResourceFetchOption.includeData, completion: { note, error in
if error != nil {
print(error)
seal.reject(error!)
} else {
let hash = note?.resources[0].data.bodyHash
ENSession.shared.primaryNoteStore()?.fetchResourceByHashWith(guid: guid, contentHash: hash, options: ENResourceFetchOption.includeData, completion: { res, error in
if error != nil {
print(error)
seal.reject(error!)
} else {
print("works")
seal.fulfill(res!)
}})
}
})
Call to fetchResourceByHashWith fails with
Optional(Error Domain=ENErrorDomain Code=0 "Unknown error" UserInfo={EDAMErrorCode=0, NSLocalizedDescription=Unknown error})
The equivalent setup works on Android SDK.
Everything else works so far in IOS SDK (chunkSync, auth, getting notebooks etc.. so this is not an issue with auth tokens)
would be great to know if this is an sdk bug or I am still doing something wrong.
Thanks
This is a bug in the SDK's "EDAM" Thrift client stub code. First the analysis and then your workarounds.
Evernote's underlying API transport uses a Thrift protocol with a documented schema. The SDK framework includes a layer of autogenerated stub code that is supposed to marshal input and output params correctly for each request and response. You are invoking the underlying getResourceByHash API method on the note store, which is defined per the docs to accept a string type for the contentHash argument. But it turns out the client is sending the hash value as a purely binary field. The service is failing to parse the request, so you're seeing a generic error on the client. This could reflect evolution in the API definition, but more likely this has always been broken in the iOS SDK (getResourceByHash probably doesn't see a lot of usage). If you dig into the more recent Python version of the SDK, or indeed also the Java/Android version, you can see a different pattern for this method: it says it's going to write a string-type field, and then actually emits a binary one. Weirdly, this works. And if you hack up the iOS SDK to do the same thing, it will work, too.
Workarounds:
Best advice is to report the bug and just avoid this method on the note store. You can get resource data in different ways: First of all, you actually got all the data you needed in the response to your fetchNote call, i.e. let resourceData = note?.resources[0].data.body and you're good! You can also pull individual resources by their own guid (not their hash), using fetchResource (use note?.resources[0].guid as the param). Of course, you may really want to use the access-by-hash pattern. In that case...
You can hack in the correct protocol behavior. In the SDK files, which you'll need to build as part of your project, find the ObjC file called ENTProtocol.m. Find the method +sendMessage:toProtocol:withArguments.
It has one line like this:
[outProtocol writeFieldBeginWithName:field.name type:field.type fieldID:field.index];
Replace that line with:
[outProtocol writeFieldBeginWithName:field.name type:(field.type == TType_BINARY ? TType_STRING : field.type) fieldID:field.index];
Rebuild the project and you should find that your code snippet works as expected. This is a massive hack however and although I don't think any other note store methods will be impacted adversely by it, it's possible that other internal user store or other calls will suddenly start acting funny. Also you'd have to maintain the hack through updates. Probably better to report the bug and don't use the method until Evernote publishes a proper fix.

How to send erlang functions source to riak mapreduce via HTTP?

I'm trying to use Riak's mapreduce via http. his is what i'm sending:
{
"inputs":{
"bucket":"test",
"key_filters":[["matches", ".*"]]
},
"query":[
{
"map":{
"language":"erlang",
"source":"value(RiakObject, _KeyData, _Arg) -> Key = riak_object:key(RiakObject), Count = riak_kv_crdt:value(RiakObject, <<\"riak_kv_pncounter\">>), [ {Key, Count} ]."
}
}
]}
Riak fails with "[worker_startup_failed]", which isn't very informative. Could anyone please help me get this to actually execute the function?
WARNING
Allowing arbitrary Erlang functions via map-reduce is a security risk. Any valid Erlang can be executed, including sending your entire data set offsite or formatting the hard drive.
You have been warned.
However, if you implicitly trust any client that may connect to your cluster, you can allow Erlang source to be passed in a map-reduce request by setting {allow_strfun, true} in the riak_kv section of app.config, (or in the advanced.config if you are using riak.conf).
Once you have allowed passing an Erlang function in a map-reduce phase, you need to pass in a function of the form fun(RiakObject,KeyData,Arg) -> [result] end. Note that this must be an anonymous fun, so fun is a keyword, not a name, and it must end with end.
Your function should handle the case where {error,notfound} is passed as the first argument instead of an object. Simply adding a catch-all clause to the function could accomplish that.
Perhaps something like:
{
"inputs":{
"bucket":"test",
"key_filters":[["matches", ".*"]]
},
"query":[
{
"map":{
"language":"erlang",
"source":"fun(RiakObject, _KeyData, _Arg) ->
Key = riak_object:key(RiakObject),
Count = riak_kv_crdt:value(
RiakObject,
<<\"riak_kv_pncounter\">>),
[ {Key, Count} ];
(_,_,_) -> [{error,0}]
end."
}
}
]}
Allowing the source to be passed in the request is very useful while developing and debugging. For production, you really should put the functions in a dedicated pre-compiled module that you copy to the code path of each node so that the phase spec can specify the module and function by name instead of providing arbitrary code.
{"map":{
"language":"erlang",
"module":"yourprecompiledmodule",
"function":"functionname"}}
You need to enable allow_strfun on all nodes in your cluster. To do so in Riak 2, you will need to use the advanced.config file to add this to the riak_kv configuration:
[
{riak_kv, [
{allow_strfun, true}
]}
].
The other option is to create your own Erlang module by using the compiler shipped with Riak and placing the *.beam file in a well-known location for Riak to find. The basho-patches directory is one such place.
Please see the documentation as well:
advanced.config
Installing custom Erlang code
HTTP MapReduce
Using MapReduce
Advanced MapReduce
MapReduce / curl example

"Throttled" async download in F#

I'm trying to download the 3000+ photos referenced from the xml backup of my blog. The problem I came across is that if just one of those photos is no longer available, the whole async gets blocked because AsyncGetResponse doesn't do timeouts.
ildjarn helped me to put together a version of AsyncGetResponse which does fail on timeout, but using that gives a lot more timeouts - as though requests that are just queued timeout. It seems like all the WebRequests are launched 'immediately', the only way to make it work is to set the timeout to the time required to download all of them combined: which isn't great because it means I have adjust the timeout depending on the number of images.
Have I reached the limits of vanilla async? Should I be looking at reactive extensions instead?
This is a bit embarassing, because I've already asked two questions here on this particular bit of code, and I still haven't got it working the way I want!
I think there must be a better way to find out that a file is not available than using a timeout. I'm not exactly sure, but is there some way to make it throw an exception if a file cannot be found? Then you could just wrap your async code inside try .. with and you should avoid most of the problems.
Anyway, if you want to write your own "concurrency manager" that runs certain number of requests in parallel and queues remaining pending requests, then the easiest option in F# is to use agents (the MailboxProcessor type). The following object encapsulates the behavior:
type ThrottlingAgentMessage =
| Completed
| Work of Async<unit>
/// Represents an agent that runs operations in concurrently. When the number
/// of concurrent operations exceeds 'limit', they are queued and processed later
type ThrottlingAgent(limit) =
let agent = MailboxProcessor.Start(fun agent ->
/// Represents a state when the agent is blocked
let rec waiting () =
// Use 'Scan' to wait for completion of some work
agent.Scan(function
| Completed -> Some(working (limit - 1))
| _ -> None)
/// Represents a state when the agent is working
and working count = async {
while true do
// Receive any message
let! msg = agent.Receive()
match msg with
| Completed ->
// Decrement the counter of work items
return! working (count - 1)
| Work work ->
// Start the work item & continue in blocked/working state
async { try do! work
finally agent.Post(Completed) }
|> Async.Start
if count < limit then return! working (count + 1)
else return! waiting () }
working 0)
/// Queue the specified asynchronous workflow for processing
member x.DoWork(work) = agent.Post(Work work)
Nothing is ever easy. :)
I think the issues you're hitting are intrinsic to the problem domain (as opposed to merely being issues with the async programming model, though they do interact somewhat).
Say you want to download 3000 pictures. First, in your .NET process, there is something like System.Net.ConnectionLimit or something I forget the name of, that will e.g. throttle the number of simultaneous HTTP connections your .NET process can run simultaneously (and the default is just '2' I think). So you could find that control and set it to a higher number, and it would help.
But then next, your machine and internet connection have finite bandwidth. So even if you could try to concurrently start 3000 HTTP connections, each individual connection would get slower based on the bandwidth pipe limitations. So this would also interact with timeouts. (And this doesn't even consider what kinds of throttles/limits are on the server. Maybe if you send 3000 requests it will think you are DoS attacking and blacklist your IP.)
So this is really a problem domain where a good solution requires some intelligent throttling and flow-control in order to manage how the underlying system resources are used.
As in the other answer, F# agents (MailboxProcessors) are a good programming model for authoring such throttling/flow-control logic.
(Even with all that, if most picture files are like 1MB but then there is a 1GB file mixed in there, that single file might trip a timeout.)
Anyway, this is not so much an answer to the question, as just pointing out how much intrinsic complexity there is in the problem domain itself. (Perhaps it's also suggestive of why UI 'download managers' are so popular.)
Here's a variation on Tomas's answer, because I needed an agent which could return results.
type ThrottleMessage<'a> =
| AddJob of (Async<'a>*AsyncReplyChannel<'a>)
| DoneJob of ('a*AsyncReplyChannel<'a>)
| Stop
/// This agent accumulates 'jobs' but limits the number which run concurrently.
type ThrottleAgent<'a>(limit) =
let agent = MailboxProcessor<ThrottleMessage<'a>>.Start(fun inbox ->
let rec loop(jobs, count) = async {
let! msg = inbox.Receive() //get next message
match msg with
| AddJob(job) ->
if count < limit then //if not at limit, we work, else loop
return! work(job::jobs, count)
else
return! loop(job::jobs, count)
| DoneJob(result, reply) ->
reply.Reply(result) //send back result to caller
return! work(jobs, count - 1) //no need to check limit here
| Stop -> return () }
and work(jobs, count) = async {
match jobs with
| [] -> return! loop(jobs, count) //if no jobs left, wait for more
| (job, reply)::jobs -> //run job, post Done when finished
async { let! result = job
inbox.Post(DoneJob(result, reply)) }
|> Async.Start
return! loop(jobs, count + 1) //job started, go back to waiting
}
loop([], 0)
)
member m.AddJob(job) = agent.PostAndAsyncReply(fun rep-> AddJob(job, rep))
member m.Stop() = agent.Post(Stop)
In my particular case, I just need to use it as a 'one shot' 'map', so I added a static function:
static member RunJobs limit jobs =
let agent = ThrottleAgent<'a>(limit)
let res = jobs |> Seq.map (fun job -> agent.AddJob(job))
|> Async.Parallel
|> Async.RunSynchronously
agent.Stop()
res
It seems to work ok...
Here's an out of the box solution:
FSharpx.Control offers an Async.ParallelWithThrottle function. I'm not sure if it is the best implementation as it uses SemaphoreSlim. But the ease of use is great and since my application doesn't need top performance it works well enough for me. Although since it is a library if someone knows how to make it better it is always a nice thing to make libraries top performers out of the box so the rest of us can just use the code that works and just get our work done!

Resources