"Throttled" async download in F# - asynchronous

I'm trying to download the 3000+ photos referenced from the xml backup of my blog. The problem I came across is that if just one of those photos is no longer available, the whole async gets blocked because AsyncGetResponse doesn't do timeouts.
ildjarn helped me to put together a version of AsyncGetResponse which does fail on timeout, but using that gives a lot more timeouts - as though requests that are just queued timeout. It seems like all the WebRequests are launched 'immediately', the only way to make it work is to set the timeout to the time required to download all of them combined: which isn't great because it means I have adjust the timeout depending on the number of images.
Have I reached the limits of vanilla async? Should I be looking at reactive extensions instead?
This is a bit embarassing, because I've already asked two questions here on this particular bit of code, and I still haven't got it working the way I want!

I think there must be a better way to find out that a file is not available than using a timeout. I'm not exactly sure, but is there some way to make it throw an exception if a file cannot be found? Then you could just wrap your async code inside try .. with and you should avoid most of the problems.
Anyway, if you want to write your own "concurrency manager" that runs certain number of requests in parallel and queues remaining pending requests, then the easiest option in F# is to use agents (the MailboxProcessor type). The following object encapsulates the behavior:
type ThrottlingAgentMessage =
| Completed
| Work of Async<unit>
/// Represents an agent that runs operations in concurrently. When the number
/// of concurrent operations exceeds 'limit', they are queued and processed later
type ThrottlingAgent(limit) =
let agent = MailboxProcessor.Start(fun agent ->
/// Represents a state when the agent is blocked
let rec waiting () =
// Use 'Scan' to wait for completion of some work
agent.Scan(function
| Completed -> Some(working (limit - 1))
| _ -> None)
/// Represents a state when the agent is working
and working count = async {
while true do
// Receive any message
let! msg = agent.Receive()
match msg with
| Completed ->
// Decrement the counter of work items
return! working (count - 1)
| Work work ->
// Start the work item & continue in blocked/working state
async { try do! work
finally agent.Post(Completed) }
|> Async.Start
if count < limit then return! working (count + 1)
else return! waiting () }
working 0)
/// Queue the specified asynchronous workflow for processing
member x.DoWork(work) = agent.Post(Work work)

Nothing is ever easy. :)
I think the issues you're hitting are intrinsic to the problem domain (as opposed to merely being issues with the async programming model, though they do interact somewhat).
Say you want to download 3000 pictures. First, in your .NET process, there is something like System.Net.ConnectionLimit or something I forget the name of, that will e.g. throttle the number of simultaneous HTTP connections your .NET process can run simultaneously (and the default is just '2' I think). So you could find that control and set it to a higher number, and it would help.
But then next, your machine and internet connection have finite bandwidth. So even if you could try to concurrently start 3000 HTTP connections, each individual connection would get slower based on the bandwidth pipe limitations. So this would also interact with timeouts. (And this doesn't even consider what kinds of throttles/limits are on the server. Maybe if you send 3000 requests it will think you are DoS attacking and blacklist your IP.)
So this is really a problem domain where a good solution requires some intelligent throttling and flow-control in order to manage how the underlying system resources are used.
As in the other answer, F# agents (MailboxProcessors) are a good programming model for authoring such throttling/flow-control logic.
(Even with all that, if most picture files are like 1MB but then there is a 1GB file mixed in there, that single file might trip a timeout.)
Anyway, this is not so much an answer to the question, as just pointing out how much intrinsic complexity there is in the problem domain itself. (Perhaps it's also suggestive of why UI 'download managers' are so popular.)

Here's a variation on Tomas's answer, because I needed an agent which could return results.
type ThrottleMessage<'a> =
| AddJob of (Async<'a>*AsyncReplyChannel<'a>)
| DoneJob of ('a*AsyncReplyChannel<'a>)
| Stop
/// This agent accumulates 'jobs' but limits the number which run concurrently.
type ThrottleAgent<'a>(limit) =
let agent = MailboxProcessor<ThrottleMessage<'a>>.Start(fun inbox ->
let rec loop(jobs, count) = async {
let! msg = inbox.Receive() //get next message
match msg with
| AddJob(job) ->
if count < limit then //if not at limit, we work, else loop
return! work(job::jobs, count)
else
return! loop(job::jobs, count)
| DoneJob(result, reply) ->
reply.Reply(result) //send back result to caller
return! work(jobs, count - 1) //no need to check limit here
| Stop -> return () }
and work(jobs, count) = async {
match jobs with
| [] -> return! loop(jobs, count) //if no jobs left, wait for more
| (job, reply)::jobs -> //run job, post Done when finished
async { let! result = job
inbox.Post(DoneJob(result, reply)) }
|> Async.Start
return! loop(jobs, count + 1) //job started, go back to waiting
}
loop([], 0)
)
member m.AddJob(job) = agent.PostAndAsyncReply(fun rep-> AddJob(job, rep))
member m.Stop() = agent.Post(Stop)
In my particular case, I just need to use it as a 'one shot' 'map', so I added a static function:
static member RunJobs limit jobs =
let agent = ThrottleAgent<'a>(limit)
let res = jobs |> Seq.map (fun job -> agent.AddJob(job))
|> Async.Parallel
|> Async.RunSynchronously
agent.Stop()
res
It seems to work ok...

Here's an out of the box solution:
FSharpx.Control offers an Async.ParallelWithThrottle function. I'm not sure if it is the best implementation as it uses SemaphoreSlim. But the ease of use is great and since my application doesn't need top performance it works well enough for me. Although since it is a library if someone knows how to make it better it is always a nice thing to make libraries top performers out of the box so the rest of us can just use the code that works and just get our work done!

Related

How to get discord bot to handle separate processes/ link to another bot

I am trying to create something of an application bot. I need the bot to be triggered in a generic channel and then continue the application process in a private DM channel with the applicant.
My issue is this : The bot can have only one on_message function defined. I find it extremely complicated (and inefficient) to check everytime if the on_message was triggered by a message from a DM channel vs the generic channel. Also, makes it difficult to keep track of an applicants answers. I want to check if the following is possible : Have the bot respond to messages from the generic channel as usual. If it receives an application prompt, start a new subprocess (or bot?) that handles the DMs with the applicant separately.
Is the above possible? if not, is there an alternative to handling this in a better way ?
#client.event
async def on_message(message):
if message.author == client.user:
return
if message.channel.type==discord.ChannelType.private:
await dm_channel.send("Whats your age?") ## Question 2
elif message.channel.type == discord.ChannelType.text:
if message.content.startswith('$h'):
member = message.author
if "apply" in message.content:
await startApply(member)
else:
await message.channel.send('Hello!')
# await message.reply('Hello!', mention_author=True)
async def startApply(member):
dm_channel = await member.create_dm()
await dm_channel.send("Whats your name?") ## Question 1
I have the above code as of now. I want the startApply function to trigger a new bot/subprocess to handle the DMs with an applicant.
Option 1
Comparatively speaking, a single if check like that is not too much overhead, but there are a few different solutions. First, you could try your hand at slash commands. This is library built as an extension for the discord.py library for slash commands. You could make one that only works in DM's, and then have it run from there with continuous slash commands.
Option 2
Use a webhook to start up a new bot. This is most likely more complicated, as youll have to get a domain or find some sort of free service to catch webhooks. You could use a webhook like this though to 'wake up' a bot and have it chat with the user in dm's.
Option 3 (Recommended)
Create functions that handle the text depending on the channel, and keep that if - elif in there. As i said, one if isn't that bad. If you had functions that are called in your code that handled everything, it actually should be fairly easy to deal with:
#client.event
async def on_message(message):
if message.author == client.user:
return
if message.channel.type==discord.ChannelType.private:
respondToPrivate(message)
elif message.channel.type == discord.ChannelType.text:
repondToText(message)
In terms of keeping track of the data, if this is a smaller personal project, MySQL is great and easy to learn. You can have each function store whatever data needed to the database so that you can have it stored to be looked at / safe in case of bot crash & then it will also be out of memory.

Can I set thread priority in dotnet? (specifically interacting with Suave webserver, but it's a general question)

I have a tool that does a lot of calculations in a loop. It is creating async tasks and then run them in parallel. About 20% of the time, the CPU is maxed out.
At the same time, I have a REST api, implemented with Suave, used to query the data.
The issue is that when the CPU is busy, Suave will just not reply at all.
Right now, I have about 10 seconds of every minute where the rest calls will not be processed, while the calculations are done, and afterwards requests are processed normally.
So I am trying to investigate if priorities in thread may be the solution for that.
I'm starting Suave like this:
let listening, server = startWebServerAsync configuration webApplication
server |> Async.Start
listening |> Async.RunSynchronously
but I was wondering if there is a way to set the priority of the server so that its code is executed if there is a request.
Alternatively, I start all the calculations like this:
snapshots
|> List.map (fun snapshot ->
async {
return dosomestuff...
})
|> Async.Parallel
|> Async.RunSynchronously
is there a way to lower the priority of this execution to give a chance for the web server to reply?
or, should I insert some Thread.Sleep(1) in the computation to give a chance to the context switch?
What I have tried:
I've tried to sprinkle the calculations with Thread.Sleep(0) and also Thread.Sleep(1) to see if it helps to do a context switch when there is a Suave request. No effect.
I've started the calculations in their own thread and set a low priority, like this:
let thread = Thread(ThreadStart(processLoop))
thread.Priority <- ThreadPriority.BelowNormal
thread.Start()
but this didn't change anything either.
More detail about Suave:
this is an example of an endpoint from Suave.
// get the correlation matrix
let private getCorrelationMatrix () =
match ReportStore.getReport() with
| Ok report ->
{|
StartTime = report.StartTime
EndTime = report.EndTime
Interval = report.Interval
PublicationTime = report.PublicationTime
CorrelationMatrix = report.CorrelationMatrix
|}
|> Json.serialize |> Successful.OK >=> setMimeType "application/json"
| Result.Error e ->
ServerErrors.INTERNAL_ERROR e
with ReportStore.getReport() just getting the last data, or error, from a mutable.
The Suave endpoint are very lightweight, they just grab the last data, or the last error, from an array and return it.
It really looks like when all cores are busy with parallel execution, no other threads can preempt that. 10s is very long when you wait for a reply!

Why is Rust's std::thread::sleep allowing my HTTP response to return the correct body?

I am working on the beginning of the final chapter of The Rust Programming Language, which is teaching how to write an HTTP response with Rust.
For some reason, the HTML file being sent does not display in the browser unless I have Rust wait before calling TcpResponse::flush().
Here is the code:
use std::io::prelude::*;
use std::net::TcpListener;
use std::net::TcpStream;
use std::fs;
use std::thread::sleep;
use std::time::Duration;
fn main() {
let listener = TcpListener::bind("127.0.0.1:7878").unwrap();
for stream in listener.incoming() {
let stream = stream.unwrap();
handle_connection(stream);
}
}
fn handle_connection(mut stream: TcpStream) {
let mut buffer = [0; 1024];
stream.read(&mut buffer).unwrap();
let contents = fs::read_to_string("hello.html").unwrap();
let response = format!(
"HTTP/1.1 200 OK\r\nContent-Length: {}\r\n{}",
contents.len(),
contents
);
stream.write(response.as_bytes()).unwrap();
// let i = stream.write(response.as_bytes()).unwrap();
// println!("{} bytes written to the stream", i);
// ^^ using this code instead will sometimes make it display properly
sleep(Duration::from_secs(1));
// ^^ uncommenting this will cause a blank page to load.
stream.flush().unwrap();
}
I observe the same behavior in multiple browsers.
According to the Rust book, calling TcpListener::flush should ensure that the bytes finish writing to the stream. So why would I be unable to view the HTML file in the browser unless I sleep the thread before flushing?
I have done hard reloading and restarted the server with cargo run multiple times and the behavior is the same. I have also printed out the file contents to the terminal, and the contents are being read fine under either condition (of course they are).
I wonder if this is a problem with my operating system. I'm on Windows 10.
It isn't really holding the project up as I can continue learning (and I'm not planning on putting an actual web project into production right now), but I would appreciate any insight anyone has on this issue. There must be something about Rust's handling of the stream or the environment that I am not understanding.
Thanks for your time!

How can you throttle calls server side?

I know client side _underscore.js can be used to throttle click rates, but how do you throttle calls server side? I thought of using the same pattern but unfortunately _throttle doesn't seem to allow for differentiating between Meteor.userId()'s.
Meteor.methods({
doSomething: function(arg1, arg2){
// how can you throttle this without affecting ALL users
}
);
Here's a package I've roughed up - but not yet submitted to Atmosphere (waiting until I familiarize myself with tinytest and write up unit tests for it).
https://github.com/zeroasterisk/Meteor-Throttle
Feel free to play with it, extend, fix and contribute (pull requests encouraged)
The concept is quite simple, and it only runs (should only be run) on the server.
You would first need to come up with a unique key for what you want to throttle...
eg: Meteor.userId() + 'my-function-name' + 'whatever'
This system uses a new Collection 'throttle' and some helper methods to:
check, set, and purge records. There is also a helper checkThenSet
method which is actually the most common pattern, check if we can do something,
and the set a record that we did.
Usage
(Use Case) If your app is sending emails, you wouldn't want to send the same email over
and over again, even if a user triggered it.
// on server
if (!Throttle.checkThenSet(key, allowedCount, expireInSec)) {
throw new Meteor.Error(500, 'You may only send ' + allowedCount + ' emails at a time, wait a while and try again');
}
....
On Throttle Methods
checkThenSet(key, allowedCount, expireInSec) checks a key, if passes it then sets the key for future checks
check(key, allowedCount) checks a key, if less than allowedCount of the (unexpired) records exist, it passes
set(key, expireInSec) sets a record for key, and it will expire after expireInSec seconds, eg: 60 = 1 min in the future
purge() expires all records which are no longer within timeframe (automatically called on every check)
Methods (call-able)
throttle(key, allowedCount, expireInSec) --> Throttle.checkThenSet()
throttle-check(key, allowedCount) --> Throttle.check()
throttle-set(key, expireInSec) --> Throttle.set()
there is not built in support for this currently in meteor, but its on the roadmap https://trello.com/c/SYcbkS3q/18-dos-hardening-rate-limiting
in theory you could use some of the options here Throttling method calls to M requests in N seconds but you would have to roll your own solution

Parallel HTTP web crawler in Erlang

I'm coding on a simple web crawler and have generated a bunch gf static files I try to crawl by the code at bottom. I have two issues/questions I don't have an idea for:
1.) Looping over the sequence 1..200 throws me an error exactly after 100 pages have been crawled:
** exception error: no match of right hand side value {error,socket_closed_remotely}
in function erlang_test_01:fetch_page/1 (erlang_test_01.erl, line 11)
in call from lists:foreach/2 (lists.erl, line 1262)
2.) How to parallelize the requests, e.g. 20 cincurrent reqs
-module(erlang_test_01).
-export([start/0]).
-define(BASE_URL, "http://46.4.117.69/").
to_url(Id) ->
?BASE_URL ++ io_lib:format("~p", [Id]).
fetch_page(Id) ->
Uri = to_url(Id),
{ok, {{_, Status, _}, _, Data}} = httpc:request(get, {Uri, []}, [], [{body_format,binary}]),
Status,
Data.
start() ->
inets:start(),
lists:foreach(fun(I) -> fetch_page(I) end, lists:seq(1, 200)).
1. Error message
socket_closed_remotely indicates that the server closed the connection, maybe because you made too many requests in a short timespan.
2. Parallellization
Create 20 worker processes and one process holding the URL queue. Let each process ask the queue for a URL (by sending it a message). This way you can control the number of workers.
An even more "Erlangy" way is to spawn one process for each URL! The upside to this is that your code will be very straightforward. The downside is that you cannot control your bandwidth usage or number of connections to the same remote server in a simple way.

Resources