Bulk downloading data with a limiter - julia

So, I am writing a function which basically bulk downloads data and saves it in db.
Firstly, I had simply put it as
function storedata(url_list)
for url in url_list
data = downloaddata(url)
savedataindb(data)
end
end
But this way the downloading was pretty slow (my guess is that the data-server itself limits the speed)
So I asynchronized all the downloads so that I can place several download calls at once
function storedata(url_list)
#sync for url in url_list
#async savedataindb(downloaddata(url))
end
end
This works and downloads pretty quickly.
But my url_list is pretty big, and so this just makes too many calls to the data-server and the data-server blocks me.
So, I thought, I will instead create batches of certain size, to download asynchronously, as follows
function storedata(url_list)
batches = divide(url_list)
#sync for batch in batches
#async for url in batch
savedataindb(downloaddata(url))
end
end
end
But this also doesn't solves the problem. The same previous issue remains.
How do I implement this function so that I can place several download calls at once but also limiting them (in some sense) at the same time?

Your code is not fully reproducible, but if length(batches) is small then it should work OK I think (i.e. when number of entries in batches collection is the way you want to limit the number of asynchronous requests).
If on the other hand you make each entry of batches the same size, i.e. when length(batch) is small and tuned to the limit your server accepts you could check:
function storedata(url_list)
batches = divide(url_list)
for batch in batches
#sync for url in batch
#async savedataindb(downloaddata(url))
end
end
end
I have not tested it as your code was not runnable. Does it help?

Related

Run Vec of futures concurrently but with only n at a time

I have a vec, specifically Vec<impl Future<Output = ()>> which contains thousands of futures. These will all run network requests when run so I don't want these to all be started at the same time like join_all would achieve. These futures return nothing and they don't have to run in any specific order. The only restriction is that I want to set a limit on how many of them are running at the same time.
buffer_unordered seemed to be close to the kind of thing I'm looking for but I wasn't able to find examples that quite matched what I'm doing.
Something like this I think but this isn't quite right
stream::iter(subscription_futures.into_iter()).buffer_unordered(10);
= note: streams do nothing unless polled

Why don't my doRedis workers begin processing until all of the jobs are in the redis server

When using foreach and doRedis the doRedis workers wait until all jobs have reached the redis server before beginning processing. Is it possible to have them begin before all the preprocessing has finished?
I am using an iterator which is working great - preprocessing happens 'just in time' and the job data begins to hit the server as the iterator runs. I can't seem to take advantage of this behavior, though, because the workers just wait until all jobs have been uploaded.
Example code:
library(foreach)
library(doRedis)
registerDoRedis("worklist", "0.0.0.0")
foreach (var = complex.iter(1:1E6)) %dopar% {
process.function(var)
}
In this example complex.iter takes a while and there are many elements to iterate over. As such it would be great if workers started running process.function() before all the preprocessing is finished. Unfortunately they seem to wait until complex.iter has run on all elements.
I have set .inorder=F.
Any suggestions as to how to achieve this desired behavior? Thanks.
You can try a couple of things to make it run smother. One is setting the chunk size and the other is to start local workers to get tasks going in the background.
[Here is a link to the PDF explaining how these two functions are used properly]
startLocalWorkers & setChunkSize
Without more information on the data, functions and tasks it is hard to help you any more than that.
In case others have the same question:
The answer is currently no, the iterator completes aggregation of all task data prior to uploading and distributing jobs to workers. Relevant discussion here: https://github.com/bwlewis/doRedis/issues/39
I was also wrong in my question in that the iterator was completing before data was uploaded. Still, the blocking upload causes the workers to wait not only until the iterator is finished but also until upload has completed.
I'll update the answer if we implement any changes.

is this the result of a partial image transfer?

I have code that generates thumbnails from JPEGs. It pulls an image from S3 and then generates the thumbs.
One in about every 3000 files ends up looking like this. It happens in batches. The high res looks like this and they're all resized down to low res. It does not fail on resize. I can go to my S3 bucket and see that the original file is indeed intact.
I had this code written in Ruby and ported it all over to clojure hoping it would just fix my issue but it's still happening.
What would result in a JPEG that looks like this?
I'm using standard image copying code like so
(with-open [in (clojure.java.io/input-stream uri)
out (clojure.java.io/output-stream file)]
(clojure.java.io/copy in out))
Would there be any way to detect the transfer didn't go well in clojure? Imagemagick? Any other command line tool?
My guess is it is one of 2 possible issues (you know your code, so you can probably rule one out quickly):
You are running out of memory. If the whole batch of processing is happening at once, the first few are probably not being released until the whole process is completed.
You are running out of time. You may be reaching your maximum execution time for the script.
Implementing some logging as the batches are processed could tell you when the issue happens and what the overall state is at that moment.

How can I determine the amount of time a rvest query is for the http response

I have been using the rvest package to perform screen scraping for some data analytics, but there are some queries that are taking a few seconds each to actually collect the data. e.g.
sectorurl = paste("http://finance.yahoo.com/q/pr?s=,",ticker,"+Profile", sep= "")
index <- read_html( sectorurl)
The second step is that one that is taking the time, so I was wondering if there were any diagnostics in the background of R or a clever package that could be run that will determine "network wait time" as opposed to CPU time, or something similar.
I would like to know if I'm stuck with the performance I have, or if actually my R code is performing well and it is http response that is limiting my process speed.
I don't think you will be able to separate the REST call from the client side code. However, my experience with accessing web services is that the network time generally dominates the total running time, with the "CPU" time being an order of magnitude, or more, less.
One option for you to try would be to paste your URL, which appears to be a GET, into a web browser and see how long it takes to complete from the console. You can compare this time against the total time taken in R for the same call. For this, try using system.time, which returns the CPU time used by a given expression.
require(stats)
system.time(read_html(sectorurl))
Check out the documentation for more information.

Why is Async version slower than single threaded version?

I am reading a large XML file using XmlReader and am exploring potential performance improvements via Async & pipelining. The following initial foray into the world of Async is showing that the Async version (which for all intents and purposes at this point is the equivalent of the Synchronous version) is much slower. Why would this be? All I've done is wrapped the "normal" code in an Async block and called it with Async.RunSynchronously
Code
open System
open System.IO.Compression // support assembly required + FileSystem
open System.Xml // support assembly required
let readerNormal (reader:XmlReader) =
let temp = ResizeArray<string>()
while reader.Read() do
()
temp
let readerAsync1 (reader:XmlReader) =
async{
let temp = ResizeArray<string>()
while reader.Read() do
()
return temp
}
let readerAsync2 (reader:XmlReader) =
async{
while reader.Read() do
()
}
[<EntryPoint>]
let main argv =
let path = #"C:\Temp\LargeTest1000.xlsx"
use zipArchive = ZipFile.OpenRead path
let sheetZipEntry = zipArchive.GetEntry(#"xl/worksheets/sheet1.xml")
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
let temp1 = readerNormal reader
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
System.GC.Collect()
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
let temp1 = readerAsync1 reader |> Async.RunSynchronously
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
System.GC.Collect()
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
readerAsync2 reader |> Async.RunSynchronously
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
printfn "DONE"
System.Console.ReadLine() |> ignore
0 // return an integer exit code
INFO
I am aware that the above Async code does not do any actual Async work - what I a trying to ascertain here is the overhead of simply making it Async
I don't expect it to go faster just because I've wrapped it in an Async. My question is the opposite: why the dramatic (IMHO) slowdown.
TIMINGS
A comment below correctly pointed out that I should provide timings for datasets of various sizes which is implicitly what had led me to be asking this question in the first instance.
The following are some times based on small vs large datasets. While the absolute values are not too meaningful, the relativities are interesting:
30 elements (small dataset)
Normal: 00:00:00.0006994
Async1: 00:00:00.0036529
Async2: 00:00:00.0014863
(A lot slower but presumably indicative of Async setup costs - this is as expected)
1.5 million elements
Normal: 00:00:01.5749734
Async1: 00:00:03.3942754
Async2: 00:00:03.3760785
(~ 2x slower. Surprised that the difference in timing is not amortized as the dataset gets bigger. If this is the case, then pipelining/parallelization can only improve performance here if you have more than two cores - to outweigh the overhead that I can't explain...)
There's no asynchronous work to do. In effect, all you get is the overheads and no benefits. async {} doesn't mean "everything in the braces suddenly becomes asynchronous". It simply means you have a simplified way of using asynchronous code - but you never call a single asynchronous function!
Additionaly, "asynchronous" doesn't necessarily mean "parallel", and it doesn't necessarily involve multiple threads. For example, when you do an asynchronous request to read a file (which you're not doing here), it means that the OS is told what you want to be done, and how you should be notified when it is done. When you run code like this using RunSynchronously, you're simply blocking one thread while posting asynchronous file requests - a scenario pretty much identical to using synchronous file requests in the first place.
The moment you do RunSynchronously, you throw away any reason whatsoever to use asynchronous code in the first place. You're still using a single thread, you just blocked another thread at the same time - instead of saving on threads, you waste one, and add another to do the real work.
EDIT:
Okay, I've investigated with the minimal example, and I've got some observations.
The difference is absolutely brutal with a profiler on - the non-async version is somewhat slower (up to 2x), but the async version is just never ending. It seems as if a huge number of allocations is going on - and yet, when I break the profiler, I can see that the non-async version (running in 4 seconds) makes a hundred thousand allocations (~20 MiB), while the async version (running over 10 minutes) only makes mere thousands. Maybe the memory profiler interacts badly with F# async? The CPU time profiler doesn't have this problem.
The generated IL is very different for the two cases. Most importantly, even though our async code doesn't actually do anything asynchronous, it creates a ton of async builder helpers, sprinkles a ton of (asynchronous) Delay calls through the code, and going into outright absurd territory, each iteration of the loop is an extra method call, including the setup of a helper object.
Apparently, F# automatically translates while into an asynchronous while. Now, given how well compressed xslt data usually is, very little I/O is involved in those Read operations, so the overhead absolutely dominates - and since every iteration of the "loop" has its own setup cost, the overhead scales with the amount of data.
While this is mostly caused by the while not actually doing anything, it also obviously means that you need to be careful about what you select as async, and you need to avoid using it in a case where CPU time dominates (as in this case - after all, both the async and non-async case are almost 100% CPU tasks in practice). This is further worsened by the fact that Read reads a single node at a time - something that's relatively trivial even in a big, non-compressed xml file. The overheads absolutely dominate. In effect, this is analogous to using Parallel.For with a body like sum += i - the setup cost of each of the each of the iterations absolutely dwarfs any actual work being done.
The CPU profiling makes this rather obvious - the two most work intensive methods are:
XmlReader.Read (expected)
Thread::intermediateThreadProc - also known as "this code runs on a thread pool thread". The overhead from this in a no-op code like this is around 100% - yikes. Apparently, even though there is no real asynchronicity anywhere, the callbacks are never run synchronously. Every iteration of the loop posts work to a new thread pool thread.
The lesson learned? Probably something like "don't use loops in async if the loop body does very little work". The overhead is incurred for each and every iteration of the loop. Ouch.
Asynchronous code doesn't magically make your code faster. As you've discovered, it'll tend to make isolated code slower, because there's overhead involved with managing the asynchrony.
What it can do is to be more efficient, but that's not the same as being inherently faster. The main purpose of Async is to make Input/Output code more efficient.
If you invoke a 'slow', blocking I/O operation directly, you'll block the thread until the operation returns.
If you instead invoke that slow operation asynchronously, it may free up the thread to do other things. It does require that there's an underlying implementation that's not thread-bound, but uses another mechanism for receiving the response. I/O Completion Ports could be such a mechanism.
Now, if you run a lot of asynchronous code in parallel, it may turn out to be faster than attempting to run the blocking implementation in parallel, because the async versions use fewer resources (fewer threads = less memory).

Resources