I'm writing some Go software that is responsible for downloading and parsing a large number of JSON files and writing that parsed data to a sqlite database. My current design has 10 go routines simultaneously downloading/parsing these JSONs and communicating them to another go routine whose sole job is to listen on a specific channel and write the channel contents to the DB.
The system does some additional read operations after all writing should have been completed, which leads to an issue where queries return incorrect results because not all of the data has been written to the table. Because the JSON data I'm pulling is dynamic, I have no easy way to know when all the data has been written.
I've considered two possibilities for solving this, though I'm not super happy with either solution:
Listen on the channel and wait for it to be empty. This should work in principle, however, it does not ensure that the data has been written, all it ensures is it's been received on the channel.
Synchronize access to the DB. This again should work in principle, however, I would still need to order the query operation to be after all the write operations.
Are there any other design decisions I should consider to rectify this issue? For reference the libraries I'm using to pull this data are go-colly and the go-sqlite3. Appreciate all the help!
You can use a sync.WaitGroup
e.g.
package main
import "sync"
func main() {
// Some sort of job queue for your workers to process. This job queue should be closed by the process
// that populates it with items. Once the job channel is closed, any for loops ranging over the channel
// will read items until there are no more items, and then break.
jobChan := make(chan JobInfo)
// Populate the job queue here...
// ...
close(jobChan)
// We now have a full queue of jobs that can't accept new jobs because the channel is closed.
// Number of concurrent workers.
workerCount := 10
// Initialize the WaitGroup.
wg := sync.WaitGroup{}
wg.Add(workerCount)
// Create the worker goroutines.
for i := 0; i < workerCount; i++ {
go func() {
// When the jobChan is closed, and no more jobs are available on the queue, the for loop
// will exit, causing wg.Done() to be called, and the anonymous function to exit.
for job := range jobChan {
// Process job.
}
wg.Done()
}()
}
// Wait for all workers to call wg.Done()
wg.Wait()
// Whatever you want to do after all queue items have been processed goes here.
// ...
}
Related
I have a requirement where I iterate through 10,000,000 documents and for each document I do some operation and store some values in '/count.xml'. When I iterate to second document I update '/count.xml' with updated value
Currently this is what I am doing, here $total-records is 10,000,000
let $total-records := xdmp:estimate(cts:search( //some code))
let $batch-size := 5000
let $pagination := 0
let $bs :=
for $records in 1 to fn:ceiling($total-records div $batch-size )
let $start := fn:sum($pagination + 1)
let $end := fn:sum($batch-size + $pagination)
let $_ := xdmp:set($pagination, $end)
return
xdmp:spawn-function
(
function() {
for $each in cts:search( //some code)[$start to $end]
return //some operation and update '/count.xml' with some updated values
},
<options xmlns="xdmp:eval"><commit>auto</commit><update>true</update</options>
)
let $doc := doc("/count.xml")
return ()
So here the issue is I need to read the '/count.xml' file after all documents are iterated, But with above code using spawn task
let $doc := doc("/count.xml")
will not be latest one as above spawn task will run on different threads.
I need a solution where
let $doc := doc("/count.xml")
waits till all spawn task are completed.
I have came across
<result>{fn:true()}</result>
option as well, but I do not know whether it will work or not because variable
$bs
not being used anywhere and documentation says 'When the calling request uses the value future in any operation, it will automatically wait for the spawned task to complete and it will use the result.'
Is there any other alternative where
let $doc := doc("/count.xml")
line will be executed only after all spawn task are completed
To process 10 mln documents, you probably need to spawn something like 10.000 batches of a 1000 docs. I don't think that will work well from within MarkLogic.
I'd advice looking into the built-in aggregation features of MarkLogic. See for instance cts:sum-aggregate. You might be able to pre-calculate per-document intermediate results, that you could aggregate at run-time using those aggregation features. That would definitely be most performant, and would scale best.
Alternative would be to orchestrate your calculations from outside of MarkLogic. Otherwise you end up either flooding the task queue, or running into timeout limits, or both. Tools like Corb2 and DMSDK could be of help with this.
Note: you can indeed make spawns wait for result by using the <result> option, but either use <result>true</result> or <result>{fn:true()}</result> (note the parentheses behind fn:true, it is a function).
HTH!
The requirement as given, one cannot tell the difference between writing once the final result of a query across 10,mil docs vs writing the result after query of 1 document at a time. Since your example does no writes to the queried documents it need not be spawned nor run in a seperate thread or transaction, rather as HTH says, you can aka use of aggregate functions to do a single query over the entire set, compute the final result and store it in 1 operation. Likely this will run very quickly (or can be made to).
If the requirements are actually that each single document MUST be queried, then sequentially another shared document written to -- this can only be observed by using seperate transactions, serially. Its going to be horrendously slow, almost certainly longer then the timeout for the calling transaction. This means you must orchestrate it from outside -- if the requirement is that the same caller start the process as finish it (a highly implementation specific requirement that if true is likely to have other implications beyond those given).
Something close thats achievable but still horrendously slow is to have an outside query poll on the updated shared document and return 'success' once the job is done.
But again, with this many documents, if your forcing a write transaction for each one, its going to take longer (or atleast is not easily guaranteed to NOT take longer) then the a single transaction timeout so must be invoked from 'outside'.
This is where I would recommend revisiting the requirements to determine the core functionality/result that is desired and if it is truly required to implement exactly as stated vs a more performant implementation that achieves the desired result.
If the core functionality needed is that every single query be 'checkpointed' with a document update, then there are other implications such as transaction rollback that need to be considered.
I am using a futures-rs powered version of the Rusoto AWS Kinesis library. I need to spawn a deep pipeline of AWS Kinesis requests to achieve high-throughput because Kinesis has a limit of 500 records per HTTP request. Combined with the 50ms latency of sending a request, I need to start generating many concurrent requests. I am looking to create somewhere on the order of 100 in-flight requests.
The Rusoto put_records function signature looks like this:
fn put_records(
&self,
input: &PutRecordsInput,
) -> RusotoFuture<PutRecordsOutput, PutRecordsError>
The RusotoFuture is a wrapper defined like this:
/// Future that is returned from all rusoto service APIs.
pub struct RusotoFuture<T, E> {
inner: Box<Future<Item = T, Error = E> + 'static>,
}
The inner Future is wrapped but the RusutoFuture still implements Future::poll(), so I believe it is compatible with the futures-rs ecosystem. The RusotoFuture provides a synchronization call:
impl<T, E> RusotoFuture<T, E> {
/// Blocks the current thread until the future has resolved.
///
/// This is meant to provide a simple way for non-async consumers
/// to work with rusoto.
pub fn sync(self) -> Result<T, E> {
self.wait()
}
}
I can issue a request and sync() it, getting the result from AWS. I would like to create many requests, put them in some kind of queue/list, and gather finished requests. If the request errored I need to reissue the request (this is somewhat normal in Kinesis, especially when hitting limits on your shard throughput). If the request is completed successfully I should issue a request with new data. I could spawn a thread for each request and sync it but that seems inefficient when I have the async IO thread running.
I have tried using futures::sync::mpsc::channel from my application thread (not running from inside the Tokio reactor) but whenever I clone the tx it generates its own buffer, eliminating any kind of backpressure on send:
fn kinesis_pipeline(client: DefaultKinesisClient, stream_name: String, num_puts: usize, puts_size: usize) {
use futures::sync::mpsc::{ channel, spawn };
use futures::{ Sink, Future, Stream };
use futures::stream::Sender;
use rusoto_core::reactor::DEFAULT_REACTOR;
let client = Arc::new(KinesisClient::simple(Region::UsWest2));
let data = FauxData::new(); // a data generator for testing
let (mut tx, mut rx) = channel(1);
for rec in data {
tx.clone().send(rec);
}
}
Without the clone, I have the error:
error[E0382]: use of moved value: `tx`
--> src/main.rs:150:9
|
150 | tx.send(rec);
| ^^ value moved here in previous iteration of loop
|
= note: move occurs because `tx` has type `futures::sync::mpsc::Sender<rusoto_kinesis::PutRecordsRequestEntry>`, which does not implement the `Copy` trait
I have also look at futures::mpsc::sync::spawn based on recommendations but it takes owner ship of the rx (as a Stream) and does not solve my problem with the Copy of tx causing unbounded behavior.
I'm hoping if I can get the channel/spawn usage working, I will have a system which takes RusotoFutures, waits for them to complete, and then provides me an easy way to grab completion results from my application thread.
As far as I can tell your problem with channel is not that a single clone of the Sender increase the capacity by one, it is that you clone the Sender for every item you're trying to send.
The error you're seeing without clone comes from your incorrect usage of the Sink::send interface. With clone you actually should see the warning:
warning: unused `futures::sink::Send` which must be used: futures do nothing unless polled
That is: your current code doesn't actually ever send anything!
In order to apply backpressure you need to chain those send calls; each one should wait until the previous one finished (and you need to wait for the last one too!); on success you'll get the Sender back. The best way to do this is to generate a Stream from your iterator by using iter_ok and to pass it to send_all.
Now you got one future SendAll that you need to "drive". If you ignore the result and panic on error (.then(|r| { r.unwrap(); Ok::<(), ()>(()) })) you could spawn it as a separate task, but maybe you want to integrate it into your main application (i.e. return it in a Box).
// this returns a `Box<Future<Item = (), Error = ()>>`. you may
// want to use a different error type
Box::new(tx.send_all(iter_ok(data)).map(|_| ()).map_err(|_| ()))
RusotoFuture::sync and Future::wait
Don't use Future::wait: it is already deprecated in a branch, and it usually won't do what you actually are looking for. I doubt RusotoFuture is aware of the problems, so I recommend avoiding RusotoFuture::sync.
Cloning Sender increases channel capacity
As you correctly stated cloning Sender increases the capacity by one.
This seems to be done to improve performance: A Sender starts in the unblocked ("unparked") state; if a Sender isn't blocked it can send an item without blocking. But if the number of items in the queue hits the configured limit when a Sender sends an item, the Sender becomes blocked ("parked"). (Removing items from the queue will unblock the Sender at a certain time.)
This means that after the inner queue hits the limit each Sender still can send one item, which leads to the documented effect of increased capacity, but only if actually all the Senders are sending items - unused Senders don't increase the observed capacity.
The performance boost comes from the fact that as long as you don't hit the limit it doesn't need to park and notify tasks (which is quite heavy).
The private documentation at the top of the mpsc module describes more of the details.
I've spent a fair amount of time looking into the Realm database mechanics and I can't figure out if Realm is using row level read locks under the hood for data selected during write transactions.
As a basic example, imagine the following "queue" logic
assume the queue has an arbitrary number of jobs (we'll say 5 jobs)
async getNextJob() {
let nextJob = null;
this.realm.write(() => {
let jobs = this.realm.objects('Job')
.filtered('active == FALSE')
.sorted([['priority', true], ['created', false]]);
if (jobs.length) {
nextJob = jobs[0];
nextJob.active = true;
}
});
return nextJob;
}
If I call getNextJob() 2 times concurrently, if row level read blocking isn't occurring, there's a chance that nextJob will return the same job object when we query for jobs.
Furthermore, if I have outside logic that relies on up-to-date data in read logic (ie job.active == false when it actually is true at current time) I need the read to block until update transactions complete. MVCC reads getting stale data do not work in this situation.
If read locks are being set in write transactions, I could make sure I'm always reading the latest data like so
let active = null;
this.realm.write(() => {
const job = this.realm.pseudoQueryToGetJobByPrimaryKey();
active = job.active;
});
// Assuming the above write transaction blocked the read until
// any concurrent updates touching the same job committed
// the value for active can be trusted at this point in time.
if (active === false) {
// code to start job here
}
So basically, TL;DR does Realm support SELECT FOR UPDATE?
Postgresql
https://www.postgresql.org/docs/9.1/static/explicit-locking.html
MySql
https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-reads.html
So basically, TL;DR does Realm support SELECT FOR UPDATE?
Well if I understand the question correctly, the answer is slightly trickier than that.
If there is no Realm Object Server involved, then realm.write(() => disallows any other writes at the same time, and updates the Realm to its latest version when the transaction is opened.
If there is Realm Object Server involved, then I think this still stands locally, but the Realm Sync manages the updates from remote, in which case the conflict resolution rules apply for remote data changes.
Realm does not allow concurrent writes. There is at most one ongoing
write transaction at any point in time.
If the async getNextJob() function is called twice concurrently, one of
the invocations will block on realm.write().
SELECT FOR UPDATE then works trivially, since there are no concurrent updates.
I want to run some slow routine in another goroutine, is it safe to do it like this:
func someHandler(w http.ResponseWriter, r *http.Request) {
go someReallySlowFunction() // sending mail or something slow
fmt.Fprintf(w,"Mail will be delivered shortly..")
}
func otherHandler(w http.ResponseWriter, r *http.Request) {
foo := int64(0)
bar := func() {
// do slow things with foo
}
go bar()
fmt.Fprintf(w,"Mail will be delivered shortly..")
}
Is there any gotchas by doing this?
Serving each http request runs in its own goroutine (more details on this). You are allowed to start new goroutines from your handler, and they will run concurrently, independently from the goroutine executing the handler.
Some things to look out for:
The new goroutine runs independently from the handler goroutine. This means it may complete before or after the handler goroutine, you cannot (should not) assume anything regarding to this without explicit synchronization.
The http.ResponseWriter and http.Request arguments of the handler are only valid and safe to use until the handler returns! These values (or "parts" of them) may be reused - this is an implementation detail of which you should also not assume anything. Once the handler returns, you should not touch (not even read) these values.
Once the handler returns, the response is committed (or may be committed at any moment). Which means your new goroutine should not attempt to send back any data using the http.ResponseWriter after this. This is true to the extent that even if you don't touch the http.ResponseWriter in your handler, not panicing from the handler is taken as a successful handling of the request and thus HTTP 200 status is sent back (see an example of this).
You are allowed to pass the http.Request and http.ResponseWriter values to other functions and to new goroutines, but care must be taken: you should use explicit synchronization (e.g. locks, channels) if you intend to read / modify these values from multiple goroutines (or you want to send back data from multiple goroutines).
Note that seemingly if both your handler goroutine and your new goroutine just reads / inspects the http.Request, that still may be problematic. Yes, multiple goroutines can read the same variable without synchronization (if nobody modifies it). But calling certain methods of http.Request also modify the http.Request, and without synchronization there is no guarantee what other goroutines would see from this change. For example Request.FormValue() returns a form value associated with the given key. But this method calls ParseMultiPartForm() and ParseForm() if necessary which modify the http.Request (e.g. they set the Request.PostForm and Request.Form struct fields).
So unless you synchronize your goroutines, you should not pass Request and ResponseWriter to the new goroutine, but acquire data needed from the Request in the handler goroutine, and pass only e.g. a struct holding the needed data.
Your second example:
foo := int64(0)
bar := func() {
// do slow things with foo
}
go bar()
This is perfectly fine. This is a closure, and local variables referred by it will survive as long as they are accessible.
Note that alternatively you could pass the value of the local variable to the anonymous function call as an argument like this:
foo := int64(0)
bar := func(foo int64) {
// do slow things with param foo (not the local foo var)
}
go bar(foo)
In this example the anonymous function will see and use its parameter foo and not the local variable foo. This may or may not be what you want (depending on whether the handler also uses the foo and whether changes made by any of the goroutines need to be visible to the other - but that would require synchronization anyway, which would be superseded by a channel solution).
If you care for acknowledgement for the mail, then the posted code won't help. Running the code in separate goroutine makes it independent and the server reply will be success even if the mail is not sent due to some error in the goroutine function.
I was wondering if there is already a library to do that or maybe a suggestion which way to go for the following problem:
Client A makes request for resource A, this is a long running request since resource A is expensive and it results in a cache miss. In the meantime client B makes request for resource A, now it's still a cache miss since client A's request hasn't returned and populated the cache yet. so instead of making a new request to generate resource A, client B should block and be notified when client A's request is complete and has populated the cache.
I think the group cache library has something along those lines, but I haven't been able to browse through the code to figure out how they do it, I also don't wanna tie the implementation to it and use it as a dependency.
The only solution I had so far is a pub-sub type of thing, where we have a global map of the current in-flight requests with the reqID as a key. When req1 comes it sets its ID in the map, req2 comes and checks if its id is in the map, since its requesting the same resource it is, so we block on a notifier channel. When req1 finishes it does 3 things:
evicts its ID from the map
saves the entry in the cache
sends a broadcast with its ID to the notifier channel
req2 receives the notification, unblocks and fetches from the cache.
Since go doesn't have built in support for broadcasts, theres probably 1 grouting listening on the broadcast channel and then keeping a list of subscribers to broadcast to for each request, or maybe we change the map to reqId => list(broadcastChannelSubscribers). Something along those lines.
If you think there is a better way to do it with Go's primitives, any input would be appreciated. The only piece of this solution that bothers me is this global map, surrounded by locks, I assume it quickly is going to become a bottleneck. IF you have some non-locking ideas, even if they are probabilistic Im happy to hear them.
It reminds me of one question where someone was implementing a similar thing:
Coalescing items in channel
I gave an answer with an example of implementing such a middle layer. I think this is in line with your ideas: have a routine keeping track of requests for the same resource and prevent them from being recalculating in parallel.
If you have a separate routine responsible for taking requests and managing access to cache, you don't need an explicit lock (there is one buried in a channel though). Anyhow, I don't know specifics of your application, but considering you need to check cache (probably locked) and (occasionally) perform an expensive calculation of missing entry – lock on map lookups doesn't seem like a massive problem to me. You can also always span more such middle layer routines if you think this would help, but you would need a deterministic way of routing the requests (so each cache entry is managed by a single routine).
Sorry for not bringing you a silver bullet solution, but it sounds like you're on a good way of solving your problem anyway.
Caching and perfomance problems are always tricky and you should always make a basic solution to benchmark against to ensure that your assumptions are correct. But if we know that the bottleneck is fetching the resource and that caching will give significant returns you could use Go's channels to implement queuing. Assuming that response is the type of your resource.
type request struct {
back chan *response
}
func main() {
c := make(chan request,10) // non-blocking
go func(input chan request){
var cached *response
for _,i := range input {
if cached == nil { // only make request once
cached = makeLongRunningRequest()
}
i.back <- cached
}
}(c)
resp := make(chan *response)
c <- request{resp} // cache miss
c <- request{resp} // will get queued
c <- request{resp} // will get queued
for _,r := range resp {
// do something with response
}
}
Here we're only fetching one resource but you could start one goroutine for each resource you want to fetch. Goroutines are cheap so unless you need millions of resources cached at the same time you should be ok. You could of course also kill your goroutines after a while.
To keep track of which resource id belongs to which channel, I'd use a map
map[resourceId]chan request
with a mutex. Again, if fetching the resource is the bottle neck then the cost of locking the map should be negligible. If locking the map turns out to be a problem, consider using a sharded map.
In general you seem to be well on your way. I'd advise to try to keep your design as simple as possible and use channels instead of locks when possible. They do protect from terrible concurrency bugs.
One solution is a concurrent non-blocking cache as discussed in detail in The Go Programming Language, chapter 9.
The code samples are well worth a look because the authors take you through several versions (memo1, memo2, etc), illustrating problems of race conditions, using mutexes to protect maps, and a version using just channels.
Also consider https://blog.golang.org/context as it has similar concepts and deals with cancellation of in flight requests.
It's impractical to copy the content into this answer, so hopefully the links are of use.
This is already provided by Golang as a feature single flight.
For your use case just use some extra logic on top of single flight. Consider the code snippet below:
func main() {
http.HandleFunc("/github", func(w http.ResponseWriter, r *http.Request) {
var key = "facebook"
var requestGroup singleflight.Group
// Search The Cache, if found in cache return from cache, else make single flight request
if res, err := searchCache(); err != nil{
return res
}
// Cache Miss-> Make Single Flight Request, and Cache it
v, err, shared := requestGroup.Do(key, func() (interface{}, error) {
// companyStatus() returns string, error, which statifies interface{}, error, so we can return the result directly.
if err != nil {
return interface{}, err
}
return companyStatus(), nil
})
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
//Set the Cache Here
setCache(key, v)
status := v.(string)
log.Printf("/Company handler requst: status %q, shared result %t", status, shared)
fmt.Fprintf(w, "Company Status: %q", status)
})
http.ListenAndServe("127.0.0.1:8080", nil)
}
// companyStatus retrieves Comapny's API status
func getCompanyStatus() (string, error) {
log.Println("Making request to Some API")
defer log.Println("Request to Some API Complete")
time.Sleep(1 * time.Second)
resp, err := http.Get("Get URL")
if err != nil {
return "", err
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return "", fmt.Errorf("Upstream response: %s", resp.Status)
}
r := struct{ Status string }{}
err = json.NewDecoder(resp.Body).Decode(&r)
return r.Status, err
}
I hope the code snippet is self explanatory and you can refer to Single Flight Official Docs to delve deep into single flight.