I am using firebase-queue in a mobile app to handle some server side work. In the firebase-queue documentation here, it says that we can specify an optional parameter numWorkers which specifies number of workers that can run simultaneously for the node.js thread. I don't fully understand how to use this parameter in my application. For e.g., one of the thing that I am doing on the server side using firebase-queue is to send a verification code to the user when he/she first logs into the application. Now this could be hundreds of users in the future. I have a few questions that I wanted to clarify to understand the user of numWorkers a little better
When should I have more than one worker for a firebase queue?
What is the optimum number of workers for any firebase queue? Coming from a Java background, it's said that having more and more threads running in an application may start to become an overhead after a certain limit. Not sure if similar principles apply here.
If I have more than one queue serving different specIds, then do I need to think about of number of total workers at cumulative level rather than per queue. I have four queues at the moment.
Please let me know if you have information in regards to my questions above. Any inputs are appreciated.
Update - June 5, 2016
After some more playing around with the firebase-queue, I have realized that the numWorkers controls how many task of a given spec can be running simultaneously. Since the queue worker is not working in an asynchronous fashion, if tasks of a given specId takes a long time to finish, then you may end up with many tasks in the queue waiting to be picked up. For e.g. if there is a network element in the processing of the task, then it may take longer to finish and if you expect a lot of these tasks to be present on the queue, then you should have a more than one worker in the firebase queue. So, I know the answer to my first question now.
I am still wondering about question 2 and 3. I have some tasks in the queue which could be in hundreds or thousands at a given time and some of them involve a network element, so they may take a considerable amount of time to finish. I am not sure of the repercussions of having say hundred workers for a queue. I am not able to test it myself since my app is still in development state and I don't have a setup to simulate a large amount of such tasks at the moment.
Related
I understand the benefits of async on the frontend (some kind of UI). With async the UI does not block and the user can "click" on other things while waiting for another operation to execute.
But what is the benefit of async programming on the backend?
The main benefit is that there could be various slow operations on the backend, which can prevent other requests from using the cpu at the same time. These operations could be: 1. Database operations 2. File operations 3. Remote calls to other services (servers), etc. You don't want to block the cpu while these operations are in progress.
First of all, there a benefit of handling more than one request at a time. Frameworks like ASP.net or django create new (or reuse existing) threads for each requests.
If you mean async operations from the thread of particular request, that's more complicated. In most cases, it does not help at all, because of the overhead of spawning new thread. But, we have things like Schedulers in C# for example, which help a lot. When correctly used, they free up a lot of CPU time normally wasted on waiting.
For example, you send a picture to a server. Your request is handled in new thread. This thread can do everything on it's own: upack the picture and save it to disk, then update the database.
Or, you can write to disk AND update the database at the same time. The thread that is done first is our focus here. When used without scheduler, it starts spinning a loop, checking if the other thread is done, which takes CPU time. If you use scheduler, it frees that thread. When the other task is done, it uses propably yet another precreated thred to finish handling of your request.
That scenario does make it seem like it's not worth the fuss, but it is easy to imagine more coplicated tasks, that can be done in the same time instead of sequentailly. On top of that, schedulers are rather smart and will make it so the total time needed will be lowest, and CPU usage moderate.
I'm looking at using celery to execute some tasks for my website asynchronously (yes I'm super new to this idea and will probably say some stupid things in this question, sorry in advance). I'm wondering: what criteria do people use to determine whether or not a particular task should be executed asynchronously with a task queue like celery vs using an http request or an ajax request? After reading a few blogs etc. people have been suggesting using task queues for:
Tasks that the user doesn't need immediately
Tasks that are periodic
Preventing tons of database requests (or other expensive tasks) from being executed all at once
Aggregating tasks
So I guess my question is: what types of tasks should I not use a task queue for? If a task is not holding up any other part of a request (not keeping a user waiting) and isn't periodic is there a situation where it would still make sense to use a task queue? Does it make sense to aggregate database modifications? and if so, how exactly does that save resources? Thanks for the help!
I've been looking at this some more, and my conclusion is that a queue should be used for tasks only if:
there is an increase in efficiency
the task is independent of other processes
the task is simple
the task is repeated a lot
This a a pretty weak answer, but if it starts a discussion by people more knowledgeable than myself it will have done its job :)
Adding:
If you want to guarantee execution of a task (tasks queues typically focus on retrying)
If you want to stay within a 3rd party rate limit (say, send up to 10 emails per second)
If a task is CPU intensive and would bog down other client requests to your main API server
An incredibly good resource for this is here, both part 1 and 2
The ASP.NET runtime is meant for short work loads that can be run in parallel. I need to be able to schedule periodic events and background tasks that may or may not run for much longer periods.
Given the above I have the following problems to deal with:
The AppDomain can shutdown due to changes (Web.config, bin, App_Code, etc.)
IIS recycles the AppPool on a regular basis (daily)
IIS itself might restart, or for that matter the server might crash
I'm not convinced that running this code inside ASP.NET is not the right thing to do, becuase it would allow for a simpler programming model. But doing so would require that an external service periodically makes requests to the app so that the application is keept running and that all background tasks are programmed with utter most care. They will have to be able to pause and resume thier work, in the event of an unexpected error.
My current line of thinking goes something like this:
If all jobs are registered in the database, it should be possible to use the database as a bookkeeping mechanism. In the case of an error, the database would contain all state necessary to resume the operation at the next opportunity given.
I'd really appriecate some feedback/advice, on this matter. I've been considering running a windows service and using some RPC solution as well, but it doesn't have the same appeal to me. And I'd instead have a lot of deployment issues and sycnhronizing tasks and code cross several applications. Due to my business needs this is less than optimial.
This is a shot in the dark since I don't know what database you use, but I'd recommend you to consider dialog timers and activation. Assuming that most of the jobs have to do some data manipulation, and is likely that all have to do only data manipulation, leveraging activation and timers give an extremely reliable job scheduling solution, entirely embedded in the database (no need for an external process/service, not dependencies outside the database bounds like msdb), and is a solution that ensures scheduled jobs can survive restarts, failover events and even disaster recovery restores. Simply put, once a job is scheduled it will run even if the database is restored one week later on a different machine.
Have a look at Asynchronous procedure execution for a related example.
And if this is too radical, at least have a look at Using Tables as Queues since storing the scheduled items in the database often falls under the 'pending queue' case.
I recommend that you have a look at Quartz.Net. It is open source and it will give you some ideas.
Using the database as a state-keeping mechanism is a completely valid idea. How complex it will be depends on how far you want to take it. In many cases you will ended up pairing your database logic with a Windows service to achieve the desired result.
FWIW, it is typically not a good practice to manually use the thread pool inside an ASP.Net application, though (contrary to what you may read) it actually works quite nicely other than the huge caveat that you can't guarantee it will work.
So if you needed a background thread that examined the state of some object every 30 seconds and you didn't care if it fired every 30 seconds or 29 seconds or 2 minutes (such as in a long app pool recycle), an ASP.Net-spawned thread is a quick and very dirty solution.
Asynchronously fired callbacks (such as on the ASP.Net Cache object) can also perform a sort of "behind the scenes" role.
I have faced similar challenges and ultimately opted for a Windows service that uses a combination of building blocks for maximum flexibility. Namely, I use:
1) WCF with implementation-specific types OR
2) Types that are meant to transport and manage objects that wrap a job OR
3) Completely generic, serializable objects contained in a custom wrapper. Since they are just a binary payload, this allows any object to be passed to the service. Once in the service, the wrapper defines what should happen to the object (e.g. invoke a method, gather a result, and optionally make that result available for return).
Ultimately, the web site is responsible for querying the service about its state. This querying can be as simple as polling or can use asynchronous callbacks with WCF (though I believe this also uses some sort of polling behind the scenes).
I tell you what I have do.
I have create a class called Atzenta that have a timer (1-2 second trigger).
I have also create a table on my temporary database that keep the jobs. The table knows the jobID, other parameters, priority, job status, messages.
I can add, or delete a job on this class. When there is no action to be done the timer is stop. When I add a job, then the timer starts again. (the timer is a thread by him self that can do parallel work). I use the System.Timers and not other timers for this.
The jobs can have different priority.
Now let say that I place a job on this table using the Atzenta class. The next time that the timer is trigger is check the query on this table and find the first available job and just run it. No other jobs run until this one is end.
Every synchronize and flags are done from the table. In the table I have flags for every job that show if its |wait to run|request to run|run|pause|finish|killed|
All jobs are all ready known functions or class (eg the creation of statistics).
For stop and start, I use the global.asax and the Application_Start, Application_End to start and pause the object that keep the tasks. For example when I do a job, and I get the Application_End ether I wait to finish and then stop the app, ether I stop the action, notify the table, and start again on application_start.
So I say, Atzenta.RunTheJob(Jobs.StatisticUpdate, ProductID); and then I add this job on table, open the timer, and then on trigger this job is run and I update the statistics for the given product id.
I use a table on a database to synchronize many pools that run the same web app and in fact its work that way. With a common table the synchronize of the jobs is easy and you avoid 2 pools to run the same job at the same time.
On my back office I have a simple table view to see the status of all jobs.
I've programmed in a number of languages, but I am not aware of deadlocks in my code.
I took this to mean it doesn't happen.
Does this happen frequently (in programming, not in the databases) enough that I should be concerned about it?
Deadlocks could arise if two conditions are true: you have mutilple theads, and they contend for more than one resource.
Do you write multi-threaded code? You might do this explicitly by starting your own threads, or you might work in a framework where the threads are created out of your sight, and so you're running in more than one thread without you seeing that in your code.
An example: the Java Servlet API. You write a servlet or JSP. You deploy to the app server. Several users hit your web site, and hence your servlet. The server will likely have a thread per user.
Now consider what happens if in servicing the requests you want to aquire some resources:
if ( user Is Important ){
getResourceA();
}
getResourceB();
if (today is Thursday ) {
getResourceA();
}
// some more code
releaseResourceA();
releaseResoruceB();
In the contrived example above, think about what might happen on a Thursday when an important user's request arrives, and more or less simultaneously an unimportant user's request arrives.
The important user's thread gets Resoruce A and wants B. The less important user gets resource B and wants A. Neither will let go of the resource that they already own ... deadlock.
This can actually happen quite easily if you are writing code that explicitly uses synchronization. Most commonly I see it happen when using databases, and fortunately databases usually have deadlock detection so we can find out what error we made.
Defense against deadlock:
Acquire resources in a well defined order. In the aboce example, if resource A was always obtained before resource B no deadlock would occur.
If possible use timeouts, so that you don't wait indefinately for a resource. This will allow you to detect contention and apply defense 1.
It would be very hard to give an idea of how often it happens in reality (in production code? in development?) and that wouldn't really give a good idea of how much code is vulnerable to it anyway. (Quite often a deadlock will only occur in very specific situations.)
I've seen a few occurrences, although the most recent one I saw was in an Oracle driver (not in the database at all) due to a finalizer running at the same time as another thread trying to grab a connection. Fortunately I found another bug which let me avoid the finalizer running in the first place...
Basically deadlock is almost always due to trying to acquire one lock (B) whilst holding another one (A) while another thread does exactly the same thing the other way round. If one thread is waiting for B to be released, and the thread holding B is waiting for A to be released, neither is willing to let the other proceed.
Make sure you always acquire locks in the same order (and release them in the reverse order) and you should be able to avoid deadlock in most cases.
There are some odd cases where you don't directly have two locks, but it's the same basic principle. For example, in .NET you might use Control.Invoke from a worker thread in order to update the UI on the UI thread. Now Invoke waits until the update has been processed before continuing. Suppose your background thread holds a lock with the update requires... again, the worker thread is waiting for the UI thread, but the UI thread can't proceed because the worker thread holds the lock. Deadlock again.
This is the sort of pattern to watch out for. If you make sure you only lock where you need to, lock for as short a period as you can get away with, and document the thread safety and locking policies of all your code, you should be able to avoid deadlock. Like all threading topics, however, it's easier said than done.
If you get a chance take a look at first few chapters in Java Concurrency in Practice.
Deadlocks can occur in any concurrent programming situation, so it depends how much concurrency you deal with. Several examples of concurrent programming are: multi-process, multi-thread, and libraries introducing multi-thread. UI frameworks, event handling (such as timer event) could be implemented as threads. Web frameworks could spawn threads to handle multiple web requests simultaneously. With multicore CPUs you might see more concurrent situations visibly than before.
If A is waiting for B, and B is waiting for A, the circular wait causes the deadlock. So, it also depends on the type of code you write as well. If you use distributed transactions, you can easily cause that type of scenario. Without distributed transactions, you risk bank accounts from stealing money.
All depends on what you are coding. Traditional single threaded applications that do not use locking. Not really.
Multi-threaded code with multiple locks is what will cause deadlocks.
I just finished refactoring code that used seven different locks without proper exception handling. That had numerous deadlock issues.
A common cause of deadlocks is when you have different threads (or processes) acquire a set of resources in different order.
E.g. if you have some resource A and B, if thread 1 acquires A and then B, and thread 2 acquires B and then A, then this is a deadlock waiting to happen.
There's a simple solution to this problem: have all your threads always acquire resources in the same order. E.g. if all your threads acquire A and B in that order, you will avoid deadlock.
A deadlock is a situation with two processes are dependent on each other - one cannot finish before the other. Therefore, you will likely only have a deadlock in your code if you are running multiple code flows at any one time.
Developing a multi-threaded application means you need to consider deadlocks. A single-threaded application is unlikely to have deadlocks - but not impossible, the obvious example being that you may be using a DB which is subject to deadlocking.
I know there's a bunch of APIs out there that do this, but I also know that the hosting environment (being ASP.NET) puts restrictions on what you can reliably do in a separate thread.
I could be completely wrong, so please correct me if I am, this is however what I think I know.
A request typically timeouts after 120 seconds (this is configurable) but eventually the ASP.NET runtime will kill a request that's taking too long to complete.
The hosting environment, typically IIS, employs process recycling and can at any point decide to recycle your app. When this happens all threads are aborted and the app restarts. I'm however not sure how aggressive it is, it would be kind of stupid to assume that it would abort a normal ongoing HTTP request but I would expect it to abort a thread because it doesn't know anything about the unit of work of a thread.
If you had to create a programming model that easily and reliably and theoretically put a long running task, that would have to run for days, how would you accomplish this from within an ASP.NET application?
The following are my thoughts on the issue:
I've been thinking a long the line of hosting a WCF service in a win32 service. And talk to the service through WCF. This is however not very practical, because the only reason I would choose to do so, is to send tasks (units of work) from several different web apps. I'd then eventually ask the service for status updates and act accordingly. My biggest concern with this is that it would NOT be a particular great experience if I had to deploy every task to the service for it to be able to execute some instructions. There's also this issue of input, how would I feed this service with data if I had a large data set and needed to chew through it?
What I typically do right now is this
SELECT TOP 10 *
FROM WorkItem WITH (ROWLOCK, UPDLOCK, READPAST)
WHERE WorkCompleted IS NULL
It allows me to use a SQL Server database as a work queue and periodically poll the database with this query for work. If the work item completed with success, I mark it as done and proceed until there's nothing more to do. What I don't like is that I could theoretically be interrupted at any point and if I'm in-between success and marking it as done, I could end up processing the same work item twice. I might be a bit paranoid and this might be all fine but as I understand it there's no guarantee that that won't happen...
I know there's been similar questions on SO before but non really answers with a definitive answer. This is a really common thing, yet the ASP.NET hosting environment is ill equipped to handle long-running work.
Please share your thoughts.
Have a look at NServiceBus
NServiceBus is an open source
communications framework for .NET with
build in support for publish/subscribe
and long-running processes.
It is a technology build upon MSMQ, which means that your messages don't get lost since they are persisted to disk. Nevertheless the Framework has an impressive performance and an intuitive API.
John,
I agree that ASP.NET is not suitable for Async tasks as you have described them, nor should it be. It is designed as a web hosting platform, not a back of house processor.
We have had similar situations in the past and we have used a solution similar to what you have described. In summary, keep your WCF service under ASP.NET, use a "Queue" table with a Windows Service as the "QueueProcessor". The client should poll to see if work is done (or use messaging to notify the client).
We used a table that contained the process and it's information (eg InvoicingRun). On that table was a status (Pending, Running, Completed, Failed). The client would submit a new InvoicingRun with a status of Pending. A Windows service (the processor) would poll the database to get any runs that in the pending stage (you could also use SQL Notification so you don't need to poll. If a pending run was found, it would move it to running, do the processing and then move it to completed/failed.
In the case where the process failed fatally (eg DB down, process killed), the run would be left in a running state, and human intervention was required. If the process failed in an non-fatal state (exception, error), the process would be moved to failed, and you can choose to retry or have human intervantion.
If there were multiple processors, the first one to move it to a running state got that job. You can use this method to prevent the job being run twice. Alternate is to do the select then update to running under a transaction. Make sure either of these outside a transaction larger transaction. Sample (rough) SQL:
UPDATE InvoicingRun
SET Status = 2 -- Running
WHERE ID = 1
AND Status = 1 -- Pending
IF ##RowCount = 0
SELECT Cast(0 as bit)
ELSE
SELECT Cast(1 as bit)
Rob
Use a simple background tasks / jobs framework like Hangfire and apply these best practice principals to the design of the rest of your solution:
Keep all actions as small as possible; to achieve this, you should-
Divide long running jobs into batches and queue them (in a Hangfire queue or on a bus of another sort)
Make sure your small jobs (batched parts of long jobs) are idempotent (have all the context they need to run in any order). This way you don't have to use a quete which maintains a sequence; because then you can
Parallelise the execution of jobs in your queue depending on how many nodes you have in your web server farm. You can even control how much load this subjects your farm to (as a trade off to servicing web requests). This ensures that you complete the whole job (all batches) as fast and as efficiently as possible, while not compromising your cluster from servicing web clients.
Have thought about the use the Workflow Foundation instead of your custom implementation? It also allows you to persist states. Tasks could be defined as workflows in this case.
Just some thoughts...
Michael