Horizontal scaling and cron jobs - symfony

I was recently forced to move my app to Amazon and use auto-scaling, I have stumbled on to a issue with cron jobs and automatic scaling.
I have a cron job running every 15 minutes which checks if subscriptions should be charged, the query selects all subscriptions that are past due, and attempts to charge them. It changes their status once processed, but they are fetched In a batch, and the process takes 1-3 minutes.
If I have multiple instances with the same cron job, it could fire simultaneously and charge the subscriptions multiple times. This has actually happened once.
What is the Best approach here? Somehow locking the table?
I am using Amazon elastic beanstalk and symfony3.

At least you can use dedicated micro instance for subscription charging (not auto-scaled of course), just with cron jobs. Simplest way yet safest (obviously it will safe if you move your subscription handling logic from front-end servers which potentially can be hacked to the server behind VPC subnet that isn't available from global network).
But if you don't want, you still can use another approach. You mentioned you use Beanstalk. Beanstalk allow to use delayed jobs.
So possible approach is:
1) When you create subscription, you can calculate when it should be charged, and then push the job with calculated delay to Beanstalk tube.
2) Then, worker get the job (with subscription) on-time. Only one worker will get the particular job, so it will work if you use autoscaling.
3) In worker, you check the subscription (probably it can be deleted or inactive etc.) and if it ready to charge, just run the code for charging. Then calculate next charging time and push new delayed job (with subscription) to queue.
Beanstalk has Symfony bundle and powerful PHP library

You can make your job run only for one instance i.e make your functionality - charge subscription run only for one of instance.
You can use AWS api for fetching all instances and then matching the instances with current running one.
ec2 = Aws::EC2::Resource.new(region: 'region',
credentials: Aws::Credentials.new(IAM_KEY', 'IAM_SECRET')
)
metadata_endpoint = 'http://169.254.169.254/latest/meta-data/'
current_server_id = Net::HTTP.get( URI.parse( metadata_endpoint + 'instance-id' ) )
instances = []
ec2.instances.each do |i|
if (i.state.name == 'running')
instances << i.id
end
end
if (instances.first == current_server_id )
{
your functionality
}

Related

JobRunr - Trying to run multiple recurring jobs with Spring Boot

I am using JobRunr to run my background jobs and in this I am providing the users to setup recurring jobs using an endpoint like below:
#PostMapping("/schedule-recurring")
public String scheduleRecurring(#RequestBody ExecutionJob executionJob) {
return BackgroundJob.scheduleRecurrently(executionJob.getId(),executionJob.getCronExpression(), ()
-> jobService.executeSomeJob(executionJob, JobContext.Null));
}
These jobs could run in 5 mins, 10 mins or it might sometimes take upto 4 hours. This all depends on how many records to process. Right now, I am in a phase where I have only one background job server since I am building a POC for this. However, in future we plan to scale it to support 1 instance per customer having one or multiple background job server based on the client license.
My issue here is that, I have 2 recurring jobs which run 1 hour apart from each other. The first recurring job executes for more than 1 hour and deems the second job unexecuted, because this job is not triggered as there is no available Background Job Worker to address this request. I am thinking of adding a check to the job to trigger itself only if a Background Job Worker is available. But is there a better idea where-in the schedule-recurring method itself adds a condition to queue the job if a Background Job worker is not available?
Thanks in advance.

Concurrency issues with Symfony and Doctrine

Good morning,
I have an issue when selecting the next record with Doctrine when there is concurrency. I have installed supervisord inside a docker container that starts multiple processes on the same "dispatch" command. The dispatch commands basically gets the next job in queue in the db and sends it to the right executor. Right now I have two docker containers that each run multiple processes through supervisord. These 2 containers are on 2 different servers. I'm also using Doctrine Optimistic locking. So the Doctrine query to find the next job in queue is the following:
$qb = $this->createQueryBuilder('job')
->andWhere('job.status = :todo')
->setMaxResults( 1 )
->orderBy('job.priority', 'DESC')
->addOrderBy('job.createdAt', 'ASC')
->setParameters(array("todo" => Job::STATUS_TO_DO));
return $qb->getQuery()->getOneOrNullResult();
So the issue is that when a worker tries to get the next job with the above query, I notice that they frequently run into the Optimistic Lock Exception which is fine meaning the record is already used by another worker. When there is an Optimistic Lock Exception, it's caught and then worker stops and another one starts. But I lose a lot of time because of this, because it takes multiple tries for workers to finally get the next job instead of the Optimistic Lock exception.
I thought about getting a random job id in the above Doctrine query.
What's your take on this? Is there a better way to handle this?
I finally figured it out. There was a delay between one of the server and the remote mysql so updates were not seen right away and that triggered the Optimistic Lock Exception. I fixed it by moving the mysql DB to Azure which is way faster than the old server and causes no delays.

Batch Jobs Not Running When Set to Waiting on My Dev Server

My level of experience with the product is basic at best, but I'm expected to be a developer; I have a basic understanding of many things.
Right now my job is to investigate canceling lines in Purchase Orders. We have a workflow set up to handle those, and I'm trying to duplicate the scenario in my dev instance. Whenever a user cancels a line, the workflow is supposed to engage, and I've found that a batch job is what triggers that workflow to work (maybe that's the case with all workflows, but I don't know that for sure).
I've set up my personal Dev AX Instance under System Configuration => System => Server Configuration to use my personal Dev AOS server that my client is also running under, but when I go to System Configuration => Batch Jobs => Batch Jobs, then find the Batch Job I've been looking for and set the status to Waiting, the Batch Job never runs.
On our Test instance, the jobs is configured exactly the same way, except they use the AOS Server allotted for it.
I did a SQL script to change the batch job to use my personal Dev AOS Server, then did a restart of the Dynamics AX Servers.
There must be something I'm doing wrong for my personal dev instance. I've been reading some things from here about what may be going on and following down the list, but I'm pretty sure the problem is even stupider => https://www.daxrunbase.com/2017/07/02/troubleshooting-batch-jobs-in-ax/
First of all, do you have all 3 workflow jobs set up?
Workflow message processing
Workflow due date processing
Workflow line-item notifications
They can be set up from System administration > Setup > Workflow > Workflow infrastructure configuration.
Secondly, it is OK for the periodic batch jobs to have status Waiting. They will be in status Executing for a short time and then they will be Waiting for the next run. If the Scheduled start date/time value in this batch job is in the past, that could be a problem. Otherwise everything is OK.
Lastly, if you have already ticked the Is batch server check-box in System administration > Setup > System > Server configuration, please also make sure to move the workflow batch group in the Batch server groups section in the same form from Remaining groups to Selected groups.
The batch jobs should start at Scheduled start date/time - or a bit later, you'd need to wait a minute and refresh the grid.

How to prevent a Hangfire recurring job from restarting after 30 minutes of continuous execution

I am working on an asp.net mvc-5 web application, and I am facing a problem in using Hangfire tool to run long running background jobs. the problem is that if the job execution exceed 30 minutes, then hangfire will automatically initiate another job, so I will end up having two similar jobs running at the same time.
Now I have the following:-
Asp.net mvc-5
IIS-8
Hangfire 1.4.6
Windows server 2012
Now I have defined a hangfire recurring job to run at 17:00 each day. The background job mainly scan our network for servers and vms and update the DB, and the recurring job will send an email after completing the execution.
The recurring job used to work well when its execution was less than 30 minutes. But today as our system grows, the recurring job completed after 40 minutes instead of 22-25 minutes as it used to be. and I received 2 emails instead of one email (and the time between the emails was around 30 minutes). Now I re-run the job manually and I have noted that that the problem is as follow:-
"when the recurring job reaches 30 minutes of continuous execution, a
new instance of the recurring job will start, so I will have two
instances instead of one running at the same time, so that why I received 2 emails."
Now if the recurring job takes less than 30 minutes (for example 29 minute) I will not face any problem, but if the recurring job execution exceeds 30 minutes then for a reason or another hangfire will initiate a new job.
although when I access the hangfire dashboard during the execution of the job, I can find that there is only one active job, when I monitor our DB I can see from the sql profiler that there are two jobs accessing the DB. this happens after 30 minutes from the beginning of the recurring job (at 17:30 in our case), and that why I received 2 emails which mean 2 recurring jobs were running in the background instead of one.
So can anyone advice on this please, how I can avoid hangfire from automatically initiating a new recurring job if the current recurring job execution exceeds 30 minutes?
Thanks
Did you look at InvisibilityTimeout setting from the Hangfire docs?
Default SQL Server job storage implementation uses a regular table as
a job queue. To be sure that a job will not be lost in case of
unexpected process termination, it is deleted only from a queue only
upon a successful completion.
To make it invisible from other workers, the UPDATE statement with
OUTPUT clause is used to fetch a queued job and update the FetchedAt
value (that signals for other workers that it was fetched) in an
atomic way. Other workers see the fetched timestamp and ignore a job.
But to handle the process termination, they will ignore a job only
during a specified amount of time (defaults to 30 minutes).
Although this mechanism ensures that every job will be processed,
sometimes it may cause either long retry latency or lead to multiple
job execution. Consider the following scenario:
Worker A fetched a job (runs for a hour) and started it at 12:00.
Worker B fetched the same job at 12:30, because the default invisibility timeout was expired.
Worker C (did not fetch) the same job at 13:00, because (it
will be deleted after successful performance.)
If you are using cancellation tokens, it will be set for Worker A at
12:30, and at 13:00 for Worker B. This may lead to the fact that your
long-running job will never be executed. If you aren’t using
cancellation tokens, it will be concurrently executed by WorkerA and
Worker B (since 12:30), but Worker C will not fetch it, because it
will be deleted after successful performance.
So, if you have long-running jobs, it is better to configure the
invisibility timeout interval:
var options = new SqlServerStorageOptions
{
InvisibilityTimeout = TimeSpan.FromMinutes(30) // default value
};
GlobalConfiguration.Configuration.UseSqlServerStorage("<name or connection string>", options);
As of Hangfire 1.5 this option is now Obsolete. Jobs that are being worked on are invisible to other workers.
Say goodbye to confusing invisibility timeout with unexpected
background job retries after 30 minutes (by default) when using SQL
Server. New Hangfire.SqlServer implementation uses plain old
transactions to fetch background jobs and hide them from other
workers.
Even after ungraceful shutdown, the job will be available for other
workers instantly, without any delays.
I was having trouble finding documentation on how to do this properly for a Postgresql database, every example I was see is using sqlserver, I found how the invisibility timeout was a property inside the PostgreSqlStorageOptions object, I found this here : https://github.com/frankhommers/Hangfire.PostgreSql/blob/master/src/Hangfire.PostgreSql/PostgreSqlStorageOptions.cs#L36. Luckily through trial and error I was able to figure out that the UsePostgreSqlStorage has an overload to accept this object. For .Net Core 2.0 when you are setting up the hangfire postgresql DB in the ConfigureServices method in the startup class add this(the default timeout is set to 30 mins):
services.AddHangfire(config =>
config.UsePostgreSqlStorage(Configuration.GetConnectionString("Hangfire1ConnectionString"), new PostgreSqlStorageOptions {
InvisibilityTimeout = TimeSpan.FromMinutes(720)
}));
I had this problem when using Hangfire.MemoryStorage as the storage provider. With memory storage you need to set the FetchNextJobTimeout in the MemoryStorageOptions, otherwise by default jobs will timeout after 30 minutes and a new job will be executed.
var options = new MemoryStorageOptions
{
FetchNextJobTimeout = TimeSpan.FromDays(1)
};
GlobalConfiguration.Configuration.UseMemoryStorage(options);
Just would like to point out that even though, it is stated the thing below:
As of Hangfire 1.5 this option is now Obsolete. Jobs that are being worked on are invisible to other workers.
Say goodbye to confusing invisibility timeout with unexpected background job retries after 30 minutes (by default) when using SQL Server. New Hangfire.SqlServer implementation uses plain old transactions to fetch background jobs and hide them from other workers.
Even after ungraceful shutdown, the job will be available for other workers instantly, without any delays.
It seems that for many people using MySQL, PostgreSQL, MongoDB, InvisibilityTimeout is still the way to go: https://github.com/HangfireIO/Hangfire/issues/1197

Pagodabox or PHPfog + Gearman

All,
I'm looking for a good way to do some job backgrounding through either of these two services.
I see PHPFog supports IronWorks, but i need something more realtime. Through these cloud based PaaS services, I'm not able to use popen(background.php --token=1234). So I'm thinking the best solution, might be to try to kick off a gearman worker to handle the job. (Actually my preferred method would be to use websockets to keep a connection open and receive feedback from the job, rather than long polling a db table through AJAX, but none of these guys support websockets)
Question 1 is, is there a better solution than using gearman to offload the job?
Question 2 is, http://help.pagodabox.com/customer/portal/articles/430779 I see pagodabox supports 'worker listeners' ... has anybody set this up with gearman? Would it work?
Thanks
I am using PagodaBox with a background worker in an application I am building right now. Basically, PagodaBox daemonizes a PHP process for you (meaning it will continually run in the background), so all you really have to do is create a script that checks a database table for tasks to run, runs them, and then sleeps a bit so it's not running too many queries against your database.
This is a simplified version of what I have running:
// Remove time limit
set_time_limit(0);
// Show ALL errors
error_reporting(-1);
// Run daemon
echo "--- Starting Daemon ---\n";
while(true) {
// Query 'work_queue' table for new tasks
// Loop over items and do whatever tasks are associated with them
// Update row to mark task as completed
// Wait a bit
sleep(30);
}
A benefit to this approach is that it's easy to test via CLI:
php tasks.php
You will see all the echo statements come through in console as it's running, and of course it's much easier to do than a more complicated setup with other dependencies like Gearman.
So whenever you add a new task to the table, the maximum amount of time you'll wait for that task to be started in a batch is 30 seconds (or whatever your sleep time is). This is better and preferable to cron jobs, because if you setup a cron job to run every minute (the lowest possible interval) and the work you have to do takes longer than a minute, another cron process will start working on the same queue and you could end up with quite a lot of duplicated task work that could cause a lot of issues that are hard to debug and troubleshoot. So if you instead have either only one background worker that runs all tasks, or multiple background workers that work on different task types, you will never run into this issue.

Resources