I've used apache flink in batch processing for a while but now we want to convert this batch job to a streaming job. The problem I run into is how to run end-to-end tests.
How it worked in a batch job
When using batch processing we created end-to-end tests using cucumber.
We would fill up the hbase table we read from
Run the batch job
Wait for it to finish
verify the result
The problem in a streaming job
We would like to do something similar with the streaming job except the streaming job does not really finish.
So:
fill up the message queue we read from
Run the streaming job.
Wait for it to finish (how?)
Verify the result
We could just wait 5 seconds after every test and assume everything has been processed but that would slow everything down a lot.
Question:
What are some ways or best practices to run end-to-end tests on a streaming flink job without forceable terminating the flink job after x seconds
Most Flink DataStream sources, if they are reading from a finite input, will inject a watermark with value LONG.MAX_VALUE when they reach the end, after which the job will be terminated.
The Flink training exercises illustrate one approach to doing end-to-end testing of Flink jobs. I suggest cloning the github repo and looking at how the tests are setup. They use a custom source and sink and redirect the input and output for testing.
This topic is also discussed a bit in the documentation.
Related
I have a DAG with 2 tasks:
download_file_from_ftp >> transform_file
My concern is that tasks can be performed on different workers.The file will be downloaded on the first worker and will be transformed on another worker. An error will occur because the file is missing on the second worker. Is it possible to configure the dag that all tasks are performed on one worker?
It's a bad practice. Even if you will find a work around it will be very unreliable.
In general, if your executor allows this - you can configure tasks to execute on a specific worker type. For example in CeleryExecutor you can set tasks to a specific Queue. Assuming there is only 1 worker consuming from that queue then your tasks will be executed on the same worker BUT the fact that it's 1 worker doesn't mean it will be the same machine. It highly depended on the infrastructure that you use. For example: when you restart your machines do you get the exact same machine or new one is spawned?
I highly advise you - don't go down this road.
To solve your issue either download the file to shared disk space like S3, Google cloud storage, etc... then all workers can read the file as it's stored in cloud or combine the download and transform into a single operator thus both actions are executed together.
i'm trying to implement a test case in catch2 that tests usage of fifo between processes.
For testing it, i want to run another process that creates and writes to fifo, while my test (using catch2) will read from this fifo.
Is there a way to run a process using catch2 or i just use regular system APIs for execution another exe file (process) as part of the test?
Thanks
Does anyone know how to run a number of processes in the background either through a job queue or parallel processing.
I have a number of maintenance updates that take time to run and want to do this in the background.
I would recomment Gearman server, it prooved quite stable, it's totally outside of Symfony2, and you have to have server up and running (don't know what your hosting options are), but it distribues jobs perfectly. In skiniest version, it just keeps all jobs in-memory, but you can configure it to use sqlite database as backup, so for any reason server reboots, or gearman deamon breaks, you can just start it again, and your jobs will be perserved. I konw it has been tested with very large loads (adding up 1k jobs per second), and it stood it's ground. It's probably more stable nowdays, I'm speaking from experience 2 yrs ago, where we offloaded some long-running tasks in ZF application to background processing via Gearman. It should be quite self-explanitory how it works from image below:
Checkout RabbitMq. It's the most popular option according to knpbundles.com
Take a look at http://github.com/mmoreram/rsqueue-bundle
Uses Redis as queue core and will be mantained.
Take a look at enqueue libraty. There are a lot of transports (AMQP, STOMP, AmazonSQS, Redis, Filesystem, Doctrine DBAL and more) to choose from. Easy to use and feature rich. That would be enough for simple job queue, though if you need something more sophisticated look at enqueue/job-queue. It can run an exclusive job (only one job running at a given time) or a job with sub-jobs, or a job with something to do after it has been done.
Of course, there is a bundle for it
I've got a codebase where I have been using CppUnit for unit testing. I'm now adding some MPI code to the project and I'd like to unit test some abstractions I'm building on top of MPI. For example, I've written some code to manage a single-producer/multiple-consumer relationship where consumers ask for work and the producer serializes the next bit of work to send to the consumer, and I'd like to test just that interaction with a test that generates some fake work items in a producer, distributes them to consumers, which then send some kind of checksum back to the consumer to make sure everything got distributed and nothing deadlocked etc.
Does anyone have experience of what works best here? Some things I've been thinking about:
Is it reasonable to have all processes execute the test runner so that they all execute the test functions in the same order? Or is it better to have only the master run the test runner and have it send broadcasts to the slaves to tell them what to do next (presumably with some kind of lookup table to map commands to test functions)?
Is it sane in any way to use CPPUNIT_ASSERT inside the slaves, or should all information be sent back to the master for assertions? If slaves can assert, how should all the results be combined to get a single output log?
How should one handle test failures, such that the exception thrown in one process doesn't cause synchronization problems e.g. another process is waiting for an MPI_Recv for which the matching MPI_Send will now never happen?
I have an application that runs as a web service, which submits jobs to Spark on a user request. A job queue needs to be limited per user. I am planning to use Airflow as an orchestration framework to manage job queues but while it supports parallel DAG execution it's optimized for batch processing rather than real time. Is Airflow designed to handle ~200 DAG executions per second with multiple queues (one per user) or should I look for alternatives?
Do you have data move from one task to another? Does time matter here since you mentioned real-time. With Airflow, workflows are expected to be mostly static or slowly changing. Mostly for ETL batch processing, you can speed up the airflow heartbeat, but would be good to have a POC with your use case to test out.
Below is from Airflow official document: https://airflow.apache.org/#beyond-the-horizon
Airflow is not a data streaming solution. Tasks do not move data from
one to the other (though tasks can exchange metadata!). Airflow is not
in the Spark Streaming or Storm space, it is more comparable to Oozie
or Azkaban