Kafka or Airflow for automation - airflow

I would like to automate some of my tasks using Apache Kafka. Previously i
used to do the same using Apache Airflow and which worked fine. But i want
to explore the same using Kafka whether this works better than Airflow or
not.
Kafka runs on Server A
Kafka searches for a file named test.xml on Server B, here kafka search
for every 10 or 20 mins whether this file created or not.
Once kafka sense the file created, then the job starts as follows
a)Create a jira ticket and update all the executions on jira for each
events
b) Trigger a rsync command
c) Then unarchive the files using tar command
d) Some script to execute using the unarchive files
e) Then archive the files and rsync to different location
f) Send email once all task finished
Please advise if this is something kafka intelligent to begin with? Or if
you have any other open source products which can do this actions , please
let me know. By the way i prefer to setup these on docker-compose based
installation.
Or please suggest, what are the best opensource tools available for this automation purpose
Thanks

Avoid using kafka for the usecases you mentioned. Kafka is not good for defining dags or workflows. It works great for streaming usecase(data in motion).
Airflow would allow to define dags with multiple task.
You can leverage filesystem sensor[https://github.com/apache/airflow/blob/main/airflow/sensors/filesystem.py] to check if file/folder has been updated/created.
And post that you can leverage python operators and other operators(hooks) to achieve all other task as well.

Related

Do airflow workers share the same file system ? or are they isolated

I have a task in airflow which downloads a file from GitHub to the local file system. passes it to spark-submit and then deletes it. I wanted to know if this will create any issues.
Can this be possible that both the workers that are running the same task concurrently on two different dag runs are referencing the same file?
Sample code -->
def python_task_callback():
download_file(file_name='script.py')
spark_submit(path='/temp/script.py')
delete_file(path='/temp/script.py')
For your use case if you do all of the actions you mentioned (download, parse, delete) in a single task then you will have no problems regardless of which executor you are running.
If you are splitting the actions between several tasks then you should use a shared file system like S3, Google Storage etc. In that case it will also work regardless of which executor youa re using.
A possible workflow can be:
1st task: copy file from github to S3
2nd task: submit the file to processing
3rd task: delete the file from S3
As for your general question if tasks share disk - that depends on the executor that you are using.
In Local Executor you have only 1 worker thus all tasks run on the same machine and share it's disk.
In Celery Executor/ Kubernetes Executor/others tasks may run on different workers.
However as mentioned - don't assume that tasks share disk, if you will need to scale up the executor from Local to Celery you don't want to find yourself in a case where you need to refactor your code.

Move data generated by a script to Kafka topic and then to HDFS

I have a script to get the count of messages in a specific Kafka topic and it can only be executed in the Kafka hosted server. The output of the script has to be moved to Hive/HDFS. So can I move it to Kafka topic and then move it to HDFS using HDFS Sink connector? If yes how can I move the data generated from script to Kafka topic ? Also let me know if there is a better solution.
how can I move the data generated from script to Kafka topic
You can pipe output of a command into a Kafka topic
your_script.sh | kafka-console-producer --topic foo --broker-list xx:9092
Or you could rewrite your script in some language that has a Kafka client. e.g. Python
Alternatively, you can look into setting up Apache Nifi, then run scripts there and upload results to HDFS/Hive.
If you want Kafka + Hive integration, Hortonworks just announced Hive-Kafka

Jenkins - How to stall a job until a notification is received?

Is there anyway that a Jenkins job can be paused until a notification is received. Ideally with a payload as well?
I have a "test" job which does a whole bunch of remote tests and I'd like it to wait until the test are done where I send a HTTP notification via curl with a payload including a test success code.
Is this possible with any default Jenkins plugins?
If Jenkins 2.x is an option for you, I'd consider taking a look at writing a pipeline job.
See https://jenkins.io/doc/book/pipeline/
Perhaps you could create a pipeline with multiple stages, where:
The first batch of work (your test job) is launched by the first pipeline stage.
That stage is configured (via Groovy code) to wait until your tests are complete before continuing. This is of course easy if the command to run your tests blocks, but if your tests launch and then detach without providing an easy way to determine when they exit, you can probably add extra Groovy code to your stage to make it poll the machine where the tests are running, to discover whether the work is complete.
Subsequent stages can be run once the first stage exits.
As for passing a payload from one stage to another, that's possible too - for exit codes and strings, you can use Groovy variables, and for files, I believe you can have a stage archive a file as an artifact; subsequent stages can then access the artifact.
Or, as Hani mentioned in a comment, you could create two Jenkins jobs, and have your tests (launched by the first job) use the Jenkins API to launch the second job when they complete successfully.
As you suggested, curl can be used to trigger jobs via the API, or you can use a Jenkins API wrapper package for to your preferred language (I've had success using the Python jenkinsapi package for this sort of work: http://pythonhosted.org/jenkinsapi/)
If you need to pass parameters from your API client code to the second Jenkins job, that's possible by adding parameters to the second job using the the Parameterized Build features built into Jenkins: https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Build

Running Apache spark job from Spring Web application using Yarn client or any alternate way

I have recently started using spark and I want to run spark job from Spring web application.
I have a situation where I am running web application in Tomcat server using Spring boot.My web application receives a REST web service request based on that It needs to trigger spark calculation job in Yarn cluster. Since my job can take longer to run and can access data from HDFS, so I want to run the spark job in yarn-cluster mode and I don't want to keep spark context alive in my web layer. One other reason for this is my application is multi tenant so each tenant can run it's own job, so in yarn-cluster mode each tenant's job can start it's own driver and run in it's own spark cluster. In web app JVM, I assume I can't run multiple spark context in one JVM.
I want to trigger spark jobs in yarn-cluster mode from java program in the my web application. what is the best way to achieve this. I am exploring various options and looking your guidance on which one is best
1) I can use spark-submit command line shell to submit my jobs. But to trigger it from my web application I need to use either Java ProcessBuilder api or some package built on java ProcessBuilder. This has 2 issues. First it doesn't sound like a clean way of doing it. I should have a programatic way of triggering my spark applications. Second problem will be I will loose the capability of monitoring the submitted application and getting it's status.. Only crude way of doing it is reading the output stream of spark-submit shell, which again doesn't sound like good approach.
2) I tried using Yarn client to submit the job from spring application. Following is the code that I use to submit spark job using Yarn Client:
Configuration config = new Configuration();
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf conf = new SparkConf();
ClientArguments cArgs = new ClientArguments(sparkArgs, conf);
Client client = new Client(cArgs, config, conf);
client.run();
But when I run the above code, it tries to connect on localhost only. I get this error:
5/08/05 14:06:10 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/08/05 14:06:12 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
So I don't think it can connect to remote machine.
Please suggest, what is best way of doing this with latest version of spark. Later I have plans to deploy this entire application in amazon EMR. So approach should work there also.
Thanks in advance
Spark JobServer might help:https://github.com/spark-jobserver/spark-jobserver, this project receives RESTful web requests and start a spark job. Results is returned as json response.
I also had similar issues trying to run Spark app that connects to YARN cluster - having no cluster config it was trying to connect to the local machine as for the main node of the cluster, which obviously failed.
It worked for me when I've placed core-site.xml and yarn-site.xml into the classpath (src/main/resources in typical sbt or Maven project structure) - application correctly connected to the cluster.
When using spark-submit location of those files is typically specified by HADOOP_CONF_DIR environment variable, but for stand-alone application it didn't have effect.

Apache Mesos Workflows - Event Driven Scheduler

We are currently using Apache Mesos with Marathon and Chronos to schedule long running and batch processes.
It would be great if we could create more complex workflows like with Oozie. Say for example kicking of a job when a file appears in a location or when a certain application completes or calls an API.
While it seems we could do this with Marathon/Chronos or Singularity, there seems no readily available interface for this.
You can use Chronos' /scheduler/dependency endpoint to specify "all jobs which must run at least once before this job will run." Do this on each of your Chronos jobs, and you can build arbitrarily complex workflow DAGs.
https://airbnb.github.io/chronos/#Adding%20a%20Dependent%20Job
Chronos currently only schedules jobs based on time or dependency triggers. Other events like file update, git push, or email/tweet could be modeled as a wait-for-X job that your target job would then depend on.

Resources