Running Apache spark job from Spring Web application using Yarn client or any alternate way - spring-mvc

I have recently started using spark and I want to run spark job from Spring web application.
I have a situation where I am running web application in Tomcat server using Spring boot.My web application receives a REST web service request based on that It needs to trigger spark calculation job in Yarn cluster. Since my job can take longer to run and can access data from HDFS, so I want to run the spark job in yarn-cluster mode and I don't want to keep spark context alive in my web layer. One other reason for this is my application is multi tenant so each tenant can run it's own job, so in yarn-cluster mode each tenant's job can start it's own driver and run in it's own spark cluster. In web app JVM, I assume I can't run multiple spark context in one JVM.
I want to trigger spark jobs in yarn-cluster mode from java program in the my web application. what is the best way to achieve this. I am exploring various options and looking your guidance on which one is best
1) I can use spark-submit command line shell to submit my jobs. But to trigger it from my web application I need to use either Java ProcessBuilder api or some package built on java ProcessBuilder. This has 2 issues. First it doesn't sound like a clean way of doing it. I should have a programatic way of triggering my spark applications. Second problem will be I will loose the capability of monitoring the submitted application and getting it's status.. Only crude way of doing it is reading the output stream of spark-submit shell, which again doesn't sound like good approach.
2) I tried using Yarn client to submit the job from spring application. Following is the code that I use to submit spark job using Yarn Client:
Configuration config = new Configuration();
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf conf = new SparkConf();
ClientArguments cArgs = new ClientArguments(sparkArgs, conf);
Client client = new Client(cArgs, config, conf);
client.run();
But when I run the above code, it tries to connect on localhost only. I get this error:
5/08/05 14:06:10 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/08/05 14:06:12 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
So I don't think it can connect to remote machine.
Please suggest, what is best way of doing this with latest version of spark. Later I have plans to deploy this entire application in amazon EMR. So approach should work there also.
Thanks in advance

Spark JobServer might help:https://github.com/spark-jobserver/spark-jobserver, this project receives RESTful web requests and start a spark job. Results is returned as json response.

I also had similar issues trying to run Spark app that connects to YARN cluster - having no cluster config it was trying to connect to the local machine as for the main node of the cluster, which obviously failed.
It worked for me when I've placed core-site.xml and yarn-site.xml into the classpath (src/main/resources in typical sbt or Maven project structure) - application correctly connected to the cluster.
When using spark-submit location of those files is typically specified by HADOOP_CONF_DIR environment variable, but for stand-alone application it didn't have effect.

Related

Kafka or Airflow for automation

I would like to automate some of my tasks using Apache Kafka. Previously i
used to do the same using Apache Airflow and which worked fine. But i want
to explore the same using Kafka whether this works better than Airflow or
not.
Kafka runs on Server A
Kafka searches for a file named test.xml on Server B, here kafka search
for every 10 or 20 mins whether this file created or not.
Once kafka sense the file created, then the job starts as follows
a)Create a jira ticket and update all the executions on jira for each
events
b) Trigger a rsync command
c) Then unarchive the files using tar command
d) Some script to execute using the unarchive files
e) Then archive the files and rsync to different location
f) Send email once all task finished
Please advise if this is something kafka intelligent to begin with? Or if
you have any other open source products which can do this actions , please
let me know. By the way i prefer to setup these on docker-compose based
installation.
Or please suggest, what are the best opensource tools available for this automation purpose
Thanks
Avoid using kafka for the usecases you mentioned. Kafka is not good for defining dags or workflows. It works great for streaming usecase(data in motion).
Airflow would allow to define dags with multiple task.
You can leverage filesystem sensor[https://github.com/apache/airflow/blob/main/airflow/sensors/filesystem.py] to check if file/folder has been updated/created.
And post that you can leverage python operators and other operators(hooks) to achieve all other task as well.

Testing Gamelift in local, can only create 1 game session?

I am testing our application in local, and it seems like i can only create 1 game session using Gamelift local.
so what i did is I run gamelift local
java -jar GameLiftLocal.jar -p 9080
run the custom gamelift server i wrote in C# and Unity
and use CLI to create game session
AWS gamelift create-game-session --endpoint-url http://localhost:9080 --maximum-player-session-count 2 --fleet-id fleet-123d
and first run, it succeed and creates the game session.
when I create another gamesession by issuing the same command above it results to
HTTP-Dispatcher - No available process.
Why is this? can we only create one Game Session in local?
If you are trying to make another game session, you need to run multiple game server processes.
GameLift can catch the game session's status by receiving Server-side API call from game server process.
I think this diagram can help you.:)
https://docs.aws.amazon.com/gamelift/latest/developerguide/gamelift-sdk-server-api-interaction-vsd.html
According to the docs:
Each server process should only host a single game session.
...
When testing locally with GameLift Local, you can start multiple server processes. Each process will connect to GameLift Local.
Sounds like you need to run multiple instances of GameLiftLocal.
Source: https://docs.aws.amazon.com/gamelift/latest/developerguide/integration-testing-local.html

Airflow dag cannot find connection-id

I am managing a Google Cloud Composer environment which runs Airflow for a data engineering team. I have recently been asked to troubleshoot one of the dags they run which is failing with this error : [12:41:18,119] {credentials_utils.py:23} WARNING - [redacted-name] connection ID not available, falling back to Google default credentials
The job is basically a data pipeline which reads from various sources and stores data into GBQ. The odd part is that they have a strictly similar Dag running for a different project and it works perfectly.
I have recreated the .json credentials for the service account behind the connection as well as the connection itself in Airflow. I have sanitized the code to see if there was any hidden spaces or so.
My knowledge of Airflow is limited and I have not been able to find any similar issue in my research, any one have encountered this before?
So the DE team came back to me saying it was actually a deployment issue where an internal module involved in service account authentication was being utilized inside another DAG running in stage environment, rendering it impossible to proceed to credential fetch from the connection ID.

amplify push not working as more tables are added

I have an Amplify project with a GraphQL API Schema comprised of 28 #model's. After adding an additional #model and running amplify push, The Amplify CLI (v5.3) returns...
Uploading files...
...for several minutes. It then returns:
Uploading files...× An error occurred when pushing the resources to the cloud
Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
An error occurred during the push operation: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
The additional model being added is simple with no secondary indices or connections to any other model. I have been working with this project for several weeks without having a problem with any amplify push updates.
I tried the following:
Running amplify pull, making the change, and then running amplify push
Creating a new amplify project.
Anyone have any thoughts on how to approach this?
EDIT
When I exclude the additional model, it takes about 14 seconds for the CLI to complete uploading the files. But when adding just a single additional model, it takes several minutes.

a service which would be able to run jobs on a timed basis

I am working for my client using Asp.net webAPI2 and angularJS. Now my client have following requirement,but i am unable to understand what type of project i have to create like WebAPI project,window service or any other? Anyone please tell me what the client actually want and how can i do it?
QueueManager will need to be some kind of a service which would be able to run jobs on a timed basis. We envision it being a service that runs on a continuous loop, but has a Thread.Sleep at the end of each iteration with a duration of x-seconds (“x” being set in a config file.) You should create this QueueManager service as a new project within the Core.Jobs project; I would like to have the project name be “Core.Jobs.QueueManager”, along with the base namespace.
Here are the functions that the QueueManager will do for each iteration:
1) Do a worker healthcheck (JobsAPI: Queue/WorkerHealthCheck – already created)
a. This method will just return a 200 status code, and a count of workers. Not need to act on the return value.
Look at Hangfire, it is quite easy to set up and simple to use.
http://docs.hangfire.io/en/latest/background-methods/performing-recurrent-tasks.html

Resources