Hue Oozie PySpark job: Reduce data being ported to stdout? - oozie

I'm a user on a Hue platform and am creating an Oozie job using PySpark [this is through a Hue workflow GUI]. When I submit, a large amount of information about dispatchers and tasks are ported to stdout obscuring the higher level information I am printing. Is there a way to configure the job to suppress the lower level information?

Related

creating sequential workflows using and ETL tool (Fivetran/Hevo), dbt and a reverse ETL tool (Hightouch)

I'm working for a startup and am setting up our analytics tech stack from scratch. As a result of limited resource we're focussing on using 3rd party tools rather than building custom pipelines.
Our stack is as follows:
ELT tool: either Fivetran or Hevo
Data warehouse: BigQuery
Transformations: dbt cloud
Reverse ETL: Hightouch (if we go with Fivetran - hevo has built in reverse ETL)
BI Tool: Tableau
The problem i'm having is:
With either Fivetran or Hevo there's a break in the below workflow whereby we have to switch tools and there's no integration within the tools themselves to trigger jobs sequentially based on the completion of the previous job.
Use case (workflow): load data into the warehouse -> transform using dbt -> reverse etl data back out of the warehouse into a tool like mailchimp for marketing purposes (e.g a list of user id who haven't performed certain actions and therefore we want to send a prompt email to, a list which is produced via a dbt job which runs daily)
Here's how these workflows would look in the respective tools (E = Extract, L = Load, T = Transform)
Hevo: E+L (hevo) -> break in workflow -> T: dbt job (unable to be triggered within the hevo UI) -> break in workflow -> reverse E+L: can be done within the hevo UI but can\t be triggered by a dbt job
Fivetran: E+L (fivetran) -> T: dbt job (can be triggered within fivetran UI) -> break in workflow -> reverse E+L fivetran partner with a company called hightouch but there's no way of triggering the hightouch job based on the completion of the fivetran/dbt job.
We can of course just sync these up in a time based fashion but this means if a previous job fails subsequent jobs still run, meaning incurring unnecessary cost and it would also be good to be able to re-trigger the whole workflow from the last break point once you've de-bugged it.
From reading online I think something like apache airflow could be used for this type of use case but that's all i've got thus far.
Thanks in advance.
You're looking for a data orchestrator. Airflow is the most popular choice, but Dagster and Prefect are newer challengers with some very nice features that are built specifically for managing data pipelines (vs. Airflow, which was built for task pipelines that don't necessarily pass data).
All 3 of these tools are open source, but orchestrators can get complex very quickly, and unless you're comfortable deploying kubernetes and managing complex infrastructure you may want to consider a hosted (paid) solution. (Hosted Airflow is under the brand name Astronomer).
Because of this complexity, you should ask yourself if you really need an orchestrator today, or if you can wait to implement one. There are hacky/brittle ways to coordinate these tools (e.g., cron, GitHub Actions, having downstream tools poll for fresh data, etc.), and at a startup's scale (one-person data team) you may actually be able to move much faster with a hacky solution for some time. Does it really impact your users if there is a 1-hour delay between loading data and transforming it? How much value is added to the business by closing that gap vs. spending your time modeling more data or building more reporting? Realistically for a single person new to the space, you're probably looking at weeks of effort until an orchestrator is adding value; only you will know if there is an ROI on that investment.
I use Dagster for orchestrating multiple dbt projects or dbt models with other data pipeline processes (e.g. database initialization, pyspark, etc.)
Here is a more detailed description and demo code:
Three dbt models (bronze, silver, gold) that are executed in sequence
Dagster orchestration code
You could try the following workflow, where you'd need to use a couple more additional tools, but it shouldn't need you any custom engineering effort on orchestration.
E+L (fivetran) -> T: Use Shipyard to trigger a dbt cloud job -> Reverse E+L: Trigger a Hightouch or Census sync on completion of a dbt cloud job
This should run your entire pipeline in a single flow.

Correct Approach For Airflow DAG Project

I am trying to see if Airflow is the right tool for some functionality I need for my project. We are trying to use it as a scheduler for running a sequence of jobs
that start at a particular time (or possibly on demand).
The first "task" is to query the database for the list of job id's to sequence through.
For each job in the sequence send a REST request to start the job
Wait until job completes or fails (via REST call or DB query)
Go to next job in sequence.
I am looking for recommendations on how to break down the functionality discussed above into an airflow DAG. So far my approach would :
create a Hook for the database and another for the REST server.
create a custom operator that handles the start and monitoring of the "job" (steps 2 and 3)
use a sensor to poll handle waiting for job to complete
Thanks

Use Airflow for batch processing to dynamically start multiple tasks based on the output of a parent task

I am trying to figure out if Airflow can be used to express a workflow where multiple instances of the same task need to be started based on the output of a parent task. Airflow supports multiple workers, so I naively expect that Airflow can be used to orchestrate workflows involving batch processing. So far I failed to find any recipe/direction that would fit this model. What is the right way to leverage Airflow for a bath processing workflow like the one below? Assume there is a pool of Airflow workers.
Example of a workflow:
1. Start Task A to produce multiple files
2. For each file start an instance of Task B (might be another workflow)
3. Wait for all instances of Task B, then start Task C
As a hack to parallelize processing of input data in Airflow, I use a custom operator that splits the input into a predetermined number of partitions. The downstream operator gets replicated for each partition and if needed the result can be merged again. For local files, the operator runs the split command. In Kubernetes, this works nicely with the cluster autoscaling.

Whats the best way to log in oozie

We are using oozie workflows with a oozie main class in the action. I am not really sure what is the best logging strategy. Should we just use log4j since it seems like that is the default strategy ? Do those logs get collected on the data nodes ?
Should we just use log4j since it seems like that is the default
strategy ?
I have not found any mention of someone using an alternative logger. It seems to be discouraged:
While Oozie can technically use any valid log4j Appender or
configurations that violate the above restrictions, certain features
related to logs may be disabled and/or not work correctly, and is thus
not advised.
Your other question:
Do those logs get collected on the data nodes ?
An SO answer mentions that
the logs are distributed across your cluster, but by logging them to
the rootLogger, you should be able to see them via the job tracker (by
drilling down on the Job task attempts).
You can inspect them via
use this to print last 10 lines
$ oozie job -oozie oozie_URL -log job_ID | tail -n 10

Apache Mesos Workflows - Event Driven Scheduler

We are currently using Apache Mesos with Marathon and Chronos to schedule long running and batch processes.
It would be great if we could create more complex workflows like with Oozie. Say for example kicking of a job when a file appears in a location or when a certain application completes or calls an API.
While it seems we could do this with Marathon/Chronos or Singularity, there seems no readily available interface for this.
You can use Chronos' /scheduler/dependency endpoint to specify "all jobs which must run at least once before this job will run." Do this on each of your Chronos jobs, and you can build arbitrarily complex workflow DAGs.
https://airbnb.github.io/chronos/#Adding%20a%20Dependent%20Job
Chronos currently only schedules jobs based on time or dependency triggers. Other events like file update, git push, or email/tweet could be modeled as a wait-for-X job that your target job would then depend on.

Resources