unable to specify master_type in MLEngineTrainingOperator - out-of-memory

I am using airflow to schedule a pipeline that will result in training a scikitlearn model with ai platform. I use this DAG to train it
with models.DAG(JOB_NAME,
schedule_interval=None,
default_args=default_args) as dag:
# Tasks definition
training_op = MLEngineTrainingOperator(
task_id='submit_job_for_training',
project_id=PROJECT,
job_id=job_id,
package_uris=[os.path.join(TRAINER_BIN)],
training_python_module=TRAINER_MODULE,
runtime_version=RUNTIME_VERSION,
region='europe-west1',
training_args=[
'--base-dir={}'.format(BASE_DIR),
'--event-date=20200212',
],
python_version='3.5')
training_op
The training package loads the desired csv files and train a RandomForestClassifier on it.
This works fine until the number and the size of the files increase. Then I get this error:
ERROR - The replica master 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL). To find out more about why your job exited please check the logs:
The total size of the files is around 4 Gb. I dont know what is the default machine used but is seems not enough. Hoping this would solve the memory consumption issue I tried to change the parameter n_jobs of the classifier from -1 to 1, with no more luck.
Looking at the code of MLEngineTrainingOperator and the documentation I added a custom scale_tier and a master_type n1-highmem-8, 8 CPUs and 52GB of RAM , like this:
with models.DAG(JOB_NAME,
schedule_interval=None,
default_args=default_args) as dag:
# Tasks definition
training_op = MLEngineTrainingOperator(
task_id='submit_job_for_training',
project_id=PROJECT,
job_id=job_id,
package_uris=[os.path.join(TRAINER_BIN)],
training_python_module=TRAINER_MODULE,
runtime_version=RUNTIME_VERSION,
region='europe-west1',
master_type="n1-highmem-8",
scale_tier="custom",
training_args=[
'--base-dir={}'.format(BASE_DIR),
'--event-date=20200116',
],
python_version='3.5')
training_op
This resulted in an other error:
ERROR - <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/MY_PROJECT/jobs?alt=json returned "Field: master_type Error: Master type must be specified for the CUSTOM scale tier.">
I don't know what is wrong but it appears that is not the way to do that.
EDIT: Using command line I manage to launch the job:
gcloud ai-platform jobs submit training training_job_name --packages=gs://path/to/package/package.tar.gz --python-version=3.5 --region=europe-west1 --runtime-version=1.14 --module-name=trainer.train --scale-tier=CUSTOM --master-machine-type=n1-highmem-16
However i would like to do this in airflow.
Any help would be much appreciated.
EDIT: My environment used an old version of apache airflow, 1.10.3 where the master_type argument was not present.
Updating the version to 1.10.6 solved this issue

My environment used an old version of apache airflow, 1.10.3 where the master_type argument was not present. Updating the version to 1.10.6 solved this issue

Related

airflow experimental api , how to set dag that is running as failed

I'm using experimental api of airflow
and would like to failed running DAG
i didn't see it in documentation
i tried the below, but it creating new running dag/task
would be glad to get the right experimental api/method/payload for it
thanks in advnace
'''session.post(
url=f'{airflow_env}/api/experimental/dags/{dag_name}/dag_runs',
json={'state':'failed',
"dag_run_id":"scheduled__2022-03-13T20:30:32.887761+00:00"})
'''

Airflow 2 - debugging why dag is not loading

On Airflow 2 my dag is not showing on the UI, and I'm getting DAG Import Errors (...) for it.
The error message is insufficient for me to debug (it's a custom operator, with a lot of custom logic - so I don't want to get into details of the error itself).
On Airflow 1.X I could use cli:
airflow list_dags
to get more elaborated debug message, is there anything analogical on airflow 2 ?
I'm looking for a cli command/UI option that will provide me with more elaborated error message, than the one I'm getting on the main screen of the webserver.
As described in the Airlfow's documentation, to test DAG loading you can simply run:
python your-dag-file.py
If there is any problem during the DAG loading phase you will get a stack trace here.
The later sections also describe how to test custom operators.
As explained in the upgrading manual the
airflow list_dags has been changed to airflow dags list
The full syntax is:
airflow dags list [-h] [-o table, json, yaml] [-S SUBDIR]
for more information see docs

Apache Airflow problem - "a task with task_id create_tag_template_field_result is already in the DAG"

So, I have a problem with even the blank Airflow installation.
As soon as I try to run
airflow test tutorial print_date 2015-06-01
I get a raised exception which says
PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
What is the reason for this (as I made literally no changes to the installation whatsoever)?
I also got that when, in a previous installation, I tried to run my own dag... but the "create_tag_template_field_result" was nowhere to be found in my code.
you can set the config arg load_examples = False to solve it.
This is the test command will call get_dag function which will construct a DagBag object, in the construction function will call collect_dags function.
The collect_dags function when the conf arg LOAD_EXAMPLES=True(default True), will collect all the dags in the example path, that's where the task create_tag_template_field_result comes from.
And in the collect_dags function will call add_task function of every example task, that's where you add the create_tag_template_field_result task again.
And maybe it's quickstart when you added this task before for the first time while you didn't realize.
you can set the config arg load_examples = False to solve it
This warning is occuring in
/usr/local/lib/python3.7/dist-packages/airflow/example_dags/example_complex.py
so i remove or rename (for example, to not working name *.py.back ) this.
I had the same error with a fresh install.
Then I don't know if this helps, but I downgraded Airflow to version 1.10.10 (with python3.7) and the error was gone.

Dag Seems to be missing

I have a dag which checks for new workflows to be generated (Dynamic DAG) at a regular interval and if found, creates them. (Ref: Dynamic dags not getting added by scheduler )
The above DAG is working and the dynamic DAGs are getting created and listed in the web-server. Two issues here:
When clicking on the DAG in web url, it says "DAG seems to be missing"
The listed DAGs are not listed using "airflow list_dags" command
Error:
DAG "app01_user" seems to be missing.
The same is for all other dynamically generated DAGs. I have compiled the Python script and found no errors.
Edit1:
I tried clearing all data and running "airflow run". It ran successfully but no Dynamic generated DAGs were added to "airflow list_dags". But when running the command "airflow list_dags", it loaded and executed the DAG, (which generated Dynamic DAGs). The dynamic DAGs are also listed as below:
[root#cmnode dags]# airflow list_dags
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
[2019-08-13 00:34:31,692] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=15, pool_recycle=1800, pid=25386
[2019-08-13 00:34:31,877] {__init__.py:51} INFO - Using executor LocalExecutor
[2019-08-13 00:34:32,113] {__init__.py:305} INFO - Filling up the DagBag from /root/airflow/dags
/usr/lib/python2.7/site-packages/airflow/operators/bash_operator.py:70: PendingDeprecationWarning: Invalid arguments were passed to BashOperator (task_id: tst_dyn_dag). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'provide_context': True}
super(BashOperator, self).__init__(*args, **kwargs)
-------------------------------------------------------------------
DAGS
-------------------------------------------------------------------
app01_user
app02_user
app03_user
app04_user
testDynDags
Upon running again, all the above generated 4 dags disappeared and only the base DAG, "testDynDags" is displayed.
When I was getting this error, there was an exception showing up in the webserver logs. Once I resolved that error and I restarted the webserver it went through normally.
From what I can see this is the error that is thrown when the webserver tried to parse the dag file and there is an error. In my case it was an error importing a new operator I added to a plugin.
Usually, I check in Airflow UI, sometimes the reason of broken DAG appear in there. But if it is not there, I usually run the .py file of my DAG, and error (reason of DAG cant be parsed) will appear.
I never got to work on dynamic DAG generation but I did face this issue when DAG was not present on all nodes ( scheduler, worker and webserver ). In case you have airflow cluster, please make sure that DAG is present on all airflow nodes.
Same error, the reason was I renamed my dag_id in uppercase. Something like "import_myclientname" into "import_MYCLIENTNAME".
I am little late to the party but I faced the error today:
In short: try executing airflow dags report and/or airflow dags reserialize
Check out my comment here:
https://stackoverflow.com/a/73880927/4437153
I found that airflow fails to recognize a dag defined in a file that does not have from airflow import DAG in it, even if DAG is not explicitly used in that file.
For example, suppose you have two files, a.py and b.py:
# a.py
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
def makedag(dag_id="a"):
with DAG(dag_id=dag_id) as dag:
DummyOperator(task_id="nada")
dag = makedag()
and
# b.py
from a import makedag
dag = makedag(dag_id="b")
Then airflow will only look at a.py. It won't even look at b.py at all, even to notice if there's a syntax error in it! But if you add from airflow import DAG to it and don't change anything else, it will show up.

RHadoop Stream Job Fail with Apache Oozie

I'm really just looking to pick the community's brain for some leads in figuring out what is going on with the issue I'm having.
I'm writing a MR job with RHadoop (rmr2, v3.0.0) and things are great -- IO with HDFS, mapping, reducing. No problems. Life is great.
I'm trying to schedule the job with Apache Oozie, and am running into some issues:
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
I've read the rmr2 debugging guide, but nothing is really getting to the stderr because the job fails before anything even gets scheduled.
In my head, everything points to a difference in environments. However, Oozie is running the job as the same user that I'm able to run everything with via cli, and all of the R environment variables (fetched with Sys.getenv()) are the same, excepting there's some additional class path stuff set with Oozie.
I can post more of the OS or Hadoop versions and config details, but sleuthing some version-specific bugs seems like a bit of a red herring as everything runs fine at the command line.
Anybody have any thoughts what might be some helpful next steps in hunting this beast down?
UPDATE:
I overwrote the system function in the base package to log the user, the host name of the node, and the command being executed before the internal call to system. So before any system call is actually executed, I get something like the following in the stderr:
user#host.name
/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar ...
When ran with Oozie, the command printed in the stderr fails with an exit status of 1. When I run the command on user#host.name, it runs successfully. So essentially the EXACT same command with the SAME user on the SAME node fails with Oozie, but runs successfully from cli.

Resources