Files created in airflow task deleted automatically or not - airflow

In my airflow task, I am creating a file using open() method in airflow dag and writing records into it. Then sending it with a mail within same task. Will it get deleted automatically or will exists into the dag?
filename = to_report_name(context)+'_'+currentNextRunTime.strftime('%m.%d.%Y_%H-%M')+'_'+currentNextRunTime.tzname()+'.'+extension.lower()
with open(filename, "w+b") as file:
file.write(download_response.content)
print(file.name)
send_report(context,file)

The file will not be deleted automaticly. The code you execute is pure Python if you want the file to be deleted once the operation is done then use tempfile module which gurentee the file will be deleted once it's closed. Example:
import tempfile, os
with tempfile.NamedTemporaryFile() as file:
os.rename(file.name, '/tmp/my_custom_name.txt') # use this if you want to rename the file
file.write(...)

Related

How to output Airflow's scheduler log to stdout or S3 / GCS

We're running Airflow cluster using puckel/airflow docker image with docker-compose. Airflow's scheduler container outputs its logs to /usr/local/airflow/logs/scheduler.
The problem is that the log files are not rotated and disk usage increases until the disk gets full. Dag for cleaning up the log directory is available but the DAG run on worker node and log directory on scheduler container is not cleaned up.
I'm looking for the way to output scheduler log to stdout or S3/GCS bucket but unable to find out. Is there any to output the scheduler log to stdout or S3/GCS bucket?
Finally I managed to output scheduler's log to stdout.
Here you can find how to use custom logger of Airflow. The default logging config is available at github.
What you have to do is.
(1) Create custom logger class to ${AIRFLOW_HOME}/config/log_config.py.
# Setting processor (scheduler, etc..) logs output to stdout
# Referring https://www.astronomer.io/guides/logging
# This file is created following https://airflow.apache.org/docs/apache-airflow/2.0.0/logging-monitoring/logging-tasks.html#advanced-configuration
from copy import deepcopy
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
import sys
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
LOGGING_CONFIG["handlers"]["processor"] = {
"class": "logging.StreamHandler",
"formatter": "airflow",
"stream": sys.stdout,
}
(2) Set logging_config_class property to config.log_config.LOGGING_CONFIG in airflow.cfg
logging_config_class = config.log_config.LOGGING_CONFIG
(3) [Optional] Add $AIRFLOW_HOME to PYTHONPATH environment.
export "${PYTHONPATH}:~"
Actually, you can set the path of logging_config_class to anything as long as the python is able to load the package.
Setting handler.processor to airflow.utils.log.logging_mixin.RedirectStdHandler didn't work for me. It used too much memory.
remote_logging=True in airflow.cfg is the key.
Please check the thread here for detailed steps.
You can extend the image with the following or do so in airflow.cfg
ENV AIRFLOW__LOGGING__REMOTE_LOGGING=True
ENV AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=gcp_conn_id
ENV AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=gs://bucket_name/AIRFLOW_LOGS
the gcp_conn_id should have the correct permission to create/delete objects in GCS

Order of task execution in DAG in Google Cloud Composer impacts whether task is executed

I'm new to Google Cloud Composer and have run into what seems to be a strange issue in the DAG that I've created. I have a process which takes a tar.gz file from cloud storage, rezips it as a .gz file and then loads the .gz file to BigQuery. Yesterday, I tried to add a new step in the process which is an insert from the created "shard" to a new table.
I couldn't get this to work until I changed the order of steps in my DAG execution. In my DAG, I have a step called "delete_tar_gz_files_op". When this was executed prior to "insert_daily_file_into_nli_table_op," the insert never ran (no failure in Composer, just seemed to not run at all). When I swap the order of these two steps with no other changes to the code, the insert works as expected. Does anyone know what might cause this? I have no idea why this would happen as these two steps aren't related at all. The one does an insert query from one big query table to another. The other deletes a tar.gz file that's in cloud storage.
My dag execution order currently which works:
initialize >> FilesToProcess >> download_file >> convert_task >> upload_task >> gcs_to_bq >> archive_files_op >> insert_daily_file_into_nli_table_op >> delete_tar_gz_files_op
Some of the code used:
#The big query operator inserts the files from the .gz file into a table in big query.
gcs_to_bq = GoogleCloudStorageToBigQueryOperator(
task_id='load_basket_data_into_big_query'+job_desc,
bucket="my-processing-bucket",
bigquery_conn_id='bigquery_default',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
compression='GZIP',
source_objects=['gzip/myzip_'+process_date+'.gz'],
destination_project_dataset_table='project.dataset.basket_'+clean_process_date,
field_delimiter='|',
skip_leading_rows=0,
google_cloud_storage_conn_id="bigquery_default",
schema_object="schema.json",
dag=dag
)
#The created shard is then inserted into basket_raw_nli.basket_nli. This is a partitioned table which contains only the NLI subtype
insert_daily_file_into_nli_table_op = bigquery_operator.BigQueryOperator(
task_id='insert_daily_file_into_nli_table_op_'+job_desc,
bql=bqQuery,
use_legacy_sql=False,
bigquery_conn_id='bigquery_default',
write_disposition='WRITE_APPEND',
allow_large_results=True,
destination_dataset_table=False,
dag=dag)
#The tar file created can now be deleted from the raw folder
delete_tar_gz_files_op=python_operator.PythonOperator(
task_id='delete_tar_gz_files_'+job_desc,
python_callable=delete_tar_gz_files,
op_args=[file, process_date],
provide_context=False,
dag=dag)
def delete_tar_gz_files(file, process_date):
execution_command='gsutil rm ' + source_dir + '/' + file
print(execution_command)
returncode=os.system(execution_command)
if returncode != 0:
#logging.error("Halting process...")
exit(1)
Manual run status:
run status

How can I read a config file from airflow packaged DAG?

Airflow packaged DAGs seem like a great building block for a sane production airflow deployment.
I have a DAG with dynamic subDAGs, driven by a config file, something like:
config.yaml:
imports:
- project_foo
- project_bar`
which yields subdag tasks like imports.project_{foo|bar}.step{1|2|3}.
I've normally read in the config file using python's open function, a la config = open(os.path.join(os.path.split(__file__)[0], 'config.yaml')
Unfortunately, when using packaged DAGs, this results in an error:
Broken DAG: [/home/airflow/dags/workflows.zip] [Errno 20] Not a directory: '/home/airflow/dags/workflows.zip/config.yaml'
Any thoughts / best practices to recommend here?
It's a bit of a kludge, but I eventually just fell back on reading zip file contents via ZipFile.
import yaml
from zipfile import ZipFile
import logging
import re
def get_config(yaml_filename):
"""Parses and returns the given YAML config file.
For packaged DAGs, gracefully handles unzipping.
"""
zip, post_zip = re.search(r'(.*\.zip)?(.*)', yaml_filename).groups()
if zip:
contents = ZipFile(zip).read(post_zip.lstrip('/'))
else:
contents = open(post_zip).read()
result = yaml.safe_load(contents)
logging.info('Parsed config: %s', result)
return result
which works as you'd expect from the main dag.py:
get_config(os.path.join(path.split(__file__)[0], 'config.yaml'))

CSV to Neo4j-Graph

I'm trying to convert database(csv files) to neo4j-graph, but I get an error.
The command is
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Users/PERC/AppData/Roaming/Neo4j%20Desktop/Application/neo4jDatabases/database-37b84fcf-d1b2-4dee-a4ee-5faed9cbaca0/installation-3.3.1/import/customers.csv" AS row
CREATE (:Customer {companyName: row.CompanyName, customerID: row.CustomerID, fax: row.Fax, phone: row.Phone});`
The customers.csv file in on import folder of Neo4j Desktop folder. But I get this error:
Couldn't load the external resource at: file:/C:/Users/PERC/AppData/Roaming/Neo4j%20Desktop/Application/neo4jDatabases/database-37b84fcf-d1b2-4dee-a4ee-5faed9cbaca0/installation-3.3.1/import/Users/PERC/AppData/Roaming/Neo4j%20Desktop/Application/neo4jDatabases/database-37b84fcf-d1b2-4dee-a4ee-5faed9cbaca0/installation-3.3.1/import/customers.csv
If you take a look at your error message, you will see that the actually-used file URL repeats the file path.
As stated in the dev manual:
File URLs will be resolved relative to the dbms.directories.import
directory. For example, a file URL will typically look like file:///myfile.csv or file:///myproject/myfile.csv.
Since your CSV file is directly in your import directory, try this:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///customers.csv" AS row
CREATE (:Customer {companyName: row.CompanyName, customerID: row.CustomerID, fax: row.Fax, phone: row.Phone});

appengine python remote_api module object has no attribute GoogleCredentials

AttributeError: 'module' object has no attribute 'GoogleCredentials'
I have an appengine app which is running on localhost.
I have some tests which i run and i want to use the remote_api to check the db values.
When i try to access the remote_api by visiting:
'http://127.0.0.1:8080/_ah/remote_api'
i get a:
"This request did not contain a necessary header"
but its working in the browser.
When i now try to call the remote_api from my tests by calling
remote_api_stub.ConfigureRemoteApiForOAuth('localhost:35887','/_ah/remote_api')
i get the error:
Error
Traceback (most recent call last):
File "/home/dan/src/gtup/test/test_users.py", line 38, in test_crud
remote_api_stub.ConfigureRemoteApiForOAuth('localhost:35887','/_ah/remote_api')
File "/home/dan/Programs/google-cloud-sdk/platform/google_appengine/google/appengine/ext/remote_api/remote_api_stub.py", line 747, in ConfigureRemoteApiForOAuth
credentials = client.GoogleCredentials.get_application_default()
AttributeError: 'module' object has no attribute 'GoogleCredentials'
I did try to reinstall the whole google cloud but this didn't work.
When i open the client.py
google-cloud-sdk/platform/google_appengine/lib/google-api-python-client/oauth2client/client.py
which is used by remote_api_stub.py, i can see, that there is no GoogleCredentials class inside of it.
The GoogleCredentials class exists, but inside of other client.py files which lie at:
google-cloud-sdk/platform/google_appengine/lib/oauth2client/oauth2client/client.py
google-cloud-sdk/platform/gsutil/third_party/oauth2client/oauth2client/client.py
google-cloud-sdk/platform/bq/third_party/oauth2client/client.py
google-cloud-sdk/lib/third_party/oauth2client/client.py
my app.yaml looks like this:
application: myapp
version: 1
runtime: python27
api_version: 1
threadsafe: true
libraries:
- name: webapp2
version: latest
builtins:
- remote_api: on
handlers:
- url: /.*
script: main.app
Is this just a wrong import/bug inside of appengine.
Or am i doing something wrong to use the remote_api inside of my unittests?
I solved this problem by replacing the folder:
../google-cloud-sdk/platform/google_appengine/lib/google-api-python-client/oauth2client
with:
../google-cloud-sdk/platform/google_appengine/lib/oauth2client/oauth2client
the one which gets included in the google-api-python-client folder now has the needed Class: GoogleCredentials in the client file.
Then i had a second problem with the connection and now i have to call:
remote_api_stub.ConfigureRemoteApiForOAuth('localhost:51805','/_ah/remote_api', False)
note, the port changes every time, the server gets restarted.
Answering instead of commenting as I cannot post a comment with my reputation -
Similar things have happened to me, when running these types of scripts on mac. Sometimes, your PATH variable gets confused as to which files to actually check for functions, especially when you have gcloud installed alongside the app engine launcher. If on mac, I would suggest editing/opening your ~/.bash_profile file to fix this (or possible ~/.bashrc, if on linux). For example, on my Mac I have the following lines to fix my PATH variable:
export PATH="/usr/local/bin:$PATH"
export PYTHONPATH="/usr/local/google_appengine:$PYTHONPATH
These basically make sure the python / command line will look in /usr/local/bin (or /usr/local/google_appengine in the case of the PYTHONPATH line) BEFORE anything in the PATH (or PYTHONPATH).
The PATH variable is where the command line checks for python files when you type them into the prompt. The PYTHONPATH is where your python files find the modules to load at runtime.

Resources