how to install dask on google composer - airflow

I tried to install dask on google composer (airflow). I used pypi (GCP UI) to add dask and the below required packages(not sure if all the google one are required though, couldn't find requirement.txt):
dask
toolz
partd
cloudpickle
google-cloud
google-cloud-storage
google-auth
google-auth-oauthlib
decorator
when I run my DAG that has dd.read_csv("a gcp bucket") it shows the below error in airflow log:
[2018-10-24 22:25:12,729] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 350, in get_fs_token_paths
[2018-10-24 22:25:12,733] {base_task_runner.py:98} INFO - Subtask: fs, fs_token = get_fs(protocol, options)
[2018-10-24 22:25:12,735] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 473, in get_fs
[2018-10-24 22:25:12,740] {base_task_runner.py:98} INFO - Subtask: "Need to install `gcsfs` library for Google Cloud Storage support\n"
[2018-10-24 22:25:12,741] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/utils.py", line 94, in import_required
[2018-10-24 22:25:12,748] {base_task_runner.py:98} INFO - Subtask: raise RuntimeError(error_msg)
[2018-10-24 22:25:12,751] {base_task_runner.py:98} INFO - Subtask: RuntimeError: Need to install `gcsfs` library for Google Cloud Storage support
[2018-10-24 22:25:12,756] {base_task_runner.py:98} INFO - Subtask: conda install gcsfs -c conda-forge
[2018-10-24 22:25:12,758] {base_task_runner.py:98} INFO - Subtask: or
[2018-10-24 22:25:12,762] {base_task_runner.py:98} INFO - Subtask: pip install gcsfs
so I tried to install gcsfs using pypi but got the below airflow error:
{
insertId: "17ks763f726w1i"
logName: "projects/xxxxxxxxx/logs/airflow-worker"
receiveTimestamp: "2018-10-25T15:42:24.935880717Z"
resource: {…}
severity: "ERROR"
textPayload: "Traceback (most recent call last):
File "/usr/local/bin/gcsfuse", line 7, in <module>
from gcsfs.cli.gcsfuse import main
File "/usr/local/lib/python2.7/site-
packages/gcsfs/cli/gcsfuse.py", line 3, in <module>
fuse import FUSE
ImportError: No module named fuse
"
timestamp: "2018-10-25T15:41:53Z"
}
seems that it is trapped in a loop of required packages!! not sure if I missed anything here? any thoughts?

You don't need to add storage in your PyPi packages, it's already installed. I ran a dag (image-version:composer-1.3.0-airflow-1.10.0) logging the version of the pre-installed package and it appears that it is 1.13.0. I also added in my dag the following, in order to replicate your case:
import dask.dataframe as dd
def read_csv_dask():
df = dd.read_csv('gs://gcs_path/data.csv')
logging.info("csv from gs://gcs_path/ read alright")
Before anything, I added via the UI the following dependencies:
dask==0.20.0
toolz==0.9.0
partd==0.3.9
cloudpickle==0.6.1
The corresponding task failed with the same message as yours ("Need to install gcsfs library for Google Cloud Storage support") at which point I returned to the UI and attempted to add gcsfs==0.1.2. This never succeeded. However, I did not get the error you did, I instead repeatedly failed with "Composer Backend timed out".
At this point, you could consider the following alternatives:
1) Install gcsfs with pip in a BashOperator. This is not optimal as you will be installing gcsfs every time the dag is ran.
2) Use another library. What are you doing with this csv? If you upload it to the gs://composer_gcs_bucket/data/ directory (check here) you can then read it using e.g. the csv standard lib like so:
import csv
def read_csv():
f=open('/home/airflow/gcs/data/data.csv', 'rU')
reader = csv.reader(f)

Related

Airflow quickstart DagRunNotFound DagRun example_bash_operator not found

I'm learning Airflow and just want to get up and running with the Quickstart: https://airflow.apache.org/docs/apache-airflow/stable/start.html
I'm not sure if this is a virtual environment issue or something I'm missing with Airflow that should be obvious and this may be a duplicate of this question from 2017: Running Airflow task from the command line does not work but there were no answers there.
My OS is POP_OS (debian)
I have created a new virtual environment and installed airflow by running the script provided:
# Airflow needs a home. `~/airflow` is the default, but you can put it
# somewhere else if you prefer (optional)
export AIRFLOW_HOME=~/airflow
# Install Airflow using the constraints file
AIRFLOW_VERSION=2.5.0
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
# For example: 3.7
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
# For example: https://raw.githubusercontent.com/apache/airflow/constraints-2.5.0/constraints-3.7.txt
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
# The Standalone command will initialise the database, make a user,
# and start all components for you.
airflow standalone
# Visit localhost:8080 in the browser and use the admin account details
# shown on the terminal to login.
# Enable the example_bash_operator dag in the home page
I was expecting this to be plug-and-play and I haven't even reached the tutorials yet.
airflow standalone works and I can run the DAG from the web UI. However, if I run
# run your first task instance
airflow tasks run example_bash_operator runme_0 2015-01-01
from the CLI, I get
airflow.exceptions.DagRunNotFound: DagRun for example_bash_operator with run_id or execution_date of '2015-01-01' not found
full error:
(airflow) jasonstewartnz#pop-os:~$ airflow tasks run example_bash_operator runme_0 2015-01-01
[2022-12-22 11:11:18,776] {dagbag.py:538} INFO - Filling up the DagBag from /home/jasonstewartnz/airflow/dags
[2022-12-22 11:11:18,820] {taskmixin.py:205} WARNING - Dependency <Task(BashOperator): create_entry_group>, delete_entry_group already registered for DAG: example_complex
[2022-12-22 11:11:18,820] {taskmixin.py:205} WARNING - Dependency <Task(BashOperator): delete_entry_group>, create_entry_group already registered for DAG: example_complex
[2022-12-22 11:11:18,820] {taskmixin.py:205} WARNING - Dependency <Task(BashOperator): create_entry_gcs>, delete_entry already registered for DAG: example_complex
[2022-12-22 11:11:18,820] {taskmixin.py:205} WARNING - Dependency <Task(BashOperator): delete_entry>, create_entry_gcs already registered for DAG: example_complex
[2022-12-22 11:11:18,820] {taskmixin.py:205} WARNING - Dependency <Task(BashOperator): create_tag>, delete_tag already registered for DAG: example_complex
[2022-12-22 11:11:18,820] {taskmixin.py:205} WARNING - Dependency <Task(BashOperator): delete_tag>, create_tag already registered for DAG: example_complex
[2022-12-22 11:11:18,832] {taskmixin.py:205} WARNING - Dependency <Task(_PythonDecoratedOperator): print_the_context>, log_sql_query already registered for DAG: example_python_operator
[2022-12-22 11:11:18,832] {taskmixin.py:205} WARNING - Dependency <Task(_PythonDecoratedOperator): log_sql_query>, print_the_context already registered for DAG: example_python_operator
[2022-12-22 11:11:18,833] {taskmixin.py:205} WARNING - Dependency <Task(_PythonDecoratedOperator): print_the_context>, log_sql_query already registered for DAG: example_python_operator
[2022-12-22 11:11:18,833] {taskmixin.py:205} WARNING - Dependency <Task(_PythonDecoratedOperator): log_sql_query>, print_the_context already registered for DAG: example_python_operator
[2022-12-22 11:11:18,833] {taskmixin.py:205} WARNING - Dependency <Task(_PythonDecoratedOperator): print_the_context>, log_sql_query already registered for DAG: example_python_operator
[2022-12-22 11:11:18,833] {taskmixin.py:205} WARNING - Dependency <Task(_PythonDecoratedOperator): log_sql_query>, print_the_context already registered for DAG: example_python_operator
[2022-12-22 11:11:18,833] {taskmixin.py:205} WARNING - Dependency <Task(_PythonDecoratedOperator): print_the_context>, log_sql_query already registered for DAG: example_python_operator
[2022-12-22 11:11:18,833] {taskmixin.py:205} WARNING - Dependency <Task(_PythonDecoratedOperator): log_sql_query>, print_the_context already registered for DAG: example_python_operator
/home/jasonstewartnz/.venv/airflow/lib/python3.10/site-packages/airflow/models/dag.py:3492 RemovedInAirflow3Warning: Param `schedule_interval` is deprecated and will be removed in a future release. Please use `schedule` instead.
[2022-12-22 11:11:18,914] {taskmixin.py:205} WARNING - Dependency <Task(_PythonDecoratedOperator): prepare_email>, send_email already registered for DAG: example_dag_decorator
[2022-12-22 11:11:18,914] {taskmixin.py:205} WARNING - Dependency <Task(EmailOperator): send_email>, prepare_email already registered for DAG: example_dag_decorator
Traceback (most recent call last):
File "/home/jasonstewartnz/.venv/airflow/bin/airflow", line 8, in <module>
sys.exit(main())
File "/home/jasonstewartnz/.venv/airflow/lib/python3.10/site-packages/airflow/__main__.py", line 39, in main
args.func(args)
File "/home/jasonstewartnz/.venv/airflow/lib/python3.10/site-packages/airflow/cli/cli_parser.py", line 52, in command
return func(*args, **kwargs)
File "/home/jasonstewartnz/.venv/airflow/lib/python3.10/site-packages/airflow/utils/cli.py", line 108, in wrapper
return f(*args, **kwargs)
File "/home/jasonstewartnz/.venv/airflow/lib/python3.10/site-packages/airflow/cli/commands/task_command.py", line 384, in task_run
ti, _ = _get_ti(task, args.map_index, exec_date_or_run_id=args.execution_date_or_run_id, pool=args.pool)
File "/home/jasonstewartnz/.venv/airflow/lib/python3.10/site-packages/airflow/utils/session.py", line 75, in wrapper
return func(*args, session=session, **kwargs)
File "/home/jasonstewartnz/.venv/airflow/lib/python3.10/site-packages/airflow/cli/commands/task_command.py", line 159, in _get_ti
dag_run, dr_created = _get_dag_run(
File "/home/jasonstewartnz/.venv/airflow/lib/python3.10/site-packages/airflow/cli/commands/task_command.py", line 115, in _get_dag_run
raise DagRunNotFound(
airflow.exceptions.DagRunNotFound: DagRun for example_bash_operator with run_id or execution_date of '2015-01-01' not found
The web UI tells me my config is /home/jasonstewartnz/airflow/airflow.cfg
The dags_folder in this /home/jasonstewartnz/airflow/dags is empty.
When I go to
http://localhost:8080/dags/example_bash_operator/details
I see that the fileloc attribute for the dag is:
fileloc /home/jasonstewartnz/.venv/airflow/lib/python3.10/site-packages/airflow/example_dags/example_bash_operator.py
Even if I copy this file to the DAGs directory or change the DAGs directory in the config to the above, or add it to the path the CLI still seems unable to find the DAG.
did you enable the DAG, 'example_bash_operator' in the UI as the instructions specify? I am referring to this step in the guide's instructios:
# Visit localhost:8080 in the browser and use the admin account details
# shown on the terminal to login.
# Enable the example_bash_operator dag in the home page
This should be done before you attempt to execute command, airflow tasks run.

Error in Cloud composer with data build tool (dbt) path ['name']: 'jaffle_shop' does not match '^[^\\d\\W]\\w*$'

I am testing a deployment of dbt within Cloud composer. On my local machine (Ubuntu 20.04) I have got success in running the dbt models with airflow. When running on Google Cloud composer I get the following error
{subprocess.py:74} INFO - Output:
{subprocess.py:78} INFO - Running with dbt=0.21.0
{subprocess.py:78} INFO - Encountered an error while reading the project:
{subprocess.py:78} INFO - ERROR: Runtime Error
{subprocess.py:78} INFO - at path ['name']: 'jaffle_shop' does not match '^[^\\d\\W]\\w*$'
{subprocess.py:78} INFO -
{subprocess.py:78} INFO - Error encountered in /home/airflow/gcs/dags/dbt_project.yml
{subprocess.py:78} INFO - Encountered an error:
{subprocess.py:78} INFO - Runtime Error
{subprocess.py:78} INFO - Could not run dbt
{subprocess.py:82} INFO - Command exited with return code 2
{taskinstance.py:1503} ERROR - Task failed with exception
We are using a BashOperator to run dbt models in Airflow.
Initially had some problems with dependencies but they were solved.
Using a standard dbt_project.yml file with a single model just to test how this works.
Another way is to use Docker but we need try if this works.
Edit
Versions
dbt: 0.21.0
cloud-composer: 1.17.1
airflow: 2.1.2
Pypi Packages
airflow-dbt: 0.4.0
dbt: 0.21.0
jsonschema: 3.1 (Added this as Pypi was throwing an error about the version
I really appreciate if anyone can help
Pete
The problem here is the jsonschema dependency. Version 3.1.0 does not work, while versions 3.1.1 and 3.2.0 will work--and should also work within Composer's dependency requirements.
There looks to have been an issue with switching to js-regex for the jsonschema folks in 3.1.0, which caused them to revert back to regular re in 3.1.1.
There are some details here, and a couple of related issues described here and here.
In general, it would be much nicer if Cloud Composer supported virtual environments to avoid this entire dependency-collision mess, but apparently Google does not support that approach.

Apache Airflow initdb command fails, due to syntax error

I have created virtualenv for python3 using:
virtualenv -p $(which python3) ENV
Then activate the source
source /Users/myusername/ENV/bin/activate
Install the apache-airflow:
pip install apache-airflow
then which airflow yields /Users/myusername/ENV/bin/airflow
But when I try to initdb using:
airflow initdb
I get below error:
{db.py:350} INFO - Creating tables
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
WARNI [airflow.utils.log.logging_mixin.LoggingMixin] cryptography not found - values will not be stored encrypted.
ERROR [airflow.models.DagBag] Failed to import: /Library/Python/2.7/site-packages/airflow/example_dags/example_http_operator.py
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/airflow/models/__init__.py", line 413, in process_file
m = imp.load_source(mod_name, filepath)
File "/Library/Python/2.7/site-packages/airflow/example_dags/example_http_operator.py", line 27, in <module>
from airflow.operators.http_operator import SimpleHttpOperator
File "/Library/Python/2.7/site-packages/airflow/operators/http_operator.py", line 21, in <module>
from airflow.hooks.http_hook import HttpHook
File "/Library/Python/2.7/site-packages/airflow/hooks/http_hook.py", line 23, in <module>
import tenacity
File "/Library/Python/2.7/site-packages/tenacity/__init__.py", line 375, in <module>
from tenacity.tornadoweb import TornadoRetrying
File "/Library/Python/2.7/site-packages/tenacity/tornadoweb.py", line 24, in <module>
from tornado import gen
File "/Library/Python/2.7/site-packages/tornado-6.0.3-py2.7-macosx-10.14-intel.egg/tornado/gen.py", line 126
def _value_from_stopiteration(e: Union[StopIteration, "Return"]) -> Any:
^
SyntaxError: invalid syntax
Done.
(ENV) ---------------------------------------------------------
Seems like example scripts use python 2.7 and it can't recognize the function definition syntax.
Does apache-airflow package need to be fixed by the next release or I can do something to fix this?
I tried fixing this:
Use python2.7 instead of python3
then install airflow on default python 2.7 enabled on mac but this throws other errors like package "six" is not compatible.
You need to turn off the example DAGs to be loaded in config file to solve this problem.
Anyway, it seems weird that airflow uses 2.7 Python when you told that it is installed into Python 3 virtual environment.

airflow trigger_dag command throwing error

I am executing airflow trigger_dag cng-hello_world command in airflow server and it resulted in below error. please suggest.
I followed below link:- http://michal.karzynski.pl/blog/2017/03/19/developing-workflows-with-apache-airflow/
The same Dag is been executed via airflow UI
[2019-02-06 11:57:41,755] {settings.py:174} INFO - setting.configure_orm(): Using pool settings. pool_size=5, pool_recycle=2000
[2019-02-06 11:57:43,326] {plugins_manager.py:97} ERROR - invalid syntax (airflow_api.py, line 7)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/airflow/plugins_manager.py", line 86, in <module>
m = imp.load_source(namespace, filepath)
File "/home/ec2-user/airflow/plugins/airflow_api.py", line 7
<!DOCTYPE html>
^
SyntaxError: invalid syntax
[2019-02-06 11:57:43,326] {plugins_manager.py:98} ERROR - Failed to import plugin /home/ec2-user/airflow/plugins/airflow_api.py
[2019-02-06 11:57:43,326] {plugins_manager.py:97} ERROR - invalid syntax (__init__.py, line 7)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/airflow/plugins_manager.py", line 86, in <module>
m = imp.load_source(namespace, filepath)
File "/home/ec2-user/airflow/plugins/__init__.py", line 7
<!DOCTYPE html>
^
SyntaxError: invalid syntax
[2019-02-06 11:57:43,327] {plugins_manager.py:98} ERROR - Failed to import plugin /home/ec2-user/airflow/plugins/__init__.py
[2019-02-06 11:57:47,236] {__init__.py:51} INFO - Using executor CeleryExecutor
[2019-02-06 11:57:48,420] {models.py:258} INFO - Filling up the DagBag from /home/ec2-user/airflow/dags
[2019-02-06 11:57:48,783] {cli.py:237} INFO - Created <DagRun cng-hello_world # 2019-02-06 11:57:48+00:00: manual__2019-02-06T11:57:48+00:00, externally triggered: True>

rJava import not working in airflow

I am trying to schedule some r script in airflow, I am using rJava library in my script. rJava and xlsx is working fine in R terminal, but not in airflow environment. I am getting this error,
libjvm.so: cannot open shared object file: No such file or directory
In my ~/.bashrc file,
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/bin/jar
export LD_LIBRARY_PATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/amd64:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server
In my ~/.profile file,
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/bin/jar
export HADOOP_HOME='/home/ubuntu/spark-2.2.0-bin-hadoop2.7/hadoop-2.7.4'
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server:$LD_LIBRARY_PATH
In my /etc/environment,
JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/bin/jar";
LD_LIBRARY_PATH="/usr/lib/jvm/java-8-openjdk-amd64/lib/amd64:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server";
Also, I tried to add this line in the top of my R script before importing rJava,
system('export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/bin/jar')
system('export LD_LIBRARY_PATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/amd64:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server')
Even then I keep getting libjvm.so file missing error. But I can see that file in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server
When I checked the log in airflow, the dag is running the script in Temporary script location: /tmp/airflowtmp7Ws3X2//tmp/airflowtmp7Ws3X2/nz-property-report6vTyGr
I think it is not picking the environment variables, getting this error,
Loading required package: xlsx
[2018-08-09 21:39:23,755] {base_task_runner.py:98} INFO - Subtask: [2018-08-09 21:39:23,755] {bash_operator.py:101} INFO - Error: package or namespace load failed for ‘xlsx’:
[2018-08-09 21:39:23,755] {base_task_runner.py:98} INFO - Subtask: [2018-08-09 21:39:23,755] {bash_operator.py:101} INFO - .onLoad failed in loadNamespace() for 'rJava', details:
[2018-08-09 21:39:23,755] {base_task_runner.py:98} INFO - Subtask: [2018-08-09 21:39:23,755] {bash_operator.py:101} INFO - call: dyn.load(file, DLLpath = DLLpath, ...)
[2018-08-09 21:39:23,755] {base_task_runner.py:98} INFO - Subtask: [2018-08-09 21:39:23,755] {bash_operator.py:101} INFO - error: unable to load shared object '/home/ubuntu/R/x86_64-pc-linux-gnu-library/3.4/rJava/libs/rJava.so':
[2018-08-09 21:39:23,756] {base_task_runner.py:98} INFO - Subtask: [2018-08-09 21:39:23,755] {bash_operator.py:101} INFO - libjvm.so: cannot open shared object file: No such file or directory
Can anyone help me with using rJava in my R script in airflow?
EDIT: As requested, here is my DAG script,
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#from airflow.models import DAG
from datetime import datetime
dag = DAG(
dag_id='property_report',
schedule_interval=None,
)
task = BashOperator(
task_id='report',
dag=dag,
bash_command="Rscript /home/ubuntu/airflow/dags/scripts/r-scripts/recreate_lastmonthreport_from_snapshotdata.R",
start_date=airflow.utils.dates.days_ago(1),
owner='airflow')
Just to help anyone looking for an answer for this. I just had to source ~/.bashrc in both screens running web server and scheduler separately and restart them. It picked up the env variables fine.

Resources