I added connection to s3 from UI
enter image description here
and supposed to see AIRFLOR_CONN_S3_CONN in my env variables due to https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html, but it didn't show up
(I used BashOperator with printenv command to see all env variables but didn't find any AIRFLOR_CONN_). What can be the problem?
I'm using Airflow 2.2.5
Related
So I have an Airflow task that is currently a BashOperator() where its bash_command is just "python pppp.py". There is a "__main __" that the only thing it does is calls mycommand(). I tried to switch it to a #task() where I "import pppp" and then call "pppp.mycommand()". My command connects to an internal messaging system. The BashOperator() works but the task() fails to connect. I even kubectl into the pod, and I can start up a python shell, do "import pppp" and call "pppp.mycommand()" and it works. I have confirmed they both ways are using the same image, have all of the same environment variables set to the same value, and such. Obviously there is a difference between the 2 ways. Can anyone think of what is different? THERE IS NO CODE TO SHARE since the issue is with an internal message system that you wouldnt be able to access. I guess I am interested in happens before airflow executes the my_task()
#task(**args)
def my_task():
import pppp
pppp.mycommand()
I need to run dags in parallel but do not need significant scaling, so LocalExecutor can do the job just fine. I looked through the Airflow docs and first created a MySQL database:
CREATE DATABASE airflow_db CHARACTER SET utf8;
CREATE USER <user> IDENTIFIED BY <pass>;
GRANT ALL PRIVILEGES ON airflow_db.* TO <user>;
I then modify the following parameters in the airflow.cfg file:
executor = LocalExecutor
sql_alchemy_conn = mysql+mysqlconnector://<user>:<pass>#localhost:3306/airflow_db
When I run airflow db init, I run into the following error message:
AttributeError: 'MySQLConverter' object has no attribute '_dagruntype_to_mysql'
During handling of the above exception, another exception occurred:
TypeError: Python 'dagruntype' cannot be converted to a MySQL type
Please note that nothing else in the airflow.cfg file was altered and that using the default SequentialExecutor with sqlite lets everything run just fine. Also note that I am using Airflow version 2.2.0
I found the solution to my own question. Instead of using the mysqlconnector, I used the pymysql driver:
pip install PyMySQL
The airflow.cfg parameters can then be adjusted as follows:
sql_alchemy_conn = mysql+pymysql://<user>:<pass>#localhost:3306/airflow_db
All else can stay the same.
tl;dr: How can I invoke the system command y | conda create --name gee_interface from an R console, e.g. via system2()? I'm comfortable enough with system2('conda', c('create', '--name', 'gee_interface')), but I don't know how to handle piping in the 'y' via system2().
Details
I am trying to use an R console to run the bash command conda create --name gee_interface (OSX Mojave with Anaconda installed).
In terminal, that command executes just fine, but prompts me to answer with Proceed ([y]/n)? (I answer 'y' and everything works smoothly).
In R, I run
Sys.setenv(PATH = paste(c("/Applications/anaconda3/bin", Sys.getenv("PATH")), collapse = .Platform$path.sep)) # ensures that system2() finds conda
system2('conda', c('create', '--name', 'gee_interface')) # This is the key line for the purposes of this question
When running the second line [i.e. system2('conda', c('create', '--name', 'gee_interface'))], the process never finishes, but quickly falls to zero CPU usage. Presumably the system is waiting for my response to the prompt, but I don't know how to provide it. How does one do this via an R script? Note also that in my particular case, the number of times that I need to respond 'y' is variable, depending on whether an environment of the name gee_interface already exists or not.
The fix to your first problem is to tell conda not to ask for confirmation using -y:
system2('conda', c('create', '--name', 'gee_interface', '-y'))
As to the second part (variable times that your input is required), I'm guessing it's to overwrite the environment if it exists? In that case, you could check for its existence first with conda info --envs, and run conda remove --name gee_interface --all if necessary before creating it.
See:
https://docs.conda.io/projects/conda/en/latest/commands/create.html
https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#removing-an-environment
You could also try your system2 call, with the argument input = "y", but that doesn't fix your second problem of needing to affirm multiple times.
See: Invoke a system command and pipe a variable as an argument
I am trying to install mlflow in R and im getting this error message saying
mlflow::install_mlflow()
Error in mlflow_conda_bin() :
Unable to find conda binary. Is Anaconda installed?
If you are not using conda, you can set the environment variable MLFLOW_PYTHON_BIN to the path of yourpython executable.
I have tried the following
export MLFLOW_PYTHON_BIN="/usr/bin/python"
source ~/.bashrc
echo $MLFLOW_PYTHON_BIN -> this prints the /usr/bin/python.
or in R,
sys.setenv(MLFLOW_PYTHON_BIN="/usr/bin/python")
sys.getenv() -> prints MLFLOW_PYTHON_BIN is set to /usr/bin/python.
however, it still does not work
I do not want to use conda environment.
how to I get past this error?
The install_mlflow command only works with conda right now, sorry about the confusing message. You can either:
install conda - this is the recommended way of installing and using mlflow
or
install mlflow python package yourself via pip
To install mlflow yourself, pip install correct (matching the the R package) python version of mlflow and set the MLFLOW_PYTHON_BIN environment variable as well as MLFLOW_BIN evn variable: e.g.
library(mlflow)
system(paste("pip install -U mlflow==", mlflow:::mlflow_version(), sep=""))
Sys.setenv(MLFLOW_BIN=system("which mlflow"))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python"))
Just ran across this, and the accepted answer by #Tomas was very helpful. I added a comment above but, for some additional context, I wanted to create a more thorough response if any other Enterprise Databricks R users run across this post trying to use the MLflow package for R on Databricks.
The Databricks MLflow quickstart guide will tell you that you need to run the following:
library(mlflow)
install_mlflow()
However, for Enterprise Databricks users, the install_mlflow() function will fail if your cluster doesn't have outside connectivity privileges (as most probably don't) and can't connect to the Anaconda repo to download the necessary packages. You'll likely get an error like this:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.anaconda.org/conda-forge/linux-64/current_repodata.js
The good news is that MLflow should already be installed on your Databricks runtime. So you can reference that install instead, and then as #Tomas mentioned, use it to set your R environment variables for MLFLOW_BIN and MLFLOW_PYTHON_BIN. From there, the R MLflow API works as specified (in my experience, but ymmv).
The only catch from the above solution is that when you use the system()function in R, you need to set intern=TRUE in order capture the output of the command. The default behavior of the system() function is intern=FALSE. Thus if you do not explicitly set intern=TRUE, then the exit code 0 will be returned from your system() call (or perhaps another exit code upon an error) and Sys.setenv() will set the environment variable to 0!
### intern=True missing ###
Sys.setenv(MLFLOW_BIN=system("which mlflow"))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python"))
Example output (you can see the the environment variables did not get set correctly):
s <- Sys.getenv()
s[grep("MLFLOW", names(s))]
MLFLOW_BIN 0
MLFLOW_CONDA_HOME /databricks/conda
MLFLOW_PYTHON_BIN 0
MLFLOW_PYTHON_EXECUTABLE
/databricks/python/bin/python
MLFLOW_TRACKING_URI databricks
However, when intern=TRUE, you'll get the correct environment variables:
### intern=True set ###
Sys.setenv(MLFLOW_BIN=system("which mlflow", intern=TRUE))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python", intern=TRUE))
Example output:
s <- Sys.getenv()
s[grep("MLFLOW", names(s))]
MLFLOW_BIN /databricks/python3/bin/mlflow
MLFLOW_CONDA_HOME /databricks/conda
MLFLOW_PYTHON_BIN /databricks/python3/bin/python
MLFLOW_PYTHON_EXECUTABLE
/databricks/python/bin/python
MLFLOW_TRACKING_URI databricks
Note: This was using Databricks runtime 9.1 LTS ML. This may or may not work on other Databricks runtime configurations.
I have a dag which checks for new workflows to be generated (Dynamic DAG) at a regular interval and if found, creates them. (Ref: Dynamic dags not getting added by scheduler )
The above DAG is working and the dynamic DAGs are getting created and listed in the web-server. Two issues here:
When clicking on the DAG in web url, it says "DAG seems to be missing"
The listed DAGs are not listed using "airflow list_dags" command
Error:
DAG "app01_user" seems to be missing.
The same is for all other dynamically generated DAGs. I have compiled the Python script and found no errors.
Edit1:
I tried clearing all data and running "airflow run". It ran successfully but no Dynamic generated DAGs were added to "airflow list_dags". But when running the command "airflow list_dags", it loaded and executed the DAG, (which generated Dynamic DAGs). The dynamic DAGs are also listed as below:
[root#cmnode dags]# airflow list_dags
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
[2019-08-13 00:34:31,692] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=15, pool_recycle=1800, pid=25386
[2019-08-13 00:34:31,877] {__init__.py:51} INFO - Using executor LocalExecutor
[2019-08-13 00:34:32,113] {__init__.py:305} INFO - Filling up the DagBag from /root/airflow/dags
/usr/lib/python2.7/site-packages/airflow/operators/bash_operator.py:70: PendingDeprecationWarning: Invalid arguments were passed to BashOperator (task_id: tst_dyn_dag). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'provide_context': True}
super(BashOperator, self).__init__(*args, **kwargs)
-------------------------------------------------------------------
DAGS
-------------------------------------------------------------------
app01_user
app02_user
app03_user
app04_user
testDynDags
Upon running again, all the above generated 4 dags disappeared and only the base DAG, "testDynDags" is displayed.
When I was getting this error, there was an exception showing up in the webserver logs. Once I resolved that error and I restarted the webserver it went through normally.
From what I can see this is the error that is thrown when the webserver tried to parse the dag file and there is an error. In my case it was an error importing a new operator I added to a plugin.
Usually, I check in Airflow UI, sometimes the reason of broken DAG appear in there. But if it is not there, I usually run the .py file of my DAG, and error (reason of DAG cant be parsed) will appear.
I never got to work on dynamic DAG generation but I did face this issue when DAG was not present on all nodes ( scheduler, worker and webserver ). In case you have airflow cluster, please make sure that DAG is present on all airflow nodes.
Same error, the reason was I renamed my dag_id in uppercase. Something like "import_myclientname" into "import_MYCLIENTNAME".
I am little late to the party but I faced the error today:
In short: try executing airflow dags report and/or airflow dags reserialize
Check out my comment here:
https://stackoverflow.com/a/73880927/4437153
I found that airflow fails to recognize a dag defined in a file that does not have from airflow import DAG in it, even if DAG is not explicitly used in that file.
For example, suppose you have two files, a.py and b.py:
# a.py
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
def makedag(dag_id="a"):
with DAG(dag_id=dag_id) as dag:
DummyOperator(task_id="nada")
dag = makedag()
and
# b.py
from a import makedag
dag = makedag(dag_id="b")
Then airflow will only look at a.py. It won't even look at b.py at all, even to notice if there's a syntax error in it! But if you add from airflow import DAG to it and don't change anything else, it will show up.