Get Access to Airflow Hooks within a jupyter notebook - jupyter-notebook

I have Airflow running using the postgres backend - all fine. Additionally I have running a Jupyter server on the same host where Airflow runs. Now I thought I can just access the airflow hooks from a notebook.
import pandas as pd
import numpy as np
import matplotlib as plt
from airflow.hooks.mysql_hook import MySqlHook
mysql = MySqlHook(mysql_conn_id = 'mysql-x')
sql = "select 1+1"
But I get this exception message:
OperationalError: (sqlite3.OperationalError) no such table: connection
[SQL: SELECT connection.password AS connection_password, connection.extra AS connection_extra, AS connection_id, connection.conn_id AS connection_conn_id, connection.conn_type AS connection_conn_type, AS connection_host, connection.schema AS connection_schema, connection.login AS connection_login, connection.port AS connection_port, connection.is_encrypted AS connection_is_encrypted, connection.is_extra_encrypted AS connection_is_extra_encrypted
FROM connection
WHERE connection.conn_id = ?]
[parameters: ('mysql-x',)]
(Background on this error at:
But what makes me get suspicious is that it not only does not find the connection_id (which I clearly can see in the Airflow UI). It also says: sqlite3.OperationalError - it very much looks like it is not even connected to the same postgres database. I have checked os.environ["AIRFLOW_HOME"] which seems to be correct.
After I start the jupyter notebook server after airflow, such that all environment variables are set, I get a different error:
/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/ in __get__(self, instance, owner)
352 def __get__(self, instance, owner):
--> 353 retval = self.descriptor.__get__(instance, owner)
354 # detect if this is a plain Python #property, which just returns
355 # itself for class level access. If so, then return us.
/usr/local/lib/python3.7/site-packages/airflow/models/ in get_password(self)
188 "Can't decrypt encrypted password for login={}, \
189 FERNET_KEY configuration is missing".format(self.login))
--> 190 return fernet.decrypt(bytes(self._password, 'utf-8')).decode()
191 else:
192 return self._password
/usr/local/lib/python3.7/site-packages/cryptography/ in decrypt(self, msg, ttl)
169 except InvalidToken:
170 pass
--> 171 raise InvalidToken
You can use this dockerfile to reproduce it:
FROM apache/airflow
USER root
# install mysql client
RUN apt-get update && apt-get install -y mariadb-client-10.3 unzip
# install mssql client and tools
RUN apt-get install -y curl gnupg libicu-dev libicu63
RUN curl -o key
RUN apt-key add < key
RUN curl > /etc/apt/sources.list.d/msprod.list
RUN apt-get update && apt-get install -y mssql-tools msodbcsql17 unixodbc unixodbc-dev unzip libunwind8
RUN curl -Lq '' -o
RUN mkdir /opt/sqlpackage/ && unzip -d /opt/sqlpackage/
RUN chmod a+x /opt/sqlpackage/sqlpackage && ln -sfn /opt/sqlpackage/sqlpackage /usr/bin/sqlpackage
RUN ln -sfn /opt/mssql-tools/bin/sqlcmd /usr/bin/sqlcmd
# install notebooks
RUN pip install jupyterlab pandas numpy scikit-learn matplotlib pymssql
#RUN cat /
# start additional notebok server
# this is a dirty hack but for the sake of this prototype good enough
RUN sed -i -e's/\# Run the command/airflow scheduler \& \njupyter notebook --ip= --port=9000 --NotebookApp.token="" --NotebookApp.password="" \& \n/' /entrypoint
# switch back to airflow user
USER airflow
RUN airflow initdb
RUN alias ll='ls -al'


I am not able to run dask yarn cluster on AWS EMR

I want run dask on EMR using YarnCluster.
I have used below bootstrap script but I have run these instructions in SSH console.
HELP="Usage: bootstrap-dask [OPTIONS]
Example AWS EMR Bootstrap Action to install and configure Dask and Jupyter
By default it does the following things:
- Installs miniconda
- Installs dask, distributed, dask-yarn, pyarrow, and s3fs. This list can be
extended using the --conda-packages flag below.
- Packages this environment for distribution to the workers.
- Installs and starts a jupyter notebook server running on port 8888. This can
be disabled with the --no-jupyter flag below.
--jupyter / --no-jupyter Whether to also install and start a Jupyter
Notebook Server. Default is True.
--password, -pw Set the password for the Jupyter Notebook
Server. Default is 'dask-user'.
--conda-packages Extra packages to install from conda.
# Parse Inputs. This is specific to this script, and can be ignored
# -----------------------------------------------------------------
# -----------------------------------------------------------------------------
# 1. Check if running on the master node. If not, there's nothing do.
# -----------------------------------------------------------------------------
grep -q '"isMaster": true' /mnt/var/lib/info/instance.json \
|| { echo "Not running on master node, nothing to do" && exit 0; }
# -----------------------------------------------------------------------------
# 2. Install Miniconda
# -----------------------------------------------------------------------------
echo "Installing Miniconda"
curl -o /tmp/
bash /tmp/ -b -p $HOME/miniconda
rm /tmp/
echo -e '\nexport PATH=$HOME/miniconda/bin:$PATH' >> $HOME/.bashrc
source $HOME/.bashrc
conda update conda -y
# configure conda environment
#source ~/miniconda/etc/profile.d/
#conda activate base
# -----------------------------------------------------------------------------
# 3. Install packages to use in packaged environment
# We install a few packages by default, and allow users to extend this list
# with a CLI flag:
# - dask-yarn >= 0.7.0, for deploying Dask on YARN.
# - pyarrow for working with hdfs, parquet, ORC, etc...
# - s3fs for access to s3
# - conda-pack for packaging the environment for distribution
# - ensure tornado 5, since tornado 6 doesn't work with jupyter-server-proxy
# -----------------------------------------------------------------------------
echo "Installing base packages"
conda install \
-c conda-forge \
-y \
-q \
dask-yarn \
s3fs \
conda-pack \
pip3 install pyarrow
# -----------------------------------------------------------------------------
# 4. Package the environment to be distributed to worker nodes
# -----------------------------------------------------------------------------
echo "Packaging environment"
conda pack -q -o $HOME/environment.tar.gz
# -----------------------------------------------------------------------------
# 5. List all packages in the worker environment
# -----------------------------------------------------------------------------
echo "Packages installed in the worker environment:"
conda list
# -----------------------------------------------------------------------------
# 6. Configure Dask
# This isn't necessary, but for this particular bootstrap script it will make a
# few things easier:
# - Configure the cluster's dashboard link to show the proxied version through
# jupyter-server-proxy. This allows access to the dashboard with only an ssh
# tunnel to the notebook.
# - Specify the pre-packaged python environment, so users don't have to
# - Set the default deploy-mode to local, so the dashboard proxying works
# - Specify the location of the native libhdfs library so pyarrow can find it
# on the workers and the client (if submitting applications).
# ------------------------------------------------------------------------------
echo "Configuring Dask"
mkdir -p $HOME/.config/dask
cat <<EOT >> $HOME/.config/dask/config.yaml
link: "/proxy/{port}/status"
environment: /home/hadoop/environment.tar.gz
deploy-mode: local
ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native/
ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native/
# Also set ARROW_LIBHDFS_DIR in ~/.bashrc so it's set for the local user
echo -e '\nexport ARROW_LIBHDFS_DIR=/usr/lib/hadoop/lib/native' >> $HOME/.bashrc
# -----------------------------------------------------------------------------
# 8. Install jupyter notebook server and dependencies
# We do this after packaging the worker environments to keep the tar.gz as
# small as possible.
# We install the following packages:
# - notebook: the Jupyter Notebook Server
# - ipywidgets: used to provide an interactive UI for the YarnCluster objects
# - jupyter-server-proxy: used to proxy the dask dashboard through the notebook server
# -----------------------------------------------------------------------------
echo "Installing Jupyter"
conda install \
-c conda-forge \
-y \
-q \
notebook \
ipywidgets \
jupyter-server-proxy \
# -----------------------------------------------------------------------------
# 9. List all packages in the client environment
# -----------------------------------------------------------------------------
echo "Packages installed in the client environment:"
conda list
# -----------------------------------------------------------------------------
# 10. Configure Jupyter Notebook
# -----------------------------------------------------------------------------
echo "Configuring Jupyter"
mkdir -p $HOME/.jupyter
HASHED_PASSWORD=`python -c "from notebook.auth import passwd; print(passwd('$JUPYTER_PASSWORD'))"`
cat <<EOF >> $HOME/.jupyter/
c.NotebookApp.password = u'$HASHED_PASSWORD'
c.NotebookApp.open_browser = False
c.NotebookApp.ip = ''
c.NotebookApp.port = 8888
# -----------------------------------------------------------------------------
# 11. Define an upstart service for the Jupyter Notebook Server
# This sets the notebook server up to properly run as a background service.
# -----------------------------------------------------------------------------
echo "Configuring Jupyter Notebook Upstart Service"
cat <<EOF > /tmp/jupyter-notebook.service
Description=Jupyter Notebook
ExecStart=$HOME/miniconda/bin/jupyter-notebook --allow-root --config=$HOME/.jupyter/
sudo mv /tmp/jupyter-notebook.service /etc/systemd/system/
sudo systemctl enable jupyter-notebook.service
# -----------------------------------------------------------------------------
# 12. Start the Jupyter Notebook Server
# -----------------------------------------------------------------------------
echo "Starting Jupyter Notebook Server"
sudo systemctl daemon-reload
sudo systemctl restart jupyter-notebook.service
#$HOME/miniconda/bin/jupyter-notebook --allow-root --config=$HOME/.jupyter/
after this i start jupyter notebook using $HOME/miniconda/bin/jupyter-notebook --allow-root --config=$HOME/.jupyter/
jupyter notebook start successfully.
When i run this code on notebook
from dask_yarn import YarnCluster
from dask.distributed import Client
# Create a cluster
cluster = YarnCluster()
# Connect to the cluster
client = Client(cluster)
it gives error like
AttributeError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 client = Client(cluster)
File ~/miniconda/lib/python3.9/site-packages/distributed/, in Client.__init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, connection_limit, **kwargs)
832 elif isinstance(getattr(address, "scheduler_address", None), str):
833 # It's a LocalCluster or LocalCluster-compatible object
834 self.cluster = address
--> 835 status = getattr(self.cluster, "status")
836 if status and status in [Status.closed, Status.closing]:
837 raise RuntimeError(
838 f"Trying to connect to an already closed or closing Cluster {self.cluster}."
839 )
AttributeError: 'YarnCluster' object has no attribute 'status'
Also when I use LocalCluster instead of YarnCluster it run perfectly. I am stuck here for days please help. Also how we configure worker nodes.

SageMaker fails when trying to add Lifecycle Configuration for keeping custom environments persistent after restart

I want to create environment in SageMaker on AWS with miniconda, and make it available as kernels in Jupyter when I restart the session. But the SageMaker keep failing.
I followed the instructions found in here:
basically it says:
"Create a custom, persistent Conda installation on the notebook instance's Amazon Elastic Block Store (Amazon EBS) volume: Run the on-create script in the terminal of an existing notebook instance. This script uses Miniconda to create a separate Conda installation on the EBS volume (/home/ec2-user/SageMaker/). Then, run the on-start script as a lifecycle configuration to make the custom environment available as a kernel in Jupyter. This method is recommended for more technical users, and it is a better long-term solution."
I run this script on the terminal on Jupyter:
set -e
sudo -u ec2-user -i <<'EOF'
unset SUDO_UID
# Install a separate conda installation via Miniconda
mkdir -p "$WORKING_DIR"
wget -O "$WORKING_DIR/"
bash "$WORKING_DIR/" -b -u -p "$WORKING_DIR/miniconda"
rm -rf "$WORKING_DIR/"
# Create a custom conda environment
source "$WORKING_DIR/miniconda/bin/activate"
conda create --yes --name "$KERNEL_NAME" python="$PYTHON"
conda activate "$KERNEL_NAME"
pip install --quiet ipykernel
# Customize these lines as necessary to install the required packages
conda install --yes numpy
pip install --quiet boto3
and it creates the "conda-test-env" environment as expected.
Then I add the as lifestyle configuration:
set -e
sudo -u ec2-user -i <<'EOF'
unset SUDO_UID
source "/home/ec2-user/SageMaker/custom-environments/miniconda/bin/activate"
conda activate conda-test-env
python -m ipykernel install --user --name "conda-test-env" --display-name "conda-test-env"
# Optionally, uncomment these lines to disable SageMaker-provided Conda functionality.
# echo "c.EnvironmentKernelSpecManager.use_conda_directly = False" >> /home/ec2-user/.jupyter/
# rm /home/ec2-user/.condarc
then I update the instance with the new configuration,
and when I start my notebook instance, after few minutes it fails.
I'll appreciate any help.

Application logs to stdout with Shiny Server and Docker

I have a Docker container running a shiny app (Dockerfile here).
Shiny server logs are output to stdout and application logs are written to /var/log/shiny-server. I'm deploying this container to AWS Fargate and logging applications only display stdout which makes debugging an application when deployed challenging. I'd like to write the application logs to stdout.
I've tried a number of potential solutions:
I've tried the solution provided here, but have had no luck.. I added the exec xtail /var/log/shiny-server/ to my as the last line in the file. App logs are not written to stdout
I noticed that writing application logs to stdout is now the default behavior in rocker/shiny, but as I'm using rocker/verse:3.6.2 (upgraded from 3.6.0 today) along with RUN export ADD=shiny, I don't think this is standard behavior for the rocker/verse:3.6.2 container with Shiny add-on. As a result, I don't get the default behavior out of the box.
This issue on github suggests an alternative method of forcing application logging to stdout by way of an environment variable SHINY_LOG_STDERR=1 set at runtime but I'm not Linux-savvy enough to know where that env variable needs to be set to be effective. I found this documentation from Shiny Server v1.5.13 which gave suggestions in which file to set the environment variable depending on Linux distro; however, the output from my container when I run cat /etc/os-release is:
which doesn't really line up with any of the distributions in the Shiny Server documentation, thus making the documentation unhelpful.
I tried adding adding the environment variable from the github issue above in the docker run command, i.e.,
docker run --rm -e SHINY_LOG_STDERR=1 -p 3838:3838 [my image]
as well as
docker run --rm -e APPLICATION_LOGS_TO_STDOUT=true -p 3838:3838 [my image]
and am still not getting the logs to stdout.
I must be missing something here. Can someone help me identify how to successfully get application logs to stdout successfully?
You can add the line ENV SHINY_LOG_STDERR=1 to your Dockerfile (at least, this works with rocker/shiny, not sure about rocker/verse), such as with your Dockerfile:
FROM rocker/verse:3.6.2
## Add shiny capabilities to container
RUN export ADD=shiny && bash /etc/cont-init.d/add
## Install curl and xtail
RUN apt-get update && apt-get install -y \
curl \
## Add pip3 and other Python packages
RUN sudo apt-get update -y && apt-get install -y python3-pip
RUN pip3 install boto3
## Add R packages
RUN R -e "install.packages(c('shiny', 'tidyverse', 'tidyselect', 'knitr', 'rmarkdown', 'jsonlite', 'odbc', 'dbplyr', 'RMySQL', 'DBI', 'pander', 'sciplot', 'lubridate', 'zoo', 'stringr', 'stringi', 'openxlsx', 'promises', 'future', 'scales', 'ggplot2', 'zip', 'Cairo', 'tinytex', 'reticulate'), repos = '')"
## Update and install
RUN tlmgr update --self --all
RUN tlmgr install ms
RUN tlmgr install beamer
RUN tlmgr install pgf
#Copy app dir and theme dirs to their respective locations
COPY iarr /srv/shiny-server/iarr
COPY iarr/reports/interim_annual_report/theme/SwCustom /opt/TinyTeX/texmf-dist/tex/latex/beamer/
#Force texlive to find my custom beamer theme
RUN texhash
## Add shiny-server information
COPY /usr/bin/
COPY shiny-customized.config /etc/shiny-server/shiny-server.conf
## Add dos2unix to eliminate Win-style line-endings and run
RUN apt-get update -y && apt-get install -y dos2unix
RUN dos2unix /usr/bin/ && apt-get --purge remove -y dos2unix && rm -rf /var/lib/apt/lists/*
# Enable Logging from stdout
RUN ["chmod", "+x", "/usr/bin/"]
CMD ["/usr/bin/"]

How to Choose R Server's R as Default in Operationalization, Remote R Workspace and RStudio Server?

So I've set up an Azure Data Science Virtual Machine on Linux (Ubuntu) and I've executed the following on the terminal to enable Remote R workspace, RStudio Server, R Server Operationalization and hadoop:
sudo apt update
sudo apt -y upgrade
# Hadoop is installed but doesn't seem to appear on the PATH or have its environment variable set by default
sudo echo "" >> ~/.bashrc
sudo echo "export PATH="'$'"PATH:/opt/hadoop/hadoop-2.7.4/bin" >> ~/.bashrc
sudo echo "export HADOOP_HOME=/opt/hadoop/hadoop-2.7.4" >> ~/.bashrc
source ~/.bashrc
#Setting up a password as none exists to begin with because of private key selection in the installation
#RStudio Server requires a password though
"MyPassword\nMyPassword\n" | sudo passwd sshuser
#Unfortunately hadoop fails on Data Science Virtual Machine
#error: mkdir: Call From IM-DSonUbuntu/ to localhost:9000 failed on connection exception: Connection refused; For more details see:
# hadoop fs -mkdir /user/RevoShare/rserve2
# hadoop fs -chmod uog+rwx /user/RevoShare/rserve2
sudo mkdir -p /var/RevoShare/rserve2
sudo chmod uog+rwx /var/RevoShare/rserve2
# hadoop fs -mkdir /user/RevoShare/sshuser
# hadoop fs -chmod uog+rwx /user/RevoShare/sshuser
sudo mkdir -p /var/RevoShare/sshuser
sudo chmod uog+rwx /var/RevoShare/sshuser
#Setting up R Server Operationalisation
cd /opt/microsoft/mlserver/9.2.1/o16n
sudo dotnet Microsoft.MLServer.Utils.AdminUtil/Microsoft.MLServer.Utils.AdminUtil.dll -silentoneboxinstall MyPassword
#They say this Data Science Virtual Machine already has RStudio Server, but even though the port 8787 is open, it's nowhere to be found! So installing it now, and after the installation it's accessible by refreshing the page that failed before.
#Perhaps it's not installed then? Or a service is not running like it shoudl?
yes | sudo gdebi rstudio-server-1.1.414-amd64.deb
#They are small, leave them for debug reasons - lets have evidence the script run thus far.
#sudo rm rstudio-server-1.1.414-amd64.deb
# Remote R workspace Service needs dotnet sdk
curl | gpg --dearmor > microsoft.gpg
sudo mv microsoft.gpg /etc/apt/trusted.gpg.d/microsoft.gpg
sudo sh -c 'echo "deb [arch=amd64] xenial main" > /etc/apt/sources.list.d/dotnetdev.list'
sudo apt update
sudo apt -y install dotnet-sdk-2.0.0
sudo apt install libxml2-dev
#Downloading and installing the Remote R service
wget -O rtvs-daemon.tar.gz
tar -xvzf rtvs-daemon.tar.gz
sudo ./rtvs-install -s
sudo systemctl enable rtvsd
sudo systemctl start rtvsd
#sudo rm rtvs-daemon.tar.gz
#sudo rm rtvs-install
#Fixing Remote R: For some reason, even though 'sudo systemctl enable rtvsd' runs, after every reboot the service won't become automatically active. So let's fix that.
sudo mv /var/RevoShare/
sudo /sbin/shutdown -r 5
sudo chown root /etc/rc.local
sudo chmod 755 /etc/rc.local
sudo systemctl enable rc-local.service
sudo -s
sudo find /etc/ -name "rc.local" -exec sed -i 's/exit 0//g' {} \;
sudo echo "" >> /etc/rc.local
sudo echo "sh /var/RevoShare/" >> /etc/rc.local
sudo echo "exit 0" >> /etc/rc.local
I've also tried, one by one, these, to see if it makes any difference to the RStudio Server (it didn't, but even if it did, I want a global solution to work on Remote R Workspace Service and R Server Operationalisation as well, not only RStudio Server):
#Configuring RStudio Server to see the R Server R
sudo echo "rsession-which-r=/opt/microsoft/mlserver/9.2.1/bin/R/R" >> /etc/rstudio/rserver.conf
export RSTUDIO_WHICH_R=/opt/microsoft/mlserver/9.2.1/bin/R/R
sudo echo "RSTUDIO_WHICH_R=/opt/microsoft/mlserver/9.2.1/bin/R/R" >> ~/.profile
source ~/.profile
sudo echo "RSTUDIO_WHICH_R=/opt/microsoft/mlserver/9.2.1/bin/R/R" >> ~/.bashrc
source ~/.bashrc
sudo echo "PATH=$PATH:/opt/microsoft/mlserver/9.2.1/bin/R" >> ~/.bashrc
export PATH=$PATH:/opt/microsoft/mlserver/9.2.1/bin/R
source ~/.bashrc
The problem is that even though "which R" points to R Server's R, i.e. typing "sudo R" will show the message "Loading Microsoft R Server packages, version 9.2.1." and will load packages like RevoScaleR, everything else fails to do so.
Accessing the RStudio Server with and logging in with the initial user ("sshuser") (or with any other user for that matter) will NOT load R Server and RevoScaleR rx functions are unavailable
Using my local Visual Studio 2017 to access the remote workspace via "Add connection" on "Workspaces" tab loads MRO and says:
Installed R versions:
[0] Microsoft R Open '' (Default)
And finally, when I use R Server's Operationalisation and log in with "mrsdeploy" package's "remoteLogin()" R Server packages like RevoScaleR are not loaded again, so things like "rxSummary(~., data=iris)" fail with error 'could not find function "rxSummary"'
The exact same thing happened when I deployed from azure a "Machine Learning Server 9.2.1 on Linux (Ubuntu)".
I don't want to just use the regular open source R, I want to be able to use the R Server - that's why I deployed this VM. How can I make it so that everything loads R Server's R, not Microsoft R Open? (Like I'm able to do from terminal using "R")
As a result of my having tried all of this and the fact that R Server is loaded in the console, my mind now goes to permissions. Could it be that by default the Data Science VM doesn't have the correct permissions to allow these?
I'm at a loss
RStudio Server is installed on the Ubuntu DSVM, but the service is disabled by default as it does not support SSL. You can enable it with systemctl enable rstudio-server, then start it with systemctl start rstudio-server.
RStudio Server uses the same R as Microsoft R Server, but the .libPaths are different, which is why you cannot load the MRS packages. You will need to manually set the .libPaths so they match.

EC2 startup script gets stuck on wget

I have the following script for some number crunching
sudo apt-get update -y
sudo apt-get upgrade -y
sudo apt-get install -y r-base r-base-dev htop s3cmd p7zip-full
7z e ###.7z
sudo R CMD BATCH --slave --no-timing --vanilla "--args 0 1 100 200 500 2" SOME-ROUTINE.R
s3cmd put *.results s3://#########/
on EC2. I upload the script as file at the Launch Instance->Instance Details->User Data
The machine fires up, updates and upgrades but then it does not execute wget and does not download the file. When i SSH in the Instance and run the exact same commands the process completes without problems.
Any ideas why wget does not work?
Any other alternatives?
It is always a bit of guessing, but here is how I would debug this:
My first suggestion would be to check for special characters in the S3 URL. This might cause the wget call to fail.
Second, I would give an explicit output path to wget with the -O option. While you are editing the command, you can also add -o to output logging information.
Last step is to check your access rights to the S3 bucket. Perhaps you can try to put the file on another webspace to see if the command executes then.
