I am not able to run dask yarn cluster on AWS EMR - jupyter-notebook

I want run dask on EMR using YarnCluster.
I have used below bootstrap script but I have run these instructions in SSH console.
HELP="Usage: bootstrap-dask [OPTIONS]
Example AWS EMR Bootstrap Action to install and configure Dask and Jupyter
By default it does the following things:
- Installs miniconda
- Installs dask, distributed, dask-yarn, pyarrow, and s3fs. This list can be
extended using the --conda-packages flag below.
- Packages this environment for distribution to the workers.
- Installs and starts a jupyter notebook server running on port 8888. This can
be disabled with the --no-jupyter flag below.
--jupyter / --no-jupyter Whether to also install and start a Jupyter
Notebook Server. Default is True.
--password, -pw Set the password for the Jupyter Notebook
Server. Default is 'dask-user'.
--conda-packages Extra packages to install from conda.
# Parse Inputs. This is specific to this script, and can be ignored
# -----------------------------------------------------------------
# -----------------------------------------------------------------------------
# 1. Check if running on the master node. If not, there's nothing do.
# -----------------------------------------------------------------------------
grep -q '"isMaster": true' /mnt/var/lib/info/instance.json \
|| { echo "Not running on master node, nothing to do" && exit 0; }
# -----------------------------------------------------------------------------
# 2. Install Miniconda
# -----------------------------------------------------------------------------
echo "Installing Miniconda"
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p $HOME/miniconda
rm /tmp/miniconda.sh
echo -e '\nexport PATH=$HOME/miniconda/bin:$PATH' >> $HOME/.bashrc
source $HOME/.bashrc
conda update conda -y
# configure conda environment
#source ~/miniconda/etc/profile.d/conda.sh
#conda activate base
# -----------------------------------------------------------------------------
# 3. Install packages to use in packaged environment
# We install a few packages by default, and allow users to extend this list
# with a CLI flag:
# - dask-yarn >= 0.7.0, for deploying Dask on YARN.
# - pyarrow for working with hdfs, parquet, ORC, etc...
# - s3fs for access to s3
# - conda-pack for packaging the environment for distribution
# - ensure tornado 5, since tornado 6 doesn't work with jupyter-server-proxy
# -----------------------------------------------------------------------------
echo "Installing base packages"
conda install \
-c conda-forge \
-y \
-q \
dask-yarn \
s3fs \
conda-pack \
pip3 install pyarrow
# -----------------------------------------------------------------------------
# 4. Package the environment to be distributed to worker nodes
# -----------------------------------------------------------------------------
echo "Packaging environment"
conda pack -q -o $HOME/environment.tar.gz
# -----------------------------------------------------------------------------
# 5. List all packages in the worker environment
# -----------------------------------------------------------------------------
echo "Packages installed in the worker environment:"
conda list
# -----------------------------------------------------------------------------
# 6. Configure Dask
# This isn't necessary, but for this particular bootstrap script it will make a
# few things easier:
# - Configure the cluster's dashboard link to show the proxied version through
# jupyter-server-proxy. This allows access to the dashboard with only an ssh
# tunnel to the notebook.
# - Specify the pre-packaged python environment, so users don't have to
# - Set the default deploy-mode to local, so the dashboard proxying works
# - Specify the location of the native libhdfs library so pyarrow can find it
# on the workers and the client (if submitting applications).
# ------------------------------------------------------------------------------
echo "Configuring Dask"
mkdir -p $HOME/.config/dask
cat <<EOT >> $HOME/.config/dask/config.yaml
link: "/proxy/{port}/status"
environment: /home/hadoop/environment.tar.gz
deploy-mode: local
ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native/
ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native/
# Also set ARROW_LIBHDFS_DIR in ~/.bashrc so it's set for the local user
echo -e '\nexport ARROW_LIBHDFS_DIR=/usr/lib/hadoop/lib/native' >> $HOME/.bashrc
# -----------------------------------------------------------------------------
# 8. Install jupyter notebook server and dependencies
# We do this after packaging the worker environments to keep the tar.gz as
# small as possible.
# We install the following packages:
# - notebook: the Jupyter Notebook Server
# - ipywidgets: used to provide an interactive UI for the YarnCluster objects
# - jupyter-server-proxy: used to proxy the dask dashboard through the notebook server
# -----------------------------------------------------------------------------
echo "Installing Jupyter"
conda install \
-c conda-forge \
-y \
-q \
notebook \
ipywidgets \
jupyter-server-proxy \
# -----------------------------------------------------------------------------
# 9. List all packages in the client environment
# -----------------------------------------------------------------------------
echo "Packages installed in the client environment:"
conda list
# -----------------------------------------------------------------------------
# 10. Configure Jupyter Notebook
# -----------------------------------------------------------------------------
echo "Configuring Jupyter"
mkdir -p $HOME/.jupyter
HASHED_PASSWORD=`python -c "from notebook.auth import passwd; print(passwd('$JUPYTER_PASSWORD'))"`
cat <<EOF >> $HOME/.jupyter/jupyter_notebook_config.py
c.NotebookApp.password = u'$HASHED_PASSWORD'
c.NotebookApp.open_browser = False
c.NotebookApp.ip = ''
c.NotebookApp.port = 8888
# -----------------------------------------------------------------------------
# 11. Define an upstart service for the Jupyter Notebook Server
# This sets the notebook server up to properly run as a background service.
# -----------------------------------------------------------------------------
echo "Configuring Jupyter Notebook Upstart Service"
cat <<EOF > /tmp/jupyter-notebook.service
Description=Jupyter Notebook
ExecStart=$HOME/miniconda/bin/jupyter-notebook --allow-root --config=$HOME/.jupyter/jupyter_notebook_config.py
sudo mv /tmp/jupyter-notebook.service /etc/systemd/system/
sudo systemctl enable jupyter-notebook.service
# -----------------------------------------------------------------------------
# 12. Start the Jupyter Notebook Server
# -----------------------------------------------------------------------------
echo "Starting Jupyter Notebook Server"
sudo systemctl daemon-reload
sudo systemctl restart jupyter-notebook.service
#$HOME/miniconda/bin/jupyter-notebook --allow-root --config=$HOME/.jupyter/jupyter_notebook_config.py
after this i start jupyter notebook using $HOME/miniconda/bin/jupyter-notebook --allow-root --config=$HOME/.jupyter/jupyter_notebook_config.py
jupyter notebook start successfully.
When i run this code on notebook
from dask_yarn import YarnCluster
from dask.distributed import Client
# Create a cluster
cluster = YarnCluster()
# Connect to the cluster
client = Client(cluster)
it gives error like
AttributeError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 client = Client(cluster)
File ~/miniconda/lib/python3.9/site-packages/distributed/client.py:835, in Client.__init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, connection_limit, **kwargs)
832 elif isinstance(getattr(address, "scheduler_address", None), str):
833 # It's a LocalCluster or LocalCluster-compatible object
834 self.cluster = address
--> 835 status = getattr(self.cluster, "status")
836 if status and status in [Status.closed, Status.closing]:
837 raise RuntimeError(
838 f"Trying to connect to an already closed or closing Cluster {self.cluster}."
839 )
AttributeError: 'YarnCluster' object has no attribute 'status'
Also when I use LocalCluster instead of YarnCluster it run perfectly. I am stuck here for days please help. Also how we configure worker nodes.


How to call Dockerized R Plumber API from an external source?

I have created an R Plumber API, and deploy it in a Docker image.
FROM rocker/r-ver:${R_VERSION}
# install the linux libraries needed for plumber
RUN apt-get update -qq && apt-get install -y \
libssl-dev \
COPY / /
# Making home & test folders
RUN Rscript required-packages/required-packages.R
# Giving permission to tests to run
RUN chmod +x tests/run_tests.sh
# Run Tests
RUN tests/run_tests.sh
# open port 7575 to traffic
# when the container starts, start the main.R script
ENTRYPOINT ["Rscript", "./main.R"]
r <- plumb("./plumber.R")
r$run(port = 7575, host = "")
I run the container with the following command.
docker run --rm -p 7575:7575 'container-name'
On the machine, http://localhost:7575/echo works perfectly fine. However, I cannot call the API from an external computer with http://ip_address:7575/echo.
What could be the problem? As far as I know, the 7575 port is open.

SageMaker fails when trying to add Lifecycle Configuration for keeping custom environments persistent after restart

I want to create environment in SageMaker on AWS with miniconda, and make it available as kernels in Jupyter when I restart the session. But the SageMaker keep failing.
I followed the instructions found in here:
basically it says:
"Create a custom, persistent Conda installation on the notebook instance's Amazon Elastic Block Store (Amazon EBS) volume: Run the on-create script in the terminal of an existing notebook instance. This script uses Miniconda to create a separate Conda installation on the EBS volume (/home/ec2-user/SageMaker/). Then, run the on-start script as a lifecycle configuration to make the custom environment available as a kernel in Jupyter. This method is recommended for more technical users, and it is a better long-term solution."
I run this on-create.sh script on the terminal on Jupyter:
set -e
sudo -u ec2-user -i <<'EOF'
unset SUDO_UID
# Install a separate conda installation via Miniconda
mkdir -p "$WORKING_DIR"
wget https://repo.anaconda.com/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh -O "$WORKING_DIR/miniconda.sh"
bash "$WORKING_DIR/miniconda.sh" -b -u -p "$WORKING_DIR/miniconda"
rm -rf "$WORKING_DIR/miniconda.sh"
# Create a custom conda environment
source "$WORKING_DIR/miniconda/bin/activate"
conda create --yes --name "$KERNEL_NAME" python="$PYTHON"
conda activate "$KERNEL_NAME"
pip install --quiet ipykernel
# Customize these lines as necessary to install the required packages
conda install --yes numpy
pip install --quiet boto3
and it creates the "conda-test-env" environment as expected.
Then I add the on-start.sh as lifestyle configuration:
set -e
sudo -u ec2-user -i <<'EOF'
unset SUDO_UID
source "/home/ec2-user/SageMaker/custom-environments/miniconda/bin/activate"
conda activate conda-test-env
python -m ipykernel install --user --name "conda-test-env" --display-name "conda-test-env"
# Optionally, uncomment these lines to disable SageMaker-provided Conda functionality.
# echo "c.EnvironmentKernelSpecManager.use_conda_directly = False" >> /home/ec2-user/.jupyter/jupyter_notebook_config.py
# rm /home/ec2-user/.condarc
then I update the instance with the new configuration,
and when I start my notebook instance, after few minutes it fails.
I'll appreciate any help.

Running RStudio on Google Colab

Is there a way to install and run RStudio on Google Colab?
I am aware that it is possible to run R code on Google Colab. Thus, i was wondering if there is also a work-around to install and run RStudio?
I am using Google Colab and RStudio from time to time. To make the set up easier, I usually copy & paste the following setup script (and use alternative 2).
After creating a new Google Colab Notebook, execute the following commands:
# Add new user called "rstudio" and define password (here "password123")
!sudo useradd -m -s /bin/bash rstudio
!echo rstudio:password123 | chpasswd
# Install R and RStudio Server (Don't forget to update version to latest version)
!apt-get update
!apt-get install r-base
!apt-get install gdebi-core
!wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-1.4.1103-amd64.deb
!gdebi -n rstudio-server-1.4.1103-amd64.deb
# ALTERNATIVE 1: Use ngrok
# Advantage: Runs in the background
# Disadvantage: Not so stable
# (often 429 errors during RStudio usage due to max 20 connections without account)
# Optionally register for a free accoount which gets this number up to 40 connections:
# https://ngrok.com/pricing
# Install ngrok (https://ngrok.com/)
!wget -c https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip -o ngrok-stable-linux-amd64.zip
# Run ngrok to tunnel RStudio app port 8787 to the outside world.
# This command runs in the background.
get_ipython().system_raw('./ngrok http 8787 &')
# Get the public URL where you can access the Dash app. Copy this URL.
! printf "\n\nClick on the following link: "
! curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"
# ==> To access to the RStudio server
# - click on this link and
# - use the username "rstudio" and
# - the password you defined at the first cell ("password123" in this example).
# ALTERNATIVE 2 (preferred): Use localtunnel
# (see also: https://github.com/naru-T/RstudioServer_on_Colab/blob/master/Rstudio_server.ipynb)
# Advantage: Stable usage of RStudio
# Disadvantage: Does not run in the background (i.e. Colab blocked)
# Install localtunnel
!npm install -g npm
!npm install -g localtunnel
# Run localtunnel to tunnel RStudio app port 8787 to the outside world.
# This command runs in the background.
!lt --port 8787
# ==> To access to the RStudio server
# - click on this link,
# - click button "Click to Continue" on the "friendly reminder" page,
# - use the username "rstudio" and
# - the password you defined at the first cell ("password123" in this example).

Get Access to Airflow Hooks within a jupyter notebook

I have Airflow running using the postgres backend - all fine. Additionally I have running a Jupyter server on the same host where Airflow runs. Now I thought I can just access the airflow hooks from a notebook.
import pandas as pd
import numpy as np
import matplotlib as plt
from airflow.hooks.mysql_hook import MySqlHook
mysql = MySqlHook(mysql_conn_id = 'mysql-x')
sql = "select 1+1"
But I get this exception message:
OperationalError: (sqlite3.OperationalError) no such table: connection
[SQL: SELECT connection.password AS connection_password, connection.extra AS connection_extra, connection.id AS connection_id, connection.conn_id AS connection_conn_id, connection.conn_type AS connection_conn_type, connection.host AS connection_host, connection.schema AS connection_schema, connection.login AS connection_login, connection.port AS connection_port, connection.is_encrypted AS connection_is_encrypted, connection.is_extra_encrypted AS connection_is_extra_encrypted
FROM connection
WHERE connection.conn_id = ?]
[parameters: ('mysql-x',)]
(Background on this error at: http://sqlalche.me/e/e3q8)
But what makes me get suspicious is that it not only does not find the connection_id (which I clearly can see in the Airflow UI). It also says: sqlite3.OperationalError - it very much looks like it is not even connected to the same postgres database. I have checked os.environ["AIRFLOW_HOME"] which seems to be correct.
After I start the jupyter notebook server after airflow, such that all environment variables are set, I get a different error:
/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/attributes.py in __get__(self, instance, owner)
352 def __get__(self, instance, owner):
--> 353 retval = self.descriptor.__get__(instance, owner)
354 # detect if this is a plain Python #property, which just returns
355 # itself for class level access. If so, then return us.
/usr/local/lib/python3.7/site-packages/airflow/models/connection.py in get_password(self)
188 "Can't decrypt encrypted password for login={}, \
189 FERNET_KEY configuration is missing".format(self.login))
--> 190 return fernet.decrypt(bytes(self._password, 'utf-8')).decode()
191 else:
192 return self._password
/usr/local/lib/python3.7/site-packages/cryptography/fernet.py in decrypt(self, msg, ttl)
169 except InvalidToken:
170 pass
--> 171 raise InvalidToken
You can use this dockerfile to reproduce it:
FROM apache/airflow
USER root
# install mysql client
RUN apt-get update && apt-get install -y mariadb-client-10.3 unzip
# install mssql client and tools
RUN apt-get install -y curl gnupg libicu-dev libicu63
RUN curl https://packages.microsoft.com/keys/microsoft.asc -o key
RUN apt-key add < key
RUN curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/msprod.list
RUN apt-get update && apt-get install -y mssql-tools msodbcsql17 unixodbc unixodbc-dev unzip libunwind8
RUN curl -Lq 'https://go.microsoft.com/fwlink/?linkid=2108814' -o sqlpackage-linux-x64-latest.zip
RUN mkdir /opt/sqlpackage/ && unzip sqlpackage-linux-x64-latest.zip -d /opt/sqlpackage/
RUN chmod a+x /opt/sqlpackage/sqlpackage && ln -sfn /opt/sqlpackage/sqlpackage /usr/bin/sqlpackage
RUN ln -sfn /opt/mssql-tools/bin/sqlcmd /usr/bin/sqlcmd
# install notebooks
RUN pip install jupyterlab pandas numpy scikit-learn matplotlib pymssql
#RUN cat /entrypoint.sh
# start additional notebok server
# this is a dirty hack but for the sake of this prototype good enough
RUN sed -i -e's/\# Run the command/airflow scheduler \& \njupyter notebook --ip= --port=9000 --NotebookApp.token="" --NotebookApp.password="" \& \n/' /entrypoint
# switch back to airflow user
USER airflow
RUN airflow initdb
RUN alias ll='ls -al'

How to fix '404 (Not Found)' errors when sourcing CSS and Javascript files in ShinyProxy

I am trying to launch a shiny app using ShinyProxy - something I have done many times before. However, this app is not correctly using any of the CSS or JS files that is required to make it run.
When I run the app manually with docker run -p 3838:3838 my_app everything works perfectly fine. However, when pointing ShinyProxy to the my_app image, the resulting app fails to load any CSS or JS files.
FROM openanalytics/r-base
MAINTAINER Daniel Beachnau "DannyBeachnau#gmail.com"
# Dependencies outside of R
RUN apt-get update && apt-get install -y \
sudo \
gdebi-core \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev \
libxt-dev \
xtail \
wget \
libpq-dev \
libmariadb-client-lgpl-dev \
# Might be needed for the archivist R-Library
dbus \
systemd \
# needed for odbc
RUN apt-get install apt-transport-https curl -y
RUN curl http://packages.microsoft.com/keys/microsoft.asc | apt-key add -
RUN curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update
RUN ACCEPT_EULA=Y apt-get install msodbcsql17 -y
# Download R-Packages
# tidyverse
RUN R -e "install.packages('tidyr')"
RUN R -e "install.packages('dplyr')"
RUN R -e "install.packages('readr')"
# Shiny Packages
RUN R -e "install.packages('shiny')"
RUN R -e "install.packages('shinycssloaders')"
RUN R -e "install.packages('shinydashboard')"
RUN R -e "install.packages('shinyWidgets')"
RUN R -e "install.packages('DT')"
RUN R -e "install.packages('shinyjs')"
RUN R -e "install.packages('flexdashboard')"
# Database Packages
RUN R -e "install.packages('odbc')"
RUN R -e "install.packages('RMySQL')"
# Other
RUN R -e "install.packages('devtools')"
RUN R -e "install.packages('lubridate')"
RUN R -e "install.packages('reshape2')"
RUN R -e "install.packages('grid')"
RUN R -e "install.packages('lemon')"
RUN R -e "install.packages('scales')"
RUN R -e "install.packages('ggthemes')"
RUN R -e "install.packages('ggplot2')"
RUN R -e "devtools::install_bitbucket(repo = 'my_repo/my_package', auth_user = 'my_username', password = 'my_password')"
# copy the app to the image
COPY . /root
# run the script to update the app data
RUN Rscript app_data_update.R
WORKDIR /root/app
COPY Rprofile.site /usr/lib/R/etc/
CMD ["R", "-e", "shiny::runApp('/root/app', host='', port=3838)"]
title: ShinyProxy Server
logo-url: /images/logo-image.png
landing-page: /
heartbeat-rate: 10000
heartbeat-timeout: 60000
container-wait-time: 60000
port: 8080
authentication: ldap
# Docker configuration
cert-path: /home/none
url: http://localhost:2375
port-range-start: 20000
container-log-path: ./container-logs
mail-to-address: DannyBeachnau#gmail.co,
- name: my_apps_name
display-name: Shiny App
docker-image: dbeachnau/my_app
groups: [Shiny Users Management]
logo-url: /images/logo-image.png
container-volumes: ["/path/to/app:/root/app"]
Here is how app looks in shiny proxy.
Here is hoe my app looks when running manually.
The console in chrome's inspect tool is replete with errors such as
GET https://myshinyserver.com/container_name/font-awesome-5.3.1/css/all.min.css net::ERR_ABORTED 404 (Not Found)
I do have other apps running on ShinyProxy which display properly, but I cannot solve the difference between how those apps are configured to how this app is configured. Let me know if additional details are required for diagnosing the issue. All feedback is appreciated - thank you.
You're probably seeing this with Shiny v1.3.0, and not with earlier versions. If so, it's probably because of a misconfiguration in your NGINX proxy directives. I've written up the details here, but I'll also post the salient details here.
proxy_set_header Connection "upgrade";
This directive causes NGINX to add a Connection: upgrade header to every HTTP request, when it's only supposed to be used for WebSockets.
This line is recommended by NGINX Inc. themselves, however, those recommendations are intended for proxying of traffic that is exclusively WebSockets, whereas Shiny traffic is a combination of normal HTTP requests and WebSockets. Older versions of shiny/httpuv didn't mind this situation, but the new versions are stricter.
A correct configuration looks something like this:
http {
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
server {
listen 80;
location / {
proxy_pass http://localhost:3838;
proxy_redirect / $scheme://$http_host/;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_read_timeout 20d;
proxy_buffering off;
See the articles linked in the RStudio Community post for other examples.
You will have to install the requested font at the top of your Dockerfile. You can add it to your list "Dependencies outside of R":
sudo apt-get install fonts-font-awesome
I have solved my problem, although, this still may not count as a sufficient answer or explanation, because I cannot account for why this solution makes a difference. I decided to rewrite the Dockerfile using a different base image which now works. Nothing else in my code changed - just the Dockerfile. The working docker file is as follows:
FROM rocker/shiny-verse
# based on debian 9
MAINTAINER Daniel Beachnau "DannyBeachnau#gmail.com"
# Dependencies outside of R
RUN apt-get update && apt-get install -y \
gnupg2 \
apt-utils \
sudo \
gdebi-core \
libxt-dev \
xtail \
# Install ODBC driver from microsoft
RUN apt-get install apt-transport-https curl -y
RUN curl http://packages.microsoft.com/keys/microsoft.asc | apt-key add -
RUN curl https://packages.microsoft.com/config/debian/9/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update
RUN ACCEPT_EULA=Y apt-get install msodbcsql17 -y
# Download R-Packages
# Shiny Packages
RUN R -e "install.packages('shinycssloaders')"
RUN R -e "install.packages('shinydashboard')"
RUN R -e "install.packages('shinyWidgets')"
RUN R -e "install.packages('DT')"
RUN R -e "install.packages('shinyjs')"
RUN R -e "install.packages('flexdashboard')"
# Database Packages
RUN R -e "install.packages('odbc')"
RUN R -e "install.packages('RMySQL')"
# Other
RUN R -e "install.packages('lubridate')"
RUN R -e "install.packages('reshape2')"
RUN R -e "install.packages('scales')"
RUN R -e "install.packages('ggthemes')"
RUN R -e "install.packages('ggplot2')"
RUN R -e "devtools::install_bitbucket(repo = 'my_repo', auth_user = 'my_username', password = 'my_password')"
# copy the app to the image
COPY . /root
# run the script to update the app data
RUN Rscript app_data_update.R
WORKDIR /root/app
COPY Rprofile.site /usr/lib/R/etc/
CMD ["R", "-e", "shiny::runApp('/root/app', host='', port=3838)"]
If anyone has insight to why this behavior is observed I would love to hear it because I am baffled to say the least.
why this solution makes a difference
It seems to be an issue with the Shiny version, changing the base image has very probably fixed that.
See Shiny apps not rendering after updated to v1.3.0
