Differences between using Sagemaker notebook vs Glue (Sagemaker) notebook - jupyter-notebook

I have a Machine Learning job I want to run with Sagemaker. For data preparation and transformation, I am using some numpy and pandas steps to transform them with notebook.
I noticed AWS Glue have both Sagemaker and Zeppelin notebook which can be created via development endpoint
There isn't much info online i could find what's the difference and benefit of using one over another (i.e. Sagemaker notebook and import from s3 vs creating notebook from Glue)
From what i researched and tried it seems that i can achieve same thing with both:
Sagemaker notebook and import directly from s3 + further python code to process the data
Glue (need to crawl and create dataset) as shown here, create dev endpoint and use similar script to process the data.
Anyone able to shed light on this?

The question isn't clear but let me explain this point.
When you launch a Glue Development endpoint you can attach either a SageMaker notebook or Zeppelin notebook. Both will be created and configured by Glue and your script will be executed on the Glue Dev endpoint.
If your question is "what is the difference between a SageMaker notebook created from Glue console and a SageMaker notebook created from SageMaker console?
When you create a notebook instance from Glue console, the created notebook will always have public internet access enabled. This blog explains the difference between the networking configurations with SM notebooks. You cannot also create the notebook with a specific disk size but you can stop the notebook once it's created and increase disk size.
If your question is "what is the difference between SageMaker notebook and Zeppelin notebooks?"
The answer is the first one used Jupter (very popular) while the second one uses Zeppelin.
If your question is "what is the difference between using only a SageMaker notebook versus using SM notebook + Glue dev Endpoint?"
The answer is: if you are running normal pandas + numpy without using Spark, SM notebook is much cheaper (if you use small instance type and if your data is relatively small). However, if you are trying to process a large dataset and you are planning to use spark, then SM notebook + Glue Dev endpoint will be the best option to develop the job which will be executed later as a Glue Job (transformation job) (server less).
SM notebook is like running python code on an EC2 instance versus SM notebook + Glue which is used to develop ETL jobs which you can launch to process deltas.

If you are using only numpy and pandas, functions-wise it doesn't make a real difference. But it depends on your data as well, if you want to work with data sitting in a Glue table it would be easier to work with Zeppelin notebooks via an endpoint.
Costwise I am pretty sure that Sagemaker is less expensive.

Related

How to make sure my jupyter notebook is runnable on any other computer or on any jupyter Lab?

An analytic task has been given to me to solve it by python and return back the result to the technical staff. I was asked to prepare the result in a jupyter notebook and such that the resulting code would be fully runnable and documented.
Honestly, I just started using jupyter notebook and generally found it pretty useful and convenient in generating reports integrated with codes and figures. But I had to go into some level of difficulty when I wanted to use specific packages like graphviz and dtreeviz, which was beyond doing a simple pip install xxx.
So, how should I make sure that my code is runnable when I do not know what packages are available at the destination Jupyter notebook of the next guy who wants to run it or when they want to run it using a Jupiter Lab? especially regarding these particular packages!
One solution for you problem would be to use docker to develop and deploy your project.
You can define all your dependencies, create your project and build a docker image with them. With this image, you can be sure that anyone who is using it, will have the same infrastructure like yours.
It shouldn't take you a lot of time to learn docker and it will help you in the future.

Can people without R installed run an R Notebook file successfully?

I have an R Notebook that I am building to provide an analysis for somebody, and I am wondering if I should choose another option as I don't know if she will be able to run the Notebook without having R installed.
Is it possible to run an R Notebook as a single entity or must you have R installed in order to do it?
To rerun the notebook they require R. But the whole point of R Notebooks is that they produce a static document as output. That document (usually in HTML format) can be shared in isolation, and does not require any additional software besides a web browser to be viewerd.
Notebook will need R to run. To distribute a notebook without the R dependency will be a bit more elaborate, like installing rstudio server within a docker container. User will, in this particular case, need to have Docker installed and know how to start a container. From there on the user can interact with the code through a web browser.
Another option would be to use the cloud solution that some companies offer. It offers sharing functionality and you don't have to worry about the infrastructure or distribution of your work. There are some free plans that may work for you, but the real power is in premium features.

Notebook instance running R with a GPU

I am new to cloud computing and GCP. I am trying to create a notebook instance running R with a GPU. I got a basic instance with 1 core and 0 GPUs to start and I was able to execute some code which was cool. When I try to create an instance with a GPU I keep getting all sorts of errors about something called live migration, or that there are no resources available, etc. Can someone tell me how to start an R notebook instance with a GPU? It can't be this difficult.
The CRAN (The Comprehensive R Archive Network) doesn't support GPU. However, you can follow this link might help you to install a Notebook instance running R with a GPU. You need a machine with Nvidia GPU drivers installed then install R and Jupyter Lab. After that compile those R packages which require it for use with GPU's.

Is there a Jupyter code that I can use "to stop" a Notebook Instances on Sagemaker, after running any code?

I use a Jupyter Notebook instance on Sagemaker to run a code that took around 3 hours to complete. Since I pay for hour use, I would like to automatically "Close and Halt" the Notebook as well as stop the "Notebook instance" after running that code. Is this possible?
Jupyter Notebook are predominantly designed for exploration and
development. If you want to launch long-running or scheduled jobs on
ephemeral hardware, it will be a much better experience to use the
training API, such as the create_training_job in boto3 or the
estimator.fit() of the Python SDK. The code passed to training jobs
can be completely arbitrary - not necessarily ML code - so whatever
you write in jupyter could likely be scheduled and ran in those
training jobs. See the random forest sklearn demo here for an
example. That being said, if you still want to programmatically shut
down a SageMaker notebook instance, you can use that boto3 call:
import boto3
sm = boto3.client('sagemaker')
sm.stop_notebook_instance(NotebookInstanceName='string')

Workflow for using command line R?

I am used to using R in RStudio. For a new project, I have to use R on the command line, because the data storage and analysis are only allowed to be on a specific server that I connect to using ssh. This server doesn't have rstudio-server to support remote RStudio sessions.
The project involves an extremely large dataset, and some pre-written code to load/format the data that I have been told to run using "source()" before I do anything else. This takes several minutes to run and load the data each time.
What would a good workflow be for something like this? Editing my code in a .r file, saving, then running it would require taking several minutes to load the data each time. But just running R in an interactive session would make it hard to keep track of what I am doing and repeat things if necessary.
Is there some command-line equivalent to RStudio where you can have an interactive session but be editing/saving a file of your code as you go?
Sounds like JuPyteR might be your friend here.
The R kernel works great.
You can use it on a remote server either with exposing an open port (and setting up JuPyteR login credentials)
Or via port forwarding over SSH.
It is a lot like an interactive reply, except it holds state.
And you can go back and rerun cells.
(Of course state can be dangerous for reproduceability)
For RStudio you can launch console and ssh to your remote servers even if your servers don't use expensive RStudio for servers platform. You can then execute all commands from R Studio directly into the ssh with the default shortcut key. This might allow to continue using R studio, track what you're doing in the R script, execute interactively.

Resources