Run preprocessor using nbconvert as a library - jupyter-notebook

I would like to run nbconvert with a preprocessor that removes cells marked with the tag "skip". I am able to do this from the command line, but when I try to use the nbconvert API within a notebook I run into problems.
An example
Following the example in the documentation, I get a notebook to work with.
from urllib.request import urlopen
url = 'http://jakevdp.github.com/downloads/notebooks/XKCD_plots.ipynb'
response = urlopen(url).read().decode()
import nbformat
nb = nbformat.reads(response, as_version=4)
I'll modify one cell so it gets skipped in the output.
nb.cells[1].metadata = {'tags': ['skip']}
Command line
Saving the file, and then running nbconvert from the command line:
nbformat.write(nb, 'nb.ipynb')
%%bash
jupyter nbconvert --to latex \
--TagRemovePreprocessor.remove_cell_tags='{"skip"}' \
--TagRemovePreprocessor.enabled=True \
'nb.ipynb'
This works. The output nb.tex file does not contain the cell tagged "skip".
API
Now let's try it using the API instead. First, without any preprocessing:
import nbconvert
latex, _ = LatexExporter().from_notebook_node(nb)
print(latex[:25])
\documentclass[11pt]{arti
Again, no problem. The conversion is working.
Now, trying to use the same preprocessor I used from the command line:
from traitlets.config import Config
c = Config()
c.RemovePreprocessor.remove_cell_tags = ('skip',)
c.LatexExporter.preprocessors = ['TagRemovePreprocessor']
LatexExporter(config=c).from_notebook_node(nb)
This time, I get:
ModuleNotFoundError: No module named 'TagRemovePreprocessor'
As far as I can see, this code matches the code sample in the documentation, except that I'm using the Latex exporter instead of HTML. So why isn't it working?

For your particular case, I believe you can resolve the issue by changing: c.RemovePreprocessor.remove_cell_tags = ('skip',) ->
c.TagRemovePreprocessor.remove_cell_tags = ('skip',)
For the benefit of others that come across this thread as I did by searching
ModuleNotFoundError: No module named 'TagRemovePreprocessor'
There is an open issue with TagRemovePreprocessor that causes all exporters other than the HTMLExporter (and LatexExporter?) to automatically disable this preprocessor.
In my case, I was attempting to use the NotebookExporter and needed to explicitly enable the preprocessor and change the preprocessing level like so:
import json
from traitlets.config import Config
from nbconvert import NotebookExporter
import nbformat
c = Config()
c.TagRemovePreprocessor.enabled=True # Add line to enable the preprocessor
c.TagRemovePreprocessor.remove_cell_tags = ["del_cell"]
c.preprocessors = ['TagRemovePreprocessor'] # Was previously: c.NotebookExporter.preprocessors
nb_body, resources = NotebookExporter(config=c).from_filename('notebook.ipynb')
nbformat.write(nbformat.from_dict(json.loads(nb_body)),'stripped_notebook.ipynb',4)

Related

ModuleNotFoundError using CaseReader in OpenMDAO

I'm trying to use a recorded case in OpenMDAO contained in "my_file.db"
when i execute the following code:
import openmdao.api as om
cr = om.CaseReader('my_file.db')
I get the following error:
ModuleNotFoundError: No module named 'groups'
'groups' is a folder from the openMDAO code that I used to record the case and now I'm trying to import it from a different directory. How can I redefine the path for om.CaseReader to look for the modules it needs?
try setting your PYTHONPATH, as discussed here:
https://bic-berkeley.github.io/psych-214-fall-2016/using_pythonpath.html
Solved using:
import os
dirname = os.path.dirname(__file__)
import sys
sys.path.append( dirname )

How to run R Code within Python interface?

I am trying to run below code where I want to read csv file and then write "sas7bdat". I have tried below code.
we already have prerequisite library installed on the system for R.
from rpy2 import robjects
robjects.r('''
library(haven)
data <- read_csv("filename.csv")
write_sas(data, "filename.sas7bdat")
''')
After running above code, there are no output get generated by this code and even I am not getting any error.
Expected output: trying to read .csv file and then that data i want to export in .sas7bdat format. (In Standard python 3.9.2 Editor)
python do not have such functionality/library hence I am trying this way to export data in .sas7bdat format.
Plz Suggest some change in above code or any other way in python through which I can create/export .sas7bdat format in python.
Thanks.
I had experience using R in Python Jupyter Notebooks, it is a bit complicated at beginning, but it did work. Here I just pasted my personal notes, hope these help:
# Major steps in installing "rpy2":
# Step 1: install R on Jupyter Notebook: conda install -c r r-essentials
# Step 2: install the "rpy2" Python package: pip install rpy2 (you may have to check the version)
# Step 3: create the environment variables: R_HOME, R_USER and R_LIBS_USER
# you can modify these environment variables in the system settings on your windows PC or use codes to set them every time)
# load the rpy2 module after installation
# Then you will be able to enable R cells within the Python Jupyter Notebook
# run this line in your Jupyter Notebook
%load_ext rpy2.ipython
My work was to do ggplot2 in Python, so I did:
# now use R to access this dataframe and plot it using ggplot2
# tell Jupyter Notebook that you are going to use R in this cell, and for the "test_data" generated using the Python
%%R -i test_data
library(ggplot2)
plot <- ggplot(test_data) +
geom_point(aes(x,y),size = 20)
plot
ggsave('test.png')
Please before you run the code make sure that haven and reader are installed in your R kernel.
from rpy2.robjects.packages import SignatureTranslatedAnonymousPackage
string = """
write_sas <- function(file, col_names = TRUE, write_to){
data <- readr::read_csv(file, col_names = col_names)
haven::write_sas(data, path = write_to)
print(paste("Data is written to ", write_to))
}
"""
rwrap = SignatureTranslatedAnonymousPackage(string, "rwrap")
rwrap.write_sas( file = "https://robjhyndman.com/data/ausretail.csv",
col_names = False,
write_to = "~/Downloads/filename.sas7bdat")
You can use any of the R function arguments. same as I used col_names

How to convert Tensorflow Object Detection API model to TFLite?

I am trying to convert a Tensorflow Object Detection model(ssd-mobilenet-v2-fpnlite, from TensorFlow 2 Detection Model Zoo) to TFLite. First of all, I train the model using the model_main_tf2.py and then I use the export_tflite_graph_tf2.py to export a saved model(.pb). However, when it comes to convert the .pb file to .tflite it throws this error:
OSError: SavedModel file does not exist at: /content/gdrive/My Drive/models/research/object_detection/fine_tuned_model/saved_model/saved_model.pb/{saved_model.pbtxt|saved_model.pb}
To convert the .pb file I used:
import tensorflow as tf
SAVED_MODEL_PATH = os.path.join(os.getcwd(),'object_detection', 'fine_tuned_model', 'saved_model', 'saved_model.pb')
# SAVED_MODEL_PATH: '/content/gdrive/My Drive/models/research/object_detection/exported_model/saved_model/saved_model.pb'
converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_PATH)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.experimental_new_converter = True
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
open("detect.tflite", "wb").write(tflite_model)
or "tflite_convert" from command line, but with the same error. I also tried to run it with the latest tf-nightly version as it suggests here, but the outcome is the same. I tried to pass the path with various ways, it seems like the .pd is not well written (not the right file). Is there a way to manage to convert the model to tflite so as to implement it to android? Thank you!
Your saved_model path should be "/content/gdrive/My Drive/models/research/object_detection/fine_tuned_model/saved_model/". It is the folder instead of files in that folder
For quick test, try to type in terminal
tflite_convert \
--saved_model_dir="path to saved_folder" \
--output_file="path to tflite file u want to save"
I don't have enough reputation to just comment but the problem here seems to be your SAVED_MODEL_PATH.
You could try to hardcode the path and remove the .pb file. I don't remember exactly what's the trick here but it's definitively due to the path

Writing a partitioned parquet file with SparkR

I've got two scripts, one in R and a short second one in pyspark that uses the output. I'm trying to copy that functionality into the first script for simplicity.
The second script is very simple -- read a bunch of csv files and emit them as partitioned parquet:
spark.read.csv(path_to_csv, header = True) \
.repartition(partition_column).write \
.partitionBy(partition_column).mode('overwrite') \
.parquet(path_to_parquet)
This should be equally simple in R but I can't figure out how to match the partitionBy functionality in SparkR. I've got this so far:
library(SparkR); library(magrittr)
read.df(path_to_csv, 'csv', header = TRUE) %>%
repartition(col = .$partition_column) %>%
write.df(path_to_parquet, 'parquet', mode = 'overwrite')
This successfully writes one parquet file for each value of partition_column. The issue is the emitted files have the wrong directory structure; whereas Python produces something like
/path/to/parquet/
partition_column=key1/
file.parquet.gz
partition_column=key2/
file.parquet.gz
...
R produces only
/path/to/parquet/
file_for_key1.parquet.gz
file_for_key2.parquet.gz
...
Am I missing something? the partitionBy function in SparkR appears only to refer to the context of window functions and I don't see anything else in the manual that could be related. Perhaps there's a way to pass something in ... but I don't see any examples in the documentation or from a search online.
Partitioning of the output is not supported in Spark <= 2.x.
However, it will be supported in SparR >= 3.0.0 (SPARK-21291 - R partitionBy API), with the following syntax:
write.df(
df, path_to_csv, "parquet", mode = "overwrite",
partitionBy = "partition_column"
)
Since corresponding PR modifies only R files, you should be able to patch any SparkR 2.x distribution, if upgrading to development version is not an option:
git clone https://github.com/apache/spark.git
git checkout v2.4.3 # Or whatever branch you use
# https://github.com/apache/spark/commit/cb77a6689137916e64bc5692b0c942e86ca1a0ea
git cherry-pick cb77a6689137916e64bc5692b0c942e86ca1a0ea
R -e "devtools::install('R/pkg')"
In the client mode this should be required only on the driver node.
but these are not fatal, and shouldn't cause any serious issues.

Modules in Airflow

I have several classes defined in Python which represent jobs. In my orchestrator I define the needed functions for Airflow as follows:
from jobs.package.job import ToBeExecuted
def run_job(**context):
ti = context['ti']
date = context['ds']
job = ToBeExecuted()
input = ti.xcom_pull(task_ids='previous_job')
output = output.csv
job.run(input, output, date)
return output
As mentioned in the Airflow docs (https://pythonhosted.org/airflow/concepts.html?highlight=zip#packaged-dags), you cannot use external packages without packaging them.
But I just don't understand the described solution. You package everything in the zip folder, but then what? How do you launch it? How do you backfill it?

Resources