Modules in Airflow - airflow

I have several classes defined in Python which represent jobs. In my orchestrator I define the needed functions for Airflow as follows:
from jobs.package.job import ToBeExecuted
def run_job(**context):
ti = context['ti']
date = context['ds']
job = ToBeExecuted()
input = ti.xcom_pull(task_ids='previous_job')
output = output.csv
job.run(input, output, date)
return output
As mentioned in the Airflow docs (https://pythonhosted.org/airflow/concepts.html?highlight=zip#packaged-dags), you cannot use external packages without packaging them.
But I just don't understand the described solution. You package everything in the zip folder, but then what? How do you launch it? How do you backfill it?

Related

R: Meaning of "extdata"?

Can someone please explain what "extdata" means in R?
For instance, I was looking at the "cronR" library in R (used for automatically scheduling jobs), and came across the term "extdata":
f <- system.file(package = "cronR", "extdata", "helloworld.R")
cmd <- cron_rscript(f)
cmd
cron_add(command = cmd, frequency = 'minutely',
id = 'test1', description = 'My process 1', tags = c('lab', 'xyz'))
cron_add(command = cmd, frequency = 'daily', at='7AM', id = 'test2')
cron_njobs()
cron_ls()
cron_clear(ask=TRUE)
cron_ls()
Similarly, the "taskscheduleR" package (also used for automatically scheduling jobs) also makes reference to "extdata":
library(taskscheduleR)
myscript <- system.file("extdata", "helloworld.R", package = "taskscheduleR")
## run script once within 62 seconds
taskscheduler_create(taskname = "myfancyscript", rscript = myscript,
schedule = "ONCE", starttime = format(Sys.time() + 62, "%H:%M"))
My Question: Can someone please explain what is "extdata"? Is this just some "formality" that needs to be added to the "system.file()" command? Can someone please explain its relevance here?
Thanks!
References:
https://cran.r-project.org/web/packages/cronR/cronR.pdf
https://cran.r-project.org/web/packages/taskscheduleR/vignettes/taskscheduleR.html
This is a convention, not a formally defined term. (However, it's a convention defined by the package authors and coded in the package structure; it's not something you can change unless you mess around with the package structure yourself.) "extdata" is presumably short for "external data".
However, this doesn't mean that you need to use "extdata" when you are structuring your own code; you only need it when finding the files that are included by the package. cron_rscript("~/my_cron_jobs/foo.R") should work fine (provided you actually have something there, and provided that the ~ == home directory shortcut works across OS, which I think it does).
system.file() takes a package argument, but otherwise strings its arguments together into a file path; i.e. system.file(package = "cronR", "extdata", "helloworld.R") means
look in the system folder that R has set up for the cronR package (in my case that is /usr/local/lib/R/site-library/cronR, but the precise location will vary by OS and configuration)
within that folder look in the extdata folder
within that folder look for helloworld.R
So this command will refer in my case to /usr/local/lib/R/site-library/cronR/extdata/helloworld.R.
Since "/" works as a path separator (at least when used from within R) for all current operating systems, you would get the same results from system.file(package="cronR", "extdata/helloworld.R")

Run preprocessor using nbconvert as a library

I would like to run nbconvert with a preprocessor that removes cells marked with the tag "skip". I am able to do this from the command line, but when I try to use the nbconvert API within a notebook I run into problems.
An example
Following the example in the documentation, I get a notebook to work with.
from urllib.request import urlopen
url = 'http://jakevdp.github.com/downloads/notebooks/XKCD_plots.ipynb'
response = urlopen(url).read().decode()
import nbformat
nb = nbformat.reads(response, as_version=4)
I'll modify one cell so it gets skipped in the output.
nb.cells[1].metadata = {'tags': ['skip']}
Command line
Saving the file, and then running nbconvert from the command line:
nbformat.write(nb, 'nb.ipynb')
%%bash
jupyter nbconvert --to latex \
--TagRemovePreprocessor.remove_cell_tags='{"skip"}' \
--TagRemovePreprocessor.enabled=True \
'nb.ipynb'
This works. The output nb.tex file does not contain the cell tagged "skip".
API
Now let's try it using the API instead. First, without any preprocessing:
import nbconvert
latex, _ = LatexExporter().from_notebook_node(nb)
print(latex[:25])
\documentclass[11pt]{arti
Again, no problem. The conversion is working.
Now, trying to use the same preprocessor I used from the command line:
from traitlets.config import Config
c = Config()
c.RemovePreprocessor.remove_cell_tags = ('skip',)
c.LatexExporter.preprocessors = ['TagRemovePreprocessor']
LatexExporter(config=c).from_notebook_node(nb)
This time, I get:
ModuleNotFoundError: No module named 'TagRemovePreprocessor'
As far as I can see, this code matches the code sample in the documentation, except that I'm using the Latex exporter instead of HTML. So why isn't it working?
For your particular case, I believe you can resolve the issue by changing: c.RemovePreprocessor.remove_cell_tags = ('skip',) ->
c.TagRemovePreprocessor.remove_cell_tags = ('skip',)
For the benefit of others that come across this thread as I did by searching
ModuleNotFoundError: No module named 'TagRemovePreprocessor'
There is an open issue with TagRemovePreprocessor that causes all exporters other than the HTMLExporter (and LatexExporter?) to automatically disable this preprocessor.
In my case, I was attempting to use the NotebookExporter and needed to explicitly enable the preprocessor and change the preprocessing level like so:
import json
from traitlets.config import Config
from nbconvert import NotebookExporter
import nbformat
c = Config()
c.TagRemovePreprocessor.enabled=True # Add line to enable the preprocessor
c.TagRemovePreprocessor.remove_cell_tags = ["del_cell"]
c.preprocessors = ['TagRemovePreprocessor'] # Was previously: c.NotebookExporter.preprocessors
nb_body, resources = NotebookExporter(config=c).from_filename('notebook.ipynb')
nbformat.write(nbformat.from_dict(json.loads(nb_body)),'stripped_notebook.ipynb',4)

Writing a partitioned parquet file with SparkR

I've got two scripts, one in R and a short second one in pyspark that uses the output. I'm trying to copy that functionality into the first script for simplicity.
The second script is very simple -- read a bunch of csv files and emit them as partitioned parquet:
spark.read.csv(path_to_csv, header = True) \
.repartition(partition_column).write \
.partitionBy(partition_column).mode('overwrite') \
.parquet(path_to_parquet)
This should be equally simple in R but I can't figure out how to match the partitionBy functionality in SparkR. I've got this so far:
library(SparkR); library(magrittr)
read.df(path_to_csv, 'csv', header = TRUE) %>%
repartition(col = .$partition_column) %>%
write.df(path_to_parquet, 'parquet', mode = 'overwrite')
This successfully writes one parquet file for each value of partition_column. The issue is the emitted files have the wrong directory structure; whereas Python produces something like
/path/to/parquet/
partition_column=key1/
file.parquet.gz
partition_column=key2/
file.parquet.gz
...
R produces only
/path/to/parquet/
file_for_key1.parquet.gz
file_for_key2.parquet.gz
...
Am I missing something? the partitionBy function in SparkR appears only to refer to the context of window functions and I don't see anything else in the manual that could be related. Perhaps there's a way to pass something in ... but I don't see any examples in the documentation or from a search online.
Partitioning of the output is not supported in Spark <= 2.x.
However, it will be supported in SparR >= 3.0.0 (SPARK-21291 - R partitionBy API), with the following syntax:
write.df(
df, path_to_csv, "parquet", mode = "overwrite",
partitionBy = "partition_column"
)
Since corresponding PR modifies only R files, you should be able to patch any SparkR 2.x distribution, if upgrading to development version is not an option:
git clone https://github.com/apache/spark.git
git checkout v2.4.3 # Or whatever branch you use
# https://github.com/apache/spark/commit/cb77a6689137916e64bc5692b0c942e86ca1a0ea
git cherry-pick cb77a6689137916e64bc5692b0c942e86ca1a0ea
R -e "devtools::install('R/pkg')"
In the client mode this should be required only on the driver node.
but these are not fatal, and shouldn't cause any serious issues.

Use Azure custom-vision trained model with tensorflow.js

I've trained a model with Azure Custom Vision and downloaded the TensorFlow files for Android
(see: https://learn.microsoft.com/en-au/azure/cognitive-services/custom-vision-service/export-your-model). How can I use this with tensorflow.js?
I need a model (pb file) and weights (json file). However Azure gives me a .pb and a textfile with tags.
From my research I also understand that there are also different pb files, but I can't find which type Azure Custom Vision exports.
I found the tfjs converter. This is to convert a TensorFlow SavedModel (is the *.pb file from Azure a SavedModel?) or Keras model to a web-friendly format. However I need to fill in "output_node_names" (how do I get these?). I'm also not 100% sure if my pb file for Android is equal to a "tf_saved_model".
I hope someone has a tip or a starting point.
Just parroting what I said here to save you a click. I do hope that the option to export directly to tfjs is available soon.
These are the steps I did to get an exported TensorFlow model working for me:
Replace PadV2 operations with Pad. This python function should do it. input_filepath is the path to the .pb model file and output_filepath is the full path of the updated .pb file that will be created.
import tensorflow as tf
def ReplacePadV2(input_filepath, output_filepath):
graph_def = tf.GraphDef()
with open(input_filepath, 'rb') as f:
graph_def.ParseFromString(f.read())
for node in graph_def.node:
if node.op == 'PadV2':
node.op = 'Pad'
del node.input[-1]
print("Replaced PadV2 node: {}".format(node.name))
with open(output_filepath, 'wb') as f:
f.write(graph_def.SerializeToString())
Install tensorflowjs 0.8.6 or earlier. Converting frozen models is deprecated in later versions.
When calling the convertor, set --input_format as tf_frozen_model and set output_node_names as model_outputs. This is the command I used.
tensorflowjs_converter --input_format=tf_frozen_model --output_json=true --output_node_names='model_outputs' --saved_model_tags=serve path\to\modified\model.pb folder\to\save\converted\output
Ideally, tf.loadGraphModel('path/to/converted/model.json') should now work (tested for tfjs 1.0.0 and above).
Partial answer:
Trying to achieve the same thing - here is the start of an answer - to make use of the output_node_names:
tensorflowjs_converter --input_format=tf_frozen_model --output_node_names='model_outputs' model.pb web_model
I am not yet sure how to incorporate this into same code - do you have anything #Kasper Kamperman?

calling an Rscript from node.js

I have been trying to execute an Rscript from my node.js server. tried to follow an example online, but i keep getting a null returned object or sometimes the process keeps running forever. I have mentioned the code snippet below. Thank you.
example.js ::
var R = require("r-script");
var out = R("scripts/testScript.R")
.data("hello world", 20)
.callSync(function(err,resp){
console.log(out);
});
testScript.R file :::
needs(magrittr)
set.seed(512)
do.call(rep, input) %>%
strsplit(NULL) %>%
sapply(sample) %>%
apply(2, paste, collapse = "")
For windows users:
You need to add the environment variable to Windows's %PATH% variable. R-script package needs to call "R" command from the CMD. If R.exe is not set as a environment vairable, then it will never be able to call the "R" command from anywhere.
Look up how to add environment variables to Windows, and remember: if the path to the folder containing the executables has a white space, it must be added between double quotes. "C:\Program Files\R\R-3.3.2\bin\x64"
If you have already done this but the problem persists, I can only think of two reasons:
There's something wrong with your R method and it's giving an internal exception inside the R session.
The system can't find the file. Maybe check the filepath.
You can use child processes in node to call other languages. I find it easiest to call Python from node, and use Python's subprocess module to then call R:
NODE
var spawn = require("child_process").spawn
var process = spawn('python',["call_r.py", script_choice, function_choice]);
This calls our call_r.py file passing along our script and function choices:
PYTHON (call_r.py)
import subprocess
import sys
script_choice = sys.argv[1]
function_choice = sys.argv[2]
call_script = 'R_Scripts/' + script_choice + '.R'
cmd = ['Rscript', call_script] + [function_choice]
result = subprocess.check_output(cmd, universal_newlines=True)
print(result)
sys.stdout.flush()
This parses the passed script and function choices, calling R via Python's subprocess module.
R (script that was chosen)
myArgs <- commandArgs(trailingOnly = TRUE)
function_choice <- myArgs[1]
# add your R functions here
eval(parse(text=function_choice))
Here, R parses the passed function choice and evaluates it. Note that arguments can be passed to the R function of choice by simply including them in the function argument (e.g. my_function('hey there'))

Resources