How to add Hudi Package to local AWS Glue Interactive Notebook - jupyter-notebook

I have setup Glue Interactive sessions locally by following https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
However, I am not able to add any additional packages like HUDI to the interactive session
There are a few magic commands to use but not sure which one is apt and how to use
%additional_python_modules
%extra_jars
%extra_py_files

I am not able to comment on the question, so adding link to a similar question that has received an answer.
Regarding the magic commands, you will find the descriptions once you start the glue interactive notebook. I am also adding them here.
%additional_python_modules List Comma separated list of pip packages, s3 paths or private pip arguments.
%additional_python_modules ['path_to_pip_package_1', 'path_to_pip_package_2']
%extra_jars List Comma separated list of additional Jars to include in the cluster.
%extra_py_files List Comma separated list of additional Python files from S3.

In my case, I have a few Python helper functions in *.py and *.zip files (also contains some *.py files but just zipped). This works:
%extra_py_files 's3://bucket/a.py,s3://bucket/b.py,s3://bucket/c.zip'
%additional_python_modules doesn't work for me so I assumed this magic is for whl files only ¯\_(ツ)_/¯

Related

Do .Rout files preserve the R working environment?

I recently started looking into Makefiles to keep track of the scripts inside my research project. To really understand what is going on, I would like to understand the contents of .Rout files produced by R CMD BATCH a little better.
Christopher Gandrud is using a Makefile for his book Reproducible research with R and RStudio. The sample project (https://github.com/christophergandrud/rep-res-book-v3-examples/tree/master/data) has only three .R files: two of them download and clean data, the third one merges both datasets. They are invoked by the following lines of the Makefile:
# Key variables to define
RDIR = .
# Run the RSOURCE files
$(RDIR)/%.Rout: $(RDIR)/%.R
R CMD BATCH $<
None of the first two files outputs data; nor does the merge script explicitly import data - it just uses the objects created in the first two scripts. So how is the data preserved between the scripts?
To me it seems like the batch execution happens within the same R environment, preserving both objects and loaded packages. Is this really the case? And is it the .Rout file that transfers the objects from one script to the other or is it a property of the batch execution itself?
If the working environment is really preserved between the scripts, I see a lot of potential for issues if there are objects with the same names or functions with the same names from different packages. Another issue of this setup seems to be that the Makefile cannot propagate changes in the first two files downstream because there is no explicit input/prerequisite for the merge script.
I would appreciate to learn if my intuition is right and if there are better ways to execute R files in a Makefile.
By default R CMD BATCH will save your workspace to a hidden .Rdata file after running unless you choose --no-save. That's why it's not really the recommended way to run R script. The recommended way is with Rscript which will not save by default. You must write code explicitly to save if that's what you want. This is different than the Rout file which should only have the output from the commands run in the script.
In this case, execution doesn't happen in the exact same environment. R is still called three times, but that environment is serialized and reloaded between each run.
You are correct that there may be a lot of problems with saving and re-loading workspaces by default. That's why most people recommend you do not do that. But in this cause, the author just figured it made things easier for their workflow so they used it. It would be better to be more explicit about input and output files in general though.

How to use Makefiles with R CMD build

I am developing a R package. It is based on a project that only used Makefile. Most of it easily translated to the R CMD build workflow. However the pdfs I need to create are a bit complex and I don't get them right unless I tinker - so far I figured how to do it with a Makefile.
In the R package documentations I find references to use Makefiles for sources and even for vignettes.
I don't grasp how these should be applied. From these documentations I had the impression Makefiles would be called in the process of R CMD build but when I put Makefile in the described directories they are just ignored. However R CMD check recognises them and outputs passing tests.
I also have seen some Makefiles that call R CMD build inside - but I keep wondering how these would execute when I use install.packages. That doesn't seem right - I mean why would R CMD check these if it wouldn't care about. And there's also this page in R packages about adding SystemRequiremens: GNU make - why do this for a file you don't use?
So what is the best practice nowadays? And are there examples in the wild that I can look at?
Updates
As I was asked for an example
I want to build a vignette as similar as described in "Writing package vignettes". There is a master Latex file which includes several Rnw files.
The concrete dilemmas are:
how do I build the pdf vignette?
how can I enforce dependencies — obviously the rnws need to rendered first
the Rnw need slowly calculated data that is neither intended to go into the package nor in the repo (it's some gigabytes) — but it is reused several times during the build.
So far I do it with a Makefile, the general pattern is like this:
tmp/test.pdf: tmp/test.tex tmp/rnw1.tex tmp/rnw2.tex
latexmk -outdir=$(#D) $<
tmp/%.tex: r/%.rnw
Rscript -e "knitr::knit('$<', output='$#')"
tmp/rnw1.tex tmp/rnw2.tex: tmp/slowdata.Rdata
tmp/slowdata.Rdata: r/ireallytakeforever.R
Rscript $<
Bdecaf,
Ok, answer version 2.0 - chuckle.
You mentioned that "The question is how Makefiles and the package build workflow are supposed to go together". In that context, my recommendation is you review a set of example R package makefiles:
Makefile for Yihui Xie's knitr package for R.
Makefile for my R/qtlcharts package.
The knitr package makefile (in my view) provides a good example of how to build vignettes. You need to review the makefile and directory structure, that would be the template I would recommend you review and use.
I'd also recommend you look at maker, a Makefile for R package development. On top of this, I would start with Karl Broman guides - (this is what I used myself as a source reference a while back now eclipsed by Hadley's book on packages but still useful (in my view).
Minimal make: A minimal tutorial on Make
R package Primer.
The other recommendation is to read Rob Hynman's article I referenced previously
Makefiles for R/LaTeX projects
between them, you should be able to do what you request. Above and beyond that you have the base R package manual you referenced.
I hope the above helps.
T.
Referenced pages:
minimal make A minimal tutorial on make - Author Karl Broman
I would argue that the most important tool for reproducible research is not Sweave or knitr but GNU make.
Consider, for example, all of the files associated with a manuscript. In the simplest case, I would have an R script for each figure plus a LaTeX file for the main text. And then a BibTeX file for the references.
Compiling the final PDF is a bit of work:
Run each R script through R to produce the relevant figure.
Run latex and then bibtex and then latex a couple of more times.
And the R scripts need to be run before latex is, and only if they’ve changed.
A simple example
GNU make makes this easy. In your directory for the manuscript, you create a text file called Makefile that looks something like the following (here using pdflatex).
mypaper.pdf: mypaper.bib mypaper.tex Figs/fig1.pdf Figs/fig2.pdf
pdflatex mypaper
bibtex mypaper
pdflatex mypaper
pdflatex mypaper
Figs/fig1.pdf: R/fig1.R
cd R;R CMD BATCH fig1.R
Figs/fig2.pdf: R/fig2.R
cd R;R CMD BATCH fig2.R
Each batch of lines indicates a file to be created (the target), the files it depends on (the prerequisites), and then a set of commands needed to construct the target from the dependent files. Note that the lines with the commands must start with a tab character (not spaces).
Another great feature: in the example above, you’d only build fig1.pdf when fig1.R changed. And note that the dependencies propagate. If you change fig1.R, then fig1.pdf will change, and so mypaper.pdf will be re-built.
One oddity: if you need to change directories to run a command, do the cd on the same line as the related command. The following would not work:
### this doesn't work ###
Figs/fig1.pdf: R/fig1.R
cd R
R CMD BATCH fig1.R
You can, however, use \ for a continuation line, line so:
### this works ###
Figs/fig1.pdf: R/fig1.R
cd R;\
R CMD BATCH fig1.R
Note that you still need to use the semicolon (;).
Using GNU make
You probably already have GNU make installed on your computer. Type make --version in a terminal/shell to see. (On Windows, go here to download make.)
To use make:
Go into the the directory for your project.
Create the Makefile file.
Every time you want to build the project, type make.
In the example above, if you want to build fig1.pdf without building mypaper.pdf, just type make fig1.pdf.
Frills
You can go a long way with just simple make files as above, specifying the target files, their dependencies, and the commands to create them. But there are a lot of frills you can add, to save some typing.
Here are some of the options that I use. (See the make documentation for further details.)
Variables
If you’ll be repeating the same piece of code multiple times, you might want to define a variable.
For example, you might want to run R with the flag --vanilla. You could then define a variable R_OPTS:
R_OPTS=--vanilla
You refer to this variable as $(R_OPTS) (or ${R_OPTS}; either parentheses or curly braces is allowed), so in the R commands you would use something like
cd R;R CMD BATCH $(R_OPTS) fig1.R
An advantage of this is that you just need to type out the options you want once; if you change your mind about the R options you want to use, you just have to change them in the one place.
For example, I actually like to use the following:
R_OPTS=--no-save --no-restore --no-init-file --no-site-file
This is like --vanilla but without --no-environ (which I need because I use the .Renviron file to define R_LIBS, to say that I have R packages defined in an alternative directory).
Automatic variables
There are a bunch of automatic variables that you can use to save yourself a lot of typing. Here are the ones that I use most:
$# the file name of the target
$< the name of the first prerequisite (i.e., dependency)
$^ the names of all prerequisites (i.e., dependencies)
$(#D) the directory part of the target
$(#F) the file part of the target
$(<D) the directory part of the first prerequisite (i.e., dependency)
$(<F) the file part of the first prerequisite (i.e., dependency)
For example, in our simple example, we could simplify the lines
Figs/fig1.pdf: R/fig1.R
cd R;R CMD BATCH fig1.R
We could instead write
Figs/fig1.pdf: R/fig1.R
cd $(<D);R CMD BATCH $(<F)
The automatic variable $(<D) will take the value of the directory of the first prerequisite, R in this case. $(<F) will take value of the file part of the first prerequisite, fig1.R in this case.
Okay, that’s not really a simplification. There doesn’t seem to be much advantage to this, unless perhaps the directory were an obnoxiously long string and we wanted to avoid having to type it twice. The main advantage comes in the next section.
Pattern rules
If a number of files are to be built in the same way, you may want to use a pattern rule. The key idea is that you can use the symbol % as a wildcard, to be expanded to any string of text.
For example, our two figures are being built in basically the same way. We could simplify the example by including one set of lines covering both fig1.pdf and fig2.pdf:
Figs/%.pdf: R/%.R
cd $(<D);R CMD BATCH $(<F)
This saves typing and makes the file easier to maintain and extend. If you want to add a third figure, you just add it as another dependency (i.e., prerequisite) for mypaper.pdf.
Our example, with the frills
Adding all of this together, here’s what our example Makefile will look like.
R_OPTS=--vanilla
mypaper.pdf: mypaper.bib mypaper.tex Figs/fig1.pdf Figs/fig2.pdf
pdflatex mypaper
bibtex mypaper
pdflatex mypaper
pdflatex mypaper
Figs/%.pdf: R/%.R
cd $(<D);R CMD BATCH $(R_OPTS) $(<F)
The advantage of the added frills: less typing, and it’s easier to extend to include additional figures. The disadvantage: it’s harder for others who are less familiar with GNU Make to understand what it’s doing.
More complicated examples
There are complicated Makefiles all over the place. Poke around github and study them.
Here are some of my own examples:
Makefile for my AIL probabilities paper
Makefile for my phylo QTL paper
Makefile for my pre-CC probabilities paper
Makefile for a talk on interactive graphs.
Makefile for a talk on QTL mapping for function-valued
traits.
Makefile for my R/qtlcharts package.
And here are some examples from Mike Bostock:
Makefile for us-rivers
Makefile for protovis
Makefile for topotree
Also look at the Makefile for Yihui Xie’s knitr package for R.
Also of interest is maker, a Makefile for R package development.
Resources
GNU make webpage
Official manual
O’Reilly Managing projects with GNU make book (part of the Open Books project)
Software carpentry’s make tutorial
Mike Bostock’s “Why Use Make”
GNU Make for reproducible data analysis by Zachary Jones
Makefiles for R/LaTeX projects by Rob Hyndman
R package primer
a minimal tutorial
A minimal tutorial on how to make an R package.
R packages are the best way to distribute R code and documentation,
and, despite the impression that the official manual
(Writing R Extensions)
might give, they really are quite simple to create.
You should make an R package even for code that you don't plan to
distribute. You'll find it is easier to keep track of your own
personal R functions if they are in a package. And it's good to write
documentation, even if it's just for your future self.
Hadley Wickham wrote
a book about R packages (free online; also
available in paper form from
Amazon). You
might just jump straight there.
Hilary Parker wrote a
short and clear tutorial on writing R packages.
If you want a crash course, you should start there. A lot of people
have successfully built R packages from her instructions.
But there is value in having a diversity of
resources, so I thought I'd go ahead and write my own minimal tutorial.
The following list of topics looks forbidding, but each is short and
straightforward (and hopefully clear). If you're put off by the list
of topics,
and you've not already abandoned me in favor of
Hadley's book, then why aren't you reading
Hilary's tutorial?
If anyone's still with me, the following pages cover the essentials of
making an R package.
Why write an R package?
The minimal R package
Building and installing an R package
Making it a proper package
Writing documentation with Roxygen2
Software licenses
Checking an R package
The following are important but not essential.
Putting it on GitHub
Getting it on CRAN
Writing vignettes
Writing tests
Including datasets
Connecting to other packages
The following contains links to other resources:
Further resources
If anything here is confusing (or wrong!), or if I've missed
important details, please
submit an issue, or (even
better) fork the GitHub repository for this website,
make modifications, and submit a pull request.
The source for this tutorial is on github.
Also see my tutorials on
git/github,
GNU make,
knitr,
making a web site with GitHub Pages,
data organization,
and reproducible research.

Sourcing So many R scripts

I have lots of .r scripts that I want to source all. I have written a function like the one below to source.
sourcer=function(){
source("wil.r")
source("k.r")
source("l.r")
}
Please can any one tell me how to get this codes activated and how to call each one any time I want to use it?
In addition to the answer by #user2885462, if the amount of R code you need to source becomes bigger, you might want to wrap the code into an R package. This provides a convenient way of loading the code, and allows you to add tests, documentation, etc. Reading the official package writing tutorial is a good place to start for that.
For an individual project, I like to have all (or most) of my R functions in separate .r files, all in the same folder: e.g., AllFunctions
Then at the beginning of my main code I run the following line of code, which sources all .r (and other extensions if they exist - which they usually don't) in the AllFunctions folder:
for (nm in list.files("AllFunctions", pattern = ".[RrSsQq]$")) source(file.path("AllFunctions", nm))

How to manage R extension / package documentation with grace (or at least without pain)

Now and then I embrace project specific code in R packages. I use the documentation files as suggested by Writing R Extensions to document the application of the code.
So once you set up your project and did all the editing to the .Rd files,
how do you manage a painless and clean versioning without rewriting or intense copy-pasting of all the documentation files in case of code or, even worse, code structure changes?
To be more verbose, my current workflow is that I issue package.skeleton(), do the editing on the .Rd-files followed by R CMD check and R CMD build. When I do changes to my code I need to redo the above maybe appending '.2.0.1' or whatever in order to preserve the precursor version. Before running the R CMD check command I need to repopulate all the .Rd-files with great care in order to get a clean check and succsessful compilation of Tex-files. This is really silly and sometimes a real pain, e.g. if you want to address all the warnings or latex has a bad day.
What tricks do you use? Please share your workflow.
The solution you're looking for is roxygen2.
RStudio provides a handy guide, but briefly you document your function in-line with the function definition:
#' Function doing something
#' Extended description goes here
#' #param x Input foo blah
#' #return A numeric vector length one containing foo
myFunc <- function(x) NULL
If you're using RStudio (and maybe ESS also?) the Build Package command automagically creates the .Rd files for you. If not, you can read the roxygen2 documentation for the commands to generate the docs.

Prevent line overflow in R documentation?

This is more of an annoyance than it is a problem, but is there any way to prevent the line "overflow" that occurs when documentation in R is compiled and a line is too long?
A snippet of some documentation created with R CMD Rd2pdf [options] files:
I can't find mention of this anywhere, and the only options for Rd2pdf are:
Options:
-h, --help print short help message and exit
-v, --version print version info and exit
--batch no interaction
--no-clean do not remove created temporary files
--no-preview do not preview generated PDF file
--encoding=enc use 'enc' as the default input encoding
--outputEncoding=outenc
use 'outenc' as the default output encoding
--os=NAME use OS subdir 'NAME' (unix or windows)
--OS=NAME the same as '--os'
-o, --output=FILE write output to FILE
--force overwrite output file if it exists
--title=NAME use NAME as the title of the document
--no-index don't index output
--no-description don't typeset the description of a package
--internals typeset 'internal' documentation (usually skipped)
Sorry for the spoiler: One solution is not using roxygen2 in extremis for package maintenance. Why not maintain your DESCRIPTION manually? You don't need many changes and it looks so much better...
The 'Collate' is really not needed for the vast majority of packages.
Why not follow the tradition of many bioconductor packages, 'Matrix' etc, of
putting S4 class definitions (including reference classes) in a file 'AllClasses.R' and maybe use 'AllGenerics.R' as well, and for the rest, collation order should not matter.

Resources