Get 'make html' to process jupyter notebooks with markdown instead of with restructuredtext - jupyter-notebook

The github repo on which I work with many others contains many python files and about ten jupyter notebooks. 'make html' currently assumes that the markdown JN cells are written in reST, which can produce meaningless and ugly results. Is it possible to configure sphinx (or maybe nbsphinx??) so that readthedocs for the JN markdown cells is rendered using markdown (preferably in the JN flavour)?
There is a website https://gist.github.com/dupuy/1855764 that addresses this problem by discussing constructs that are common to markdown and reST, but the document is at least 10 years old. For example, it lacks the "click here" link construct that works in both markup languages, namely:
[click here](urlname).
There remain in our JNs constructs that do not seem to have a common syntax that produces decent rendering in both markdown and reST, or at least I have not been successful in searching for one. An example (and there may be others) is making nested lists, specially lists without numbers or bullets.
An alternative would be for the JN text cells to be rendered by reST. There is a website https://nbsphinx.readthedocs.io/en/0.8.8/raw-cells.html that explains how to use reST in a JN. I have two problems with that. Firstly, JNs in our environment do not behave as this website explains, and I do not know how to change our environment (configuration files for JNs??) to make them behave in the way claimed by the website. Secondly, our JNs are designed to be used by naive users (even more naive than I am), and so the JN must work when the naive user uses JN in its "out of the box" configuration.

Related

Where can I find information on how to structure long R code? [duplicate]

Does anyone have any wisdom on workflows for data analysis related to custom report writing? The use-case is basically this:
Client commissions a report that uses data analysis, e.g. a population estimate and related maps for a water district.
The analyst downloads some data, munges the data and saves the result (e.g. adding a column for population per unit, or subsetting the data based on district boundaries).
The analyst analyzes the data created in (2), gets close to her goal, but sees that needs more data and so goes back to (1).
Rinse repeat until the tables and graphics meet QA/QC and satisfy the client.
Write report incorporating tables and graphics.
Next year, the happy client comes back and wants an update. This should be as simple as updating the upstream data by a new download (e.g. get the building permits from the last year), and pressing a "RECALCULATE" button, unless specifications change.
At the moment, I just start a directory and ad-hoc it the best I can. I would like a more systematic approach, so I am hoping someone has figured this out... I use a mix of spreadsheets, SQL, ARCGIS, R, and Unix tools.
Thanks!
PS:
Below is a basic Makefile that checks for dependencies on various intermediate datasets (w/ .RData suffix) and scripts (.R suffix). Make uses timestamps to check dependencies, so if you touch ss07por.csv, it will see that this file is newer than all the files / targets that depend on it, and execute the given scripts in order to update them accordingly. This is still a work in progress, including a step for putting into SQL database, and a step for a templating language like sweave. Note that Make relies on tabs in its syntax, so read the manual before cutting and pasting. Enjoy and give feedback!
http://www.gnu.org/software/make/manual/html_node/index.html#Top
R=/home/wsprague/R-2.9.2/bin/R
persondata.RData : ImportData.R ../../DATA/ss07por.csv Functions.R
$R --slave -f ImportData.R
persondata.Munged.RData : MungeData.R persondata.RData Functions.R
$R --slave -f MungeData.R
report.txt: TabulateAndGraph.R persondata.Munged.RData Functions.R
$R --slave -f TabulateAndGraph.R > report.txt
I generally break my projects into 4 pieces:
load.R
clean.R
func.R
do.R
load.R: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I'll either write out the workspace using save() or just keep things in memory for the next step.
clean.R: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.
func.R: Contains all of the functions needed to perform the actual analysis. source()'ing this file should have no side effects other than loading up the function definitions. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.
do.R: Calls the functions defined in func.R to perform the analysis and produce charts and tables.
The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.
If you'd like to see some examples, I have a few small (and not so small) data cleaning and analysis projects available online. In most, you'll find a script to download the data, one to clean it up, and a few to do exploration and analysis:
Baby names from the social security administration
30+ years of fuel economy data from the EPI
A big collection of data about the housing crisis
Movie ratings from the IMDB
House sale data in the Bay Area
Recently I have started numbering the scripts, so it's completely obvious in which order they should be run. (If I'm feeling really fancy I'll sometimes make it so that the exploration script will call the cleaning script which in turn calls the download script, each doing the minimal work necessary - usually by checking for the presence of output files with file.exists. However, most times this seems like overkill).
I use git for all my projects (a source code management system) so its easy to collaborate with others, see what is changing and easily roll back to previous versions.
If I do a formal report, I usually keep R and latex separate, but I always make sure that I can source my R code to produce all the code and output that I need for the report. For the sorts of reports that I do, I find this easier and cleaner than working with latex.
I agree with the other responders: Sweave is excellent for report writing with R. And rebuilding the report with updated results is as simple as re-calling the Sweave function. It's completely self-contained, including all the analysis, data, etc. And you can version control the whole file.
I use the StatET plugin for Eclipse for developing the reports, and Sweave is integrated (Eclipse recognizes latex formating, etc). On Windows, it's easy to use MikTEX.
I would also add, that you can create beautiful reports with Beamer. Creating a normal report is just as simple. I included an example below that pulls data from Yahoo! and creates a chart and a table (using quantmod). You can build this report like so:
Sweave(file = "test.Rnw")
Here's the Beamer document itself:
%
\documentclass[compress]{beamer}
\usepackage{Sweave}
\usetheme{PaloAlto}
\begin{document}
\title{test report}
\author{john doe}
\date{September 3, 2009}
\maketitle
\begin{frame}[fragile]\frametitle{Page 1: chart}
<<echo=FALSE,fig=TRUE,height=4, width=7>>=
library(quantmod)
getSymbols("PFE", from="2009-06-01")
chartSeries(PFE)
#
\end{frame}
\begin{frame}[fragile]\frametitle{Page 2: table}
<<echo=FALSE,results=tex>>=
library(xtable)
xtable(PFE[1:10,1:4], caption = "PFE")
#
\end{frame}
\end{document}
I just wanted to add, in case anyone missed it, that there's a great post on the learnr blog about creating repetitive reports with Jeffrey Horner's brew package. Matt and Kevin both mentioned brew above. I haven't actually used it much myself.
The entries follows a nice workflow, so it's well worth a read:
Prepare the data.
Prepare the report template.
Produce the report.
Actually producing the report once the first two steps are complete is very simple:
library(tools)
library(brew)
brew("population.brew", "population.tex")
texi2dvi("population.tex", pdf = TRUE)
For creating custom reports, I've found it useful to incorporate many of the existing tips suggested here.
Generating reports:
A good strategy for generating reports involves the combination of Sweave, make, and R.
Editor:
Good editors for preparing Sweave documents include:
StatET and Eclipse
Emacs and ESS
Vim and Vim-R
R Studio
Code organisation:
In terms of code organisation, I find two strategies useful:
Read up about analysis workflow (e.g., ProjectTemplate,
Josh Reich's ideas, my own presentation on R workflow
Slides
and Video )
Study example reports and discern the workflow
Hadley Wickham's examples
My examples on github
Examples of reproducible research listed on Cross Validated
I use Sweave for the report-producing side of this, but I've also been hearing about the brew package - though I haven't yet looked into it.
Essentially, I have a number of surveys for which I produce summary statistics. Same surveys, same reports every time. I built a Sweave template for the reports (which takes a bit of work). But once the work is done, I have a separate R script that lets me point out the new data. I press "Go", Sweave dumps out a few score .tex files, and I run a little Python script to pdflatex them all. My predecessor spent ~6 weeks each year on these reports; I spend about 3 days (mostly on cleaning data; escape characters are hazardous).
It's very possible that there are better approaches now, but if you do decide to go this route, let me know - I've been meaning to put up some of my Sweave hacks, and that would be a good kick in the pants to do so.
I'm going to suggest something in a different sort of direction from the other submitters, based on the fact that you asked specifically about project workflow, rather than tools. Assuming you're relatively happy with your document-production model, it sounds like your challenges really may be centered more around issues of version tracking, asset management, and review/publishing process.
If that sounds correct, I would suggest looking into an integrated ticketing/source management/documentation tool like Redmine. Keeping related project artifacts such as pending tasks, discussion threads, and versioned data/code files together can be a great help even for projects well outside the traditional "programming" bailiwick.
Agreed that Sweave is the way to go, with xtable for generating LaTeX tables. Although I haven't spent too much time working with them, the recently released tikzDevice package looks really promising, particularly when coupled with pgfSweave (which, as far as I know is only available on rforge.net at this time -- there is a link to r-forge from there, but it's not responding for me at the moment).
Between the two, you'll get consistent formatting between text and figures (fonts, etc.). With brew, these might constitute the holy grail of report generation.
At a more "meta" level, you might be interested in the CRISP-DM process model.
"make" is great because (1) you can use it for all your work in any language (unlike, say, Sweave and Brew), (2) it is very powerful (enough to build all the software on your machine), and (3) it avoids repeating work. This last point is important to me because a lot of the work is slow; when I latex a file, I like to see the result in a few seconds, not the hour it would take to recreate the figures.
I use project templates along with R studio, currently mine contains the following folders:
info : pdfs, powerpoints, docs... which won't be used by any script
data input : data that will be used by my scripts but not generated by them
data output : data generated by my scripts for further use but not as a proper report.
reports : Only files that will actually be shown to someone else
R : All R scripts
SAS : Because I sometimes have to :'(
I wrote custom functions so I can call smart_save(x,y) or smart_load(x) to save or load RDS files to and from the data output folder (files named with variable names) so I'm not bothered by paths during my analysis.
A custom function new_project creates a numbered project folder, copies all the files from the template, renames the RProj file and edits the setwd calls, and set working directory to new project.
All R scripts are in the R folder, structured as follow :
00_main.R
setwd
calls scripts 1 to 5
00_functions.R
All functions and only functions go there, if there's too many I'll separate it into several, all named like 00_functions_something.R, in particular if I plan to make a package out of some of them I'll put them apart
00_explore.R
a bunch of script chunks where i'm testing things or exploring my data
It's the only file where i'm allowed to be messy.
01_initialize.R
Prefilled with a call to a more general initialize_general.R script from my template folder which loads the packages and data I always use and don't mind having in my workspace
loads 00_functions.R (prefilled)
loads additional libraries
set global variables
02_load data.R
loads csv/txt xlsx RDS, there's a prefilled commented line for every type of file
displays which files hava been created in the workspace
03_pull data from DB.R
Uses dbplyr to fetch filtered and grouped tables from the DB
some prefilled commented lines to set up connections and fetch.
Keep client side operations to bare minimum
No server side operations outside of this script
Displays which files have been created in the workspace
Saves these variables so they can be reloaded faster
Once it's been done once I switch off a query_db boolean and the data will reloaded from RDS next time.
It can happen that I have to refeed data to DBs, If so I'll create additional steps.
04_Build.R
Data wrangling, all the fun dplyr / tidyr stuff goes there
displays which files have been created in the workspace
save these variables
Once it's been done once I switch off a build boolean and the data will reloaded from RDS next time.
05_Analyse.R
Summarize, model...
report excel and csv files
95_build ppt.R
template for powerpoint report using officer
96_prepare markdown.R
setwd
load data
set markdown parameters if needed
render
97_prepare shiny.R
setwd
load data
set shiny parameters if needed
runApp
98_Markdown report.Rmd
A report template
99_Shiny report.Rmd
An app template
For writing a quick preliminary report or email to a colleague, I find that it can be very efficient to copy-and-paste plots into MS Word or an email or wiki page -- often best is a bitmapped screenshot (e.g. on mac, Apple-Shift-(Ctrl)-4). I think this is an underrated technique.
For a more final report, writing R functions to easily regenerate all the plots (as files) is very important. It does take more time to code this up.
On the larger workflow issues, I like Hadley's answer on enumerating the code/data files for the cleaning and analysis flow. All of my data analysis projects have a similar structure.
I'll add my voice to sweave. For complicated, multi-step analysis you can use a makefile to specify the different parts. Can prevent having to repeat the whole analysis if just one part has changed.
I also do what Josh Reich does, only I do that creating my personal R-packages, as it helps me structure my code and data, and it is also quite easy to share those with others.
create my package
load
clean
functions
do
creating my package: devtools::create('package_name')
load and clean: I create scripts in the data-raw/ subfolder of my package for loading, cleaning, and storing the resulting data objects in the package using devtools::use_data(object_name). Then I compile the package.
From now on, calling library(package_name) makes these data available (and they are not loaded until necessary).
functions: I put the functions for my analyses into the R/ subfolder of my package, and export only those that need to be called from outside (and not the helper functions, which can remain invisible).
do: I create a script that uses the data and functions stored in my package.
(If the analyses only need to be done once, I can put this script as well into the data-raw/ subfolder, run it, and store the results in the package to make it easily accessible.)

emacs: using mmm-mode to combine markdown-mode and ESS for editing rmarkdown files

I'm playing with mmm-mode to combine markdown-mode and ESS for editing Rmarkdown files. I'm using gnu emacs 24.3 on Windows 7 and up-to-date version of the aforementioned modes. This is what I've got in my .emacs file:
(require 'mmm-mode) ;;; possibly init with (require 'mmm-auto) instead
(mmm-add-classes
'((rmarkdown
:submode r-mode
:face mmm-declaration-submode-face
:front "^```[{]r.*[}] *$"
:back "^``` *$")))
(setq mmm-global-mode 'maybe)
(mmm-add-mode-ext-class 'markdown-mode "\\.rmd\\'" 'rmarkdown)
That works so far as within a buffer showing an rmarkdown file, R code blocks are recognized and I get proper syntactically aware font-locking within both R code blocks and markdown blocks. More, when I have the point in an R code block I get ESS and Imenu-R menus and when it's in a markdown region I get a markdown menu. So far so good.
Here are my issues. Within R code blocks electric left assignment doesn't work. I can't simply hit the underscore key to get '<-' and to toggle between that and '_'.
Also, I don't get syntactically aware auto indentation for R code.
Both of these things work when I'm using ESS to edit files containing pure R code.
Any thoughts on how to tune this up? I'm aware of this previous post from nearly a year ago: How can I use Emacs ESS mode with R markdown? and the pointer to polymode, but polymode seems to be advancing slowly. I've also seen other pointers to org-mode for similar functionality and while that's a plunge I may take at some point, today my questions are about getting the most out of the combination of mmm-mode, markdown-mode and ESS. Thanks for your help.
Polymode is the way to go. Unfortunately still in development, but works for most of the things.

How to disable spell checking in code regions in RMD files (Markdown, knitr, R)

I am using the Vim-R-plugin to edit files containing markdown and R-code blocks such that the files can be complied using knitr. The filetype is: RMD. I have enabled spell checking. How can I disable the spell checking within the code blocks?
Spell checking is attached to certain syntax groups. Find the :syn region that covers the R code blocks, and append / edit in contains=#NoSpell.
Instead of trying to get the #NoSpell working by region, my approach is to toggle between languages.
I work in three languages which are set up to toggle with a function key where I include "nospell". This makes turning spellchecking on and off as easy as pressingt F7. When coding and writing nospell is turned on, when finalizing the edits I toggle to the appropriate language.
In fact, I find spellchecks in my code to be a plus. I make mistakes in the comment sections too, sometimes even in variable names/plot lables etc. This way you have a quick last check of all language items that are going to be visible .
I got this to work on OS X by editing the ~/.vim/syntax/R.vim and doing a search and replace of all instances of #Spell to #NoSpell. Then restarting vim. All the red underscores were gone from the code chunks but were still in the rest of the the rmarkdown.
Interestingly this has not effected the spell checking in pure R documents that have a .R extension, so I having thought I understood what I was doing perhaps I have to admit I don't fully. But at least it has turned off spell checking of the code chunks in rmarkdown (Rmd) documents while leaving it still working elsewhere in the document.

Mathematica-like (LaTeX) typesetting for own CAS application

As I am using Mathematica a lot I got the idea to write a small and free CAS which just exposes a very small subset of necessary functions and packages to be used and I want to present the results in an appropriate way to the user like Mathematica does (ignore the Facebook logo in the background :D ):
My first idea was to create LaTeX code in the background and to pdflatex the source and include the PDF then in the view... however this seems way to much overkill! I want to write this CAS either in C++ or C# and I want to know if there are any recommended solutions to output nice formula like that.
My first thought was a "real-time formula editing view" but it would be ok to have an input box to enter the commands and formulas and the upper view just to be uneditable output.
A few ways come to my mind.
Use LaTeX behind the scenes to typeset equations, as you say. Again, Cadabra does this.
Use TeXmacs as the front end. Cadabra does this.
Use MathJax. This is a javascript framework which renders TeX equations to images or MathML. It's very easy to use it if you have a HTML view in your UI toolkit. MathJax is used in the sister site MathOverflow, for example.
I find the route 3 is the most attractive.
For calling LaTeX in the background, don't use pdflatex, but use the non-PDF latex to produce a DVI file, and convert it then to PNG with dvipng.
Have a look at the preview package or the standalone class to get the output in the right size (i.e. only the formula, not a whole page).

Using org-mode to structure an analysis

I am trying to make better use of org-mode for my projects. I think literate programming is especially applicable to the realm of data analysis and org-mode lets us do some pretty awesome literate programming.
I think most of you will agree with me that the workflow for writing an analysis is different than most other types of programming. I don't just write a program, I explore the data. And, while many of these explorations are dead-ends, I don't want to delete/ignore them completely. I just don't want to re-run them every time I execute the org file. I also tend to find or develop chunks of useful code that I would like to put into an analytic template, but some of these chunks won't be relevant for every project and I'd like to know how to make org-mode ignore these chunks when I am executing the entire buffer. Here's a simplified example.
* Import
- I want org-mode to ignore import-sql.
#+srcname: import-data
#+begin_src R :exports none :noweb yes
<<import-csv>>
#+end_src
#+srcname: import-csv
#+begin_src R :exports none
data <- read.csv("foo-clean.csv")
#+end_src
#+srcname: import-sql
#+begin_src R :exports none
library(RSQLite)
blah blah blah
#+end_src
* Clean
- This is run on foo.csv, producing foo-clean.csv
- Fixes the mess of -9 and -13 to NA for my sanity.
- This only needs to be run once, and after that, reference.
- How can I tell org-mode to skip this?
#+srcname: clean-csv
#+begin_src sh :exports none
sed .....
#+end_src
* Explore
** Explore by a factor (1)
- Dead end. Did not pan out. Ignore.
- Produces a couple of charts showing there is not interaction.
#+srcname: explore-by-a-factor-1
#+begin_src R :exports none :noweb yes
#+end_src
** Explore by a factor (2)
- A useful exploration that I will reference later in a report.
- Produces a couple of charts showing the interaction of my variables.
#+srcname: explore-by-a-factor-2
#+begin_src R :exports none :noweb yes
#+end_src
I would like to be able to use org-babel-execute-buffer and have org-mode somehow know to skip over the code blocks import-sql, clean-csv and explore-by-a-factor-1. I want them in the org file, because they are relevant to the project. After-all, tomorrow someone might want to know why I was so sure explore-by-a-factor-1 was not useful. I want to keep that code around, so I can bang out the plot or the analysis or what-ever and go on, but not have it run every-time I rerun everything because there's no reason to run it. Ditto with the clean-csv stuff. I want it around, to document what I did to the data (and why), but I don't want to re-run it every time. I'll just import foo-clean.csv.
I Googled all over this and read a bunch of org-mode mailing list archives and I was able to find a couple of ideas, but not what I want. EXPORT_SELECT_TAGS, EXPORT_EXCLUDE_TAGS are great, when exporting the file. And the :tangle header works well, when creating the actual source files. I don't want to do either of these. I just want to execute the buffer. I would like to be able to define code blocks in a similar fashion to be executed or ignored. I guess I would like to find a way to have an org variable such as:
EXECUTE_SELECT_TAGS
This way I could simply tag my various code blocks and be done with it. It would be even nicer if I could then run the file, using only source blocks with specific tags. I can't find a way to do this and I thought I would ask before asking/begging for a new feature in org-mode.
I figured out. From the Org manual (since updated):
The :eval header argument can be used to limit the evaluation of specific code blocks. :eval accepts two arguments “never” and “query”. :eval never will ensure that a code block is never evaluated, this can be useful for protecting against the evaluation of dangerous code blocks. :eval query will require a query for every execution of a code block regardless of the value of the org-confirm-babel-evaluate variable.
So you just have to add :eval never to the header of the blocks that you don’t want to execute, and voilá!
While I never did get an answer to my question, the discussion was interesting and apparently an org-mode based Template for R strikes a few people as an interesting idea. I downloaded the source code to org-mode and looked at org-babel-execute-buffer. It is, as I feared, a naive function which does precisely what it says it does and nothing more. It is not (currently) possible to pass it any additional parameters to affect it's behavior. (Unless I am badly misreading the lisp, which is entirely possible.)
Eventually, I decided org-babel-execute-buffer is not necessary for a useful R template system. Babel's noweb functionality is really flexible and I think it is possible to build a workable solution using noweb, rather than trying to develop a complex tagging schema to define how/when to run things.
For tangling/export it should still be possible to use tags to create usable/sane output.
For anyone who is interested: LiterateR
It's probably a little rude to use this thread to put this out there but this is why I asked the question in the first place. TemplateR is my attempt to make R a little easier to use. Right now it is just a template with two simplistic functions. I consider it to be a proof of concept at this point. Eventually, I want to develop something that does more to help people develop R projects more quickly. TemplateR will accomplish this by:
1. Provide a strong structure to develop around.
2. Provide built-in function to provide support for common tasks, especially in the realm of reproducible research.
3. Provide snippets of tested code that can be rapidly re-purposed for the current project.
Right now, all it provides is a basic structure/framework and two simple functions.
1. Identify which R packages are missing (based on what is manually entered into a table) and
2. Creates project directories (plots, data, reports).
More will come in future versions. The README.org and TODO.org go into further detail.

Resources