Rstudio and version control facilities - r

I have a question regarding the version control integration in Rstudio.
What's the advantage of using it instead of an external version control manager like SourceTree?
From what I understand the main advantage is to avoid using another tool and saving the switch from a window to another.
But the problem is we often need to use other languages in a data mining projects (SQL, Python, ...) so in almost every scenario we also need a dedicated version control manager.

It's not an either/or. Think of RStudio's version control as a way to accomplish your most common source control tasks without leaving the IDE. For instance, press Ctrl + Alt + D to see a diff of the current file. How long would that take using an external tool?
Likewise, you can easily commit a change in just a few keystrokes (try Ctrl + Alt + M), and you're never more than a click away from seeing which branch you're on.
Most IDEs have some basic form of version control built-in for this reason. It isn't intended to completely replace any VCS tools you're using; it's intended to keep you from having to switch context when you're doing common tasks. You will still need to use an external VCS for more complicated operations.
Another thing to keep in mind is that RStudio can manage your whole data science project, not just its R code. You can use its VCS integration to manage and commit changes to files of any kind. We sometimes use it as a VCS interface to projects that contain no R code at all!
Some people prefer to do everything VCS-related in a single interface. That's great, and hopefully the RStudio VCS integration stays out of your way if that's your workflow. But most IDE users like being able to do day-to-day VCS operations without breaking context, and that's what the VCS integration in RStudio is for.

Related

R and Rstudio Docker vs Binder

My problem is that I can't use R-studio at my work place as the IT does not support it . I want to use R and R-studio that installed on my personnel laptop on my company laptop ( using a modern browser which is behind firewall ) . Some of the options I am thinking of two two things
should I need to build a docker for R and R-studio (I see base images are already available) , I am mostly interested in basic R , Dplyr (haven ,xporter, and Reticulate ) packages .
Should I have to use a binder . I am not technical person and my programming skills are very limited can any one suggest me way .
What exactly are the difference between using Docker option vs Binder ?
I know I can use R-Studio online and get my work done but with the new paid account I am running out of project hours and very slow sometimes . Thanks in advance
Here are some examples beyond the modern RStudio MyBinder example:
https://github.com/fomightez/pythonista_skewedf
https://github.com/fomightez/r_phylogenetics_worshop
https://github.com/fomightez/chapter7/tree/master/binder
The modern RStudio MyBinder example has been set as a template on GitHub so you can use
The first one is for a special use of a package not on conda. And I started that one from square one.
The other two were converted from content by others to aid in making them Binder-ready.
You essentially list everything you need from conda in the environment.yml along with the appropriate channels. If you need special stuff not on conda, you need the other configuration files included there.
Getting everything working can take some iterations on adding things, letting the image get built, and testing your libraries are available. Although you seem to think your situation is not overly complex.
The binder launch badges you see are just images where you modify the URL to point the MyBinder federation site at your repository. Look at the URL and you should see the pattern where you put studio at the end of the URL pointing at your repo. The form at MyBinder.org site can help with this; however, most often it is easier to just adapt a working launch badge's code copied from elsewhere. The form isn't set up at this time for making the URLs for launching to RStudio.
Download anything useful your create in a running session. The sessions timeout after 10 minutes, although RStudio usually keeps them active.
Lack of Persistence and limited memory, storage, & power can be drawbacks. The inherent reproducibility and portability are advantages.
MyBinder.org doesn't work with private repos. If you have code you don't want to share, you can upload it to the temporary session, using the repo for specifying the environment. You could host a private binderhub that does allow the use of private git repositories; however, that is probably overkill for your use case and exceed your ability level at this time.
GitHub isn't the only place to host repositories that can be pointed at the MyBinder system. If you go to the MyBinder.org page and click where it says 'GitHub' on the left side of the top line of the form, you can see a list of the sources at which you can host a repository and point the system to build an image and launch a container with that specified image.
Building the image from a source repository takes some minutes the first time. Once the image is built though on the service, launch is typically less than 30 seconds. Each time you make a change on the source repo, a build is necessary. Some changes don't cause the new build to be as long as the initial one as some optimizing is done to only build what is necessary after a change. Keep in mind there are several members of the federation around the workd and if traffic on the internet gets sent to where the built image isn't yet available, it will be built from scratch again first.
The Holepunch project is out there to offer some help for users working in the R ecosystem; however, with the R-Conda system that is now integrated into MyBinder it is pretty much as easy to do it the way I described. Last I knew, the Holepunch route makes a Dockerfile that isn't as easy to troubleshoot as using the current the R-Conda system route. Dockerfiles are essentially a last ditch configuration file that MyBinder can handle. The reason being the other configuration files are much easier and don't require knowing Dockerfile syntax. MyBinder aims to offer the ability to take advantage of Docker offering containers with a specified environment without users needing to know anything about Docker.
There is a Binder Help category for posting to get help at the Jupyter Discourse Forum. Some other examples of posts already there may help you troubleshoot.
Notice of a common pitfall
Most of the the configuration files for making a repository Binder-ready are simply text and can be edited right in the GitHub browser interface, without need to git or even cloning the repo locally.
Last I knew, there are two exceptions to this. The postBuild and start configuration files have settings that allow them to be run as scripts and these get altered in a way they no longer work if you edit them via the GitHub browser interface. (This was my experience when last I tried. Your mileage may vary or things may have changed now.) To edit those, you have to have git available on a system you have and pull one from some other source. Then edit that on your machine that has git working & add it your repo and push it back up from your local computer.
(If this is a problem, you can post in the Jupyter Discourse Forum Binder help category and you and I could coordinate where I fork and edit those files in your repo to your specifications and then make a pull request to update your source of the fork with those changes.)
If you are using Jupyter notebooks extensively then it may make sense to use Binder
But if you simply want to use R and Rstudio, then all you need is docker. A good resource is
https://github.com/rocker-org/rocker

What is the easiest way to create a webapp from an interactive Jupyter Notebook?

I have a Jupyter Notebook that plots some data and lets the user interact with it via a slider.
What would be the easiest way to make a web app with a similar functionality? (reusing as much of the code...)
I believe the easiest way is to use voilà.
After installing you just have to run:
voila <path-to-notebook> <options>
And you will have a server running your notebook as a web app, with all the input code omitted.
AppMode is "A Jupyter extension that turns notebooks into web applications".
From the README:
Appmode consist of a server-side and a notebook extension for Jupyter.
Together these two extensions provide the following features:
One can view any notebook in appmode by clicking on the Appmode button in the toolbar. Alternatively one can change the url from
baseurl/notebooks/foo.ipynb to baseurl/apps/foo.ipynb. This also
allows for direct links into appmode.
When a notebook is opened in appmode, all code cells are automatically executed. In order to present a clean UI, all code cells
are hidden and the markdown cells are read-only.
A notebook can be opened multiple times in appmode without interference. This is achieved by creating temporary copies of the
notebook for each active appmode view. Each appmode view has its
dedicated ipython kernel. When an appmode page is closed the kernel is
shutdown and the temporary copy gets removed.
To allow for passing information between notebooks via url parameters, the current url is injected into the variable
jupyter_notebook_url.
To be complete - there exists also https://www.streamlit.io/ .
I still dont understand the exact difference between voila and streamlit.
At the moment I just struggle with the possibility to re-run everything with new parameters... I have bad luck with voila still.
Edit: I see that streamlit requires a raw python, not .ipynb, this fact would mean that this answer is completely wrong, I will search a bit more on streamlit before further action/comment.
Edit2: Voila looks great. However, I found few things that uncover the underlying complexity and thus a troubles that may arise.
callbacks. Widgets work great in jupyter, but since it is not possible to re-run one cell, sometimes the logic must be modified to work in Voila.
interactive java objects need a special treatment, e.g. matplotlib has a cheap solution, but there was nothing for e.g. jsroot
links. It is easy to create (a file and) a download link in jupyter, Voila can also serve a file, but it needs another extra treatment.
After all, I pose myself a question - is it better to learn many tricks and modifications to jupyter or to use some other system? I am going to see if streamlit can give em some answer.
The Jupyter Dashboards Bundlers extension from the Jupyter Incubator is one way to do it while retaining interactivity.
EDIT: While pip installing this package will also install the cms package dependency, like dashboard_bundlers, cms needs to be explicitly enabled/quick-setup as a notebook extension for the dashboard tools to work.
#raphaelts has the right idea and should be the accepted answer. As of Dec 2019, Voila is the most appropriate method to deploy Jupyter notebooks to production as a stand alone webapp. Think internal datascience teams sharing their analytics workload with internal C-Suite teams using SPA stlye Notebooks with all the code hidden and custom GUI/interactions thrown in. Recently discussed on HN
https://news.ycombinator.com/item?id=20160634 and the official announcement from the Jupyter maintainer https://blog.jupyter.org/and-voil%C3%A0-f6a2c08a4a93enter link description here
As mentioned above, voilà is a very powerful tool which hides the input cells from your notebooks and therefore provides a clean interface. In order to deploy your notebook with voilà you need to follow the specific steps of your organization. But if you want to quickly run it on your machine, simply install it with pip install voila. Then you can enter start from the command-line: voila my_notebook.ipynb or use the "Voila" button which should have appeared in your Jupyter notebook.
Note, however, that using voilà is only one part of the story. You also need to build the required interactivity, ie. to set up how to respond to input changes. There are quite a few frameworks for this.
The simplest one is to use the interact function or the observe method from the Ipywidgets library. This is very direct, but things can easily get out of control as you start having more and more widgets and complexity.
There are complete frameworks, some of them mentioned above. E.g. streamlit, dash and holoviz. These are very powerful and suited for larger projects.
But if you want to keep it simple, I also recommend to check out autocalc. It is a very easy-to-use library, which lets you define the dependencies between your widgets/variables and let all the recalculation be triggered automatically. A tutorial can be found here.
Disclaimer: I am the author of the autocalc package.
The easiest way is to use the Mercury framework. You can reuse all your code. To convert the notebook to web app you will need to add the YAML header in the first cell of the notebook (very similar to R Markdown). The widgets are generated based on YAML. The end-user can tweak widgets values and click the Run button to execute the notebook from the top to the bottom. You can easily hide the notebook's code (if you want) by setting the show-code: False in the YAML. The example notebook and corresponding web app are below.
Example of the notebook with YAML header
Example web app generated from notebook with Mercury

How switch R architectures dynamically in RStudio

In RStudio there's a Tools menu which allows you to select an installed version/architecture of R under Global Options.
That's great, but my issue with that is that, as the name implies, it is a Global option, so once you select a different architecture (or version number) you then have to restart RStudio and it applies to all of your RStudio instances and projects.
This is a problem for me because:
I have some scripts within a given project that strictly require 32-bit R due to the fact that they're interfacing with 32-bit databases, such as Hortonworks' Hadoop
I have other scripts within the same project which strictly require 64-bit R, due to (a) availability of certain packages and (b) memory limits being prohibitively small in 32-bit R on my OS
which we can call "Issue #1" and it's also a problem because I have certain projects which require a specific architecture, though all the scripts within the project use the same architecture (which should theoretically be an easier to solve problem that we can call "Issue #2").
If we can solve Issue #1 then Issue #2 is solved as well. If we can solve Issue #2 I'll still be better off, even if Issue #1 is unsolved.
I'm basically asking if anyone has a hack, work-around, or better workflow to address this need for frequently switching architectures and/or needing to run different architectures in different R/RStudio sessions simultaneously for different projects on a regular basis.
I know that this functionality would probably represent a feature request for RStudio and if this question is not appropriate for StackOverflow for that reason then let me know and I'll delete it. I just figured that a lot of other people probably have this issue, so maybe someone has found a work-around/hack?
There's no simple way to do this, but there are some workarounds. One you might consider is launching the correct bit-flavor of R from the current bit-flavor of R via system2 invoking Rscript.exe, e.g. (untested code):
source32 <- function(file) {
system2("C:\\Program Files\\R\\R-3.1.0\\bin\\i386\\Rscript.exe", normalizePath(file))
}
...
# Run a 64 bit script
source("my64.R")
# Run a 32 bit script
source32("my32.R")
Of course that doesn't really give you a 32 bit interactive session so much as the ability to run code as 32 bit.
One other tip: If you hold down CTRL while launching RStudio, you can pick the R flavor and bitness to launch on startup. This will save you some time if you're switching a lot.

Tools Commonly used to Program in R

I apologize if this has already been asked a different way but I couldn't find anything getting at what I wanted.
I am really getting into R from other packages (SPSS). As I learn about what truly can be done, I realize that there are additional "tools" that I need. This gets me to my question.
What setup do you have for developing R code? I can't see myself actually developing r packages anywhere in the near future, but I do see myself wanting to manage my r projects effeciently, as well as create reports and presentations in LaTeX.
For context, I develop my R code in Eclipse for Windows, but I have had a real hard time successfully setting up Latex/Sweave and Github plugins.
Lastly, do you develop code using Windows or something else?
Many thanks in advance for any insight you can lend.
Emacs has everything I commonly need:
ESS (for R),
AucTeX (for Latex),
similarly rich 'modes' for other languages I use (C++, make, shell, ...),
plus a lot of other modes you get quite used to as e.g. dired for directory/file browsing or org-mode as planner/to-do list,
the SVN integration is very good too
and there are probably a number of tools within Emacs I am now forgetting.
Works in text mode as well as graphical mode, and works essentially the same (incl ESS and AucTeX) on several operating systems (Linux mostly and Windows when I must). On Debian/Ubuntu all this is prepackaged and tends to work out of the box as well. For both Windows and OS X, Vincent Goulet has package very handy bundles, see here.
The 'daemon mode' is outstanding too -- I keep the same main Emacs session running and just connect and re-connect to it even when accessing the machine (via ssh or directly) from different computers.
Also see the EmacsWiki for more tips around Emacs.
Back to Emacs and R in particular. The R FAQ says it pretty well:
6.1 Is there Emacs support for R?
6.2 Should I run R from within Emacs?
and I like the affirmative and resounding answer to the second question: "Yes, definitely". I fully concur.
I'll second the suggestion that Emacs compliments R nicely, but let me share what the "killer feature" is for me.
Using Org-mode with Org-babel, I can write whole reports with inline graphs produced from R in raster and vector format which compiles seamlessly into a PDF report via latex. I can also view the graphs while editing, similar to a WYSIWYG editor.
I just wrapped up a major report with over 70 inline graphs with little effort, no editing external files, no issues maintaining naming between figures in my report and external files, or forgetting to recompile the latest version of a figure. Org & Babel does it all.
Org-mode:
http://orgmode.org/
Org-Babel:
http://orgmode.org/worg/org-contrib/babel/index.php
Example of inline R with Babel and PDF output, see the first example in multiple formats:
http://orgmode.org/worg/org-contrib/babel/languages/ob-doc-R.php
Enjoy!
This is probably more relevant for package development, but it is also worth mentioning the roxygen R package that allows in-source documentation of your code. Note that even though you can't see yourself developing R packages anywhere in the near future, a package can be a very handy way of grouping related functions you develop and maintain, consistently documenting the code and keeping track of updates, even if you do not plan to distribute it.
I use a mac, and my most important tools are:
the command line, for running R
git, for keeping track of changes
github for publishing my code, bug tracking and collaboration
textmate for writing R code
Has anyone tried RStudio? It's the shiny new editor for R.
I use windows... (don't say it).
I like Notepad++ and NPPtoR. Makes it pretty easy to send things back and forth.
I use Eclipse on Windows and Linux. I compile LaTeX code (with Sweave) on Linux and I haven't bothered yet to set up the whole process in Eclipse. I need to pdflatex and bibtex files several times anyway, so I just have a terminal window with the specific string of commands handy. I tried ESS and Eclipse and they're very similar in functionality (and in my opinion the best two editors out there).
I use Eclipse / StatEt on Windows, and it Rocks !. For LaTex/Sweave I use MikTex which works well for me. For help setting things up check out this document and this post.
Other Tools you may find useful include;
If you want to build R Packages on
Windows, then get the RTools
For
Creating Documents, you may want to
check out odfWeave,
LibreOffice (was OpenOffice) and
the MSOffice ODF plugin
I have also
dabbled with Git but also didn't get
very far on Windows, but that was a
while ago.
For Presentations in LaTex
I recomend Beamer
I use Eclipse for both R and Latex while working on research papers. The plugins for both are very mature now. The nice thing is that you don't have to switch application while writing papers. I used different combination before but I found this to be the best.
I just got home from our local R User meeting (find one near you here) and of the 20 or so people there, all of us used a different program or tool to write R code in. I think that goes to show the diversity of the tools used to write and edit R code is just as diverse as the R community itself.

What artifacts to save for a nightly build?

Assume that I set up an automatic nightly build. What artifacts of the build should I save?
For example:
Input source code
output binaries
Also, how long should I save them, and where?
Do your answers change if I do Continuous Integration?
You shouldn't save anything for the sake of saving it. you should save it because you need it (i.e., QA uses nightly builds to test). At which point, "how long to save it" becomes however long QA wants them.
i wouldn't "save" source code so much as tag/label it. I don't know what source control you're using, but tagging is trivial (performance & disk space) for any quality source control system. Once your build is tagged, unless you need binaries, there really isn't any benefit to just having them around because you can simply re-compile when necessary from source.
Most CI tools let you tag on each successful build. This can become problematic for some systems as you can easily have 100+ tags a day. For such cases I recommend still running a nightly build and only tagging that.
Here are some artifacts/information that I'm used to keep at each build:
The tag name of the snapshot you are building (tag and do a clean checkout before you build)
The build scripts themselfs or their version number (if you treat them as a separate project with its own version control)
The output of the build script: logs and final product
A snapshot of your environment:
compiler version
build tool version
libraries and dll/libs versions
database version (client & server)
ide version
script interpreter version
OS version
source control version (client and server)
versions of other tools used in the process and everything else that might influence the content of your build products. I usually do this with a script that queries all this information and logs it to a text file that should be stored with the other build artifacts.
Ask yourself this question: "if something destroys entirely my build/development environment what information would I need to create a new one so I can redo my build #6547 and end up with the exact same result I got the first time?"
Your answer is what you should keep at each build and it will be a subset or superset of the things I already mentioned.
You can store everything in your SCM (I'd recommend a separate repository), but in this case your question on how long you should keep the items looses sense. Or you should store it to zipped folders or burn a cd/dvd with the build result and artifacts. Whatever you choose, have a backup copy.
You should store them as long as you might need them. How long, will depend on your development team pace and your release cycle.
And no, I don't think it changes if you do continous integration.
This isn't a direct answer to your question, but don't forget to version control the nightly build setup itself. When the project structure changes, you may have to change the build process, which will break older builds from that point on.
In addition to the binaries as everyone else has mentioned I would recomend setting up a symbol server and a source server and making sure you get the correct information out and into those. It will aid in debugging tremendously.
We save the binaries, stripped and unstripped (so we have the exactly same binary, once with and once without debug symbols). Further we build everything twice, once with debug output enabled and once without (again, stripped and unstripped, so every build result in 4 binaries). The build is stored to a directory according to SVN revision number. That way we can always retain the source from the SVN repository by simply checking out this very revision (that way the source is archived as well).
A surprising one I learned about recently: If you're in an environment that might be audited you'll want to save all the output of your build, the script output, the compiler output, etc.
That's the only way you can verify your compiler settings, build steps, etc.
Also, how long to save them for, and where to save them?
Save them until you know that build won't be going to production, iow as long as you have the compiled bits around.
One logical place to save them is your SCM system. Another option is to use a tool that will automatically save them for you, like AnthillPro and its ilk.
We're doing something close to "embedded" development here, and I can tell you what we save:
the SVN revision number and timestamp, as well as the machine it was built on and by whom (also burned into the build binaries)
a full build log, showing whether it was a full/incremental build, any interesting (STDERR) output the data baking tools produced, a list of files compiled and any compiler warnings (this compresses very well, being text)
the actual binaries (for anywhere from 1-8 build configurations)
files produced as a side effect of linking: a linker command file, address map, and a sort of "manifest" file indicating what was burned into the final binaries (CRC and size for each), as well as the debugging database (.pdb equivalent)
We also mail out the result of running some tools over the "side-effect" files to interested users. We don't actually archive these since we can reproduce them later, but these reports include:
total and delta of filesystem size, broken down by file type and/or directory
total and delta of code section sizes (.text, .data, .rodata, .bss, .sinit, etc)
When we have unit tests or functional tests (e.g. smoke tests) running, those results show up in the build log.
We've not thrown out anything yet -- given, our target builds usually end up at ~16 or 32 MiB per configuration, and they're fairly compressible.
We do keep uncompressed copies of the binaries around for 1 week for ease of access; after that we keep only the lightly compressed version. About once a month we have a script that extracts each .zip that the build process produces and 7-zips a whole month of build outputs together (which takes advantage of only having small differences per build).
An average day might have a dozen or two builds per project... The buildserver wakes up about every 5 minutes to check for relevant differences and builds. A full .7z on a large very active project for one month might be 7-10GiB, but it's certainly affordable.
For the most part, we've been able to diagnose everything this way. Occasionally there's a hiccup on the buildsystem and a file isn't actually a the revision it's supposed to be when a build happens, but there's usually enough evidence of this in the logs. Sometimes we have to dig out a tool that understands the debugging database format and feed it a few addresses to diagnose a crash (we have automatic stackdumps built into the product). But usually all the information needed is there.
We haven't had to crack the .7z archives yet, to mention. But we have the info there, and I have some interesting ideas on how to mine bits of useful data from it.
Save what can't be reproduced easily. I work on FPGAs where only the FPGA team have the tools and some cores (libraries) of the design are licensed to compile on only one machine. So we save the output bitstreams. But try to check them over one another rather than with a date/time/version stamp.
Save as in check in to source code control or just on disk? Save nothing to source code control. All derived files should be visible in the file system and available to developers. Don't checkin binaries, code generated from XML files, message digests etc. A separate packaging step will make these end products available. As you have the change number you can always reproduce the build if necessary assuming of course everything you need to do a build is completely in the tree and is available to all builds by syncing.
I would save your built binaries for exactly as long as they have a chance to go into production or be used by some other team (like a QA group). Once something has left production, what you do with it can vary a lot. For a lot of teams, they'll keep just their most recent prior build around (for rollback) and otherwise discard their builds.
Others have regulatory requirements to keep anything that went into production around for as long as seven years (banks). If you are a product company, I'd keep around any binary a customer might have installed in case a tech support guy wants to install the same version.

Resources