I know it is not best practice to use the R package called drake within a notebook tool, but I'm doing it anyway as a workaround for the limitations to the collaboration infrastructure we have on my team at work. Since my code is broken up into chunks that are distributed throughout sections of the notebook, it would be useful to have multiple analysis plans, which I would execute in the appropriate section, and other plans may be written and executed in subsequent sections of the notebook. Is it possible to write multiple plans in drake?
Sorry I am late to this thread. I am the maintainer of the drake R package, and I usually expect to receive questions on the issue tracker. A drake-r-package StackOverflow tag would really help me keep up, but I do not have that privilege.
Anyway, interesting use case. I do see some workarounds:
Separate caches for separate plans. drake uses storr to cache its targets, and you could create different caches for different sections of your report. Essentially, your report would manage a bunch of separate drake projects. See this chapter in the manual for more on the caching system. Use the cache argument to make() to supply a manual or non-default storr cache.
Separate plans and a single cache. Here, you would need to ensure that each plan has a completely unique set of targets. If there is overlap, then some targets will always rebuild whenever your run the report.
A single cumulative plan. Essentially, when it comes time to build additional targets as you move through the report, you can add new rows to an existing plan. In fact, this is the recommended approach for large complex projects (related example here). For even more control, use the targets argument to make() to only build a select few targets and their out-of-date dependencies.
I’m not sure i understand the question? We often use drake in jupyter notebook, and do try to support that use case (via the python bindings).
By “plans”, do you mean mathematical programs? Or inverse kinematics calls? Both should be ok in a notebook framework. Or are you actually calling them in parallel?
Not sure how R fits into it?
Related
I have created an R code script that:
Reads some data from a database
Makes some transformations and..
exports into a csv the modified table.
This code needs to run in a client's machine, but we need to "hide" the actual code from the user.
Is there any useful suggestions on how we can achieve that?
Up front
... it will be nearly impossible to deploy an R <something> to another computer in a way that prevents curious users from accessing the source code.
From a mailing list conversation in 2011, in response to "I would not like anyone to be able to read the code.",
R is an open source project, so providing ways for you to do this is not
one of our goals.
Duncan Murdoch https://stat.ethz.ch/pipermail/r-help/2011-July/282755.html
(Prof Murdoch was on the R Core Team and R Foundation for many years.)
Background
Several (many?) programming languages provide the ability to compile a script or program into an executable, the .exe you reference. For example, python has tools like py2exe and PyInstaller. The tools range from merely compactifying the script into a zip-ball, perhaps obfuscating the script; ... to actually creating a exe with the script either tightly embedded or such. (This part could use some more citations/research.)
This is usually good enough for many people, by keeping the honest out. I say it that way because all you need to do is google phrases like decompile py2exe and you'll find tools, howtos, tutorials, etc, whose intent might be honestly trying to help somebody recover lost code. Regardless of the intentions, they will only slow curious users.
Unfortunately, there are no tools that do this easily for R.
There are tools with the intent of making it easy for non-R-users to use R-based tools. For instance, RInno and DesktopDeployR are two tools with the intent of creating Windows (no mac/linux) installers that support R or R/shiny tools. But the intent of tools like this is to facilitate the IT tasks involved with getting a user/client to install and maintain R on their computer, not with protecting the code that it runs.
Constrain R.exe?
There have been questions (elsewhere?) that ask if they can modify the R interpreter itself so that it does not do everything it is intended to do. For instance, one could redefine base::print in such a way that functions' contents cannot be dumped, and debug doesn't show the code it's about to execute, and perhaps several other protective steps.
There are a few problems with this approach:
There is always another way to get at a function's contents. Even if you stop print.default and the debugger from doing this, there are others ways to get to the functions (body(.), for one). How many of these rabbit holes do you feel you will accurately traverse, get them all ... with no adverse effect on normal R code?
Even if you feel you can get to them all, are you encrypting the source .R files that contain your proprietary content? Okay, encrypting is good, except you need to decrypt the contents somehow. Many tools that have encrypted contents do so to thwart reverse-engineering, so they also embed (obfuscatedly, of course) the decryption key in the application itself. Just give it time, somebody will find and extract it.
You might think that you can download the key on start-up (not stored within the app), so that the code is decrypted in real-time. Sorry, network sniffers will get the key. Even if you retrieve it over https://, tools such as https://mitmproxy.org/ will render this step much less effective.
Let's say you have recompiled R to mask print and such, have a way to distribute source code encrypted, and are able to decrypt it in a way that does not easily reveal the key (for full decryption of the source code files). While it takes a dedicated user to wade through everything above to get to the source code, none of the above steps are required: they may legally compel you to release your changes to the R interpreter itself (that you put in place to prevent printing function contents). This doesn't reveal your source code, but it will reveal many of your methods, which might be sufficient. (Or just the risk of legal costs.)
R is GPL, and that means that anything that links to it is also "tainted" with the GPL. This means that anything compiled with Rcpp, for instance, will also be constrained/liberated (your choice) by the GPL. This includes thoughts of using RInside: it is also GPL (>= 2).
To do it without touching the GPL, you'd need to write your interpreter (relatively from scratch, likely) without code from the R project.
Alternatives
Ultimately, if you want to release R-based utilities/apps/functionality to clients, the only sure-fire way to allow them to use your code without seeing it is to ... control the computers on which R will run (and source code will reside). I'll add more links supporting this claim as I find them, but a small start:
https://stat.ethz.ch/pipermail/r-help/2011-July/282717.html
https://www.researchgate.net/post/How_to_make_invisible_the_R_code
Options include anything that keeps the R code and R interpreter completely under your control. Simple examples:
Shiny apps, self-hosted (or on shinyapps.io if you trust their security); servers include Shiny Server (both free and commercial versions), RStudio Connect (commercial only), and ShinyProxy. (The list is not known to be exclusive.)
Rplumber is an API server, not a shiny server. The intent is for single HTTP(s) endpoint calls, possibly authenticated, supporting whatever HTTP supports (post, get, etc). This can be served in various ways, see its hosting page for options.
Rserve. I know less about this, but from what I've experienced with it, I've not had as much luck integrating with enterprise systems (where, e.g., authentication and fine-control over authorization is important). This does allow near-raw access to R, so it might not be what you want (especially when the intent is to give to clients who may not be strong R users themselves).
OpenCPU should be discussed, but not as a viable candidate for "protect your code". It is very similar to rplumber in that it provides HTTP endpoints, but it supports endpoints for every exported function in every package installed in its R library. This includes the base package, so it is not at all difficult to get the source code of any function that you could get on the R console. I believe this is a design feature, even if it is perfectly at odds with your intent to protect your code.
Anything that can call R or Rscript. This might be PHP or mod_python or similar. Any web-page serving language that can exec("/usr/bin/Rscript",...) can take its output and turn it around to the calling agent. (It might also be possible, for example, for a PHP front-end to call an opencpu endpoint that only permits connections from the PHP-serving host.)
I just joined a company that needs to build an ETL pipeline inside an AWS account owned by a client.
There's one part of the ETL pipeline that runs a code written in R. The problem is, this R code is a very important part of our business, and our intelectual property. Our clients can't see this code.
Is there any way to run this in their AWS environment without them having access to our code? R is not compilable, so we can't just deploy an executable file there. And we HAVE to run this in their environment. I suggested creating an API to run this in our AWS environment, but this is not an option.
In my experience, these are the options I've realized in situations like this, in increasing order of difficulty:
Take the computation off-premises. This sounds like not an option for you.
Generate an API (e.g., shiny, opencpu, plumber) that is callable from their premises. This might require some finessing on their end, as I'm inferring (since they want it all done within their environment) that they might prefer a locked-down computation (perhaps disabling network access).
Rewrite the sensitive portions in Rcpp. While this does have the possible benefit of speed improvements, it makes it slightly harder for them to "discover" the underlying intellectual property. Realize that R and Rcpp are both GPL, which means that anything linked to by R must also be GPL, meaning source-code available. (It is feasible that since you are not making it public that you can argue your case here, but I am not a lawyer and would not want to be the first consultant found on the wrong side of GPL law here. Again, IANAL.)
Rewrite the sensitive portions in a non-R executable (note that I don't say "as a non-R library and link to it via R calls", since the linking action taints the library with R's GPL). This executable can be called by your otherwise releasable R package (via system or processx::run).
(For the record, one might infer C or C++ here, but other higher-level languages do allow compilable executables and are not GPL. Python has some such modes. Be sure to obfuscate your variables :-)
I think your "safest" options are #2 and #4.
We have 2 solutions.
SolutionA is an internal solution where we put reusable code through our products
For the sake of the question, it has only two projects NugetProjectA and NugetProjectB which has a project reference to NugetProjectA.
SolutionB its a solution that has package references towards SolutionA via nuget.
The thing that troubles is:
add new method added in NugetProjectA
add new method in NugetProjectB project that uses previous method
publish new version of NugetProjectB
update nuget reference on Project of SolutionA
execute in Project newly added method of NugetProjectB
Since we didn't publish the NugetProjectA updated version, last step described will fail.
This seems to be a easy problem to solution. But imagine this with many more projects in SolutionB and many more in SolutionA.
Your question is vague and open-ended, so there isn't a simple, concise answer. Honestly it could easily be a multi-post blog series or even a short book. I'll try to give some general suggestions, but I'm not going to go into a lot of details to avoid making this answer too long.
Mono-repo vs multiple repos
Just like the microservice craze from a few years ago, you should first ask yourself if you need it. Having all your source code in a single repository and solution might make it feel "legacy", and it sure seems nicer to have a 3 minute build of a component rather than a 30 minute build of a whole system when checking in a 1-line bug fix. But are the problems you listed in your question worth the benefit of a shorter CI build?
On the other extreme, Google may famously run almost everything out of a single repository, but they have teams of people doing nothing more than managing their mono-repo and build system because a large number of developers working on a single repo have a different set of problems. These are engineers not working on customer applications. If their work make other teams more productive, then it can be worthwhile.
You need to figure out what's best for you, but given the problem you described in the question, maybe you have too many repos/solutions and your processes could be faster with fewer mistakes if you consolidate a little.
Automation
The next best thing is to just use good engineering practises. Automate as much as possible to reduce the risk of human mistakes, including automating processes that validate that manual processes are followed correctly.
Have the a CI or CD pipeline push packages, so you can't forgot to push dependencies, like in your example.
There are tools that will automatically generate a version number from the git history based on the number of commits since the versioning config file was checked in. This prevents you from forgetting to increment the package version when you check in a change to the package.
Have tests to make sure the package works before pushing it. You could make it as simple as making sure that dependencies exist, but you could also go further and have test programs run that use the package to ensure that backwards compatibility isn't broken and that new features actually work as expected. Have these tests run before pushing the package, so if the tests fail, you don't push a bad package.
You can have a pipeline that will automatically create pull requests on solutions that use a package when a new version of one of your packages is published. You can even automatically merge it, but you'll want to make sure you have real good tests to avoid a buggy package cascading into a huge mess, particularly if you do automatic deployments on successful builds.
Be creative and think about other ways you can automate your processes to make things easier for yourself and your team.
But be pragmatic about what you automate. There's no point spending a week full time to automate something that only takes you 5 minutes once a month to do manually. But if the manual process sometimes goes wrong and causes hours or days of effort to fix, then that makes automating the process more worthwhile.
Use modern features
The .NET ecosystem has changed a lot in the last 2-3 years since .NET Core was announced. Now packing with MSBuild (either dotnet pack or msbuild -t:pack) is easier to create packages than creating .nuspec files and making sure you do the right things to get project dependencies packed as nuget dependencies, getting all files in the right places, etc. If your class library uses SDK style projects, then there's nothing extra to do. If your project is a traditional project, you'll need to use PackageReference for your NuGet dependencies (or specify <ProjectStyle>PackageReference</ProjectStyle> as an MSBuild property in your project file), and then reference the NuGet.Build.Tasks.Pack package.
Version the application, not each package
Like the mono-repo vs multiple repos point, considering versioning all packages in the application with a single version number, rather than versioning each package individually. Yes, this means you'll sometimes (or maybe often) publish a new version of a package that doesn't have any code changes to the previous version, but it simplifies the release considerably. Coupled with packing with MSBuild in the section above, you can create a Directory.Build.props file in your repository root, and set the <Version> property to your app version, and all projects in the repo will have the same version. So, when you're ready to release, bump the version in a single file and every project and every NuGet packages will have the same version.
Summary
In an ideal world each component would be reusable in different applications, in a separate source code repository, each package individually versioned using semantic versioning. But in the real world this adds a lot of development time complexity. Your customers may be happier to get bug fixes and new features more quickly, even if the version number of packages are less meaningful. So, make data driven decisions. If you're frequently having dependency version problems, reduce your dependencies so there are fewer things that can go wrong.
Don't get me wrong, there are many good reasons to have multiple projects, multiple solutions, multiple repositories. Just make sure that the reason you're doing them is because it helps your team/company be more productive, not for idealistic reasons that are slowing you down or causing bugs.
There doesn't seem to be too many options for deploying predictive models in production which is surprising given the explosion in Big Data.
I understand that the open-source PMML can be used to export models as an XML specification. This can then be used for in-database scoring/prediction. However it seems that to make this work you need to use the PMML plugin by Zementis which means the solution is not truly open source. Is there an easier open way to map PMML to SQL for scoring?
Another option would be to use JSON instead of XML to output model predictions. But in this case, where would the R model sit? I'm assuming it would always need to be mapped to SQL...unless the R model could sit on the same server as the data and then run against that incoming data using an R script?
Any other options out there?
The following is a list of the alternatives that I have found so far to deploy an R model in production. Please note that the workflow to use these products varies significantly between each other, but they are all somehow oriented to facilitate the process of exposing a trained R model as a service:
openCPU
AzureML
DeployR
yhat (already mentioned by #Ramnath)
Domino
Sense.io
The answer really depends on what your production environment is.
If your "big data" are on Hadoop, you can try this relatively new open source PMML "scoring engine" called Pattern.
Otherwise you have no choice (short of writing custom model-specific code) but to run R on your server. You would use save to save your fitted models in .RData files and then load and run corresponding predict on the server. (That is bound to be slow but you can always try and throw more hardware at it.)
How you do that really depends on your platform. Usually there is a way to add "custom" functions written in R. The term is UDF (user-defined function). In Hadoop you can add such functions to Pig (e.g. https://github.com/cd-wood/pigaddons) or you can use RHadoop to write simple map-reduce code that would load the model and call predict in R. If your data are in Hive, you can use Hive TRANSFORM to call external R script.
There are also vendor-specific ways to add functions written in R to various SQL databases. Again look for UDF in the documentation. For instance, PostgreSQL has PL/R.
You can create RESTful APIs for your R scripts using plumber (https://github.com/trestletech/plumber).
I wrote a blog post about it (http://www.knowru.com/blog/how-create-restful-api-for-machine-learning-credit-model-in-r/) using deploying credit models as an example.
In general, I do not recommend PMML because the packages you used might not support translation to PMML.
A common practice is scoring a new/updated dataset in R and moving only the results (IDs, scores, probabilities, other necessary fields) into the production environment/data warehouse.
I know this has its limitations (infrequent refreshes, reliance upon IT, data set size/computing power restrictions) and may not be the cutting edge answer many (of your bosses) are looking for; but for many use-cases this works well (and is cost friendly!).
It’s been a few years since the question was originally asked.
For rapid prototyping I would argue the easiest approach currently is to use the Jupyter Kernel Gateway. This allows you to add REST endpoints to any cell in your Jupyter notebook. This works for both R and Python, depending on the kernel you’re using.
This means you can easily call any R or Python code through a web interface. When used in conjunction with Docker it lends itself to a microservices approach to deploying and scaling your application.
Here’s an article that takes you from start to finish to quickly set up your Jupyter Notebook with the Jupyter Kernel Gateway.
Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users
For moving solutions to production the leading approach in 2019 is to use Kubeflow. Kubeflow was created and is maintained by Google, and makes "scaling machine learning (ML) models and deploying them to production as simple as possible."
From their website:
You adapt the configuration to choose the platforms and services that you want to use for each stage of the ML workflow: data preparation, model training, prediction serving, and service management.
You can choose to deploy your workloads locally or to a cloud environment.
Elise from Yhat here.
Like #Ramnath and #leo9r mentioned, our software allows you to put any R (or Python, for that matter) model directly into production via REST API endpoints.
We handle real-time or batch, as well as all of the model testing and versioning + systems management associated with the process.
This case study we co-authored with VIA SMS might be useful if you're thinking about how to get R models into production (their data sci team was recoding into PHP prior to using Yhat).
Cheers!
I am looking for a solution that allows me to keep a track of a multitude of R scripts that I create for various projects and purposes. Some scripts are easily tracked to specific projects, whereas others are "convenience" functions created to serve a set of tasks.
Is there a way I can create a central DB and query it to find which scripts match most appropriately?
I could create a system using a DBMS manually, but are users aware of anything in general or specific to R, that comes in the form of a software tool (maybe FOSS) ?
EDIT: Thank you for the responses. My current system is just a set of scripts with comments that allow me to identify their intended task. Though I use StatET with SVN, I would like a search utility along the lines of the "sos" package.
The question
I am looking for a solution that allows me to keep a track of a multitude of R scripts
that I create for various projects and purposes. Some scripts are easily tracked to specific
projects, whereas others are "convenience" functions created to serve a set of tasks.
fails to address the obvious follow-up of why the existing mechanism is not suitable:
Create a local package for each project
Create one or more local packages for local utility functions
Use R's already existing mechanisms for searching, indexing, testing, cross-referencing
And use any revision control system of your liking, local or on the web, to host the code for 1. to 3. above.
Reinventing an RDBMS schema for 1. to 3. is just wrong in my book. But if you must, go ahead and replicate what you can already (mostly) get for free in tested and widely used code.
R comes with several mechanisms for searching for help, most of which naturally use CRAN. Some examples: the sos package, cranberries, crantastic, and rseek. In many cases, these could be adapted to use a local repository (you can find out how to create a local repository in the R manual, which is very easy to do). Otherwise, if you package your scripts and submit them to CRAN, you will naturally have these available to you. I would also highly recommend this presentation on the subject: Creating R Packages, Using CRAN, R-Forge, And Local R Archive Networks And Subversion (SVN) Repositories from Spencer Graves and Sundar Dorai-Raj.
These would require you to put your code in packages, and create documentation, all of which is worth doing anyway. The package documentation turns out to be very useful for both documenting what things do, and helping your find them in the future. You can use roxygen to create this documentation in-line with your code. Also read this related question: Organizing R Source Code.
Alternatively, the help.search() function can be very useful for searching local packages, regardless of whether you have a repository set up.
You'd probably be best working with a version control system. Many can be indexed and be made search-able. At my work, a stack of R, Eclipse, StatET, Subversion and Subclipse works very well for us.