I write a framework for testing for validity of R scripts and for analyzing their output. I don't want the script to be able to interact with each other in any way, but I do want to have a direct access to the R objects they left after execution.
The natural way to design this framework is to create a custom environment for each script and execute the script in it with something like local(code, envir=custom_envir). Then I could inspect the environment for the leftover objects and efficiently work with them.
Unfortunately, calls to library or require "leaks" outside the scripts' execution environment - both functions insert package environment just after the R_GlobalEnv, which is eventually inherited by every script. So loading a library in one script changes the way the other script works - possibly breaking them (by changing the order of masked objects).
I don't want to run the scripts in separate R processes because the scripts produce heavy objects, and cross-process copying them may not fit in RAM. Serializing to disk is even worse.
How to "chroot jail" the execution of the script inside the existing R session, so that all memory effects of execution will be contained inside custom environment(s)?
Related
I am developing processes for collecting, cleaning and storing various data sets. The development is done with RStudio projects. I won't say I'm following every tidyverse/RStudio workflow recommendation but in general I'm using that framework-- relevant now is that I'm using standard subdirectories and the here package for referencing them.
Every project has a MAIN.R script that ultimately sources the functions from the other scripts-- one only needs to run MAIN.R to execute the process. I did this not only for simplicity but also because the long-term intent is to have this be a scheduled process.
For now at least my method for scheduling R Scripts is with Windows Task Scheduler. Getting an R Script scheduled and running is not a problem. The issue is the contextual assumptions of developing within a project: source(here("CODE", "some-file.R")) fails when I run MAIN.R outside of the scope of the project.
One obvious solution would be to hard-code the project location as one of the parameters. I would need to have two different MAIN.R files, one for development that uses the project and one that uses that parameter for scheduling. I don't hate that idea, don't love it as it someone nullifies the whole point of the project/here approach. Is there a more elegant solution that someone else has created that I couldn't find on Google, or better workaround ideas?
I ended up using the solution described here: https://community.rstudio.com/t/how-to-play-nice-with-taskscheduler-r-studio-projects-and-here/24406/2 .
I didn't have to make any changes to the MAIN.R script. Instead, I scheduled it directly but added the project directory to the "Starts In" argument of the Windows Task Scheduler task.
I'm working on an R package at work. My package has gotten large enough that I've decided I need some form of repeatable testing. I settled upon using testthat and mockery. I'm not a developer, so this is the first time I'm writing tests at this level.
I deal with a lot of data files and it's very convenient to have functions in my package to help locate files. These functions interact with the file system via calls to dir. For example,
Data from one event can be split over multiple files. If I have file datafile_2017.10.20_12.00.00, I have a function that can find the next file that is part of the same event, i.e. datafile_2017.10.20_12.05.00.
My question is this: what is the best way to test functions like this? My intuition is to avoid using actual files stored somewhere else in my repository because that can fail for a number of reasons, e.g. different paths, different repo states b/w systems. I searched around and it looks like different languages have mocking libraries that allow for mocking directory structures. I haven't found anything like that for R (except for testthatsomemore, but it was removed from CRAN sometime in 2016).
Is there an R package that allows for mocking directory structures? Or am I wrong to move away from storing small test files in my repo?
I wanted to execute R code from SSIS package. How can I add a data control step that executes R-code? SSIS supports only vb.net and asp.net.
SSIS has many data transformations available but R is very friendly when it comes to data manipulations.
I want to run a R-code from SSIS scripts or some other way.Basically, I'm trying to integrate R in ETL process.
I wanted to extract data(E) from from a CSV file.
Transform (T) it in R and load (L) it in Microsoft database.
Is it possible to get this workflow done in SSIS package by executing R-script using SSIS data control items? Thanks!
Here are a couple of ways you could integrate R into your ETL process.
Crude, fast and dirty - Execute Process Task in the Control Flow. This would be similar to calling RScript from the command line. You would likely make your transformation, save it to a file on disk, and get that filename from your Execute Process Task so you can feed it into a Data Flow task. Upside is you're keeping your R clean and separate from your C#/VB.
Integrated via Rdotnet - You could use the RDotNet library (I believe, haven't tried to integrate it). You would need to register the DLLs in the GAC, and then you can either work with .NET objects in your SSIS scripts or call R scripts directly.
Integrated in SQL Server 2016 - Microsoft has added R support via extended stored procedures. You call the R script via stored proc and use a sql query for input data and can store the output. See more detail here. This would mean utilizing an Execute SQL task in SSIS.
I hope it helps you or someone else, since you want data processing you might bring your dataset into a CSV file (throught a data flow task), execute the file using: "Rscript " (it might be executed as a command with the execute process task), inside the file you have to upload the dataset into a dataframe ( calling it with readLines() function), then do all the math/Calculation you request, write the data or calculation results into a CSV file an reading again it from SSIS.
It is not an elegant solution, but it works :), At least till microsoft integrates R as a control/data flow process.
CYA
PS. here you go how to execute files from the command line: Run R script from command line
I'm creating an R package for the handling of a specific dataset that is regularly updated in our organization, but not on a fixed schedule (making it unsuitable for something such as a cronjob). As a result, users must currently run a set of two scripts for data processing before they begin to analyze the data. In converting this set of functions into a package, I'm hoping to alleviate this by having the scripts be called whenever the package is first loaded to R (with analogous functions if people would like to manually check for an update in the middle of a multi-day session).
I've seen ways to deal with compiling external files upon package installation, but nothing on how to get R to run a script whenever the package is loaded (not just installed). Does anyone know if this is possible, and if so, how to do it?
Thanks!
These functions are outlined in the Writing R Extensions Guide, (which, if you're writing a package, you should be reading carefully) specifically section 1.5.3 Load Hooks
You can define an .onLoad function that will be called when you package loads.
Say I want to have a simple web app that takes some user input, performs a quick calculation in some predefined R script, and returns some cool looking graphic with say ggplot. One way to do this would be:
Have PHP accept some input from a web form
Sanitize the user input in PHP
Send the arguments to some pre-written R script using some combination of the PHP exec() command and Rscript
R does some calculations and saves the plot graphic to the server as well as some meta info to a MySQL database
The client can then access their cool new graphic from their web browser
This seems fairly straight forward to me. Thus my question is, what advantages would the rapache package have over the process described?
First off, rapache is not a package. It's an apache module and a set of conventions, really a system, for creating web applications written in R...
The advantage is speed. The disadvantage is you'd have to write a bunch of R code. Some might disagree with me on that one, though.