I'd like to create a zip archive from within R, and need maximal cross-platform compatibility, so I would prefer not to use a system("zip") command.
Within utils there's zip.file.extract (aka unzip), which uses [a lot of] c code, derived from zlib 1.1.3 within a file called dounzip.c I couldn't find any similar capabilities for creating zip files.
It's also tricky to construct a specific google query for "cran create zip" or equivalent!
Also, a tar will not suffice, I need to creating zip's to use as input for another set of non-R tools.
I'd appreciate any pointers?
cheers,
mark
As usual the amazing Omega Project for Statistical Computing is a valuable resource! Take a look at the Rcompression package and try, for example, something like:
?gzip
txt <- paste(rep("This is a string", 40), collapse = "\n")
v <- gzip(txt))
writeBin(v, "test.txt.zip")
HTH
I think the command gzfile() may also do what you're looking for. Also note that in the upcoming version 2.10.0 there are some enhancements to compression functions that may be relevant. (see https://svn.r-project.org/R/trunk/NEWS -- the svn server may ask you to accept a certificate)
Related
I'm writing an R package making use of Rcpp to call functions written in C++ into the R code. Some of these functions and templates are written in files with a .hpp extension following the convention used by boost (and also discussed here).
This does not result in an error when building (R CMD build .) and checking (R CMD check --as-cran package.tar.gz) the package, but it returns the next warning:
Subdirectory ‘src’ contains:
file.hpp example.hpp
These are unlikely file names for src files
Ok, this is not a big issue, but my concern is, why the warning? is naming *hpp files considered a bad practice in the R community? Are there objective or community reasons why I should use *cpp/*h files instead of *hpp for the templates?
I originally left this information as a comment, but realized it actually answers your question I think, so here goes:
As Dirk Eddelbuettel points out in the comments, when you have a question about an R Core Team policy on R extension packages, your best bet is to look through their excellent Writing R Extensions manual. This manual tells you almost anything you could ever need to know.
In your case specifically, you needed to look at Section 1.1.5, which explains that "[the R Core Team] recommend[s] using .h for headers" because (as they explain in footnote 18) "Using .hpp is not guaranteed to be portable."
sessionInfo() includes very useful info that will improve the chances of someone being able to run your code on their machine, including
OS and version
R version
Attached packages
What other info can be provided with an R script to ensure someone else will be able to run it in their environment?
NB please include how to get that info (i.e. what command to run or where to look for it)
While this is not a complete answer, I tend to include this function with scripts I send along as it will download a package if the computer does not have it. This is more of a suggestion for scripts. For packages, you can explicitly put what versions of other packages your package depends on.
package_load<-function(packages = NULL, quiet=TRUE,
verbose=FALSE, warn.conflicts=FALSE){
# download required packages if they're not already
pkgsToDownload<- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(pkgsToDownload)>0)
install.packages(pkgsToDownload, repos="http://cran.us.r-project.org",
quiet=quiet, verbose=verbose)
# then load them
for(i in 1:length(packages))
require(packages[i], character.only=T, quietly=quiet,
warn.conflicts=warn.conflicts)
}
## Example of use
package_load(c('dplyr', 'rgdal'))
This is helpful for one off scripts as it gets over the hurdle of a different computer not having the appropriate packages. However, I generally suggest to folks to make sure their version of R is up to date as well.
Is this the best solution? Probably not, but it does help with minor scripts you send along to others. For a larger code base, it would probably be better to put together a package or a docker image.
I think the criterion you listed are the "basics" of reusability of a script. The next levels would be the possible interaction of your scripts (e.g. R Shiny scripts will use web features: therefore, giving the web browser and its version used to produce the script is a good practice). Also, another kind of information would be commentaries precising the expected input and outputs.
NB: I would precise "attached packages and their versions", just for us to be sure...
I need to "industrialize" an R code for a data science project, because the project will be rerun several times in the future with fresh data. The new code should be really easy to follow even for people who have not worked on the project before and they should be able to redo the whole workflow quite quickly. Therefore I am looking for tips, suggestions, resources and best-practices on how to achieve this objective.
Thank you for your help in advance!
You can make an R package out of your project, because it has everything you need for a standalone project that you want to share with others :
Easy to share, download and install
R has a very efficient documentation system for your functions and objects when you work within R Studio. Combined with roxygen2, it enables you to document precisely every function, and makes the code clearer since you can avoid commenting with inline comments (but please do so anyway if needed)
You can specify quite easily which dependancies your package will need, so that every one knows what to install for your project to work. You can also use packrat if you want to mimic python's virtualenv
R also provide a long format documentation system, which are called vignettes and are similar to a printed notebook : you can display code, text, code results, etc. This is were you will write guidelines and methods on how to use the functions, provide detailed instructions for a certain method, etc. Once the package is installed they are automatically included and available for all users.
The only downside is the following : since R is a functional programming language, a package consists of mainly functions, and some other relevant objects (data, for instance), but not really scripts.
More details about the last point if your project consists in a script that calls a set of functions to do something, it cannot directly appear within the package. Two options here : a) you make a dispatcher function that runs a set of functions to do the job, so that users just have to call one function to run the whole method (not really good for maintenance) ; b) you make the whole script appear in a vignette (see above). With this method, people just have to write a single R file (which can be copy-pasted from the vignette), which may look like this :
library(mydatascienceproject)
library(...)
...
dothis()
dothat()
finishwork()
That enables you to execute the whole work from a terminal or a distant machine with Rscript, with the following (using argparse to add arguments)
Rscript myautomatedtask.R --arg1 anargument --arg2 anotherargument
And finally if you write a bash file calling Rscript, you can automate everything !
Feel free to read Hadley Wickham's book about R packages, it is super clear, full of best practices and of great help in writing your packages.
One can get lost in the multiple files in the project's folder, so it should be structured properly: link
Naming conventions that I use: first, second.
Set up the random seed, so the outputs should be reproducible.
Documentation is important: you can use the Roxygen skeleton in rstudio (default ctrl+alt+shift+r).
I usually separate the code into smaller, logically cohesive scripts, and use a main.R script, that uses the others.
If you use a special set of libraries, you can consider using packrat. Once you set it up, you can manage the installed project-specific libraries.
I wrote a R package which uses awk to do some initial filtering on data. But the awk the package uses needs be higher than a certain version.
Where do you recommend to specify this dependency?
I can list the dependency in SystemRequirements in DESCRIPTION file
http://r.789695.n4.nabble.com/how-to-list-external-dependencies-i-e-non-R-packages-td4693947.html
But it doesn't really do the check. Anyway, it may be good enough.
If you really wanted to go the long way, we now provide a template:
package x13binary installs the X-13ARIMA-SEATS binary from US Census
it takes a binary from a matching GitHub repo x13prebuilt we have set up just to provide these binaries
packages needing X-13 deseasonalization such as seasonal simply depend on x13binary
they then use wrapper scripts that find the binary as part of x13binary and use it
X-13ARIMA-SEATS is open source under a weird US Government license meaning that it is available as source, but with terms that vary a little a between the US and the rest of the world -- reflecting its origin from a world where license were less well understood.
This scheme might be overkill for you. On the other hand, you have simply no way to ensure that you will get the correct / minimally required awk version on Windows, OS X or arbitrary Linux.
List it in the SystemRequirements field of the DESCRIPTION file. For an example, see Ryacas.
A user can work on many PCs. A good code runs no matter what PC it is running on. Assuming one does not want to rely on preference and option files, what is the best way to make sure a package is loaded (and installed if needed).
library command is cool, but the require command is much better. But even require is not getting the job done.
Triggering re-install that is not needed (eg, in R studio) causes an interesting prompt to restart the R session - and this is why unnecessary installs are best avoided.
One possible trick A is to do this (not to type the package name too often)
doInstall <- T;toInstall <- c("downloader");
if(doInstall) install.packages(toInstall);
lapply(toInstall, library, character.only = T)
or a worse trick B would be
if (!require(downloader)) {install.packages("downloader"); require(downloader)}
Is there a "2015 way" of doing it with one command - something like
justdoitall(c("downloader","dplyr"))
Here is an example of installing package zipcode using the pacman approach.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(zipcode)
Assuming one does not want to rely on preference and option files
That rules out putting anything in .Rprofile or using external packages so we're stuck with base R to solve your problem. If that's the case then the answer is that you can't do this much better than what you have written in your question (I prefer B to A)
If you're willing to bend a little bit and require the user to load a package first (which could be done on startup by using .Rprofile) there are a few options that do exactly what you want.
installr::require2 and pacman::p_load do what you ask. Disclosure: I am a an author/maintainer of pacman. I agree with your sentiment that we shouldn't rely on options or external files though especially if we plan on sharing the code. I use pacman pretty much every day (it has much more use than just installing/loading packages) but for the most part these types of functions should be treated as useful for interactive use but if you want portable, shareable code without worries about whether packages will be available you will have to resort to something along the lines of what you have in your question.