This question already has answers here:
How to create, structure, maintain and update data codebooks in R?
(5 answers)
Closed 5 years ago.
In analyzing data the metadata about variables is extremely important. How do you manage this information in R?
For example, is there a way to specify a label that will be printed instead of the variable name?
What facilities are there in R for this?
Quick suggestions that come to mind are
attributes to store data along with an object (as Frank Harrell has championed for a long time)
the comment() function can do parts of this
the whole gamut of object-orientation to achieve different printing behaviour etc
Use the repo package. You can assign each variable a long name, a description, a set of tags, a remote url, dependency relations and also attach to it figures or generic external files. Find the latest stable release on CRAN (install.packages("repo")) or the latest development on github. A quick overview here. Hope it helps.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I found the following line in Hadley Wickham's book about R packages:
While you’re free to arrange functions into files as you wish, the two extremes are bad: don’t put all functions into one file and don’t put each function into its own separate file (See here).
But why? This would seem to be the two options which make most sense to me. Especially keeping just one file seems appealing to me.
From user perspective:
When I want to know how something works in a package I often go to the respective GitHub page and look at the code. Having functions organised in different files makes this a lot harder and I regularly end up cloning a repository just so I can search the content of all files (e.g. via grep -rnw '/path/to/somewhere/' -e 'function <-').
From a developer's perspective
I also don't really see the upside for developing a package. Browsing through a big file doesn't seem much harder than browsing through a small one if you employ the outline window in R Studio. I know about the Ctrl + . shortcut but it still means I have to open a new file when working on a different function while Ctrl + . could basically do the same job if I keep just one file.
Wouldn't it make more sense to keep all functions in one single file? I know different people like to organise their projects in different ways and that is fine. I'm not asking for opinions here. Rather I would like to know if there are any real disadvantage of keeping everything in one file.
There is obviously not one single answer to that. One obvious question is the size of your package: If it contains only those two neat functions and a plot command, why bother organizing it in any difficult manner: Just hack it into one file and you are good to go.
If your project is large, say you try to throw the R graphics system over and write lots and lots of functions (geoms and stats and ...) a lot of small files might be a better idea. But then, having more files than there is room for tabs in RStudio might not be such a good idea as well.
Another important question is, whether you intend to develop alone or with hundreds of people on GitHub. You might prefer to accept a change in a small file as opposed the "the one big file" that was so easy to search, back when you were alone.
The people who invented Java originally had a "one file per class" going and C# seems to have something similar. That does not mean, that those people are less clever then Hadley. It just means, that your mileage may vary and you have the right to oppose to Hadleys opinions.
Why not put all files on your computer in the root directory?
Ultimately if you use a file tree you are back to using everything as single entities.
Putting things that conceptually belong together into the same file is the logical continuation of putting things into directories/libraries.
If you write a library and define a function as well as some convenience wrappers around them it makes sense to put them in one file.
Navigating the file tree is made easier as you have fewer files and navigating the files is easier as you don't have all functions in the same file.
This is likely an easy answer that's been answered numerous times. I just lack the terminology to search for the answer.
I'm creating a package. When the package loads I want to have instant access to the data sets.
So lets say DICTIONARY is a data set.
I want to be able to do DICTIONARY rather than data(DICTIONARY) to get it to load. How do I go about doing this?
From R-exts.pdf (online source):
The ‘data’ subdirectory is for data files, either to be made available
via lazy-loading or for loading using data(). (The choice is made by
the ‘LazyData’ field in the ‘DESCRIPTION’ file: the default is not to
do so.)
Adding the following to the DESCRIPTION file should do it:
LazyData: true
This question already has answers here:
Sharing R functionality with multiple users without exposing code [closed]
(2 answers)
Distributing a compiled executable with an R package
(2 answers)
Closed 7 years ago.
Is there a provision in R or in other free tools such as RStudio which can hide the R code (does NOT show to USER) and still perform it's intended job by taking specific pre-defined inputs from USER?
I have written a few lines of R code which takes an input from USER and produce an output in terms of table and a graph. Is there any way I can protect (hide) this code from USER while the user runs this code as per given instructions? Making R package could be a way however is it totally fool proof (User doesn't get to see the code inside function)?
Thanks!
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am trying to get some good practices happening and have recently moved to using git for version control.
One set of scripts I use is for producing measurement uncertainty estimates from laboratory data. The same scripts are used on different data files and produce a set of files and graphs based on that data. The core scripts change infrequently.
Should I create a branch for each new data set? Is this efficient?
Should I use one set of scripts and just manually relocate output files to a separate location after each use?
There are a few different aspects here that should be touched on. I will try provide my opinions/recommendations for each.
The core scripts change infrequently.
This sounds to me like you should make an R package of your own. If you some core functions that aren't supposed to change, it would probably be best to package them together. Ideally, you design the functions so that the code behind each doesn't need to be modified and you just change an argument (or even begin exploring R S3 or S4 classes).
The custom scripting, you could provide a vignette for yourself demonstrating how you approach a data set. If you wanted to save each final script, I would probably store them in the inst/examples directory for you to call again if you needed to re-run if you don't want to store them locally.
Should I create a branch for each new data set? Is this efficient?
No, I generally would not ever recommend someone put their data on github. It is also not 'efficient' to create a new branch for a new data set. The idea behind creating another branch is to add a new aspect/component to an existing project. Simply adding a dataset and modifying some scripts is, IMHO, a poor use of a branch.
What you should do with your data depends on the data characteristics. Is this data large? Would it benefit from a RDBMS? You at least want to have it backed up on a local laboratory hard drive. Secondly, if you are academically minded, once you finish analyzing the data you should look in to an online repository so that others could also analyze the data. If these datasets are small, you could also put them in your package in the data directory if they are not sensitive.
Should I use one set of scripts and just manually relocate output files to a separate location after each use?
No, I would recommend that with your core functions/scripts that you should look in to creating a wrapper for this part and provide an argument to specify the output path.
I hope these comments help you.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Workflow for statistical analysis and report writing
I have been programming with R for not too long but am running into a project organization question that I was hoping somebody could give me some tips on. I am finding that a lot of the analysis I do is ad hoc: that is, I run something, think about the results, tweek it and run some more. This is conceptually different than in a language like C++ where you think about the entire thing you want to run before coding. It is a huge benefit of interpreted languages. However, the issue that comes up is I end up having a lot of .RData files that I save so I don't have to source my script every time. Does anyone have any good ideas about how to organize my project so I can return to it a month later and have a good idea of what each file is associated with?
This is sort of a documentation question I guess. Should I document my entire project at each leg and be vigorous about cleaning up files that will no longer be necessary but were a byproduct of the research? This is my current system but it is a bit cumbersome. Does anyone else have any other suggestions?
Per the comment below: One of the key things that I am trying to avoid is the proliferation of .R analysis files and .RData sets that go along with them.
Some thoughts on research project organisation here:
http://software-carpentry.org/4_0/data/mgmt/
the take-home message being:
Use Version Control for your programs
Use sensible directory names
Use Version Control for your metadata
Really, Version Control is a good thing.
My analysis is a knitr document, with some external .R files which are called from it.
All data is in a database, but during my analysis the processed data are saved as .RData files. Only when I delete the RData, they are recreated from the database when I run the analysis again. Kinda like a cache, saves database access and data processing time when I rerun (parts of) my analysis.
Using a knitr (Sweave, etc) document for the analysis enables you to easily write a documented workflow with the results included. And knitr caches the results of the analysis, so small changes do usually not result in a full rerun of all R code, but only of a small section. Saves quite some running time for a bigger analysis.
(Ah, and as said before: use version control. Another tip: working with knitr and version control is very easy with RStudio.)