project organization with R [duplicate] - r

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Workflow for statistical analysis and report writing
I have been programming with R for not too long but am running into a project organization question that I was hoping somebody could give me some tips on. I am finding that a lot of the analysis I do is ad hoc: that is, I run something, think about the results, tweek it and run some more. This is conceptually different than in a language like C++ where you think about the entire thing you want to run before coding. It is a huge benefit of interpreted languages. However, the issue that comes up is I end up having a lot of .RData files that I save so I don't have to source my script every time. Does anyone have any good ideas about how to organize my project so I can return to it a month later and have a good idea of what each file is associated with?
This is sort of a documentation question I guess. Should I document my entire project at each leg and be vigorous about cleaning up files that will no longer be necessary but were a byproduct of the research? This is my current system but it is a bit cumbersome. Does anyone else have any other suggestions?
Per the comment below: One of the key things that I am trying to avoid is the proliferation of .R analysis files and .RData sets that go along with them.

Some thoughts on research project organisation here:
http://software-carpentry.org/4_0/data/mgmt/
the take-home message being:
Use Version Control for your programs
Use sensible directory names
Use Version Control for your metadata
Really, Version Control is a good thing.

My analysis is a knitr document, with some external .R files which are called from it.
All data is in a database, but during my analysis the processed data are saved as .RData files. Only when I delete the RData, they are recreated from the database when I run the analysis again. Kinda like a cache, saves database access and data processing time when I rerun (parts of) my analysis.
Using a knitr (Sweave, etc) document for the analysis enables you to easily write a documented workflow with the results included. And knitr caches the results of the analysis, so small changes do usually not result in a full rerun of all R code, but only of a small section. Saves quite some running time for a bigger analysis.
(Ah, and as said before: use version control. Another tip: working with knitr and version control is very easy with RStudio.)

Related

Shiny apps in R: how to structure them correctly

I created my first Shiny app, which runs perfectly in my laptop.
However, I need to submit it to my professor and I want to make sure he will be able to run it.
I have a UI file, a server file, a global file and a process file.
The process file stores the data preparation.
The global file reads two RDS files which are the datasets that I use in the server.
Where should my libraries be loaded? For example, the app does not run without leaflet, how can I ensure the libraries are run automatically?
My RDS files are saved to my local drive, which means that my professor will need to change the path in order to use them, how can avoid this?
Shall I put the UI, server and global into one R script or is it ok to have them onto two different scripts?
Thank you!
As you describe your current set up, the most obvious place to load your libraries is at the start of your global file. (Or at the start of app.R if you move to a single file configuration.) Though not exactly what the reprex package is designed for, you could probably use reprex to make sure that your code is reproducible and independent of anything you may have overlooked. (You've already identified the obvious issue with data files.) Look here for more information on reprex.
Indeed. This is a problem. If you must load the data from files, you need to find a way of providing them to your professor. Telling him to edit your code by hand is not a good way to start. Exactly how to do this depends on the infrastucture your institution has set up, so it's difficult to advise you what to do. Do you not have a shared area for your course? Ask your fellow students - or even your professor.
Whether you bundle the app in a single file or in seperate files is really a matter of choice. I don't think there's a wroing way and a right way. For me, the deciding factor is usually the size of the app. Large apps get several files. For small apps, separation is just an unnecessary complication. Another factor - that doesn't seem to be an issue here - is whether I see the app as a potential front end for some methodology that might be worthy of its own package. In that case, I develop the app as a front end, and the package itself as separate entities.
Question 1 and 3 depend on criteria for code quality like performance and maintainability. When the code grows, there will come the point when it is more difficult to handle all code in one file. Once it grows even further you will encounter the point when you need split the app into modules to keep your code easy to maintain.
Regarding libraries I advise declaring them at the entrance point of an app (though, basically, that is a matter of taste and style). That way, you provide maximum clarity regarding the dependencies of the app. Again, if the app becomes very large and not all parts of the app rely on the same packages, it could improve performance and maintainability that each part loads the packages as required. It could give you a performance advantage when you do not need to load all packages at once. However, that is probably only true for very large apps.
However, since this all seems to be an exercise at university, I doubt that your app will reach higher levels of complexity.
Question 2: In a shiny app you can provide fileInput widget. This SO question shows you how.

Automating excel process -- Applying a macro to several files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am working on files used to monitor health plan data, and each type of report comes in the same template. I have created macros to automate the oversight of the files (find errors, gaps in data, logical improbabilities, etc). Now that I have that in place and the code has been cleaned up -- I am trying to automate the application of those macros to these files. For instance, for one type of report I have 10 client files I have to review. Currently I am going through the painstaking process of opening each file and dropping in the macro, applying it to the file, removing the macro (so that the clients cant take my code), and saving the resulting file. I repeat that process for each client -- and then do a similar process for many other reports. I know there has to be a better way, I am wondering if anyone has experience in this field and might be able to point me in the direction of how to achieve it. I use R studio for another process we automated and I believe I could utilize it for this process as well -- just need to find a jumping off point.
Manual intervention will still be needed to review the results of the macros, but I am hoping to eliminate the unnecessary manual touch points
Really appreciate any advise / knowledge you can share
Unless the macro you've written contains some very specific functions that don't have Python equivalents, I'd recommend simply abandoning VBA and manipulating your Excel sheets in Python via xlwings or openpyxl. If your data is very "database-like" in that the top row is simply column headers and every additional row contains nothing but data aligned to those headers, you can also use pandas to process the data as well.
If you do need access to those functions built directly into Excel that don't have Python equivalents, you can use win32com to communicate with Excel via Python. This library basically drives Excel via its COM interface. You can then either use the COM libraries directly to execute an equivalent of your VBA file from within Python, or if you prefer to stick with VBA, you can simply paste your VBA code into your Python script and inject it into workbook like in this example. From there, you can also remove your VBA code from the Excel sheet as shown in this example.
A pure VB solution would involve essentially making these same calls to inject and subsequently remove the CodeComponent in an environment outside your Excel workbook.
You may find it vastly easier to solve problems like this with a popular scripting language like Python since the support community around it is much larger than VBA's. VBA tends to be unpopular among developers and thus its support community also tends to be small. Large support communities also mean well-maintained and highly-convenient libraries such as the aforementioned xlwings and openpyxl.

Is there a way to make R code only able to be run and not edited? Essentially read-only?

I am in the midst of writing some scripts to perform data analysis on large excel sheets faster than by hand. However, my company has a strict quality review system where the program used needs to be validated and secure (i.e. no one can edit it, there is proof of what code was run, etc.). So essentially I would like my code to be able to be ran by my coworkers without them being able to edit the script. I was also interested in inserting prompts that they can fill in (e.g. "Which column would you like to analyze?")
Is all of this possible? I have read a few things online about file permissions but I know that these can easily be changed by the user. I also read about obfuscators but am entirely unfamiliar with their use.
One thought I have is to use Rmarkdown as a method of displaying which lines were run for which results. However, I believe that document could be edited as well? This would also leave the issue of the script itself being able to be edited.

Automating file treatment between multiple programmes

I'm trying to find a way to automate a series of processes that uses several different programmes. (Indifferently on Ubuntu and Windows).
For each programme, I've boiled the process down to either a macro or a script in each programme, so I feel confident that the entire process can be almost entirely automated. I just can't figure out what I can do to create a unifying tool.
The process is the following;
I have a simple text file with data, I use a jedit macro to tidy the data. This then goes to a OpenCalc template to create a graph, that data is then imported to a programme called TXM which (after many clicks) generates a column of data, this is exported to a csv file, that csv file is imported to an R session where upon a script is executed.
I have to repeat this process( and a few more similarones) dozens of times a day, and it's driving me nuts.
My research into how to automate the import treatment export process has shown a few glimmers of hope but I haven't been able to make any real progress.
I tried Autoexpect, but couldn't make it work on Ubuntu. TCL, I think only works for internet applications, Fabric I also haven't been able to make work.
I'm willing to spend a lot of time learning and develloping a tool to achieve this, but at the moment I'm not even sure what terms to search for.
I've figured it out for windows; I created a .bat file in a text editor which, when click prompst the user for names, etc and rewrites another text file. It then executes that .txt file as a script with r with the
command R.exe CMD BATCH c:\user\desktop\mymacro.txt

how to manage an R program where the functions are stored in different source files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I finally arrived at a point in using R where my programs are not anymore grown-up command line scripts, but real codes. At this point, I think it doesn't make sense to keep all the functions used by the main code in the same source file. Now, If I understand correctly, the way to use function myfunction, stored in file hereliesfunction.r, from a script stored in file myscript.r, is to add the line
source("hereliesfunction.r")
in file myscript.r, before the part of the script code where myfunction is used.
Is this the right approach in R?
Do I need a different source command for each function used by my main code? I guess it works "recursively",i.e., I can put source commands in
hereliesfunction.r to let myfunction use other functions.
What happens when I return from myfunction? Do these other
functions remain in memory, ready to be accessed by the main code too, or are they destroyed just like any other object created by myfunction?
Finally, is there some guideline on whether to store all the
functions used by a main code in the same directory as the main
code, or not?
Once you source a R file, it runs all the commands in that file. If it contains a function definition, it stores it into the global environment and is at your disposal until you remove it or close R session (so 3., yes).
Your entire post is screaming R package. As #docendodiscimus has pointed out, you should invest some time to develop a package. Not only does it hold your code in one place, is easy to maintain, it also offers a great platform to document your code (probably the most important part of code development/analysis) through help files and vignettes and offers easy version control through local and remote repositories (git, svn...).
[about sourcing] Is this the right approach in R?
Yes but in the mid-term, consider building a package as stated by #docendo discimus. devtools::create() and if you use RStudio Projects > New package are your friends. Learning to build packages is made simple by Hadley's R-pkg and was, personally, the best investment ever in R. Plus documenting and writing tutorials/vignettes and writing tests is always useful: it may be time consuming at the first glance, but you will probably soon hugely benefit from it (better understanding of your code, realizing you can improve the package architecture, etc.)
Do I need a different source command for each function used by my main code?
All functions, and in a larger extent code, located in the file sourced will be executed in R (so functions will be declared and available, you can check it with ls()
I guess it work "recursively",i.e., I can put source commands in hereliesfunction.r to let myfunction use other functions.
Yes
What happens when I return from myfunction? Do these other functions remain in memory, ready to be accessed by the main code too, or are they destroyed just like any other object created by myfunction?
Not sure to understand but may be related to previous points.
Finally, is there some guideline on whether to store all the functions used by a main code in the same directory as the main code, or not?
You can store them wherever you want, as long as the path for source is the right one. But it's generally a better practice to store all your functions in the same directory (or in a subfolder, eg /code, so that you just change your working directory once (or if you use RStudio's projects, you don't even need to bother, you just open the project), and as a side effect, as long as one is working in the same directory, the relative paths will still work. And thus you can share the folder with Dropbox or other, which ease collaboration.
Again, in the mid term or if many projects use the same source files, it's probably a good idea to write a package (for your own use, or to share on GitHub or CRAN or...)

Resources