How to Track Software Run/Output [closed] - software-design

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Background:
I'm begining to forge my way into more of a software engineering role at work. I have no computer science background and have learned to program some high level languages on my own (mostly R, python, and ruby). And before I start to find my own solutions to problems, I want to know best practices for keeping tack of last runs of the program.
Specifically, I'm writing a program that will clean data in a database (find missing data, imputation, etc...). It needs to know when it was last run so it does not retrieve too much data.
Question:
How do I best keep track of previous code runs?
I'm writing production level code. These scripts and functions will be run automatically (maybe a nightly or weekly basis), and the results will be output to a file. Each of these programs will depend on when it was run last. I can see this dealt with a few ways.
The output file name (or a diagnostic file name) contains the last date/time it was run. I.e. 'output_file_2014_07_11_01_00_04.txt' From this name, the program can determine when it was last run.
Keep a separate info file that the program just appends the last run time to a list of run times.
These solutions seem prone to problems. Is there a more secure and efficient method for recording/reading the last run date?

I like the idea of putting it in the filename. That binds the run time to the actual data. If you keep the run time in a separate file, data can separated from the meta-data (i.e. the run time).
This works in a trusted environment. If accidental or malicious vandalism is a concern, like changing a filename is a problem, then a lot other things become problematic too.
A third alternative is to create a "header" or comment section in the data file itself. The the run time in the header. When you read the data, your skip can either skip the header and go straight for the data, or examine the header and extract the meta-data (i.e. run time or other attributes).
This approach has the advantage that (a) the meta-data and the data are kept together and (b) you can include more meta-data than just the run time. This approach has the disadvantage that any program reading the data must first skip the header. For an example of this approach, see the Attribute-Relation File Format (ARFF) at http://www.cs.waikato.ac.nz/ml/weka/arff.html

Related

Automating excel process -- Applying a macro to several files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am working on files used to monitor health plan data, and each type of report comes in the same template. I have created macros to automate the oversight of the files (find errors, gaps in data, logical improbabilities, etc). Now that I have that in place and the code has been cleaned up -- I am trying to automate the application of those macros to these files. For instance, for one type of report I have 10 client files I have to review. Currently I am going through the painstaking process of opening each file and dropping in the macro, applying it to the file, removing the macro (so that the clients cant take my code), and saving the resulting file. I repeat that process for each client -- and then do a similar process for many other reports. I know there has to be a better way, I am wondering if anyone has experience in this field and might be able to point me in the direction of how to achieve it. I use R studio for another process we automated and I believe I could utilize it for this process as well -- just need to find a jumping off point.
Manual intervention will still be needed to review the results of the macros, but I am hoping to eliminate the unnecessary manual touch points
Really appreciate any advise / knowledge you can share
Unless the macro you've written contains some very specific functions that don't have Python equivalents, I'd recommend simply abandoning VBA and manipulating your Excel sheets in Python via xlwings or openpyxl. If your data is very "database-like" in that the top row is simply column headers and every additional row contains nothing but data aligned to those headers, you can also use pandas to process the data as well.
If you do need access to those functions built directly into Excel that don't have Python equivalents, you can use win32com to communicate with Excel via Python. This library basically drives Excel via its COM interface. You can then either use the COM libraries directly to execute an equivalent of your VBA file from within Python, or if you prefer to stick with VBA, you can simply paste your VBA code into your Python script and inject it into workbook like in this example. From there, you can also remove your VBA code from the Excel sheet as shown in this example.
A pure VB solution would involve essentially making these same calls to inject and subsequently remove the CodeComponent in an environment outside your Excel workbook.
You may find it vastly easier to solve problems like this with a popular scripting language like Python since the support community around it is much larger than VBA's. VBA tends to be unpopular among developers and thus its support community also tends to be small. Large support communities also mean well-maintained and highly-convenient libraries such as the aforementioned xlwings and openpyxl.

Why not organise all functions in a package in one file? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I found the following line in Hadley Wickham's book about R packages:
While you’re free to arrange functions into files as you wish, the two extremes are bad: don’t put all functions into one file and don’t put each function into its own separate file (See here).
But why? This would seem to be the two options which make most sense to me. Especially keeping just one file seems appealing to me.
From user perspective:
When I want to know how something works in a package I often go to the respective GitHub page and look at the code. Having functions organised in different files makes this a lot harder and I regularly end up cloning a repository just so I can search the content of all files (e.g. via grep -rnw '/path/to/somewhere/' -e 'function <-').
From a developer's perspective
I also don't really see the upside for developing a package. Browsing through a big file doesn't seem much harder than browsing through a small one if you employ the outline window in R Studio. I know about the Ctrl + . shortcut but it still means I have to open a new file when working on a different function while Ctrl + . could basically do the same job if I keep just one file.
Wouldn't it make more sense to keep all functions in one single file? I know different people like to organise their projects in different ways and that is fine. I'm not asking for opinions here. Rather I would like to know if there are any real disadvantage of keeping everything in one file.
There is obviously not one single answer to that. One obvious question is the size of your package: If it contains only those two neat functions and a plot command, why bother organizing it in any difficult manner: Just hack it into one file and you are good to go.
If your project is large, say you try to throw the R graphics system over and write lots and lots of functions (geoms and stats and ...) a lot of small files might be a better idea. But then, having more files than there is room for tabs in RStudio might not be such a good idea as well.
Another important question is, whether you intend to develop alone or with hundreds of people on GitHub. You might prefer to accept a change in a small file as opposed the "the one big file" that was so easy to search, back when you were alone.
The people who invented Java originally had a "one file per class" going and C# seems to have something similar. That does not mean, that those people are less clever then Hadley. It just means, that your mileage may vary and you have the right to oppose to Hadleys opinions.
Why not put all files on your computer in the root directory?
Ultimately if you use a file tree you are back to using everything as single entities.
Putting things that conceptually belong together into the same file is the logical continuation of putting things into directories/libraries.
If you write a library and define a function as well as some convenience wrappers around them it makes sense to put them in one file.
Navigating the file tree is made easier as you have fewer files and navigating the files is easier as you don't have all functions in the same file.

how to manage an R program where the functions are stored in different source files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I finally arrived at a point in using R where my programs are not anymore grown-up command line scripts, but real codes. At this point, I think it doesn't make sense to keep all the functions used by the main code in the same source file. Now, If I understand correctly, the way to use function myfunction, stored in file hereliesfunction.r, from a script stored in file myscript.r, is to add the line
source("hereliesfunction.r")
in file myscript.r, before the part of the script code where myfunction is used.
Is this the right approach in R?
Do I need a different source command for each function used by my main code? I guess it works "recursively",i.e., I can put source commands in
hereliesfunction.r to let myfunction use other functions.
What happens when I return from myfunction? Do these other
functions remain in memory, ready to be accessed by the main code too, or are they destroyed just like any other object created by myfunction?
Finally, is there some guideline on whether to store all the
functions used by a main code in the same directory as the main
code, or not?
Once you source a R file, it runs all the commands in that file. If it contains a function definition, it stores it into the global environment and is at your disposal until you remove it or close R session (so 3., yes).
Your entire post is screaming R package. As #docendodiscimus has pointed out, you should invest some time to develop a package. Not only does it hold your code in one place, is easy to maintain, it also offers a great platform to document your code (probably the most important part of code development/analysis) through help files and vignettes and offers easy version control through local and remote repositories (git, svn...).
[about sourcing] Is this the right approach in R?
Yes but in the mid-term, consider building a package as stated by #docendo discimus. devtools::create() and if you use RStudio Projects > New package are your friends. Learning to build packages is made simple by Hadley's R-pkg and was, personally, the best investment ever in R. Plus documenting and writing tutorials/vignettes and writing tests is always useful: it may be time consuming at the first glance, but you will probably soon hugely benefit from it (better understanding of your code, realizing you can improve the package architecture, etc.)
Do I need a different source command for each function used by my main code?
All functions, and in a larger extent code, located in the file sourced will be executed in R (so functions will be declared and available, you can check it with ls()
I guess it work "recursively",i.e., I can put source commands in hereliesfunction.r to let myfunction use other functions.
Yes
What happens when I return from myfunction? Do these other functions remain in memory, ready to be accessed by the main code too, or are they destroyed just like any other object created by myfunction?
Not sure to understand but may be related to previous points.
Finally, is there some guideline on whether to store all the functions used by a main code in the same directory as the main code, or not?
You can store them wherever you want, as long as the path for source is the right one. But it's generally a better practice to store all your functions in the same directory (or in a subfolder, eg /code, so that you just change your working directory once (or if you use RStudio's projects, you don't even need to bother, you just open the project), and as a side effect, as long as one is working in the same directory, the relative paths will still work. And thus you can share the folder with Dropbox or other, which ease collaboration.
Again, in the mid term or if many projects use the same source files, it's probably a good idea to write a package (for your own use, or to share on GitHub or CRAN or...)

What is a functional structure for multiple files using the same R scripts? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am trying to get some good practices happening and have recently moved to using git for version control.
One set of scripts I use is for producing measurement uncertainty estimates from laboratory data. The same scripts are used on different data files and produce a set of files and graphs based on that data. The core scripts change infrequently.
Should I create a branch for each new data set? Is this efficient?
Should I use one set of scripts and just manually relocate output files to a separate location after each use?
There are a few different aspects here that should be touched on. I will try provide my opinions/recommendations for each.
The core scripts change infrequently.
This sounds to me like you should make an R package of your own. If you some core functions that aren't supposed to change, it would probably be best to package them together. Ideally, you design the functions so that the code behind each doesn't need to be modified and you just change an argument (or even begin exploring R S3 or S4 classes).
The custom scripting, you could provide a vignette for yourself demonstrating how you approach a data set. If you wanted to save each final script, I would probably store them in the inst/examples directory for you to call again if you needed to re-run if you don't want to store them locally.
Should I create a branch for each new data set? Is this efficient?
No, I generally would not ever recommend someone put their data on github. It is also not 'efficient' to create a new branch for a new data set. The idea behind creating another branch is to add a new aspect/component to an existing project. Simply adding a dataset and modifying some scripts is, IMHO, a poor use of a branch.
What you should do with your data depends on the data characteristics. Is this data large? Would it benefit from a RDBMS? You at least want to have it backed up on a local laboratory hard drive. Secondly, if you are academically minded, once you finish analyzing the data you should look in to an online repository so that others could also analyze the data. If these datasets are small, you could also put them in your package in the data directory if they are not sensitive.
Should I use one set of scripts and just manually relocate output files to a separate location after each use?
No, I would recommend that with your core functions/scripts that you should look in to creating a wrapper for this part and provide an argument to specify the output path.
I hope these comments help you.

project organization with R [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Workflow for statistical analysis and report writing
I have been programming with R for not too long but am running into a project organization question that I was hoping somebody could give me some tips on. I am finding that a lot of the analysis I do is ad hoc: that is, I run something, think about the results, tweek it and run some more. This is conceptually different than in a language like C++ where you think about the entire thing you want to run before coding. It is a huge benefit of interpreted languages. However, the issue that comes up is I end up having a lot of .RData files that I save so I don't have to source my script every time. Does anyone have any good ideas about how to organize my project so I can return to it a month later and have a good idea of what each file is associated with?
This is sort of a documentation question I guess. Should I document my entire project at each leg and be vigorous about cleaning up files that will no longer be necessary but were a byproduct of the research? This is my current system but it is a bit cumbersome. Does anyone else have any other suggestions?
Per the comment below: One of the key things that I am trying to avoid is the proliferation of .R analysis files and .RData sets that go along with them.
Some thoughts on research project organisation here:
http://software-carpentry.org/4_0/data/mgmt/
the take-home message being:
Use Version Control for your programs
Use sensible directory names
Use Version Control for your metadata
Really, Version Control is a good thing.
My analysis is a knitr document, with some external .R files which are called from it.
All data is in a database, but during my analysis the processed data are saved as .RData files. Only when I delete the RData, they are recreated from the database when I run the analysis again. Kinda like a cache, saves database access and data processing time when I rerun (parts of) my analysis.
Using a knitr (Sweave, etc) document for the analysis enables you to easily write a documented workflow with the results included. And knitr caches the results of the analysis, so small changes do usually not result in a full rerun of all R code, but only of a small section. Saves quite some running time for a bigger analysis.
(Ah, and as said before: use version control. Another tip: working with knitr and version control is very easy with RStudio.)

Resources