Current situation
My scientific project contains mainly different analytical and numerical calculations with physics purpose, i.e. I work with Mathematica, MATLAB, Fortran. Of course it is a good idea to commit the progress with a version control program like git.
I was asking some friends what would be the best way to structure my files. Beside the simple and well-known programmer's file structure (/bin/,/dat/,/doc/,/src/,...), I found no satisfying but similar answers:
What is the best project structure for a Python application?
Directory structure for a C++ library
Web Application (Django) typical project folder structure
Proper folder structure for lots of source files
https://stackoverflow.com/q/7715704/2062965
I hope this issue is not getting closed as it is asked by a physicist ;) as I guess many scientists are facing the same question at some point in more or less extension. By knowing that programmers have faced such issues before, I do not ask this question at Physics Stack Exchange, where it could probably be marked as off-topic or too broad.
Example
In my last project I had the following global structure where I had some interaction between the different project items:
/External_Code_Contribution/
/Experimental_Data/
/Analytical_Calculations/
/Project_Issue_1/
/Project_Issue_5/
/Documentations/
/Thesis/
/Papers/
/Talks/
/Literature/
/Project_Organization/
/Numerical_Calculations/
/Project_Issue_1/
/SubIssue_A/
/SubIssue_B/
/SubIssue_C/
/Project_Issue_2/
/Project_Issue_3/
/Project_Issue_4/
/Project_Issue_6/
Description:
In this project I divided the files according their functionality (external contribution, exp. data, numerics, documentation,...) and then subdivided them according to their issue (what was calculated at that step) where each issue again contained subissues (e.g. for analogous calculations with a different data set) with source code, binaries and processed data (for simplicity all in the same directory).
Unfortunately in this approach I ether had to:
refer to the source code in the corresponding issue (danger of changing anything can lead to incompatibilities), or
copy the source code to each new issue (each source code file with same name but in a different folder could be different).
Question
Who faced this issue of structuring projects in directories? Please tell me your approach and the arguments that brought you to this structure!
(Possible addition: What should I keep in mind when I set up a file structure for a project as described above?)
If you do not completely understand the answer, please do not hesitate to ask instead of flagging. Thank you!
Related
I created my first Shiny app, which runs perfectly in my laptop.
However, I need to submit it to my professor and I want to make sure he will be able to run it.
I have a UI file, a server file, a global file and a process file.
The process file stores the data preparation.
The global file reads two RDS files which are the datasets that I use in the server.
Where should my libraries be loaded? For example, the app does not run without leaflet, how can I ensure the libraries are run automatically?
My RDS files are saved to my local drive, which means that my professor will need to change the path in order to use them, how can avoid this?
Shall I put the UI, server and global into one R script or is it ok to have them onto two different scripts?
Thank you!
As you describe your current set up, the most obvious place to load your libraries is at the start of your global file. (Or at the start of app.R if you move to a single file configuration.) Though not exactly what the reprex package is designed for, you could probably use reprex to make sure that your code is reproducible and independent of anything you may have overlooked. (You've already identified the obvious issue with data files.) Look here for more information on reprex.
Indeed. This is a problem. If you must load the data from files, you need to find a way of providing them to your professor. Telling him to edit your code by hand is not a good way to start. Exactly how to do this depends on the infrastucture your institution has set up, so it's difficult to advise you what to do. Do you not have a shared area for your course? Ask your fellow students - or even your professor.
Whether you bundle the app in a single file or in seperate files is really a matter of choice. I don't think there's a wroing way and a right way. For me, the deciding factor is usually the size of the app. Large apps get several files. For small apps, separation is just an unnecessary complication. Another factor - that doesn't seem to be an issue here - is whether I see the app as a potential front end for some methodology that might be worthy of its own package. In that case, I develop the app as a front end, and the package itself as separate entities.
Question 1 and 3 depend on criteria for code quality like performance and maintainability. When the code grows, there will come the point when it is more difficult to handle all code in one file. Once it grows even further you will encounter the point when you need split the app into modules to keep your code easy to maintain.
Regarding libraries I advise declaring them at the entrance point of an app (though, basically, that is a matter of taste and style). That way, you provide maximum clarity regarding the dependencies of the app. Again, if the app becomes very large and not all parts of the app rely on the same packages, it could improve performance and maintainability that each part loads the packages as required. It could give you a performance advantage when you do not need to load all packages at once. However, that is probably only true for very large apps.
However, since this all seems to be an exercise at university, I doubt that your app will reach higher levels of complexity.
Question 2: In a shiny app you can provide fileInput widget. This SO question shows you how.
I would like to use R objects (e.g., cleaned data) generated in one git-versioned R project in another git-versioned R project.
Specifically, I have multiple git-versioned R projects (that hold drake plans) that do various things for my thesis experiments (e.g., generate materials, import and clean data, generate reports/articles).
The experiment-specific projects should ideally be:
Connectable - so that I can get objects (mainly data and materials) that I generated in these projects into another git-versioned R project that generates my thesis report.
Self-contained - so that I can use them in other non-thesis projects (such as presentations, reports, and journal manuscripts). When sharing such projects, I'd ideally like not to need to share a monolithic thesis project.
Versioned - so that their use in different projects can be independent (e.g., if I make changes to the data cleaning for a manuscript after submitting the thesis, I still want the thesis to be reproducible as it was originally compiled).
At the moment I can see three ways of doing this:
Re-create the data cleaning process
But: this involves copy/paste, which I'd like to avoid, especially if things change upstream.
Access the relevant scripts/functions by changing the working directory
But: even if I used here it seems that this would introduce poor reproducibility.
Make the source projects into packages and make the objects I want to "export" into exported data (as per the data section of Hadley's R packages guide)
But: I'd like to avoid the unnecessary metadata, artefacts, and noise (e.g., see Miles McBain's "Project as an R package: An okay idea") if I can.
Is there any other way of doing this?
Edit: I tried #landau's suggestion of using a single drake plan, which worked well for a while, until (similar to #vrognas' case) I ended up with too many sub-projects (e.g., conference presentations and manuscripts) that relied on the same objects. Therefore, I added some clarifications above to my intentions with the question.
My first recommendation is to use a single drake plan to unite the stages of the overall project that need to share data. drake is designed to handle a lot of moving parts this way, and it will be more seamless when it comes to drake's decisions about what to rerun downstream. But if you really do need different plans in different sub-projects that share data, you can track each shared dataset as a file_out() file in one plan and track it with file_in() in another plan.
upstream_plan <- drake_plan(
export_file = write_csv(dataset, file_out("exported_data/dataset.csv"))
)
downstream_plan <- drake_plan(
dataset = read_csv(file_in("../upstream_project/exported_data/dataset.csv"))
)
You fundamentally misunderstood Miles McBain’s critique. He isn’t saying that you shouldn’t write reusable code nor that you shouldn’t use packages. He’s saying that you shouldn’t use packages for everything. But reusable code (i.e. code that you want to reuse) absolutely belongs in packages (or, better, modules), which can then be used in multiple projects.
That being said, first off, pay attention to Will Landau’s advice.
Secondly, you can make your RStudio projects configurable such that they can load data based on paths given in a configuration. Once that’s accomplished, nothing speaks against hard-coding paths to data in different projects inside that config file.
I am in a similar situation. I have many projects that are spawned from one raw dataset. Previously, when the project was young and small, I had it all in one version controlled project. This got out of hand as more sub-projects were spawned and my git history got cluttered from working on projects in parallel. This could be to my lack of skills with git. My folder structure looked something like this:
project/.git
project/main/
project/sub-project_1/
project/sub-project_2/
project/sub-project_n/
I contemplated having each project in its own git branch, but then I could not access them simultaneously. If I had to change something to the main dataset (eg I might have not cleaned some parts) then project 1 could become outdated and nonfunctional. Once I had finished project 1, I would have liked it to be isolated and contained for reproducibility. This is easier to achieve if the projects are separated. I don't think a drake/targets plan would solve this?
I also looked briefly into having the projects as git submodules but it seemed to add too much complexity. Again, my git ignorance might shine through here.
My current solution is to have the main data as an R-package, and each sub-project as a separate git-versioned folder (they are actually packages as well, but this is not necessary). This way I can load in a specific version of the data (using renv for package versions).
My folder structure now looks something like this:
main/.git
sub-project_1/.git
sub-project_2/.git
sub-project_n/.git
And inside each sub-project, I call library(main) to load the cleaned data. Within each sub-project, a drake/targets plan could be used.
Over the past few weeks I have been getting into Ada, for various different reasons. But there is no doubt that information regarding my personal reasons as to why I'm using Ada is out of scope for this question.
As of the other day I started using the gprbuild command that comes with the Windows version of GNAT, in order to get the benefits of a system for managing my applications in a project-related manner. That is, being able to define certain attributes on a per-project basis, rather than manually setting up the compile-phase myself.
Currently when naming my files, their names are based off of what seems to be a standard for the grpbuild, although I could very much be wrong. For periods (in the package structure), a - is put in the name of the file, for underscores, an _ is put accordingly. As such, a package by the name App.Test.File_Utils would have a file name of app-test-file_utils: .ads and .adb accordingly.
In the .gpr project file I have specified:
for Source_Dirs use ("app/src/**");
so that I am allowed to use multiple directories for storing my files, rather than needing to have them all in the same directory.
The Problem
The problem that arises, however, is that file names tend to get very long. As I am already putting the files in a directory based on the package name contained by the file, I was wondering if there is a way to somehow make the compiler understand that the package name can be retrieved from the file's directory name.
That is, rather than having to name the App.Test.File_Utils' file name app-test-file_utils, I would like it to reside under the app/test directory by the name file_utils.
Is this doable, or will I be stuck with the horrors of eventually having to name my files along the lines of: app-test-some-then-one-has-more_files-another_package-knew-test-more-important_package.ads? Granted, I have not missed something about how an Ada application should actually be structured.
What I have tried
I tried looking for answers in the package Naming configuration of the gpr files in the documentation, but to no avail. Furthermore I have been browsing the web for information, but decided it might be better to get help through Stackoverflow, so that other people who might struggle with this problem in the future (granted it is a problem in the first place) might also get help.
Any pointers in the right direction would be very helpful!
In the top-secret GNAT documentation there is a description of how to use non-default file names. It's a great deal of effort. You will probably give up, use the default names, and put them all in a single directory.
You can also simplify much of the effort by using GPS and letting it build your project file as you add files to your source directories.
I am attempting to create a simulation using gromacs based off of POPC128a.pdb and topol11885.top files that has 4 times the number of POP and Sol while maintaining all the other physical properties. Essentially, creating a POPC512a.pdb and topol47540.top to create a simulation. I know how to use editconf for the box size and a few other features, but I am new to gromacs and have looked through the documentation on how to possibly do this using editconf and other tools, but could not find a clear answer. Is there a simple way to make these changes? Or would I need to make these .pdb and .top files from scratch?
Thank you for any advice.
The tag "pdb-files" refers to "Program Database" files which are not related at all to the software gromacs and the world of bioinfromatics. Researchgate website might be more suitable.
You might want consult the gromacs resource/tutorial webpage which plenty of useful links.
You might replicate your simulation system coordinates with gmx genconf, and edit the topology file directly to specify the changed composition.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Workflow for statistical analysis and report writing
I have been programming with R for not too long but am running into a project organization question that I was hoping somebody could give me some tips on. I am finding that a lot of the analysis I do is ad hoc: that is, I run something, think about the results, tweek it and run some more. This is conceptually different than in a language like C++ where you think about the entire thing you want to run before coding. It is a huge benefit of interpreted languages. However, the issue that comes up is I end up having a lot of .RData files that I save so I don't have to source my script every time. Does anyone have any good ideas about how to organize my project so I can return to it a month later and have a good idea of what each file is associated with?
This is sort of a documentation question I guess. Should I document my entire project at each leg and be vigorous about cleaning up files that will no longer be necessary but were a byproduct of the research? This is my current system but it is a bit cumbersome. Does anyone else have any other suggestions?
Per the comment below: One of the key things that I am trying to avoid is the proliferation of .R analysis files and .RData sets that go along with them.
Some thoughts on research project organisation here:
http://software-carpentry.org/4_0/data/mgmt/
the take-home message being:
Use Version Control for your programs
Use sensible directory names
Use Version Control for your metadata
Really, Version Control is a good thing.
My analysis is a knitr document, with some external .R files which are called from it.
All data is in a database, but during my analysis the processed data are saved as .RData files. Only when I delete the RData, they are recreated from the database when I run the analysis again. Kinda like a cache, saves database access and data processing time when I rerun (parts of) my analysis.
Using a knitr (Sweave, etc) document for the analysis enables you to easily write a documented workflow with the results included. And knitr caches the results of the analysis, so small changes do usually not result in a full rerun of all R code, but only of a small section. Saves quite some running time for a bigger analysis.
(Ah, and as said before: use version control. Another tip: working with knitr and version control is very easy with RStudio.)