Controlling finite-state machines - fsm

I'm attempting to improve my understanding of FSMs by using them in an application responsible for deploying software to a staging system. A component of this application is responsible for ensuring that git repositories are installed, up-to-date and correctly checked-out. I have modelled the possible states of the repository as a finite-state machine. Most examples that I have seen assume that there will be a series of external events driving the FSM, but in this case there is just an end goal: ensure that the repository is installed, up-to-date and checked-out. I could hard-wire the FSM so that it fires the events necessary to ensure that it reaches the checked-out state: if state is missing, install; if state is out-of-date, update; and so on. But what if I want the system to be more flexible, such that it can be extended to allow multiple predetermined outcomes? I could write a number of conditional statements that control the flow of events sent to the FSM, but that doesn't feel right: it defeats the object of using an FSM in the first place. It feels as though multiple FSMs are required: one that encapsulates the behaviour of the git repository, and one or more that describe the path to take to reach the desired end goal.
At this point, I wonder if my understanding of FSMs is flawed. Is it common to have an FSM control itself by issuing the events required to reach a particular state? And what about having one FSM control another? Is there a name for this type of system?
Update
As requested, here's my state diagram:

I don't think your understanding is flawed, but the FSM machinery does seem unwarranted.
"A component of this application is responsible for ensuring that git repositories are installed, up-to-date and correctly checked-out. I have modelled the possible states of the repository as a finite-state machine."
Would you mind posting one of your FSMs? My guess is that you are not getting very interesting ones, as FSMs are an overkill to express the sort of state transitions you described, eg:
0 -> git repos are installed -1> git repos are up-to-date -2> correctly checked out
You get an uninteresting FSM, which just goes through a sequence of states. Alternation (the other building block of FSMs) doesn't make much sense if you have dependencies.
I guess a simple sequence of dependencies would be enough to model your system:
git repos are correctly checked out <- git repos are up-to-date
git repos are up-to-date <- git repos are installed
meaning for the git repos to be correctly checked out one has to have (depends on) the git repos being up-to-date. If you represent your whole system just with the dependencies, topological sort will be enough to produce a correct ordering. The nice thing is that whenever in topsort you find a set of nodes with no dependencies on it (think of the naive algorithm) you can check any of them in any order...or in parallel if you wish. The FSM formalism does not easily give you that.

An Event Driven FSM states that transitions occur due to an event. If you were to model this as an FSM, then each action should emit an event at completion that would be fed back to the FSM. The event than determines the next transition. This approach is not uncommon. I think an FSM works well here provided you conform your code to become event driven.
Regarding FSMs working together - there is a notion of hierarchical FSMs
Another possibility - you may want to use a process or job control DSL. Often, they have an FSM hidden under the covers - but you express your logic using a higher level language.

Related

R and Rstudio Docker vs Binder

My problem is that I can't use R-studio at my work place as the IT does not support it . I want to use R and R-studio that installed on my personnel laptop on my company laptop ( using a modern browser which is behind firewall ) . Some of the options I am thinking of two two things
should I need to build a docker for R and R-studio (I see base images are already available) , I am mostly interested in basic R , Dplyr (haven ,xporter, and Reticulate ) packages .
Should I have to use a binder . I am not technical person and my programming skills are very limited can any one suggest me way .
What exactly are the difference between using Docker option vs Binder ?
I know I can use R-Studio online and get my work done but with the new paid account I am running out of project hours and very slow sometimes . Thanks in advance
Here are some examples beyond the modern RStudio MyBinder example:
https://github.com/fomightez/pythonista_skewedf
https://github.com/fomightez/r_phylogenetics_worshop
https://github.com/fomightez/chapter7/tree/master/binder
The modern RStudio MyBinder example has been set as a template on GitHub so you can use
The first one is for a special use of a package not on conda. And I started that one from square one.
The other two were converted from content by others to aid in making them Binder-ready.
You essentially list everything you need from conda in the environment.yml along with the appropriate channels. If you need special stuff not on conda, you need the other configuration files included there.
Getting everything working can take some iterations on adding things, letting the image get built, and testing your libraries are available. Although you seem to think your situation is not overly complex.
The binder launch badges you see are just images where you modify the URL to point the MyBinder federation site at your repository. Look at the URL and you should see the pattern where you put studio at the end of the URL pointing at your repo. The form at MyBinder.org site can help with this; however, most often it is easier to just adapt a working launch badge's code copied from elsewhere. The form isn't set up at this time for making the URLs for launching to RStudio.
Download anything useful your create in a running session. The sessions timeout after 10 minutes, although RStudio usually keeps them active.
Lack of Persistence and limited memory, storage, & power can be drawbacks. The inherent reproducibility and portability are advantages.
MyBinder.org doesn't work with private repos. If you have code you don't want to share, you can upload it to the temporary session, using the repo for specifying the environment. You could host a private binderhub that does allow the use of private git repositories; however, that is probably overkill for your use case and exceed your ability level at this time.
GitHub isn't the only place to host repositories that can be pointed at the MyBinder system. If you go to the MyBinder.org page and click where it says 'GitHub' on the left side of the top line of the form, you can see a list of the sources at which you can host a repository and point the system to build an image and launch a container with that specified image.
Building the image from a source repository takes some minutes the first time. Once the image is built though on the service, launch is typically less than 30 seconds. Each time you make a change on the source repo, a build is necessary. Some changes don't cause the new build to be as long as the initial one as some optimizing is done to only build what is necessary after a change. Keep in mind there are several members of the federation around the workd and if traffic on the internet gets sent to where the built image isn't yet available, it will be built from scratch again first.
The Holepunch project is out there to offer some help for users working in the R ecosystem; however, with the R-Conda system that is now integrated into MyBinder it is pretty much as easy to do it the way I described. Last I knew, the Holepunch route makes a Dockerfile that isn't as easy to troubleshoot as using the current the R-Conda system route. Dockerfiles are essentially a last ditch configuration file that MyBinder can handle. The reason being the other configuration files are much easier and don't require knowing Dockerfile syntax. MyBinder aims to offer the ability to take advantage of Docker offering containers with a specified environment without users needing to know anything about Docker.
There is a Binder Help category for posting to get help at the Jupyter Discourse Forum. Some other examples of posts already there may help you troubleshoot.
Notice of a common pitfall
Most of the the configuration files for making a repository Binder-ready are simply text and can be edited right in the GitHub browser interface, without need to git or even cloning the repo locally.
Last I knew, there are two exceptions to this. The postBuild and start configuration files have settings that allow them to be run as scripts and these get altered in a way they no longer work if you edit them via the GitHub browser interface. (This was my experience when last I tried. Your mileage may vary or things may have changed now.) To edit those, you have to have git available on a system you have and pull one from some other source. Then edit that on your machine that has git working & add it your repo and push it back up from your local computer.
(If this is a problem, you can post in the Jupyter Discourse Forum Binder help category and you and I could coordinate where I fork and edit those files in your repo to your specifications and then make a pull request to update your source of the fork with those changes.)
If you are using Jupyter notebooks extensively then it may make sense to use Binder
But if you simply want to use R and Rstudio, then all you need is docker. A good resource is
https://github.com/rocker-org/rocker

Is there a convention for git version control between multiple operating systems in R?

Apologies if this isn't an appropriate question for SO - if not please let me know and I'll delete/move it. I just haven't found any resources on this myself. Anything I google related to "multiple operating systems git" gives me pages for applications that work on multiple OS's like GitHub or Tower.
I currently work regularly between two operating systems - PC at the office, Mac at home. I've been managing this with with git by using my master branch for PC/Windows R code, while using a OSXversion branch for Mac R code. This is fine for whenever I'm updating Windows or Mac specific code on each branch (such as package installation instructions in the comments). Where this gets tricky is for general improvement in my code that applies to both Mac & PC. What I've been doing is manually copy-pasting any general improvements between my Mac/PC code or cherry picking my merges. Is there a better way to be doing this?
It's fine to store code that runs on different operating systems in a Git repository. Simply check out the repository on each of the operating systems you're going to be working on.
The only thing you need to watch for is the line endings, which. Unless you're dealing with files that specifically require native CRLF / LF end-of-line styles, you're best to turn the automatic conversion off.
This can be done with:
git config --global core.autocrlf false
Further notes on autocrlf can be found on the GitHub help page itself.
As for your actual committing, you'll want to be following Git Flow, and simply have a develop branch that bases off of master. From here, you'll want to create individual feature branches. These feature branches can be worked on whilst on either Windows or Mac. The package installation instructions should really go in your README.md file.

Dev and deploy management with SVN of a Web Site

Net solution for a website, consisting of 5 projects, and there are a few(less than 10) developers working on the solution. We deploy almost on a daily basis.
The question is, how to setup the SVN repo to support this scenario (the daily deploy), also mentioning that not every commited file should go to production, there is a QA check before deploying.
Try out TeamCity
(CI tool) as its free for smaller amounts of CI. this may be better for you than CruiseControl.Net as CCNET is very configuration heavy as its all done via XML. TeamCity uses wizards to create the scripts to manage releases
if you need any other help on CI then let me know as its something I am evangelistic about.
What you want to do is commonly referred to as Continuous integration (CI).
While you can do that using Subversion, it is probably not the right tool for the job.
There is special CI software, which will allow you to easily automate the necessary tasks (checkout from version control, compiling / building, running automatic tests, deployment etc). An example would be CruiseControl.NET.
As to "not every commited file should go to production", the common solution is to have a special "release" branch, which gets deployed. Only tested code is merged there (or have the trunk always be stable, otherwise same model). Of course, you can also (better: additionally) have tests before your automatic deployment, and only deploy if all tests pass.
Working with a release branch
In practice, this means that people check in their code as they produce it. Sometimes this code will work, sometimes not. When the release time draws nearer, a "release branch" is created in Subversion. This release branch is then effectively a frozen snapshot of the source as it was at the time of branching. Now this branch can be used to compile & deploy the application, which can then be tested.
No new code is checked into the branch (but checkins can continue elsewhere). Only if a bug is detected in the branch, will there be a checkin into the branch to fix it. This continues until the branch passes all tests. Then the branch can be released as a new version of the software; afterwards the branch will only be used if the released version needs to be patched.
Of course, any bugfixes checked into the branch need to also be put into the trunk (either by merging branch -> trunk, for which Subversion provides special support, or by reimplementing the fix in the trunk, as appropriate).

Questions on setting up automated builds

When using an automated build system, it is usually a source control entry which executes tests (but I assume this can be configured to not be on every entry in a large team). How comes build applications have actions for source code checkins. Is there any need for this? So to summarise, is a build script executed by a source control entry or at a certain time everyday?
Also, the term "break the build" - does this mean code put source control and when the build is executed, it fails due to the code not passing a unit test/code coverage app returns negative results below a certain threshold?
Finally, what does a step mean? (Eg one step build)?
Thanks
So to summarize, is a build script executed by a source control entry or at a certain time everyday?
This depends. Some teams use a commit in the version control system as trigger, some teams use a temporal event as trigger (e.g. each hour). If you run the build after each change, you get immediate feedback. If you let some time run between two builds, you delay that feedback and, in case of a build failure, it's harder to identify the change(s) that are the cause. It may require more investigation.
Just to clarify the vocabulary, for me, "the build" is actually the script/tool that automates all the things that needs to be done (compiling, runing tests, etc). Then running this automated build continuously is what people call "continuous integration". And triggering a build on an event (time based or on commit), pulling the sources from the repository, running the build script, notifying people in case of failure is the responsibility of a "continuous integration engine".
Also, the term "break the build" - does this mean code put source control and when the build is executed, it fails due to the code not passing a unit test/code coverage app returns negative results below a certain threshold?
This is very binary indeed: the build passes, or it doesn't. When it doesn't, there can be many reasons: the code didn't compile, a test failed, a quality check failed (coding standards, code coverage, etc). If you commit some code than causes a build failure (whatever the reason is), then you "broke the build".
Finally, what does a step mean? (Eg one step build)?
In my opinion, a one step build means that you can build your entire application, run the tests, run the quality checks, produce reports, assemble the application, deploy it, etc with one command. This is a synonym of automated build (if you can't run your build in one step, i.e. if it requires human intervention, then it isn't fully automated).
Also, the term "break the build" -
does this mean code put source control
and when the build is executed, it
fails due to the code not passing a
unit test/code coverage app returns
negative results below a certain
threshold?
This could mean different things depending on company, project or team.
Usually "build" is some reference (usually automated) procedure which is either succeeds or not.
Thus "breaking the build" is doing something that leads failing of this reference procedure.
It could include or exclude unit-tests running, or regression test running, or deployment of your product, or whatever your team thinks should never fail.

What artifacts to save for a nightly build?

Assume that I set up an automatic nightly build. What artifacts of the build should I save?
For example:
Input source code
output binaries
Also, how long should I save them, and where?
Do your answers change if I do Continuous Integration?
You shouldn't save anything for the sake of saving it. you should save it because you need it (i.e., QA uses nightly builds to test). At which point, "how long to save it" becomes however long QA wants them.
i wouldn't "save" source code so much as tag/label it. I don't know what source control you're using, but tagging is trivial (performance & disk space) for any quality source control system. Once your build is tagged, unless you need binaries, there really isn't any benefit to just having them around because you can simply re-compile when necessary from source.
Most CI tools let you tag on each successful build. This can become problematic for some systems as you can easily have 100+ tags a day. For such cases I recommend still running a nightly build and only tagging that.
Here are some artifacts/information that I'm used to keep at each build:
The tag name of the snapshot you are building (tag and do a clean checkout before you build)
The build scripts themselfs or their version number (if you treat them as a separate project with its own version control)
The output of the build script: logs and final product
A snapshot of your environment:
compiler version
build tool version
libraries and dll/libs versions
database version (client & server)
ide version
script interpreter version
OS version
source control version (client and server)
versions of other tools used in the process and everything else that might influence the content of your build products. I usually do this with a script that queries all this information and logs it to a text file that should be stored with the other build artifacts.
Ask yourself this question: "if something destroys entirely my build/development environment what information would I need to create a new one so I can redo my build #6547 and end up with the exact same result I got the first time?"
Your answer is what you should keep at each build and it will be a subset or superset of the things I already mentioned.
You can store everything in your SCM (I'd recommend a separate repository), but in this case your question on how long you should keep the items looses sense. Or you should store it to zipped folders or burn a cd/dvd with the build result and artifacts. Whatever you choose, have a backup copy.
You should store them as long as you might need them. How long, will depend on your development team pace and your release cycle.
And no, I don't think it changes if you do continous integration.
This isn't a direct answer to your question, but don't forget to version control the nightly build setup itself. When the project structure changes, you may have to change the build process, which will break older builds from that point on.
In addition to the binaries as everyone else has mentioned I would recomend setting up a symbol server and a source server and making sure you get the correct information out and into those. It will aid in debugging tremendously.
We save the binaries, stripped and unstripped (so we have the exactly same binary, once with and once without debug symbols). Further we build everything twice, once with debug output enabled and once without (again, stripped and unstripped, so every build result in 4 binaries). The build is stored to a directory according to SVN revision number. That way we can always retain the source from the SVN repository by simply checking out this very revision (that way the source is archived as well).
A surprising one I learned about recently: If you're in an environment that might be audited you'll want to save all the output of your build, the script output, the compiler output, etc.
That's the only way you can verify your compiler settings, build steps, etc.
Also, how long to save them for, and where to save them?
Save them until you know that build won't be going to production, iow as long as you have the compiled bits around.
One logical place to save them is your SCM system. Another option is to use a tool that will automatically save them for you, like AnthillPro and its ilk.
We're doing something close to "embedded" development here, and I can tell you what we save:
the SVN revision number and timestamp, as well as the machine it was built on and by whom (also burned into the build binaries)
a full build log, showing whether it was a full/incremental build, any interesting (STDERR) output the data baking tools produced, a list of files compiled and any compiler warnings (this compresses very well, being text)
the actual binaries (for anywhere from 1-8 build configurations)
files produced as a side effect of linking: a linker command file, address map, and a sort of "manifest" file indicating what was burned into the final binaries (CRC and size for each), as well as the debugging database (.pdb equivalent)
We also mail out the result of running some tools over the "side-effect" files to interested users. We don't actually archive these since we can reproduce them later, but these reports include:
total and delta of filesystem size, broken down by file type and/or directory
total and delta of code section sizes (.text, .data, .rodata, .bss, .sinit, etc)
When we have unit tests or functional tests (e.g. smoke tests) running, those results show up in the build log.
We've not thrown out anything yet -- given, our target builds usually end up at ~16 or 32 MiB per configuration, and they're fairly compressible.
We do keep uncompressed copies of the binaries around for 1 week for ease of access; after that we keep only the lightly compressed version. About once a month we have a script that extracts each .zip that the build process produces and 7-zips a whole month of build outputs together (which takes advantage of only having small differences per build).
An average day might have a dozen or two builds per project... The buildserver wakes up about every 5 minutes to check for relevant differences and builds. A full .7z on a large very active project for one month might be 7-10GiB, but it's certainly affordable.
For the most part, we've been able to diagnose everything this way. Occasionally there's a hiccup on the buildsystem and a file isn't actually a the revision it's supposed to be when a build happens, but there's usually enough evidence of this in the logs. Sometimes we have to dig out a tool that understands the debugging database format and feed it a few addresses to diagnose a crash (we have automatic stackdumps built into the product). But usually all the information needed is there.
We haven't had to crack the .7z archives yet, to mention. But we have the info there, and I have some interesting ideas on how to mine bits of useful data from it.
Save what can't be reproduced easily. I work on FPGAs where only the FPGA team have the tools and some cores (libraries) of the design are licensed to compile on only one machine. So we save the output bitstreams. But try to check them over one another rather than with a date/time/version stamp.
Save as in check in to source code control or just on disk? Save nothing to source code control. All derived files should be visible in the file system and available to developers. Don't checkin binaries, code generated from XML files, message digests etc. A separate packaging step will make these end products available. As you have the change number you can always reproduce the build if necessary assuming of course everything you need to do a build is completely in the tree and is available to all builds by syncing.
I would save your built binaries for exactly as long as they have a chance to go into production or be used by some other team (like a QA group). Once something has left production, what you do with it can vary a lot. For a lot of teams, they'll keep just their most recent prior build around (for rollback) and otherwise discard their builds.
Others have regulatory requirements to keep anything that went into production around for as long as seven years (banks). If you are a product company, I'd keep around any binary a customer might have installed in case a tech support guy wants to install the same version.

Resources