Does non-deterministic nature of property-based testing hurt build repeatability? - functional-programming

I am learning FP and got introduced to the concept of property-based testing and for someone from OOP world PBT looks both useful and dangerous. It does check a lot of options, but what if there is one (or some) options that fail, but they didn't fail during your first let's say Jenkins build. Then next time you run the build the test may or may not fail, doesn't it kill the entire idea of repeatable builds?
I see that some people explored options to make the tests deterministic, but then if such test doesn't catch an error it will never catch it.
So what's better approach here? Do we sacrifice build repeatability to eventually uncover a bug or do we take the risk of never uncovering it, but get our repeatability back?
(I hope that I properly understood the concept of PBT, but if I didn't I would appreciate if somebody could point out my misconceptions)

Doing a lot of property-based testing I don’t see indeterminism as a big problem. I basically experience three types of it:
A property is really indeterministic b/c some external factor - e.g. timeout, delay, db config - makes it so. Those flaky tests also show up in example-based testing and should be eliminated by making the external factor deterministic.
A property fails rarely because the triggering condition is only sometimes met by pseudo random data generation. Most PBT libraries have ways to reproduce those failing runs, eg by re-using the random seed of the failing test run or even remembering the exact constellation in a database of some sort. Those failures reveal problems and are one of the reasons why we’re doing random test cases generation in the first place.
Coverage assertions („this condition will be hit in at least 5 percent of all cases“) may fail from time to time even though they are generally true. This can be mitigated by raising the number of tries. Some libs, eg quickcheck, do their own calculation of how many tries are needed to prove/disprove coverage assumptions and thereby mostly eliminate those false positives.
The important thing is to always follow up on flaky failures and find the bug, the indeterministic external factor or the wrong assumption in the property‘s invariant. When you do that, sporadic failures will occur less and less often. My personal experience is mostly with jqwik but other people have been telling me similar stories.

You can have both non-determinism and reproducible builds by generating the randomness outside the build process. You could generate it during development or during external testing.
One example would be to seed your property based tests, and to automatically modify this seed on commit. You're still making a tradeoff. A developer could be alerted of a bug unrelated to what they're working on, and you lose some test capacity since the tests might change less often.
You can tip the tradeoff further in the deterministic direction by making the seed change less often. You could for example have one seed for each program component or file, and only change it when a related file is committed.
A different approach would be to not change the seed during development at all. You would instead have automatic QA doing periodic or continuous testing with random seeds and use them to generate bug reports/issues that can be dealt with when convenient.

johanneslink's analysis of non-determinism is spot on.
There's one thing I would like to add: non-determinism is not only a rare and small cost, it's also beneficial. If the first run of your test suite is successful, insisting on determinism means insisting that future runs (of the same suite against the same system) will find zero bugs.
Usually most test suites contain many independent tests of many independent system parts, and commits rarely change large parts of the system. So even across commits, most tests test exactly the same thing before and after, where once again determinism guarantees that you will find zero bugs.
Allowing for randomness means every run has at least a chance of discovering a bug.
That of course raises the question of regression tests. I think the standard argument is something like this: to maximize value per effort you should focus your testing on the most bug-prone parts of the code. Having observed a bug in the past provides evidence about which part of the code is buggy (and which kind of bug it's likely to have). You should use that evidence to guide your testing effort. (Often with a laser-like focus on one concrete bug.)
I think this is a very reasonable argument. I also think there's more than one way of making good use of the evidence provided by bugs.
For example, you might write a generator which produces data of the same kind and shape as the data which triggered the bug the first time, and/or which is tailor made to trigger the bug.
And/or, you might want to write tests verifying specifically those properties that were violated by the buggy behavior.
If you want to judge how good these tests are, I recommend running them a couple of times (on normally sized input batches). If they trigger the bug every time, it's likely to do so in the future also.
Here's a (hopefully thought-)provoking question: is it worse to release software which has a bug it has had before, or release software with new bugs? In other words: is catching past bugs more important than catching new ones—or do do it primarily because it's easier?
If you think we do it in part because it's easier, then I don't think it matters that re-catching the bug is probabilistic: what you should really care about is something like the average bug-catching abilities of property testing—its benefits elsewhere should outweigh the fairly small chance that an old bug squeaks through, even though it got caught in (say) 5 consecutive runs of the tests when you evaluated your regression tests.
Now, if you can't reliably generate random inputs that trigger the bug even though you understand the bug just fine, or the generator which does it is large and complicated and thus costly to maintain, hand-picking a regression example seems like a perfectly reasonable choice.

Related

why should I not use find optimizations?

I read in the manual and info pages the sections about optimization levels in the find command and I cannot understand why I should not use the most aggressive optimization level.
The only relevant sentences I found was (from the man find version 4.4.2):
Conversely, optimisations that prove to be reliable, robust and effective may be enabled at lower optimisation levels over time.
The findutils test suite runs all the tests on find at each optimisation level and ensures that the result is the same.
If I understood well, it's about proofing the right behaviour of find through findutils but, this test suit ensures that all otimization levels are giving the same result.
You're missing this sentence:
The cost-based optimiser has a fixed idea of how likely any given test is to succeed.
That means if you have a directory with highly atypical contents (e.g. a lot of named pipes and very few "regular" files), the optimizer may actually worsen the performance of your query (in this case by assuming -type f is more likely to succeed than -type p when the reverse is true). In a situation like this, you're better off hand-optimizing it, which is only really possible at -O1 or -O2.
Even ignoring this issue, the fixed costs of the cost based optimizer are difficult to get right. There are multiple pieces of hardware and software involved (the hard disk, the kernel, the filesystem) which all do some caching and optimization of their own. As a result, it is very hard to predict how expensive different operations will be, even relative to one another (e.g. we know readdir(2) is cheaper than stat(2), but we don't know how much cheaper). This means cost-based optimization is not always guaranteed to produce the best optimization even assuming typical filesystem contents. The lower optimization levels allow you to hand-tune your query by trial and error, which may be more reliable, if more laborious.

How to approach writing developer tests (unit tests, integration tests, etc) for a system?

I have a WCF service which runs and interacts with database, file system and few external web services, then creates the result and Xml Serialize it and returns it finally.
I'd like to write tests for this solution and I'm thinking how (it's all using dependency injection and design by contract).
There are 3 main approaches I can take.
1) I can pick smallest units of codes/methods and write tests for it. Pick one class and isolate it from its dependencies (other classes, etc). Although it guarantees quality but it takes lots of time writing them and that's slow.
2) Only make the interaction with external systems mockable and write some tests that cover the main scenarios from when the request is made until the response is serialized and returned. This will test all the interactions between my classes but mocks all external resource accesses.
3) I can setup a test environment where the interaction with external web services do happen, file access happens, database access happens, etc. Then writing the tests from end to end. this requires environmental setup and dependency on all other systems to be up and running.
About #1, I see no point in investing the time/money/energy on writing the tests for every single method or codes that I have. I mean it's a waste of time.
About #3, since it has dependency on external resources/systems, it's hard to set it up and running.
#2, sounds to be the best option to me. Since it will test what it should be testing. Only my system and all its classes and mocking all other external systems.
So basically, my conclusion after some years experience with unit tests is that writing unit tests is a waste to be avoided and instead isolated system tests are best return on investment.
Even if I was going to write the tests first (TDD) then the production code, still #2 I think would be best.
What's your view on this? would you write small unit tests for your application? would you consider it a good practice and best use of time/budget/energy?
If you want to talk about quality, you should have all 3:
Unit tests to ensure your code does what you think it does, expose any edge cases and help with regression. You (developer) should write such tests.
Integration tests to verify correctness of entire process, whether components talk to each other correctly and so on. And again, you as a developer write such tests.
System-wide tests in production-like environment (with some limitations naturally - you might not have access to client database, but you should have its exact copy on your local machines). Those tests are usually written by dedicated testers (often in programming languages different from application code), but of course can be written by you.
Second and third type of tests (integration and system) will be way too much effort to test edge cases of smaller components. This is what you usually want unit tests for. You need integration because something might fail on hooking-up of tested, verified and correct modules. And of course system tests is what you do daily, during development, or have assigned people (manual testers) do it.
Going for selected type of tests from the list might work to some point, but is far from complete solution or quality software.
All 3 are important and targeted at different test types that is a matrix of unit/integration/system categories with positive and negative testing in each category.
For code coverage Unit testing will yield the highest percentage, followed by Integration then System.
You also need to consider whether or not the purpose of the test is Validation (will meet the final user\customer requirements, i.e Value) or Verification (written to specification, i.e. Correct).
In summary the answer is 'it depends', and I would recommend following the SEI CMMi model for Verification and Validation (i.e. testing) which begins with the goals (value) of each activity then subjecting that activity to measures that will ultimately allow the whole process to be subjected to continuous improvement. In this way you have isolated the What and Why from the How and you will be able to answer time and value type questions for your given environment (which could be a Life support System or a Tweet of the day, to your favorite Aunt, App).
Summary: #2 (integration testing) seems most logical, but you shouldn't hesitate to use a variety of tests to achieve the best coverage for pieces of your codebase that need it most. Shooting for having tests for "everything" is not a worthy goal.
Long version
There is a school of thought out there where devs are convinced that adopting unit\integration\system tests means striving for every single chuck of code being tested. It's either no test coverage at all, or committing to testing "everything". This binary thinking always makes adopting any kind of testing strategy seem very expensive.
The truth is, forcing every single line of code\function\module to be tested is about as sound as writing all your code to be as fast as possible. It takes too much time and effort, and most of it nets very little return. Another truth is that you can never achieve true 100% coverage in a non-trivial project.
Testing is not a goal unto itself. It's a means to achieve other things: final product quality, maintainability, interoperability, and so on, all while expending the least amount of effort possible.
With that in mind, step back and evaluate your particular circumstances. Why do you want to "write tests for this solution"? Are you unhappy with the overall quality of the project today? Have you experienced high regression rates? Are you perhaps unsure about how some module works (and more importantly, what bugs it might have)? Regardless of what your exact goal is, you should be able to select pieces that pose particular challenges and focus your attention on them. Depending on what those pieces are, an appropriate testing approach can be selected.
If you have a particularly tricky function or a class, consider unit testing them. If you're faced with a complicated architecture with multiple, hard to understand interactions, consider writing integration tests to establish a clean baseline for your trickiest scenarios and to better understand where the problems are coming from (you'll probably flush out some bugs along the way). System testing can help if your concerns are not addressed in more localized tests.
Based on the information you provided for your particular scenario, external-facing unit testing\integration testing (#2) looks most promising. It seems like you have a lot of external dependencies, so I'd guess this is where most of the complexity hides. Comprehensive unit testing (#1) is a superset of #2, with all the extra internal stuff carrying questionable value. #3 (full system testing) will probably not allow you to test external edge cases\error conditions as well as you would like.

Project nearing completion. Time to begin testing. Which methods are feasible towards the end of the development cycle?

Let's assume one joins a project near the end of its development cycle. The project has been passed on across many teams and has been an overall free-for-all with no testing whatsoever taking place along the whole time. The other members on this team have no knowledge of testing (shame!) and unit testing each method seems infeasible at this point.
What would the recommended strategy for testing a product be at this point, besides usability testing? Is this normally the point where you're stuck with manual point-and-click expected output/actual output work?
I typically take a bottom-up approach to testing, but I think in this case you want to go top-down. Test the biggest components you can wrap unit-tests around and see how they fail. Those failures should point you towards what sub-components need tests of their own. You'll have a pretty spotty test suite when this is done, but it's a start.
If you have the budget for it, get a testing automation suite. HP/Mercury QuickTest is the leader in this space, but is very expensive. The idea is that you record test cases like macros by driving your GUI through use cases. You fill out inputs on a form (web, .net, swing, pretty much any sort of GUI), the engine learns the form elements names. Then you can check for expected output on the GUI and in the db. Then you can plug in a table or spreadsheet of various test inputs, including invalid cases where it should fail and run it through hundreds of scenarios if you like. After the tests are recorded, you can also edit the generated scripts to customize them. It builds a neat report for you in the end showing you exactly what failed.
There are also some cheap and free GUI automation testing suites that do pretty much the same thing but with fewer features. In general the more expensive the suite, the less manual customizition is necessary. Check out this list: http://www.testingfaqs.org/t-gui.html
I think this is where a good Quality Assurance test would come in. Write out old fashioned test cases and hand out to multiple people on the team to test.
What would the recommended strategy for testing a product be at this point, besides usability testing?
I'd recommend code inspection, by someone/people who know (or who can develop) the product's functional specification.
An extreme, purist way would be to say that, because it "has been an overall free-for-all with no testing whatsoever", therefore one can't trust any of it: not the existing testing, nor the code, nor the developers, nor the development process, nor management, nothing about the project. Furthermore, testing doesn't add quality to software (quality has to be built-in, part of the development process). The only way to have a quality product is to build a quality product; this product had no quality in its build, and therefore one needs to rebuild it:
Treat the existing source code as a throw-away prototype or documentation
Build a new product piece-by-piece, optionally incorporating suitable fragments (if any) of the old source code.
But doing code inspection (and correcting defects found via code inspection) might be quicker. That would be in addition to functional testing.
Whether or not you'll want to not only test it but also spend the extra time effort to develop automated tests depends on whether you'll want to maintain the software (i.e., in the future, to change it in any way and then retest it).
You'll also need:
Either:
Knowledge of the functional specification (and non-functional specification)
Developers and/or QA people with a clue
Or:
A small, simple product
Patient, forgiving end-users
Continuing technical support after the product is delivered
One technique that I incorporate into my development practice when entering a project at this time in the lifecycle is to add unit tests as defects are reported (by QA or end users). You won't get full code coverage of the existing code base, but at least this way future development can be driven and documented by tests. Also this way you should be assured that your tests fail before working on the implementation. If you write the test and it doesn't fail, the test is faulty.
Additionally, as you add new functionality to the system, start those with tests so that at least those sub-systems are tested. As the new systems interact with existing, try adding tests around the old boundary layers and work your way in over time. While these won't be Unit tests, these integration tests are better than nothing.
Refactoring is yet another prime target for testing. Refactoring without tests is like walking a tight rope without a net. You may get to the other side successfully, but is the risk worth the reward?

When is it best to change code to match standards?

I have recently been put in charge of debugging two different programs which will eventually need to share an XML parsing script, at the minimum. One was written with PureMVC, and another was built from scratch. While it made sence, originally, to write the one from scratch (it saved a good deal of memory, but the memory problems have since been resolved).
Porting the non-PureMVC application will take a good deal of time and effort which does not need to be used, but it will make documentation and code-sharing easier. It will also lower the overall learning curve. With that in mind:
1. What should be taken into account when considering whether it is best to move things to one standard?
(On a related note)
Some of the code is a little odd. Because the interpreting App had to convert commands from one syntax to another, it made sense to have an interpreter Object. Because there needed to be communication with the external environment, it made more sense to have one object interact with the environment, and for that to deal with the interpreter exclusively.
Effectively, an anti-Singleton was created. The object would only interface with the interpreter, and that's it. If a member of another class were to try to call one of its public methods, the object would raise an Exception.
There are better ways to accomplish this, but it is definitely a bit odd. There are more standard means of accomplishing the same thing, though they often involve the creation of classes or class files which are extraordinarily large. The only solution which I could find that was standards compliant would involve as much commenting and explanation as is currently required, if not more. Considering this:
2. If some code is quirky, but effective, is it better to change it to make it less quirky, even if it is made a more unwieldy?
In my opinion this type of refactoring is often not considered in schedules and can only be done when there is extra time.
More often than not, the criterion for shipping code is if it works, not necessarily if it's the best possible code solution.
So in answer to your question, I try and refactor when I have time to do so. Priority One still remains to produce a functional piece of code.
Things to take into account:
Does it work as-is?
As Galwegian notes, this is the only criterion in many shops. However, IMO just as important is:
How skilled are the programmers who are going to maintain it? Have they ever encountered nonstandard code? Compare the cost of their time to learn it (including the cost of delayed dot releases) to the cost of your time to refactor it.
If you're maintaining it, then instead consider:
How much time will dealing with the nonstandard code cost you over the intended lifecycle of the codebase (e.g., the time between now and when the whole thing is rewritten)?
That's hard to guess, but consider that many codebases FAR outlive the usefulness envisioned by their original authors. (Y2K anyone?) I've gradually developed a sense of when a refactoring is worthwhile and when it's not, mostly by erring on the side of "not" too often and regretting it later.
Only change it if you need to be making changes anyway. But less quirky is always a good goal. Most of the time spent on a particular piece of software is in maintenance, so if you can do something to make that easier, you'll be reducing the overall time spent on that piece of code. Nonetheless, don't change something if it's working and doesn't need any modifications.
If you have time, now. If you don't have time and it can be avoided, later.

How do you handle unit/regression tests which are expected to fail during development?

During software development, there may be bugs in the codebase which are known issues. These bugs will cause the regression/unit tests to fail, if the tests have been written well.
There is constant debate in our teams about how failing tests should be managed:
Comment out failing test cases with a REVISIT or TODO comment.
Advantage: We will always know when a new defect has been introduced, and not one we are already aware of.
Disadvantage: May forget to REVISIT the commented-out test case, meaning that the defect could slip through the cracks.
Leave the test cases failing.
Advantage: Will not forget to fix the defects, as the script failures will constantly reminding you that a defect is present.
Disadvantage: Difficult to detect when a new defect is introduced, due to failure noise.
I'd like to explore what the best practices are in this regard. Personally, I think a tri-state solution is the best for determining whether a script is passing. For example when you run a script, you could see the following:
Percentage passed: 75%
Percentage failed (expected): 20%
Percentage failed (unexpected): 5%
You would basically mark any test cases which you expect to fail (due to some defect) with some metadata. This ensures you still see the failure result at the end of the test, but immediately know if there is a new failure which you weren't expecting. This appears to take the best parts of the 2 proposals above.
Does anyone have any best practices for managing this?
I would leave your test cases in. In my experience, commenting out code with something like
// TODO: fix test case
is akin to doing:
// HAHA: you'll never revisit me
In all seriousness, as you get closer to shipping, the desire to revisit TODO's in code tends to fade, especially with things like unit tests because you are concentrating on fixing other parts of the code.
Leave the tests in perhaps with your "tri-state" solution. Howeveer, I would strongly encourage fixing those cases ASAP. My problem with constant reminders is that after people see them, they tend to gloss over them and say "oh yeah, we get those errors all the time..."
Case in point -- in some of our code, we have introduced the idea of "skippable asserts" -- asserts which are there to let you know there is a problem, but allow our testers to move past them on into the rest of the code. We've come to find out that QA started saying things like "oh yeah, we get that assert all the time and we were told it was skippable" and bugs didn't get reported.
I guess what I'm suggesting is that there is another alternative, which is to fix the bugs that your test cases find immediately. There may be practical reasons not to do so, but getting in that habit now could be more beneficial in the long run.
Fix the bug right away.
If it's too complex to do right away, it's probably too large a unit for unit testing.
Lose the unit test, and put the defect in your bug database. That way it has visibility, can be prioritized, etc.
I generally work in Perl and Perl's Test::* modules allow you to insert TODO blocks:
TODO: {
local $TODO = "This has not been implemented yet."
# Tests expected to fail go here
}
In the detailed output of the test run, the message in $TODO is appended to the pass/fail report for each test in the TODO block, so as to explain why it was expected to fail. For the summary of test results, all TODO tests are treated as having succeeded, but, if any actually return a successful result, the summary will also count those up and report the number of tests which unexpectedly succeeded.
My recommendation, then, would be to find a testing tool which has similar capabilities. (Or just use Perl for your testing, even if the code being tested is in another language...)
We did the following: Put a hierarchy on the tests.
Example: You have to test 3 things.
Test the login (login, retrieve the user name, get the "last login date" or something familiar etc.)
Test the database retrieval (search for a given "schnitzelmitkartoffelsalat" - tag, search the latest tags)
Test web services (connect, get the version number, retrieve simple data, retrieve detailed data, change data)
Every testing point has subpoints, as stated in brackets. We split these hierarchical. Take the last example:
3. Connect to a web service
...
3.1. Get the version number
...
3.2. Data:
3.2.1. Get the version number
3.2.2. Retrieve simple data
3.2.3. Retrieve detailed data
3.2.4. Change data
If a point fails (while developing) give one exact error message. I.e. 3.2.2. failed. Then the testing unit will not execute the tests for 3.2.3. and 3.2.4. . This way you get one (exact) error message: "3.2.2 failed". Thus leaving the programmer to solve that problem (first) and not handle 3.2.3. and 3.2.4. because this would not work out.
That helped a lot to clarify the problem and to make clear what has to be done at first.
I tend to leave these in, with an Ignore attribute (this is using NUnit) - the test is mentioned in the test run output, so it's visible, hopefully meaning we won't forget it. Consider adding the issue/ticket ID in the "ignore" message. That way it will be resolved when the underlying problem is considered to be ripe - it'd be nice to fix failing tests right away, but sometimes small bugs have to wait until the time is right.
I've considered the Explicit attribute, which has the advantage of being able to be run without a recompile, but it doesn't take a "reason" argument, and in the version of NUnit we run, the test doesn't show up in the output as unrun.
I think you need a TODO watcher that produces the "TODO" comments from the code base. The TODO is your test metadata. It's one line in front of the known failure message and very easy to correlate.
TODO's are good. Use them. Actively management them by actually putting them into the backlog on a regular basis.
#5 on Joel's "12 Steps to Better Code" is fixing bugs before you write new code:
When you have a bug in your code that you see the first time you try to run it, you will be able to fix it in no time at all, because all the code is still fresh in your mind.
If you find a bug in some code that you wrote a few days ago, it will take you a while to hunt it down, but when you reread the code you wrote, you'll remember everything and you'll be able to fix the bug in a reasonable amount of time.
But if you find a bug in code that you wrote a few months ago, you'll probably have forgotten a lot of things about that code, and it's much harder to fix. By that time you may be fixing somebody else's code, and they may be in Aruba on vacation, in which case, fixing the bug is like science: you have to be slow, methodical, and meticulous, and you can't be sure how long it will take to discover the cure.
And if you find a bug in code that has already shipped, you're going to incur incredible expense getting it fixed.
But if you really want to ignore failing tests, use the [Ignore] attribute or its equivalent in whatever test framework you use. In MbUnit's HTML output, ignored tests are displayed in yellow, compared to the red of failing tests. This lets you easily notice a newly-failing test, but you won't lose track of the known-failing tests.

Resources