We have a suite of converters that take complex data and transform it. Mostly the input is EDI and the output XML, or vice-versa, although there are other formats.
There are many inter-dependencies in the data. What methods or software are available that can generate complex input data like this?
Right now we use two methods: (1) a suite of sample files that we've built over the years mostly from files bugs and samples in documentation, and (2) generating pseudo-random test data. But the former only covers a fraction of the cases, and the latter has lots of compromises and only tests a subset of the fields.
Before go further down the path of implementing (reinventing?) a complex table-driven data generator, what options have you found successful?
Well, the answer is in your question. Unless you implement a complex table-driven data generator, you're doing the things right with (1) and (2).
(1) covers the rule of "1 bug verified, 1 new test case".
And if the structure of the pseudo-random test data of (2) corresponds whatsoever in real life situations, it is fine.
(2) can always be improved, and it'll improve mainly over time, when thinking about new edge cases. The problem with random data for tests is that it can only be random to a point where it becomes so difficult to compute the expected output from the random data in the test case, that you have to basically rewrite the tested algorithm in the test case.
So (2) will always match a fraction of the cases. If one day it matches all the cases, it will be in fact a new version of your algorithm.
I'd advise against using random data as it can make it difficult if not impossible to reproduce the error that reported (I know you said 'pseudo-random', just not sure what you mean by that exactly).
Operating over entire files of data would likely be considering functional or integration testing. I would suggest taking your set of files with known bugs and translating these into unit tests, or at least do so for any future bugs you come across. Then you can also extend these unit tests to include coverage for the other erroneous conditions that you don't have any 'sample data'. This will likely be easier then coming up with a whole new data file every time you think of a condition/rule violation you want to check for.
Make sure your parsing of the data format is encapsulated from the interpretation of the data in the format. This will make unit testing as described above much easier.
If you definitely need to drive your testing you may want to consider getting a machine readable description of the file format, and writing a test data generator which will analyze the format and generate valid/invalid files based upon it. This will also allow your test data to evolve as the file formats do as well.
Related
I'm trying to train a learning model on real estate sale data that includes dates. I've looked into 1-to-K binary encoding, per the advice in this thread, however my initial assessment is that it may have the weakness of not being able to train well on data that is not predictably cyclic. While real estate value crashes are recurring, I'm concerned (maybe wrongfully so, you tell me) that doing 1-to-K encoding will inadvertently overtrain on potentially irrelevant features if the recurrence is not explainable by a combination of year-month-day.
That said, I think there is potentially value in that method. I think that there is also merit to the argument of converting time series data to ordinal, as also recommended in the same thread. Which brings me to the real question: is it bad practice to duplicate the same initial feature (the date data) in two different forms in the same training data? I'm concerned if I use methods that rely on the assumption of feature independence I may be violating this by doing so.
If so, what are suggestions for how to best get the maximal information from this date data?
Edit: Please leave a comment how I can improve this question instead of down-voting.
Is it bad practice?
No, sometimes transformations make your Feature easier accesible for your algorithm. Following this line of thought you converting Features is completely fine.
Does it scew your algorithm?
Concerning runtime it might be better to not have to transform your data everytime. Depending on your algorithm you might get worse interpretability (if that is important for you) depending on the type of transformations.
Also if you want to restrict the amount / set of Features your algorithm should use, you might add Information redundancies by adding transformed Features.
So what should you do?
Transform your data / Features as much as you want and as often as you want.
That's not hurting anyone, but rather helping by increasing the Feature space. But after you did so, do a PCA or something similar in order to find redundancies in your Features and reduce your Feature space again.
Note:
I tried to be General, obviously this is highly dependant on the Kind of algorithm you're using.
i first want to say that i'm beginner in ocaml. So i made a simple app that takes data from a json, does some calculations or replace some of them with arg from the command line, then writes another json with the new data and also replace those values in a html template and writes that too. You can see my project here https://github.com/ralcr/invoice-cmd/blob/master/invoice.ml
The question is how to deal with that amount of variables? In the languages i know i would probably repeat myself twice, but here are like 6 times. Thanks for any advice.
First of all, I would like to notice, that StackExchange code review is probably a better place to post such questions, as the question is more about a design rather than about the language.
I have two suggestions, on how to improve your code. The first one is to use string maps (or hashtables) to store your variables. Another is much more radical, is to rewrite the code in a more functional way.
Use maps
In your code, you're doing a lot of pouring the same water from one bucket into another, without doing actual work. The first thing that comes to mind, is whether it is necessary at all. When you parse JSON definitions into a set of variables, you do not actually reduce complexity or enforce any particular invariants. Basically, you're confusing data with code. These variables, are actually data that you're processing not a part of the logic of your application. So the first step would be to use string map, and store them in it. Then you can easily process a big set of variables with fold and map.
Use functions
Another approach is not to store the variables at all and express everything as stateless transformations on JSON data. Your application looks like a JSON processor, so I don't really see any reason why you should first read everything and store it in the memory, and then later produce the result. It is more natural to process data on the fly and express your logic as a set of small transformations. Try to split everything into small functions, so that each individual transformation can be easily understood. Then compose your transformation from smaller parts. This would be a functional style, where the flow of data is explicit.
I'm giving TDD a serious try today and have found it really helpful, in line with all the praise it receives.
In my quest for exercises on which to learn both Python and TDD i have started to code SPOJ exercises using the TDD technique and i have arrived at a question:
Given that all of SPOJ's exercises are mostly math applied to programming; How does one test a math procedure as in the TDD Fashion? Sample known-to-be-correct data? Test against a known implementation?
I have found that using the sample data given in the problem itself is valuable but it feels overkill for something you can test so quickly using the console, not to mention the overhead to design your algorithms in a testable fashion (Proxying the stdout and stdin objects is nothing short of too much work for a really small reward), and while it is good because it forces you to think your solutions in testable terms i think i might be trying way too hard on this.
All guidance is welcome
test all edge cases. Your program is most likely to fail when the input is special for some reason: negative, or zero values, very large values, inputs in reverse order, empty inputs. You might also want some destructive tests to see how large the inputs can be before things break or grind to a halt.
Sphere online Judge may not be the best fit of TDD. For one the input data might be better behaved than what a real person might put in. Secondly there is a code size limit on some problems. An extensive test-suite might put you over that limit.
You might want to take a look at Uncle Bob's "Transformation Priority Premise." It offers some guidance on how to pick a sequence of tests to test drive an algorithm.
Use sample inputs for which you know the results (outputs). Use equivalence partitioning to identify a suitable set of test cases. With maths code you might find that you can not implement as incrementally as for other code: you might need several test cases for each incremental improvement. By that I mean that non maths code can typically be thought of as having a set of "features" and you can implement one feature at a time, but maths code is not much like that.
I have a WCF service which runs and interacts with database, file system and few external web services, then creates the result and Xml Serialize it and returns it finally.
I'd like to write tests for this solution and I'm thinking how (it's all using dependency injection and design by contract).
There are 3 main approaches I can take.
1) I can pick smallest units of codes/methods and write tests for it. Pick one class and isolate it from its dependencies (other classes, etc). Although it guarantees quality but it takes lots of time writing them and that's slow.
2) Only make the interaction with external systems mockable and write some tests that cover the main scenarios from when the request is made until the response is serialized and returned. This will test all the interactions between my classes but mocks all external resource accesses.
3) I can setup a test environment where the interaction with external web services do happen, file access happens, database access happens, etc. Then writing the tests from end to end. this requires environmental setup and dependency on all other systems to be up and running.
About #1, I see no point in investing the time/money/energy on writing the tests for every single method or codes that I have. I mean it's a waste of time.
About #3, since it has dependency on external resources/systems, it's hard to set it up and running.
#2, sounds to be the best option to me. Since it will test what it should be testing. Only my system and all its classes and mocking all other external systems.
So basically, my conclusion after some years experience with unit tests is that writing unit tests is a waste to be avoided and instead isolated system tests are best return on investment.
Even if I was going to write the tests first (TDD) then the production code, still #2 I think would be best.
What's your view on this? would you write small unit tests for your application? would you consider it a good practice and best use of time/budget/energy?
If you want to talk about quality, you should have all 3:
Unit tests to ensure your code does what you think it does, expose any edge cases and help with regression. You (developer) should write such tests.
Integration tests to verify correctness of entire process, whether components talk to each other correctly and so on. And again, you as a developer write such tests.
System-wide tests in production-like environment (with some limitations naturally - you might not have access to client database, but you should have its exact copy on your local machines). Those tests are usually written by dedicated testers (often in programming languages different from application code), but of course can be written by you.
Second and third type of tests (integration and system) will be way too much effort to test edge cases of smaller components. This is what you usually want unit tests for. You need integration because something might fail on hooking-up of tested, verified and correct modules. And of course system tests is what you do daily, during development, or have assigned people (manual testers) do it.
Going for selected type of tests from the list might work to some point, but is far from complete solution or quality software.
All 3 are important and targeted at different test types that is a matrix of unit/integration/system categories with positive and negative testing in each category.
For code coverage Unit testing will yield the highest percentage, followed by Integration then System.
You also need to consider whether or not the purpose of the test is Validation (will meet the final user\customer requirements, i.e Value) or Verification (written to specification, i.e. Correct).
In summary the answer is 'it depends', and I would recommend following the SEI CMMi model for Verification and Validation (i.e. testing) which begins with the goals (value) of each activity then subjecting that activity to measures that will ultimately allow the whole process to be subjected to continuous improvement. In this way you have isolated the What and Why from the How and you will be able to answer time and value type questions for your given environment (which could be a Life support System or a Tweet of the day, to your favorite Aunt, App).
Summary: #2 (integration testing) seems most logical, but you shouldn't hesitate to use a variety of tests to achieve the best coverage for pieces of your codebase that need it most. Shooting for having tests for "everything" is not a worthy goal.
Long version
There is a school of thought out there where devs are convinced that adopting unit\integration\system tests means striving for every single chuck of code being tested. It's either no test coverage at all, or committing to testing "everything". This binary thinking always makes adopting any kind of testing strategy seem very expensive.
The truth is, forcing every single line of code\function\module to be tested is about as sound as writing all your code to be as fast as possible. It takes too much time and effort, and most of it nets very little return. Another truth is that you can never achieve true 100% coverage in a non-trivial project.
Testing is not a goal unto itself. It's a means to achieve other things: final product quality, maintainability, interoperability, and so on, all while expending the least amount of effort possible.
With that in mind, step back and evaluate your particular circumstances. Why do you want to "write tests for this solution"? Are you unhappy with the overall quality of the project today? Have you experienced high regression rates? Are you perhaps unsure about how some module works (and more importantly, what bugs it might have)? Regardless of what your exact goal is, you should be able to select pieces that pose particular challenges and focus your attention on them. Depending on what those pieces are, an appropriate testing approach can be selected.
If you have a particularly tricky function or a class, consider unit testing them. If you're faced with a complicated architecture with multiple, hard to understand interactions, consider writing integration tests to establish a clean baseline for your trickiest scenarios and to better understand where the problems are coming from (you'll probably flush out some bugs along the way). System testing can help if your concerns are not addressed in more localized tests.
Based on the information you provided for your particular scenario, external-facing unit testing\integration testing (#2) looks most promising. It seems like you have a lot of external dependencies, so I'd guess this is where most of the complexity hides. Comprehensive unit testing (#1) is a superset of #2, with all the extra internal stuff carrying questionable value. #3 (full system testing) will probably not allow you to test external edge cases\error conditions as well as you would like.
I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
This seems like something that ProjectTemplate would have already, but I can't seem to find that functionality there.
Does such a tool exist?
If it does not, what considerations should be taken into account to provide maximum flexibility? Some preliminary thoughts:
A markup language should be used (YAML?)
All sub-directories should be scanned
To facilitate (2), a standard extension for a dataset descriptor should be used
Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
This is a very good question: people should be very concerned about all of the sequences of data collection, aggregation, transformation, etc., that form the basis for statistical results. Unfortunately, this is not widely practiced.
Before addressing your questions, I want to emphasize that this appears quite related to the general aim of managing data provenance. I might as well give you a Google link to read more. :) There are a bunch of resources that you'll find, such as the surveys, software tools (e.g. some listed in the Wikipedia entry), various research projects (e.g. the Provenance Challenge), and more.
That's a conceptual start, now to address practical issues:
I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Welcome to everyone's nightmare. :)
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
No problem. list.files(...,recursive = TRUE) might become a good friend; see also listDirectory() in R.utils.
It's worth noting that filling in a methods section on data sources is a narrow application within data provenance. In fact, it's rather unfortunate that the CRAN Task View on Reproducible Research focuses only on documentation. The aims of data provenance are, in my experience, a subset of reproducible research, and documentation of data manipulation and results are a subset of data provenance. Thus, this task view is still in its infancy regarding reproducible research. It might be useful for your aims, but you'll eventually outgrow it. :)
Does such a tool exist?
Yes. What are such tools? Mon dieu... it is very application-centric in general. Within R, I think that these tools are not given much attention (* see below). That's rather unfortunate - either I'm missing something, or else the R community is missing something that we should be using.
For the basic process that you've described, I typically use JSON (see this answer and this answer for comments on what I'm up to). For much of my work, I represent this as a "data flow model" (that term can be ambiguous, by the way, especially in the context of computing, but I mean it from a statistical analyses perspective). In many cases, this flow is described via JSON, so it is not hard to extract the sequence from JSON to address how particular results arose.
For more complex or regulated projects, JSON is not enough, and I use databases to define how data was collected, transformed, etc. For regulated projects, the database may have lots of authentication, logging, and more in it, to ensure that data provenance is well documented. I suspect that that kind of DB is well beyond your interest, so let's move on...
1. A markup language should be used (YAML?)
Frankly, whatever you need to describe your data flow will be adequate. Most of the time, I find it adequate to have good JSON, good data directory layouts, and good sequencing of scripts.
2. All sub-directories should be scanned
Done: listDirectory()
3. To facilitate (2), a standard extension for a dataset descriptor should be used
Trivial: ".json". ;-) Or ".SecretSauce" works, too.
4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
As stated, this doesn't quite make sense. Suppose that I take var1 and var2, and create var3 and var4. Perhaps var4 is just a mapping of var2 to its quantiles and var3 is the observation-wise maximum of var1 and var2; or I might create var4 from var2 by truncating extreme values. If I do so, do I retain the name of var2? On the other hand, if you're referring to simply matching "long names" with "simple names" (i.e. text descriptors to R variables), then this is something only you can do. If you have very structured data, it's not hard to create a list of text names matching variable names; alternatively, you could create tokens upon which string substitution could be performed. I don't think it's hard to create a CSV (or, better yet, JSON ;-)) that matches variable name to descriptor. Simply keep checking that all variables have matching descriptor strings, and stop once that's done.
5. Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
That's where others' suggestions of roxygen and roxygen2 can apply.
6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
Hmm, I'm stumped here. :)
(*) By the way, if you want one FOSS project that relates to this, check out Taverna. It has been integrated with R as documented in several places. This may be overkill for your needs at this time, but it's worth investigating as an example of a decently mature workflow system.
Note 1: Because I frequently use bigmemory for large data sets, I have to name the columns of each matrix. These are stored in a descriptor file for each binary file. That process encourages the creation of descriptors matching variable names (and matrices) to descriptors. If you store your data in a database or other external files supporting random access and multiple R/W access (e.g. memory mapped files, HDF5 files, anything but .rdat files), you will likely find that adding descriptors becomes second nature.