Allowing nulls vs default values - asp.net

I'm working on an ASP.NET project that replaces many existing paper forms. One of the requirements is that the user can save the form in any state, i.e. they could create a new blank form and immediately save it with no data or with partial data. I'm validating for data type on every save but validation for required fields does not occur until the user marks the form as completed.
I'm not sure what the best approach is to handle this requirement in the database and domain model. As I see it, I have two options:
Allow nulls for any field that may not have data. This feels like the "correct" approach but it requires that almost every database field allow nulls and I have to code around a lot of nullable types. Also, when the form is finalized none of the required fields are enforced in the database.
Populate my business objects with meaningful default values. In some cases, there are meaningful default values for many (but not all) fields that I could use. This approach verges on "magic numbers" which makes me uncomfortable.
Which approach is best? Or is there a third way? I'm not willing to go to extremes, such as splitting the tables.
Edited to add: I wanted to expand on this a bit since I accepted a response. The primary reason that I'm not interested in splitting the tables is that once a project is submitted, the data on the forms is used to generate data for another system that is the system of record. At that point the original form data is unlikely to be revised or used for reporting.

I don't understand why you don't want to split the tables. I don't know what domain you're in but in any I could imagine there are two classes of people:
people who have submitted the form
people who haven't
And as a business executive I don't care about the second. But the first I care deeply about, and they need to have all their data in correctly.
It also improves efficiency - most of your queries about aggregate data will be over the first table, not the second. The second table will only be used for index seeks.

If splitting the table(s) (are there more than one?) is not an option, I would consider creating single table to store serialisations of objects of incomplete forms, and only commit a form to the "real" tables when the form is fully submitted by the user.

If there isn't a sensible default, and you don't want to split the data, then nulls are almost certainly your best option. Re the db not being to verify that they are not null when completed... well, if you don't want to split the table there isn't much you can do (short of using a CHECK constraint, or an INSTEAD OF trigger to run validation). But the DB isn't the only place responsible for data validation. Your app logic can do that too.

You could use a temporary table with "allow nulls" on every column to store the form containing partial or no data and copy / move the data to the final table when the user marks the form as completed. This way, you do not depend on default values (which the user may forget to change), you can save in any state, and you still have the validation in the end.

This is a situation that cries out for split tables. I know you said you don't want to do that, and in a comment even said "this project doesn't warrant that level of effort". but it's really the best solution.
Set up preliminary table(s) with everything except your key nullable. When the user marks the form complete, and it passes validation, move it to the final table(s). not only is this The Right Thing To Do, but it's probably less effort than "coding around nullable values" when working with finished forms.
If you need to see all forms, finished or not, make a Union view.

I'd take the first option but add a column to the database tables so that when the form is completed this is flagged. Then for anything using the form data it merely needs to check that the form has been completed.
That's my suggestion for a way around this.

NULL values are not searchable by the indexes.
If you'll need to issue a query like "select first 10 forms with a certain field unfilled", this query will use a FULL TABLE SCAN which may be not efficient.
Oracle does not distinguish between NULL and empty string, but other databases do. You'll probably want to make an empty string to be the DEFAULT for unfilled fields and use it in a search.
If you don't need to search on unfilled fields, then just make them NULL.

NULL generally means "Don't Know" (in a database) whereas an empty string could actually represent an empty string.
I would tend to use NULL as the "Don't Know" value in your case. When you print out data you'll just have to assume that any NULL value means an empty string.

CHECK CONSTRAINT + VIEW
if you don't have a status field add one so you can tell that it is finished.
add a check constraint on that status field so it can't be marked finished if any of the columns are null.
When you write your queries on "finished" forms you can ignore checking for nulls everywhere if you do one of these two options:
just add Status="F"inished in the where clause
make a view of only finished ones
when using the "finished view" you don't have to do all the validation checks or worry about unfinished ones showing up in the results

I've had a similar situation, and while I haven't yet come up with a solution, I have been toying with the idea of just using simple XML serialization to store the temporary document data. If you generate simple classes that model the data in the objects (using nullable types where needed, perhaps), it would be easy to stuff data from the screen into those objects, serialize them to XML and then store them in a temporary "staging" table. When your users are done working and want to submit or finalize the document, then you perform all of your needed validation against the serialized data, eventually putting into the "real" table with the proper data structures and constraints.

Related

How to setup data model for customizable application

I have an ASP.NET data entry application that is used by multiple clients. The application consists of multiple data entry modules that are common to all clients.
I now have multiple clients that want their own custom module added which will typically consist of a dozen or so data points. Some values will be text, others numeric, some will be dropdown selections, etc.
I'm in need of suggestions for handling the data model for this. I have two thoughts on how to handle. First would be to create a new table for each new module for each client. This is pretty clean but I don't particular like it. My other thought is to have one table with columns for each custom data point for each client. This table would end up with a lot of columns and a lot of NULL values. I don't really like either solution and suspect there's a better way to do this, so any feedback you have will be appreciated.
I'm using SQL Server 2008.
As always with these questions, "it depends".
The dreaded key-value table.
This approach relies on a table which lists the fields and their values as individual records.
CustomFields(clientId int, fieldName sysname, fieldValue varbinary)
Benefits:
Infinitely flexible
Easy to implement
Easy to index
non existing values take no space
Disadvantage:
Showing a list of all records with complete field list is a very dirty query
The Microsoft way
The Microsoft way of this kind of problem is "sparse columns" (introduced in SQL 2008)
Benefits:
Blessed by the people who design SQL Server
records can be queried without having to apply fancy pivots
Fields without data don't take space on disk
Disadvantage:
Many technical restrictions
a new field requires DML
The xml tax
You can add an xml field to the table which will be used to store all the "extra" fields.
Benefits:
unlimited flexibility
can be indexed
storage efficient (when it fits in a page)
With some xpath gymnastics the fields can be included in a flat recordset.
schema can be enforced with schema collections
Disadvantages:
not clearly visible what's in the field
xquery support in SQL Server has gaps which makes getting your data a real nightmare sometimes
There are maybe more solutions, but to me these are the main contenders. Which one to choose:
key-value seems appropriate when the number of extra fields is limited. (say no more than 10-20 or so)
Sparse columns is more suitable for data with many properties which are filled out infrequent. Sounds more appropriate when you can have many extra fields
xml column is very flexible, but a pain to query. Appropriate for solutions that write rarely and query rarely. ie: don't run aggregates etc on the data stored in this field.
I'd suggest you go with the first option you described. I wouldn't over think it. The second option you outlined would be a bad idea in my opinion.
If there are fields common to all the modules you're adding to the system you should consider keeping those in a single table then have other tables with the fields specific to a particular module related back to the primary key in the common table. This is basically table inheritance (http://www.sqlteam.com/article/implementing-table-inheritance-in-sql-server) and will centralize the common module data and make it easier to query across modules.

How do I check which values in my Form have changed before saving?

The situation is like this. We have a form with a large number of fields (over 30 spread over several tabs) and what I want to do is find which values have changed before saving with minimum impact on performance. What happens right now is, for editing, single records are queried from several databases. The values are passed over to the client side as value objects. At the moment they are not bound to any fields in the form.
My initial idea was to have a boolean flag for each field to set true or false each time any of the fields were changed. At the time of saving the program would run through the list of flags to see which fields have changed. This seems more than a bit clunky to me so I was thinking maybe it could be done on the server side. But then I don't want to go through each field one by one checking to see which ones don't match the db records.
Any ideas on what to do here?
This is a very common problem for a lot of Flex applications. Because it happens so often there are a number of commercial implementations for Data Management. Queries are stored into entities and those entities are bound to a form on the client side. Whenever a field is updated, it will automatically perform the steps to persist the changes to the db and do rollbacks when requested.
Adobe LCDS Data Management - If you are dealing with a Java environment
WebOrb - If you are dealing with a .net, php, java, rails environment
Of course you can re-invent the wheel and roll out your own, set up PropertyChangeEvent listeners on each field. When the changes are dispatched, listen for them and write handlers for each one.
This sounds exactly like what we're doing with one of the projects I'm working on for a client.
What we do is dupe the value objects once they back to the UI. Then when calling the update service, I send both the original object and the new object. In the service, I do a field by field compare on the server to determine what values should sent to the database.
If you need to update every field/property conditionally based on whether or not it changed; then I don't see a way to avoid the check with every field/property. Even if you implement your Boolean idea and swap the flag in the UI whenever anything changes; you're still going to have to check those Boolean values when creating your query to determine what should be updated or not.
In my situation, three different databases are queried to create the value object that gets sent back to the UI. Field updates are saved in one of those database and given first order of preference when doing the select. So, wee have an explicit field by field comparison happening inside a stored procedure.
If you don't need field by field comparisons, but rather a "record by record" comparisons; then the Boolean approach to let you know the record/Value Object had changed is going to save you some time and coding.

Primary keys on webforms (load initially or on save)?

This is just a general question irrespective of database architecture.
I am maintaining an ASP.NET web application. The structure is such that,
Say on 'Add a new employee' webform
The primary key (or the record id to
be saved with) is initially loaded on form
load event & displayed as a label
So when the form loads, the record id to save with is shown to the user
Positives:
End user already knows what the id/serial of the form is (even before he saves the form)
So on form save when he is directed
to gridview screen (with all entries)
he can search records easily
(although the most recent one is at
the top anyway)
Negatives:
If he does not save the form, say he
just cancels after loading the data entry form,
the id/key initially fetched is
wasted (in my case it is a sequence
field fetched on form load from database)
What do you guys do in these scenarios ? Which approach would you recommend for 'web applications'? And how to facilitate the user with a different approach ? Is our current approach recommended (To me,it wastes the ids/sequence from database)
I'd always recommend not presenting the identity field value for the record being created until the record has been created. The "create a temporary placeholder record first to obtain the identity field value ahead of time" approach can, as you mention, result in wasted IDs, unless you have a process in place to reclaim them.
You can always pop-up a message box when the user presses save that tells them the identity field value of the newly created record.
In this situation you could use a GUID created by the application itself. The database would then only have the PK set to be a Unique Identifier (GUID) and that it must not be null. In this situation you are not wasting any unique keys as each call to get a new GUID should be definition produce a (mathmatically) unique identifier. It is worth noting that if you use this method, it is best to make sure your PK is not set up to be clustered. The resulting index reorganisation upon insert could quickly result in an application that suffers performance hits.
For one: I wouldn't care so much about wasted id values. When you are in danger of running out of int32 values (and when has that happened to you last?), use int64. The user experience is way much more important than wasting a few id values.
Having said that, I would not want the primary key to be anything the user would want to type in. If you are having a primary key that users need to type in, chances are it then is (or will be requested to be) more than just an int32/64 value and carries (will carry) meaning in its composition and/or formatting. Primary keys should not have that. (Tons of reasons google for meaningless primary keys or other such terms).
If you need a meaningful key, make it a secondary index that is in no way related to the primary key. If a part of that is still a sequential number taken from some counter value in your database. Decide whether functionally it is a problem for gaps to appear in the sequence. (The tax people generally don't want gaps in invoice numbers). If functionally it is no problem, then certainly don't start worrying about it technically. If functionally it is a problem, then yes, you have no option but to wait for the save in order to show it to the user. But, please, when you do, don't do it in a popup. They are horribly intrusive as they have to be dismissed. Just put up an informative message on the screen where the user is sent after (s)he saves the new employee. Much like gmail is telling you about actions you have performed just above the list of messages.

Store in DB or not to store?

There are few string lists in my web application that i don't know where to store in DB or just class.
ie. I have 7 major browsers with which users enter the site. I want to save these stats thus i need to create browser column in UserLogin database. I don't want to waste space and resources so i can save full browser name in each login row. So i either need to save browserID field and hook it up with Browsers table which will store names following db normalization rules or to have sort of Dataholder abstract class which has a list of browsers from which i can retrieve browser name by it's ID...
The question what should i do ? These few data lists i have contain no more than 200 items each so i think it makes sense to have them as abstract class but again i don't know whether MS-SQL will handle multiple joins so well. Think of idea when i have user with country,ip,language,browser and few more stats ..
thanks
I have been on both sides of the fence about this.
My rule of thumb is:
If one of these lists changes, will I have to do changes to the code, too?
(e.g..: in your case, if someone writes "yet another browser" tomorrow, will I need to write code that caters for it?)
If the answer is "most probably yes" or "definitely" you can leave it inside code.
In all other cases (even just a "maybe, 50%-50%) you better put it in the DB, or at the very least a property file.
And please consider this, too: if you expect to have to provide statistics based on this data (e.g.: "how many users use Explorer") you better put it in the DB anyway: it becomes part of your domain data and therefore it must be there.
About the "domain data" part.
The information stored in your DB is the "domain data" of your application. It is, in a sense, a (hopefully consistent) representation of what your application is about - it represents the "known universe" for your application.
If you agree to this definition, then you must also accept that it does not make sense to have 99.9% of your "reality" in the DB, and 0.1% outside of it - if nothing else, it makes some operations cumbersome (if you only store the smallint you can't create meaningful reports without either post-processing them using the class to decode "1" into "Firefox" or providing some other key for the end-user).
It also makes impossible for you to leverage some inherent DB techniques like foreign key (if you just use a smallint without correlating it to any other table, who guarantees that "10" is an acceptable value in your domain?)
MS SQL handles multiple joins really well; it's up to you where you want to store the data. You can also consider XML too, as another option. I would consider the database or XL; it is easier to change the values than if the values are in code (have to recompile/deploy to change when in production).
HTH.

Bulk Collection Manipulation through a REST (RESTful) API

I'd like some advice on designing a REST API which will allow clients to add/remove large numbers of objects to a collection efficiently.
Via the API, clients need to be able to add items to the collection and remove items from it, as well as manipulating existing items. In many cases the client will want to make bulk updates to the collection, e.g. adding 1000 items and deleting 500 different items. It feels like the client should be able to do this in a single transaction with the server, rather than requiring 1000 separate POST requests and 500 DELETEs.
Does anyone have any info on the best practices or conventions for achieving this?
My current thinking is that one should be able to PUT an object representing the change to the collection URI, but this seems at odds with the HTTP 1.1 RFC, which seems to suggest that the data sent in a PUT request should be interpreted independently from the data already present at the URI. This implies that the client would have to send a complete description of the new state of the collection in one go, which may well be very much larger than the change, or even be more than the client would know when they make the request.
Obviously, I'd be happy to deviate from the RFC if necessary but would prefer to do this in a conventional way if such a convention exists.
You might want to think of the change task as a resource in itself. So you're really PUT-ing a single object, which is a Bulk Data Update object. Maybe it's got a name, owner, and big blob of CSV, XML, etc. that needs to be parsed and executed. In the case of CSV you might want to also identify what type of objects are represented in the CSV data.
List jobs, add a job, view the status of a job, update a job (probably in order to start/stop it), delete a job (stopping it if it's running) etc. Those operations map easily onto a REST API design.
Once you have this in place, you can easily add different data types that your bulk data updater can handle, maybe even mixed together in the same task. There's no need to have this same API duplicated all over your app for each type of thing you want to import, in other words.
This also lends itself very easily to a background-task implementation. In that case you probably want to add fields to the individual task objects that allow the API client to specify how they want to be notified (a URL they want you to GET when it's done, or send them an e-mail, etc.).
Yes, PUT creates/overwrites, but does not partially update.
If you need partial update semantics, use PATCH. See http://greenbytes.de/tech/webdav/draft-dusseault-http-patch-14.html.
You should use AtomPub. It is specifically designed for managing collections via HTTP. There might even be an implementation for your language of choice.
For the POSTs, at least, it seems like you should be able to POST to a list URL and have the body of the request contain a list of new resources instead of a single new resource.
As far as I understand it, REST means REpresentational State Transfer, so you should transfer the state from client to server.
If that means too much data going back and forth, perhaps you need to change your representation. A collectionChange structure would work, with a series of deletions (by id) and additions (with embedded full xml Representations), POSTed to a handling interface URL. The interface implementation can choose its own method for deletions and additions server-side.
The purest version would probably be to define the items by URL, and the collection contain a series of URLs. The new collection can be PUT after changes by the client, followed by a series of PUTs of the items being added, and perhaps a series of deletions if you want to actually remove the items from the server rather than just remove them from that list.
You could introduce meta-representation of existing collection elements that don't need their entire state transfered, so in some abstract code your update could look like this:
{existing elements 1-100}
{new element foo with values "bar", "baz"}
{existing element 105}
{new element foobar with values "bar", "foo"}
{existing elements 110-200}
Adding (and modifying) elements is done by defining their values, deleting elements is done by not mentioning it the new collection and reordering elements is done by specifying the new order (if order is stored at all).
This way you can easily represent the entire new collection without having to re-transmit the entire content. Using a If-Unmodified-Since header makes sure that your idea of the content indeed matches the servers idea (so that you don't accidentally remove elements that you simply didn't know about when the request was submitted).
Best way is :
Pass Only Id Array of Deletable Objects from Front End Application To Web API
2. Then You have Two Options:
2.1 Web API Way : Find All Collections/Entities using Id arrays and Delete in API , but you need to take care of Dependant entities like Foreign Key Relational Table Data too
2.2. Database Way : Pass Ids to your database side, find all records in Foreign Key Tables and Primary Key Tables and Delete in same order i.e. F-Key Table records then P-Key Table records

Resources