How to implement gapless, user-friendly IDs in NHibernate? - asp.net

I'm designing an application where my Order objects need to have a sequential and user-friendly Id field. I'm avoiding the HiLo algorithm because of the rather large gaps it produces (see here). Naturally, Guid values would make my corporate users go bananas. I'm also avoiding Oracle sequences because of the major disadvantages of it:
(From: NHibernate POID Generators revealed)
Post insert generators, as the name
suggest, assigns the id’s after the
entity is stored in the database. A
select statement is executed against
database. They have many drawbacks,
and in my opinion they must be used
only on brownfield projects. Those
generators are what WE DO NOT SUGGEST
as NH Team.
> Some of the drawbacks are the
following:
Unit Of Work is broken with the use of
those strategies. It doesn’t matter if
you’re using FlushMode.Commit, each
Save results in an insert statement
against DB. As a best practice, we
should defer insertions to the commit,
but using a post insert generator
makes it commit on save (which is what
UoW doesn’t do).
Those strategies
nullify batcher, you can’t take the
advantage of sending multiple queries
at once(as it must go to database at
the time of Save).
Any ideas/experience on implementing user-friendly IDs without major gaps between them?
Edit:
User friendly Id fields are ones my corporate users can memorize and even discuss and/or have phone conversations talking about a particular Order by its code, e.g. "I'm calling to know why the order #1625 was denied.".
The Id doesn't need to be strictly gapless, but I am worried that my users would get confused when they see gaps like 100, 201, 305. For my older projects, I currently implement NHibernate using Oracle sequences which occasionally lose a few sequences when exceptions are thrown, but yet keep a rather tidy order to them. The downside to them is how they break the Unit of Work which results in additional hits to the database for every Save command with or without the Session.Flush.

One option would be to keep a key-table that simply stores an incrementing value. This can introduce a few problems, namely possible locking issues as well as additional hits to the database.
Another option might be to refine what you mean by "User-friendly Id". This could consist of a combination of a Date/Time and a customer-specific sequence (or including the customer id as well). Also, your order id does not necessarily have to be the actual key on the table. There is nothing to say that you can't use a surrogate key with a separate "calculated" column which represents the order id.
The bottom-line is that it sounds like you want to use a surrogate key, but have the benefits of a natural key. It can be very difficult to have it both ways and a lot comes down to how you actually plan on using the data, how users interpret the data, and personal preference.

Related

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

CosmosDB/DocumentDB partitioning with multiple types in same collection

Official recommendation from the team is, to my knowledge, to put all datatypes into single collection that have something like type=someType field on documents to distinguish types.
Now, if we assume large databases with partitioning where different object types can be:
Completely different fields (so no common field for partitioning)
Related (through reference)
How to organize things so that things that should go together end up in same partition?
For example, lets say we have:
User
BlogPost
BlogPostComment
If we store them as separate types with type=user|blogPost|blogPostComment, in same collection, how do we ensure that user, his blogposts and all the corresponding comments end up in same partition?
Is there some best practice for this?
[UPDATE]
Can you ever avoid cross-partition queries completely? Should that be a goal? Or you just try to minimize them?
For example, you can partition your data perfectly for 99% of cases/queries but then you need some dashboard to show aggregates from all-the-data. Is that something you just accept as inevitable and try to minimize or is it possible to avoid it completely?
I've written about this somewhat extensively in other similar questions regarding Cosmos.
Basically, when dealing with many different logical entity types in a single Cosmos collection the easiest option is to put a generic (or abstract, as you refer to it) partition key on all your documents. At this point it's the concern of the application to make sure that at runtime the appropriate value is chosen. I usually name this document property either partitionKey, routingKey or something similar.
This is extremely important when designing for optimal query efficiency as your choice of partition keys can have a huge impact on query and throughput performance. A generic key like this lets you design the optimal storage of your data as it benefits whatever application you're building.
Even something like tenant does not make sense as different tenants might have wildly different data size and access patterns. Instead you could include the tenantId at runtime as part of your partition key as a kind of composite.
UPDATE:
For certain query patterns it might be possible to serve them entirely out of a single partition. It's definitely not the end of the world if things end up going cross partition though. The system is still quick. If possible, limiting the amount of partitions that need to be touched for a given query is ideal but you're never going to get away from it 100% of the time.
A partition should hold data related to a group that is expected to grow, for instance a Tenant which will group many documents (which can be of different types as you have mentioned) So the Partition Key in this instance should be the TenantId. The partitioning is more about the data relating to a group than the type of data. If the data is related to a User then you could use the UserId, however many users may comment on the same posts so it doesn't seem like a good candidate for a partition key unless there is some de-normalization of the user info so it doest have to relate back to the other users directly.. if that makes sense?

Using auto-number database fields theory

I was on "another" programming forum, and we were talking about getting the next number from an auto-increment field BEFORE an insert takes place (there is a way using ADOX). This was in an MS-Access database btw.
Anyway, the discussion veered off into the area of SHOULD you use auto-increment fields for things like invoice numbers, PO numbers, bill of lading numbers, or anything else that needs an unique, incrementing number.
My thoughts were "why not"? Other people are arguing that an Invoice number (for instance) should be managed as a separate table and incremented with code, not using an auto-number field.
Can someone give me a good reason why that would be true?
I've used auto-number fields for years for just this type of thing and have never had problem one.
Your thoughts?
I have always avoided number auto_increment. As it turns out for good reason. But originally my reasons were because that was what the professor told us.
Facebook had a major breach a few years ago - simply because they were use AUTO_INCREMENT fields for user id's. Doesn't take a calculator to figure out that if my ID is 10320 there is likely someone with ID 10319, etc.
When debugging (or proofing design) having a key that implicit of the data it represents is a heck of a lot easier.
Have keys that are implicit of the data reduces the potencial for corrupted data (type's and user guessing).
Implicit keys require the developer think about they're data. I have never come across a table using implicit keys that was not normalized.
Other than the fact deadlines often run tight - there is no great reason for auto increment.
Normally I use and autonumbering field for the ID so I don't need to think about how's generated.
The recordset operation like insert and delete alter the sequence skipping block of numbers.
When you manage CustomerID, Invoice Numbers and so on, it's better to have the full control over them instead of letting them under system's control.
You can create a function that generates for you the desired numbers using a rule (e.g. the invoice can be a function that include the invoicing date).
With autonumbering you can't manage this.
After that there is NO FIXED RULES about what to do and what not do.
It's just your practice and experience and the degree of freedom you want to have.
Bye:-)

What's the best way to cache complicated search queries in a .NET webapp?

I have a website that allows users to query for specific recipes using various search criteria. For example, you can say "Show me all recipes that I can make in under 30 minutes that will use chicken, garlic and pasta but not olive oil."
This query is sent to the web server over JSON, and deserialized into a SearchQuery object (which has various properties, arrays, etc).
The actual database query itself is fairly expensive, and there's a lot of default search templates that would be used quite frequently. For this reason, I'd like to start caching common queries. I've done a little investigation into various caching technologies and read plenty of other SO posts on the subject, but I'm still looking for advice on which way to go. Right now, I'm considering the following options:
Built in System.Web.Caching: This would provide a lot of control over how many items are in the cache, when they expire, and their priority. However, cached objects are keyed by a string, rather than a hashable object. Not only would I need to be able to convert a SearchQuery object into a string, but the hash would have to be perfect and not produce any collisions.
Develop my own InMemory cache: What I'd really like is a Dictionary<SearchQuery, Results> object that persists in memory across all sessions. Since search results can start to get fairly large, I'd want to be able to cap how many queries would be cached and provide a way for older queries to expire. Something like a FIFO queue would work well here. I'm worried about things like thread safety, and am wondering if writing my own cache is worth the effort here.
I've also looked into some other third party cache providers such as NCache and Velocity. These are both distributed cache providers and are probably completely overkill for what I need at the moment. Plus, it seems every cache system I've seen still requires objects to be keyed by a string. Ideally, I want something that holds a cache in process, allows me to key by an object's hash value, and allows me to control expiration times and priorities.
I'd appreciate any advice or references to free and preferably open source solutions that could help me out here. Thanks!
Based on what you are saying, I recommend you use System.Web.Caching and build that into your DataAccess layer shielding it from the rest of you system. When called you can make your real time query or pull from a cached object based on your business/application needs. I do this today, but with Memcached.
An in-memory cache should be pretty easy to implement. I can't think of any reason why you should have particular concerns about validating the uniqueness of a SearchQuery object versus any other - that is, while the key must be a string, you can just store the original object along with the results in the cache, and validate equality directly after you've got a hit on the hash. I would use System.Web.Caching for the benefits you've noted (expiration, etc.). If there happened to be a collision, then the 2nd one would just not get cached. But this would be extremely rare.
Also, the amount of memory needed to store search results should be trivial. You don't need to keep the data of every single field, of every single row, in complete detail. You just need to keep a fast way to access each result, e.g. an int primary key.
Finally, if there are possibly thousands of results for a search that could be cached, you don't even need to keep an ID for each one - just keep the first 100 or something (as well as the total number of hits). I suspect if you analyzed how people use search results, it's a rare person that goes beyond a few pages. If someone did, then you can just run the query again.
So basically you're just storing a primary key for the first X records of each common search, and then if you get a hit on your cache, all you have to do is run a very inexpensive lookup of a handful of indexed keys.
Give a quick look to the Enterprise library Caching Application Block. Assuming you want a web application wide cache, this might be the solution your looking for.
I'm assuming that generating a database query from a SearchQuery object is not expensive, and you want to cache the result (i.e. rowset) obtained from executing the query.
You could generate the query text from your SearchQuery object and use that text as the key for a lookup using System.Web.Caching.
From a quick reading the documentation for the Cache class it appears that the keys have to be unique - which they would be if you used they query text - not the hash of the key.
EDIT
If you are concerned about long cache keys then check the following links:
Cache key length in asp.net
Maximum length of cache keys in HttpRuntime.Cache object?
It seems that the Cache class stores the cached items in an internal dictionary, which uses the key's hash. Keys (query text) with the same hash would end-up in the same bucket in the dictionary, where its just a quick linear search to find the required one when do a cache lookup. So I think you'd be okay with long key strings.
The asp.net caching is pretty well thought out, and I don't think this is a case where you need something else.

How to setup data model for customizable application

I have an ASP.NET data entry application that is used by multiple clients. The application consists of multiple data entry modules that are common to all clients.
I now have multiple clients that want their own custom module added which will typically consist of a dozen or so data points. Some values will be text, others numeric, some will be dropdown selections, etc.
I'm in need of suggestions for handling the data model for this. I have two thoughts on how to handle. First would be to create a new table for each new module for each client. This is pretty clean but I don't particular like it. My other thought is to have one table with columns for each custom data point for each client. This table would end up with a lot of columns and a lot of NULL values. I don't really like either solution and suspect there's a better way to do this, so any feedback you have will be appreciated.
I'm using SQL Server 2008.
As always with these questions, "it depends".
The dreaded key-value table.
This approach relies on a table which lists the fields and their values as individual records.
CustomFields(clientId int, fieldName sysname, fieldValue varbinary)
Benefits:
Infinitely flexible
Easy to implement
Easy to index
non existing values take no space
Disadvantage:
Showing a list of all records with complete field list is a very dirty query
The Microsoft way
The Microsoft way of this kind of problem is "sparse columns" (introduced in SQL 2008)
Benefits:
Blessed by the people who design SQL Server
records can be queried without having to apply fancy pivots
Fields without data don't take space on disk
Disadvantage:
Many technical restrictions
a new field requires DML
The xml tax
You can add an xml field to the table which will be used to store all the "extra" fields.
Benefits:
unlimited flexibility
can be indexed
storage efficient (when it fits in a page)
With some xpath gymnastics the fields can be included in a flat recordset.
schema can be enforced with schema collections
Disadvantages:
not clearly visible what's in the field
xquery support in SQL Server has gaps which makes getting your data a real nightmare sometimes
There are maybe more solutions, but to me these are the main contenders. Which one to choose:
key-value seems appropriate when the number of extra fields is limited. (say no more than 10-20 or so)
Sparse columns is more suitable for data with many properties which are filled out infrequent. Sounds more appropriate when you can have many extra fields
xml column is very flexible, but a pain to query. Appropriate for solutions that write rarely and query rarely. ie: don't run aggregates etc on the data stored in this field.
I'd suggest you go with the first option you described. I wouldn't over think it. The second option you outlined would be a bad idea in my opinion.
If there are fields common to all the modules you're adding to the system you should consider keeping those in a single table then have other tables with the fields specific to a particular module related back to the primary key in the common table. This is basically table inheritance (http://www.sqlteam.com/article/implementing-table-inheritance-in-sql-server) and will centralize the common module data and make it easier to query across modules.

Resources