I have a table that will contain large amounts of data. The purpose of this table is user transactions.
I will be inserting into this table from a web-service, which a third party will be calling, frequently.
The third party will be supplying a reference code (most probably a string).
The requirement here is that I will need to check whether this reference code has already been inserted. If it exists, just return the details and do nothing else. If it doesn't create the transaction as expected. The reasoning behind this is the possibility of loss of communication with the service after the request is received.
I have some performance concerns with this, as the search will be done on a string value, and also on a large table. Most of the time the transaction will not exist in the database, as this is just a precaution.
I am not asking for code here, but for the best approach for performance.
AS your subject indicates if you are trying to look for evaluating EXISTS (SELECT 1 from Sometable) then there will not be much of performance penality. This is because you will not be writing just a bunch of 1s (means the inner query) to evaluate the result to boolean.
Other aspect to this is the non clustered indexing provided on the reference code field.If the length of the reference code lets say its a fixed length string (CHAR(50) then also the B -tree will be optimum .
I am not sure about the data consistency requirements hence exepct the normal readcommitted will wont do any harm unless you have highly transnational read writes.
Related
I am preparing for an interview.
At Geeks for Geeks I found the following report of an interview:
Asked me a technical question for implementing a database with key value pairs that can be stored efficiently where the key and value pairs are only valid for a specific time. Modified the version of the question every time to make it more efficient. It includes implementation of hashing, dictionary, garbage collection, and heap.
I cannot figure out the part where we have to make sure the key value pair is valid for a specific time.
I am assuming the first bit where a database with key value pairs is to be implemented is a dictionary/hashtable but I can only make that much out. Looking for someone to help me out with the rest.
"Database" means durability, and to implement that efficiently, you're going to use an append-only transaction log combined with asynchronously updated snapshots of the dictionary. The in-memory dictionary is updated as updates are written to the log, and when the system is restarted, the dictionary is reconstituted from the latest available snapshot and any logged updates that it's missing.
Expiry is easily implemented in a system like that, and actually makes it simpler. In addition to updating the in-memory dictionary as updates are logged, you also process the log on a delay equal to the expiry time, and delete any records corresponding to delayed updates, if they are still current. You don't even need the snapshots in this scheme, since the dictionary can be reconstituted from only the recent portion of the log.
You don't often get to take use an expiry mechanism like this in real life, though, because they usually let you specify a different expiry time for each record. It will usually boil down to a similar kind of process involving multiple "expiry buckets", though.
If you want a lot of detail, you should look online for design information about memcache and redis.
How would one validate a unique constraint using DDD? Let's say that an Entity has a property name that must be unique among the system and there is a specific EntityRepository method nameExists(name): bool... This is what I found people suggests to do, because the repository is the abstraction of the collection of all the Entityies and should be able to perform this check.
So before creating/adding the new Entity the command / domain service could check for the existence of a newName against the repository, but I think that this will not always work because of concurrency.
In a concurrent scenario where two transactions are started simultaneously, the EntityRepository's nameExists method might return false for both transactions, and as a result of this two entries with the same name will be incorrectly inserted.
I am sure that I am missing something basic, but the answers I found all point to the repository exists method - TBH others say that a UNIQUE constraint should be put on the DB to catch the concurrency case, but what if one uses Event Sourcing or a persistence layer that does not have unique constraints?
| Follow up question |
What if the uniqueness constraint is to be applied in different levels of a hierarchy?
A Container's name must be unique in the system and then Child names must be unique inside a Container.
Let's say that a transactional DB takes care of the uniqueness at the lowest possible level, what about the domain?
Should I still express the uniqueness logic at the domain level, e.g. with a Domain Service for the system-level uniqueness and embedding Child entities inside the Container entity and having a business rule (and therefore making Container the aggregate root)?
Or should I not bother with "replicating" the uniqueness in the domain and (given there are no other rules to apply between the two) split Container and Child? Will the domain lack expressiveness then?
I am sure that I am missing something basic
Not something basic.
The term we normally use for enforcing a constraint, like uniqueness, across a set of entities is set validation. Greg Young calls your attention to a specific question:
What is the business impact of having a failure
Most set constraints fall into one of two categories
constraints that need to be true when the system reaches steady state, but may not hold while work is in progress. In business processes, these are often handled by detecting conflicts in the stored data, and then invoking various mitigation processes to resolve the conflict.
constraints that need to be true always.
The first category includes things like double booking a seat on an airplane; it's not necessarily a problem unless both people show up, and even then you can handle it by bumping someone to another seat, or another flight.
In these cases, you make a best effort - you look at a recent copy of the set, make sure there are no conflicts there, then hope for the best (accepting that some percentage of the time, you'll have missed a change).
See Memories, Guesses and Apologies (Pat Helland, 2007).
Second category is the hard one; to ensure the invariant holds you have to lock the entire set to ensure that races don't allow two different writers to insert conflicting information.
Relational databases tend to be really good at set validation - putting the entire set into a single database is going to be the right answer (note the assumption that the set is small enough to fit into a single database -- trying to lock two databases at the same time is hard).
Another possibility is to ensure that only one writer can update the set at any given time -- you don't have to worry about a losing a race when you are the only one running in it.
Sometimes you can lock a smaller set -- imagine, for example, having a collection of locks with numbers, and the hash code for the name tells you which lock you have to grab.
This simplest version of this is when you can use the name as the aggregate identifier itself.
if one uses Event Sourcing or a persistence layer that does not have unique constraints?
Sometimes, you introduce a persistent store dedicated to the set, just to ensure that you can maintain the invariant. See "microservices".
But if you can't change the database, and you can't use a database with the locking guarantees that you need, and the business absolutely has to have the set valid at all times... then you single thread that part of the work.
Everybody that wants to change a name puts a request into a queue, and the one thread responsible for managing the invariant certifies each and every change.
There's no magic; just hard work and trade offs.
hashset underlaying data structure is hashtable .how it will identify duplicates and why it is good for if our frequent operation is search operation ?
It uses hash code of the object which is quickly computed integer. This hash code tries to be as even distributed over all potential object values as possible.
As a result it can distribute the inserted values into a array (hashtable) with very low probability of conflict. Then the search operation is quite quick - get the hash code, access the array, compare and get the value - usually constant time. The same actually happens for finding duplicates.
The conflicts of hash code are resolved as well - there can be potentially more values for the same entry within the hash table - there comes the equal into play. But they are rather rare so they don't affect average performance significantly.
I use SQL Server and when I create a new table I make a specific field an auto increment
primary key. The problem is some people told me making the field an auto increment for the primary key means when deleting any record (they don't care about the auto increment field number) the field increases so at some point - if the type of my field is integer for example - the range of integer will be consumed totally and i will be in trouble. So they tell me not to use this feature any more.
The best solution is making this through the code by getting the max of my primary key then if the value does not exist the max will be 1 other wise max + 1.
Any suggestions about this problem? Can I use the auto increment feature?
I want also to know the cases which are not preferable to use auto increment ..and the alternatives...
note :: this question is general not specific to any DBMS , i wanna to know is this true also for DBMSs like ORACLE ,Mysql,INFORMIX,....
Thanks so much.
You should use identity (auto increment) columns. The bigint data type can store values up to 2^63-1 (9,223,372,036,854,775,807). I don't think your system is going to reach this value soon, even if you are inserting and deleting lots of records.
If you implement the method you propose properly, you will end up with a lot of locking problems. Otherwise, you will have to deal with exceptions thrown because of constraint violation (or even worse - non-unique values, if there is no primary key constraint).
An int datatype in SQL Server can hold values from -2,147,483,648 through 2,147,483,647.
If you seed your identity column with -2,147,483,648, e.g. FooId identity(-2,147,483,648, 1) then you have over 4 billion values to play with.
If you really think this is still not enough, you could use a bigint, which can hold values from -9,223,372,036,854,775,808 through 9,223,372,036,854,775,807, but this almost guaranteed to be overkill. Even with large data volumes and/or a large number of transactions, you will probably either run out of disk space or exhaust the lifetime of your application before you exhaust the identity values when using an int, and almost certainly when using a bigint.
To summarise, you should use an identity column and you should not care about gaps in the values since a) you have enough candidate values and b) it's an abstract number with no logical meaning.
If you were to implement the solution you suggest, with the code deriving the next identity column, you would have to consider concurrency, since you will have to synchronise access to the current maximum identity value between two competing transactions. Indeed, you may end up introducing a significant performance degradation, since you will have to first read the max value, calculate and then insert (not to mention the extra work involved in synchronising concurrent transactions). If, however, you use an identity column, concurrency will be handled for you by the database engine.
The solution they suggest can, and most likely will, create a concurrency problem and/or scalability problem. If two sessions use the Max technique you describe at the same time, they can come up with the same number and then both try to add it at the same time. This will create a constraint violation.
You can work around that problem by locking the table or catching exceptions, and keep re-inserting.. but that's a really bad way to do things. Locking will reduce performance and cause scalability issues (and if you're planning as many records as to be worried about overflowing an int then you will need scalability).
Identity fields are atomic operations. Two sessions cannot create the same identity field, so this problem is non-existent when using it.
If you're concerned that an identity field may overflow, then use a larger datatype, such as bigint. You would be hard pressed to generate enough records to overflow that.
Now, there are valid reasons NOT to use an identity field, but this is not one of them.
Continue to use the identity feature with PK in SQL Server. In mysql, there is also auto increment feature. Don't worry that you run out of integer range, you will run out of hard disk space before that happens.
I would advice AGAINST using the Identity/Auto-increment, because:
It's implementation is broken in SQL server 2005/2008. Read more
It doesn't work well if you are going to use an ORM to map your database to objects. Read more
I would advice you to use the Hi/Lo generator if you usually access your database through a program and don't depend on sending insert statements manually to the DB. You can read more about it in the second link.
One requirement is that when persisting my C# objects to the database I must decide the database ID (surrogate primary key) in code.
Second requirement is that the database type for the key must be int or char(x)... so no uniqueidentifier or binary(16) or the like.
These are unchangeable requirements.
What would be the best way to go about handling this?
One idea is the base64 encoded GUIDs looking like "XSiZtdXcKU68QWe7N96Dig". These are easily created in code and are to me acceptable in URLs if necessary. But will it be too expensive regarding performance (indexing, size) having all primary and foreign keys be char(22)? Off hand I really like this idea.
Another idea would be to create a code version of a database sequence creating incremented integers for me. But I don't know if this is plausible and would need some guidance to secure the reliability. The sequencer must know har far it has come and what about threads that I don't control etc.
I imagine that no table involved will ever exceed 1.000.000 rows... will probably be far less.
You could have a table called "sequences". For each table there would be a row with a counter. Then, when you need another number, fetch it from the counter table and increment it. Put it in a transaction and you will have uniqueness.
However this will suffer in terms of performance, of course.
A simple incrementing int would be the easiest way to ensure uniqueness. This is what the database will do if you let it. If you set the table row to auto_increment, the database will do this for you automatically.
There are no security issues with this, but since you will be handling it yourself instead of letting the database engine take care of it, you will need to ensure that you don't generate the same id twice. This should be simple if you are on a single threaded system, but if your program is distributed you will need to put some effort into ensuring the uniqueness.
Seeing that you have an ASP.NET app, you could do the following (hoping and assuming all users must authenticate themselves before using your app!):
Assign each user a unique "UserID" in your database (can be INT, or CHAR)
Assign each user a "HighestSequentialID" (INT) in your database
When the user logs on, read those values from the database and store them in e.g. a custom principal, or in a cookie, or something else
whenever the user is about to insert a row, create a segmented ID: (UserID).(User's sequential number) and store it as "VARCHAR(20)" - e.g. your UserID is 15 and thus this user's entries would have unique IDs of "15.00001", "15.00002" and so on.
when the user logs off (or at any other time), update its new, highest used sequential ID in the database so that next time around, you'll know what this user has used last
Again - you'll have to do a lot more housekeeping work yourself, and it's always prone to a mishap (assigning a duplicate user ID, or misinterpreting the highest sequential number for that user).
I would strongly recommend trying to get these requirements changed - with these in place, all solutions will be sub-optimal at best, while using the database to handle this would be totally painless.
Marc
For a table below 1.000.000 rows, I would not be too terribly concerned about a char(22) Primary key. Of course the ideal solution for a situation like this would be for each object to have something unique about it that you could leverage for the key, even if it is a multi-part key. The next ideal solution would be to have the requirements changed :)