Is there any reason to use an index other than RecId (SurrogateKey in AX2012) as the clustered index?
Confirmed by a quick Google search (*), one should consider at least 4 criteria when deciding on clustered indexes:
Index must be unique.
Index must be narrow (As few fields as possible - since these would be copied to every other index).
Index must be static (As updating the index field value(s) will cause SQL server to physically move the record to a new location)
Index must be ordered (Ascending / Descending).
RecId adheres to all of the above, in a better way than any index you can create yourself. Any index you create yourself will violate at least the 2nd and/or the 4th, since it would automatically include DataAreaId.
What I think...
Could it be that the option to set this is just a legacy property from AX3.0 or lower, and that its use could be deprecated now?
*TechNet SQL Server Index Design Guide and Effective Clustered Indexes

While RecId is a good choice, you can make a shorter key on say an int on a global table (SaveDataPerCompany = No).
Access patterns matters, if you often access your customers by account number, you might as well store the records in that order.
Also, if you only have one index as is often the case for group and parameter tables, you are not punished for having a longer key, it will need storage somewhere anyway.
See also What do Clustered and Non clustered index actually mean?


separate indexes for select optimization

I have a table 'data' with columns
id (auto_increment) id_device (integer) timestamp(numeric)
I need to execute these selects:
select * from data where id<10000000 and id_device=345
select * from data where id<10000000 and id_device=345 and timestamp>'2017-01-01 10:00:00' and timestamp<'2017-03-01 08:00:00'
For first select:
Is it better to make separate index for "id" and separate for "id_device"?
Or is it better for performance to make index like INDEX id, id_device?
For second select:
Is better to make separate index for "id" and separate for "id_device" and separate for "timestamp"?
Or is it better for performance to make index like INDEX id, id_device, timestamp?
My short answer: it depends on your data.
Longer: if id_device=345 is true for fewer rows than id<10000000 then id_device should be listed first in a multi-column index: ...ON data(id_device,id). Also if select speed is more important to you/your users than insert/update/delete speed, then why not add a lot of indexes and leave it to the query planner to choose which ones to use:
create index i01_tbl on tbl(id);
create index i02_tbl on tbl(id_device);
create index i03_tbl on tbl(timestamp);
create index i04_tbl on tbl(id,id_device);
create index i05_tbl on tbl(id_device,id);
create index i06_tbl on tbl(timestamp,id);
create index i07_tbl on tbl(id,timestamp);
create index i08_tbl on tbl(id_device,timestamp);
create index i09_tbl on tbl(timestamp,id_device);
create index i10_tbl on tbl(id, id_device, timestamp);
create index i11_tbl on tbl(id_device, id, timestamp);
create index i12_tbl on tbl(id_device, timestamp, id);
create index i13_tbl on tbl(id, timestamp, id_device);
create index i14_tbl on tbl(timestamp, id_device, id);
create index i15_tbl on tbl(timestamp, id, id_device);
The query planner algorithms in your database (sqlite have them too) usually make good choises on that. Especially if you run the ANALYZE sqlite command periodically or after changing lots of data. The downside of having many indexes is slower inserts and deletes (and updates if they involve indexed columns) and more disk/memory usage. Use explain plan on your important SQLs (important when it comes to speed) to check which indexes are used and not. If an index is never used or only used in queries that is fast anyway without it, then you can drop those. Also be aware that newer versions of your database (sqlite, oracle, postgresql) can have newer query planner algorithms which for most SELECTs are better, but for some can get worse. Realistic tests on realistic datasets are the best way to tell. Which indexes to create is not an exact science and dont have definitive rules that fits all cases.

How to make values unique in cassandra

I want to make unique constraint in cassandra .
As i want to all the value in my column be unique in my column family
now i want that i this row no values equal to rahul ,123 and abc get inserted again on seraching on datastax i found that i can achieve it by doing query on partition key as IF NOT EXIST ,but not getting the solution for getting all the 3 values uniques
means if
name- jacob
this should also be not inserted into my database as my phone column has the same value as i have shown with name-rahul.
The short answer is that constraints of any type are not supported in Cassandra. They are simply too expensive as they must involve multiple nodes, thus defeating the purpose of having eventual consistency in first place. If you needed to make a single column unique, then there could be a solution, but not for more unique columns. For the same reason - there is no isolation, no consistency (C and I from the ACID). If you really need to use Cassandra with this type of enforcement, then you will need to create some kind of synchronization application layer which will intercept all requests to the database and make sure that the values are unique, and all constraints are enforced. But this won't have anything to do with Cassandra.
I know this is an old question and the existing answer is correct (you can't do constraints in C*), but you can solve the problem using batched creates. Create one or more additional tables, each with the constrained column as the primary key and then batch the creates, which is an atomic operation. If any of those column values already exist the entire batch will fail. For example if the table is named Foo, also create Foo_by_Name (primary key Name), Foo_by_Phone (primary key Phone), and Foo_by_Address (primary key Address) tables. Then when you want to add a row, create a batch with all 4 tables. You can either duplicate all of the columns in each table (handy if you want to fetch by Name, Phone, or Address), or you can have a single column of just the Name, Phone, or Address.

How does GAE datastore index null values

I'm concerned about read performance, I want to know if putting an indexed field value as null is faster than giving it a value.
I have lots of items with a status field. The status can be, "pending", "invalid", "banned", etc...
my typical request is to find the status "ok" (or null). Since null fields are not saved to datastore, it is already a win to avoid to have a "useless" default value I can replace with null. So I already have less disk space use.
But I was wondering, since datastore is noSql, it doesn't know about the data structure and it doesn't know there is a missing column status. So how does it do the status = null request check?
Does it have to check all columns of each row trying to find my column? or is there some smarter mechanism?
For example, index (null=Entity,key) when we pass a column explicitly saying it is null (if this is the case, does Objectify respect that and keep the field in the list when passing it to the native API if it's null?)
And mainly, which request is more efficient?
The low level API (and Objectify) stores and indexes nulls if you specify that a field/property should be indexed. For Objectify, you can specify #Ignore(IfNull.class) or #Unindex(IfNull.class) if you want to alter this behavior. You are probably confusing this with documentation for other data access APIs.
Since GAE only allows you to query for indexed fields, your question is really: Is it better to index nulls and query for them, or to query for everything and filter out non-null values?
This is purely a question of sparsity. If the overwhelming majority of your records contain null values, then you're probably better off querying for everything and filtering out the ones you don't want manually. A handful of extra entity reads are probably cheaper than updating and storing an extra index. On the other hand, if null records are a small percentage of your data, then you will certainly want the index.
This indexing dilema is not unique to GAE. All databases present this question with respect to low-cardinality fields; it's just that they'll do the table scan (testing and skipping rows) for you.
If you really want to fine-tune this behavior, read Objectify's documentation on Partial Indexes.
null is also treated as a value in datastore and there will be entries for null values in indexes. Datastore doc says, "Datastore distinguishes between an entity that does not possess a property and one that possesses the property with a null value"
Datastore will never check all columns or all records. If you have this property indexed, it will get records from the index only If not indexed, you cannot query by that property.
In terms of query performance, it should be the same, but you can always profile and check.

Database Table Data Types To Store Key/Value Cache

I am working on a project that requires key/value caching function but the application will exist in a very limited environment that does not support any of the go-to industry standard memory caching methods such as ASP.NET Cache, memcached, AppFabric.
The only option we have in this restrictive environment is a MS SQL database. We have to create a simple key/value table to meet our key/value caching needs. We will most likely serialize the data as JSON but I am not sure what would be the best data type for the key. It will obviously need to be a unique key and need to be readable by the programmer getting and setting the cache. It also needs to be a fast look up since we will already be loosing performance not having access to an "in memory" cache solution.
I am use to having my primary key column be an int or bigint value. In this case should the primary key (the cache key) be a char or varchar data type as all queries will be:
SELECT value FROM CacheTable WHERE key = 'keyname'
I also saw posts about using an md5 hash but other posts pointed out that hashing cannot be relied on to produce unique keys all the time. I'm basically after some advice on the data type and rather or not the 'key' column should be the primary key or if I should still create an int or bigint primary key (even though it probably will not be used).
The end result we are after is creating a caching class similar to .NET's native caching where we can create a static class that pulls from the database table such as:
CustomDatabaseCache.Set(string key, object value);
CustomDatabaseCache.Get(string key)
I think in your scenario having a clustered primary key on your keyname column would work fine. However, it's worth experimenting with fill factors, because you want a fill factor that is low enough that you don't cause excessive page splits, but one that is high enough to keep the number of page reads low.
A clustered IDENTITY index works better in terms of eliminating page splits on the clustered index - and you could use a unique index on keyname which used an INCLUDE clause to include your values. However - in your case, I don't see the benefit in doing that because you'd have exactly the same page-split problem on your unique index, and the clustered index on keyname would be no more expensive to read because you wouldn't have any extra columns. Plus you would then have index update cost on two indexes for write.
Hope that helps.

What exactly are hashtables?

What are they and how do they work?
Where are they used?
When should I (not) use them?
I've heard the word over and over again, yet I don't know its exact meaning.
What I heard is that they allow associative arrays by sending the array key through a hash function that converts it into an int and then uses a regular array. Am I right with that?
(Notice: This is not my homework; I go too school but they teach us only the BASICs in informatics)
Wikipedia seems to have a pretty nice answer to what they are.
You should use them when you want to look up values by some index.
As for when you shouldn't use them... when you don't want to look up values by some index (for example, if all you want to ever do is iterate over them.)
You've about got it. They're a very good way of mapping from arbitrary things (keys) to arbitrary things (values). The idea is that you apply a function (a hash function) that translates the key to an index into the array where you store the values; the hash function's speed is typically linear in the size of the key, which is great when key sizes are much smaller than the number of entries (i.e., the typical case).
The tricky bit is that hash functions are usually imperfect. (Perfect hash functions exist, but tend to be very specific to particular applications and particular datasets; they're hardly ever worthwhile.) There are two approaches to dealing with this, and each requires storing the key with the value: one (open addressing) is to use a pre-determined pattern to look onward from the location in the array with the hash for somewhere that is free, the other (chaining) is to store a linked list hanging off each entry in the array (so you do a linear lookup over what is hopefully a short list). The cases of production code where I've read the source code have all used chaining with dynamic rebuilding of the hash table when the load factor is excessive.
Good hash functions are one way functions that allow you to create a distributed value from any given input. Therefore, you will get somewhat unique values for each input value. They are also repeatable, such that any input will always generate the same output.
An example of a good hash function is SHA1 or SHA256.
Let's say that you have a database table of users. The columns are id, last_name, first_name, telephone_number, and address.
While any of these columns could have duplicates, let's assume that no rows are exactly the same.
In this case, id is simply a unique primary key of our making (a surrogate key). The id field doesn't actually contain any user data because we couldn't find a natural key that was unique for users, but we use the id field for building foreign key relationships with other tables.
We could look up the user record like this from our database:
WHERE last_name = 'Adams'
AND first_name = 'Marcus'
AND address = '1234 Main St'
AND telephone_number = '555-1212';
We have to search through 4 different columns, using 4 different indexes, to find my record.
However, you could create a new "hash" column, and store the hash value of all four columns combined.
String myHash = myHashFunction("Marcus" + "Adams" + "1234 Main St" + "555-1212");
You might get a hash value like AE32ABC31234CAD984EA8.
You store this hash value as a column in the database and index on that. You now only have to search one index.
WHERE hash_value = 'AE32ABC31234CAD984EA8';
Once we have the id for the requested user, we can use that value to look up related data in other tables.
The idea is that the hash function offloads work from the database server.
Collisions are not likely. If two users have the same hash, it's most likely that they have duplicate data.
