How and which column is chosen as primary index in teradata - teradata

How and which column is chosen as primary index in teradata when it is not defined while creating a table?

If you don't define an index, Teradata will implicit take the first row as Primary Index. Beside this you can either choose a or many columns as Primary Index or define the table by NO PRIMARY INDEX.
Primary Index will define the Distribution Key of the data across the AMPS. If NO PRIMARY INDEX is defined it will be RoundRobin.
Choosing the PI is part of Physical Design and there is no answer to rule them all. There is a dedicated document in the Documentation covering this topic ("Database Design"). You have to think of:
1) distribution of the data (prevent high skew)
2) possible access and joins
ad 1) should be clear
ad 2) due to data are distributed across PI, a GROUP BY different from the PI or a JOIN with Join-Fields other then PI (at least PI have to be part of it) will result in data redistribution of your spool. - which is bad for the performance of the query.
If you would like to test different PI with your data, you can do it by SQL with following SQL (e.g. myTable with PI of column_1 and column_2):
SELECT HASHAMP (HASHBUCKET (HASHROW (column_1,column_2))) as targetAMP
,COUNT (*) as CountRecords
FROM myTable
GROUP BY targetAMP;

Related

Limiting the number of rows a table can contain based on the value of a column - SQLite

Since SQLite doesn't support TRUE and FALSE, I have a boolean keyword that stores 0 and 1. For the boolean column in question, I want there to be a check for the number of 1's the column contains and limit the total number for the table.
For example, the table can have columns: name, isAdult. If there are more than 5 adults in the table, the system would not allow a user to add a 6th entry with isAdult = 1. There is no restriction on how many rows the table can contain, since there is no limit on the amount of entries where isAdult = 0.
You can use a trigger to prevent inserting the sixth entry:
CREATE TRIGGER five_adults
BEFORE INSERT ON MyTable
WHEN NEW.isAdult
AND (SELECT COUNT(*)
FROM MyTable
WHERE isAdult
) >= 5
BEGIN
SELECT RAISE(FAIL, "only five adults allowed");
END;
(You might need a similar trigger for UPDATEs.)
The SQL-99 standard would solve this with an ASSERTION— a type of constraint that can validate data changes with respect to an arbitrary SELECT statement. Unfortunately, I don't know any SQL database currently on the market that implements ASSERTION constraints. It's an optional feature of the SQL standard, and SQL implementors are not required to provide it.
A workaround is to create a foreign key constraint so isAdult can be an integer value referencing a lookup table that contains only values 1 through 5. Then also put a UNIQUE constraint on isAdult. Use NULL for "false" when the row is for a user who is not an adult (NULL is ignored by UNIQUE).
Another workaround is to do this in application code. SELECT from the database before changing it, to make sure your change won't break your app's business rules. Normally in a multi-user RDMS this is impossible due to race conditions, but since you're using SQLite you might be the sole user.

How can I return inserted ids for multiple rows in SQLite?

Given a table:
CREATE TABLE Foo(
Id INTEGER PRIMARY KEY AUTOINCREMENT,
Name TEXT
);
How can I return the ids of the multiple rows inserted at the same time using:
INSERT INTO Foo (Name) VALUES
('A'),
('B'),
('C');
I am aware of last_insert_rowid() but I have not found any examples of using it for multiple rows.
What I am trying to achieve can bee seen in this SQL Server example:
DECLARE #InsertedRows AS TABLE (Id BIGINT);
INSERT INTO [Foo] (Name) OUTPUT Inserted.Id INTO #InsertedRows VALUES
('A'),
('B'),
('C');
SELECT Id FROM #InsertedRows;
Any help is very much appreciated.
This is not possible. If you want to get three values, you have to execute three INSERT statements.
Given SQLite3 locking:
An EXCLUSIVE lock is needed in order to write to the database file. Only one EXCLUSIVE lock is allowed on the file and no other locks of any kind are allowed to coexist with an EXCLUSIVE lock. In order to maximize concurrency, SQLite works to minimize the amount of time that EXCLUSIVE locks are held.
And how Last Insert Rowid works:
...returns the rowid of the most recent successful INSERT into a rowid table or virtual table on database connection D.
It should be safe to assume that while a writer executes its batch INSERT to a ROWID-table there can be no other writer to make the generated primary keys non-consequent. Thus the insert primary keys are [lastrowid - rowcount + 1, lastrowid]. Or in Python SQLite3 API:
cursor.execute(...) # multi-VALUE INSERT
assert cursor.rowcount == len(values)
lastrowids = range(cursor.lastrowid - cursor.rowcount + 1, cursor.lastrowid + 1)
In normal circumstances when you don't mix provided and expected-to-be-generated keys or as AUTOINCREMENT-mode documentation states:
The normal ROWID selection algorithm described above will generate monotonically increasing unique ROWIDs as long as you never use the maximum ROWID value and you never delete the entry in the table with the largest ROWID.
The above should work as expected.
This Python script can be used to test correctness of the above for multi-threaded and multi-process setup.
Other databases
For instance, MySQL InnoDB (at least in default innodb_autoinc_lock_mode = 1 "consecutive" lock mode) works in similar way (though obviously in much more concurrent conditions) and guarantees that inserted PKs can be inferred from lastrowid:
"Simple inserts" (for which the number of rows to be inserted is known in advance) avoid table-level AUTO-INC locks by obtaining the required number of auto-increment values under the control of a mutex (a light-weight lock) that is only held for the duration of the allocation process, not until the statement completes

How to make values unique in cassandra

I want to make unique constraint in cassandra .
As i want to all the value in my column be unique in my column family
ex:
name-rahul
phone-123
address-abc
now i want that i this row no values equal to rahul ,123 and abc get inserted again on seraching on datastax i found that i can achieve it by doing query on partition key as IF NOT EXIST ,but not getting the solution for getting all the 3 values uniques
means if
name- jacob
phone-123
address-qwe
this should also be not inserted into my database as my phone column has the same value as i have shown with name-rahul.
The short answer is that constraints of any type are not supported in Cassandra. They are simply too expensive as they must involve multiple nodes, thus defeating the purpose of having eventual consistency in first place. If you needed to make a single column unique, then there could be a solution, but not for more unique columns. For the same reason - there is no isolation, no consistency (C and I from the ACID). If you really need to use Cassandra with this type of enforcement, then you will need to create some kind of synchronization application layer which will intercept all requests to the database and make sure that the values are unique, and all constraints are enforced. But this won't have anything to do with Cassandra.
I know this is an old question and the existing answer is correct (you can't do constraints in C*), but you can solve the problem using batched creates. Create one or more additional tables, each with the constrained column as the primary key and then batch the creates, which is an atomic operation. If any of those column values already exist the entire batch will fail. For example if the table is named Foo, also create Foo_by_Name (primary key Name), Foo_by_Phone (primary key Phone), and Foo_by_Address (primary key Address) tables. Then when you want to add a row, create a batch with all 4 tables. You can either duplicate all of the columns in each table (handy if you want to fetch by Name, Phone, or Address), or you can have a single column of just the Name, Phone, or Address.

Explanation on index on a datetime field and included columns

I have a sqlserver table with the usual
intID(primary key),field1,field2,manyotherfields..., datetime TimeOperation
99% of my different kind of queries start with a TimeOperation BETWEEN startTime AND endTime, and then select * (or count(*)) where fieldA=xxx, and join with other smaller tables.
select * because more or less I need all the fields.
I obviusly created an index on TimeOperation ... but performance are not good enough, so I want to add some index key columns or index included columns, but I'm a little bit confused.
I get the difference between the two, but I don't get how much adding a column in each case impacts on speed and on size.
I guess that the biggest improvement would be to create an index including ALL the columns, is it right? (but I can't afford it in terms of space)
And if I often use field1=xxx for example, adding field1 to the index key columns (after TimeOperation) would give better performance right?
Also...just to be sure how an index with included columns works: if I select rows with TimeOperation in a certain range, sql seeks my TimeOperation index for the rows I'm interested in, and it is faster than scanning all the table because in the index the TimeOperation values are in ascending order, is it right? But then I need all the data now I need all the rest of the data fields of those rows...how does sql acts to retrieve the data? I guess it has a sort of bookmark to those rows in the index, right? But it has to hit the table multiple times then... so including all the columns in the index will save the time to hit the table, it it correct?
Thanks!
Mattia
We will need more information on your table examples of your queries to address this fully, but:
DateTime columns should be highly selective by themselves, so an index with TimeOperation as the first column should address the bulk of queries against TimeOperation.
Do not add all columns blindly to an index, or even on included indexes - this will make the index page density worse and be counter productive (you would be duplicating your table in an index).
If all data in your database centres around TimeOperation, you might consider building your clustered index around it.
If you have queries just on field1 = x then you need a separate index just for field1 (assuming that it is suitably selective), i.e. no TimeOperation on the index if its not in the WHERE clause of your query.
Yes, you are right, when SQL locates a record in an index, it needs to do a key (or RID) lookup back into the cluster to retrieve the rest of the columns. If your non clustered index Includes the other columns in your select statement, the lookup can be avoided. But since you are using SELECT(*), covering indexes are unlikely to help .
Edit
Explanation - Selectivity and density are explained in detail here. e.g. iff your queries against TimeOperation return only a small number of rows (rule of thumb is < 5%, but this isn't always), will the index be used, i.e. your query is selective enough for SQL to choose the index on TimeOperation.
The basic starting point would be:
CREATE TABLE [MyTable]
(
intID INT ID identity(1,1) NOT NULL,
field1 NVARCHAR(20),
-- .. More columns, which may be selected, but not filtered
TimeOperation DateTime,
CONSTRAINT PK_MyTable PRIMARY KEY (IntId)
);
And the basic indexes will be
CREATE NONCLUSTERED INDEX IX_MyTable_1 ON [MyTable](TimeOperation);
CREATE NONCLUSTERED INDEX IX_MyTable_2 ON [MyTable](Field1);
Clustering Consideration / Option
If most of your records are inserted in 'serial' ascending TimeOperation order, i.e. intId and TimeOperation will both increase in tandem, then I would leave the clustering on intID (the default) (i.e. table DDL is PRIMARY KEY CLUSTERED (IntId), which is the default anyway).
However, if there is NO correlation between IntId and TimeOperation, and IF most of your queries are of the form SELECT * FROM [MyTable] WHERE TimeOperation between xx and yy then CREATE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) (and changing PK to PRIMARY KEY NONCLUSTERED (IntId)) should improve this query (Rationale: since contiguous times are kept together, fewer pages need to be read, and the bookmark lookup will be avoided). Even better, if values of TimeOperation are guaranteed to be unique, then CREATE UNIQUE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) will improve density as it will avoid the uniqueifier.
Note - for the rest of this answer, I'm assuming that your IntId and TimeOperations ARE strongly correlated and hence the clustering is by IntId.
Covering Indexes
As others have mentioned, your use of SELECT (*) is bad practice and inter alia means covering indexes won't be of any use (the exception being COUNT(*)).
If your queries weren't SELECT(*), but instead e.g.
SELECT TimeOperation, field1
FROM
WHERE TimeOperation BETWEEN x and y -- and returns < 5% data.
Then altering your index on TimeOperation to include field1
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation) INCLUDE(Field1);
OR adding both to the index (with the most common filter first, or the most selective first if both filters are always present)
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);
Either will avoid the rid / key lookup. The second (,) option will address your query where BOTH TimeOperation and Field1 are filtered in a WHERE or HAVING clause.
Re : What's the difference between index on (TimeOperation, Field1) and separate indexes?
e.g.
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);
will not be useful for the query
SELECT ... FROM MyTable WHERE Field1 = 'xyz';
The index will only be useful for the queries which have TimeOperation
SELECT ... FROM MyTable WHERE TimeOperation between x and y;
OR
SELECT ... FROM MyTable WHERE TimeOperation between x and y AND Field1 = 'xyz';
Hope this helps?
An index, at its most basic, creates a layer of the "hypertree" structure behind the scenes, which allows the SQL engine to more easily find rows with particular values for indexed columns. Each index creates a different way to "drill down" into the table's data using a binary search (logN performance). Each index you add makes selecting by that index faster, at the cost of slowing insertions/updates (the data must be put in and then indexes must be created).
An index, therefore, should normally be created for combinations of columns that are commonly used to filter records. I would indeed create an index on TimeOperation, and TimeOperation alone.
NEVER simply create an index including all columns of a table, especially a wide one such as this.

Generating Order Numbers - Keep unique across multiple machines - Unique string seed

I'm attempting to create an order number for customers to use. I will have multiple machines that do not have access to the same database (so can't use primary keys and generate a unique ID).
I will have a unique string that I could use for a seed for some algorithm that will generate a unique looking alphanumeric ID # for the order number. I do not want to use this unique string as the order # because its contents would not be appropriate in appearance for a customer to use for order #.
Would it be possible to combine the use of a GUID & my unique string with some algorithm to create a unique order #?
Open to any suggestions.
If you have a relatively small number of machines and each one can have it's own configuration file or setting, you can assign a letter to each machine (A,B,C...) and then append the letter onto the order number, which could just be an auto-incrementing integer in each DB.
i.e.
Starting each database ID at 1000:
1001A // First order on database A
1001B // First order on database B
1001C // First order on database C
1002A // Second order on database A
1003A // Third order on database A
1004A // etc...
1002B
1002C
Your order table in each database would have an ID column (integer) and "machine" identifier (character A,B,C...) so in case you ever needed to combine DBs into one, each order would still be unique.
Just use a straight up guid/uuid. They take into account the mac address of the network interface to make it unique to that machine.
http://en.wikipedia.org/wiki/Uuid
You can use ids and as a primary key if you generate they id from a stored procedure (or perhaps in Oracle using a sequence).
What you have to do is make each machine generate in a different range e.g. machine a from 1 to 1million, machine B from 1000001 to 2000000 etc.
You say you have a unique string that would not be 'appropriate' to show to customers.
If it's only inappropriate and not necessary i.e. security/privacy related you could just transform it somehow. A simple example would be Rot13
But generally I too would suggest using UUID (but version 4) for random numbers. The probability for generating duplicates is extremely low and there are libraries for many programming languages available.

Resources