I'm starting a new project and one of the requirements is to use Teradata. I'm proficient in many different database systems but Teradata is fairly new to me.
On the client end they have removed all foreign keys from their database under the recommendations of "a consultant".
Every part of me cringes.
I'm using a new database instance so I'm not constrained by what they've already done on other databases. I haven't been explicitly told not to use foreign keys and my relation with the customer is such that they will at the very least hear me out. However, my decision and case should be well-informed.
Is there any intrinsic, technological reason that I should not use FKs in Teradata to maintain referential integrity based upon Teradata's design, performance, side-effects, etc...
Of note, I'm accessing Teradata using the .Net Data Provider v16 which only supports up to EF5.
Assuming that the new project is implementing a Data Warehouse there's a simple reason (and this is true for any DWH, not only Teradata): a DWH is not the same as an OLTP system.
Of course you still got Primary & Foreign Keys in the Logical data model, but maybe not implemented in the Physical model (although they are supported by Teradata). There are several reasons:
Data is usually loaded in batches into a DWH and both PK & FKs must be validated by the loading process before Insert/Update/Delete. Otherwise you load 1,000,000 rows and there's a single row failing the constraints. Now you got a Rollback and an error message and try to find the bad data, good luck. But when all the validation is already done during load there's no reason to do the same checks a 2nd time within the database.
Some tables in the DWH will be Slowly Changing Dimensions and there's no way to define a PK/FK on that usibg Standard SQL syntay, you need something like TableA.column references TableB.column and TableA.Timestamp between TableB.ValidFrom and TableB.ValidTo (it is possible when you create Temporal Table)
Sometimes a table is recreated or reloaded from scratch, hard to do if there's a FK referencing it.
Some PKs are never used for any access/join, so why implementing them physically, it's just a huge overhead in CPU/IO/storage.
Knowledge about PK/FK is important for the optimizer, so there's a so-called Soft Foreign Key (REFERENCES WITH NO CHECK OPTION), which is a kind of dummy: applied during optimization, but never actually checked by the DBMS (it's like telling the optimizer trust me, it's correct).
Related
There are some groups of queries that include creating tables, fields, etc. How to implement a mechanism by which a group of requests should be executed all, or if somewhere the error was canceled all? That is, the principle of transactionality with ALTER TABLE queries for example (which are committed implicitly)
You are talking about Commit and Rollback.
However if you are mixing normal SQL with DML (CREATE/ALTER/DROP etc) the DML commands have an implied COMMIT as part of there execution, you cannot avoid it. So this will likely cause you problems if mixed with your normal insert/update/delete type queries
Transactional DML is not possible until MySQL 8.0. In that version, there is a "Data Dictionary" that is stored in InnoDB tables. (I don't want to think about the bootstrap required!) The DD allows, at least in theory, the ROLLBACK of DDL actions such as ALTER and DROP TABLE. Study the 8.0 docs to see if it does enough for your needs.
In the past (pre-8.0), Many DML operations were at least reasonable "crash safe". For example, an ALTER used to copy the table over, then quickly swap in the new table. This provided a reasonably good way to recover from, a crash during an ALTER.
There have been a lot of major improvements to ALTER since 5.6. Before then, the only really "instant" alter was adding an option to an ENUM, and that had caveats. There are still things that mandate a complete, time-consuming, rebuild of the table -- such as any change to the PRIMARY KEY.
DML operations should be minimized; they are not designed for frequent use.
Good day.
I have a basic question on SQL and table structure.
What we have now: 17 tables. These tables include 1 admin table. The other 13 tables are all branched off 3 "main" tables: customers, CareWorkers, Staff.
If I'm wanting to adhere to ACID ideology, I want to then create tables that each houses unique information.
My question is, and what I'm trying to wrap my head around, when I create each of these "nested-deeper" (not sure what to call it) tables, I simply do an inner join statement to grab the foreign key on my ASP.NET app correct?
First, inner join is how you get your tables "back together", and #SpectralGhost's example is how you do it. But you might want to consider doing it in the database rather than in your ASP code. The way you do that is with views. If you create a view (the syntax is CREATE VIEW and there are plenty of examples out there) then you can make the database schema as complex as you need to without making it hard to use in your ASP application. You can even make views updatable (you define an "INSTEAD OF" trigger, again, many examples if you search).
But you probably don't want to update a view, or a table, directly from your ASP code. You probably want to define STORED PROCEDUREs that update your data, and call those from your ASP code. This allows you to restrict access to your tables and views to read only and force any writes to come through a stored procedure you can control better. This prevents SQL INJECTION, making your ASP application much more secure. If the service account the application pool you ASP page runs under can pass raw queries to the database then any compromise can do tremendous damage to your database. If all it can do is execute a stored procedure where the parameters can be changed but not the functionality, they can only put some junk values in, or maybe not even that if you range check well.
The last bit of advice is that you are not preserving "ACID", you are preserving "NORMALIZED". It's definitely a tough concept to wrap your head around, here's a resource that helped me out a great deal when I was starting out. http://www.marcrettig.com/data-normalization-poster/ I still have a copy on my wall. You shouldn't obsess over normalization, but you should definitely keep it in mind and stick to it when you reasonably can. Again, there are numerous resources a search will get you, but the basic benefit is a normalized database is much more resistant to consistency problems, and is more storage efficient. And since disk IO is slow, storage efficient is usually query efficient too.
They are related tables. You should have at least one table with a primary key and often several that related back to that table from that table's foreign key.
TableOne
TableOneID
TableTwo
TableTwoID
TableOneID
TableTwo relates to TableOne via TableOneID. An inner join would give you where there are records in both tables based on your join. Example:
SELECT *
FROM TableOne t1
INNER JOIN TableTwo t2 ON t1.TableOneID=t2.TableOneID
Specifically how to do this in your application depends on your design. If you are using an ORM, then actual SQL is not terribly important. If you are using stored procedures, then it is.
I am designing a simple messaging service using ASP.NET MVC / Windows Azure Table Storage. I have two kinds of entities - messages and message threads. Relation between them is simple - each thread can have multiple messages but the message can only be assigned to one thread.
Table storage is not a relational DB, so representing relations is always a bit tricky. I need to decide between 2 approaches:
Having one big table for threads and one for messages. And having threadId as a partition key of message entity so that messages are partitioned by threads.
Dynamically creating a special table for each message thread and having threadId as a name of the table.
I tend to prefer the second because it fits better into architecture of the rest of the service. But there will obviously be large number of tables created in a storage account.
Do you think this may be a problem?
You could also consider having just one table, that stores both Thread and Message entities. This would give you transaction support, and you could use Lucifure's hybrid approach on this table.
Creating a large number of tables may be an issue, depending on how you want to manage them. The underlying REST API for listing tables works like a query for table entities. It only returns the first 1000 tables, after that you have to use a continuation token. All of the storage explorers I've seen don't allow you to query tables based on name, they simply like the first 1000 tables. If you end up with 20000 threads, it could take you a while to get to the table you want.
One way you could mitigate this is to put your message table in its own storage account. This way your storage account with all of your other tables won't get crowded out by all of these dynamic tables that you will be creating and possibly deleting.
Deleting is actually one of the ways in which using a separate table for each thread would be easier. To delete all of the related messages you simply have to delete one table rather than iterating over each message and deleting it.
Everything else however will be more complicated than keeping all of the messages in one table. If this is core functionality to your app and you can dedicate enough time to develop it this way, one table per thread is probably a good idea. Otherwise the easy way to do things is with one big table.
You may consider a hybrid approach to keep the number of tables to a manageable level, depending on your scalability needs.
My experience has been that date based partitioning at the table level is a very effective approach and can be leverage across the board.
For example you could partition tables based on date and with a granularity of day or month. So a table name like “Thread201202” could be used for all threads started in February 2012.
Your thread id would implicitly include the “201202” and be something like “201202-myid01” although you would not need to explicitly store it in the partition key since it would be implied in the table name.
Aged threads could then be easily disposed by deleting tables say more than a year old.
I have a reasonably complex query to extract the Id field of the results I am interested in based on parameters entered by the user.
After extracting the relevant Ids I am using the resulting set of Ids several times, in separate queries, to extract the actual output record sets I want (by joining to other tables, using aggregate functions, etc).
I would like to avoid running the initial query separately for every set of results I want to return. I imagine my situation is a common pattern so I am interested in what the best approach is.
The database is in MS SQL Server and I am using .NET 3.5.
It would definitely help if the question contained some measurements of the unoptimized solution (data sizes, timings). There is a variety of techniques that could be considered here, some listed in the other answers. I will assume that the reason why you do not want to run the same query repeatedly is performance.
If all the uses of the set of cached IDs consist of joins of the whole set to additional tables, the solution should definitely not involve caching the set of IDs outside of the database. Data should not travel there and back again if you can avoid it.
In some cases (when cursors or extremely complex SQL are not involved) it may be best (even if counterintuitive) to perform no caching and simply join the repetitive SQL to all desired queries. After all, each query needs to be traversed based on one of the joined tables and then the performance depends to a large degree on availability of indexes necessary to join and evaluate all the remaining information quickly.
The most intuitive approach to "caching" the set of IDs within the database is a temporary table (if named #something, it is private to the connection and therefore usable by parallel independent clients; or it can be named ##something and be global). If the table is going to have many records, indexes are necessary. For optimum performance, the index should be a clustered index (only one per table allowed), or be only created after constructing that set, where index creation is slightly faster.
Indexed views are cleary preferable to temporary tables except when the underlying data is read only during the whole process or when you can and want to ignore such updates to keep the whole set of reports consistent as far as the set goes. However, the ability of indexed views to always accurately project the underlying data comes at a cost of slowing down those updates.
One other answer to this question mentions stored procedures. This is largely a way of organizing your code. However, it if you go this way, it is preferable to avoid using temporary tables, because such references to a temporary table prevent pre-compilation of the stored procedure; go for views or indexed views if you can.
Regardless of the approach you choose, do not guess at the performance characteristics and query optimizer behavior. Learn to display query execution plans (within SQL Server Management Studio) and make sure that you see index accesses as opposed to nested loops combining multiple large sets of data; only add indexes that demonstrably and drastically change the performance of your queries. A well chosen index can often change the performance of a query by a factor of 1000, so this is somewhat complex to learn but crucial for success.
And last but not least, make sure you use UPDATE STATISTICS when repopulating the database (and nightly in production), or your query optimizer will not be able to put the indexes you have created to their best uses.
If you are planning to cache the result set in your application code, then ASP.NET has cache, Your Winform will have the object holding the data with it with which you can reuse the data.
If planning to do the same in SQL Server, you might consider using indexed views to find out the Id's. The view will be materialized and hence you can get the results faster. You might even consider using a staging table to hold the id's temporarily.
With SQL Server 2008 you can pass table variables as params to SQL. Just cache the IDs and then pass them as a table variable to the queries that fetch the data. The only caveat of this approach is that you have to predefine the table type as UDT.
http://msdn.microsoft.com/en-us/library/bb510489.aspx
For SQL Server, Microsoft generally recommends using stored procedures whenever practical.
Here are a few of the advantages:
http://blog.sqlauthority.com/2007/04/13/sql-server-stored-procedures-advantages-and-best-advantage/
* Execution plan retention and reuse
* Query auto-parameterization
* Encapsulation of business rules and policies
* Application modularization
* Sharing of application logic between applications
* Access to database objects that is both secure and uniform
* Consistent, safe data modification
* Network bandwidth conservation
* Support for automatic execution at system start-up
* Enhanced hardware and software capabilities
* Improved security
* Reduced development cost and increased reliability
* Centralized security, administration, and maintenance for common routines
It's also worth noting that, unlike other RDBMS vendors (like Oracle, for example), MSSQL automatically caches all execution plans:
http://msdn.microsoft.com/en-us/library/ms973918.aspx
However, for the last couple of versions of SQL Server, execution
plans are cached for all T-SQL batches, regardless of whether or not
they are in a stored procedure
The best approach depends on how often the Id changes, or how often you want to look it up again.
One technique is to simply store the result in the ASP.NET object cache, using the Cache object (also accessible from HttpRuntime.Cache). For example (from a page):
this.Cache["key"] = "value";
There are many possible variations on this theme.
You can use Memcached to cache values in the memory.
As I see there are some .net ports.
How frequently does the data change that you'll be querying? To me, this sounds like a perfect scenario for data warehousing, where you flatting the data for quicker data retrieval and create the tables exactly as your 'DTO' wants to see the data. This method is different than an indexed view in that it's simply a table which will have quick seek operations, and could especially be improved if you setup the indexes properly on the columns that you plan to query
You can create Global temporary Table. Create the table on the fly. Now insert the records as per your request. Access this table in your next request in your joins... for reusability
Instead create only stored procedures and call them from from the code?
There is a place for dynamic SQL and/or ad hoc SQL, but it needs to be justified based on the particular usage needs.
Stored procedures are by far a best practice for almost all situations and should be strongly considered first.
This issue is a little bigger than just procs or ad hoc, because the database has a wide variety of tools to define its interface, including tables, views, functions and procedures.
People here have mentioned the execution plans and parameterization but, by far, the most important thing in my mind is that any technique which relies on exposed base tables to users means that you lose any ability for the database to change its underlying implementation or control security vertically or horizontally. At the very least, I would expose only views to a typical application/user/role.
In a scenario where the application or user's account only has access to EXEC SPs, then there is no possibility of that account being able to even have a hope of using a SQL injection of the form: "; SELECT name, password from USERS;" or "; DELETE FROM USERS;" or "; DROP TABLE USERS;" because the account doesn't have anything but EXEC (and certainly no DDL). You can control column visibility at the SP level and not have to deny select on an employee salary column, for example.
In other words, unless you are comfortable granting db_datareader to public (because that's effectively what you are doing when you LINQ-to-tables), then you need some sort of realistic security in your application, and SPs are the only way to go, with LINQ-to-views possibly being acceptable.
Depends entirely on what you're doing.
As a general rule a stored proc will have it's query plan cached better than a dynamically generated SQL statement. It will also be slightly easier to maintain indexes for.
However, dynamically generated SQL statements can have their query plans cached, so the difference is marginal.
Dynamically generated SQL statement can also introduce security risk - always parameterise them.
That said sprocs are a pain to maintain and update, they separate DB-logic and .Net code in a way that makes it harder for developers to piece together what a data access method is doing.
Also, to fix or update a SQL string you just change code. To fix or update a sproc you have to change the database - often a much messier option.
So I wouldn't recommend that as a 'one size fits all' best practice.
There is no right or wrong answer here. There are benefits to both which can be easily obtained through a google search. Different projects with different requirements may lead you to different solutions. It's not as black or white as you might want it to be. You might as well throw ORMs into the mix. If you prefer sql queries in your data layer as opposed to stored procs, make sure you use parametrized queries.
sql in sp- easy to maintain, sql in app- pain in the butt ot maintain.
it's so much faster and easier to hop onto a sql instance, modify an sp, test it, then deploy the sp, instead of having to modify the code in the app, test it, then deploy the app.
It depends on the data distribution in your table. Prepared query plans and stored procedures get cached, and the plan itself depends on the table statistics.
Suppose you've building a blog and that your posts table has a user_id. And that you're frequently doing stuff like:
select posts.* from posts where user_id = ? order by published desc limit 20;
Suppose indexes on posts (user_id) and posts (published desc).
Additionally suppose that you've two authors, author1 which wrote 3 posts a long time ago, and author2 who has written 10k posts since.
In this case, the query plan of the ad hoc query will be very different depending on whether you're fetching the author1 posts or the author2 posts:
For author1, the database will decide to use the index on user_id and sort the results.
For author2, the database will read the first 20 rows using the index on published.
If you prepare the statement, the planner will pick either of the two. Suppose the second (which I think is likely): applied to author1, this means going through the whole table by way of the index -- which is much slower than the optimal plan.
If simplicity is your goal, then an ORM would be a good practice for your simple database operations
ORMs like Entity Framework, nHibernate, LINQ to SQL, etc. will manage the code creation of the data access and repository layers and provide you with strongly typed objects representing your tables. This can lead to a cleaner, more maintainable architecture.
Save the stored procedures for your more complex queries. This is where you can take advantage of advanced SQL and cached query plans.
Dynamic SQL - Bad
Stored Procedures - Better
Linq-To-SQL or Linq-to-EF (or ORM tools) - Best
You do not want dynamic SQL inside your application since you do not have compile-time checking. Stored procedures will at least be checked, but it is still not part of a cohesive usnit and removes business logic to the database layer. Linq-To-EF will allow business logic to stay inside your application and allow you to have compile-time checking of syntax.