SQL Database Design - Cache Tables? - sqlite

What's a common/best practice for database design when it comes to improving performance on count(1) queries? (I'm currently using SQLite)
I've normalized my data, it exists on multiple tables, and for simple things I want to do on a single table with a good index -- queries are acceptably quick for my purposes.
eg:
SELECT count(1) from actions where type='3' and area='5' and employee='2533';
But when I start getting into multiple table queries, things get too slow (> 1 second).
SELECT count(1)
from
(SELECT SID from actions
where type='3' and employee='2533'
INTERSECT
SELECT SID from transactions where currency='USD') x;
How should I cache my results? What is a good design?
My natural reaction is to add a table solely for storing rows of cached results per employee?

Edit
Design patterns like Command Query Responsibility Segregation (CQRS) specifically aim to improve the read side performance of data access, often in distributed systems and at enterprise scale.
Commands are issued to indicate 'transactions' or 'change / updates' to data
When a system processes these commands (e.g. by updating database tables), the new state of the affected objects is 'broadcast'
Systems which are interested (such as a user interface or a queryable REST API) will then subscribe to these data changes, and then 'shape' the updated data to their specific needs
This updated data is then cached (often called a 'Read Store')
Another pattern commonly associated with CQRS is "Event Sourcing", which stores, and then allows 'replay' of Commands, for various use cases.
The above may be overkill for your scenario, but a very simple implementation of caching at an internal app level, could be via a Sqllite Trigger
Assuming that there are many more 'reads' than writes to your actions or transactions tables,
You could create a cache tables specifically for "SID for actions by type by employee' and one for "SID for transactions by Currency", or even combine the two (depends on what other scenarios you have for querying)
You would then need to update these cache table(s) every time the underlying action or transactions tables update. One cheap (and nasty) way would be to provide an INSERT, UPDATE and DELETE trigger on the action and transactions table, which would then update the appropriate cache table(s).
Your 'query' interface would now primarily interact with the cache tables, using the 'derived' data (such as the counts).
You may still however need to handle cache miss scenarios, such as the initial 'seed' of these cache tables, or if the cache tables need to be regenerated.
In addition to a local relational database like SqlLite, NoSql databases like MongoDb, Cassandra and Redis are frequently used as alternatives to read side caching in read-heavy environments (depending on the type and format of data that you need to cache). You would however need to handle alternative to synchronize data from your 'master' (e.g. SQLLite) database to these cache read stores - triggers obviously won't cut it here.
Original Answer
If you are 100% sure that you are always repeating exactly the same query for the same customer, sure, persist the result.
However, in most other instances, RDBMS usually handles caching just fine.
The INTERSECT with the query
SELECT SID from transactions where currency='USD'
Could be problematic if there are a large number of transaction records with USD.
Possibly you could replace this with a join?
SELECT count(1) from
(
SELECT t.[SID]
from
transactions as t
inner join
(
SELECT SID from actions where type='3' and employee='2533'
) as a
on t.SID = a.SID
where t.currency= 'USD'
) as a
You might just check your indexes however:
For
SELECT count(1) from actions where
type='3' and area='5' and
employee='2533'
SELECT SID from actions where
type='3' and employee='2533'
An index on Actions(Employee, Type) or Actions(Employee, Type, Area) would make sense (assuming Employee has highest selectivity, and depending on the selectivity of Type and Area).
You can also compare this to an index on Actions(Employee, Type, Area, SID) as a covering index for your second query.
And for the join above, you need an index on Transactions(SID, Currency)

Related

DynamoDB table structure

We are looking to use AWS DynamoDB for storing application logs. Logs from multiple components in our system would be stored here. We are expecting a lot of writes and only minimal number of reads.
The client that we use for writing into DynamoDB generates a UUID for the partition key, but using this makes it difficult to actually search.
Most prominent search cases are,
Search based on Component / Date / Date time
Search based on JobId / File name
Search based on Log Level
From what I have read so far, using a UUID for the partition key is not suitable for our case. I am currently thinking about using either / for our partition key and ISO 8601 timestamp as our sort key. Does this sound reasonable / widely used setting for such an use case ?
If not kindly suggest alternatives that can be used.
Using UUID as partition key will efficiently distribute the data amongst internal partitions so you will have ability to utilize all of the provisioned capacity.
Using sortable (ISO format) timestamp as range/sort key will store the data in order so it will be possible to retrieve it in order.
However for retrieving logs by anything other than timestamp, you may have to create indexes (GSI) which are charged separately.
Hope your logs are precious enough to store in DynamoDB instead of CloudWatch ;)
In general DynamoDB seems like a bad solution for storing logs:
It is more expensive than CloudWatch
It has poor querying capabilities, unless you start utilising global secondary indexes which will double or triple your expenses
Unless you use random UUID for hash key, you are risking creating hot partitions/keys in your db (For example, using component ID as a primary or global secondary key, might result in throttling if some component writes much more often than others)
But assuming you already know these drawbacks and you still want to use DynamoDB, here is what I would recommend:
Use JobId or Component name as hash key (one as primary, one as GSI)
Use timestamp as a sort key
If you need to search by log level often, then you can create another local sort key, or you can combine level and timestamp into single sort key. If you only care about searching for ERROR level logs most of the time, then it might be better to create a sparse GSI for that.
Create a new table each day(let's call it "hot table"), and only store that day's logs in that table. This table will have high write throughput. Once the day finishes, significantly reduce its write throughput (maybe to 0) and only leave some read capacity. This way you will reduce risk of running into 10 GB limit per hash key that Dynamo DB has.
This approach also has an advantage in terms of log retention. It is very easy and cheap to remove log older than X days this way. By keeping old table capacity very low you will also avoid very high costs. For more complicated ad-hoc analysis, use EMR

How can I improve performance while altering a large mysql table?

I have 600 Millions records in a table and I am not able to add a column in this table as every time I try to do it, it times out.
Suppose in your MYSQL database you have a giant table having 600 Millions of rows, having some schema operation such as adding a unique key, altering a column, even adding one more column to it is a very cumbersome process which will takes hours to process and sometimes there is a server time out. In order to overcome that, one to have to come up with very good migration plan, one of which I jotting below.
1) Suppose there is table Orig_X in which I have to add a new column colNew with default value as 0.
2) A Dummy table Dummy_X is created which is replica of Orig_X except with a new column colNew.
3) Data is inserted from the Orig_X to Dummy_X with the following settings.
4) Auto commit is set to zero, so that data is not committed after each insert statement hindering the performance.
5) Binary logs are set to zero, so that no data will be written in these logs.
6) After insertion of data bot the feature are set to one.
SET AUTOCOMMIT = 0;
SET sql_log_bin = 0;
Insert into Dummy_X(col1, col2, col3, colNew)
Select col1, col2, col3, from Orig_X;
SET sql_log_bin = 1;
SET AUTOCOMMIT = 1;
7) Now primary key can be created with the newly inserted column, which is now the part of primary key.
8) All the unique keys can now be created.
9) We can check the status of the server by issuing the following command
SHOW MASTER STATUS
10) It’s also helpful to issue FLUSH LOGS so MySQL will clear the old logs.
11) In order to boost performance to run the similar type of queries such as above insert statement, one should have query cache variable on.
SHOW VARIABLES LIKE 'have_query_cache';
query_cache_type = 1
Above were the steps for the migration strategy for the large table, below I am witting so steps to improve the performance of the database/queries.
1) Remove any unnecessary indexes on the table, pay particular attention to UNIQUE indexes as these when disable change buffering. Don't use a UNIQUE index if you have no reason for that constraint, prefer a regular INDEX.
2) If bulk loading a fresh table, delay creating any indexes besides the PRIMARY KEY. If you create them once all after data is loaded, then InnoDB is able to apply a pre-sort and bulk load process which is both faster and results in typically more compact indexes.
3) More memory can actually help in performance optimization. If SHOW ENGINE INNODB STATUS shows any reads/s under BUFFER POOL AND MEMORY and the number of Free buffers (also under BUFFER POOL AND MEMORY) is zero, you could benefit from more (assuming you have sized innodb_buffer_pool_size correctly on your server.
4) Normally your database table gets re-indexed after every insert. That's some heavy lifting for you database, but when your queries are wrapped inside a Transaction, the table does not get re-indexed until after this entire bulk is processed. Saving a lot of work.
5) Most MySQL servers have query caching enabled. It's one of the most effective methods of improving performance that is quietly handled by the database engine. When the same query is executed multiple times, the result is fetched from the cache, which is quite fast.
6) Using the EXPLAIN keyword can give you insight on what MySQL is doing to execute your query. This can help you spot the bottlenecks and other problems with your query or table structures. The results of an EXPLAIN query will show you which indexes are being utilized, how the table is being scanned and sorted etc...
7) If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
8) In every table have an id column that is the PRIMARY KEY, AUTO_INCREMENT and one of the flavors of INT. Also preferably UNSIGNED, since the value cannot be negative.
9) Even if you have a user’s table that has a unique username field, do not make that your primary key. VARCHAR fields as primary keys are slower. And you will have a better structure in your code by referring to all users with their id's internally.
10) Normally when you perform a query from a script, it will wait for the execution of that query to finish before it can continue. You can change that by using unbuffered queries. This saves a considerable amount of memory with SQL queries that produce large result sets, and you can start working on the result set immediately after the first row has been retrieved as you don't have to wait until the complete SQL query has been performed.
11) With database engines, disk is perhaps the most significant bottleneck. Keeping things smaller and more compact is usually helpful in terms of performance, to reduce the amount of disk transfer.
12) The two main storage engines in MySQL are MyISAM and InnoDB. Each have their own pros and cons.MyISAM is good for read-heavy applications, but it doesn't scale very well when there are a lot of writes. Even if you are updating one field of one row, the whole table gets locked, and no other process can even read from it until that query is finished. MyISAM is very fast at calculating
SELECT COUNT(*)
types of queries.InnoDB tends to be a more complicated storage
engine and can be slower than MyISAM for most small applications. But it supports row-based locking, which scales better. It also supports some more advanced features such as transactions.

Is there a way to find the SQL that updated a particular field at a particular time?

Let's assume that I know when a particular database record was updated. I know that somewhere exists a history of all SQL that's executed, perhaps only accessible by a DBA. If I could access this history, I could SELECT from it where the query text is LIKE '%fieldname%'. While this would pretty much pull up any transactional query containing the field name I am looking for, it's a great start, especially if I can filter the recordset down to a particular date/time range.
I've discovered the dbc.DBQLogTbl view, but it doesn't appear to work as I expect. Is there another view that contains the information I am looking for?
It depends on the level of database query logging (DBQL) that has been enabled by the DBA.
Some DBA's may elect not to detailed information for tactical queries so it is best to consult with your DBA team to understand what is being captured. You can also query the DBC.DBQLRules to determine what level of logging has been enabled.
The following data dictionary objects will be of particular interest to your question:
DBC.QryLog contains the details about the query with respect to the user, session, application, type of statement, CPU, IO, and other fields associated with a particular query.
DBC.QryLogSQL contains the SQL statements. If a SQL statement is exceeds a certain length it is split across multiple rows which is denoted by a column in this table. If you join this to the main Query Log table care must be taken if you are aggregating and metrics in the Query Log table. Although more often then not if your are joining the Query Log table to the SQL table you are not doing any aggregation.
DBC.QryLogObjects contains the objects used by a particular query and how they were used. This includes tables, columns, and indexes referenced by a particular query.
These tables can be joined together in DBC via QueryID and ProcID. There are a few other tables that capture information about the queries but are beyond the scope of this particular question. You can find out about those in the Teradata Manuals.
Check with your DBA team to determine the level of logging being done and where they historical DBQL data is retained. Often DBQL data is moved nightly to a historical database and there often is a ten minute delay in data being flushed from cache to the DBC tables. Your DBA team can tell you where to find historical DBQL data.

Caching result of SELECT statement for reuse in multiple queries

I have a reasonably complex query to extract the Id field of the results I am interested in based on parameters entered by the user.
After extracting the relevant Ids I am using the resulting set of Ids several times, in separate queries, to extract the actual output record sets I want (by joining to other tables, using aggregate functions, etc).
I would like to avoid running the initial query separately for every set of results I want to return. I imagine my situation is a common pattern so I am interested in what the best approach is.
The database is in MS SQL Server and I am using .NET 3.5.
It would definitely help if the question contained some measurements of the unoptimized solution (data sizes, timings). There is a variety of techniques that could be considered here, some listed in the other answers. I will assume that the reason why you do not want to run the same query repeatedly is performance.
If all the uses of the set of cached IDs consist of joins of the whole set to additional tables, the solution should definitely not involve caching the set of IDs outside of the database. Data should not travel there and back again if you can avoid it.
In some cases (when cursors or extremely complex SQL are not involved) it may be best (even if counterintuitive) to perform no caching and simply join the repetitive SQL to all desired queries. After all, each query needs to be traversed based on one of the joined tables and then the performance depends to a large degree on availability of indexes necessary to join and evaluate all the remaining information quickly.
The most intuitive approach to "caching" the set of IDs within the database is a temporary table (if named #something, it is private to the connection and therefore usable by parallel independent clients; or it can be named ##something and be global). If the table is going to have many records, indexes are necessary. For optimum performance, the index should be a clustered index (only one per table allowed), or be only created after constructing that set, where index creation is slightly faster.
Indexed views are cleary preferable to temporary tables except when the underlying data is read only during the whole process or when you can and want to ignore such updates to keep the whole set of reports consistent as far as the set goes. However, the ability of indexed views to always accurately project the underlying data comes at a cost of slowing down those updates.
One other answer to this question mentions stored procedures. This is largely a way of organizing your code. However, it if you go this way, it is preferable to avoid using temporary tables, because such references to a temporary table prevent pre-compilation of the stored procedure; go for views or indexed views if you can.
Regardless of the approach you choose, do not guess at the performance characteristics and query optimizer behavior. Learn to display query execution plans (within SQL Server Management Studio) and make sure that you see index accesses as opposed to nested loops combining multiple large sets of data; only add indexes that demonstrably and drastically change the performance of your queries. A well chosen index can often change the performance of a query by a factor of 1000, so this is somewhat complex to learn but crucial for success.
And last but not least, make sure you use UPDATE STATISTICS when repopulating the database (and nightly in production), or your query optimizer will not be able to put the indexes you have created to their best uses.
If you are planning to cache the result set in your application code, then ASP.NET has cache, Your Winform will have the object holding the data with it with which you can reuse the data.
If planning to do the same in SQL Server, you might consider using indexed views to find out the Id's. The view will be materialized and hence you can get the results faster. You might even consider using a staging table to hold the id's temporarily.
With SQL Server 2008 you can pass table variables as params to SQL. Just cache the IDs and then pass them as a table variable to the queries that fetch the data. The only caveat of this approach is that you have to predefine the table type as UDT.
http://msdn.microsoft.com/en-us/library/bb510489.aspx
For SQL Server, Microsoft generally recommends using stored procedures whenever practical.
Here are a few of the advantages:
http://blog.sqlauthority.com/2007/04/13/sql-server-stored-procedures-advantages-and-best-advantage/
* Execution plan retention and reuse
* Query auto-parameterization
* Encapsulation of business rules and policies
* Application modularization
* Sharing of application logic between applications
* Access to database objects that is both secure and uniform
* Consistent, safe data modification
* Network bandwidth conservation
* Support for automatic execution at system start-up
* Enhanced hardware and software capabilities
* Improved security
* Reduced development cost and increased reliability
* Centralized security, administration, and maintenance for common routines
It's also worth noting that, unlike other RDBMS vendors (like Oracle, for example), MSSQL automatically caches all execution plans:
http://msdn.microsoft.com/en-us/library/ms973918.aspx
However, for the last couple of versions of SQL Server, execution
plans are cached for all T-SQL batches, regardless of whether or not
they are in a stored procedure
The best approach depends on how often the Id changes, or how often you want to look it up again.
One technique is to simply store the result in the ASP.NET object cache, using the Cache object (also accessible from HttpRuntime.Cache). For example (from a page):
this.Cache["key"] = "value";
There are many possible variations on this theme.
You can use Memcached to cache values in the memory.
As I see there are some .net ports.
How frequently does the data change that you'll be querying? To me, this sounds like a perfect scenario for data warehousing, where you flatting the data for quicker data retrieval and create the tables exactly as your 'DTO' wants to see the data. This method is different than an indexed view in that it's simply a table which will have quick seek operations, and could especially be improved if you setup the indexes properly on the columns that you plan to query
You can create Global temporary Table. Create the table on the fly. Now insert the records as per your request. Access this table in your next request in your joins... for reusability

sql server database design

I am planning to create a website using ASP.NET and SQL Server. However, my plan for the database design leaves me wondering if there is a better way.
The website will serve as a repository of information for various users. I figure I would have two databases, a Membership and Profile database.
The profile database would contain user data for all users, where each user may have ~20 tables. I would create the tables when the user account is created and generate a key used to name the tables. The tables are not directly related.
For Example a set of tables for two different users could look like:
User1 Tables - TransactionTable_Key1, AssetTable_Key1, ResearchTable_Key1 ....;
User2 Tables - TransactionTable_Key2, AssetTable_Key2, ResearchTable_Key2 ....;
The Key1, Key2 etc.. values would be retrieved based on the MembershipID data when the account was created. This could result in a very large number of tables over time. I'm not sure if this will limit scalability by setting up the database in this way. Any recommendations?
Edit: I should mention that some of these tables would contain 20k+ rows.
Realistically it sounds like you only really need one database for this.
From the way you worded your question, it sounds like you're trying to dynamically create tables for users as they create accounts. I wouldn't recommend this method.
What you want to do is create a master table that contains a primary key for each individual user. I'm assuming this is the Membership table. Then create the ~20 tables that you need for the profiles of these members. Every record, no matter the number of users that you have, will go into these tables. These 20 tables would need to have a foreign key pointing to the unique identifier of the Membership table.
When you want to query a Member for their user information, just select from the tables where the membership table's primary Id matches the foreign key in the profile tables.
This would result in only a few tables in the end and is easily maintainable and follows better database design.
Your ORM layer (EF, LINQ, DAL code) will hate having to deal with one set of tables per tenant. It is much better to have either one set of tables for all tenant in a single database, or a separate database per tenant. The later is only better if schema upgrade has to be vetted by tenant (like Salesforce.com has). If you can afford to upgrade all tenant to a new schema at once then there is no reason for database per tenant.
When you design a schema that hold multiple tenant the important things to remember are
don't use heaps, all tables must be clustered index
add the tenant ID as the leftmost key to every clustered
add the tenant ID as the leftmost key to every non-clustered index too
add the Left.tenantID = right.tenantID predicate to every join
add the table.TenantID = #currentTenantID to every query
These are fairly simple rules and if you obey them (with no exceptions) you will get a perfect partitioning per tenant of every query (no query will ever ever scan rows in a range of a different tenant) so you eliminate contention between tenants. To be more through, you can disable lock escalation to make sure no tenant escalates to block every other tenant.
This design also lends itself to table partitioning and to sharing the database for scale-out.
You definitely don't want to create a set of tables for each user, and you would want these only in one database. Even with SQL Server 2008's large capacity for tables (note really total objects in database), it would quickly become unmanageable. Your best bet is to use 20 tables, and separate them via a column into user areas. You might consider partitioning the tables by this user value, but that should be tested for performance reasons too.
Yes, since the tables only contain id, key, and value, why not make one single table?
Have the columns:
id, user ID, key, value
Put an Index on the user ID field.
A key idea behind a relational database is that the table structure does not change. You create a solid set of tables, and these are the "bones" of your application.
Cheers,
Daniel
Neal,
The solution really depends on your requirement. If security and data access are concern and you have only a handful of users, you can set up a different db for each user with access for him set to only his/her database.
Other wise, what Daniel Williams suggested is a good alternative where you have one DB and tables laid out with a indexed column partitioning the users data rows.
It's hard to tell from the summary, but it looks like you are designing for dynamic attribution by user. This design approach is called EAV (Entity-Attribute-Value) and consists of a simple base collection key (UserID, SiteID, ProductID...) and then rows consisting of name/value pairs. In a more complex version, categories are sometimes added as "super columns" to the tuple/row and provide sub-groupings for a set of name/value pairs.
Designing in this way moves responsibility for data type integrity, relational integrity and tuple integrity to the application layer.
The risk with doing this in a relational system involves the breaking of the tuple or row into a set of rows. Updates, deletes, missing values and the definition of a tuple are no longer easily accessible through human interaction. As your application evolves and the definition of a tuple changes, it becomes almost impossible to tell if a name/value pair is missing because it's part of an earlier-version tuple or because it was unintentionally deleted. Ad-hoc research as well becomes harder to manage as business analysts must keep an understanding of the virtual structure either in their heads or in documentation provided.
If you are looking to implement an EAV model, I would suggest you look at a non-relational solution (nosql) like MongoDB or CouchDB. These stores allow a developer to save and retrieve "documents" or json-formatted messages that are essentially made up of a collection of name/value pairs and can look very much like a serialized object. The advantage here is that you can store dynamic attribution without breaking your tuple. You always know that you have a complete tuple because you can store and retrieve it as a single "blob" of information that can be serialized and deserialized at-will. You can also update single attributes within the tuple, if that's a concern.
MongoDB also provides some database-like features such as multiple-attribute indexes, a query engine that is robust in comparison to other similar non-relational offerings and a sharding solution that is much less trouble than trying to do it with MySQL.
I hope this helps.

Resources