Is there a way to find the SQL that updated a particular field at a particular time? - teradata

Let's assume that I know when a particular database record was updated. I know that somewhere exists a history of all SQL that's executed, perhaps only accessible by a DBA. If I could access this history, I could SELECT from it where the query text is LIKE '%fieldname%'. While this would pretty much pull up any transactional query containing the field name I am looking for, it's a great start, especially if I can filter the recordset down to a particular date/time range.
I've discovered the dbc.DBQLogTbl view, but it doesn't appear to work as I expect. Is there another view that contains the information I am looking for?

It depends on the level of database query logging (DBQL) that has been enabled by the DBA.
Some DBA's may elect not to detailed information for tactical queries so it is best to consult with your DBA team to understand what is being captured. You can also query the DBC.DBQLRules to determine what level of logging has been enabled.
The following data dictionary objects will be of particular interest to your question:
DBC.QryLog contains the details about the query with respect to the user, session, application, type of statement, CPU, IO, and other fields associated with a particular query.
DBC.QryLogSQL contains the SQL statements. If a SQL statement is exceeds a certain length it is split across multiple rows which is denoted by a column in this table. If you join this to the main Query Log table care must be taken if you are aggregating and metrics in the Query Log table. Although more often then not if your are joining the Query Log table to the SQL table you are not doing any aggregation.
DBC.QryLogObjects contains the objects used by a particular query and how they were used. This includes tables, columns, and indexes referenced by a particular query.
These tables can be joined together in DBC via QueryID and ProcID. There are a few other tables that capture information about the queries but are beyond the scope of this particular question. You can find out about those in the Teradata Manuals.
Check with your DBA team to determine the level of logging being done and where they historical DBQL data is retained. Often DBQL data is moved nightly to a historical database and there often is a ten minute delay in data being flushed from cache to the DBC tables. Your DBA team can tell you where to find historical DBQL data.

Related

Constraints (Meta Information) in Kusto / Azure Data Explorer

For some of our logs we have the following schema:
A "master" event table that collects events. Each event comes with a unique id (guid).
For each event we collect additional IoT data (sensor data) which also contains the guid as a link the event table
Now, we often see the schema that someone starts with the IoT data and then wants to query the master event table. The join or query criteria is the guid. Now, as we have a lot of data the unconditioned query, of course, does not return within a short time frame, if at all.
Now what our analysts do is to use the time range as a factor. Typically, the sensor data refers to events that happend on the same day or +/- a few hours, or minutes or seconds (depends on the events). This query typically returns, but not always as fast as it could be. Given that the guid is unique, queries that explicitely state this knowledge are typically way faster than those that don't, e.g.
Event_Table | where ... | take 1
unfortuntely, everyone needs to remember those properties of the data.
After this long intro: Is there a way in Kusto to speed up those queries without explictely write "take 1"? As in, telling the Kusto engine that this column holds unique keys? I am not talking about enforcing that (as a DB unique key would do), but just to give hints to kusto on how to improve the query? Can this be done somehow?
It sounds that you can benefit from introducing server-side function:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/schema-entities/stored-functions
Using this method - you can define a function on a server - and users will provide a parameter to the function.

Filtering results from ClickHouse using values from dictionaries

I'm a little unfamiliar with ClickHouse and still study it by trial and error. Got a question about it.
Talking about the star scheme of data representations, with dimensions and facts. Currently, I keep everything in PostgreSQL, but OLAP queries with aggregations start to show bad timing, so I'm going to move some fact tables to ClickHouse. Initial tests of CH show incredible performance, however, in real life the queries should include joins to dimension tables from PostgreSQL. I know I can connect them as dictionaries.
Question: I found that using dictionaries I can make requests similar to LEFT JOINs in good old RDBMS, ie values from resultset could be joined with corresponding values from the dictionary. But can they be filtered by some restrictions on dictionary keys (as in INNER JOIN)? For example, in PostgreSQL I have a table users (id, name, ...) and in ClickHouse I have table visits (user_id, source, medium, session_time, timestamp, ...) with metrics about their visits to the site. Can I make a query to CH to fetch aggregated metrics (number of daily visits for given date range) of users which name matches some condition (LIKE "EVE%" for example)?
It sounds like ODBC table function is what you're looking for. ClickHouse have a bunch of table functions which work like Postgres foreign tables. The setup is similar to Dictionaries but you gain the traditional JOIN behavior. It currently doesn't show up in the official document. You can refer to this https://github.com/yandex/ClickHouse/blob/master/dbms/tests/integration/test_odbc_interaction/test.py#L84 . And in near future (this year), ClickHouse will have standard JOIN statement supported.
The dictionary will basically replace the value first. As I understand it your dictionary would be based off your users table.
Here is an example. Hopefully I am understanding your question.
select dictGetString('accountidmap', 'domain', tuple(toString(account_id))) AS domain, sum(session) as sessions from session_distributed where date = '2018-10-15' and like(domain, '%cats%') group by domain
This is a real query on our database so If there is something you want to try/confirm let me know

Can I use ZCatalogs Query Plan to Optimise Catalog Queries?

I'm wondering if I can make use of the information provided by the Query Report and Query Plan tabs on the portal catalog. Can I optimize ZCatalog queries based on the query report? How does ZCatalogs Query Plan differ from a query plan of an SQL database?
The query plan information is used to improve catalog performance, but you cannot optimize your own queries based on plan information.
The catalog only builds up that information as needed, based on your index sizes; unlike a SQL database the catalog does not plan each query based on such information but rather looks up pre-calculated plans from the structure reflected in the Query Plan tab.
The query report tab does give you information about what indexes are performing poorly for your code; you may want to rethink code that uses those combinations of indexes and/or look into why those indexes performed poorly; perhaps your query didn't limit the result quickly enough or the slow index is very large, indicating that perhaps your ZODB cache is too small to hold that large index or that other results keep pushing it out.
On the whole, for large applications it is a good idea to retain the query plan; in one project we dump cache information before stopping instances and reload that after starting again, and that includes the catalog query plan:
plan = site.portal_catalog.getCatalogPlan()
with open(PLAN_PATH, 'w') as out:
out.write(plan)
and on load:
if os.path.exists(PLAN_PATH):
from Products.ZCatalog.plan import PriorityMap
try:
PriorityMap.load_from_path(PLAN_PATH)
except Exception:
pass

sql server database design

I am planning to create a website using ASP.NET and SQL Server. However, my plan for the database design leaves me wondering if there is a better way.
The website will serve as a repository of information for various users. I figure I would have two databases, a Membership and Profile database.
The profile database would contain user data for all users, where each user may have ~20 tables. I would create the tables when the user account is created and generate a key used to name the tables. The tables are not directly related.
For Example a set of tables for two different users could look like:
User1 Tables - TransactionTable_Key1, AssetTable_Key1, ResearchTable_Key1 ....;
User2 Tables - TransactionTable_Key2, AssetTable_Key2, ResearchTable_Key2 ....;
The Key1, Key2 etc.. values would be retrieved based on the MembershipID data when the account was created. This could result in a very large number of tables over time. I'm not sure if this will limit scalability by setting up the database in this way. Any recommendations?
Edit: I should mention that some of these tables would contain 20k+ rows.
Realistically it sounds like you only really need one database for this.
From the way you worded your question, it sounds like you're trying to dynamically create tables for users as they create accounts. I wouldn't recommend this method.
What you want to do is create a master table that contains a primary key for each individual user. I'm assuming this is the Membership table. Then create the ~20 tables that you need for the profiles of these members. Every record, no matter the number of users that you have, will go into these tables. These 20 tables would need to have a foreign key pointing to the unique identifier of the Membership table.
When you want to query a Member for their user information, just select from the tables where the membership table's primary Id matches the foreign key in the profile tables.
This would result in only a few tables in the end and is easily maintainable and follows better database design.
Your ORM layer (EF, LINQ, DAL code) will hate having to deal with one set of tables per tenant. It is much better to have either one set of tables for all tenant in a single database, or a separate database per tenant. The later is only better if schema upgrade has to be vetted by tenant (like Salesforce.com has). If you can afford to upgrade all tenant to a new schema at once then there is no reason for database per tenant.
When you design a schema that hold multiple tenant the important things to remember are
don't use heaps, all tables must be clustered index
add the tenant ID as the leftmost key to every clustered
add the tenant ID as the leftmost key to every non-clustered index too
add the Left.tenantID = right.tenantID predicate to every join
add the table.TenantID = #currentTenantID to every query
These are fairly simple rules and if you obey them (with no exceptions) you will get a perfect partitioning per tenant of every query (no query will ever ever scan rows in a range of a different tenant) so you eliminate contention between tenants. To be more through, you can disable lock escalation to make sure no tenant escalates to block every other tenant.
This design also lends itself to table partitioning and to sharing the database for scale-out.
You definitely don't want to create a set of tables for each user, and you would want these only in one database. Even with SQL Server 2008's large capacity for tables (note really total objects in database), it would quickly become unmanageable. Your best bet is to use 20 tables, and separate them via a column into user areas. You might consider partitioning the tables by this user value, but that should be tested for performance reasons too.
Yes, since the tables only contain id, key, and value, why not make one single table?
Have the columns:
id, user ID, key, value
Put an Index on the user ID field.
A key idea behind a relational database is that the table structure does not change. You create a solid set of tables, and these are the "bones" of your application.
Cheers,
Daniel
Neal,
The solution really depends on your requirement. If security and data access are concern and you have only a handful of users, you can set up a different db for each user with access for him set to only his/her database.
Other wise, what Daniel Williams suggested is a good alternative where you have one DB and tables laid out with a indexed column partitioning the users data rows.
It's hard to tell from the summary, but it looks like you are designing for dynamic attribution by user. This design approach is called EAV (Entity-Attribute-Value) and consists of a simple base collection key (UserID, SiteID, ProductID...) and then rows consisting of name/value pairs. In a more complex version, categories are sometimes added as "super columns" to the tuple/row and provide sub-groupings for a set of name/value pairs.
Designing in this way moves responsibility for data type integrity, relational integrity and tuple integrity to the application layer.
The risk with doing this in a relational system involves the breaking of the tuple or row into a set of rows. Updates, deletes, missing values and the definition of a tuple are no longer easily accessible through human interaction. As your application evolves and the definition of a tuple changes, it becomes almost impossible to tell if a name/value pair is missing because it's part of an earlier-version tuple or because it was unintentionally deleted. Ad-hoc research as well becomes harder to manage as business analysts must keep an understanding of the virtual structure either in their heads or in documentation provided.
If you are looking to implement an EAV model, I would suggest you look at a non-relational solution (nosql) like MongoDB or CouchDB. These stores allow a developer to save and retrieve "documents" or json-formatted messages that are essentially made up of a collection of name/value pairs and can look very much like a serialized object. The advantage here is that you can store dynamic attribution without breaking your tuple. You always know that you have a complete tuple because you can store and retrieve it as a single "blob" of information that can be serialized and deserialized at-will. You can also update single attributes within the tuple, if that's a concern.
MongoDB also provides some database-like features such as multiple-attribute indexes, a query engine that is robust in comparison to other similar non-relational offerings and a sharding solution that is much less trouble than trying to do it with MySQL.
I hope this helps.

SQL Database Design - Cache Tables?

What's a common/best practice for database design when it comes to improving performance on count(1) queries? (I'm currently using SQLite)
I've normalized my data, it exists on multiple tables, and for simple things I want to do on a single table with a good index -- queries are acceptably quick for my purposes.
eg:
SELECT count(1) from actions where type='3' and area='5' and employee='2533';
But when I start getting into multiple table queries, things get too slow (> 1 second).
SELECT count(1)
from
(SELECT SID from actions
where type='3' and employee='2533'
INTERSECT
SELECT SID from transactions where currency='USD') x;
How should I cache my results? What is a good design?
My natural reaction is to add a table solely for storing rows of cached results per employee?
Edit
Design patterns like Command Query Responsibility Segregation (CQRS) specifically aim to improve the read side performance of data access, often in distributed systems and at enterprise scale.
Commands are issued to indicate 'transactions' or 'change / updates' to data
When a system processes these commands (e.g. by updating database tables), the new state of the affected objects is 'broadcast'
Systems which are interested (such as a user interface or a queryable REST API) will then subscribe to these data changes, and then 'shape' the updated data to their specific needs
This updated data is then cached (often called a 'Read Store')
Another pattern commonly associated with CQRS is "Event Sourcing", which stores, and then allows 'replay' of Commands, for various use cases.
The above may be overkill for your scenario, but a very simple implementation of caching at an internal app level, could be via a Sqllite Trigger
Assuming that there are many more 'reads' than writes to your actions or transactions tables,
You could create a cache tables specifically for "SID for actions by type by employee' and one for "SID for transactions by Currency", or even combine the two (depends on what other scenarios you have for querying)
You would then need to update these cache table(s) every time the underlying action or transactions tables update. One cheap (and nasty) way would be to provide an INSERT, UPDATE and DELETE trigger on the action and transactions table, which would then update the appropriate cache table(s).
Your 'query' interface would now primarily interact with the cache tables, using the 'derived' data (such as the counts).
You may still however need to handle cache miss scenarios, such as the initial 'seed' of these cache tables, or if the cache tables need to be regenerated.
In addition to a local relational database like SqlLite, NoSql databases like MongoDb, Cassandra and Redis are frequently used as alternatives to read side caching in read-heavy environments (depending on the type and format of data that you need to cache). You would however need to handle alternative to synchronize data from your 'master' (e.g. SQLLite) database to these cache read stores - triggers obviously won't cut it here.
Original Answer
If you are 100% sure that you are always repeating exactly the same query for the same customer, sure, persist the result.
However, in most other instances, RDBMS usually handles caching just fine.
The INTERSECT with the query
SELECT SID from transactions where currency='USD'
Could be problematic if there are a large number of transaction records with USD.
Possibly you could replace this with a join?
SELECT count(1) from
(
SELECT t.[SID]
from
transactions as t
inner join
(
SELECT SID from actions where type='3' and employee='2533'
) as a
on t.SID = a.SID
where t.currency= 'USD'
) as a
You might just check your indexes however:
For
SELECT count(1) from actions where
type='3' and area='5' and
employee='2533'
SELECT SID from actions where
type='3' and employee='2533'
An index on Actions(Employee, Type) or Actions(Employee, Type, Area) would make sense (assuming Employee has highest selectivity, and depending on the selectivity of Type and Area).
You can also compare this to an index on Actions(Employee, Type, Area, SID) as a covering index for your second query.
And for the join above, you need an index on Transactions(SID, Currency)

Resources