Constraints (Meta Information) in Kusto / Azure Data Explorer - azure-data-explorer

For some of our logs we have the following schema:
A "master" event table that collects events. Each event comes with a unique id (guid).
For each event we collect additional IoT data (sensor data) which also contains the guid as a link the event table
Now, we often see the schema that someone starts with the IoT data and then wants to query the master event table. The join or query criteria is the guid. Now, as we have a lot of data the unconditioned query, of course, does not return within a short time frame, if at all.
Now what our analysts do is to use the time range as a factor. Typically, the sensor data refers to events that happend on the same day or +/- a few hours, or minutes or seconds (depends on the events). This query typically returns, but not always as fast as it could be. Given that the guid is unique, queries that explicitely state this knowledge are typically way faster than those that don't, e.g.
Event_Table | where ... | take 1
unfortuntely, everyone needs to remember those properties of the data.
After this long intro: Is there a way in Kusto to speed up those queries without explictely write "take 1"? As in, telling the Kusto engine that this column holds unique keys? I am not talking about enforcing that (as a DB unique key would do), but just to give hints to kusto on how to improve the query? Can this be done somehow?

It sounds that you can benefit from introducing server-side function:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/schema-entities/stored-functions
Using this method - you can define a function on a server - and users will provide a parameter to the function.

Related

Total ordering of transactions in Sqlite WAL mode

I have a use-case where my application does a count (with a filter) at time T1 and transmits this information to the UI.
In parallel (T0.9, T1.1), rows are inserted/updated and fire events to the interface. From the UI's perspective, it is hard to know if the count included a given event or not.
I would like to return some integer X with the count transaction and another integer Y from the insert/update transaction so that the interface only considers events where Y > X.
Since sqlite mimics the snapshot isolation, I was thinking that there must be information in the snapshot to know what records to read or not that could de leveraged for that.
I cannot use the max ROWID either because an update might change older rows that should now we counted given the filter.
Time seems also unreliable since a write transaction could start before a read transaction but still not be included in the snapshot.
I have no issue coding a custom plugin for sqlite is there is need for one to access the data.
Any idea is appreciated!

What is the best way to schedule tasks in a serverless stack?

I am using NextJS and Firebase for an application. The users are able to rent products for a certain period. After that period, a serverless function should be triggered which updates the database etc. Since NextJS is event-driven I cannot seem to figured out how to schedule a task, which executes when the rental period ends and the database is updated.
Perhaps cron jobs handled elsewhere (Easy Cron etc) are a solution. Or maybe an EC2 instance just for scheduling these tasks.
Since this is marked with AWS EC2, i've assumed it's ok to suggest a solution with AWS services in mind.
What you could do is leverage DynamoDB's speed & sort capabilities. If you specify a table with both the partition key and the range key, the data is automatically sorted in the UTF-8 order. This means iso-timestamp values can be used to sort data historically.
With this in mind, you could design your table to have a partition key of a global, constant value across all users (to group them all) and a sort key of isoDate#userId, while also creating an GSI (Global Secondary Index) with the userId as the partition key, and the isoDate as the range key.
With your data sorted, you can use the BETWEEN query to extract the entries that fit to your time window.
Schedule 1 lambda to run every minute (or so) and extract the entries that are about to expire to notify them about it.
Important note: This sorting method works when ALL range keys have the same size, due to how sorting with the UTF-8 works. You can easily accomplish this if your application uses UUIDs as ids. If not, you can simply generate a random UUID to attach to the isoTimestamp, as you only need it to avoid the rare exact time duplicity.
Example: lets say you want to extract all data from expiring near the 2022-10-10T12:00:00.000Z hour:
your query would be BETWEEN 2022-10-10T11:59:00.000Z#00000000-0000-0000-0000-000000000000 and 2022-10-10T12:00:59.999Z#zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
Timestamps could be a little off, but you get the idea. 00.. is the start UTF8 of an UUID, and zz.. (or fff..) is the end.
In AWS creating periodic triggers to Lambda using AWS Console is quite simple and straight-forward.
Login to console and navigate to CloudWatch.
Under Events, select Rules & click “Create Rule”
You can either select fixed rate or select Cron Expression for more control
Cron expression in CloudWatch starts from minutes not seconds, important to remember if you are copying Cron expression from somewhere else.
Click “Add Target”, select “Lambda Function” from drop down & then select appropriate Lambda function.
If you want to pass some data to the target function when triggered, you can do so by expanding “Configure Input”

Efficient DynamoDB schema for time series data

We are building a conversation system that will support messages between 2 users (and eventually between 3+ users). Each conversation will have a collection of users who can participate/view the conversation as well as a collection of messages. The UI will display the most recent 10 messages in a specific conversation with the ability to "page" (progressive scrolling?) the messages to view messages further back in time.
The plan is to store conversations and the participants in MSSQL and then only store the messages (which represents the data that has the potential to grow very large) in DynamoDB. The message table would use the conversation ID as the hash key and the message CreateDate as the range key. The conversation ID could be anything at this point (integer, GUID, etc) to ensure an even message distribution across the partitions.
In order to avoid hot partitions one suggestion is to create separate tables for time series data because typically only the most recent data will be accessed. Would this lead to issues when we need to pull back previous messages for a user as they scroll/page because we have to query across multiple tables to piece together a batch of messages?
Is there a different/better approach for storing time series data that may be infrequently accessed, but available quickly?
I guess we can assume that there are many "active" conversations in parallel, right? Meaning - we're not dealing with the case where all the traffic is regarding a single conversation (or a few).
If that's the case, and you're using a random number/GUID as your HASH key, your objects will be evenly spread throughout the nodes and as far as I know, you shouldn't be afraid of skewness. Since the CreateDate is only the RANGE key, all messages for the same conversation will be stored on the same node (based on their ConversationID), so it actually doesn't matter if you query for the latest 5 records or the earliest 5. In both cases it's query using the index on CreateDate.
I wouldn't break the data into multiple tables. I don't see what benefit it gives you (considering the previous section) and it will make your administrative life a nightmare (just imagine changing throughput for all tables, or backing them up, or creating a CloudFormation template to create your whole environment).
I would be concerned with the number of messages that will be returned when you pull the history. I guess you'll implement that by a query command with the ConversationID as the HASH key and order results by CreationDate descending. In that case, I'd return only the first page of results (I think it returns up to 1MB of data, so depends on an average message length, it might be enough or not) and only if the user keeps scrolling, fetch the next page. Otherwise, you might use a lot of your throughput on really long conversations and anyway, the client doesn't really want to get stuck for a long time waiting for megabytes of data to appear on screen..
Hope this helps

Is there a way to find the SQL that updated a particular field at a particular time?

Let's assume that I know when a particular database record was updated. I know that somewhere exists a history of all SQL that's executed, perhaps only accessible by a DBA. If I could access this history, I could SELECT from it where the query text is LIKE '%fieldname%'. While this would pretty much pull up any transactional query containing the field name I am looking for, it's a great start, especially if I can filter the recordset down to a particular date/time range.
I've discovered the dbc.DBQLogTbl view, but it doesn't appear to work as I expect. Is there another view that contains the information I am looking for?
It depends on the level of database query logging (DBQL) that has been enabled by the DBA.
Some DBA's may elect not to detailed information for tactical queries so it is best to consult with your DBA team to understand what is being captured. You can also query the DBC.DBQLRules to determine what level of logging has been enabled.
The following data dictionary objects will be of particular interest to your question:
DBC.QryLog contains the details about the query with respect to the user, session, application, type of statement, CPU, IO, and other fields associated with a particular query.
DBC.QryLogSQL contains the SQL statements. If a SQL statement is exceeds a certain length it is split across multiple rows which is denoted by a column in this table. If you join this to the main Query Log table care must be taken if you are aggregating and metrics in the Query Log table. Although more often then not if your are joining the Query Log table to the SQL table you are not doing any aggregation.
DBC.QryLogObjects contains the objects used by a particular query and how they were used. This includes tables, columns, and indexes referenced by a particular query.
These tables can be joined together in DBC via QueryID and ProcID. There are a few other tables that capture information about the queries but are beyond the scope of this particular question. You can find out about those in the Teradata Manuals.
Check with your DBA team to determine the level of logging being done and where they historical DBQL data is retained. Often DBQL data is moved nightly to a historical database and there often is a ten minute delay in data being flushed from cache to the DBC tables. Your DBA team can tell you where to find historical DBQL data.

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources