I have a Kusto table, which has incoming streaming data. However, I need to store the aggregates per day in another table. Is there a way to run scheduled functions in Kusto which for example runs at midnight and ingests the result into an aggregate table? If no, how can I achieve my goal? Are there some connectors available with which I can achieve this?
Materialized views is the ideal solution for it. Please note that materialized views over table with streaming data is now in private preview so you will need to open a support ticket for it. See the applicable comment in "create materialized views" document:
Materialized views over streaming ingestion tables are supported in
preview mode. Enabling this feature on your cluster requires creating
a support ticket.
Related
TL:DR I'd like to combine the power of BigQuery with my MERN-stack application. Is it better to (a) use nodejs-biquery to write a Node/Express API directly with BigQuery, or (b) create a daily job that writes my (entire) BigQuery DB over to MongoDB, and then use mongoose to write a Node/Express API with MongoDB?
I need to determine the best approach for combining a data ETL workflow that creates a BigQuery database, with a react/node web application. The data ETL uses Airflow to create a workflow that (a) backs up daily data into GCS, (b) writes that data to BigQuery database, and (c) runs a bunch of SQL to create additional tables in BigQuery. It seems to me that my only two options are to:
Do a daily write/convert/transfer/migrate (whatever the correct verb is) from BigQuery database to MongoDB. I already have a node/express API written using mongoose, connected to a MongoDB cluster, and this approach would allow me to keep that API.
Use the nodejs-biquery library to create a node API that is directly connected to BigQuery. My app would change from MERN stack (BQ)ERN stack. I would have to re-write the node/express API to work with BigQuery, but I would no longer need the MongoDB (nor have to transfer data daily from BigQuery to Mongo). However, BigQuery can be a very slow database if I am looking for a single entry, a since its not meant to be used as Mongo or a SQL Database (it has no index, one row retrieve query run slow as full table scan). Most of my APIs calls are for very little data from the database.
I am not sure which approach is best. I don't know if having 2 databases for 1 web application is a bad practice. I don't know if it's possible to do (1) with the daily transfers from one db to the other, and I don't know how slow BigQuery will be if I use it directly with my API. I think if it is easy to add (1) to my data engineering workflow, that this is preferred, but again, I am not sure.
I am going with (1). It shouldn't be too much work to write a python script that queries tables from BigQuery, transforms, and writes collections to Mongo. There are some things to handle (incremental changes, etc.), however this is much easier to handle than writing a whole new node/bigquery API.
FWIW in a past life, I worked on a web ecommerce site that had 4 different DB back ends. ( Mongo, MySql, Redis, ElasticSearch) so more than 1 is not an issue at all, but you need to consider one as the DB of record, IE if anything does not match between them, one is the sourch of truth, the other is suspect. For my example, Redis and ElasticSearch were nearly ephemeral - Blow them away and they get recreated from the unerlying mysql and mongo sources. Now mySql and Mongo at the same time was a bit odd and that we were dong a slow roll migration. This means various record types were being transitioned from MySql over to mongo. This process looked a bit like:
- ORM layer writes to both mysql and mongo, reads still come from MySql.
- data is regularly compared.
- a few months elapse with no irregularities and writes to MySql are turned off and reads are moved to Mongo.
The end goal was no more MySql, everything was Mongo. I ran down that tangent because it seems like you could do similar - write to both DB's in whatever DB abstraction layer you used ( ORM, DAO, other things I don't keep up to date with etc.) and eventually move the reads as appropriate to wherever they need to go. If you need large batches for writes, you could buffer at that abstraction layer until a threshold of your choosing was reached before sending it.
With all that said, depending on your data complexity, a nightly ETL job would be completely doable as well, but you do run into the extra complexity of managing and monitoring that additional process. Another potential downside is the data is always stale by a day.
I am new to Apache flink and building a simple application where I am reading the events from a kinesis stream, say something like
TestEvent{
String id,
DateTime created_at,
Long amount
}
performing aggregation (sum) on field amount on above stream keyed by id. The transformation is equivalent to SQL select sum(amount) from testevents group by id where testevents are all the events received till now.
The aggregated result is stored in a flink state and I want the result to be exposed via an API. Is there any way to do so?
PS: Can we store the flink state in dynamoDB and create an API there? or any other way to persist and expose the state to outside world?
I'd recommend to ignore state for now and rather look at sinks as the primary way for a stream application to output results.
If you are already using Kinesis for input, you could also use Kinesis to output the results from Flink. You can then use the Kinesis adapter for DynamoDB that is provided by AWS as further described on a related stackoverflow post.
Coming back to your original question: you can query Flinks state and ship a REST API together with your stream application, but that's a whole lot of work that is not needed to achieve your goal. You could also access checkpointed/savepointed state through the state API, but again that's quite a bit of manual work that can be saved by going the usual route outlined above.
This is Flink's documentation, which provides some use cases queryable_state
You can also use the API to read it offlineState Processor API
What if we have copy files of persistent storage (blobs) for a given Kusto database and want to be able to access these outside Kusto? Is there any way or API available for reading these files? It appears that these are binary files in Kusto's proprietary format so can't just be read without some sort of API/bridge available from Kusto.
There is an API for accessing Kusto data through Kusto: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/api/.
You really don't want to access the blobs directly as they are stored in a heavily compressed and indexed column store format. You would have to replicate most of the Kusto database engine to do so. To do it right, you would effectively end up building another node on your Kusto cluster locally, and it's not clear that you would gain anything from that. For example, you'd be further from the data, so your queries would be slower. Better to just ask your Kusto cluster to do the work and send the results.
If you need to access the data using another platform you can .Export it.
If you really need to access the data directly, and are willing to sacrifice some performance, then your best bet is probably to store the data outside Kusto and map it as an External Table or use one of the SQL plugins to query the data in it's native format.
If you want to access Kusto data from a non-Kusto environment, you need to move the data out of Kusto into SQL or blob storage using the .export command.
https://learn.microsoft.com/en-us/azure/kusto/management/data-export/
The information isn't duplicated by ADX, it's indexed and compressed by ADX to enable ad-hoc interactive exploration experience.
In addition to the Kusto APIs you can query the data in Kusto using the Kusto(ADX) spark connector
AFAIK, Memcached does not support synchronization with database (at least SQL Server and Oracle). We are planning to use Memcached (it is free) with our OLTP database.
In some business processes we do some heavy validations which requires lot of data from database, we can not keep static copy of these data as we don't know whether the data has been modified so we fetch the data every time which slows the process down.
One possible solution could be
Write triggers on database to create/update prefixed-postfixed (table-PK1-PK2-PK3-column) files on change of records
Monitor this change of file using FileSystemWatcher and expire the key (table-PK1-PK2-PK3-column) to get updated data
Problem: There would be around 100,000 users using any combination of data for 10 hours. So we will end up having a lot of files e.g. categ1-subcateg5-subcateg-78-data100, categ1-subcateg5-subcateg-78-data250, categ2-subcateg5-subcateg-78-data100, categ1-subcateg5-subcateg-33-data100, etc.
I am expecting 5 million files at least. Now it looks a pathetic solution :(
Other possibilities are
call a web service asynchronously from the trigger passing the key
to be expired
call an exe from trigger without waiting it to finish and then this
exe would expire the key. (I have got some success with this approach on SQL Server using xp_cmdsell to call an exe, calling an exe from oracle's trigger looks a bit difficult)
Still sounds pathetic, isn't it?
Any intelligent suggestions please
It's not clear (to me) if the use of Memcached is mandatory or not. I would personally avoid it and use instead SqlDependency and OracleDependency. The two both allow to pass a db command and get notified when the data that the command would return changes.
If Memcached is mandatory you can still use this two classes to trigger the invalidation.
MS SQL Server has "Change Tracking" features that maybe be of use to you. You enable the database for change tracking and configure which tables you wish to track. SQL Server then creates change records on every update, insert, delete on a table and then lets you query for changes to records that have been made since the last time you checked. This is very useful for syncing changes and is more efficient than using triggers. It's also easier to manage than making your own tracking tables. This has been a feature since SQL Server 2005.
How to: Use SQL Server Change Tracking
Change tracking only captures the primary keys of the tables and let's you query which fields might have been modified. Then you can query the tables join on those keys to get the current data. If you want it to capture the data also you can use Change Capture, but it requires more overhead and at least SQL Server 2008 enterprise edition.
Change Data Capture
I have no experience with Oracle, but i believe it may also have some tracking functionality as well. This article might get you started:
20 Using Oracle Streams to Record Table Changes
By using the Orchestration debugger one can get useful time information on the left, regarding entering and leaving shapes. Unfortunately one cannot copy the information from that window. I would like to do some benchmarks and save statistics in Excel.
Does anyone know the sql query to get the same data from the DB? I have tried to find out with SQL Profiler, but did not hit anything that looks like the correct query or stored procedure.
I know I could use BAM, but I just need a quick one for temporary use.
If you are trying to watch with SQL trace be sure you have stopped BizTalk and you are looking at the BizTalkDTADb database otherwise it is guaranteed to be an exercise in futility as BizTalk constantly interacts with SQL Server.
The exact stored procedure it calls to display the orchestration info is dtasp_LocalCallGetActions. You will likely have to do some fancy joins to get some meaningful data out of it. A good place to start is the views in the BizTalkDTADb database which can show the same data you see in the HAT views and will allow you to run the same queries over in query analyzer.