I'm looking to implement a scenario where the aggregates materialized in ADX can be used for further downstream processing in a streaming fashion instead of having query periodically (e.g., Az Functions listening to these aggregates in EventHub, and then triggering certain actions).
Does ADX support change feed be available on top of Materialized View, so that newly calculated aggregates are available in the configured Event Hub for stream processing ?
Seems like an xyproblem
What would be the added value of putting ADX in the middle of a streaming architecture?
Why not working directly against what's feeding ADX?
If you insist on using ADX for this, you could use:
update policy to capture changes in a source table
continuous data export to spill the data to blobs
Doesn't support stream ingestion
Minimum interval of 1 minute
Event grid to listen to new files creation
Related
I noticed something interesting, I have a bunch of stored functions in my ADX database and each one corresponds to an update policy with a common target table. Since the data is continuously being pumped into the source tables, update policies are getting triggered and data is getting loaded to the target table. It means at any given time a bunch of instances of all the individual functions are executing (one instance per source extent) , but when I issue .show capacity command, it always shows me that consumed 'Queries' resource is always 0. I was expecting this number to be sufficiently big, because update policy function execution is essential a query, the query text being the function call itself. And then again multiple source extents are getting created for every table at at a time ,there will be multiple instances too , this explains why I expect a big number here. But no matter how many times I issued this command .show capacity , I always see consumed 'Queries' as 0. Why is this so? And where else can I see exact number of these update policy query instances running ?
I always see consumed 'Queries' as 0
Queries that are executed internally by commands (e.g. by .set-or-append commands, .export commands, or as part of update policies triggered by commands like .ingest) aren't counted under the Queries capacity.
The commands that invoke them adhere to the capacity constraints of the virtual resource (e.g. Ingestion) they fall under.
where else can I see exact number of these update policy query instances running?
The aforementioned queries (that are run internally, by commands) aren't currently exposed to you as of this writing.
Using .show table * policy update you can see the update policies you have defined, and derive which queries are being run when you ingest into a specific table.
I have a configured google analytics raw data export to big query.
Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.
For example:
Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.
On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.
https://issuetracker.google.com/issues/117064598
It's not cost-effective for persisting & querying the data.
And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?
Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB
In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.
Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).
I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.
An advice that I can give you about querying the ga_realtime_sessions table is:
Only query for the columns you really need (no select *);
use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.
I was wondering, how does Firestore handle real-time syncing of deeply nested objects? Specifically, does it only sync the diff?
For example, I have a state of the app which is just an array of 3 values and this state is synced between devices. If I then change one of the values will the whole new array be synced (transmitted on the network) or only the diff? What if my state is the nested object?
I'm asking because I want to sync the whole state which is an object with multiple fields but I don't wont to sync the whole object when I only change single field.
Like Realtime Database, Cloud Firestore uses data synchronization to update data on any connected device. However, it's also designed to make simple, one-time fetch queries efficiently.
Queries are indexed by default: Query performance is proportional to the size of your result set, not your data set.
Cloud Firestore will only send your device only the difference of the document.
Tips:
Add queries to limit the data that your listen operations return and use listeners that only download updates to data.
Place your listeners as far down the path as you can to limit the amount of data they sync. Your listeners should be close to the data you want them to get. Don't listen at the database root, as that results in downloads of your entire database.
Hope it helps!
I'm building an ETL process that extracts data from REST API and then pushes the update messages to queue. The API doesn't support delta detection and uses hard delete for data deletion (record just disappears). I currently detect changes by keeping the table in DynamoDB that contains all the record ids along with their CRC. Whenever the API data is extracted next time I compare every record's CRC towards a CRC stored in DynamoDB thus detecting if change has occurred.
This allows to detect the updates/inserts but wouldn't detect the deletions. Is there a best practice of how to detect hard deletes without putting the whole dataset into memory?
I'm currently thinking of this:
1. Have a Redis/DynamoDB table where the last extracted data snapshot would be temporarily saved
2. When the data extraction is complete - do the reverse processing - stream the data from DynamoDB comparing against Redis dataset to detect the missing key values
Is there a best practice / better approach with regard to this?
I am using Kafka as a pipeline to store analytics data before it gets flushed to S3 and ultimately to Redshift. I am thinking about the best architecture to store data in Kafka, so that it can easily be flushed to a data warehouse.
The issue is that I get data from three separate page events:
When the page is requested.
When the page is loaded
When the page is unloaded
These events fire at different times (all usually within a few seconds of each other, but up to minutes/hours away from each other).
I want to eventually store a single event about a web page view in my data warehouse. For example, a single log entry as follows:
pageid=abcd-123456-abcde, site='yahoo.com' created='2015-03-09 15:15:15' loaded='2015-03-09 15:15:17' unloaded='2015-03-09 15:23:09'
How should I partition Kafka so that this can happen? I am struggling to find a partition scheme in Kafka that does not need a process using a data store like Redis to temporarily store data while merging the CREATE (initial page view) and UPDATE (subsequent load/unload events).
Assuming:
you have multiple interleaved sessions
you have some kind of a sessionid to identify and correlate separate events
you're free to implement consumer logic
absolute ordering of merged events are not important
wouldn't it then be possible to use separate topics with the same number of partitions for the three kinds of events and have the consumer merge those into a single event during the flush to S3?
As long as you have more than one total partition you would then have to make sure to use the same partition key for the different event types (e.g. modhash sessionid) and they would end up in the same (per topic corresponding) partitions. They could then be merged using a simple consumer which would read the three topics from one partition at a time. Kafka guarantees ordering within partitions but not between partitions.
Big warning for the edge case where a broker goes down between page request and page reload though.