Best configuration for repeated data on azure data explorer - azure-data-explorer

I'm using Azure Data Explorer to store data from some systems, and we load the full data everyday to keep track of changes (the data has no "last modified" field).
Considering that the same data will probably be repeated in during various days, I want to know it there is any recommended partition/sharding policy specific for this case to optimize compression and/or performance?

Related

Which DB to use for comparing courses of data by days?

I'm currently thinking about a little "BigData" Project where I want to record some utilizations every 10 minutes and write them to a DB over several month or years.
I then want to analyze the data e.g. in these ways:
Which time of the day is best (in terms of a low utilization)?
What are the differences in utilization between normal weekdays and days on the weekend?
At what time does the higher part of the utilization begin on a normal monday?
For this I obviously need the possibility to build averaged graphs for e.g. all mondays that where recorded so far.
For the first "proof of concept" I set up a InfluxDB and Grafana which works quite fine for seeing the data being written to the DB, but the more I research on the internet the more I see that InfluxDB is not made for what I want to do (or it can not do it yet).
So which Database would be best to record and analyze data like that? Or is it more like a question about which tool to use to analyze the data? Which tool could that be?
InfluxDB query language is not flexible enough for your kind of questions.
SQL databases supported by Grafana (MySQL, Postgres, TimescaleDB, Clickhouse) seem to fit better.The choice depends on your preferences and amount of your data. For smaller datasets pure MySQL & Postgres may be enough. For higher loads consider TimescaleDB. For billions of datapoints Clickhouse is a probably better.
If you want a lightweight but scalable NoSQL timeseries solution have a look at VictoriaMetrics.

Silo data for users into particular physical location

We have a system we're designing which has to hold data for people globally, including countries with very strict data protection policies, specifically where data about its citizens must physically reside that country.
Now we could engineer a mechanism for silo-ing/querying the data where it must be pulled from a particular location but as the system will be azure based, we were hoping that cosmosDB's partitioning feature might be an option.
Based on the information available to date for partitioning, it seems like it's possible to assign a location-specific partition for some data but it's not very clear. Any search for partitioning in general goes on about high availability and low latency - good things - but not what I'm looking for.
To this end, can location-specific data be assigned in cosmosDB as part of its feature set or is this something that has to be engineered on top?
For data sovereignty, you must engineer a data access layer across multiple Cosmos DB accounts. Cosmos DB by default will replicate your data across all regions within your account (which is not what you need).
While not specifically for this scenario, you can see a description of how to build such a layer here: https://learn.microsoft.com/en-us/azure/cosmos-db/multi-region-writers

BigQuery streaming best practice

I am using Google BigQuery for sometime now, using upload files,
As I get some delays with this method I am now trying to convert my code into streaming.
Looking for best solution here, what is more correct working with BQ:
1. Using multiple (up to 40) different streaming machines ? or directing traffic to single or more endpoints to upload data?
2. Uploading one row at a time or stacking to a list of 100-500 events and uploading it.
3. is streaming the way to go, or stick with files uploading - in terms of high volumes.
some more data:
- we are uploading ~ 1500-2500 rows per second.
- using .net API.
- Need data to be available within ~ 5 minutes
Didn't find such reference elsewhere.
The big difference between streaming data and uploading files is that streaming is intended for live data that is being produced on real time while being streamed, whereas with uploading files, you would upload data that was stored previously.
In your case, I think Streaming makes more sense. If something goes wrong, you would only need to re-send the failed rows, instead of the whole file. And it adapts more to the growing files that I think you're getting.
The best practices in any case are:
Trying to reduce the number of sources that send the data.
Sending bigger chunks of data in each request instead of multiple tiny chunks.
Using exponential back-off to retry those requests that could fail due to server errors (These are common and should be expected).
There are certain limits that apply to Load Jobs as well as to Streaming inserts.
For example, when using streaming you should insert less than 500 rows per request and up to 10,000 rows per second per table.

How well does scriptDB work as substitute for storing data directly in Google Spreadsheets?

I want to retrieve data from Google Analytics API, create custom calculations and then push the aggregations to a Google Spreadsheets in order to reuse in Google Visualisation API app. My concern is that I'll hit the Spreadsheet cell quota very quickly with the raw data needed for the calculation.
I know scriptDB quota is 100MB but before I invest time and resources in learning how it works I'd like to get an idea whether it's feasible for storing raw analytics data (provided it's not too granular and it's just designed to answer specific questions) and how much of it I could realistically store in scriptDB (relative to spreadsheets) before I hit the quota.
Thanks
For bulk data access (e.g. reading a table for Visualization), a spreadsheet will have a speed advantage over ScriptDb. What is faster: ScriptDb or SpreadsheetApp? If you wish to support more sophisticated queries though, to "answer specific questions" as you mention, then ScriptDb will give you an edge, as query times vary with the number of results but should be unaffected by the query criteria themselves.
With data in a spreadsheet, you will be able to obtain a DataTable for Visualization with a single Range.getDataTable() operation. With ScriptDb, you will need to write a script to build your DataTable.
Regarding size constraints, it's not possible to really compare the two without knowing the size of your individual data elements. You're already aware of the general constraints:
Spreadsheet, 40K cells, but may hit (unspecified) size limit before that, depending on data element sizes.
ScriptDb, 50MB, 100MB or 200MB depending on account type. The number of objects that can be stored is affected by the complexity (depth) of the object and the size of the property names, and of course the size of data contained in the objects.
Ultimately, the question of which is best for your application is a matter of opinion, and of which factors matter most for the application. If the analytics data is tabular, then a spreadsheet offers advantages for implementation largely because of Range.getDataTable(), and is faster for bulk access. I'd recommend starting there, and considering a move to ScriptDb if and when you actually hit spreadsheet size or query performance limitations.

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources