Separation of clusters for ingestion and export - azure-data-explorer

Considering that a cluster is dominated by ingestion processes in terms of memory and cpu usage , is it better to have a separate follower cluster dedicated to only export? The use case is to export huge amount of data out of an ADX cluster by letting all the nodes participate in export. In other words, is there any disadvantage in using a follower cluster for export that the leader cluster itself? Or it will be a better strategy to simply scale up/out the main(leader) cluster itself for facilitating heavy export without having to do it through a follower cluster ? What is the best way to optimize export in this case? The export is to an external table which points to a storage in the same region as the cluster.

I suggest scaling up/out the existing cluster, instead of creating a follower cluster. It will allow you easier management and you'll pay less.
To have efficient export, the recommendation is to export into parquet format, and use the useNativeParquetWriter flag, see more details here.

Related

DynamoDB usable for largeish event table?

I'm thinking of re-architecting an RDS model to a DynamoDB one and it appears mostly to be working using a single-table design. We have, however a log table that can contain 5-10 million rows that are queried on many attributes.
Is there any pattern that might be applicable in migrating to DynamoDB or is this a case where full scans would be required and we would just be better off keeping the log stuff as a relational table?
Thanks in advance,
Nik
Those keywords and phrases "log" and "queried on many attributes" sound to me like DynamoDB is not the best solution for your log data. If the number of distinct queries is fairly limited and well-known in advance, you might be able to design your keys to fit your access patterns.
For example, if you commonly query on Color and Quantity attributes, you could design a key like COLOR#Red#QTY#25. And you could use secondary or global secondary indexes for queries involving other attributes similarly.
But it is not a great solution if you have many attributes that you need to query arbitrarily.
Alternative Solution: Another serverless option to consider is storing your log data in S3 and using Athena to query it using SQL.
You will likely be trading away a bit of latency and speed by taking this approach compared to RDS and DynamoDB. But queries against log data often don't need millisecond response times, so it can cover a lot of use cases.
Data modelling for DynamoDB
Write down all of your access patterns, in order of priority/most used
Research models which are similar to your use-case
Download NoSQL Workbench and create test models where you can visualize your ideas
Run commands against DynamoDB Local and test your access patterns are fulfilled.
Access Parterns
Your access patterns will ultimately decide if DynamoDB will suit your needs. If you need to query based on multiple fields you can have up to 20 Global Secondary Indexes which will give you some flexibility, but usually if you exceed 8-10 indexes then DynamoDB may not be a good choice or the schema is badly designed.
Use smart designs with sort-key and index-key overloading, it will allow you to group the data better and make your access patterns more efficient.
Log Data Use-case
Storing log data is a pretty common use-case for DynamoDB and many many AWS customers use it for that sole purpose. But I can't over emphasize the importance of understanding your access patterns and working backwards from those to create your model.
Alternatives
If you require query capability or free text search ability, then you could use DynamoDB integrations with OpenSearch (via Lambda/EventBridge) for example, with OpenSearch providing you the flexibility for your queries.
Doesn't seem like a good use case - I have done it and wasn't at all happy with the result - now I load 'log like' data into elasticsearch and much happier with the result.
In my case, I insert the data to dynamodb - to archive it - but also feed data in ES, but once in a while if I kill my ES cluster, I can reload all or some of the data from ddb.

Estimate duration of DynamoDB data export via Data Pipeline

My DynamoDB table has around 100 million (30GB) items and I provisioned it with 10k RCUs. I'm using a data pipeline job to export the data.
The DataPipeline Read Throughput Ratio set to 0.9.
How do I calculate the time for the export to be completed (The pipeline is taking more than 4 hrs to complete the export)
How can I optimize this, so that export completes in less time.
How does the Read Throughput Ratio relate to DynamoDB export?
The answer to this question addresses most of your questions in regard to estimating the time for the Data Pipeline job to complete.
There is now a much better solution to export data from DynamoDB to S3, which was announced in November 2020. There is now a way to do that from DynamoDB directly without provisioning an EMR Cluster and tons of RCUs.
Check out the documentation for: Exporting DynamoDB table data to Amazon S3

Should bulk data be included in the graph?

I have been using ArangoDB for a while now for smaller system requirements and love it. We have recently been tasked by a client to analyze a large amount of financial data which is currently housed in SQL but I was hoping to more efficiently query the data in ArangoDB.
One of the more simplistic requirements is to rollup gl entry amounts to determine account totals across their general ledger. There are approximately 2200 accounts in their general ledger with a maximum depth of approximately 10. The number of gl entries is approximately 150 million and I was wondering what the most efficient method of aggregating account totals would be?
I plan on using a graph to manage the account hierarchy/structure but should edges be created for 150 million gl entries or is it more efficient to traverse the inbound relationships and run sub queries on the gl entry collections to calculate total the amounts?
I would normally just run the tests myself but I am struggling with simply loading the data in my local instance of arango and thought I would get some insight while I work at loading the data.
Thanks in advance!
What is the benefit you're looking to gain by moving the data into a graph model. If it's to build connections between accounts, customers, GL's, and such, then it might be best to go with a hybrid model.
It's possible to build a hierarchical graph style relationship between your accounts and GL's, but then store your GL entries in a flat document collection.
This way you can use AQL style graph queries to quickly determine relationships between accounts and GLs. If you need to SUM entries in a GL, then you can have queries that identify the GL._id's and then sum the flat collections that have foreign keys that reference the GL._id they are associated with.
By adding indexes on your foreign keys you will speed up queries, and by using Foxx Micro Services you can provide a layer of abstraction between a REST style query and the actual data model you are using. That way if you find you need to change your database model under the covers, by updating your Foxx MicroServices the consumer doesn't need to be aware of those changes.
I can't answer your question on performance, you'll just need to ensure your hardware is appropriately spec'ed.

Riak and time-sorted records

I'd like to sort some records, stored in riak, by a function of the each record's score and "age" (current time - creation date). What is the best way do do a "time-sensitive" query in riak? Thus far, the options I'm aware of are:
Realtime mapreduce - Do the entire calculation in a mapreduce job, at query-time
ETL job - Periodically do the query in a background job, and store the result back into riak
Punt it to the app layer - Don't sort at all using riak, and instead use an application-level layer to sort and cache the records.
Mapreduce seems the best on paper, however, I've read mixed-reports about the real-world latency of riak mapreduce.
MapReduce is a quite expensive operation and not recommended as a real-time querying tool. It works best when run over a limited set of data in batch mode where the number of concurrent mapreduce jobs can be controlled, and I would therefore not recommend the first option.
Having a process periodically process/aggregate data for a specific time slice as described in the second option could work and allow efficient access to the prepared data through direct key access. The aggregation process could, if you are using leveldb, be based around a secondary index holding a timestamp. One downside could however be that newly inserted records may not show up in the results immediately, which may or may not be a problem in your scenario.
If you need the computed records to be accurate and will perform a significant number of these queries, you may be better off updating the computed summary records as part of the writing and updating process.
In general it is a good idea to make sure that you can get the data you need as efficiently as possibly, preferably through direct key access, and then perform filtering of data that is not required as well as sorting and aggregation on the application side.

getting data on hourly basis from couchdb with millions of objects

I have a couchdb database setup at a AWS EC2 medium on-demand instance, there are around 4 million objects in it, with growing rate of around 100 objects per second.
I want to write some map/reduce queries on top of it, but it takes forever for my map jobs to complete.
SO I am wondering if i should copy over the data to some other machine, and delete all data here on the master machine, keeping it clean, and i should rather write my map jobs on the second instance where the data is copied; i am also thinking of moving over this data to a s3 instance, and keep just one week's data here.
Am i thinking in right direction
Unfortunately for such a big database you can only use build-in reduce functions:
_sum
_count
_stats
These functions works MUCH faster than javascript ones. And this is the only possible option for huge databases.
http://wiki.apache.org/couchdb/Built-In_Reduce_Functions
You could write your own View Server or use one of the available implementations to test if that helps with performance.

Resources