Load huge amount of dummy data to DynamoDB global table - amazon-dynamodb

I want to create a DynamoDB global table and load huge data into it for testing purpose. The data is dummy data and has no special requirement. The global table data size is going to be 70GB.
I am going to use Jmeter to execute java sampler to load data on EC2 instance with DynamoDB VPC endpoint, the java sampler will use batchItemWrite to write data to table.
DynamoDB has limit on item size 400KB, even if I have used batchItemwrite and multiple jmeter threads to load the data. In order to load 70 GB data seems still require more than 24 hours.
Is there any better solution for better performance to load data to DynamoDB table? Any comment is appreciated.

Related

DolphinDB Python API: Issues with partitionColName when writing data to dfs table with compo domain

It's my understanding that the PartitionedTableAppender method of DolphinDB Python API can implement concurrent data writes. I'm trying to write data to a dfs table with compo domain where the partitions are determined by the values of "datetime" and "symbol". Now the data I'd like to write include records of 150 symbols in one day. This is what I tried:
But it seems only one partitioning column can be specified in partitionColName. Please inform me if I do have a wrong way of writing this.
Just specify one partitioning column in this case even if it uses compo domain. Based on the given information, it is suggested to specify partitionColName to be "symbol" and then concurrent writes can be implemented. However, the script still works if you set it to be "datetime", but the data cannot be written concurrently because it only contains one day's records with which only one partition is created.
Refer to the basic operating rules when you are using PartitionedTableAppender:
DolphinDB does not allow multiple writers to write data to one
partition at the same time. Therefore, when multiple threads are
writing to the same table concurrently, it is recommended to make sure each of
them writes to a different partition. Python API provides a
convenient way by dividing data by partition automatically.
With DolphinDB server version 1.30 or above, we can write to DFS
tables with the PartitionedTableAppender object in Python API. The
user needs to first specify a connection pool. The system obtains
the partitioning information before assigning them to the connection pool
for concurrent writing. A partition can only be written to by one
connection pool at one time.
Therefore, only one partitioning column is needed for a table with compo domain. Just specify a highly-differentiated partitioning column to create numbers of partitions and assign them to multiple connection pools. Thus, the data can be written concurrently to the dfs tables.

Mule 4 : Design : How to process data[files/ database records] in Mule 4 without getting "out-of-memory" error?

Scenario :
I have a database which contains 100k records which have a 10 GB size in memory.
My objective is to
fetch these records,
segregate the data based on certain conditions
then generate csv files for each group of data
write these CSV files to a NAS (storage drive accessible over the same network)
To achieve this, I am thinking of the design as follows:
Use a Scheduler component that triggers the flow daily at 9 am for example)
Use a database select operation to fetch the records
Use a batch processing scope
In batch step use reduce function in Transform message and segregate the data in aggregator in the format like :
{
"group_1" : [...],
"group_2" : [...]
}
In the on complete step of batch processing use a file component to write the data in files in the NAS drive
Questions/Concerns :
Case 1 : When reading from database select it loads all the 100k records in memory.
Question : How to optimize this step so that I can still get 100k records to process but not have a spike in memory usage?
Case 2 : When segregating the data I am storing the isolated data in the aggregator object in reduce operator and then the object stays in memory till i write it into files.
Question : Is there a way I can segregate the data and directly write the data in files in the batch aggregator step and quickly clean the memory from the aggregator object space?
Please treat it as a design question for Mule 4 flows and help me. Thanking the community for your help ad support.
Don't load 100K records in memory. Loading high volumes of data in memory will probably cause an out of memory error. You are not providing details in the configurations but the database connector 'streams' pages of records by default so that's taking care. Use the fetchSize attribute to tune the number of records per page that are read. The default is 10. The batch scope uses disk space to buffer data, to avoid using RAM memory. It also has some parameters to help tune the numbers of records processed per step, for example batch block size and batch aggregator size. Using default values would not be anywhere near 100K records. Also be sure to control concurrency to limit resource usage.
Note that even if reducing all configurations it doesn't mean there will be no spike when processing. Any processing consumes resources. The idea is to have a predictable, controlled spike, instead of an uncontrolled one that can exhaust available resources.
This question is not clear. You can't control the aggregator memory other than the aggregator size, but it looks like it only keeps the more recent aggregated records, not all the records. Are you having any problems with that or is this a theoretical question?

database lookup expedition using externally provided statistics during initialization

I have a large table that I can not fit to the memory. I am considering moving the table to a database, for example sqlite. However, the some of the lookups will be made very often, therefore they could stay in memory and rest could stay in the hard disk. I want to expedit the table lookup by supplying some statistics during initialization, so that memory lookups to the hard disk will be reduced. I have tried it with sqlite, however, I am not able to supply it a list of external statistics. Any suggestions?
Thank you,

The right way to make DynamoDB data searchable in Athena?

I need to have, a slowly changing, AWS DynamoDb periodically dumped on S3, for querying it on Athena. It needs to be ensured that data available to Athena is not much behind what's available on DynamoDb (maximum lag of 1 hour)
I am aware of the following two approaches:
Use EMR (from Data Pipeline) to export the entire DynamoDb
Advantage of this approach is that with a single EMR script (run hourly), compressed Parquet files, which are directly searchable on Athena, can be dumped on S3. However, a big disadvantage of this approach is that while only a small number of records change in an hour, the entire dump needs to be taken, requiring significantly higher read capacity in DynamoDb and higher EMR resources.
Use DynamoDB Streams to reflect any changes in DynamoDb on S3.
This has the advantage of not needing to process unchanged data on DynamoDb, thus reduces the need of significantly higher read capacity than whats needed in normal operations. However, a follow up script (probably another EMR job) would be needed to consolidate the per record files generated by DynamoDb streams, else performance of Athena gets severely impacted because of large number of files.
Are there any other approaches which can do better than these?
I think the best solution from a performance/cost perspective would be to use DynamoDB Streams and a Glue Job to consolidate the files once a day (or once a week, depending on your data's velocity).
One downside in the DynamoDB Streams approach (and all solutions that read data incrementally) is that you have to handle the complexity of updating/deleting records from a Parquet file.
If your load is not exclusively "appending" new data to the table you should write any updated/deleted item somewhere (probably a DynamoDB Table or an S3 file) and let Glue delete those records before writing the consolidated file to S3.
All the actors will be:
a Lambda processing streams that should:
write newly added items to Parquet files in S3,
write updates (even PutItem on an existing item) and deletions to a DynamoDB table;
a Glue Job running less frequently that should:
consolidate the many smaller files created by the first lambda into fewer bigger parquets,
merge all update/delete operations logged in the DynamoDB table to the resulting parquet.
This results in much more effort than using EMR to dump the full table hourly: you should judge by yourself if it is worth it. :)
I have implemented the following data pipeline for real time data streaming to athena:
Dynamodb -> DynamoDB streams -> Lambda -> Kinesis Firehose -> s3 <- Athena
Do let me know if you have any doubts.

Require advice on Time Series Database solution that fits with Apache Spark

For example, lets say I wish to analyze a months worth of company data for trends. I plan on doing regression analysis and classification using an MLP.
A months worth of data has ~10 billion data points (rows).
There are 30 dimensions to the data.
12 features are numeric (integer or float; continuous).
The rest are categoric (integer or string).
Currently the data is stored in flat files (CSV) and is processed and delivered in batches. Data analysis is carried out in R.
I want to:
change this to stream processed (rather than batch process).
offload the computation to a Spark cluster
house the data in a time-series database to facilitate easy read/write and query. In addition, I want the cluster to be able to query data from the database when loading the data into memory.
I have an Apache Kafka system that can publish the feed for the processed input data. I can write a Go module to interface this into the database (via CURL, or a Go API if it exists).
There is already a development Spark cluster available to work with (assume that it can be scaled as necessary, if and when required).
But I'm stuck on the choice of database. There are many solutions (here is a non-exhaustive list) but I'm looking at OpenTSDB, Druid and Axibase Time Series Database.
Other time-series databases which I have looked at briefly, seem more as if they were optimised for handling metric data. (I have looked at InfluxDB, RiakTS and Prometheus)
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can
access diverse data sources including HDFS, Cassandra, HBase, and S3. - Apache Spark Website
In addition, the time-series database should store that data in a fashion that exposes it directly to Spark (as this is time-series data, it should be immutable and therefore satisfies the requirements of an RDD - therefore, it can be loaded natively by Spark into the cluster).
Once the data (or data with dimensionality reduced by dropping categoric elements) is loaded, using sparklyr (a R interface for Spark) and Spark's machine learning library (MLib, this cheatsheet provides a quick overview of the functionality), regression and classification model can be developed and experimented with.
So, my questions:
Does this seem like a reasonable approach to working with big data?
Is my choice of database solutions correct? (I am set on working with columnar store and time-series database, please do not recommend SQL/Relational DBMS)
If you have prior experience working with data analysis on clusters, from both an analytics and systems point of view (as I am doing both), do you have any advice/tips/tricks?
Any help would be greatly appreciated.

Resources