Corda Performance lag on High volume transaction - corda

I did a Load test by sending around a few million records over a period of 12 hours , here is the analysis.
Hour 1 the transactions commits are very fast , within a few 100
milliseconds. as the hours go by and the number of transactions
committed to the corda DB increases so did the reduction in the
performance of the Corda Node.
After Around 2 million transactions
committed , the node efficiency goes down to about a few seconds per
transaction. After DB refresh of the nodes i.e resetting the DB to
a version with no Data , the transactions execute again within the
milliseconds range
Following is the query
Whether the MQ in the Corda Node impacts this ?
Any Corda Query that is causing the drop in performance ?
P.S : I am working with corda 3.3 enterprise version

There might many factors to consider in this question. For example, how is your Cordapp written, what is the size of your node, Are your flows linear transactions, etc. Furthermore, we had boosted the performance since Corda(v4.x).
You can find more information on sizing and performance: https://docs.corda.r3.com/sizing-and-performance.html

Related

AWS Neptune Query gremlins slowness on cold call

I'm currently running some queries with a big gap of performance between first call (up to 2 minutes) and the following one (around 5 seconds).
This duration difference can be seen through the gremlin REST API in both execution and profile mode.
As the query is loading a big amount of data, I expect the issue is coming from the caching functionalities of Neptune in its default configuration. I was not able to find any way to improve this behavior through configuration and would be glad to have some advices in order to reduce the length of the first call.
Context :
The Neptune database is running on a db.r5.8xlarge instance, and during execution CPU always stay bellow 20%. I'm also the only user on this instance during the tests.
As we don't have differential inputs, the database is recreated on a weekly basis and switched to production once the loader has loaded everything. Our database have then a short lifetime.
The database is containing slightly above 1.000.000.000 nodes and far more edges. (probably around 10.000.000.000) Those edges are splitted across 10 types of labels, and most of them are not used in the current query.
Query :
// recordIds is a table of 50 ids.
g.V(recordIds).HasLabel("record")
// Convert local id to neptune id.
.out('local_id')
// Go to tree parent link. (either myself if edge come back, or real parent)
.bothE('tree_top_parent').inV()
// Clean duplicates.
.dedup()
// Follow the tree parent link backward to get all children, this step load a big amount of nodes members of the same tree.
.in('tree_top_parent')
.not(values('some flag').Is('Q'))
// Limitation not reached, result is between 80k and 100K nodes.
.limit(200000)
// Convert back to local id for the 80k to 100k selected nodes.
.in('local_id')
.id()
Neptune's architecture is comprised of a shared cluster "volume" (where all data is persisted and where this data is replicated 6 times across 3 availability zones) and a series of decoupled compute instances (one writer and up to 15 read replicas in a single cluster). No data is persisted on the instances however, approximately 65% of the memory capacity on an instance is reserved for a buffer pool cache. As data is read from the underlying cluster volume, it is stored in the buffer pool cache until the cache fills. Once the cache fills, a least-recently-used (LRU) eviction policy will clear buffer pool cache space for any newer reads.
It is common to see first reads be slower due to the need to fetch objects from the underlying storage. One can improve this by writing and issuing "prefetch" queries that pull in objects that they think they might need in the near future.
If you have a use case that is filling buffer pool cache and constantly seeing buffer pool cache misses (a metric one can see in the CloudWatch metrics for Neptune), then you may also want to consider using one of the "d" instance types (ex: r5d.8xlarge) and enabling the Lookup Cache feature [1]. This feature specifically focuses on improving access to property values/literals at query time by keeping them in a directly attached NVMe store on the instance.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-lookup-cache.html

Azure data explorer Batching policy modifications

I have huge amount of data flowing from Eventhub to Azure Data Explorer. Currently we have not done any modification on the batching policy, so it is scheduling every 5 minutes. But we need to reduce it to a less value so that the end to end lag is reduced.
How can I calculate the ideal batching time for this setup. Is there any calculation based on the CPU of ADX and the Data ingestion on Eventhub , so that I can figure out an ideal time without affecting the CPU usage of ADX
There is no tool or other functionality that allows you to do it today, you will need to try the desired setting for "MaximumBatchingTimeSpan" and observe the impact on the CPU usage.
Essentially, if you are ingesting huge volumes of data (per table), you are probably not using the 5 minutes batching window, or can decrease it significantly without detrimental impact.
Please have a look at the latency and batching metrics for your cluster (https://learn.microsoft.com/en-us/azure/data-explorer/using-metrics#ingestion-metrics) and see a) if your actual latency is below 5 minutes - which would indicate the batching is not driven by time, and b) what is the "Batching type" that your cluster most often enacts - time/size/number of items.
Based on these numbers you can tweak down the time component of your ingestion batching policy.

Corda Ledger size questions

I hava multiple questions about the disk size which Corda needs over time and could not find any information online.
How much disk space does a Corda transaction need?
How musch disk space does Corda need over the course of 10 years with 4.5 million transactions per month on average (without attachment etc.)
The size of a transaction is not fixed. It will depend on the states, contracts, attachments and other components used.
We do not have any rough guides currently, but we will likely be doing some tests shortly in the run-up to the release of Corda's enterprise version. This will give an idea of the storage requirements of running a node.
As was said the answer is it depends on the transaction size. The average bitcoin transaction runs about 560 bytes, giving around 2000 transactions per 1 meg block. Ethereum runs an average of about 2K per transaction so it can store 500 per 1 meg block and from best numbers I can get hyperledger runs about 5k per transaction to around 205 per block. Assuming CORDA will be somewhere in this spectrum, and assuming you will use the less is more axiom (store as little as possible in the blockchain block, defer all else to sideDB or offchain storage) then lets chose something easy to calculate with, let's say CORDA has a 1k per transaction average. That is 1000 trans/block. With the 1k size multiply TPSseconds of processing in a dayactual processing days per year to get your number. In your case (4,500,000*1024*12*10)/(1024^3) should give you gig. (seems to be about 515 gigabytes at a 1k transaction size)
I tried the CordApp example of an ultra simple IOU transaction to measure this. A single IOU transaction contains the identity of two counterparties and one notary and a singe double value (requiring 8 bytes).
Looking at the database I see that the serialised transaction requires 11 kB.
I am asking for alternative ways for serialisation in: Corda: Large serialized transaction size: Are there alternatives to current serialization design?

Is there a way to read from Dynamo DB stream with a fixed no of workers and leases without any issue

I am continuously publishing data into dynamoDB which has stream enabled. I am reading this stream using DynamoDB apadter of KCL.
I am using 1 KCL worker with 5 leases. At the time of creation my Dynamo table had 1 partition(1 RCU and 999WCU). When I keep publishing the data into dynamo, the no of partitions will grow and the no of active shards too. Reading is fine till the no of active shards are 5. As soon as it crosses 5, KCL is not able to read from one of the shards(tps is getting dropped).
Is there any config/parameter that I can set that will allow me to read from growing shards using fixed no of leases ?
You're looking for the maxLeasesPerWorker property.
From the javadoc:
Worker will not acquire more than the specified max number of leases even if there are more shards that need to be processed. This can be used in scenarios where a worker is resource constrained or to prevent lease thrashing when small number of workers pick up all leases for small amount of time during deployment.
Make sure to take note of the warning in the javadoc as well:
Note that setting a low value may cause data loss (e.g. if there aren't enough Workers to make progress on all shards). When setting the value for this property, one must ensure enough workers are present to process shards and should consider future resharding, child shards that may be blocked on parent shards, some workers becoming unhealthy, etc.

peak read capacity units dynamo DB table

I need to find out the peak read capacity units consumed in the last 20 seconds in one of my dynamo DB table. I need to find this pro-grammatically in java and set an auto-scaling action based on the usage.
Please can you share a sample java program to find the peak read capacity units consumed in the last 20 seconds for a particular dynamo DB table?
Note: there are unusual spikes in the dynamo DB requests on the database and hence needs dynamic auto-scaling.
I've tried this:
result = DYNAMODB_CLIENT.describeTable(recomtableName);
readCapacityUnits = result.getTable()
.getProvisionedThroughput().getReadCapacityUnits();
but this gives the provisioned capacity but I need the consumed capacity in last 20 seconds.
You could use the CloudWatch API getMetricStatistics method to get a reading for the capacity metric you require. A hint for the kinds of parameters you need to set can be found here.
For that you have to use Cloudwatch.
GetMetricStatisticsRequest metricStatisticsRequest = new GetMetricStatisticsRequest()
metricStatisticsRequest.setStartTime(startDate)
metricStatisticsRequest.setEndTime(endDate)
metricStatisticsRequest.setNamespace("AWS/DynamoDB")
metricStatisticsRequest.setMetricName('ConsumedWriteCapacityUnits',)
metricStatisticsRequest.setPeriod(60)
metricStatisticsRequest.setStatistics([
'SampleCount',
'Average',
'Sum',
'Minimum',
'Maximum'
])
List<Dimension> dimensions = []
Dimension dimension = new Dimension()
dimension.setName('TableName')
dimension.setValue(dynamoTableHelperService.campaignPkToTableName(campaignPk))
dimensions << dimension
metricStatisticsRequest.setDimensions(dimensions)
client.getMetricStatistics(metricStatisticsRequest)
But I bet you'd results older than 5 minutes.
Actually current off the shelf autscaling is using Cloudwatch. This does have a drawback and for some applications is unacceptable.
When spike load is hitting your table it does not have enough capacity to respond with. Reserved with some overload is not enough and a table starts throttling. If records are kept in memory while waiting a table to respond it can simply blow the memory up. Cloudwatch on the other hand reacts in some time often when spike is gone. Based on our tests it was at least 5 mins. And rising capacity gradually, when it was needed straight up to the max
Long story short. We have created custom solution with own speedometers. What it does is counting whatever it has to count and changing tables's capacity accordingly. There is a still a delay because
App itself takes a bit of time to understand what to do
Dynamo table takes ~30 sec to get updated with new capacity details.
On a top we also have a throttling detector. So if write/read request has got throttled we immediately rise a capacity accordingly. Some times level of capacity looks all right but throttling because of HOT key issue.

Resources