How to handle large Vault Size in Corda? - bigdata

The data in our vault is manageable. Eventually, we would accumulate a large volume. It is not possible to retain such large data for every day transactions. We would want to periodically archive or warehouse the data, so that query performance is maintained.
May I know if you have thought about handling large scale datasets and what would be your advise.

From the corda-dev mailing list:
Yep, we should do some design work around this. As you note it’s not a pressing issue right now but may become one in future.
Our current implementation is actually designed to keep data around even when it’s no longer ‘current’ on the ledger. The ORM mapped vault tables prefer to mark a row as obsolete rather than actually delete the data from the underlying database. Also, the transaction store has no concept of garbage collection or pruning so it never deletes data either. This has clear benefits from the perspective of understanding the history of the ledger and how it got into its current state, but it poses operational issues as well.
I think people will have different preferences here depending on their resources and jurisdiction. Let’s tackle the two data stores separately:
Making the relationally mapped tables delete data is easy, it’s just a policy change. Instead of marking a row as gone, we actually issue a SQL DELETE call.
The transaction store is trickier. Corda benefits from its blockless design here; in theory we can garbage collect old transactions. The devil is in the details however because for nodes that use SGX the tx store will be encrypted. Thus not only do we need to develop a parallel GC for the tx graph, but also, run it entirely inside the enclaves. A fun systems engineering problem.
If the concern is just query performance, one obvious move is to shift the tx store into a scalable K/V store like Cassandra, hosted BigTable etc. There’s no deep reason the tx store must be in the same RDBMS as the rest of the data, it’s just convenient to have a single database to backup. Scalable K/V stores don’t really lose query performance as the dataset grows, so, this is also a nice solution.
W.R.T. things like the GDPR, being able to delete data might help or it might be irrelevant. As with all things GDPR related nobody knows because the EU didn’t bother to define any answers - auditing a distributed ledger might count as a “legitimate need” for data, or it might not, depending on who the judge is on the day of the case.
It is at any rate only an issue when personal data is stored on ledger, which is not most use cases today.

Related

Storing data on edges of GraphDB

It's being proposed that we store a data about a relationship between two vertices on the edge between them. The idea would be that these two vertices are related and there are user level pieces of information that are looking to be stored in graph. The best example I can think of would be a Book, and a Reader, and the Reader can store cliff notes on the edges for retrieval later on.
Is this common practice? It seems to me that we should minimize the amount of data living in edges and that a vast majority of GraphDB data be derived data, rather than using it as an actual data store. Given that its in memory, what happens when it goes down? (We're using Neptune so.. there are technically backups).
Sorry if the question is a bit vague, but I'm not sure else how to ask. I've googled around looking for best practices and its all pretty generic data related to the concepts and theories of graph db.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it?
Without too much additional detail it is hard to provide exact modeling advice , but in general one of the advantages of using a graph databases is that edges are first class citizens and allow for properties on edges. A common use case for this would be something like PERSON - purchases -> Product where you might have a purchase_date on the purchases edge to represent the date of the purchase, as someone might buy the same thing multiple times.
I am not sure what exactly you mean by that a vast majority of GraphDB data be derived data as you can use graphs to derive and infer data/relationships based on the connections but they do fully support storing data in them as well.
Given that its in memory, what happens when it goes down? - Amazon Neptune (and most other DBS) use a buffer cache to store some data in memory, but that data is also persisted to disk, so if the instance goes down, there is no problem with recovering it from the durable storage.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it? - Just as with any database, I would not recommend exposing the Gremlin API directly to consumers, as doing so comes with a whole host of potential security risks. Generally, the underlying data store of any application should be transparent to the users. They should be interacting with an interface like REST/GraphQL that is designed to answer business related questions and not really know or care that there is a graph database backing those requests.

Firestore database model for Notion-like modules [duplicate]

I have seen videos and read the documentation of Cloud firestore, from Google Firebase service, but I can't figure this out coming from realtime database.
I have this web app in mind in which I want to store my providers from different category of products. I want perform a search query through all my products to find what providers I have for such product, and eventually access that provider info.
I am planning to use this structure for this purpose:
Providers ( Collection )
Provider 1 ( Document )
Name
City
Categories
Provider 2
Name
City
Products ( Collection )
Product 1 ( Document )
Name
Description
Category
Provider ID
Product 2
Name
Description
Category
Provider ID
So my question is, is this approach the right way to access the provider info once I get the product I want?
I know this is possible in the realtime database, using the provider ID I could search for that provider in the providers section, but with Firestore I am not sure if its possible or if this is right approach.
What is the correct way to structure this kind of data in Firestore?
You need to know that there is no "perfect", "the best" or "the correct" solution for structuring a Cloud Firestore database. The best and correct solution is the solution that fits your needs and makes your job easier. Bear also in mind that there is also no single "correct data structure" in the world of NoSQL databases. All data is modeled to allow the use-cases that your app requires. This means that what works for one app, may be insufficient for another app. So there is not a correct solution for everyone. An effective structure for a NoSQL type database is entirely dependent on how you intend to query it.
The way you are structuring your data looks good to me. In general, there are two ways in which you can achieve the same thing. The first one would be to keep a reference of the provider in the product object (as you already do) or to copy the entire provider object within the product document. This last technique is called denormalization and is a quite common practice when it comes to Firebase. So we often duplicate data in NoSQL databases, to suit queries that may not be possible otherwise. For a better understanding, I recommend you see this video, Denormalization is normal with the Firebase Database. It's for Firebase Realtime Database but the same principles apply to Cloud Firestore.
Also, when you are duplicating data, there is one thing that needs to keep in mind. In the same way, you are adding data, you need to maintain it. In other words, if you want to update/delete a provider object, you need to do it in every place that it exists.
You might wonder now, which technique is best. In a very general sense, the best way in which you can store references or duplicate data in a NoSQL database is completely dependent on your project's requirements.
So you should ask yourself some questions about the data you want to duplicate or simply keep it as references:
Is the static or will it change over time?
If it does, do you need to update every duplicated instance of the data so they all stay in sync? This is what I have also mentioned earlier.
When it comes to Firestore, are you optimizing for performance or cost?
If your duplicated data needs to change and stay in sync in the same time, then you might have a hard time in the future keeping all those duplicates up to date. This will also might imply you spend a lot of money keeping all those documents fresh, as it will require a read and write for each document for each change. In this case, holding only references will be the winning variant.
In this kind of approach, you write very little duplicated data (pretty much just the Provider ID). So that means that your code for writing this data is going to be quite simple and quite fast. But when reading the data, you will need to load the data from both collections, which means an extra database call. This typically isn't a big performance issue for reasonable numbers of documents, but definitely does require more code and more API calls.
If you need your queries to be very fast, you may want to prefer to duplicate more data so that the client only has to read one document per item queried, rather than multiple documents. But you may also be able to depend on local client caches makes this cheaper, depending on the data the client has to read.
In this approach, you duplicate all data for a provider for each product document. This means that the code to write this data is more complex, and you're definitely storing more data, one more provider object for each product document. And you'll need to figure out if and how to keep up to date on each document. But on the other hand, reading a product document now gives you all information about the provider document in one read.
This is a common consideration in NoSQL databases: you'll often have to consider write performance and disk storage vs. reading performance and scalability.
For your choice of whether or not to duplicate some data, it is highly dependent on your data and its characteristics. You will have to think that through on a case-by-case basis.
So in the end, remember that both are valid approaches, and neither of them is pertinently better than the other. It all depends on what your use-cases are and how comfortable you are with this new technique of duplicating data. Data duplication is the key to faster reads, not just in Cloud Firestore or Firebase Realtime Database but in general. Any time you add the same data to a different location, you're duplicating data in favor of faster read performance. Unfortunately in return, you have a more complex update and higher storage/memory usage. But you need to note that extra calls in Firebase real-time database, are not expensive, in Firestore are. How much duplication data versus extra database calls is optimal for you, depends on your needs and your willingness to let go of the "Single Point of Definition mindset", which can be called very subjective.
After finishing a few Firebase projects, I find that my reading code gets drastically simpler if I duplicate data. But of course, the writing code gets more complex at the same time. It's a trade-off between these two and your needs that determines the optimal solution for your app. Furthermore, to be even more precise you can also measure what is happening in your app using the existing tools and decide accordingly. I know that is not a concrete recommendation but that's software development. Everything is about measuring things.
Remember also, that some database structures are easier to be protected with some security rules. So try to find a schema that can be easily secured using Cloud Firestore Security Rules.
Please also take a look at my answer from this post where I have explained more about collections, maps and arrays in Firestore.

Firebase Fan out - the most cost effective way?

I know this issue may have been raised multiple times but I have read on most of the questions available but did not found any that can exactly help to answer my question. As proposed by the Firebase team the fan out technique is the recommended way to ensure fast data read, but with the cost of data duplication. I know this question is subjective and depends on the application, but which is the best solution in terms of cost saving($) and data read?
Post same node in multiple child (save data read only called once,
but have redundant, so consume more Firebase storage) (see image Firebase Database - the "Fan Out" technique)
Post only one node, and other reference to the node by its key (not redundant and consume less Firbase storage, but need to read twice - get the key, and get the node for the key) (see image https://stackoverflow.com/a/38215398/1423345)
For context, I am building a non profit marketplace app, so I need to apply the best solution in terms of balancing both between cost saving ($) and fast data read.
On the other hand, read twice (bandwidth) vs bigger storage? Which one is more cost effective?
I would start by saying that ideally in Firebase you read or sync only what's necessary. So your database queries are coupled by other filters to make the query as specific as possible. If you can nail that then you will anyway build a very intelligent data structure which will be cost effective.
Now the real debate Fan - Out technique or just post reference to the nodes. As I personally prefer Fan-Out and also use it successfully so I will answer in reference to that technique only which will also give you indications of the reason that make me not wanna use keeping a reference and all.
First and foremost thing is end-user experience and performance. Which comes in the form of the Big Data Chunk Synchronization. Well in general it means that instead of downloading small chunks you aim for the biggest possible so that you reduce High Cell radio usage, High Battery Drain, High bandwidth and also keep the app updated and in sync as fast as possible.
If you aim for that kind of app performance then you clearly see that Fan-Out is the clear winner over other technique due to following reasons.
You download A Big Data Chunk stored in other node which doesn't let your cell radio stay on for long.
As you download whole info at once, your app performs better than others. Obviously by whole I don't mean that you should download full database. It's all about that smart balance which makes you download just what is required in first go.
It's not that this is the only technique which will give you faster reads and better data structure. There are other techniques like indexing, data validation and security rules which are equally important. All coupled up properly with correct data structure will give you far better performance.
In a situation where you have just a reference to other node and not actual data, then you might end up in a situation where you don't actually have anything to show to your users. Let's say your users aren't getting good connectivity so after one read which gave you just the reference, the network falls. So till the network is up again your users don't see anything and trust me that is a very bad situation for the app. Your aim as a developer should be to reduce the chances of those situations
So, I would recommend you to go for FAN - OUT technique as it is faster and cost effective when you see other factors like data filtering, indexing and security rules as well. Yes it comes with a slight price of high storage usage. But what does a less storage mean when you don't have happy users ? Still it all comes down to personal preference. But I have shared my experience and thoughts hope it helps you make right decision.
I would encourage you to got through this and have a more deeper understanding of no SQL Data modelling
Do let me know if this info helped you.

Is it ok to build architecture around regular creation/deletion of tables in DynamoDB?

I have a messaging app, where all messages are arranged into seasons by creation time. There could be billions of messages each season. I have a task to delete messages of old seasons. I thought of a solution, which involves DynamoDB table creation/deletion like this:
Each table contains messages of only one season
When season becomes 'old' and messages no longer needed, table is deleted
Is it a good pattern and does it encouraged by Amazon?
ps: I'm asking, because I'm afraid of two things, met in different Amazon services -
In Amazon S3 you have to delete each item before you can fully delete bucket. When you have billions of items, it becomes a real pain.
In Amazon SQS there is a notion of 'unwanted behaviour'. When using SQS api you can act badly regarding SQS infrastructure (for example not polling messages) and thus could be penalized for it.
Yes, this is an acceptable design pattern, it actually follows a best practice put forward by the AWS team, but there are things to consider for your specific use case.
AWS has a limit of 256 tables per region, but this can be raised. If you are expecting to need multiple orders of magnitude more than this you should probably re-evaluate.
You can delete a table a DynamoDB table that still contains records, if you have a large number of records you have to regularly delete this is actually a best practice by using a rolling set of tables
Creating and deleting tables is an asynchronous operation so you do not want to have your application depend on the time it takes for these operations to complete. Make sure you create tables well in advance of you needing them. Under normal circumstances tables create in just a few seconds to a few minutes, but under very, very rare outage circumstances I've seen it take hours.
The DynamoDB best practices documentation on Understand Access Patterns for Time Series Data states...
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
It's perfectly acceptable to split your data the way you describe. You can delete a DynamoDB table regardless of its size of how many items it contains.
As far as I know there are no explicit SLAs for the time it takes to delete or create tables (meaning there is no way to know if it's going to take 2 seconds or 2 minutes or 20 minutes) but as long your solution does not depend on this sort of timing you're fine.
In fact the idea of sharding your data based on age has the potential of significantly improving the performance of your application and will definitely help you control your costs.

De-normalize live data for the sake of reports - Good or Bad?

What are the pros/cons of de-normalizing an enterprise application database because it will make writing reports easier?
Pro - designing reports in SSRS will probably be "easier" since no joins will be necessary.
Con - developing/maintaining the app to handle de-normalized data will become more difficult due to duplication of data and synchronization.
Others?
Denormalization for the sake of reports is Bad, m'kay.
Creating views, or a denormalized data warehouse is good.
Views have solved most of my reporting related needs. Data warehouses are great when users will be generating reports almost constantly or when your views start to slow down.
This is why you want to normalize your database
To free the collection of relations from undesirable insertion, update and deletion dependencies;
To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;
To make the relational model more informative to users;
To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
—E.F. Codd, "Further Normalization of the Data Base Relational Model" via wikipedia
The only time you should consider de-normaliozation is when the time it takes the report to generate is not acceptable. De-normalization will cause consistentcy issues that are sometimes impossible to determine especially in large datasets
Don't denormalize just to get rid of complexity in reporting, it can cause huge problems in the rest of the application. Either you don't enforce the rules resulting in bad data or if you do then inserts, deletes and updates can be seriously slowed for everyone not just the two or three people who run reports.
If the reports truly can't run well, then create a data warehouse that is denormalized and populate it in a nightly or weekly feed. The kind of reports that typically need this do not generally care if the data is up-to-the minute as they are usually monthly, quarterly, or annual reports that process (and especially aggregate) large amounts of data after the fact.
You can do both... let the normalized database for applications.
Then create a denormalized database for reports, and create an application which regulary copy data from one database to the other.
After all, reports don't always need to have the latest updated data, most of the time you can easily launch an update every 1 hour on the reporting database, and only once a day day.
Beyond the data warehouse and views solutions provided in other answers, which are good in some ways, if you are willing to sacrifice some performance to get a good to the last second data, but still want a normalized database, you could use on Oracle a Materialized View with fast refresh on commit, or in Sql Server, you could use clustered indexes for a view.
Another Con is that the data is likely not to be real-time as there is some time moving around the data to go from a normalized form to a de-normalized. If someone wants the report to be up to the very second it was requested, that can be tough to do in this situation.
If this is a duplication of the synchronization in the original post, sorry I didn't quite see it that way.

Resources