Debezium initial data snapshot and related entities order - constraints

After the first launch Debezium will do initial data snapshot of the already existing data.
Let's say I have two tables - A and B. Table B have NOT NULL FK constraint on A. According to Debezium default approach - Debezium will create two separate Kafka topics for data from tables A and B.
In my understanding, there is a very big chance that I'll potentially try to create record in new table B while appropriate record A will not be present in the appropriate new table A. This way I'll run into constraint violation error.
Do I need to use some internal 3rd party buffer and organize the proper order of insert into the sink database by myself or there is some standard mechanism in Debezium in order to handle such situations?
For example - can I use Debezium Topic Routing https://debezium.io/documentation/reference/configuration/topic-routing.html in order to fix such issue? I can potentially configure Topic Routing to send all depended events (from tables A and B in my example above) to the same topic. In case of the Kafka topic with a single partition all events must be ordered in a correct way. Will it work and this way will I have a correct related entities order for initial snapshot data load?

The IBM IDR (Data Replication) Product solved this with a solution that allows for exactly once semantics and re-creates the ordering of operations within a transaction and ordering of transactions.
Kafka's built in exactly once features has some limitations beyond performance, you don't inherently get the transaction re-ordered by operation, which is important for things like applying with referential integrity constraints.
So in our product we have a proper and a poor man's way to solve the problem. The poor man's is to send all the data for all the tables to a single topic. Obviously this is sub-optimal, but our product will produce data in operation order from a single producer if you do this. You'd probably want idempotence to avoid batches showing up out of order.
Now the pro-level way to solve this is a feature called the TCC (Transactionally Consistent Consumer).
I'm not sure if you need an enterprise level solution performance and feature wise.
If this is a non-critical project you might find the following discussion useful in how we approach delivering the features your looking for.
https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions/
And here's our docs on the feature for reference.
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/concepts/kafkatcc.html
That should give background as to why this problem is hard to solve and what goes into a solution hopefully.

Related

How to find which kinds are not being used in Google Datastore

There's any way to list the kinds that are not being used in google's datastore by our app engine app without having to look into our code and/or logic? : )
I'm not talking about indexes, which I can list by issuing an
gcloud datastore indexes list
and then compare with the datastore-indexes.xml or index.yaml.
I tried to check datastore kinds statistics and other metadata but I could not find anything useful to help me on this matter.
Should I give up to find ways of datastore providing me useful stats and code something to keep collecting datastore statistics(like data size), during a huge period to have at least a clue of which kinds are not being used and then, only after this research, take a look into our app code to see if the kind Model was removed?
Example:
select bytes from __Stat_Kind__
Store it somewhere and keep updating for a period. If the Kind bytes size does not change than probably the kind is not being used anymore.
The idea is to do some cleaning in datastore.
I would like to find which kinds are not being used anymore, maybe for a long time or were created manually to be used once... You know, like a table in oracle that no one knows what is used for and then if we look into the statistics of that table we would see that this table was only used once 5 years ago. I'm trying to achieve the same in datastore, I want to know which kinds are not being used anymore or were used a while ago, then ask around and backup/delete it if no owner was found.
It's an interesting question.
I think you would be best-placed to audit your code and instill organizational practice that requires this documentation to be performed in future as a business|technical pre-prod requirement.
IIRC, Datastore doesn't automatically timestamp Entities and keys (rightly) aren't incremental. So there appears no intrinsic mechanism to track changes short of taking a snapshot (expensive) and comparing your in-flight and backup copies for changes (also expensive and inconclusive).
One challenge with identifying a Kind that appears to be non-changing is that it could be referenced (rarely) by another Kind and so, while it does not change, it is required.
Auditing your code and documenting it for posterity should not only provide you with a definitive answer (and identify owners) but it pays off a significant technical debt that has been incurred and avoids this and probably future problems (e.g. GDPR-like) requirements that will arise in the future.
Assuming you are referring to records being created/updated, then I can think of the following options
Via the Cloud Console (Datastore > Dashboard) - This lists all your 'Kinds' and the number of records in each Kind. Theoretically, you can take a screen shot and compare the counts so that you know which one has experienced an increase or not.
Use of Created/LastModified Date columns - I usually add these 2 columns to most of my datastore tables. If you have them, then you can have a stored function that queries them. For example, you run a query to sort all of your Kinds in descending order of creation (or last modified date) and you only pull the first record from each one. This tells you the last time a record was created or modified.
I would write a function as part of my App, put it behind a page which requires admin privilege (only app creator can run it) and then just clicking a link on my App would give me the information.

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

Traversing the Ledger

I have a cordapp set up that uploads an attachment with each transaction. The attachment is a zipped file of a list of unique identifiers related to the tx. I am trying to implement logic that forbids the same unique identifier to appear again in a subsequent transaction. Let's say I have an initial tx with an attachment listing A,B,C,D,E and it passes. Then I have Tx 2a with attachment F,G,H and Tx 2b with attachment C,F,G,H. I would want 2a to be accepted but 2b to be rejected.
I'm trying to figure out the best way to store and query the history of identifiers. I know that the attachment will be saved to the tx history, but traversing the ledger and opening/reading all attachments to ensure there are no duplicates seems extremely intensive as we scale (the attachments are more likely to list thousands of unique identifiers rather than 5).
Is it practical to create a table on the db - perhaps even the off-ledger portion of the vault - that just contains all of the ids that have been used? The node responsible for checking redundancy could read the incoming attachment, query the table, check redundancy, sign the tx, and then insert the new ids into the table? Or is there something better we can do that involves actually traversing the ledger?
Thank you
Assuming there are not millions of identifiers and if you don't mind all of the past identifiers being in the current version of the state then you can accumulate them inside the state, inside a Set? The Set will ensure there are no dupes. The benefit of this approach is that you can then perform the checking logic inside the contract.
If you don't care about performing these checks inside the contract then you can do one of the approaches you suggested:
"traversing the ledger" is really just performing a bunch of inefficient database queries queries as you rightly note
the other approach you suggested seems like a good idea. Keep an off-ledger DB table with the identifiers in. Currently working on a feature to make this much easier. In the meantime you can use ServiceHub.jdbcConnection to execute queries against the DB.
Which one you choose really depends on other aspects of your use-case.
One thing you could try is maintain a bloom filter inside your state object. This way you get a space efficient data structure and quick set membership checks. You'll have to update the filter each time an identifier is added. Could be something to look at.
Cheers

Corda dynamic data handling at the time creating a contract

I am finding a solution to handle dynamic data / list to all nodes or specified nodes at the time creating a contract in Corda. I don’t think Oracle is a good approach to use in my case for the following reasons:
The data can be a list of for example legal entity names, they are not from outside world, not a single value;
The list is depended on particular field(s) selected, therefore will need perhaps a centralized place to maintain the data relationship;
Appreciate if anyone can help on this. Thanks.
Kwan
This question is a little difficult to answer without further details on your use-case. However, on the surface, an Oracle doesn't sound like a bad solution:
The data provided by an oracle can be a list
The term "outside world" simply refers to any information not included in the transaction itself. This term should not be taken too literally.
Ultimately, you can think of an Oracle as a provider of "official" data. You request a command including the data from the oracle, include it in the transaction, and the oracle will sign over the transaction if and only if it agrees that the data in the command is true. As long as the Oracle is trusted by all parties involved, this allows data from outside the transaction to be included in the transaction in a reliable way.

Firebase data structure and url to use

I'm really new to firebase, want to try out a simple mix-client app on it - android, js. I have a users table and a tasks table. The very first question that comes to my mind is, how to store them (and thus how the url to be)? For example, based on the tasks table, should I use:
/tasks/{userid}/task1, /tasks/{userid}/task2, ...
Or
/{userid}/tasks/task1, /{userid}/tasks/task2, ...
The next question, based on the answer to the first one - why to use any of the versions?
In my opinion, the first version is good because domains are separated.
The second approach is good because data is stored per-user which may make some of the operations easier.
Any ideas/suggestions?
Update: For the current case, let's say there are following features:
show list of tasks for each user
add new task to the list
edit/delete a task by user.
Simple operations.
This answer might come in late, but here's how I feel about the question after a year's experience with Firebase.
For your very first question, it totally depends on which data your application will mostly read and how and in which order ( kind of like sorting ) you expect to read the data.
your first proposal of data structure, that is "/tasks/{userid}/task1", "taks/{userid}/task2"... is good if the application will oftentimes read the tasks as per users with an added advantage of possibly sorting the data by any task's "attribute" if I might call it so.
say each task has got a priority attribute then,
// get all of a user's tasks with a priority of 25.
var userTasksRef = firebase.database().ref("tasks/${auth.uid}");
userTasksRef.orderByChild("priority").equalTo(25).on(
"desired_event",
(snapshot) => {
//do something important here.
});
2. I'll highly advice against the second approach because generally most if not all of the data that is associated to that user will be stored under the "/{userid}/" node and with firebase's mechanism, should a situation be in which you need more than one datum at that path level, it will require you getting that data with all the other data that's associated to that user's node ( tasks and any other data included). I won't want that behavior on my database. Nonetheless, this approach still permits you to store the tasks as per the users or making multiple RESTfull requesting and collecting the required data datum after datum. Suggest fanning out the data structure if this situation is encountered. Totally valid data structure if there don't exist a use case in the application where in datum at the first level of the path is needed and only that datum is needed but rather the block of data available at that path level with all the data at the deriving paths at that level( that is 2nd 3rd ... levels).
As per the use cases you've described, and if the database structure you've given is exhaustive of your database structure, I'll say it isn't enough to cover your use cases.
Suggest reading the docs here. Great and exhaustive documentation of their's.
As a pick, the first approach is a better approach to modelling this data use case in NoSQL and more accurately Firebase's NoSQL database.

Resources