How to create surrogate Key in BigData

How to create surrogate Key in BigData - bigdata

We are planning to move our Transactional data into BigData platform and do the analysis there. One challenge we faced is how can we create auto-increment in bigData. We need it to generate Surrogate keys.

Most common approach is to use a type 3 UUID, i.e. a pseudo-random identifier with extremely, extremely low collision chance.
If you really need sequential (or at least monotonic) identifiers for some reason, then you will need to generate them from a single source, and this single source may need to be separated out as a service, e.g. Twitter Snowflake.

Yes. I agree with UUID approach.
but please make sure that you refactor your ER model to have proper balance between normalised and deNormalised entity.
If you move your existing application ER model as is in BigData architecture then it would slow down performance as it might have to do joins with BigTable.
Also make sure that you know your Key to access data is strong and not changing when data is updated while storing in NoSql database
This link will give u some idea about above
Transition-RDBMS-NoSQL
relational-databases-vs-non-relational-databases

Related

DynamoDB usable for largeish event table?

I'm thinking of re-architecting an RDS model to a DynamoDB one and it appears mostly to be working using a single-table design. We have, however a log table that can contain 5-10 million rows that are queried on many attributes.
Is there any pattern that might be applicable in migrating to DynamoDB or is this a case where full scans would be required and we would just be better off keeping the log stuff as a relational table?
Thanks in advance,
Nik

Those keywords and phrases "log" and "queried on many attributes" sound to me like DynamoDB is not the best solution for your log data. If the number of distinct queries is fairly limited and well-known in advance, you might be able to design your keys to fit your access patterns.
For example, if you commonly query on Color and Quantity attributes, you could design a key like COLOR#Red#QTY#25. And you could use secondary or global secondary indexes for queries involving other attributes similarly.
But it is not a great solution if you have many attributes that you need to query arbitrarily.
Alternative Solution: Another serverless option to consider is storing your log data in S3 and using Athena to query it using SQL.
You will likely be trading away a bit of latency and speed by taking this approach compared to RDS and DynamoDB. But queries against log data often don't need millisecond response times, so it can cover a lot of use cases.

Data modelling for DynamoDB
Write down all of your access patterns, in order of priority/most used
Research models which are similar to your use-case
Download NoSQL Workbench and create test models where you can visualize your ideas
Run commands against DynamoDB Local and test your access patterns are fulfilled.
Access Parterns
Your access patterns will ultimately decide if DynamoDB will suit your needs. If you need to query based on multiple fields you can have up to 20 Global Secondary Indexes which will give you some flexibility, but usually if you exceed 8-10 indexes then DynamoDB may not be a good choice or the schema is badly designed.
Use smart designs with sort-key and index-key overloading, it will allow you to group the data better and make your access patterns more efficient.
Log Data Use-case
Storing log data is a pretty common use-case for DynamoDB and many many AWS customers use it for that sole purpose. But I can't over emphasize the importance of understanding your access patterns and working backwards from those to create your model.
Alternatives
If you require query capability or free text search ability, then you could use DynamoDB integrations with OpenSearch (via Lambda/EventBridge) for example, with OpenSearch providing you the flexibility for your queries.

Doesn't seem like a good use case - I have done it and wasn't at all happy with the result - now I load 'log like' data into elasticsearch and much happier with the result.
In my case, I insert the data to dynamodb - to archive it - but also feed data in ES, but once in a while if I kill my ES cluster, I can reload all or some of the data from ddb.

Debezium initial data snapshot and related entities order

After the first launch Debezium will do initial data snapshot of the already existing data.
Let's say I have two tables - A and B. Table B have NOT NULL FK constraint on A. According to Debezium default approach - Debezium will create two separate Kafka topics for data from tables A and B.
In my understanding, there is a very big chance that I'll potentially try to create record in new table B while appropriate record A will not be present in the appropriate new table A. This way I'll run into constraint violation error.
Do I need to use some internal 3rd party buffer and organize the proper order of insert into the sink database by myself or there is some standard mechanism in Debezium in order to handle such situations?
For example - can I use Debezium Topic Routing https://debezium.io/documentation/reference/configuration/topic-routing.html in order to fix such issue? I can potentially configure Topic Routing to send all depended events (from tables A and B in my example above) to the same topic. In case of the Kafka topic with a single partition all events must be ordered in a correct way. Will it work and this way will I have a correct related entities order for initial snapshot data load?

The IBM IDR (Data Replication) Product solved this with a solution that allows for exactly once semantics and re-creates the ordering of operations within a transaction and ordering of transactions.
Kafka's built in exactly once features has some limitations beyond performance, you don't inherently get the transaction re-ordered by operation, which is important for things like applying with referential integrity constraints.
So in our product we have a proper and a poor man's way to solve the problem. The poor man's is to send all the data for all the tables to a single topic. Obviously this is sub-optimal, but our product will produce data in operation order from a single producer if you do this. You'd probably want idempotence to avoid batches showing up out of order.
Now the pro-level way to solve this is a feature called the TCC (Transactionally Consistent Consumer).
I'm not sure if you need an enterprise level solution performance and feature wise.
If this is a non-critical project you might find the following discussion useful in how we approach delivering the features your looking for.
https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions/
And here's our docs on the feature for reference.
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/concepts/kafkatcc.html
That should give background as to why this problem is hard to solve and what goes into a solution hopefully.

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.

The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.

My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

Best way to store key value pairs in BizTalk

We have requirements where we need to store Key-Value pair type data for quick retrieval in BizTalk Map.
Is there any best practice by which we can store it. The data stored should be easy to maintain and should have a cache mechanism for easy retrieval as the number of a key-value pair to be stored can be high in a number ranging from 1- 100 or more.
We do not need to store confidential information so I am not preferring SSO. but still is it a preferable method?
We need to use it in BizTalk map and data retrieval might happen for each row of data so there is performance pressure as well.

For this I would use the Get Common Value and Get Application Value functoids XRef functiods. It both gives you the ability to have key value pairs (with an additional element so you can scope it per application), and they also do Caching.
I wrote a blog post about it BizTalk Pattern: Translating Reference Data in a Map using Xref

You do have the xRef Functoids, but those are somewhat difficult to maintain and use beyond their original design requirement.
What I have done for a similar situation is to pre-Fetch the lookup tables as SQL Table Types into an Orchestration, then use a Multi-Input Map passing the business message and lookup tables. That way, all lookups are internal to the transform.
Retrieving the entire lookup table once is in many case less taxing than doing many lookups.

How to implement gapless, user-friendly IDs in NHibernate?

I'm designing an application where my Order objects need to have a sequential and user-friendly Id field. I'm avoiding the HiLo algorithm because of the rather large gaps it produces (see here). Naturally, Guid values would make my corporate users go bananas. I'm also avoiding Oracle sequences because of the major disadvantages of it:
(From: NHibernate POID Generators revealed)
Post insert generators, as the name
suggest, assigns the id’s after the
entity is stored in the database. A
select statement is executed against
database. They have many drawbacks,
and in my opinion they must be used
only on brownfield projects. Those
generators are what WE DO NOT SUGGEST
as NH Team.
> Some of the drawbacks are the
following:
Unit Of Work is broken with the use of
those strategies. It doesn’t matter if
you’re using FlushMode.Commit, each
Save results in an insert statement
against DB. As a best practice, we
should defer insertions to the commit,
but using a post insert generator
makes it commit on save (which is what
UoW doesn’t do).
Those strategies
nullify batcher, you can’t take the
advantage of sending multiple queries
at once(as it must go to database at
the time of Save).
Any ideas/experience on implementing user-friendly IDs without major gaps between them?
Edit:
User friendly Id fields are ones my corporate users can memorize and even discuss and/or have phone conversations talking about a particular Order by its code, e.g. "I'm calling to know why the order #1625 was denied.".
The Id doesn't need to be strictly gapless, but I am worried that my users would get confused when they see gaps like 100, 201, 305. For my older projects, I currently implement NHibernate using Oracle sequences which occasionally lose a few sequences when exceptions are thrown, but yet keep a rather tidy order to them. The downside to them is how they break the Unit of Work which results in additional hits to the database for every Save command with or without the Session.Flush.

One option would be to keep a key-table that simply stores an incrementing value. This can introduce a few problems, namely possible locking issues as well as additional hits to the database.
Another option might be to refine what you mean by "User-friendly Id". This could consist of a combination of a Date/Time and a customer-specific sequence (or including the customer id as well). Also, your order id does not necessarily have to be the actual key on the table. There is nothing to say that you can't use a surrogate key with a separate "calculated" column which represents the order id.
The bottom-line is that it sounds like you want to use a surrogate key, but have the benefits of a natural key. It can be very difficult to have it both ways and a lot comes down to how you actually plan on using the data, how users interpret the data, and personal preference.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex