I have two flowfiles from NiFi as follows:
Type:A
id:1
Type:B
id:1
I have some written rules that tell me Type A links to Type B if the id's match. So in the example above, it's a match!
Is there anyway i can use NiFi to compare flowfiles and figure out this connection? If not, what would be the best way to do this? What tools/technologies should i look into in order to accomplish this?
Thanks in advance!
The scenario you described sounds like a streaming join which can be done with a stream processing system like Storm, Flink, Spark, etc.
NiFi focuses more on lookup joins where there is some kind of reference dataset, and something from incoming data is used to perform a look up into the reference dataset, and then enrich the incoming data.
Related
After the first launch Debezium will do initial data snapshot of the already existing data.
Let's say I have two tables - A and B. Table B have NOT NULL FK constraint on A. According to Debezium default approach - Debezium will create two separate Kafka topics for data from tables A and B.
In my understanding, there is a very big chance that I'll potentially try to create record in new table B while appropriate record A will not be present in the appropriate new table A. This way I'll run into constraint violation error.
Do I need to use some internal 3rd party buffer and organize the proper order of insert into the sink database by myself or there is some standard mechanism in Debezium in order to handle such situations?
For example - can I use Debezium Topic Routing https://debezium.io/documentation/reference/configuration/topic-routing.html in order to fix such issue? I can potentially configure Topic Routing to send all depended events (from tables A and B in my example above) to the same topic. In case of the Kafka topic with a single partition all events must be ordered in a correct way. Will it work and this way will I have a correct related entities order for initial snapshot data load?
The IBM IDR (Data Replication) Product solved this with a solution that allows for exactly once semantics and re-creates the ordering of operations within a transaction and ordering of transactions.
Kafka's built in exactly once features has some limitations beyond performance, you don't inherently get the transaction re-ordered by operation, which is important for things like applying with referential integrity constraints.
So in our product we have a proper and a poor man's way to solve the problem. The poor man's is to send all the data for all the tables to a single topic. Obviously this is sub-optimal, but our product will produce data in operation order from a single producer if you do this. You'd probably want idempotence to avoid batches showing up out of order.
Now the pro-level way to solve this is a feature called the TCC (Transactionally Consistent Consumer).
I'm not sure if you need an enterprise level solution performance and feature wise.
If this is a non-critical project you might find the following discussion useful in how we approach delivering the features your looking for.
https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions/
And here's our docs on the feature for reference.
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/concepts/kafkatcc.html
That should give background as to why this problem is hard to solve and what goes into a solution hopefully.
I am finding a solution to handle dynamic data / list to all nodes or specified nodes at the time creating a contract in Corda. I don’t think Oracle is a good approach to use in my case for the following reasons:
The data can be a list of for example legal entity names, they are not from outside world, not a single value;
The list is depended on particular field(s) selected, therefore will need perhaps a centralized place to maintain the data relationship;
Appreciate if anyone can help on this. Thanks.
Kwan
This question is a little difficult to answer without further details on your use-case. However, on the surface, an Oracle doesn't sound like a bad solution:
The data provided by an oracle can be a list
The term "outside world" simply refers to any information not included in the transaction itself. This term should not be taken too literally.
Ultimately, you can think of an Oracle as a provider of "official" data. You request a command including the data from the oracle, include it in the transaction, and the oracle will sign over the transaction if and only if it agrees that the data in the command is true. As long as the Oracle is trusted by all parties involved, this allows data from outside the transaction to be included in the transaction in a reliable way.
As I've worked in my personal lab instance of OpenTSDB, I've started to wonder if it is possible to get it to index on tags as well as metric names. My understanding (correction is welcome...) is that OpenTSDB indexes only on metric names. So, suppose I have something like the following, borrowed from the docs:
tsd.hbase.rpcs{type=*,host=tsd1}
My understanding is that tsd.hbase.rpcs is indexed for searching, but that the keys (type=, host=, etc) are not. Is that correct? If so, is there a way to have them be indexed, or some reasonable approximation of it? Thanks.
Yes you are correct, according to the documentation, OpenTSDB creates keys in the 'tsdb' HBase table of the form
[salt]<metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]
When you do a query with specific tagk and tagv OpenTSDB can construct the key and look it up. If you have a range of tagk and tagv it will look up all the rows and either aggregate them or return multiple time series, depending on your query.
If you are interested in asking questions about tagks, you should use the OpenTSDB search/lookup api, however this still requires a metric name.
If you want to formulate your question around tagks only, you could consider forwarding your data to Bosun for indexing and using its API
/api/metric/{tagk}/{tagv}
Returns the metrics that are available for the specified tagk/tagv pair. For example, you can see what metrics are available for host=server01
I've been testing beanstalkd today and I am wondering is there anyway possible to make job data unique on beanstald tube? In other words, is it possible to make the tube have only unique values? If no, maybe someone could suggest me similar MQ system which have this feature.
Thank you in advance for your answers!
Well what you are looking for is a SET and not a QUEUE.
Sets doesn not allow duplicate members.
Redis has Sorted Sets if you want to keep an order as in the queue
You can achieve what you want with Redis, read about Data types here:
http://redis.io/topics/data-types
When i have a resource, let's say customers/3 which returns the customer object and i want to return this object with different fields, or some other changes (for example let's say i need to have include in customer object also his latest purchase (for the sake of speed i dont want to do 2 different queries)).
As i see it my options are:
customers/3/with-latest-purchase
customers/3?display=with-latest-purchase
In the first option there is distinct URI for the new representation, but is this REALLY needed? Also how do i tell the client that this URI exist?
In the second option there is GET parameter telling the server what kind of representation to return. The URI parameters can be explained through OPTIONS method and it is easier to tell client where to look for the data as all the representations are all in one place.
So my question is which of these is better (more RESTful) and/or is there some better way to do this that i do not know about?
I think what is best is to define atomic, indivisible service objects, e.g. customer and customer-latest-purchase, nice, clean, simple. Then if the client wants a customer with his latest purchases, they invoke both service calls, instead of jamming it all in one with funky parameters.
Different representations of an object is OK in Java through interfaces but I think it is a bad idea for REST because it compromises its simplicity.
There is a misconception that making query parameters look like file paths is more RESTful. The query portion of the address is included when determining a distinct URI so the second option is fine.
Is there much of a performance hit in including the latest purchase data in all customer GET requests? If not, the simplest thing would be to do that so there would neither be weird URL params or double requests. If getting the latest order is a significant hardship (which it probably shouldn't be) there is nothing wrong with adding a flag in the query string to include it.