Unique job data in beanstalkd - mq

I've been testing beanstalkd today and I am wondering is there anyway possible to make job data unique on beanstald tube? In other words, is it possible to make the tube have only unique values? If no, maybe someone could suggest me similar MQ system which have this feature.
Thank you in advance for your answers!

Well what you are looking for is a SET and not a QUEUE.
Sets doesn not allow duplicate members.
Redis has Sorted Sets if you want to keep an order as in the queue
You can achieve what you want with Redis, read about Data types here:
http://redis.io/topics/data-types

Related

Debezium initial data snapshot and related entities order

After the first launch Debezium will do initial data snapshot of the already existing data.
Let's say I have two tables - A and B. Table B have NOT NULL FK constraint on A. According to Debezium default approach - Debezium will create two separate Kafka topics for data from tables A and B.
In my understanding, there is a very big chance that I'll potentially try to create record in new table B while appropriate record A will not be present in the appropriate new table A. This way I'll run into constraint violation error.
Do I need to use some internal 3rd party buffer and organize the proper order of insert into the sink database by myself or there is some standard mechanism in Debezium in order to handle such situations?
For example - can I use Debezium Topic Routing https://debezium.io/documentation/reference/configuration/topic-routing.html in order to fix such issue? I can potentially configure Topic Routing to send all depended events (from tables A and B in my example above) to the same topic. In case of the Kafka topic with a single partition all events must be ordered in a correct way. Will it work and this way will I have a correct related entities order for initial snapshot data load?
The IBM IDR (Data Replication) Product solved this with a solution that allows for exactly once semantics and re-creates the ordering of operations within a transaction and ordering of transactions.
Kafka's built in exactly once features has some limitations beyond performance, you don't inherently get the transaction re-ordered by operation, which is important for things like applying with referential integrity constraints.
So in our product we have a proper and a poor man's way to solve the problem. The poor man's is to send all the data for all the tables to a single topic. Obviously this is sub-optimal, but our product will produce data in operation order from a single producer if you do this. You'd probably want idempotence to avoid batches showing up out of order.
Now the pro-level way to solve this is a feature called the TCC (Transactionally Consistent Consumer).
I'm not sure if you need an enterprise level solution performance and feature wise.
If this is a non-critical project you might find the following discussion useful in how we approach delivering the features your looking for.
https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions/
And here's our docs on the feature for reference.
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/concepts/kafkatcc.html
That should give background as to why this problem is hard to solve and what goes into a solution hopefully.

Most efficient way to change synthetic partition key values

I have a collection with thousands of documents all of which have a synthetic partition key property like:
partitionKey: ‘some-document-related-value’
now i need to change values for partitionKey. of course, it takes recreation of documents in order to do so but i am wondering what is the most efficient/straightforward way to do it?
should i use azure function with cosmosdbtrigger? (set to start feed from begining)
change feed processor?
some other way?
i’m looking for quickest solution thats still reliable.
Yes, change feed is a common way to migrate data from one container to another. Another simple option may be to use Data Migration Tool where you build your new partition key in the select statement.
Hopefully this is helpful.

NiFi matching flowfiles based on some rules

I have two flowfiles from NiFi as follows:
Type:A
id:1
Type:B
id:1
I have some written rules that tell me Type A links to Type B if the id's match. So in the example above, it's a match!
Is there anyway i can use NiFi to compare flowfiles and figure out this connection? If not, what would be the best way to do this? What tools/technologies should i look into in order to accomplish this?
Thanks in advance!
The scenario you described sounds like a streaming join which can be done with a stream processing system like Storm, Flink, Spark, etc.
NiFi focuses more on lookup joins where there is some kind of reference dataset, and something from incoming data is used to perform a look up into the reference dataset, and then enrich the incoming data.

Hoes does setRange in GAE datastore work?

I need to mark the first(earliest) 10 entries in my table as '1'. How to do this?
I read about query.setRange(0,10) option. I just want to confirm whether this setRange option without any sort applied retrieves the earliest 10 entities of a table?
I don't want to save any date and time to measure this.
If you have some other simpler way to do this, please advise.
Thanks,
Karthick.
You should never count on a database (no matter which one) to sort results without explicitly requesting a specific order. Even if datastore returns the earliest entities, it might change in the future and your software will stop working.

Partition Or Separate Table

I am designing database for fleet management system.
I will be getting n number of records every 3 seconds. Obviously, there will be millions of record in my table where I am going to store current Information of vehicle in the current_location table. Here performance is an BIG issue.
To solve this, I received the following suggestions:
Create a separate table for each vehicle.
Here a table will be created at a run time as as soon as I click on create new table.And all the data related to particular table will be inserted and retrieve from that particular table.
Go for partition.
Please answer the following questions about these solutions.
What is difference between the two?
Which is best and why?
At what point will the number of rows in the tables cause performance issues?
Are there any other solutions?
Now ---if I go for range partition in sql server 2008 what should i do to,
partition using varchar(20).
i am planning to do partition based on vehicle no. eg MH30 q 1234.
Here In vehicle no. lets say mh30 q 1234--only 30 & q going to change....so my question is HOW SHOULD I GO. means how should write the partition function.
***1st this question was asked for my sql..now for sql server
********sorry guys now I shifted from my sql to sql server*****With The same question
definitely use partitioning. why go to all of the hassle to figure out which table to use to answer a question when mysql will do it for you? and good luck find the current location of all of your trucks if you're not using partitioning!
partitioning gives you the performance benefits of multiple tables, but with automatic pruning (selection of just the tables needed to answer the query).
nothing is ever "best". the question is: what is best for your problem?
this is impossible to answer. you will just have to monitor your system for performance issues and adjust server settings or scale as necessary.
at least as far as mysql is concerned, none as good as partitioning!
Don't bother with partitioning for 28,800 rows per day.
We don't (yet) with over 5 million per day. (The "yet" means we have no business input on what data retention policy they want)
There should be very little performance difference between making a separate table for each vehicle, and making the vehicle ID the first field in the primary key. You get the same grouping on disk either way, and mysql should have no trouble with millions of rows in a table.
Partitions are only useful if you have multiple disks on your machine and want to spread the load across disks.
So I guess my answer is do neither. Designing this in a priori seems overkill.
One thing I want to point out is that having one table (which you can partition later when you need to) will be much easier to maintain both in the database and in terms of querying the data.

Resources