confluent-kafka-python library: read offset per topic per consumer_group - pykafka

Due to pykafka EOL we are in the process of migration to confluent-kafka-python. For pykafka we wrote an elaborated script that produced output in the format:
topic
consumer group
offset
topic_alpha
total_messages
100
topic_alpha
consumer_a
10
topic_alpha
consumer_b
25
I am wondering whether there is a Python code that knows how to do something similar for the confluent-kafka-python?
small print: there is a partial example on how to read offsets per given consumer_group. However, I struggle to get the list of consumer_group per topic without manually parsing __consumer_offsets.

Use admin_client.list_groups() to get a list of groups, and admin_client.list_topics() to get all topics and partitions in the cluster and client.get_watermark_offsets() for the given topics.
Then for each consumer group instantiate a new consumer with the corresponding group.id, create a TopicPartition list to query committed offsets for and then call c.committed() to retrieve the committed offsets.
Subtract the committed offsets from the high watermark to get th

Related

Need help to iterate through array of JSON response in TOSCA

I just started working on TOSCA, I need one help with the technical issue which I am facing now.
I have an API which sends a array of objects and my test condition should validate a particular field in all objects in array.
In TOSCA when I scanned my API, I see an item attribute is created and that has all fields within it.As per the source I see that we can extract any data from the item either by making the item as "$index" or by setting the value as index==1(index value).
or like this
But I don't want to iterate like this as for each test data, the number of items may vary and I don't want to hard code the response by index.As it fails with a new data set as below
With one test data I got four records and in next iteration response has only three records and also data updated so the verification is failed.
Can someone help me to find out a solution to iterate / loop through all items at once(using some loops) and extract data into buffer.

Debatching Biztalk flat file message into individual grouped flat files based on value

Have an issue where I am trying to debatch a flat file in BizTalk Server (comma delimited to tab-delimited) into individual flat files based on a value (in this example it would be PONumber) in the original file.
Sample input:
PartNumber,Weight,PONumber,Other
21519,234,46788,1
81919,456,47115,1
91910,789,47115,1
This would outcome into 2 messages such as:
PartNumber Weight PONumber Other
21519 234 46788 1
and
PartNumber Weight PONumber Other
81919 456 47115 1
91910 789 47115 1
I have seen similar things but no definite answers, or samples are dead links. Does anyone have a sample where they have done something like this or have a good solution?
Option 1: Convoy pattern
Change your schema so that it has a max occurs of 1 for the PO line, this will debatch each line into it's own messages when it is received.
Promote the PONumber so that it is a promoted property in the message context.
Have an Orchestration that has a correlation set based on the PO number, and initialises this on the first receive shape.
Have a receive shape with a following correlation that is in a wait shape inside a loop to receive all the other lines with the same PO number and combine them into a single message.
Option 2: Staging database
The other option is to just insert all of the rows into a SQL database, and then have a stored procedure that you poll that gets all the lines for a single PO.
This can sometimes be simpler, and avoids the issue of Zombies as you can implement this as a messaging only pattern or using a simpler Orcherstration without a loop.

How to design a recommendation system in DynamoDb based on likes

Considering performance as main importance, how would be the best design approach to follow in order to build a recommendation system in DynamoDb?
The system would need to store an url and the numbers of times that topic was 'liked', but my requirement includes the need of searches by daily, weekly, monthly and yearly, for e.g.:
Give me the top 10 of the week
Give me the top 10 of the month
I was thinking about to include the date and time information, so that the query could control it through this field, but not sure wether it is the good in terms of performance.
If the only data structure you had was a hash map, how would you solve this problem?
What if on top of that constraint, you could only update any key up to 1000 times per second, and read a key up to 3000 per second?
How often do you expect your items to get liked? Presumably there will be some that will be hot and liked a lot, while others would almost never get any likes.
How real_time does your system need to be? Can the system be eventually consistent (meaning, would it be ok if you only reported likes as of several minutes ago)?
Let's give this a shot
Disclaimer: this is very much a didactic exercise -- in practice you may want to explore an analytics product, or some other technologies than DynamoDB to accompish this task
Part 1. Representing an Item And Updating Like Counts
First, let's talk about your aggregation/analytics goals: you mentioned that you want to query for "top 10 of the week" or "top 10 of the month" but you didn't specify if that is supposed to mean "calendar week"/"calendar month", or "last 7 days"/"last 30 days".
I'm going to take it literally and assume that "top 10 of the week" means top 10 items from this week that started on the most recent Monday (or Sunday if you roll that way). Same for month: "top 10 of the month" means "top 10 items since the beginning of this month.
In this case, you will probably want to store, for each item:
a count of total all-time likes
a count of likes since the beginning of current month
a count of likes since the beginning of current week
current month number - needed to determine if we need to reset
current week number - needed to determine if we need to reset
And each week, reset the counts for the current week; And each month reset the counts for the current month.
In DynamoDB, this might be represented like so:
{
id: "<item-id>",
likes_all: <numeric>, // total likes of all time
likes_wk: <numeric>, // total likes for the current week
likes_mo: <numeric>, // total likes for the current month
curr_wk: <numeric>, // number of the current week of year, eg. 27
curr_mo: <numeric>, // number of the current month of year, eg. 6
}
Now, you can update the number of likes with an UpdateItem operation, with an UpdateExpression, like so:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET likes_all = likes_all + :lc, likes_wk = likes_wk + :lc, likes_mo = likes_mo + :lc" \
--expression-attribute-values '{":lc": {"N":"1"}}' \
--return-values ALL_NEW
This gives you a simple atomic way to increment the counts and get back the updated values. Notice the :lc value can be any number (not just 1). This will come in handy below.
But there's a catch. You also need to be able to reset the counts if the week or month rolled over, so to do that, you can break the update into two operations:
update the total count (and get the most recent values back)
conditionally update the week and month counts
So, our update sequence becomes:
Step 1. update total count and read back the updated item:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET likes_all = likes_all + :lc" \
--expression-attribute-values '{":lc": {"N":"1"}}' \
--return-values ALL_NEW
This updates the total count and gives us back the state of the item. Based on the values of the curr_wk and curr_mo, you will have to decide what the update looks like. You may be either incrementing, or setting an absolute value. Let's say we're in the case when the update is being performed after the week rolled over, but not the month. And let's say that the result of the update above looks like this:
{
id: "<item-id>",
likes_all: 1000, // total likes of all time
likes_wk: 70, // total likes for the current week
likes_mo: 150, // total likes for the current month
curr_wk: 26, // number of the week of last update
curr_mo: 6, // number of the month of year of last update
}
curr_wk is 6, but at the time of update, the actual current week should be 7.
Then your update query would look look like this:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET curr_wk = 27, likes_wk = :lc, likes_mo = likes_mo + :lc" \
--condition-expression "curr_wk = :wk AND curr_mo = :mo" \
--expression-attribute-values '{":lc": {"N":"1"}, ":wk": {"N":"26"}, ":lc": {"N":"6"},}' \
--return-values ALL_NEW
The ConditionExpression ensures that we don't reset the likes twice, if two conflicting updates happen at the same time. In that case, one of the updates would fail and you'd have to switch the update back to an increment.
Part 2 - Keeping Track of Statistics
To take care of your statistics you need to keep track of most likes per week and per month.
You can keep a sorted list of hottest items per week and per month. You also can store these lists in Dynamo.
For example, let's say you want to keep track of top 3. You might store something like:
{
id: "item-stats",
week_top: ["item3:4000", "item2:2000", "item9:700"],
month_top: ["item2:100000", "item4:50000", "item3:12000"],
curr_wk: 26,
curr_mo: 6,
sequence: <optimistic-lock-token>
}
Whenever you perform an update for items, you would also update the statistics.
The algorithm for updating statistics will be similar to updating an item, except you can't just use update expressions. Instead you have to implement your own read-modify-write sequence using GetItem, PutItem and ConditionExpression.
First, you read the current values for the item-stats special item, including the value of the current sequence (this is important to detect clobbering)
Then, you figure out if the item(s) you've just updated counts for would make it into the Top-N weekly or monthly list. If so, you would update the week_top and/or month-top attributes and prepare a conditional PutItem request.
The PutItem request must include a conditional check that verifies the sequeuce of the item-stats is the same as what you read earlier. If not, you need to read the item again and re-compute the top-N lists, then attempt to put again.
Also, similar to the way the counts get reset for items, when an update happens you need to check and see if the weekly or monthly top needs to be reset as part of the update.
When you make the PutItem request, make sure to generate a new sequence value.
Part 3 - Putting It All Together
In Part 1 and Part 2 we figured out how to keep track of likes and keep track of statics but there are big problems with our approach: performance would be pretty bad with any kind of real-life scale; hot items would create problems for us; updating the Top-N stats would be a significant bottleneck.
To improve performance and achieve some scalability we'd want to get away from updating each item and the item-stats for every single "like".
We can achieve a good balance of performance and scalability using a combination of queues + dynamodb + compute resource.
create a queue to store pending likes
let "likes API" would enqueue a message tagging a post with a like, instead of applying them as they come
implement a queue consumer (could be a Lambda, or some other periodically running process) to pull messages off the queue and aggregate likes per item, then update items and the item-stats
By batching updates, we can get control over concurrency (and cost) at the expense of latency/eventual consistency.
We may end up with a limited number of queue consumers, each processing items in batches. In each batch, multiple item likes would be aggregated and a single update per item would be applied. Similarly, a single item-stats update would be applied per batch processor.
Depending on volume of incoming likes, you may need to spin up more processors.

MS Project: How to set daily actual work for a task using a JavaScript Add-In?

I want to synchronize data for actual work from a web-based application of my company with MS Project. I am currently developing an Add-In with JavaScript in order to achieve this:
The red circle in my screenshot shows the data that I want to set programmatically. However, I have no idea how to achieve this.
I understand that I can get Task GUIDs and then set task fields using the task GUID and the field ID. This way I can save the cumulative actual work, but not per day like in my screenshot.
The API Docs on the MS Office Website are rather hard to read and navigate. Any help would be apprechiated!
Let's first separate the language from the operation.
Operationally, based on your circle, you want to set work for a task to happen on individual days? This is done using timeScaleData, see https://learn.microsoft.com/en-us/previous-versions/office/developer/office-2003/aa206255(v=office.11) . When I did something similar (in VBA), I had to (1) get an array of time scale values, then (2) walk/iterate through that array and set work to those days:
set timeScaleValsArry = myTask.Assignments(1).TimeScaleData(startDay, endDay, pjAssignmentTimeScaledWork, daily)
for a = 1 to timeScaleValsArry.Count
timeScaleValsArry[a].value = hoursToWorkThatDay
next
Breaking down the elements above:
myTask is the task (of type task) I want to manipulate.
Assignments is an array representing each resource assigned to the task; for my purposes, I only ever had 1 resource assigned, hence the index of (1).
TimeScaleData is the function that returns the the array starting on the day startDay (whatever you want that to be), endDay, pjAssignmentTimeScaledWork which tells this function what data we want to work with (being work, but there are alternates ), and daily which is the frequency you want to work with (for instance you can go down to minutes, or up to years).
Then the returned array timeScaleValsArry is walked, and inside the loop the daily assignment for each value is manipulated. You'd need to customize this part to meet your needs; alternatively, you don't even need to loop if you always had three days: just hard code the array indices.
As far as language, clearly this is do-able in VBA. Doing this in C# as a VSTO addin has very similar syntax. I'd presume for JavaScript (what are you using, ScriptLab?) would also have similar syntax.

Dynamodb data model for process/transaction monitoring

I am wanting to keep track of multi stage processing job.
Likely just need the following fields
batchId (guid) | eventId (guid) | statusId (int) | timestamp | message (string)
There are relatively small number of events per batch.
I want to be able to easily query events that have a statusId less than n (still being processed or didn't finish processing).
Would using multiple rows for each status change, and querying for latest status be the best approach? I would use global secondary index but StatusId does not seem like a good candidate for hashkey (less than 10 statuses).
Instead of using multiple rows for every status change, if you updated the same event row instead, you could use a technique described in the DynamoDB documentation in the section 'Use a Calculated Value'. Basically this would involve adding another attribute (say 'derivedStatusId') which would be derived by appending a random number to statusId at the time of writing to DynamoDB. For example, for a statusId of 2, derivedStatusId could be one of {"2-00", "2-01", .. "2-99"}. Setting up a Global Secondary Index on derivedStatusId would give you some fan-out that will help in preventing the index from becoming hot.
If you are sure that you will use this index for only unfinished events, then removing the derivedStatusId attribute from the record when it transitions to a finished status will remove it from index as well - which may be a good property if events are expected to finish processing eventually, and if they stay around forever. This technique is called "Sparse Index" and is described in more detail here.
From your question, it seems like keeping status history recording is a desired property (I assume this because you want to have multiple rows for status changes). Consider putting this historical information in the same row. DynamoDB supports list data types and also has a generous 400KB item limit which may just allow you to capture all the desired historical information in the same record.

Resources