Relational DB and kafka consumer sync - spring-kafka

I have a unique requirement where i have to fetch messages from a topic and persist it on db and then poll it by 15 minutes interval. Please could somebody suggest how to do it effectively using Spring-Kafka ? Thanks.

If I understand the requirement correctly you pretty much describe the solution - call poll every 15 minutes using a scheduled job (if you have a cluster of consumers then in order to retrieve all messages every 15 minutes use a Cron based schedule so they all poll at the same time, if it's just a single consumer you can just use a simple schedule to run every 15 minutes).
Each time you poll, persist the data from the records in your DB and commit the DB transaction. Use auto commit in Kafka with commit.interval.ms less than 15 minutes - that means each time you poll, the previous batch of records (which have definitely been persisted in the DB) will be committed in Kafka (note that auto-commit causes commit of offsets to happen synchronously in the poll method).
The other important configuration to set is max.poll.interval.ms - make that greater than 15 minutes otherwise a rebalance would get triggered between polls.
Note that if a rebalance does occur (as is generally inevitable at some point) you will consume the same records again. Simplest approach here is to use a unique index in the database and just catch and ignore the exception if you try to store the same record.
I would be interested in why you have the 15 minute requirement - the simplest way of consuming is just to poll continuously in a loop, or using a framework like spring-kafka, which again polls frequently. If you are just persisting the data from the records wouldn't it just work regardless of the interval between batches of records? I guess you have some other constraint to deal with (or I have just misunderstood your requirements)
Note, as mentioned in my comments below if you want to use spring-kafka there is an idle between polls property - https://docs.spring.io/spring-kafka/api/org/springframework/kafka/listener/ContainerProperties.html#setIdleBetweenPolls-long-
The other info above is still important if you take this approach rather than scheduling your own polling.

Related

Monitor DocumentDb RU usage

Is there a way to programmatically monitor the Request Unit utilization of a DocumentDB database so we can manually increase the Request Units proactively?
There isn't currently a call you can execute to see the remaining RU, since they're replenished at every second. Chances are that the time it would take to request and process the current RU levels for a given second would return expired data.
To proactively increase RU/s, the best that can be done would be to use the data from the monitoring blade.
I think you could try below steps:
Use azure cosmos db Database - List Metrics rest api from here.
Create Azure Time Trigger Function to execute above code in schedule(maybe every 12 hours). If the metrics touch the threshold value,send a warning email to yourself.

If persistence is enabled, what counts as read operations when data exists in cache?

If the listener is disconnected for more than 30 minutes (for example,
if the user goes offline), you will be charged for reads as if you had
issued a brand-new query.
Does this still apply if persistence is enabled?
Situation 1: App is offline for over 30 minutes. Persistence is enabled and reads data from cache. Does reading documents from cache count as read operations?
Situation 2: App is online but no added/modified/deleted operations occur. Persistence is enabled and all data exists in cache. Does opening my app after 30 minutes cause read operations if no new data has been added/modified/deleted?
Firestore documentation
In both cases, if some read operation is satisfied only by the local cache, it is not billed.
The issue with the documentation that you quoted about listeners is specifically regarding the total results of a query that could return multiple documents over time. Note that a query listener can generate updates for new or changed documents indefinitely over time. But if your query listener is disconnected for more than 30 minutes, you are billed for the entire query again, and do not pick up where the listener may have left off previously with partial or in-progress results.

Timer function for chat application

I am actually building a chat application which has to show current users, I have a column 'IsOnline' in db whose value toggles between 1 and 0 as user logs in or out. I need a function which hits api every 15 seconds to get latest users who are currently online.
Since I am using entity framework which does not support signalr and sql dependency I have decided to go this way.
How can I have a method which runs every 15 seconds in a separate thread so as not to interfere my other crud operations as long I have user in session.
Polling after 15 second is not a good solution especially if your call is on DB. Think about the latency in this approach. I think you need to look for different approach rather than calling db after 15 second.
If you want to check online/offline maintain the status in memory rather than Persist in Db (persist after 1 hour or 2 hour if you want to keep in DB).
Store status in memory, for instance in memcached or redis. Have the client issue a request every 15 seconds. The online status is transient, it does not need to be stored in DB.
It's hard to advise in depth because you did not describe the architecture of your app.
In general, efficient implmentation of presence notifications is tricky. It may be easier to take something off the shelf instead of developing your own.

Ensure In Process Records are Unique ActiveMQ

I'm working on a system where clients enter data into a program and the save action posts a message to activemq for more time intensive processing.
We are running into rare occasions where a record will be updated by a client twice in a row and a consumer on that activemq queue will process the two records at the same time. I'm looking for a way to ensure that messages containing records with the same identity are processed in-order and only one at a time. To be clear if a record with ID 1, 1, and 2 (in that order) are sent to activemq, 1 would process, then 2 (if 1 was still in process) and finally 1.
Another requirement, (due to volume) requires that the consumer be multi-threaded, so there may be 16 threads accessing that queue. This would have to be taken into consideration.
So if you have multiple threads reading that queue and you want the solution to be close to ActiveMQ you have to think about how you scale related to order concerns.
If you have multiple consumers, they may operate at different speed and you can never be sure which consumer goes before the other. The only way is to have a single consumer (you can still achieve High Availability by using exclusive-consumers).
You can, however, segment the load in other ways. How depends a lot on your application. If you can create, say 16 "worker" queues (or whatever your max consumer count would be) and distribute load to these queues while guarantee that requests from a single user always come to the same "worker queue", message order will remain per user.
If you have no good way to divide users into groups, simply take the userID mod MAX_CONSUMER_THREADS as a simple solution.
There may be better ways to deal with this problem in the consumer logic itself. Like keeping track of the sequence number and postpone updates that are out of order (scheduled delay can be used for that).

How to implement real time updates in ASP.NET

I have seen several websites that show you a real time update of what's going on in the database. An example could be
A stock ticker website that shows stock prices in real time
Showing data like "What other users are searching for currently.."
I'd assume this would involve some kind of polling mechanism that queries the database every few seconds and renders it on a web page. But the thought scares me when I think about it from the performance standpoint.
In an application I am working on, I need to display the real time status of an operation that a user has submitted. Users wait for the process to be completed. As and when an operation is completed, the status is updated by another process (could be a windows service). Should I query the database every second to get the updated status?
It's not necessarily done in the db. As you suggested that's expensive. Although db might be a backing store, likely a more efficient mechanism is used to accompany the polling operation like storing the real-time status in memory in addition to finally on the db. You can poll memory much more efficiently than SELECT status from Table every second.
Also as I mentioned in a comment, in some circumstances, you can get a lot of mileage out of forging the appearance of status update through animations and such, employ estimation, checking the data source less often.
Edit
(an optimization to use less db resources for real time)
Instead of polling the database per user to check job status every X seconds, slightly alter the behaviour of the situation. Each time a job is added to the database, read the database once to put meta data about all jobs in the cache. So , for example, memory cache will reflect [user49 ... user3, user2, user1, userCurrent] 50 user's jobs if 1 job each. (Maybe I should have written it as [job49 ... job2, job1, job_current] but same idea)
Then individual users' web pages will poll that cache which is always kept current. In this example the db was read just 50 times into the cache (once per job submission). If those 50 users wait an average 1 minute for job processing and poll for status every second then user base polls the cache a total of 50 users x 60 secs = 3000 times.
That's 50 database reads instead of 3000 over a period of 50 min. (avg. one per min.) The cache is always fresh and yet handles the load. It's much less scary than considering hitting the db every second for each user. You can store other stats and info in the cache to help out with estimation and such. As long as fresh cache provides more efficiency then it's a viable alternative to massive db hits.
Note: By cache I mean a global store like Application or 3rd party solution, not the ASP.NET page cache which goes out of scope quickly. Caching using ASP.NET's mechanisms might not suit your situation.
Another Note: The db will know when another job record is appended no matter from where, so a trigger can init the cache update.
Despite a good database solution, so many users polling frequently is likely to create problems with web server connections and you might need a different solution at that level, depending on traffic.
Maybe have a cache and work with it so yo don't hit the database each time the data is modified and update the database every few seconds or minutes or what you like
The problem touches many layers of a web application.
On the client, you either use an iframe whose content autorefreshes every n seconds using the meta refresh tag (HTML), or a javascript which is triggered by a timer and updated a named div (AJAX).
On the server, you have at least two places to cache your data:
One is in the Application object, where you keep a timestamp of the last update, and refresh the cached data as your refresh interval elapses.
If you want to present data from a database, keep aggregated values or cache relevant data for faster retrieval.

Resources