Confluent Batch Consumer. Consumer not working if Time out is specified - confluent-kafka-dotnet

I am trying to consume a max of 1000 messages from kafka at a time. (I am doing this because i need to batch insert into MSSQL.) I was under the impression that kafka keeps an internal queue which fetches messages from the brokers and when i use the consumer.consume() method it just checks if there are any messages in the internal queue and returns if it finds something. otherwise it just blocks until the internal queue is updated or until timeout.
I tried to use the solution suggested here: https://github.com/confluentinc/confluent-kafka-dotnet/issues/1164#issuecomment-610308425
but when i specify TimeSpan.Zero (or any other timespan up to 1000ms) the consumer never consumes any messages. but if i remove the timeout it does consume messages but then i am unable to exit the loop if there are no more messages left to be read.
I also saw an other question on stackoverflow which suggested to read the offset of the last message sent to kafka and then read messages until i reach that offset and then break from the loop. but currently i only have one consumer and 6 partitions for a topic. I haven't tried it yet but i think managing offsets for each of the partition might make the code messy.
Can someone please tell me what to do?
static List<RealTime> getBatch()
{
var config = new ConsumerConfig
{
BootstrapServers = ConfigurationManager.AppSettings["BootstrapServers"],
GroupId = ConfigurationManager.AppSettings["ConsumerGroupID"],
AutoOffsetReset = AutoOffsetReset.Earliest,
};
List<RealTime> results = new List<RealTime>();
List<string> malformedJson = new List<string>();
using (var consumer = new ConsumerBuilder<Ignore, string>(config).Build())
{
consumer.Subscribe("RealTimeTopic");
int count = 0;
while (count < batchSize)
{
var consumerResult = consumer.Consume(1000);
if (consumerResult?.Message is null)
{
break;
}
Console.WriteLine("read");
try
{
RealTime item = JsonSerializer.Deserialize<RealTime>(consumerResult.Message.Value);
results.Add(item);
count += 1;
}
catch(Exception e)
{
Console.WriteLine("malformed");
malformedJson.Add(consumerResult.Message.Value);
}
}
consumer.Close();
};
Console.WriteLine(malformedJson.Count);
return results;
}

I found a workaround.
For some reason the consumer first needs to be called without a timeout. That means it will wait for a message until it gets at least one. after that using consume with timeout zero fetches all the rest of the messages one by one from the internal queue. this seems to work out for the best.

I had a similar problem, updating the Confluent.Kafka and lidrdkafka libraries from version 1.8.2 to 2.0.2 helped

Related

Consume one message per time in Netcore and Kafka

I'm trying to making a Consumer Kafka using NET CORE 2.1, this consumer should read one message compare timestamp and commit or not, so this consumer can stay on same message until this validation is true. See my code:
while(true)
{
try
{
var cr = consumer.Consume(TimeSpan.FromMilliseconds(4000));
if (cr == null)
{
Console.WriteLine("Exiting ... no messages to process");
break;
}
double totalSeconds = (DateTime.Now - cr.Timestamp.UtcDateTime).TotalSeconds;
Console.WriteLine($"TotalSeconds = {totalSeconds} , Resume = {resumeTimeSeconds}");
if (totalSeconds > resumeTimeSeconds)
{
Console.WriteLine($"Message = {cr.Value}");
consumer.Commit();
}else
{
Console.WriteLine($"Skipping... {cr.Value}");
continue;
}
}
catch (ConsumeException e)
{
Console.WriteLine($"Error occured: {e.Error.Reason}");
}
}
So, I have 10 messages in my topic and LAG is 2. I want to the next message is called only if i Commit() the previous message, but the consumer.Consume() method always call the next message.
The consumer commit comes into play only when your consumer start ( or recover from a crash). Your consumer will internally keep track of the last received offset for each partition.
What you can do is maybe use seek() to get back to the previous offset you just tried to process, and then retry.
Yannick

Microsoft's MPEG-2 demuxer filter - can I change an elementary stream pin's PID while the graph is running?

I'm working with multi-program UDP MPEG-2 TS streams that, -unfortunately- dynamically re-map their elementary stream PIDs at random intervals. The stream is being demuxed using Microsoft's MPEG-2 demultiplexer filter.
I'm using the PSI-Parser filter (an example filter included in the DirectShow base classes) in order to react to the PAT/PMT changes.
The code is properly reacting to the change, yet I am experiencing some odd crashes (heap memory corruption) right after I remap the Demuxer pins to their new ID's. (The re-mapping is performed inside the thread that is processing graph events, while the EC_PROGRAMCHANGED message is being processed).
The crash could be due to faulty code in my part, yet I have not found any reference that tells me if changing the pin PID mapping is safe while the graph is running.
Can anyone provide some info if this is operation is safe, and if it is not, what could I do to minimize capture disruption?
I managed to find the source code for a Windows CE version of the demuxer filter. Inspecting it, indeed, it seems that it is safe to remap a pin while the filter is running.
I also managed to find the source of my problems with the PSI-Parser filter.
When a new transport stream is detected, or the PAT version changes, the PAT is flushed, (all programs are removed, the table is re-parsed and repopulated).
There is a subtle bug within the CPATProcessor::flush() method.
//
// flush
//
// flush an array of struct: m_mpeg2_program[];
// and unmap all PMT_PIDs pids, except one: PAT
BOOL CPATProcessor::flush()
{
BOOL bResult = TRUE;
bResult = m_pPrograms->free_programs(); // CPrograms::free_programs() call
if(bResult == FALSE)
return bResult;
bResult = UnmapPmtPid();
return bResult;
}// flush
Here's the CPrograms::free_programs() implementation.
_inline BOOL free_programs()
{
for(int i= 0; i<m_ProgramCount; i++){
if(!HeapFree(GetProcessHeap(), 0, (LPVOID) m_programs[i] ))
return FALSE;
}
return TRUE;
}
The problem here is that the m_ProgramCount member is never cleared. So, -apart from reporting the wrong number of programs in the table after a flush (since it is updated incrementally for each program found in the table)-, the next time the table is flushed, it will try to release memory that was already released.
Here's my updated version that fixes the heap corruption errors:
_inline BOOL free_programs()
{
for(int i= 0; i<m_ProgramCount; i++){
if(!HeapFree(GetProcessHeap(), 0, (LPVOID) m_programs[i] ))
return FALSE;
}
m_ProgramCount = 0; // This was missing, next call will try to free memory twice
return TRUE;
}

Asynchronous hive query execution : OperationHandle gets cleaned up at server side as soon as the query initiator client disconnects

Is it possible to execute a query asynchronously in hive server?
For eg, How can I /Is it possible to do something like this from the client-
QueryHandle handle = executeAsyncQuery(hiveQuery);
Status status = handle.checkStatus();
if(status.isCompleted()) {
QueryResult result = handle.fetchResult();
}
I also had a look at How do I make an async call to Hive in Java?. But did not help. The answers were mostly around the thrift clients taking a callback argument.
Any help would be appreciated. Thanks!
[EDIT 1]
I went through the HiveConnection.java in hive-jdbc. hive-jdbc by default uses the async thrift APIs. Hence it submits a query and polls for result sets (look at HiveStatement.java). Now i am able to write a piece of code which is purely non blocking. But the problem is as soon as the client disconnect the foot print about the query is lost.
Client 1
final TCLIService.Client client = new TCLIService.Client(createBinaryTransport(host, port, loginTimeout, sessConf, false)); // from HiveConnection.java
TSessionHandle sessionHandle = openSession(client) // from HiveConnection.java
TExecuteStatementReq execReq = new TExecuteStatementReq(sessionHandle, sql);
execReq.setRunAsync(true);
execReq.setConfOverlay(sessConf);
final TGetOperationStatusReq handle = client.ExecuteStatement(execReq)
writeHandleToFile("~/handle", handle)
Client 2
final TGetOperationStatusReq handle = readHandleFromFile("~/handle")
final TCLIService.Client client = new TCLIService.Client(createBinaryTransport(host, port, loginTimeout, sessConf, false));
while (true) {
System.out.println(client.GetOperationStatus(handle).getOperationState());
Thread.sleep(1000);
}
Client 2 keeps printing FINISHED_STATE as long as Client 1 is alive. But if client 1 process completes or gets killed, client 2 starts printing null which means hiveserver2 is cleaning up the resources as soon as a client disconnects.
Is it possible to configure hiveserver2 to configure this clean up process based on time or something?
Thanks!
Did some research and figured out that this happens only with binary transport (tcp)
#Override
public void deleteContext(ServerContext serverContext,
TProtocol input, TProtocol output) {
Metrics metrics = MetricsFactory.getInstance();
if (metrics != null) {
try {
metrics.decrementCounter(MetricsConstant.OPEN_CONNECTIONS);
} catch (Exception e) {
LOG.warn("Error Reporting JDO operation to Metrics system", e);
}
}
ThriftCLIServerContext context = (ThriftCLIServerContext) serverContext;
SessionHandle sessionHandle = context.getSessionHandle();
if (sessionHandle != null) {
LOG.info("Session disconnected without closing properly, close it now");
try {
cliService.closeSession(sessionHandle);
} catch (HiveSQLException e) {
LOG.warn("Failed to close session: " + e, e);
}
}
}
The above stub (from ThriftBinaryCLIService) gets executed through this piece of code from TThreadPoolServer which is used by ThriftBinaryCLIService.
eventHandler.deleteContext(connectionContext, inputProtocol,
outputProtocol);
Apparently http transport (ThriftHttpCLIService) has a different strategy of cleaning up operation handles (not greedy like tcp)
Will check with hive community on this to understand a bit more and see if there is an issue addressing this already.

Loading any persistent workflow containing delay activity when it is a runnable instance in the store

We are trying to load and resume workflows which have a delay. I have seen the Microsoft sample of Absolute Delay for this using store.WaitForEvents and LoadRunnableInstance to load the workflow. However here the workflow is already known.
In our case we want to have an event waiting for the store.WaitForEvents after every say 5 seconds to check if there is a runnable instance and if so only load and run that /those particular instances. Is there a way I could know which workflow instance is ready.
We are maintaing the workflow id and the xaml associated to it in our database, so if we could know the workflow instance id we could get the xaml mapped to it, create the workflow and then do a LOadRunnableInstance on it.
Any help would be greatly appreciated.
Microsoft sample (Absolute Delay)
public void Run(){
wfHostTypeName = XName.Get("Version" + Guid.NewGuid().ToString(),
typeof(WorkflowWithDelay).FullName);
this.instanceStore = SetupSqlpersistenceStore();
this.instanceHandle =
CreateInstanceStoreOwnerHandle(instanceStore, wfHostTypeName);
WorkflowApplication wfApp = CreateWorkflowApp();
wfApp.Run();
while (true)
{
this.waitHandler.WaitOne();
if (completed)
{
break;
}
WaitForRunnableInstance(this.instanceHandle);
wfApp = CreateWorkflowApp();
try
{
wfApp.LoadRunnableInstance();
waitHandler.Reset();
wfApp.Run();
}
catch (InstanceNotReadyException)
{
Console.WriteLine("Handled expected InstanceNotReadyException, retrying...");
}
}
Console.WriteLine("workflow completed.");
}
public void WaitForRunnableInstance(InstanceHandle handle)
{
var events=instanceStore.WaitForEvents(handle, TimeSpan.MaxValue);
bool foundRunnable = false;
foreach (var persistenceEvent in events)
{
if (persistenceEvent.Equals(HasRunnableWorkflowEvent.Value))
{
foundRunnable = true;
break;
}
}
if (!foundRunnable) {
Console.WriteLine("no runnable instance");
}
}
Thanks
Anamika
I had a similar problem with durable delay activities and WorkflowApplicationHost. Ended up creating my own 'Delay' activity that worked essentially the same way as the one out of the box, (takes an arg that describes when to resume the workflow, and then bookmarks itself). Instead of saving delay info in the SqlInstanceStore though, my Delay Activity created a record in a seperate db. (similar to the one you are using to track the Workflow Ids and Xaml). I then wrote a simple service that polled that DB for expired delays and initiated a resume of the necessary workflow.
Oh, and the Delay activity deleted it's record from that DB on bookmark resume.
HTH
I'd suggest having a separate SqlPersistenceStore for each workflow definition you're hosting.

How often should I open/close my Booksleeve connection?

I'm using the Booksleeve library in a C#/ASP.NET 4 application. Currently the RedisConnection object is a static object across my MonoLink class. Should I be keeping this connection open, or should I be open/closing it after each query/transaction (as I'm doing now)? Just slightly confused. Here's how I'm using it, as of now:
public static MonoLink CreateMonolink(string URL)
{
redis.Open();
var transaction = redis.CreateTransaction();
string Key = null;
try
{
var IncrementTask = transaction.Strings.Increment(0, "nextmonolink");
if (!IncrementTask.Wait(5000))
{
transaction.Discard();
throw new System.TimeoutException("Monolink index increment timed out.");
}
// Increment complete
Key = string.Format("monolink:{0}", IncrementTask.Result);
var AddLinkTask = transaction.Strings.Set(0, Key, URL);
if (!AddLinkTask.Wait(5000))
{
transaction.Discard();
throw new System.TimeoutException("Add monolink creation timed out.");
}
// Run the transaction
var ExecTransaction = transaction.Execute();
if (!ExecTransaction.Wait(5000))
{
throw new System.TimeoutException("Add monolink transaction timed out.");
}
}
catch (Exception ex)
{
transaction.Discard();
throw ex;
}
finally
{
redis.Close(false);
}
// Link has been added to redis
MonoLink ml = new MonoLink();
ml.Key = Key;
ml.URL = URL;
return ml;
}
Thanks, in advance, for any responses/insight. Also, is there any sort of official documentation for this library? Thank you S.O. ^_^.
According to the author of Booksleeve,
The connection is thread safe and intended to be massively shared;
don't do a connection per operation.
Should I be keeping this connection open, or should I be open/closing
it after each query/transaction (as I'm doing now)?
There is probably a little overhead if you will open a new connection each time you want to make a query/transaction and although redis is designed for high level of concurrently connected clients, there might be performance problems if their number is around tens of thousands. As far as I know connection pooling should be done by the client libraries (because redis itself doesn't have this functionality), so you should check if booksleeve supports this stuff. Otherwise you should open the connection when your application starts and keep it open for it's lifetime (in case you don't need parallel clients connected to redis for some reason).
Also, is there any sort of official documentation for this library?
The only documentation I was able to find regarding how to use it was tests folder in it's source codes.
For reference (continuing #bzlm's answer), I created a Singleton that always provides the same Redis connection using BookSleeve (if it's closed, it's being created. Else, the existing connection is being served).
Look at this: https://stackoverflow.com/a/8777999/290343
You consume it like that:
RedisConnection connection = Redis.RedisConnectionGateway.Current.GetConnection();

Resources