Cassandra Async reads and writes, Best practices - asynchronous

To Set the context,
We have 4 tables in cassandra, out of those 4, one is data table remaining are search tables (Lets assumme DATA, SEARCH1, SEARCH2 and SEARCH3 are the tables).
We have an initial load requirement with upto 15k rows in one req for the DATA table and hence to the search tables to keep in sync.
We do it in batch inserts with each bacth as 4 queries (one to each table) to keep consistency.
But for every batch we need to read the data. If exists, just update only the DATA table's lastUpdatedDate column, else insert to all the 4 tables.
And below is the code snippet how we are doing:
public List<Items> loadData(List<Items> items) {
CountDownLatch latch = new CountDownLatch(items.size());
ForkJoinPool pool = new ForkJoinPool(6);
pool.submit(() -> items.parallelStream().forEach(item -> {
BatchStatement batch = prepareBatchForCreateOrUpdate(item);
batch.setConsistencyLevel(ConsistencyLevel.LOCAL_ONE);
ResultSetFuture future = getSession().executeAsync(batch);
Futures.addCallback(future, new AsyncCallBack(latch), pool);
}));
try {
latch.await();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
//TODO Consider what to do with the failed Items, Retry? or remove from the items in the return type
return items;
}
private BatchStatement prepareBatchForCreateOrUpdate(Item item) {
BatchStatement batch = new BatchStatement();
Item existingItem = getExisting(item) //synchronous read
if (null != data) {
existingItem.setLastUpdatedDateTime(new Timestamp(System.currentTimeMillis()));
batch.add(existingItem));
return batch;
}
batch.add(item);
batch.add(convertItemToSearch1(item));
batch.add(convertItemToSearch2(item));
batch.add(convertItemToSearch3(item));
return batch;
}
class AsyncCallBack implements FutureCallback<ResultSet> {
private CountDownLatch latch;
AsyncCallBack(CountDownLatch latch) {
this.latch = latch;
}
// Cooldown the latch for either success or failure so that the thread that is waiting on latch.await() will know when all the asyncs are completed.
#Override
public void onSuccess(ResultSet result) {
latch.countDown();
}
#Override
public void onFailure(Throwable t) {
LOGGER.warn("Failed async query execution, Cause:{}:{}", t.getCause(), t.getMessage());
latch.countDown();
}
}
The execution is taking about 1.5 to 2 mins for 15k items considering the network roundtrip b/w application and cassandra cluster(Both reside on same DNS but different pods on kubernetes)
we have ideas to make even the read call getExisting(item) also async, but handling of the failure cases is becoming complex.
Is there a better approach for data loads for cassandra(Considering only the Async wites through datastax enterprise java driver).

First thing - batches in Cassandra are other things than in the relational DBs. And by using them you're putting more load on the cluster.
Regarding the making everything async, I thought about following possibility:
make query to the DB, obtain a Future and add listener to it - that will be executed when query is finished (override the onSuccess);
from that method, you can schedule the execution of the next actions based on the result that is obtained from Cassandra.
One thing that you need to make sure to check, is that you don't issue too much simultaneous requests at the same time. In the version 3 of the protocol, you can have up to 32k in-flight requests per connection, but in your case you may issue up to 60k (4x15k) requests. I'm using following wrapper around Session class to limit the number of in-flight requests.

Related

How to find the number of received event record count when using the filter strategy?

EDIT : I would like to find the no. of records that rejected from filter in one poll of batch.
For example : In one poll, let say if 1000 records got consumed from a partition, among them 500 records got eliminated through filter strategy and finally remaining 500 records will be reached to listener for the processing.
Right now the the issue is, I am able to see only the records count that received at listener, but not able to see the total no. of messages that got eliminated. Basically, I would like to get the count of elimination in one poll of the batch or total count received at filter.
Listener:
public class MyConsumer {
#Autowired
private TestEventProcessor testEventProcessor;
#KafkaListener(topics = "${input.topic}", containerFactory = "testBatchListenerContainerFactory")
public void onMessage(
#Payload List<ConsumerRecord<String, String>> consumerRecords, Acknowledgment acknowledgment) {
log.info("Total no of records in this batch :" + consumerRecords.size());
testEventProcessor.processAndAcknowledgeBatchMessages(consumerRecords, acknowledgment);
}
}
Kakfa Config:
#Bean
public ConcurrentKafkaListenerContainerFactory<String, String> batchListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory(CGID));
factory.setConcurrency(10);
factory.setBatchListener(true);
factory.setAckDiscarded(true);
int counter = 0;
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
factory.setRecordFilterStrategy(
(consumerRecord) -> {
try {
counter++;
long start = System.currentTimeMillis();
if (hasRequiredFields(consumerRecord)) {
return false;
}
} catch (Exception ex) {
log.error("exception occured:",
consumerRecord.value(),
consumerRecord.partition(),
consumerRecord.offset(),
ex.getStackTrace());
return true;
}
return true;
});
}
I have some filter criteria as part of the hasRequiredFields()
which does the records elimination. Though I am setting the batch of records to consume, I always see that very little count from the listener side. I would like to know how many records are being received, how many records are getting excluded, and how many records are getting processed?
There is no built-in mechanism for a record listener to get the number of records in the batch of records returned by the poll.
One solution would be to implement ConsumerInterceptor and add it to the consumer configuration.
Then, onConsume(ConsumerRecords) will be called for each batch (before the first record is sent to the listener) and you can get the count() there.
If you have only one listener with concurrency=1, you could store the count in a static field. Otherwise you would need to use a static ConcurrentMap with the thread name as the key and the count in the value.
Probably simplest to implement ConsumerInterceptor on your filter.

DbUpdateConcurrencyException with EntityFramework on RemoveRange

We are experiencing something very curious with EntityFramework that we are having a hard time debugging and understanding.
We have a service, which debounces index updates, such that we'll update the index when all changes are made and not for every change. The code looks something like this:
var messages = await _debouncingDbContext.DebounceMessages
.Where(message => message.ElapsedTime <= now)
.Take(_internalDebouncingOptionsMonitor.CurrentValue.ReturnLimit)
.ToListAsync(stoppingToken)
.ConfigureAwait(false);
if (!messages.Any())
return;
var tasks = messages.Select(m =>
_messageSession.Send(m.ReplyTo, new HandleDebouncedMessage {Message = m.OriginalMessage}))
.ToList();
try
{
await Task.WhenAll(tasks).ConfigureAwait(false);
}
catch (Exception e)
{
//Exception handling
}
_debouncingDbContext.DebounceMessages.RemoveRange(messages);
await _debouncingDbContext.SaveChangesAsync().ConfigureAwait(false);
While this is being run, we have another thread that can update the ElapsedTime on the entries. This happens if a new event comes in, before the debounce timer expires.
What we experience is the that the await _debouncingDbContext.SaveChangesAsync().ConfigureAwait(false); throws a DbUpdateConcurrencyException.
The result is that the entries are not being deleted and consequently being queried out over and over again in the initial query. This leads to an exponential growth in our index updates where the same few items are being updated over and over again. Eventually, the system dies.
The only fix we have for this right now, is restarting the service. Once that is done, the next iteration picks up the troublesome messages just fine and everything works again.
I am having a hard time understanding how this can happen. It appears that the dbcontext thinks that the entries are deleted, when in fact that are not. Somehow the DBContext gets decoupled from the database state.
I cannot understand how this can happen, when the only thing potentially being changed on the database entry it self is a timestamp and not the actual ID by which it is deleted.
EDIT 18th of November.
Adding a little more context.
The database model, looks like this:
[DatabaseGenerated(DatabaseGeneratedOption.Identity)]
public int Id { get; set; }
public string Key { get; set; }
public string OriginalMessage { get; set; }
public string ReplyTo { get; set; }
public DateTimeOffset ElapsedTime { get; set; }
The only thing configured on the dbcontext is two indexes:
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
base.OnModelCreating(modelBuilder);
modelBuilder.Entity<DebounceMessageWrapper>()
.HasIndex(m => m.Key);
modelBuilder.Entity<DebounceMessageWrapper>()
.HasIndex(m => m.ElapsedTime);
}
The flow is quite simple.
We have a dotnet hosted service extending the abstract BackgroundService class from dotnet core. This runs in a while(!stoppingToken.IsCancellationRequested) loop with the initial code above and a Task.Delay(Y) in each loop. All the above code does, is it queries all the messages with an ElapsedTime greater than the allowed timespan. For each of those messages, it returns it to the ReplyTo and then deletes all the corresponding database entries. It is this delete that fails.
Secondly, we have a MessageHandler listening for events on a RabbitMQ. This spawns a thread per physical core on the host. Each of these threads receives messages and looks up the messages based of the key on the database model. If the message already exist, the ElapsedTime is updated, if not, the message is inserted in the database.
This gives us X+1 threads, where X equals the number of physical cores on the host, that potentially can alter the database. Each of these threads uses it's own scope and thereby unique instance of the DBContext.
The idea of this service, is as mentioned to debounce index updates. The nature of our system makes these index updates come in bulks and there is no reason to update the index for each of the updates, if it can be done by one index update when all the changes are finished.

What Exactly Does HttpApplicationState.Lock Do?

My application stores two related bits of data in application state. Each time I read these two values, I may (depending on their values) need to update both of them.
So to prevent updating them while another thread is in the middle of reading them, I'm locking application state.
But the documentation for HttpApplicationState.Lock Method really doesn't tell me exactly what it does.
For example:
How does it lock? Does it block any other thread from writing the data?
Does it also block read access? If not, then this exercise is pointless because the two values could be updated after another thread has read the first value but before it has read the second.
In addition to preventing multiple threads from writing the data at the same time, it is helpful to also prevent a thread from reading while another thread is writing; otherwise, the first thread could think it needs to refresh the data when it's not necessary. I want to limit the number of times I perform the refresh.
Looking at the code is locking only the write, not the read.
public void Lock()
{
this._lock.AcquireWrite();
}
public void UnLock()
{
this._lock.ReleaseWrite();
}
public object this[string name]
{
get
{
return this.Get(name);
}
set
{
// here is the effect on the lock
this.Set(name, value);
}
}
public void Set(string name, object value)
{
this._lock.AcquireWrite();
try
{
base.BaseSet(name, value);
}
finally
{
this._lock.ReleaseWrite();
}
}
public object Get(string name)
{
object obj2 = null;
this._lock.AcquireRead();
try
{
obj2 = base.BaseGet(name);
}
finally
{
this._lock.ReleaseRead();
}
return obj2;
}
The write and the read is thread safe, meaning have all ready the lock mechanism. So if you going on a loop that you read data, you can lock it outside to prevent other break the list.
Its also good to read this answer: Using static variables instead of Application state in ASP.NET
Its better to avoid use the Application to store data, and direct use a static member with your lock mechanism, because first of all MS suggest it, and second because the read/write to application static data is call the locking on every access of the data.

Asp.net c#, Rollback or Commit after multiple process

I want to use Rollback() or commit() functions after multiple process.
There is no error, but it does not commit() to update DB.
Here is my example code,
public void startTransaction(){
using(Ads_A_Connection = new AdsConnection(Ads_A_connection_string))
using(Ads_B_Connection = new AdsConnection(Ads_B_connection_string))
{
Ads_A_Connection.Open();
Ads_B_Connection.Open();
AdsTransaction aTxn = Ads_A_Connection.BeginTransaction();
AdsTransaction bTxn = Ads_B_Connection.BeginTransaction();
try{
string aResult = this.process1(Ads_A_Connection);
this.process2(Ads_B_Connection, aResult);
this.process3(Ads_A_Connection. Ads_B_Connection);
aTxn.Commit();
bTxn.Commit();
// there is no error, but it couldn't commit.
}catch(Exception e){
aTxn.Rollback();
bTxn.Rollback();
}
}
}
public string process1(conn){
// Insert data
return result;
}
public void process2(conn. aResult){
// update
}
public void process3(aConn, bConn){
// delete
// update
}
I guess, its because out of using scope. because I tried to put all the code into
startTransaction() method, then it works. but it look too dirty.
How can I use rollback() or commit() after multiple (METHOD) process?
anybody know, please advice me.
Thanks!
[EDIT]
I just add TransactionScope before connection,
using (TransactionScope scope = new TransactionScope())
{
using(Ads_A_Connection = new AdsConnection(Ads_A_connection_string))
using(Ads_B_Connection = new AdsConnection(Ads_B_connection_string))
{
.
.
but it makes an error, it say "Error 5047: The transaction command was not in valid sequence."
I need a little more hint please :)
To extend what Etch mentioned, their are several issues with manually managing transactions on your connections:
You need to pass the SQL connection around your methods
Need to manually remember to commit or rollback when you are finished
If you have more than one connection to manage under a transaction, you should really use DTC or XA to enroll the transactions into a Distributed / 2 phase transaction.
TransactionScopes are supported with the Advantage Database Server, although you will need to enable the MSDTC service and possibly also enable XA compliance.
Note that I'm assuming that the advantage .NET client has some sort of connection pooling mechanism - this makes the cost of obtaining connections very lightweight.
Ultimately, this means that your code can be refactored to something like the following, which is easier to maintain:
private void Method1()
{
using(Ads_A_Connection = new AdsConnection(Ads_A_connection_string))
{
Ads_A_Connection.Open();
string aResult = this.process1(Ads_A_Connection);
} // Can logically 'close' the connection here, although it is actually now held by the transaction manager
}
private void Method2()
{
using(Ads_B_Connection = new AdsConnection(Ads_B_connection_string))
{
Ads_B_Connection.Open();
this.process2(Ads_B_Connection, aResult);
} // Can logically 'close' the connection here, although it is actually now held by the transaction manager
}
public void MyServiceWhichNeedToBeTransactional(){
using(TransactionScope ts = new TransactionScope()) { // NB : Watch isolation here. Recommend change to READ_COMMITTED
try{
Method1();
Method2();
ts.Complete();
}
catch(Exception e){
// Do Logging etc. No need to rollback, as this is done by default if Complete() not called
}
}
}
TransactionScope is your friend!
TransactionScope

Keeping memory in Tasks discrete

I've heard a LOT in the past about how programming with Threads and Tasks is very dangerous to the naive. Well, I'm naive, but I've got to learn sometime. I am making a program (really, it's a Generic Handler for ASP.Net) that needs to call to a 3rd party and wait for a response. While waiting, I'd like to have the handler continue doing some other things, so I am trying to figure out how to do the 3rd party web request asynchronously. Based on some answers to some other questions I've received, here is what I've come up with, but I want to make sure I won't get into big problems when my handler is called multiple time concurrently.
To test this I've built a console project.
class Program
{
static void Main(string[] args)
{
RunRequestAsynch test = new RunRequestAsynch();
test.TestingThreadSafety = Guid.NewGuid().ToString();
Console.WriteLine("Started:" + test.TestingThreadSafety);
Task tTest = new Task(test.RunWebRequest);
tTest.Start();
while (test.Done== false)
{
Console.WriteLine("Still waiting...");
Thread.Sleep(100);
}
Console.WriteLine("Done. " + test.sResponse);
Console.ReadKey();
}
}
I instantiate a separate object (RunRequestAsynch) set some values on it, and then start it. While that is processing I'm just outputting a string to the console window.
public class RunRequestAsynch
{
public bool Done = false;
public string sResponse = "";
public string sXMLToSend = "";
public string TestingThreadSafety = "";
public RunRequestAsynch() { }
public void RunWebRequest()
{
Thread.Sleep(500);
// HttpWebRequest stuff goes here
sResponse = TestingThreadSafety;
Done = true;
Thread.Sleep(500);
}
}
So...if I run 1000 of these simultaneously, I can count on the fact that each instance has its own memory and properties, right? And that the line "Done = true;" won't fire and then every one of the instances of the Generic Handler die, right?
I wrote a .bat file to run several instances, and the guid I set on each specific object seems to stay the same for each instance, which is what I want...but I want to make sure I'm not doing something really stupid that will bite me in the butt under full load.
I don't see any glaring problems, however you should consider using the Factory.StartNew instead of Start. Each task will only be executed once, so there isn't any problem with multiple tasks running simultaneously.
If you want to simplify your code a little and take advantage of the Factory.StartNew, in your handler you could do something like this (from what I remember of your last question):
Task<byte[]> task = Task.Factory.StartNew<byte[]>(() => // Begin task
{
//Replace with your web request, I guessed that it's downloading data
//change this to whatever makes sense
using (var wc = new System.Net.WebClient())
return wc.DownloadData("Some Address");
});
//call method to parse xml, will run in parallel
byte[] result = task.Result; // Wait for task to finish and fetch result.

Resources