I have a requirement to process 10 million records in MS SQL database using WSO2 ESB.
Input file can be XML or Flat file.
I have created a dataservice provided in WSO2 ESB.
Now, I start process to read from XML and insert into MS SQL database, I want to commit every 5000 records during processing via ESB so that if 5001 record fails, I can restart the processing from 5001 record instead of 0.
First problem, commit is happening for all records at once. I want to configure it in such a way that it should process 5000 records, commits in DB and then proceed with next set of records. Additionally, if the batch job fails after processing 10000 records, I want the batch job to start processing from 100001 record and not from 0
Please suggest ideas.
Thanks,
Abhishek
This is a more or less common pattern. Create an agent/process continously reading from an ipc buffer (memory or file).
The ESB endpoint simply writes into the buffer.
The agent is reponsible of retrying and/or notify asynchonously if finally cannot commit.
What you can do is you can write start and end records in a file and place it in ESB, When the schedule starts it will pick the record from file, in your case 5000 and then process it in DSS, now if DSS response is successful then you increment the record and update in the file in this case 10000, now if DSS response is not success then 10000 will be mentioned in the file, after you find the root cause as why it failed and fix it and then run the schedule then it will take record from 10000 and if it is success then write 15000 in file, so this will continue till it doesn't meet the end condition
Related
I am looking for design advise on below use case.
I am designing an application which can process EXCEL/CSV/JSON files. They all contain
same columns/attributes. There are about 72 columns/attributes. These files may contain up to 1 million records.
Now i have two options to process those files.
Option 1
Service 1: Read the content from given file, convert each row into JSON save the records into SQL table by batch processing (3k records per batch).
Service 2: Fetch those JSON records from database table (which are saved in step 1), process (validation and calculation) them and save final results into separate table.
Option 2 (using Rabbit MQ)
Service 1: Read the content from given file, and send every row as a message into Queue. Let say if file contains 1 million records then this service will be sending 1 million messages into Queue.
Service 2: Listen to Queue created in step 1, and process those messages (Validation and calculation) and save final results into separate table.
POC experience with Option 1:
It took 5 minutes to read and batch saving the data into table for 100K records. (job of service 1)
If application is trying to process multiple files parallelly which contain 200K records in each of them some times i am seeing deadlocks.
No indexes or relation ships are created on this batch processing table.
Saving 3000 records per batch to avoid table locks.
While services are processing, results are trackable and query the progress. Let say, For "File 1.JSON" - 50000 records are processed successfully and remaining 1000 are IN progress.
If Service 1 finish the job correctly and if something goes wrong with service 2 then we still have better control to reprocess those records as they are persisted in the database.
I am planning to delete the data in batch processing table with a nightly SQL job if all records are already processed by service 2 so this table will be fresh and ready to store the data for the next day processing.
POC experience with option 2:
To produce (service 1) and consume messages (service 2) for 100k record file it took around 2 hr 30 mins.
No storage of file data into the database so no deadlocks (like option 1)
Results are not trackable as much as option 1, while services are processing the records. - To share the status with clients who sent the file for processing.
We can see the status of messages on Rabbit MQ management screen for monitoring purpose.
If service 1 partially read the data from a given file and error out due to some issues then there is no chance of roll back already published messages in Rabbit MQ per my knowledge so consumer keep working on those published messages..
I can horizontally scale the application on both of these options to speed up the process.
Per above facts both options have advantages and disadvantages. Is it a good use case to use Rabbit MQ ? Is it advisable to produce and consume millions of records through RabbitMQ ? Is there a better way to deal with this use case apart from these 2 options.
Please advise.
*** I am using .Net Core 5.0 and SQL server 2019. Service 1 and Service 2 are .net core worker services (windows jobs). All tests are done on my local machine and Rabbit MQ is installed on Docker (docker is on my local machine).
By default update policy on a Kusto table is non-transactional. Lets say I have an Update Policy defined on a table MyTarget for which the source is defined in the update policy as MySource. The update policy is defined as transactional. Ingestion has been set on the table MySource. So continuously data will be getting loaded to MySource. Now say certain ingestion data batch is loaded to MySource, right after that the query defined in the Update Policy will be triggered. Now lets say this query fails , due to memory issues etc -- even the data batch loaded to MySource will not be committed (because the update policy is transactional). I have heard that in this case the ingestion will be re-tried automatically. Is it so? I haven't found any documentation regarding this retry. Anyways -- my simple question is -- how many times retry will be attempted and how much is the interval after each attempt? Are these configurable properties (I am talking about ADX cluster which is available through Azure) if I am owner of the ADX cluster?
yes, there's an automatic retry for ingestions that failed due to a failure in a transactional update policy.
the full details can be found here: https://learn.microsoft.com/en-us/azure/kusto/management/updatepolicy#failures
Failures are treated as follows:
Non-transactional policy: The failure is ignored by Kusto. Any retry is the responsibility of the data owner.
Transactional policy: The original ingestion operation that triggered the update will fail as well. The source table and the database will not be modified with new data.
In case the ingestion method is pull (Kusto's Data Management service is involved in the ingestion process), there's an automated retry on the entire ingestion operation, orchestrated by Kusto's Data Management service, according to the following logic:
Retries are done until reaching the earliest between the maximum retry period (2 days) and maximum retry attempts (10 attempts).
The backoff period starts from 2 minutes, and grows exponentially (2 -> 4 -> 8 -> 16 ... minutes)
In any other case, any retry is the responsibility of the data owner.
For a particular requirement, I will have to iterate through a list of 50000 records and insert them into database. The requirement is that if any one of the 50000 records fail, all the other records should be rollback. And hence we did not involve any commit in the processing. But this resulted in the following error:
[2/1/16 14:01:47:939 CST] 000000be SystemOut O ERROR
org.springframework.jdbc.UncategorizedSQLException:
PreparedStatementCallback; uncategorized SQLException for SQL [INSERT
INTO ...) VALUES (...)]; SQL state [null]; error code [0]; Current
thread has not commited in more than [120] seconds and may incur
unwanted blocking locks. Please refactor code to commit more
frequently.
Now, when we implemented batching - We are using PreparedStatement.executeBatch() method to insert data in batches of 100, the above error doesn't arise. The autoCommit is set to true by default for the batching, so a commit happens after every batch execution.
Could anyone suggest how we can handle the rollback mechanism in the above case. If 50th batch execution fails, then we want all the previous 49 batch executions to be reverted. We are using Spring Data/JDBC, Oracle 11g database, WebSphere application server. I have read somewhere that the above 120 seconds timeout for commit can also be configured in the JTA settings of WebSphere. Is it so? Please suggest any alternatives or other possible solutions.
Thank you in advance.
You must set autocommit to false and only commit at the end if all your batches executed successfully.
Problem Description:
1.There is Biztalk Application receiving formatted/zipped data file containing > 2 million data records.
2.Created pipeline component that process file and 'de batching' these 2 million records of data into smaller slice-messages with ~2000 records each .
3.Slice-messages are being sent to SQL port and processed by stored procedure.Slice-messages contains filename and batch id.
Questions:
A.What would be the best way to know that all slice-messages received and processing of whole file completed on SQL side ?
B.Is there any way in biztalk port to say "do not send message of type B, until all messages of Type A have been send" (messages priority)?
Here are possible solutions I've tried :
S1.Add specific 'end of file' tags to end of last slice-message saying that file is being processed and stored procedure will receive this part of message mark file as completed.
But because messages are being delivered asynchronously last message can be received on sql earlier that other messages and I will have false-competed event.
So this solution is only possible for "Ordered delivery ports" - but this type of ports have poor performance because sending only one message at a time.
S2.Transfer total records count into every slice-message and run count() sql statement after every slice-message received.
Because table where data is stored is super huge, even running count with filename as parameter takes time.
I'm wondering if there is better solution to know that all messages are being received ?
Have your pipeline component emit a "batch" message that contains the count of the records in the batch and some unique identifier that can link it back to the slice-messages records.
Have both the stored procedure that processes the slice-message and the batch message check to see if the batch total (if it exists yet for the slice-message process) matches the processed total, if they match, then you've finished processing them all.
Here's how I would approach this.
Load the 2MM records into a SQL Server table or tables by SSIS.
Drain the table at whatever rate give you an acceptable performance profile.
Delete records as they are processed (completed).
When no more records for "FILE001.txt" exist, SQL Server will return a flag
saying "FILE001.txt complete".
Do further processing.
When the staging table is empty, the Polling SP can either return nothing (the Adapter will silently ignore the response) or return a flag that says "nothing to do" and you handle that yourself.
We receive many large data files daily in a variety of formats (i.e. CSV, Excel, XML, etc.). In order to process these large files we transform the incoming data into one of our standard 'collection' message classes (using XSLT and a pipeline component - either built-in or custom), disassemble the large transformed message into individual 'object' messages and then call a series of SOAP web service methods to handle business logic and database operations.
Unlike other files received, the latest file will contain all data rows each day and therefore, we have to handle the differences to prevent identical records from being re-processed each day.
I have a suitable mechanism for handling inserts and updates but am currently struggling with the deletes (where the record exists in the database but not in the latest file).
My current thought process is to flag the deleted records in the database using a 'cleanup' task at the end of the entire process but this would require a method to be called once all 'object' messages from the disassembled file have completed.
Is it possible to monitor individual messages from a multi-record file and call a method on completion of the whole file? Currently, all research is pointing to an orchestration with some sort of 'wait' but is this the only option?
Example: File contains 100 vehicle records. This is disassembled into 100 individual XML messages which are processed using 100 calls to a web service method. Wish to call cleanup operation when all 100 messages are complete.
The best way I've found to handle the 'all rows every day' scenario is to pre-stage the data in SQL Server where it's easier to compare the 'current' set to the 'previous' set. The INTERSECT and EXCEPT operators make it pretty easy in most cases.
Then drain the records with a Polling statement.
The component that does the de-batching would need to publish a start of batch message with the number of individual records and a correlation key.
The components that do the insert & update would need to publish a completion message with the same correlation key when it is completed processing.
The start of batch message would have spun up an Orchestration that would would listen for the completion messages with that correlation key and count the number, and either after it has received the correct number or after a timeout period it would call the cleanup or raise an exception.