I am looking for design advise on below use case.
I am designing an application which can process EXCEL/CSV/JSON files. They all contain
same columns/attributes. There are about 72 columns/attributes. These files may contain up to 1 million records.
Now i have two options to process those files.
Option 1
Service 1: Read the content from given file, convert each row into JSON save the records into SQL table by batch processing (3k records per batch).
Service 2: Fetch those JSON records from database table (which are saved in step 1), process (validation and calculation) them and save final results into separate table.
Option 2 (using Rabbit MQ)
Service 1: Read the content from given file, and send every row as a message into Queue. Let say if file contains 1 million records then this service will be sending 1 million messages into Queue.
Service 2: Listen to Queue created in step 1, and process those messages (Validation and calculation) and save final results into separate table.
POC experience with Option 1:
It took 5 minutes to read and batch saving the data into table for 100K records. (job of service 1)
If application is trying to process multiple files parallelly which contain 200K records in each of them some times i am seeing deadlocks.
No indexes or relation ships are created on this batch processing table.
Saving 3000 records per batch to avoid table locks.
While services are processing, results are trackable and query the progress. Let say, For "File 1.JSON" - 50000 records are processed successfully and remaining 1000 are IN progress.
If Service 1 finish the job correctly and if something goes wrong with service 2 then we still have better control to reprocess those records as they are persisted in the database.
I am planning to delete the data in batch processing table with a nightly SQL job if all records are already processed by service 2 so this table will be fresh and ready to store the data for the next day processing.
POC experience with option 2:
To produce (service 1) and consume messages (service 2) for 100k record file it took around 2 hr 30 mins.
No storage of file data into the database so no deadlocks (like option 1)
Results are not trackable as much as option 1, while services are processing the records. - To share the status with clients who sent the file for processing.
We can see the status of messages on Rabbit MQ management screen for monitoring purpose.
If service 1 partially read the data from a given file and error out due to some issues then there is no chance of roll back already published messages in Rabbit MQ per my knowledge so consumer keep working on those published messages..
I can horizontally scale the application on both of these options to speed up the process.
Per above facts both options have advantages and disadvantages. Is it a good use case to use Rabbit MQ ? Is it advisable to produce and consume millions of records through RabbitMQ ? Is there a better way to deal with this use case apart from these 2 options.
Please advise.
*** I am using .Net Core 5.0 and SQL server 2019. Service 1 and Service 2 are .net core worker services (windows jobs). All tests are done on my local machine and Rabbit MQ is installed on Docker (docker is on my local machine).
Related
I have just added a new feature to an app I'm building. It uses the same working Cosmos/Table storage code that other features use to query and pump results segments from the Cosmos DB Emulator via the Tables API.
The emulator is running with:
/EnableTableEndpoint /PartitionCount=50
This is because I read that the emulator defaults to 5 unlimited containers and/or 25 limited and since this is a Tables API app, the table containers are created as unlimited.
The table being queried is the 6th to be created and contains just 1 document.
It either takes around 30 seconds to run a simple query and "trips" my Too Many Requests error handling/retry in the process, or hangs seemingly forever and no results are returned, the emulator has to be shut down.
My understanding is that with 50 partitions I can make 10 unlimited tables, collections since each is "worth" 5. See documentation.
I have tried with rate limiting on and off, and jacked the RU/s to 10,000 on the table. It always fails to query this one table. The data, including the files on disk, has been cleared many times.
It seems like a bug in the emulator. Note that the "Sorry..." error that I would expect to see upon creation of the 6th unlimited table, as per the docs, is never encountered.
After switching to a real Cosmos DB instance on Azure, this is looking like a problem with my dodgy code.
Confirmed: my dodgy code.
Stand down everyone. As you were.
I have a requirement to process 10 million records in MS SQL database using WSO2 ESB.
Input file can be XML or Flat file.
I have created a dataservice provided in WSO2 ESB.
Now, I start process to read from XML and insert into MS SQL database, I want to commit every 5000 records during processing via ESB so that if 5001 record fails, I can restart the processing from 5001 record instead of 0.
First problem, commit is happening for all records at once. I want to configure it in such a way that it should process 5000 records, commits in DB and then proceed with next set of records. Additionally, if the batch job fails after processing 10000 records, I want the batch job to start processing from 100001 record and not from 0
Please suggest ideas.
Thanks,
Abhishek
This is a more or less common pattern. Create an agent/process continously reading from an ipc buffer (memory or file).
The ESB endpoint simply writes into the buffer.
The agent is reponsible of retrying and/or notify asynchonously if finally cannot commit.
What you can do is you can write start and end records in a file and place it in ESB, When the schedule starts it will pick the record from file, in your case 5000 and then process it in DSS, now if DSS response is successful then you increment the record and update in the file in this case 10000, now if DSS response is not success then 10000 will be mentioned in the file, after you find the root cause as why it failed and fix it and then run the schedule then it will take record from 10000 and if it is success then write 15000 in file, so this will continue till it doesn't meet the end condition
We receive many large data files daily in a variety of formats (i.e. CSV, Excel, XML, etc.). In order to process these large files we transform the incoming data into one of our standard 'collection' message classes (using XSLT and a pipeline component - either built-in or custom), disassemble the large transformed message into individual 'object' messages and then call a series of SOAP web service methods to handle business logic and database operations.
Unlike other files received, the latest file will contain all data rows each day and therefore, we have to handle the differences to prevent identical records from being re-processed each day.
I have a suitable mechanism for handling inserts and updates but am currently struggling with the deletes (where the record exists in the database but not in the latest file).
My current thought process is to flag the deleted records in the database using a 'cleanup' task at the end of the entire process but this would require a method to be called once all 'object' messages from the disassembled file have completed.
Is it possible to monitor individual messages from a multi-record file and call a method on completion of the whole file? Currently, all research is pointing to an orchestration with some sort of 'wait' but is this the only option?
Example: File contains 100 vehicle records. This is disassembled into 100 individual XML messages which are processed using 100 calls to a web service method. Wish to call cleanup operation when all 100 messages are complete.
The best way I've found to handle the 'all rows every day' scenario is to pre-stage the data in SQL Server where it's easier to compare the 'current' set to the 'previous' set. The INTERSECT and EXCEPT operators make it pretty easy in most cases.
Then drain the records with a Polling statement.
The component that does the de-batching would need to publish a start of batch message with the number of individual records and a correlation key.
The components that do the insert & update would need to publish a completion message with the same correlation key when it is completed processing.
The start of batch message would have spun up an Orchestration that would would listen for the completion messages with that correlation key and count the number, and either after it has received the correct number or after a timeout period it would call the cleanup or raise an exception.
How are the connections are being calculated?
Let's assume that I have a web app which one load sends a message to all connected clients, and let's say I have 5 connected clients. Does it means that as long as the browser tab with the web app is open it will count as 1 connections, which means that I will have 6 concurrent connections and that's count towards what you define as "Connection" in the pricing page?
If not, please explain how you calculate the "Connection". Thanks
This question was bugging me ever since I ran through the thinkster.io angular+firebase tutorial and I saw my firebase analytics tab showing a peak concurrent of 6 even though I only remember having the one page open. I looked back at the code and thought it could be to do with how the tutorial has you create a new Firebase(url) for each location in your firebase.
I wanted to test the difference between creating a new Firebase(url) vs taking the root reference and then accessing the .child() location. My theory was that new Firebase(url) would create a new connection each time, while .child() would re-use the existing connection.
Setup
Created two new firebases each with identical data
Setup an angularjs project using yeoman
Included angularfire
Code
For simplicity, I just put everything in the main controller of the generated code.
To test out the connections created with new Firebase() I did the following:
$scope.fb_root = $firebase(new Firebase(FBURL_NEW));
$scope.fb_root_apps = $firebase(new Firebase(FBURL_NEW + '/apps'));
$scope.fb_root_someApp = $firebase(new Firebase(FBURL_NEW + '/apps/someApp'));
$scope.fb_root_users = $firebase(new Firebase(FBURL_NEW + '/users'));
$scope.fb_root_mike = $firebase(new Firebase(FBURL_NEW + '/users/mike'));
To test out the connections created with ref.$child() I did the following:
$scope.fb_child = $firebase(new Firebase(FBURL_CHILD));
$scope.fb_child_apps = $scope.fb_child.$child("apps");
$scope.fb_child_someApp = $scope.fb_child_apps.$child("someApp");
$scope.fb_child_users = $scope.fb_child.$child("users");
$scope.fb_child_mike = $scope.fb_child_users.$child("mike");
I then bound these objects in my view so I can see them, and I played around with updating data via my firebase forge and watching the data update live on my app.
Results
I opened up my local app into 17 browser tabs, hoping that a large number of tabs would exaggerate any differences between the connection methods.
What I found is that each tab only opened up one single web socket connection back to firebase for each firebase db. So at the end of the test, both methods resulted in the same peak count of 17 connections.
Conclusion
From this simple test I think it's safe to say that the Firebase JS library does a good job of managing its connection.
Regardless of your code calling new Firebase() a bunch of times, or by referencing child locations via .child(), the library will only create a single connection as far as your metering is concerned. That connection will stay online for as long as your app is open.
So in your example - yes I believe you will see 6 concurrent connections, 1 for the app where someone is sending the message, and 5 for the apps receiving the message.
Update
One other thing worth mentioning is that Firebase measures connections for paid plans based on the 95th percentile of usage during the month. This is listed in the FAQ section of their Pricing page # https://www.firebase.com/pricing.html
Update 11-Mar-16: Firebase no longer appears to measure connections based on 95th %. Instead, the 101st concurrent connection is denied.
https://www.firebase.com/pricing.html :
All our plans have a hard limit on the number of database connections.
Our Free and Spark plans are limited to 100. The limit cannot be
raised. All other plans have a courtesy limit of 10,000 database
connections. This can be removed to permanently allow Unlimited
connections if you email us at firebase-support#google.com.. The
reason we impose this courtesy limit is to prevent abuse and to ensure
that we are prepared to handle our largest customers. Please contact
us at least 24 hours in advance so we can lift this limit and ensure
we have enough capacity available for your needs.
Before I tackle this solution, I wanted to run it by the community to get feedback.
Questions:
Is my approach feasible? i.e. can it even be done this way?
Is it the right/most efficient solution?
If it isn’t the right solution, what would be a better approach?
Problems:
Need to send mass emails through the application.
The shared hosted server only permits a maximum of 500 emails to be sent per hour before getting labeled a spammer
Server timeout while sending batch emails
Proposed Solution:
Upon task submittal (i.e. the user provides all necessary email information using a form and frontend template, selects the target audience, etc..), the action will then:
Determines how many records (from a stored db of contacts) the email will be sent to
If the number of records in #1 above is more than 400:
Assign a batch number to all these records in the DB.
Run a CRON job that:
Every hour, selects 400 records in batch “X” and sends the saved email template until there are no more records with batch “X”. Each time a batch of 400 is sent, it’s batch number is erased (so it won’t be selected again the following hour).
If there is an unfinished CRON JOB scheduled ahead of it (i.e. currently running), it will be placed in a queue.
Other clarification:
To send these emails I simply iterate over the SWIFT mailer using the following code:
foreach($list as $record)
{
mailers::sendMemberSpam($record, $emailParamsArray);
// where the above simply contains: sfContext::getInstance()->getMailer()->send($message);
}
*where $list is the list of records with a batch_number of “X”.
I’m not sure this is the most efficient of solutions, because it seems to be bogging down the server, and will eventually time out if the list or email is long.
So, I’m just looking for opinions at this point... thanks in advance.