Mule 4 : Design : How to process data[files/ database records] in Mule 4 without getting "out-of-memory" error? - mule4

Scenario :
I have a database which contains 100k records which have a 10 GB size in memory.
My objective is to
fetch these records,
segregate the data based on certain conditions
then generate csv files for each group of data
write these CSV files to a NAS (storage drive accessible over the same network)
To achieve this, I am thinking of the design as follows:
Use a Scheduler component that triggers the flow daily at 9 am for example)
Use a database select operation to fetch the records
Use a batch processing scope
In batch step use reduce function in Transform message and segregate the data in aggregator in the format like :
{
"group_1" : [...],
"group_2" : [...]
}
In the on complete step of batch processing use a file component to write the data in files in the NAS drive
Questions/Concerns :
Case 1 : When reading from database select it loads all the 100k records in memory.
Question : How to optimize this step so that I can still get 100k records to process but not have a spike in memory usage?
Case 2 : When segregating the data I am storing the isolated data in the aggregator object in reduce operator and then the object stays in memory till i write it into files.
Question : Is there a way I can segregate the data and directly write the data in files in the batch aggregator step and quickly clean the memory from the aggregator object space?
Please treat it as a design question for Mule 4 flows and help me. Thanking the community for your help ad support.

Don't load 100K records in memory. Loading high volumes of data in memory will probably cause an out of memory error. You are not providing details in the configurations but the database connector 'streams' pages of records by default so that's taking care. Use the fetchSize attribute to tune the number of records per page that are read. The default is 10. The batch scope uses disk space to buffer data, to avoid using RAM memory. It also has some parameters to help tune the numbers of records processed per step, for example batch block size and batch aggregator size. Using default values would not be anywhere near 100K records. Also be sure to control concurrency to limit resource usage.
Note that even if reducing all configurations it doesn't mean there will be no spike when processing. Any processing consumes resources. The idea is to have a predictable, controlled spike, instead of an uncontrolled one that can exhaust available resources.
This question is not clear. You can't control the aggregator memory other than the aggregator size, but it looks like it only keeps the more recent aggregated records, not all the records. Are you having any problems with that or is this a theoretical question?

Related

Column pruning on parquet files defined as an external table

Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.
As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.

Realtime database cpu usage %100

Whatever method I try, database will be locked.
the processor will be 100% for at least 5 minutes.
I used this data to log
Example Structure
there are at least 10 million records here
RoomName:
playcount:
user_id:
proccess:
value:
time:
123232132321312313443:
gmc_10:
xasdddfdsdffdsdfff:
remove_begin:
value: 1200
timestamp: 9888439944
what I tried
1- I tried deletes nodes. I can't delete one nodes. because I can't read one nodes.
FirebaseDatabase.DefaultInstance.RootReference.Child ("player_room").LimitToFirst(1).
Result : cpu usage 100%
2- I tried orderbykey and started at end_at
FirebaseDatabase.DefaultInstance.RootReference.Child ("player_room").OrderByKey().StartAt("0").EndAt("100").LimitToFirst(1).GetValueAsync().ContinueWith(task=>{
Result: Cpu usage %100
3- I tried to export from the firebase console
Result : cpu usage 100%
how can i read any one nodes?
i want to read "123232132321312313443" wildcard nodes
Not important ordering, sorting, equals.
i want just read any nodes.
If you have a list of 10 million nodes you're going to have a hard time reading that list, and can only read data if you know its exact path. So it is somewhat expected that the database will be at 100% capacity for a bit when you try to access that list. But I would expect it to eventually completely the operation, as long as you request a reasonable subset of the data. If that doesn't happen for you, add information about the exact error you get to the question or reach out to Firebase support for personalized help in troubleshooting.
Aside from that: you might want to enable Firebase's nightly backups, and use those to locally determine the exact path of the specific node you want to read/delete from the online database.

Loading Bulk data in Firebase

I am trying to use the set api to set an object in firebase. The object is fairly large, the serialized json is 2.6 mb in size. The root node has around 90 chidren, and in all there are around 10000 nodes in the json tree.
The set api seems to hang and does not call the callback.
It also seems to cause problems with the firebase instance.
Any ideas on how to work around this?
Since this is a commonly requested feature, I'll go ahead and merge Robert and Puf's comments into an answer for others.
There are some tools available to help with big data imports, like firebase-streaming-import. What they do internally can also be engineered fairly easily for the do-it-yourselfer:
1) Get a list of keys without downloading all the data, using a GET request and shallow=true. Possibly do this recursively depending on the data structure and dynamics of the app.
2) In some sort of throttled fashion, upload the "chunks" to Firebase using PUT requests or the API's set() method.
The critical components to keep in mind here is that the number of bytes in a request and the frequency of requests will have an impact on performance for others viewing the application, and also count against your bandwidth.
A good rule of thumb is that you don't want to do more than ~100 writes per second during your import, preferably lower than 20 to maximize your realtime speeds for other users, and that you should keep the data chunks in low MBs--certainly not GBs per chunk. Keep in mind that all of this has to go over the internets.

BigQuery streaming best practice

I am using Google BigQuery for sometime now, using upload files,
As I get some delays with this method I am now trying to convert my code into streaming.
Looking for best solution here, what is more correct working with BQ:
1. Using multiple (up to 40) different streaming machines ? or directing traffic to single or more endpoints to upload data?
2. Uploading one row at a time or stacking to a list of 100-500 events and uploading it.
3. is streaming the way to go, or stick with files uploading - in terms of high volumes.
some more data:
- we are uploading ~ 1500-2500 rows per second.
- using .net API.
- Need data to be available within ~ 5 minutes
Didn't find such reference elsewhere.
The big difference between streaming data and uploading files is that streaming is intended for live data that is being produced on real time while being streamed, whereas with uploading files, you would upload data that was stored previously.
In your case, I think Streaming makes more sense. If something goes wrong, you would only need to re-send the failed rows, instead of the whole file. And it adapts more to the growing files that I think you're getting.
The best practices in any case are:
Trying to reduce the number of sources that send the data.
Sending bigger chunks of data in each request instead of multiple tiny chunks.
Using exponential back-off to retry those requests that could fail due to server errors (These are common and should be expected).
There are certain limits that apply to Load Jobs as well as to Streaming inserts.
For example, when using streaming you should insert less than 500 rows per request and up to 10,000 rows per second per table.

How can i improve the performance of the SQLite database?

Background: I am using SQLite database in my flex application. Size of the database is 4 MB and have 5 tables which are
table 1 have 2500 records
table 2 have 8700 records
table 3 have 3000 records
table 4 have 5000 records
table 5 have 2000 records.
Problem: Whenever I run a select query on any table, it takes around (approx 50 seconds) to fetch data from database tables. This has made the application quite slow and unresponsive while it fetches the data from the table.
How can i improve the performance of the SQLite database so that the time taken to fetch the data from the tables is reduced?
Thanks
As I tell you in a comment, without knowing what structures your database consists of, and what queries you run against the data, there is nothing we can infer suggesting why your queries take much time.
However here is an interesting reading about indexes : Use the index, Luke!. It tells you what an index is, how you should design your indexes and what benefits you can harvest.
Also, if you can post the queries and the table schemas and cardinalities (not the contents) maybe it could help.
Are you using asynchronous or synchronous execution modes? The difference between them is that asynchronous execution runs in the background while your application continues to run. Your application will then have to listen for a dispatched event and then carry out any subsequent operations. In synchronous mode, however, the user will not be able to interact with the application until the database operation is complete since those operations run in the same execution sequence as the application. Synchronous mode is conceptually simpler to implement, but asynchronous mode will yield better usability.
The first time SQLStatement.execute() on a SQLStatement instance, the statement is prepared automatically before executing. Subsequent calls will execute faster as long as the SQLStatement.text property has not changed. Using the same SQLStatement instances is better than creating new instances again and again. If you need to change your queries, then consider using parameterized statements.
You can also use techniques such as deferring what data you need at runtime. If you only need a subset of data, pull that back first and then retrieve other data as necessary. This may depend on your application scope and what needs you have to fulfill though.
Specifying the database with the table names will prevent the runtime from checking each database to find a matching table if you have multiple databases. It also helps prevent the runtime will choose the wrong database if this isn't specified. Do SELECT email FROM main.users; instead of SELECT email FROM users; even if you only have one single database. (main is automatically assigned as the database name when you call SQLConnection.open.)
If you happen to be writing lots of changes to the database (multiple INSERT or UPDATE statements), then consider wrapping it in a transaction. Changes will made in memory by the runtime and then written to disk. If you don't use a transaction, each statement will result in multiple disk writes to the database file which can be slow and consume lots of time.
Try to avoid any schema changes. The table definition data is kept at the start of the database file. The runtime loads these definitions when the database connection is opened. Data added to tables is kept after the table definition data in the database file. If changes such as adding columns or tables, the new table definitions will be mixed in with table data in the database file. The effect of this is that the runtime will have to read the table definition data from different parts of the file rather than at the beginning. The SQLConnection.compact() method restructures the table definition data so it is at the the beginning of the file, but its downside is that this method can also consume much time and more so if the database file is large.
Lastly, as Benoit pointed out in his comment, consider improving your own SQL queries and table structure that you're using. It would be helpful to know your database structure and queries are the actual cause of the slow performance or not. My guess is that you're using synchronous execution. If you switch to asynchronous mode, you'll see better performance but that doesn't mean it has to stop there.
The Adobe Flex documentation online has more information on improving database performance and best practices working with local SQL databases.
You could try indexing some of the columns used in the WHERE clause of your SELECT statements. You might also try minimizing usage of the LIKE keyword.
If you are joining your tables together, you might try simplifying the table relationships.
Like others have said, it's hard to get specific without knowing more about your schema and the SQL you are using.

Resources