Number of records in WARC file - warc

I currently parsing WARC files from CommonCrawl corpus and I would like to know upfront, without iterating through all WARC records, how many records are there.
Does WARC 1.1 standard defines such information?

The WARC standard does not define a standard way to indicate the number of WARC records in the WARC file itself. The number of response records in Common Crawl WARC files is usually between 30,000 and 50,000 - note that there are also request and metadata records. The WARC standard recommends 1 GB as target size of WARC files which puts a natural limit to the number of records.

Related

Mule 4 : Design : How to process data[files/ database records] in Mule 4 without getting "out-of-memory" error?

Scenario :
I have a database which contains 100k records which have a 10 GB size in memory.
My objective is to
fetch these records,
segregate the data based on certain conditions
then generate csv files for each group of data
write these CSV files to a NAS (storage drive accessible over the same network)
To achieve this, I am thinking of the design as follows:
Use a Scheduler component that triggers the flow daily at 9 am for example)
Use a database select operation to fetch the records
Use a batch processing scope
In batch step use reduce function in Transform message and segregate the data in aggregator in the format like :
{
"group_1" : [...],
"group_2" : [...]
}
In the on complete step of batch processing use a file component to write the data in files in the NAS drive
Questions/Concerns :
Case 1 : When reading from database select it loads all the 100k records in memory.
Question : How to optimize this step so that I can still get 100k records to process but not have a spike in memory usage?
Case 2 : When segregating the data I am storing the isolated data in the aggregator object in reduce operator and then the object stays in memory till i write it into files.
Question : Is there a way I can segregate the data and directly write the data in files in the batch aggregator step and quickly clean the memory from the aggregator object space?
Please treat it as a design question for Mule 4 flows and help me. Thanking the community for your help ad support.
Don't load 100K records in memory. Loading high volumes of data in memory will probably cause an out of memory error. You are not providing details in the configurations but the database connector 'streams' pages of records by default so that's taking care. Use the fetchSize attribute to tune the number of records per page that are read. The default is 10. The batch scope uses disk space to buffer data, to avoid using RAM memory. It also has some parameters to help tune the numbers of records processed per step, for example batch block size and batch aggregator size. Using default values would not be anywhere near 100K records. Also be sure to control concurrency to limit resource usage.
Note that even if reducing all configurations it doesn't mean there will be no spike when processing. Any processing consumes resources. The idea is to have a predictable, controlled spike, instead of an uncontrolled one that can exhaust available resources.
This question is not clear. You can't control the aggregator memory other than the aggregator size, but it looks like it only keeps the more recent aggregated records, not all the records. Are you having any problems with that or is this a theoretical question?

Organizing tables with data-heavy rows to optimize access times

I am working with a sqlite3 database of around 70 gigabytes right now. This db has three tables: one with about 30 million rows, and two more with ~150 and ~300 million each, with each table running from 6-11 columns.
The table with the fewest rows is consuming the bulk of the space, as it contains a raw data column of zipped BLOBs, generally running between 1 and 6 kilobytes per row; all other columns in the database are numeric, and the zipped data is immutable so inefficiency in modification is not a concern.
I have noticed that creating indexes on the numeric columns of this table:
[15:52:36] Query finished in 723.253 second(s).
takes several times as long as creating a comparable index on the table with five times as many rows:
[15:56:24] Query finished in 182.009 second(s).
[16:06:40] Query finished in 201.977 second(s).
Would it be better practice to store the BLOB data in a separate table to access with JOINs? The extra width of each row is the most likely candidate for the slow scan rate of this table.
My current suspicions are:
This is mostly due to the way data is read from disk, making skipping medium-sized amounts of data impractical and yielding a very low ratio of usable data per sector read from the disk by the operating system, and
It is therefore probably standard practice that I did not know as a relative newcomer to relational databases to avoid putting larger, variable-width data into the same table as other data that may need to be scanned without indices
but I would appreciate some feedback from someone with more knowledge in the field.
In the SQLite file format, all the column values in a row are simply appended together, and stored as the row value. If the row is too large to fit into one database page, the remaining data is stored in a linked list of overflow pages.
When SQLite reads a row, it reads only as much as needed, but must start at the beginning of the row.
Therefore, when you have a blob (or a large text value), you should move it to the end of the column list so that it is possible to read the other columns' values without having to go through the overflow page list:
CREATE TABLE t (
id INTEGER PRIMARY KEY,
a INTEGER,
[...],
i REAL,
data BLOB NOT NULL,
);
With a single table, the first bytes of the blob value are still stored inside the table's database pages, which decreases the number of rows that can be stored in one page.
If the other columns are accessed often, then it might make sense to move the blob to a separate table (a separate file should not be necessary). This allows the database to go through more rows at once when reading a page, but increases the effort needed to look up the blob value.

sqlite drop index very slow

I have a sqlite database which is about 75 GB. It takes almost 4 hours to create an index on the database. After indexing, the file size is about 100 GB.
Often, I have to modify (insert/delete/update) large chunks (few GBs) of data. As of now, I am deleting the index before modifying the tables. After the modifications are complete, the indexes are recreated.
Dropping the index takes huge amount of time (it is of the same order as creating the index).
In some very special cases (when entire data needs to be regenerated), I am able to write to a new database file and replace it with the original one. This strategy does not require to me drop the indices.
What can I do to speedup the index deletion in cases I cannot just switch the database files? Any suggestions/ideas are welcome.
This is I think one of the limitations of single file databases. If tables/indices were stored in separate files, then those files could simply be marked deleted.

Limiting number of rows in a table

My table contains around 16 columns. I want to limit number of rows at 10000. I can :
Insert only if current size less than 10000.
Put a configuration limit (either through some external configuration file or through some dynamic parameter) on the maximum number of rows.
I prefer option 2 because reduced effort checking size for every insert (my table is insert intensive, reading is occasional). It would be useful if this limit can be dynamically set (for example using an sqlite3_limit() -like API), but an /etc/* -like configuration file too would do.
Is this possible on SQLite 3.7.7.1 and Linux (SLES 11)?

Import small number of records from a very large CSV file in Biztalk 2006

I have a Biztalk project that imports an incoming CSV file and dumps it to a database table. The import works fine, but I only need to keep about 200-300 records from a file with upwards of a million rows. My orchestration discards these rows, but the problem is that the flat file I'm importing is still 250MB, and when converted to XML using a regular flat file pipeline, it takes hours to process and sometimes causes the server to run out memory.
Is there something I can do to have the Custom Pipeline itself discard rows I don't care about? The very first item in each CSV row is one of a few strings, and I only want to keep rows that start with a certain string.
Thanks for any help you're able to provide.
A custom pipeline component would certainly be the best solution; but it would need to execute in the decode stage before the disassembler component.
Making it 100% streaming-enabled would be complex (but certainly doable), but depending on the size of the resulting trimmed CVS file, you could simply pre-process the entire input file as soon as your custom component runs and either generate the results in memory (in a MemoryStream) if it's small, or write them to a file and then return the resulting FileStream to BizTalk to continue processing from there.

Resources