Import small number of records from a very large CSV file in Biztalk 2006 - biztalk

I have a Biztalk project that imports an incoming CSV file and dumps it to a database table. The import works fine, but I only need to keep about 200-300 records from a file with upwards of a million rows. My orchestration discards these rows, but the problem is that the flat file I'm importing is still 250MB, and when converted to XML using a regular flat file pipeline, it takes hours to process and sometimes causes the server to run out memory.
Is there something I can do to have the Custom Pipeline itself discard rows I don't care about? The very first item in each CSV row is one of a few strings, and I only want to keep rows that start with a certain string.
Thanks for any help you're able to provide.

A custom pipeline component would certainly be the best solution; but it would need to execute in the decode stage before the disassembler component.
Making it 100% streaming-enabled would be complex (but certainly doable), but depending on the size of the resulting trimmed CVS file, you could simply pre-process the entire input file as soon as your custom component runs and either generate the results in memory (in a MemoryStream) if it's small, or write them to a file and then return the resulting FileStream to BizTalk to continue processing from there.

Related

increase U-SQL 1gb limit on input file?

It seems like I'm hitting a 1GB upper-boundary on my U-SQL input file size. Is there such a limit, and if so, how can this be increased?
Here's my case in a nutshell:
I'm working on a custom xml extractor where I'm processing XML files of roughly 2,5gb. These XML files conform to well maintained XSD schemas. using xsd.exe I've generated .NET classes for Xml serialization. The custom extractor uses these desialized .NET objects to populate the output rows.
This all works pretty neat running U-SQL on my local ADLA Account from Visual Studio. Memory usage goes up to approx 3 gb for a 2,5 gb input xml, so this should perfectly fit on a single vertex per file.
This still works great using <1gb input files on the Data Lake.
However, when trying to scale things up at the Data Lake Store, it seems the job got terminated by hitting the 1gb input file size boundary.
I know streaming the outer XML, and then serializing the inner XML fragments is an alternative option, but we don't want to create - and particularly maintain - too much custom code depending on those externally managed schemas.
Therefore, raising the upper-limit would be great.
I see two issues right now. One that we can address, and one for which we have a feature under development for later this year.
U-SQL per default assumes that you want to scale out processing over your file and will split it into 1GB "chunks" for extraction. If your extractor needs to see all the data (e.g., in order to parse XML or JSON or an image for example) you need to mark the extractor to process the files atomically (not splitting it) in the following way:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{ ...
Now while a vertex has 3GB of data, we currently limit the memory size for a UDO like an extractor to 500MB. So if you process your XML in a way that requires a lot of memory, you will currently still fail with a System.OutOfMemory error. We are working on adding annotations to the UDOs that let you specify your memory requirements to overwrite the default, but that is still under development at this point. The only ways to address that is to either make your data small enough, or - in the case of XML for example - use a streaming parsing strategy that does not allocate too much memory (e.g., use the XML Reader interface).

The best way to store big number of files with adding quick search

In my Windows 8/RT app I use SQLite DataBase (sqlite-net) witch store in Isolated Storage. In DataBase I have a lot of data, including files(images, pdf's and other) links. I get those links from web server. When I got link, I want to download file and store it locally.
My question is: what is the best way to store big number of files (100+)? One important think: I need to organize quickly find the desired file.
I have three ideas:
Create another DataBase only for files (I can't modify existing)
Create folder in IS and store here directly.
Create list of files and store it in IS.
Which would be better/faster? Or somebody have another great solution?
100 files isn't such a big number as you can easily store up to 100k files (or folders) in a single (NTFS) directory.
If you receive the files from a webserver then the question is whether the source makes sure there are no duplicate filenames. If this can't be assured, I'd recommend having a database table mapping from original filename and metadata to its hash (SHA256 or similar) and store the file with a filename corresponding to its hash.
Then, when using the file, you can pass pass it to the user using the original filename using the StorageFile API.
Going beyond 100k files, you could create a subfolder structure from the first two letters of the hash.
Either way, storing the file metadata in a database and the files in a directory has been the most useful approach for us in the past.
100 files with average size of 1MB is only 100MB.
Most people say that storing binary files in database is wrong and suggest storing files separately and only keep file names in database, but I think it is fine provided you know what you are doing and why.
Big advantage of storing files in database is that you keep files together with their properties logically in one place. Also, you can simply copy one file and this would backup everything.
Database also affords you transaction support. You may have some problems reading and writing BLOBs into database, but it is not very difficult.

Best format to store incremental data in regularly using R

I have a database that is used to store transactional records, these records are created and another process picks them up and then removes them. Occasionally this process breaks down and the number of records builds up. I want to setup a (semi) automated way to monitor things, and as my tool set is limited and I have an R shaped hammer, this looks like an R shaped nail problem.
My plan is to write a short R script that will query the database via ODBC, and then write a single record with the datetime, the number of records in the query, and the datetime of the oldest record. I'll then have a separate script that will process the data file and produce some reports.
What's the best way to create my datafile, At the moment my options are
Load a dataframe, add the record and then resave it
Append a row to a text file (i.e. a csv file)
Any alternatives, or a recommendation?
I would be tempted by the second option because from a semantic point of view you don't need the old entries for writing the new ones, so there is no reason to reload all the data each time. It would be more time and resources consuming to do that.

Using SQLite to replace filesystem

I have an application that stores configuration files as XML on disk. I'd like to reduce the risk of data file corruption in the case of crashing etc. It seems like the common recommendation is to use SQLite.
What is your opinion on just using BLOBs to store the current XML format? The table would look like:
CREATE TABLE t ( filename TEXT, filedata BLOB )
On the one hand, this seems inelegant, but on the other it would avoid all the work (and corresponding bugs) of converting the configuration to an appropriate format.
Sounds inefficient. You'll need to load and parse the BLOB to get your configuration values as well as save the entire configuration file for every change.
I'm assuming the reason you're switching to a SQLite database is because the transaction mechanism will give you some amount of fault tolerance to crashes. If you store each of your configuration files as one BLOB then you will need to save then entire file before the transaction completes as opposed to just saving the updated values which should be quicker.
In addition if you're using a DOM based XML parser you'll end up loading both the BLOB and the parsed DOM tree into memory at the same time. Depending on the size and number of your configuration files that could be resource intensive.
IMHO you're better off creating a table for each configuration files with a row for each of your configuration values. You'll get better read/write performance, less memory usage and be able to use all the relational mechanisms of SQLite.

Does DB2 OS/390 BLOB support .docx file

ASP.net app inserts Microsoft Windows 2007 .docx file into a row on DB2 OS/390 Blob table. A different VB.net app gets the DB2 OS/390 Blob data. VB.net app kicks off Microsoft Word to open the .docx file but then Microsoft Word pops up a message that the data is corrupted. Word will allow you to fix the data so the file can be viewed but it is extra steps and users complain.
I've seen some examples where .docx can be converted to .doc but they only talk about stripping out the text. Some of our .docx have pictures in them.
Any ideas?
I see that this question is 10 months old. I hope it's not too late to be helpful.
Neither DB2 nor any other database that allows a "Blob" data type would know that the data came from a .docx file, or do anything that would cause Word to complain. The data is supposed to be an exact copy of whatever data you pass to it.
Similarly, the Word document does not "know" that it has been copied to a BLOB object and then back.
Therefore, the problem is almost certainly with your handling of the BLOB data, in one or both of your programs.
Please run your first program to copy the .docx file into the databse, then run the second one to read it back out. Then use a byte-by-byte tool to compare the two files. One way to do this would be to open a command window and type:
fc/b Doc1.docx Doc2.docx
If you have access to some better compare tools, by all means use them... but make sure that it looks at EVERY BYTE, not just the printable characters.
Obviously, you ARE going to find differences, or else Microsoft Word wouldn't give you errors on the second one when the first one is just fine. Once you see what the differences are, hopefully you will understand what is going wrong and how to fix them.
I had a similar problem several years ago (I was storing graphics, but it's the same basic problem). It turns out that the document size was being affected - I would store 8005 bytes into the BLOB object, and when I read it back out I was getting 8192 bytes. NUL (0) bytes were being appended to the end of the data.
My solution at the time was to append an "X" to the end of the BLOB data when I wrote it to the database. Then, when I read it back, I would search for the very last "X" in the data and remove it, along with any data after it. That way, I could recover the original data. What I should have done was store the data length in the database along with the BLOB data. Then you could truncate the file to that size, eliminating the corruption.
If appended NUL bytes aren't your problem, then you'll need to do something else to fix the problem. But you don't have a clue until you know what changed. Something did.

Resources