It seems like I'm hitting a 1GB upper-boundary on my U-SQL input file size. Is there such a limit, and if so, how can this be increased?
Here's my case in a nutshell:
I'm working on a custom xml extractor where I'm processing XML files of roughly 2,5gb. These XML files conform to well maintained XSD schemas. using xsd.exe I've generated .NET classes for Xml serialization. The custom extractor uses these desialized .NET objects to populate the output rows.
This all works pretty neat running U-SQL on my local ADLA Account from Visual Studio. Memory usage goes up to approx 3 gb for a 2,5 gb input xml, so this should perfectly fit on a single vertex per file.
This still works great using <1gb input files on the Data Lake.
However, when trying to scale things up at the Data Lake Store, it seems the job got terminated by hitting the 1gb input file size boundary.
I know streaming the outer XML, and then serializing the inner XML fragments is an alternative option, but we don't want to create - and particularly maintain - too much custom code depending on those externally managed schemas.
Therefore, raising the upper-limit would be great.
I see two issues right now. One that we can address, and one for which we have a feature under development for later this year.
U-SQL per default assumes that you want to scale out processing over your file and will split it into 1GB "chunks" for extraction. If your extractor needs to see all the data (e.g., in order to parse XML or JSON or an image for example) you need to mark the extractor to process the files atomically (not splitting it) in the following way:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{ ...
Now while a vertex has 3GB of data, we currently limit the memory size for a UDO like an extractor to 500MB. So if you process your XML in a way that requires a lot of memory, you will currently still fail with a System.OutOfMemory error. We are working on adding annotations to the UDOs that let you specify your memory requirements to overwrite the default, but that is still under development at this point. The only ways to address that is to either make your data small enough, or - in the case of XML for example - use a streaming parsing strategy that does not allocate too much memory (e.g., use the XML Reader interface).
Related
I have huge dataset that must be inserted into a graph database via gremlin (gremlin server). Becuase the xml file is too big (over 8gb in size), I decided to split it into multiple, manageable nine xml files (about 1gb each). My question is, is there a way to use the insert each of these data files into my tinkerpop graph database via gremlin server? i.e. trying something like this? or, what is the best way to insert these data please?
graph.io(IoCore.graphml()).readGraph("data01.xml")
graph.io(IoCore.graphml()).readGraph("data02.xml")
graph.io(IoCore.graphml()).readGraph("data03.xml")
graph.io(IoCore.graphml()).readGraph("data04.xml")
graph.io(IoCore.graphml()).readGraph("data05.xml")
That's a large GraphML file. I'm not sure I've ever come across one so large. I'd wonder how you went about splitting it since GraphML files aren't easily split as they are XML based, have a header and a structure where vertices and edges are in separate nodes. It was for these reasons (and others) that TinkerPop developed formats like Gryo and GraphSON which were easily split for processing in Hadoop-like file structures.
That said, assuming you split your GraphML files correctly so that each file was a complete subgraph I suppose that you would be able to load them the way you are suggesting, however, I'd be concerned about how much memory you'd need to do this. The io() loader is not mean for bulk parallel loading and basically holds an in-memory cache of vertices to speed loading. That in-memory cache is essentially just a HashMap that doesn't expire its contents. Therefore, while the loading is occurring you would need to be able to hold all of the Vertex instances in memory for a particular file.
I don't know what your requirements are or how you ended up with such a large GraphML file, but for a graph of this size I would look at either the provider specific bulk loading tools of the graph you are using or some custom method that used spark-gremlin or a Gremlin script of some sort to parallel load your data.
From a Linux bash script, I want to read the structured data stored by a particular Firefox add-on called FB-Purity.
I have found a folder called .mozilla/firefox/b8eab5j0.default/storage/default/moz-extension+++37a9788c-671d-4cae-ba5c-fbdb8788499a^userContextId=4294967295/ that contains a .metadata file which contains the string moz-extension://37a9788c-671d-4cae-ba5c-fbdb8788499a, an URL which when opened in Firefox shows the add-on's details, so I am pretty sure that this folder belongs to the add-on.
That folder contains an idb directory, which sounds like Indexed Database API, a W3C standard apparently used since last year by Firefox it to store add-ons data.
The idb folder only contains an empty folder and an SQLite file.
The SQLite file, unfortunately, does not contain much application structured data, but the object_data table contains a 95KB blob which probably contains the real structured data:
INSERT INTO `object_data` VALUES (1,'0pmegsjfoetupsf.742612367',NULL,NULL,
X'e08b0d0403000101c0f1ffe5a201000400ffff7b00220032003100380035003000320022003a002
2005300610074006f0072007500200055007205105861006e00690022002c00220036003100350036
[... 95KB ...]
00780022007d00000000000000');
Question: Any clue what this blob's format is? How to extract it (using command line or any library or Linux tool) to JSON or any other readable format?
Well, I had a fun day today figuring this out and ended creating a Python tool that can read the data from these indexedDB database files and print them (and maybe more at some point): moz-idb-edit
To answer the technical parts of the question first:
Both the name key (name) and data (value) use a Mozilla proprietary format whose only documentation appears to be its source code at this time.
The keys use a special just-for-this use-case encoding whose rough description is available in mozilla-central/dom/indexedDB/Key.cpp – the file also contains the only known implementation. Its unique selling point appears to be the fact that it is relatively compact while being compatible with all the possible index types websites may throw at you as well as being in the correct binary sorting order by default.
The values are stored using SpiderMonkey's internal StructuredClone representation that is also used when moving values between processes in the browser. Again there are no docs to speak of but one can read the source code which fortunately is quite easy to understand. Before being added to the database however the generated binary is compressed on-the-fly using Google's Snappy compression which “does not aim for maximum compression [but instead …] aims for very high speeds and reasonable compression” – probably not a bad idea considering that we're dealing with wasteful web content here.
To locate the correct indexedDB file for an extension's local storage data, one needs to resolve the extension's static ID to a so-call “internal UUID” whose value is different in every browser profile instance (to make tracking based on installed addons a lot harder). The mapping table for this is stored as a pref (“extensions.webextensions.uuids”) in the prefs.js. The IDB path then is ${MOZ_PROFILE}/storage/default/moz-extension+++${EXT_UUID}^userContextId=4294967295/idb/3647222921wleabcEoxlt-eengsairo.sqlite
For all practical intents and purposes you can read the value of a single storage key of any extension by downloading the project mentioned above. Basic usage is:
$ ./moz-idb-edit --extension "${EXT_ID}" --profile "${MOZ_PROFILE}" "${STORAGE_KEY}"
Where ${EXT_ID} is the extension's static ID (check its manifest.json file or look in about:support#extensions-tbody if your unsure), ${MOZ_PROFILE} is the Firefox profile directory (also in about:support) and ${STORAGE_KEY} is the name of the key you'd like to query (unfortunately querying all keys is not supported yet).
Also writing data is not currently supported either.
I'll update this answer as I implement more features (or drop me an issue on the project page!).
A website is serving continuously updated content (think stock exchange), is required to generate reports on-demand and files get downloaded by users. Users can customize the downloaded report based on lots of parameters.
What is the best practice in handling highly customized reports downloaded files as (.xls)?
How to cache and improve performance ?
It might be good to mention that the data is stored in RavenDb and the reports are expected to handle 100K results sizes.
Here are some pointers:
Make sure you haven static indexes defined in RavenDB to match all possible reports. You don't want to use dynamically generated temp indexes for this.
Probably one or more parameters will drastically change the query, so you may have some conditional logic to choose which of several query to run. This is especially true for different groupings, as they'll require a different map-reduce index.
Choose whether you want to limit your result set using standard paging with Skip and Take operators, or whether you are going to stream unbounded result sets.
However you build the actual report, do it in memory. Do not try to write it to disk first. Managing file permissions, locks, and cleanup is not worth the hassle. Plus, you risk taking servers down if they run out of disk space.
Preferably you should build the response and stream it out to your user in a single step, as to not require large amounts of memory on the server. Make sure you understand the yield keyword in C#, and that you work with IEnumerable and IQueryable directly whenever possible. Don't try to use .ToList() or .ToArray(), which will put the whole result set into memory.
With regard to caching, you could consider using a front-end cache like Memcached, but I'm not sure if it will help you here or not. You probably want as accurate of data that's possible from your database. Introducing any sort of cache will require you understand how and when to reset that cache. Keep in mind that Raven has several caching layers built in already. Build your solution without cache first, and then add caching if you need it.
I have an application that stores configuration files as XML on disk. I'd like to reduce the risk of data file corruption in the case of crashing etc. It seems like the common recommendation is to use SQLite.
What is your opinion on just using BLOBs to store the current XML format? The table would look like:
CREATE TABLE t ( filename TEXT, filedata BLOB )
On the one hand, this seems inelegant, but on the other it would avoid all the work (and corresponding bugs) of converting the configuration to an appropriate format.
Sounds inefficient. You'll need to load and parse the BLOB to get your configuration values as well as save the entire configuration file for every change.
I'm assuming the reason you're switching to a SQLite database is because the transaction mechanism will give you some amount of fault tolerance to crashes. If you store each of your configuration files as one BLOB then you will need to save then entire file before the transaction completes as opposed to just saving the updated values which should be quicker.
In addition if you're using a DOM based XML parser you'll end up loading both the BLOB and the parsed DOM tree into memory at the same time. Depending on the size and number of your configuration files that could be resource intensive.
IMHO you're better off creating a table for each configuration files with a row for each of your configuration values. You'll get better read/write performance, less memory usage and be able to use all the relational mechanisms of SQLite.
I have a Biztalk project that imports an incoming CSV file and dumps it to a database table. The import works fine, but I only need to keep about 200-300 records from a file with upwards of a million rows. My orchestration discards these rows, but the problem is that the flat file I'm importing is still 250MB, and when converted to XML using a regular flat file pipeline, it takes hours to process and sometimes causes the server to run out memory.
Is there something I can do to have the Custom Pipeline itself discard rows I don't care about? The very first item in each CSV row is one of a few strings, and I only want to keep rows that start with a certain string.
Thanks for any help you're able to provide.
A custom pipeline component would certainly be the best solution; but it would need to execute in the decode stage before the disassembler component.
Making it 100% streaming-enabled would be complex (but certainly doable), but depending on the size of the resulting trimmed CVS file, you could simply pre-process the entire input file as soon as your custom component runs and either generate the results in memory (in a MemoryStream) if it's small, or write them to a file and then return the resulting FileStream to BizTalk to continue processing from there.