I have an application that stores configuration files as XML on disk. I'd like to reduce the risk of data file corruption in the case of crashing etc. It seems like the common recommendation is to use SQLite.
What is your opinion on just using BLOBs to store the current XML format? The table would look like:
CREATE TABLE t ( filename TEXT, filedata BLOB )
On the one hand, this seems inelegant, but on the other it would avoid all the work (and corresponding bugs) of converting the configuration to an appropriate format.
Sounds inefficient. You'll need to load and parse the BLOB to get your configuration values as well as save the entire configuration file for every change.
I'm assuming the reason you're switching to a SQLite database is because the transaction mechanism will give you some amount of fault tolerance to crashes. If you store each of your configuration files as one BLOB then you will need to save then entire file before the transaction completes as opposed to just saving the updated values which should be quicker.
In addition if you're using a DOM based XML parser you'll end up loading both the BLOB and the parsed DOM tree into memory at the same time. Depending on the size and number of your configuration files that could be resource intensive.
IMHO you're better off creating a table for each configuration files with a row for each of your configuration values. You'll get better read/write performance, less memory usage and be able to use all the relational mechanisms of SQLite.
Related
Lets say we create an external ADX table with compressed set to true:-
with(compressed=true)
After this if we export data to this external table, assuming the external table kind is adl (I don't think this matters) , since the compression is achieved in memory in the ADX cluster before data is exported , this will cause in lesser amount of data getting exported I believe , saving bandwidth. Is that a right assumption? Though, I think if the external table dataformat is either orc or parquet, this might not matter as these formats are already compressed considerably.
Yes, exporting compressed data will write less data to the storage account and consume less bandwidth. You can use gzip or snappy compression with parquet format as well.
It seems like I'm hitting a 1GB upper-boundary on my U-SQL input file size. Is there such a limit, and if so, how can this be increased?
Here's my case in a nutshell:
I'm working on a custom xml extractor where I'm processing XML files of roughly 2,5gb. These XML files conform to well maintained XSD schemas. using xsd.exe I've generated .NET classes for Xml serialization. The custom extractor uses these desialized .NET objects to populate the output rows.
This all works pretty neat running U-SQL on my local ADLA Account from Visual Studio. Memory usage goes up to approx 3 gb for a 2,5 gb input xml, so this should perfectly fit on a single vertex per file.
This still works great using <1gb input files on the Data Lake.
However, when trying to scale things up at the Data Lake Store, it seems the job got terminated by hitting the 1gb input file size boundary.
I know streaming the outer XML, and then serializing the inner XML fragments is an alternative option, but we don't want to create - and particularly maintain - too much custom code depending on those externally managed schemas.
Therefore, raising the upper-limit would be great.
I see two issues right now. One that we can address, and one for which we have a feature under development for later this year.
U-SQL per default assumes that you want to scale out processing over your file and will split it into 1GB "chunks" for extraction. If your extractor needs to see all the data (e.g., in order to parse XML or JSON or an image for example) you need to mark the extractor to process the files atomically (not splitting it) in the following way:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{ ...
Now while a vertex has 3GB of data, we currently limit the memory size for a UDO like an extractor to 500MB. So if you process your XML in a way that requires a lot of memory, you will currently still fail with a System.OutOfMemory error. We are working on adding annotations to the UDOs that let you specify your memory requirements to overwrite the default, but that is still under development at this point. The only ways to address that is to either make your data small enough, or - in the case of XML for example - use a streaming parsing strategy that does not allocate too much memory (e.g., use the XML Reader interface).
In my Windows 8/RT app I use SQLite DataBase (sqlite-net) witch store in Isolated Storage. In DataBase I have a lot of data, including files(images, pdf's and other) links. I get those links from web server. When I got link, I want to download file and store it locally.
My question is: what is the best way to store big number of files (100+)? One important think: I need to organize quickly find the desired file.
I have three ideas:
Create another DataBase only for files (I can't modify existing)
Create folder in IS and store here directly.
Create list of files and store it in IS.
Which would be better/faster? Or somebody have another great solution?
100 files isn't such a big number as you can easily store up to 100k files (or folders) in a single (NTFS) directory.
If you receive the files from a webserver then the question is whether the source makes sure there are no duplicate filenames. If this can't be assured, I'd recommend having a database table mapping from original filename and metadata to its hash (SHA256 or similar) and store the file with a filename corresponding to its hash.
Then, when using the file, you can pass pass it to the user using the original filename using the StorageFile API.
Going beyond 100k files, you could create a subfolder structure from the first two letters of the hash.
Either way, storing the file metadata in a database and the files in a directory has been the most useful approach for us in the past.
100 files with average size of 1MB is only 100MB.
Most people say that storing binary files in database is wrong and suggest storing files separately and only keep file names in database, but I think it is fine provided you know what you are doing and why.
Big advantage of storing files in database is that you keep files together with their properties logically in one place. Also, you can simply copy one file and this would backup everything.
Database also affords you transaction support. You may have some problems reading and writing BLOBs into database, but it is not very difficult.
I am writing an NSDocument based application that has an SQLite database that I wish to tag along in the document bundle. It consists of some number of tables, with the schema for each table being a timestamp and a value. This database will start out small, but could grow to a sizable amount over time. The application is updating the database "behind NSDocument's back."
What I have attempted to do thus far is override the writeToURL:ofType:error: method, which can get passed some convoluted URL like:
file://localhost/private/var/folders/mr/l1z6gdls0fb3t28m3z1bz6lw0000gn/T/TemporaryItems/(A%20Document%20Being%20Saved%20By%MyApp%2031)/Untitled.wsdoc
At this point I am forced, if you will, to use an in-memory database, then suck up the entire contents of that database into an NSFileWrapper. It works, but it doesn't scale well. Doing his each time someone presses Command-S (or worse yet, I turn on autosaving) could be a very expensive operation if the database is huge, say 200-300MB or more (which is not out of the realm of possibility for this application).
So I'm wondering: is it possible to manage an SQLite file outside of the purview of NSDocument while still having it reside IN the document bundle so that the database can exist within the bundle as it is moved/copied?
We can save an image with 2 way
upload image in Server and save image url in Database.
save directly image into database
which one is better?
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!
Like many questions, the ansewr is "it depends." Systems like SharePoint use option 2. Many ticket tracking systems (I know for sure Trac does this) use option 1.
Think also of any (potential) limitations. As your volume increases, are you going to be limited by the size of your database? This has particular relevance to hosted databases and applications where increasing the size of your database is much more expensive than increasing your storage allotment.
Saving the image to the server will work better for a website, given that these are incidental to your website, like per customer branding images - if you're setting up the next Flickr obviously the answer would be different :). You'd want to set up one server to act as a file server, share out the /uploaded_images directory (or whatever you name it), and set up an application variable defining the base url of uploaded images. Why is it better? Cost. File servers are dirt cheap commodity hardware. You can back up the file contents using dirt cheap commodity (even just consumer grade) backup software. And if your file server croaks and someone loses a day of uploaded images? Who cares. They just upload them again. Our database server is an enterprise cluster running on SSD SAN. Our backups and tran logs are shipped to remote sites over expensive bandwidth and maintained even on tape for x period. We use it for all the data where we need the ACID (atomicity, consistency, isolation, durability) benefits of a RDBMS. We don't use it for company logos.
Store them in the database unless you have a good reason not to.
Storing them in the filesystem is premature optimization.
With a database you get referential integrity, you can back everything up at once, integrated security, etc.
The book SQL Anti-Patterns calls storing files in the filesystem an anti-pattern.