compressed property in ADX external table creation - azure-data-explorer

Lets say we create an external ADX table with compressed set to true:-
with(compressed=true)
After this if we export data to this external table, assuming the external table kind is adl (I don't think this matters) , since the compression is achieved in memory in the ADX cluster before data is exported , this will cause in lesser amount of data getting exported I believe , saving bandwidth. Is that a right assumption? Though, I think if the external table dataformat is either orc or parquet, this might not matter as these formats are already compressed considerably.

Yes, exporting compressed data will write less data to the storage account and consume less bandwidth. You can use gzip or snappy compression with parquet format as well.

Related

Column pruning on parquet files defined as an external table

Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.
As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.

How to export large table from Teradata

What would be the best way to export large table (e.g. over 10 billion rows) from Teradata? In terms of speed and resources consumption.
I know FastExport tool which uses fastexport mode, but it still requires the results to be put in the spool before sending to the client. Before it allowed avoiding spool by forcing nospoolonly mode but this seems to be broken in recent releases. Therefore if I use it with select * from table query the whole table will be copied, which would require massive spool file allowance.
I also came across JDBC driver and PT API, however they seem to be using the same underlying mechanisms.
Is there a better way?

increase U-SQL 1gb limit on input file?

It seems like I'm hitting a 1GB upper-boundary on my U-SQL input file size. Is there such a limit, and if so, how can this be increased?
Here's my case in a nutshell:
I'm working on a custom xml extractor where I'm processing XML files of roughly 2,5gb. These XML files conform to well maintained XSD schemas. using xsd.exe I've generated .NET classes for Xml serialization. The custom extractor uses these desialized .NET objects to populate the output rows.
This all works pretty neat running U-SQL on my local ADLA Account from Visual Studio. Memory usage goes up to approx 3 gb for a 2,5 gb input xml, so this should perfectly fit on a single vertex per file.
This still works great using <1gb input files on the Data Lake.
However, when trying to scale things up at the Data Lake Store, it seems the job got terminated by hitting the 1gb input file size boundary.
I know streaming the outer XML, and then serializing the inner XML fragments is an alternative option, but we don't want to create - and particularly maintain - too much custom code depending on those externally managed schemas.
Therefore, raising the upper-limit would be great.
I see two issues right now. One that we can address, and one for which we have a feature under development for later this year.
U-SQL per default assumes that you want to scale out processing over your file and will split it into 1GB "chunks" for extraction. If your extractor needs to see all the data (e.g., in order to parse XML or JSON or an image for example) you need to mark the extractor to process the files atomically (not splitting it) in the following way:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{ ...
Now while a vertex has 3GB of data, we currently limit the memory size for a UDO like an extractor to 500MB. So if you process your XML in a way that requires a lot of memory, you will currently still fail with a System.OutOfMemory error. We are working on adding annotations to the UDOs that let you specify your memory requirements to overwrite the default, but that is still under development at this point. The only ways to address that is to either make your data small enough, or - in the case of XML for example - use a streaming parsing strategy that does not allocate too much memory (e.g., use the XML Reader interface).

Save image url or save image file in sql database?

We can save an image with 2 way
upload image in Server and save image url in Database.
save directly image into database
which one is better?
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!
Like many questions, the ansewr is "it depends." Systems like SharePoint use option 2. Many ticket tracking systems (I know for sure Trac does this) use option 1.
Think also of any (potential) limitations. As your volume increases, are you going to be limited by the size of your database? This has particular relevance to hosted databases and applications where increasing the size of your database is much more expensive than increasing your storage allotment.
Saving the image to the server will work better for a website, given that these are incidental to your website, like per customer branding images - if you're setting up the next Flickr obviously the answer would be different :). You'd want to set up one server to act as a file server, share out the /uploaded_images directory (or whatever you name it), and set up an application variable defining the base url of uploaded images. Why is it better? Cost. File servers are dirt cheap commodity hardware. You can back up the file contents using dirt cheap commodity (even just consumer grade) backup software. And if your file server croaks and someone loses a day of uploaded images? Who cares. They just upload them again. Our database server is an enterprise cluster running on SSD SAN. Our backups and tran logs are shipped to remote sites over expensive bandwidth and maintained even on tape for x period. We use it for all the data where we need the ACID (atomicity, consistency, isolation, durability) benefits of a RDBMS. We don't use it for company logos.
Store them in the database unless you have a good reason not to.
Storing them in the filesystem is premature optimization.
With a database you get referential integrity, you can back everything up at once, integrated security, etc.
The book SQL Anti-Patterns calls storing files in the filesystem an anti-pattern.

Using SQLite to replace filesystem

I have an application that stores configuration files as XML on disk. I'd like to reduce the risk of data file corruption in the case of crashing etc. It seems like the common recommendation is to use SQLite.
What is your opinion on just using BLOBs to store the current XML format? The table would look like:
CREATE TABLE t ( filename TEXT, filedata BLOB )
On the one hand, this seems inelegant, but on the other it would avoid all the work (and corresponding bugs) of converting the configuration to an appropriate format.
Sounds inefficient. You'll need to load and parse the BLOB to get your configuration values as well as save the entire configuration file for every change.
I'm assuming the reason you're switching to a SQLite database is because the transaction mechanism will give you some amount of fault tolerance to crashes. If you store each of your configuration files as one BLOB then you will need to save then entire file before the transaction completes as opposed to just saving the updated values which should be quicker.
In addition if you're using a DOM based XML parser you'll end up loading both the BLOB and the parsed DOM tree into memory at the same time. Depending on the size and number of your configuration files that could be resource intensive.
IMHO you're better off creating a table for each configuration files with a row for each of your configuration values. You'll get better read/write performance, less memory usage and be able to use all the relational mechanisms of SQLite.

Resources