Writing Kusto output to file on local computer - azure-data-explorer

I wrote the following query to export data from kusto to my local computer. However the command below is not working. Can someone please recommend me how I can do it
.export to csv ("C:\\Downloads\\file.csv")
<| fullDay13_0to1_30
| project R, A

There is no reason this should work, since you are using service-side export method.
Export data to storage
Executes a query and writes the first result set to an external
storage, specified by a storage connection string.
Syntax
.export [async] [compressed] to OutputDataFormat (
StorageConnectionString [, ...] ) [with ( PropertyName = PropertyValue
[, ...] )] <| Query
Storage connection strings
The following types of external storage are supported:
Azure Blob Storage
Azure Data Lake Storage Gen2
Azure Data Lake Storage Gen1
You are looking for client-side export.
Client-side export
In its simplest form, data export can be done on the client side. The client runs a query against the service, reads
back the results, and then writes them. This form of data export
depends on the client tool to do the export, usually to the local
filesystem where the tool runs. Among tools that support this model
are Kusto.Explorer and Web UI.
There are other methods, including Kusto CLI, SDK, PowerShell and REST API

Related

Pulling data from serverless SQL external table to spark for sentiment analysis

I'm not experienced at all with Azure, yet I've been tasked with setting up the above. The serverless SQL external tables were set up by a company contracted to do so and use the SynapseDeltaFormat as the format, if that matters. One of the tables created in this manner has a column where we want the records that are at least 25 characters long to be processed for sentiment analysis. All of the examples I've been able to find online just have you use some other external file to act as the source and in this case that isn't what we want to do. The data has already been pulled into the serverless SQL environment. I assume that a pipeline and one or more notebooks would be necessary to move data around, but I can't for the life of me find a resource that explains how such an engine would be set up. In fact, I can't even find a reference for adding to the external data tables the sentiment analysis as it appears one can't be created without an already-existing data source. Does anyone have sources or information available to assist in what I'm trying to do?
Before starting with Synapse, you must first understand the requirement and architecture.
An external table points to data located in Hadoop, Azure Storage
blob, or Azure Data Lake Storage. External tables are used to read
data from files or write data to files in Azure Storage. With Synapse
SQL, you can use external tables to read external data using dedicated
SQL pool or serverless SQL pool.
You mentioned -
The data has already been pulled into the serverless SQL environment.
Where it is exactly located? Is it in Azure Data Lake, CosmosDB, or Datavserse or in Dedicated SQL Pool?
Based on the source, you need to design the architecture.
If the data is in dedicated SQL Pool, I recommend you to use Dedicated SQL Pool analytics instead of serverless pool. To know whether the data in Dedicated Pool, you need to check how the data has been created. For example, when used in conjunction with the CREATE TABLE AS SELECT statement, selecting from an external table imports data into a table within the dedicated SQL pool.
Since you mentioned Serverless SQL Pool, I'm assuming that you have created a Data Source in Azure Storage account.
Any external tables in Synapse SQL pools follow the below hierarchy:
CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the storage.
CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files.
CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.
You can CREATE EXTERNAL TABLE using below syntax:
CREATE EXTERNAL DATA SOURCE <data_source_name>
WITH
( LOCATION = '<prefix>://<path>'
[, CREDENTIAL = <database scoped credential> ]
, TYPE = HADOOP
)
[;]
Secondly, to CREATE EXTERNAL FILE FORMAT you can refer below syntax:
-- Create an external file format for PARQUET files.
CREATE EXTERNAL FILE FORMAT file_format_name
WITH (
FORMAT_TYPE = PARQUET
[ , DATA_COMPRESSION = {
'org.apache.hadoop.io.compress.SnappyCodec'
| 'org.apache.hadoop.io.compress.GzipCodec' }
]);
--Create an external file format for DELIMITED TEXT files
CREATE EXTERNAL FILE FORMAT file_format_name
WITH (
FORMAT_TYPE = DELIMITEDTEXT
[ , DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec' ]
[ , FORMAT_OPTIONS ( <format_options> [ ,...n ] ) ]
);
<format_options> ::=
{
FIELD_TERMINATOR = field_terminator
| STRING_DELIMITER = string_delimiter
| FIRST_ROW = integer
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| ENCODING = {'UTF8' | 'UTF16'}
| PARSER_VERSION = {'parser_version'}
}
Last, you can CREATE A EXTERNAL TABLE using below synatx:
CREATE EXTERNAL TABLE { database_name.schema_name.table_name | schema_name.table_name | table_name }
( <column_definition> [ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
[, TABLE_OPTIONS = N'{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}' ]
[, <reject_options> [ ,...n ] ]
)
[;]
<column_definition> ::=
column_name <data_type>
[ COLLATE collation_name ]
<reject_options> ::=
{
| REJECT_TYPE = value,
| REJECT_VALUE = reject_value,
| REJECT_SAMPLE_VALUE = reject_sample_value,
| REJECTED_ROW_LOCATION = '/REJECT_Directory'
}
Refer the official documents Analyze data with a serverless SQL pool, Use external tables with Synapse SQL
.
I'm a bit torn between deleting and posting what I did to complete this particular task. Ultimately, I created a Synapse notebook (C#, to be specific on the language) that would pull the required data from the serverless sql instance that we are using in Synapse. It would then pull already-processed records from the destination and remove any repeats from the new data that still needed processing. It would then call the Language Analysis endpoint for sentiment analysis and write those back to the destination. The notebook was then added to a Synapse pipeline that would automate the gathering of required values from a key vault and a database to pass into the notebook so it could do its thing.

How to connect to parquet files in Azure Blob Storage with arrow::open_dataset?

I am open to other ways of doing this. Here are my constraints:
I have parquet files in a container in Azure Blob Storage
These parquet files will be partitioned by a product id, as well as the date (year/month/day)
I am doing this in R, and want to be able to connect interactively (not just set up a notebook in databricks, though that is something I will probably want to figure out later)
Here's what I am able to do:
I understand how to use arrow::open_dataset() to connect to a local parquet directory: ds <- arrow::open_dataset(filepath, partitioning = "product")
I can connect to, view, and download from my blob container with the AzureStor package. I can download a single parquet file this way and turn it into a data frame:
blob <- AzureStor::storage_endpoint("{URL}", key="{KEY}")
cont <- AzureStor::storage_container(blob, "{CONTAINER-NAME}")
parq <- AzureStor::storage_download(cont, src = "{FILE-PATH}", dest = NULL)
df <- arrow::read_parquet(parq)
What I haven't been able to figure out is how to use arrow::open_dataset() to reference the parent directory of {FILE-PATH}, where I have all the parquet files, using the connection to the container that I'm creating with AzureStor. arrow::open_dataset() only accepts a character vector as the "sources" parameter. If I just give it the URL with the path, I'm not passing any kind of credential to access the container.
Unfortunately, you probably are not going to be able to do this today purely from R.
Arrow-R is based on Arrow-C++ and Arrow-C++ does not yet have a filesystem implementation for Azure. There are JIRA tickets ARROW-9611,ARROW-2034 for creating one but these tickets are not in progress at the moment.
In python it is possible to create a filesystem purely in python using the FSspec adapter. Since there is a python SDK for Azure Blob Storage it should be possible to do what you want today in python.
Presumably something similar could be created for R but you would still need to create the R equivalent of the fsspec adapter and that would involve some C++ code.
If you use Azure Synapse then you can connect to your data with odbc as if it were a SQL Server database and it has support for partitioning and other files types as well. The pricing, from what I recall, is like $5/month fixed plus $5/TB queried.
Querying data would look something like this...
library(odbc)
syncon <- dbConnect(odbc(),
Driver = "SQL Server Native Client 11.0",
Server = "yourname-ondemand.sql.azuresynapse.net",
Database = "dbname",
UID = "sqladminuser",
PWD = rstudioapi::askForPassword("Database password"),
Port = 1433)
somedata <- dbGetQuery(syncon, r"---{SELECT top 100
result.filepath(1) as year,
result.filepath(2) as month,
*
FROM
OPENROWSET(
BULK 'blobcontainer/directory/*/*/*.parquet',
DATA_SOURCE='blobname',
FORMAT = 'parquet'
) as [result]
order by node, pricedate, hour}---")
the filepath keyword refers to the name of the directory in the BULK path.
Here's the MS website https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-specific-files
You can also make views so that people who like SQL but not parquet files can query the views without having to know anything about the underlying data structure, it'll just look like a SQL Server database to them.

Dump to Storage Container failed ingestion in ADX

I've problems when I ingest data from an IoT Hub in my Azure Data Explorer.
When I execute on ADX the command to see the errors:
.show ingestion failures | where FailedOn >= make_datetime('2021-09-09 12:00:00')
I get the error message:
BadRequest_NoRecordsOrWrongFormat: The input stream produced 0 bytes. This usually means that the input JSON stream was ill formed.
There is a column with IngestionSourcePath, but it seems to be an internal URI of the product itself.
I read in another Stack Overflow question, that there is a command in ADX to dump the failed ingestion blob into a container with this syntax:
.dup-next-failed-ingest into TableName to h#'Path to Azure blob container'
The problem is that this command is not documented through Microsoft documentation.
The questions are:
What are the full syntax of this command?
Can you show me some examples?
Which are the needed permissions to run this command over ADX and also over the Blob Container?
Is another command to remove this dump after fix the ingestion errors?
The full syntax of the command is:
.dup-next-failed-ingest into TableName to h#'[Path to Azure blob container];account-key'
or
.dup-next-failed-ingest into TableName to h#'[Path to Azure blob container]?SAS-key'
We will add the documentation for this command.
The error you encountered most likely indicates that the JSON flavor you are using does not match the flavor you specified for the data connection, or that the JSON objects are not syntactically solid. My recommendation would be to make sure you use the "MultiJSON" as the data connection format for any JSON payloads.
When looking at the interim blobs created using this command, please keep in mind that you will not be looking at the original events sent into the IoT Hub, but batches of these events, created by ADX internal batching mechanism.

cosmosdb - archive data older than n years into cold storage

I researched several places and could not find any direction on what options are there to archive old data from cosmosdb into a cold storage. I see for DynamoDb in AWS it is mentioned that you can move dynamodb data into S3. But not sure what options are for cosmosdb. I understand there is time to live option where the data will be deleted after certain date but I am interested in archiving versus deleting. Any direction would be greatly appreciated. Thanks
I don't think there is a single-click built-in feature in CosmosDB to achieve that.
Still, as you mentioned appreciating any directions, then I suggest you consider DocumentDB Data Migration Tool.
Notes about Data Migration Tool:
you can specify a query to extract only the cold-data (for example, by creation date stored within documents).
supports exporting export to various targets (JSON file, blob
storage, DB, another cosmosDB collection, etc..),
compacts the data in the process - can merge documents into single array document and zip it.
Once you have the configuration set up you can script this
to be triggered automatically using your favorite scheduling tool.
you can easily reverse the source and target to restore the cold data to active store (or to dev, test, backup, etc).
To remove exported data you could use the mentioned TTL feature, but that could cause data loss should your export step fail. I would suggest writing and executing a Stored Procedure to query and delete all exported documents with single call. That SP would not execute automatically but could be included in the automation script and executed only if data was exported successfully first.
See: Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs.
UPDATE:
These days CosmosDB has added Change feed. this really simplifies writing a carbon copy somewhere else.

Import large data (json) into Firebase periodically

We are in the situation that we will have to update large amounts of data (ca. 5 Mio Records) in firebase periodically. At the moment we have a few json files that are around ~1 GB in size.
As existing third party solutions (here and here) have some reliability issues (import object per object; or need for open connection) and are quite disconnected to the google cloud platform ecosystem. I wonder if there is now an "official" way using i.e. the new google cloud functions? Or a combination with app engine / google cloud storage / google cloud datastore.
I really like not to deal with authentication — something that cloud functions seems to handle well, but I assume the function would time out (?)
With the new firebase tooling available, how to:
Have long running cloud functions to do data fetching / inserts? (does it make sense?)
Get the json files into & from somewhere inside the google cloud platform?
Does it make sense to first throw large data into google-cloud-datastore (i.e. too $$$ expensive to store in firebase) or can the firebase real-time database be reliably treaded as a large data storage.
I finally post the answer as it aligns with the new Google Cloud Platform tooling of 2017.
The newly introduced Google Cloud Functions have a limited run-time of approximately 9 minutes (540 seconds). However, cloud functions are able to create a node.js read stream from cloud storage like so (#googlecloud/storage on npm)
var gcs = require('#google-cloud/storage')({
// You don't need extra authentication when running the function
// online in the same project
projectId: 'grape-spaceship-123',
keyFilename: '/path/to/keyfile.json'
});
// Reference an existing bucket.
var bucket = gcs.bucket('json-upload-bucket');
var remoteReadStream = bucket.file('superlarge.json').createReadStream();
Even though it is a remote stream, it is highly efficient. In tests I was able to parse jsons larger than 3 GB under 4 minutes, doing simple json transformations.
As we are working with node.js streams now, any JSONStream Library can efficiently transform the data on the fly (JSONStream on npm), dealing with the data asynchronously just like a large array with event streams (event-stream on npm).
es = require('event-stream')
remoteReadStream.pipe(JSONStream.parse('objects.*'))
.pipe(es.map(function (data, callback(err, data)) {
console.error(data)
// Insert Data into Firebase.
callback(null, data) // ! Return data if you want to make further transformations.
}))
Return only null in the callback at the end of the pipe to prevent a memory leak blocking the whole function.
If you do heavier transformations that require a longer run time, either use a "job db" in firebase to track where you are at and only do i.e. 100.000 transformations and call the function again, or set up an additional function which listens on inserts into a "forimport db" that finally transforms the raw jsons object record into your target format and production system asynchronously. Splitting import and computation.
Additionally, you can run cloud functions code in a nodejs app engine. But not necessarily the other way around.

Resources