Column pruning on parquet files defined as an external table - azure-data-explorer

Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.

As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.

Related

Bigquery and R: Cost and where is the data stored?

I'm using RStudio to run analysis on large datasets stored in BigQuery. The dataset is private and from a large retailer that shared the dataset with me via BigQuery to run the required analyses.
I used the bigrquery library to connect R to BigQuery, but couldn't find answers to the following two questions:
1) When I use R to run the analyses (e.g. first used SELECT to get the data and stored them in a data frame in R), is the data then somehow locally stored on my laptop? The company is concerned about confidentiality and probably doesn't want me to store the data locally but leave them in the cloud. But is it even possible to use R then?
2) My BigQuery free version has 1 TB/month for analyses. If I use select in R to get the data, it for instance tells me "18.1 gigabytes processed", but do I also use up my 1 TB if I run analyses on R instead of running queries on BigQuery? If it doesn't incur cost, then i'm wondering what the advantage is of running queries on BigQuery instead of in R, if the former might cost me money in the end?
Best
Jennifer
As far as I know, Google's BigQuery is an entirely cloud based database. This means that when you run a query or report on BigQuery, it happens in the cloud, and not locally (i.e. not in R). This is not to say that your source data might be local; in fact, as you have seen you may upload a local data set from R. But, the query would execute in the cloud, and then return the result set to R.
With regard to your other question, the source data in the BigQuery tables would remain in the cloud, and the only exposure to the data you would have locally would be the results of any query you might execute from R. Obviously, if you ran SELECT * on every table, you could see all the data in a particular database. So I'm not sure how much of a separation of concerns there would really be in your setup.
As for pricing, from the BigQuery documentation on pricing:
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed. You are charged for the number of bytes processed whether the data is stored in BigQuery or in an external data source such as Google Cloud Storage, Google Drive, or Google Cloud Bigtable.
So you get 1TB of free processing per month of data, after which you would start getting billed.
Unless you explicitly save to a file, R stores the data in memory. Because of the way sessions work, however, RStudio will basically keep a copy of the session unless you tell it not to, which is why it asks you if you want to save your session when you exit of switch projects. What you should do to be sure of not storing anything is when you are done for the day (or whatever) use the broom icon in the Environment tab to delete everything in the environment. Or you can individually delete a data frame or other object rm(obj) or go to the environment window and change "list" to "grid" and select individual objects to remove. See this How do I clear only a few specific objects from the workspace? which addresses this part of my answer (but this is not a duplicate question).

Incremental data load/processing

We have a lot of applications that are tracked by Azure Application Insight. We have configured continuous export to Azure Blob Storage. These data are partitioned by an export date. The folder structure have format
{yyyy-MM-dd}/{guid}.blob
There is a problem, the partition key rely on an export date (not on an events dates within *.blob files).
So the file that is placed to the folder
/2016-11-05/
can contain events for dates
2016-11-09, 2016-11-11, 2016-11-12
We would like to have these data in Azure Data Lake Database to analyze them with Azure Data Lake Analytics. Also we would like to have an event table partitioned by an event generation time.
To orchestrate the entire ETL process we have choosen ADF. We like the incremental model, scheduling and concurency model. We have choosen a data slice ~ 1 day. One of the ADF's requirements: every activity in a pipeline should be repeatable (if i will schedule rerun for a random data slice, it should be reprocessed consistently, for example cleanup an old data, and load a new data).
So we have the pipeline with these activities:
1) Data Movement: jsons blobs ---> csv datalake store file system, the result csv files have the same partitioning key as the source data, aligned to an export date.
2) U-SQL Activity. We planned to invoke an U-SQL job and pass the parameters SliceStart, SliceEnd pointed to the current data slice. When source's and target's partitioning key are aligned, we can just truncate partition / reload partition. But when these partitions are missaligned there is not so good.
Looks like there is problem to implement repeatable U-SQL step in the case due to:
missaligned partitioning key (export date vs event date)
lack of dynamic U-SQL
But the root cause here is partition's misalignment.
I have only ideas:
an additional task that will repartition the source app insight's data
crazy dances with dynamic U-SQL emulation https://social.msdn.microsoft.com/Forums/en-US/aa475035-2d57-49b8-bdff-9cccc9c8b48f/usql-loading-a-dynamic-set-of-files?forum=AzureDataLake
I would be happy if someone gave me idea how to solve the problem. Ideally we should avoid massive reload/rescans of data.

Exporting all Marketo Leads in a CSV?

I am trying to export all of my leads from Marketo (we have over 20M+) into a CSV file, but there is a 10k row limit per CSV export.
Is there any other way that I can export a CSV file with more than 10k row? I tried searching for various dataloader tool on Marketo Launchpoint but couldn't find a tool that would work.
Have you considered using the API? It may not be practical unless you have a developer on your team (I'm a programmer).
marketo lead api
If your leads are in salesforce and marketo/salesforce are in parity, instead of exporting all your leads, do a sync from salesforce to the new MA tool (if you are switching) instead. It's a cleaner easier sync.
For important campaigns etc, you can create smart lists and export those.
There is no 10k row limit for exporting Leads from a list. However, there is a practical limit, especially if you choose to export all columns (instead of only the visible columns). I would generally advise on exporting a maximum of 200,000-300,000 leads per list, so you'd need to create multiple Lists.
As Michael mentioned, the API is also a good option. I would still advise to create multiple Lists, so you can run multiple processes in parallel, which will speed things up. You will need to look at your daily API quota: the default is either 10,000 or 50,000. 10,000 API calls allow you to download 3 million Leads (batch size 300).
I am trying out Data Loader for Marketo on Marketo Launchpoint to export my lead and activity data to my local database. Although it cannot transfer marketo data to CSV file directly, you can download Lead to your local database and then export to get a CSV file. For your reference, we have 100K leads and 1 billion activity data.
You might have to run multiple times for 20M leads, but the tool is quite easy and convenient to use so maybe it’s worth a try.
Initially there are 4 steps to get bulk leads from marketo
1. Creating a Job
2. Enqueue Export Lead Job
2. Polling Job Status
3. Retrieving Your Data
http://developers.marketo.com/rest-api/bulk-extract/bulk-lead-extract/

Best format to store incremental data in regularly using R

I have a database that is used to store transactional records, these records are created and another process picks them up and then removes them. Occasionally this process breaks down and the number of records builds up. I want to setup a (semi) automated way to monitor things, and as my tool set is limited and I have an R shaped hammer, this looks like an R shaped nail problem.
My plan is to write a short R script that will query the database via ODBC, and then write a single record with the datetime, the number of records in the query, and the datetime of the oldest record. I'll then have a separate script that will process the data file and produce some reports.
What's the best way to create my datafile, At the moment my options are
Load a dataframe, add the record and then resave it
Append a row to a text file (i.e. a csv file)
Any alternatives, or a recommendation?
I would be tempted by the second option because from a semantic point of view you don't need the old entries for writing the new ones, so there is no reason to reload all the data each time. It would be more time and resources consuming to do that.

How can i improve the performance of the SQLite database?

Background: I am using SQLite database in my flex application. Size of the database is 4 MB and have 5 tables which are
table 1 have 2500 records
table 2 have 8700 records
table 3 have 3000 records
table 4 have 5000 records
table 5 have 2000 records.
Problem: Whenever I run a select query on any table, it takes around (approx 50 seconds) to fetch data from database tables. This has made the application quite slow and unresponsive while it fetches the data from the table.
How can i improve the performance of the SQLite database so that the time taken to fetch the data from the tables is reduced?
Thanks
As I tell you in a comment, without knowing what structures your database consists of, and what queries you run against the data, there is nothing we can infer suggesting why your queries take much time.
However here is an interesting reading about indexes : Use the index, Luke!. It tells you what an index is, how you should design your indexes and what benefits you can harvest.
Also, if you can post the queries and the table schemas and cardinalities (not the contents) maybe it could help.
Are you using asynchronous or synchronous execution modes? The difference between them is that asynchronous execution runs in the background while your application continues to run. Your application will then have to listen for a dispatched event and then carry out any subsequent operations. In synchronous mode, however, the user will not be able to interact with the application until the database operation is complete since those operations run in the same execution sequence as the application. Synchronous mode is conceptually simpler to implement, but asynchronous mode will yield better usability.
The first time SQLStatement.execute() on a SQLStatement instance, the statement is prepared automatically before executing. Subsequent calls will execute faster as long as the SQLStatement.text property has not changed. Using the same SQLStatement instances is better than creating new instances again and again. If you need to change your queries, then consider using parameterized statements.
You can also use techniques such as deferring what data you need at runtime. If you only need a subset of data, pull that back first and then retrieve other data as necessary. This may depend on your application scope and what needs you have to fulfill though.
Specifying the database with the table names will prevent the runtime from checking each database to find a matching table if you have multiple databases. It also helps prevent the runtime will choose the wrong database if this isn't specified. Do SELECT email FROM main.users; instead of SELECT email FROM users; even if you only have one single database. (main is automatically assigned as the database name when you call SQLConnection.open.)
If you happen to be writing lots of changes to the database (multiple INSERT or UPDATE statements), then consider wrapping it in a transaction. Changes will made in memory by the runtime and then written to disk. If you don't use a transaction, each statement will result in multiple disk writes to the database file which can be slow and consume lots of time.
Try to avoid any schema changes. The table definition data is kept at the start of the database file. The runtime loads these definitions when the database connection is opened. Data added to tables is kept after the table definition data in the database file. If changes such as adding columns or tables, the new table definitions will be mixed in with table data in the database file. The effect of this is that the runtime will have to read the table definition data from different parts of the file rather than at the beginning. The SQLConnection.compact() method restructures the table definition data so it is at the the beginning of the file, but its downside is that this method can also consume much time and more so if the database file is large.
Lastly, as Benoit pointed out in his comment, consider improving your own SQL queries and table structure that you're using. It would be helpful to know your database structure and queries are the actual cause of the slow performance or not. My guess is that you're using synchronous execution. If you switch to asynchronous mode, you'll see better performance but that doesn't mean it has to stop there.
The Adobe Flex documentation online has more information on improving database performance and best practices working with local SQL databases.
You could try indexing some of the columns used in the WHERE clause of your SELECT statements. You might also try minimizing usage of the LIKE keyword.
If you are joining your tables together, you might try simplifying the table relationships.
Like others have said, it's hard to get specific without knowing more about your schema and the SQL you are using.

Resources