Incremental data load/processing - azure-application-insights

We have a lot of applications that are tracked by Azure Application Insight. We have configured continuous export to Azure Blob Storage. These data are partitioned by an export date. The folder structure have format
{yyyy-MM-dd}/{guid}.blob
There is a problem, the partition key rely on an export date (not on an events dates within *.blob files).
So the file that is placed to the folder
/2016-11-05/
can contain events for dates
2016-11-09, 2016-11-11, 2016-11-12
We would like to have these data in Azure Data Lake Database to analyze them with Azure Data Lake Analytics. Also we would like to have an event table partitioned by an event generation time.
To orchestrate the entire ETL process we have choosen ADF. We like the incremental model, scheduling and concurency model. We have choosen a data slice ~ 1 day. One of the ADF's requirements: every activity in a pipeline should be repeatable (if i will schedule rerun for a random data slice, it should be reprocessed consistently, for example cleanup an old data, and load a new data).
So we have the pipeline with these activities:
1) Data Movement: jsons blobs ---> csv datalake store file system, the result csv files have the same partitioning key as the source data, aligned to an export date.
2) U-SQL Activity. We planned to invoke an U-SQL job and pass the parameters SliceStart, SliceEnd pointed to the current data slice. When source's and target's partitioning key are aligned, we can just truncate partition / reload partition. But when these partitions are missaligned there is not so good.
Looks like there is problem to implement repeatable U-SQL step in the case due to:
missaligned partitioning key (export date vs event date)
lack of dynamic U-SQL
But the root cause here is partition's misalignment.
I have only ideas:
an additional task that will repartition the source app insight's data
crazy dances with dynamic U-SQL emulation https://social.msdn.microsoft.com/Forums/en-US/aa475035-2d57-49b8-bdff-9cccc9c8b48f/usql-loading-a-dynamic-set-of-files?forum=AzureDataLake
I would be happy if someone gave me idea how to solve the problem. Ideally we should avoid massive reload/rescans of data.

Related

What strategy to manage EF core change tracker when a large number of entities are tracked

We are using entity framework core 6 in an ASP.net core accounting software application.
A few operation consist of importing a large number of different entities in the database (these are backup restore process and XML import from another software). The amount of data in these source files can be quite large (several ten of thousands of entities).
Since the number of entities is too large to handle in a single transaction, we have a batching system that will call "SaveChanges" on the db context every few hundreds inserts (otherwise, the final "SaveChanges" simply wouldn't work)
We're running into a performance problem: when the change tracker contains many entities (a few thousands or more), every call to DetectChanges takes a loooooong time (several seconds) and so the whol import process is becomming almost exponentially slower as the dataset size grows.
We are experimenting with the possibility of create new, short-lived context to save some of the more numerous entities instead of loading the in the initial db context but that is a process that is rather hard to code properly: there are many object that we need to copy (in part or in full) and pass to the calling context to be able to rebuild the data structure properly.
So, I was wondering if there was another approach. Maybe a way to tell the change tracger that a set of entities should be kep around for reference but not to be serialized anymore (and, of course, skipped by the change detection process).
Edit I was asked for a specific business case so here it is: accounting data is stored per fiscal year.
Each fiscal year contains the data itself but also all the configuration options necessary for the software to work. This data is actually a rather complex set of relationships: accounts contains reference to tax templates (to be used when creating entry line for this account) which themselves contains several references to accounts (for referencing which accounts should be used to create entry lines for recording the tax amount). There are many such circular relationships in the model.
The load process therefore need to load the accounts first and record, for each one, what tax template it references. Then we load the taxe templates, fill in the references to the accounts and then have to process the accounts again to enter the ID of the newly created taxes.
We're using an ORM because the data model is defined by the class model: saving data directly to the database certainly would be possible but every time we would change the model, we'd had to manually adjust all these methodes as well: I'm trying to limit the number of ways my (small) team can shoot themselves in the foot when improving out model (whihc is evolving fast) and having a single reference for the data model seems like the way to go.

Column pruning on parquet files defined as an external table

Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.
As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.

Bigquery and R: Cost and where is the data stored?

I'm using RStudio to run analysis on large datasets stored in BigQuery. The dataset is private and from a large retailer that shared the dataset with me via BigQuery to run the required analyses.
I used the bigrquery library to connect R to BigQuery, but couldn't find answers to the following two questions:
1) When I use R to run the analyses (e.g. first used SELECT to get the data and stored them in a data frame in R), is the data then somehow locally stored on my laptop? The company is concerned about confidentiality and probably doesn't want me to store the data locally but leave them in the cloud. But is it even possible to use R then?
2) My BigQuery free version has 1 TB/month for analyses. If I use select in R to get the data, it for instance tells me "18.1 gigabytes processed", but do I also use up my 1 TB if I run analyses on R instead of running queries on BigQuery? If it doesn't incur cost, then i'm wondering what the advantage is of running queries on BigQuery instead of in R, if the former might cost me money in the end?
Best
Jennifer
As far as I know, Google's BigQuery is an entirely cloud based database. This means that when you run a query or report on BigQuery, it happens in the cloud, and not locally (i.e. not in R). This is not to say that your source data might be local; in fact, as you have seen you may upload a local data set from R. But, the query would execute in the cloud, and then return the result set to R.
With regard to your other question, the source data in the BigQuery tables would remain in the cloud, and the only exposure to the data you would have locally would be the results of any query you might execute from R. Obviously, if you ran SELECT * on every table, you could see all the data in a particular database. So I'm not sure how much of a separation of concerns there would really be in your setup.
As for pricing, from the BigQuery documentation on pricing:
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed. You are charged for the number of bytes processed whether the data is stored in BigQuery or in an external data source such as Google Cloud Storage, Google Drive, or Google Cloud Bigtable.
So you get 1TB of free processing per month of data, after which you would start getting billed.
Unless you explicitly save to a file, R stores the data in memory. Because of the way sessions work, however, RStudio will basically keep a copy of the session unless you tell it not to, which is why it asks you if you want to save your session when you exit of switch projects. What you should do to be sure of not storing anything is when you are done for the day (or whatever) use the broom icon in the Environment tab to delete everything in the environment. Or you can individually delete a data frame or other object rm(obj) or go to the environment window and change "list" to "grid" and select individual objects to remove. See this How do I clear only a few specific objects from the workspace? which addresses this part of my answer (but this is not a duplicate question).

How to periodically update a moderate amount of data (~2.5m entries) in Google Datastore?

I'm trying to do the following periodically (lets say once a week):
download a couple of public datasets
merge them together, resulting in a dictionary (I'm using Python) of ~2.5m entries
upload/synchronize the result to Cloud Datastore so that I have it as "reference data" for other things running in the project
Synchronization can mean that some entries are updated, others are deleted (if they were removed from the public datasets) or new entries are created.
I've put together a python script using google-cloud-datastore however the performance is abysmal - it takes around 10 hours (!) to do this. What I'm doing:
iterate over the entries from the datastore
look them up in my dictionary and decide if the need update / delete (if no longer present in the dictionary)
write them back / delete them as needed
insert any new elements from the dictionary
I already batch the requests (using .put_multi, .delete_multi, etc).
Some things I considered:
Use DataFlow. The problem is that each tasks would have to load the dataset (my "dictionary") into memory which is time and memory consuming
Use the managed import / export. Problem is that it produces / consumes some undocumented binary format (I would guess entities serialized as protocol buffers?)
Use multiple threads locally to mitigate the latency. Problem is the google-cloud-datastore library has limited support for cursors (it doesn't have an "advance cursor by X" method for example) so I don't have a way to efficiently divide up the entities from the DataStore into chunks which could be processed by different threads
How could I improve the performance?
Assuming that your datastore entities are only updated during the sync, then you should be able to eliminate the "iterate over the entries from the datastore" step and instead store the entity keys directly in your dictionary. Then if there are any updates or deletes necessary, just reference the appropriate entity key stored in the dictionary.
You might be able to leverage multiple threads if you pre-generate empty entities (or keys) in advance and store cursors at a given interval (say every 100,000 entities). There's probably some overhead involved as you'll have to build a custom system to manage and track those cursors.
If you use dataflow, instead of loading in your entire dictionary you could first import your dictionary into a new project (a clean datastore database), then in your dataflow function you could load the key given to you through dataflow to the clean project. If the value comes back from the load, upsert that to your production project, if it doesn't exist, then delete the value from your production project.

Attaching two memory databases

I am collecting data every second and storing it in a ":memory" database. Inserting data into this database is inside a transaction.
Everytime one request is sending to server and server will read data from the first memory, do some calculation, store it in the second database and send it back to the client. For this, I am creating another ":memory:" database to store the aggregated information of the first db. I cannot use the same db because I need to do some large calculation to get the aggregated result. This cannot be done inside the transaction( because if one collection takes 5 sec I will lose all the 4 seconds data). I cannot create table in the same database because I will not be able to write the aggregate data while it is collecting and inserting the original data(it is inside transaction and it is collecting every one second)
-- Sometimes I want to retrieve data from both the databses. How can I link both these memory databases? Using attach database stmt, I can attach the second db to the first one. But the problem is next time when a request comes how will I check the second db is exist or not?
-- Suppose, I am attaching the second memory db to first one. Will it lock the second database, when we write data to the first db?
-- Is there any other way to store this aggregated data??
As far as I got your idea, I don't think that you need two databases at all. I suppose you are misinterpreting the idea of transactions in sql.
If you are beginning a transaction other processes will be still allowed to read data. If you are reading data, you probably don't need a database lock.
A possible workflow could look as the following.
Insert some data to the database (use a transaction just for the
insertion process)
Perform heavy calculations on the database (but do not use a transaction, otherwise it will prevent other processes of inserting any data to your database). Even if this step includes really heavy computation, you can still insert and read data by using another process as SELECT statements will not lock your database.
Write results to the database (again, by using a transaction)
Just make sure that heavy calculations are not performed within a transaction.
If you want a more detailed description of this solution, look at the documentation about the file locking behaviour of sqlite3: http://www.sqlite.org/lockingv3.html

Resources