I have a requirement to get the result of different Transaction codes (TCodes), that extract has to be loaded into an SQL database. There are some TCODEs with complex logic, so replicate the logic is not an option.
Is there any way to have a daily basis process that runs all the tcodes and locates the extract in a onedrive or any other location?
I just need the same result as if a user get into the tcode, executes, and extract to a CSV file.
Related
Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.
As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.
AWS Glue is great for transforming data from a raw form into whichever format you need, and keeping the source and destination data sets synchronized.
However, I have a scenario where data lands in a 'landing area' bucket from untrusted external sources, and the first ETL step needs to be a data validation step which only allows valid data to pass to the data lake, while non-valid data is moved to a quarantine bucket for manual inspection.
Non-valid data includes:
bad file formats/encodings
unparseable contents
mismatched schemas
even some sanity checks on the data itself
The 'landing area' bucket is not part of the data lake, it is only a temporary dead drop for incoming data, and so I need the validation job to delete the files from this bucket once it has moved them to the lake and/or quarantine buckets.
Is this possible with Glue? If the data is deleted from the source bucket, won't Glue end up removing it downstream in a subsequent update?
Am I going to need a different tool (e.g. StreamSets, NiFi, or Step Functions with AWS Batch) for this validation step, and only use Glue once the data is in the lake?
(I know I can set lifecycle rules on the bucket itself to delete the data after a certain time, like 24 hours, but in theory this could delete data before Glue has processed it, e.g. in case of a problem with the Glue job)
Please see purge_s3_path in the docs:
glueContext.purge_s3_path(s3_path, options={}, transformation_ctx="")
Deletes files from the specified Amazon S3 path recursively.
Also, make sure your AWSGlueServiceRole has s3:DeleteObject permissions
Your glue environment comes with boto3. You should be better of using the boto3 s3 client/resource to delete the landing files after Youve completed processing the data via glue
I'm using RStudio to run analysis on large datasets stored in BigQuery. The dataset is private and from a large retailer that shared the dataset with me via BigQuery to run the required analyses.
I used the bigrquery library to connect R to BigQuery, but couldn't find answers to the following two questions:
1) When I use R to run the analyses (e.g. first used SELECT to get the data and stored them in a data frame in R), is the data then somehow locally stored on my laptop? The company is concerned about confidentiality and probably doesn't want me to store the data locally but leave them in the cloud. But is it even possible to use R then?
2) My BigQuery free version has 1 TB/month for analyses. If I use select in R to get the data, it for instance tells me "18.1 gigabytes processed", but do I also use up my 1 TB if I run analyses on R instead of running queries on BigQuery? If it doesn't incur cost, then i'm wondering what the advantage is of running queries on BigQuery instead of in R, if the former might cost me money in the end?
Best
Jennifer
As far as I know, Google's BigQuery is an entirely cloud based database. This means that when you run a query or report on BigQuery, it happens in the cloud, and not locally (i.e. not in R). This is not to say that your source data might be local; in fact, as you have seen you may upload a local data set from R. But, the query would execute in the cloud, and then return the result set to R.
With regard to your other question, the source data in the BigQuery tables would remain in the cloud, and the only exposure to the data you would have locally would be the results of any query you might execute from R. Obviously, if you ran SELECT * on every table, you could see all the data in a particular database. So I'm not sure how much of a separation of concerns there would really be in your setup.
As for pricing, from the BigQuery documentation on pricing:
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed. You are charged for the number of bytes processed whether the data is stored in BigQuery or in an external data source such as Google Cloud Storage, Google Drive, or Google Cloud Bigtable.
So you get 1TB of free processing per month of data, after which you would start getting billed.
Unless you explicitly save to a file, R stores the data in memory. Because of the way sessions work, however, RStudio will basically keep a copy of the session unless you tell it not to, which is why it asks you if you want to save your session when you exit of switch projects. What you should do to be sure of not storing anything is when you are done for the day (or whatever) use the broom icon in the Environment tab to delete everything in the environment. Or you can individually delete a data frame or other object rm(obj) or go to the environment window and change "list" to "grid" and select individual objects to remove. See this How do I clear only a few specific objects from the workspace? which addresses this part of my answer (but this is not a duplicate question).
We have a lot of applications that are tracked by Azure Application Insight. We have configured continuous export to Azure Blob Storage. These data are partitioned by an export date. The folder structure have format
{yyyy-MM-dd}/{guid}.blob
There is a problem, the partition key rely on an export date (not on an events dates within *.blob files).
So the file that is placed to the folder
/2016-11-05/
can contain events for dates
2016-11-09, 2016-11-11, 2016-11-12
We would like to have these data in Azure Data Lake Database to analyze them with Azure Data Lake Analytics. Also we would like to have an event table partitioned by an event generation time.
To orchestrate the entire ETL process we have choosen ADF. We like the incremental model, scheduling and concurency model. We have choosen a data slice ~ 1 day. One of the ADF's requirements: every activity in a pipeline should be repeatable (if i will schedule rerun for a random data slice, it should be reprocessed consistently, for example cleanup an old data, and load a new data).
So we have the pipeline with these activities:
1) Data Movement: jsons blobs ---> csv datalake store file system, the result csv files have the same partitioning key as the source data, aligned to an export date.
2) U-SQL Activity. We planned to invoke an U-SQL job and pass the parameters SliceStart, SliceEnd pointed to the current data slice. When source's and target's partitioning key are aligned, we can just truncate partition / reload partition. But when these partitions are missaligned there is not so good.
Looks like there is problem to implement repeatable U-SQL step in the case due to:
missaligned partitioning key (export date vs event date)
lack of dynamic U-SQL
But the root cause here is partition's misalignment.
I have only ideas:
an additional task that will repartition the source app insight's data
crazy dances with dynamic U-SQL emulation https://social.msdn.microsoft.com/Forums/en-US/aa475035-2d57-49b8-bdff-9cccc9c8b48f/usql-loading-a-dynamic-set-of-files?forum=AzureDataLake
I would be happy if someone gave me idea how to solve the problem. Ideally we should avoid massive reload/rescans of data.
I have a database that is used to store transactional records, these records are created and another process picks them up and then removes them. Occasionally this process breaks down and the number of records builds up. I want to setup a (semi) automated way to monitor things, and as my tool set is limited and I have an R shaped hammer, this looks like an R shaped nail problem.
My plan is to write a short R script that will query the database via ODBC, and then write a single record with the datetime, the number of records in the query, and the datetime of the oldest record. I'll then have a separate script that will process the data file and produce some reports.
What's the best way to create my datafile, At the moment my options are
Load a dataframe, add the record and then resave it
Append a row to a text file (i.e. a csv file)
Any alternatives, or a recommendation?
I would be tempted by the second option because from a semantic point of view you don't need the old entries for writing the new ones, so there is no reason to reload all the data each time. It would be more time and resources consuming to do that.