Add model parameters and Run timestamp to scored results - azure-machine-learning-studio

I have run an Azure machine learning experiment in Studio and currently am exporting the scored results to Azure SQL.
I want to include in the export data, the "run timestamp" and the model configuration as extra columns.
This is usefull because i can then compare the results from different runs against each other in my Azure SQL database.
Anyone know how to add this data to the scored data set ?
Thanks in advance,
Oliver

You could use the Apply SQL Transformation module to generate the timestamp for each row of scored data, as well as a new string column containing the model configuration.
If you are more comfortable with R, you could use the Execute R Script module to create the new columns.

Related

Setting up data adapter: can’t connect data to model

After I adapted my SQL dataset according to my model, the sensor list just disappeared from localhost, and the SQL command is not generating the expected data (a table from sample.db). Default dataset was Hyperion but I changed it to my dataset sensors.
https://forge.autodesk.com/en/docs/dataviz/v1/developers_guide/advanced_topics/sqlite_adapter/
What is the problem? What kind of table is expected to come from the SQL database?
Please note that the Data Viz Extension tutorial is a work-in-progress.
Before these tutorials are finished, I'd suggest that you take a look at a sample app that I've been working on: https://github.com/petrbroz/forge-iot-extensions-demo. It's also using the Data Visualization Extensions but it aims to be simpler and easier to reuse. By default, the IoT sensors, channels, and samples are defined in a simple JSON, but I've put together a separate code branch (sample/sqlite) where the IoT data is fetched from a sample sqlite database.

Column pruning on parquet files defined as an external table

Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.
As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.

Bigquery and R: Cost and where is the data stored?

I'm using RStudio to run analysis on large datasets stored in BigQuery. The dataset is private and from a large retailer that shared the dataset with me via BigQuery to run the required analyses.
I used the bigrquery library to connect R to BigQuery, but couldn't find answers to the following two questions:
1) When I use R to run the analyses (e.g. first used SELECT to get the data and stored them in a data frame in R), is the data then somehow locally stored on my laptop? The company is concerned about confidentiality and probably doesn't want me to store the data locally but leave them in the cloud. But is it even possible to use R then?
2) My BigQuery free version has 1 TB/month for analyses. If I use select in R to get the data, it for instance tells me "18.1 gigabytes processed", but do I also use up my 1 TB if I run analyses on R instead of running queries on BigQuery? If it doesn't incur cost, then i'm wondering what the advantage is of running queries on BigQuery instead of in R, if the former might cost me money in the end?
Best
Jennifer
As far as I know, Google's BigQuery is an entirely cloud based database. This means that when you run a query or report on BigQuery, it happens in the cloud, and not locally (i.e. not in R). This is not to say that your source data might be local; in fact, as you have seen you may upload a local data set from R. But, the query would execute in the cloud, and then return the result set to R.
With regard to your other question, the source data in the BigQuery tables would remain in the cloud, and the only exposure to the data you would have locally would be the results of any query you might execute from R. Obviously, if you ran SELECT * on every table, you could see all the data in a particular database. So I'm not sure how much of a separation of concerns there would really be in your setup.
As for pricing, from the BigQuery documentation on pricing:
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed. You are charged for the number of bytes processed whether the data is stored in BigQuery or in an external data source such as Google Cloud Storage, Google Drive, or Google Cloud Bigtable.
So you get 1TB of free processing per month of data, after which you would start getting billed.
Unless you explicitly save to a file, R stores the data in memory. Because of the way sessions work, however, RStudio will basically keep a copy of the session unless you tell it not to, which is why it asks you if you want to save your session when you exit of switch projects. What you should do to be sure of not storing anything is when you are done for the day (or whatever) use the broom icon in the Environment tab to delete everything in the environment. Or you can individually delete a data frame or other object rm(obj) or go to the environment window and change "list" to "grid" and select individual objects to remove. See this How do I clear only a few specific objects from the workspace? which addresses this part of my answer (but this is not a duplicate question).

I am using the Peoplesoft test framework writing data captured via variables to a file

I am using the Peoplesoft test framework. on creating a script I captured several values using variable now I want to write these values to a file. Tried to do several searches and no luck at all on this subject. Also once I have written these to a file I want to the compare them agains values in another script......any help would be much appreciated.
Could you update a column in a table with the value from script 1, update a different column with the value from script 2 and then compare the columns?
Would this satisfy your requirement?

Refer database tables in R

I have a database name Team which has 40 tables . How can I connect to that database and refer to particular table without using sqlquerry. By the use of R data Structures.
I am not sure what do you mean with "How can I connect to that database and refer to particular table without using sqlquerry".
I am not aware of a way to "see" DB tables as R dataframes or arrays or whatever without importing the tuples first through some sort of query (in SQL) - this seems to be the most practical way to use R with DB data (without going to the hassle of exporting these as .csv files first, and re-read them in R).
There are a couple ways to import data from a DB to R, so that the result of a query becomes a R data structure (including proper type conversion, ideally).
Here is a short guide on how to do that with SQL-R
A similar brief introduction to the DBI family

Resources