Can AZ ML workbench reference multiple data sources from Data Prep Transform Dataflow expression - azure-machine-learning-studio

Using AZ ML workbench for a class project (required tool) I coded the desired logic below in an exploration notebook but cannot find a way to include this into a Data-prep Transform Data flow.
all_columns = df.columns
sum_columns = [col_name for col_name in all_columns if col_name not in ['NPI', 'Gender', 'State', 'Credentials', 'Specialty']]
sum_op_columns = list(set(sum_columns) & set(df_op['Drug Name'].values))
The logic is using the column names from one data source df_op (opioid drugs) to choose which subset of columns to include from another data source df (all drugs). When adding a py script/expression Transform Data Flow I'm only seeing the ability to reference the single df. Alternatives?

I may have a way for you to access both data frames.
In Workbench, once you have the data sources that you need loaded, right click on one and select "Generate Data Access Code File".
Once there you're automatically given code to access that specific file. However, you can use the same code to access the other files.
In the screenshot above, I have two data sources. I can use the below code to access them both as a pandas data frame and manipulate them as I need.
df_salary = datasource.load_datasource('SalaryData.dsource')
df_startup = datasource.load_datasource('50-Startups.dsource')
I believe from there you can save your updated data frame to a CSV and then use that in the train script.
Hope that helps or at least points you to another solution.

Related

In R data frames functions to replace original frame

In R, each time a data frame is filtered for example are there any changes made to the source data frame? What are best practices for preserving the original data frame?
Okay, so I do not understand exactly what you mean but, if you have a .csv file for example ("example.csv") in your working directory and you create an r-object (example) from it, the original .csv file is maintained intact.
The example object however changes whenever you apply functions or filters to it. The easiest way to maintain an original data frame is to apply those functions to a differently named object (i.e. example2)
you may save as another data frame or output them for preservation
mtcars1 <- mtcars %>%
select(mpg,cyl,hp,vs)
Save one object to a file
saveRDS(mtcars1 , file = "my_data.rds")
Restore the object
readRDS(file = "my_data.rds")
Save multiple objects
save(mtcars, mtcars1, file = "multi_data.RData")
Restore multiple objects again
load("multi_data.RData")

How to save an empty data to adlsgen2 using R?

I've explored but didn't find any suggestions.
I want to know how to save an empty file to adlsgen2 using R and then read it back to the same code.
Thanks for the help.
Since you want to write an empty file to ADLS gen2 using R and read it from that location, first create a dataframe.
The dataframe can either be completely empty, or it can be a dataframe with column names but no rows. You can use the following code to create one.
df <- data.frame() #OR
df <- data.frame(<col_name>=<type>)
Once you have created a dataframe, you must establish connection to ADLS gen2 account. You can use the method specified in the following link to do that.
https://saketbi.wordpress.com/2019/05/11/how-to-connect-to-adls-gen2-using-sparkr-from-databricks-rstudio-while-integrating-securely-with-azure-key-vault/
After making the connection you can use read and write functions in R language using the path to ADLS gen2 storage. The following link provides numerous functions that can be used according to your requirement.
https://www.geeksforgeeks.org/reading-files-in-r-programming/
Sharing the solution to my question:
I've used SparkR to create an empty data frame and save the same in the adls after setting up the spark context.
Solution is below:
Step 1: Create a schema
fcstSchema <- structType(structField("ABC", "string",TRUE))
new_df<- data.frame(ABC=NULL,stringsAsFactors=FALSE)
n<-createDataFrame(new_df,fcstSchema)
SparkR::saveDF(n,path = "abfss://<account_name>#<datalake>.dfs.core.windows.net/<path>/",source = "delta",mode = "overwrite")

PowerBI: How to save result of R script?

Is it possible to implement the following scenario in Power BI Desktop?
Load data from Excel file to several tables
Make calculation with R script from several data sources
Store results of calculation to new table in Power BI (.pbix)
The idea is to use Power BI Desktop for solving "transportation problem" with linear programming in R. Before solver will be running we need to make data transformations from several data sources. I'm new in Power BI. I see that it is possible to apply R scripts for loading and transformation of data, and visualizations. But I need the possibility of saving the results of calculation, for the subsequent visualization by the regular means of Power BI. Is it possible?
As I mentioned in my comment, this post would have solved most of your challenges. That approach replaces one of the tables with a new one after the R script, but you're specifically asking to produce a new table, presumably leaving the input tables untouched. I've recently written a post where you can do this using Python in the Power Query Editor. The only difference in your case would be the R script itself.
Here's how I would do it with an R script:
Data samples:
Table1
Date,Value1
2108-10-12,1
2108-10-13,2
2108-10-14,3
2108-10-15,4
2108-10-16,5
Table2
Date,Value2
2108-10-12,10
2108-10-13,11
2108-10-14,12
2108-10-15,13
2108-10-16,14
Power Query Editor:
With these tables loaded either from Excel or CSV files, you've got this setup in the Power Query Editor::
Now you can follow these steps to get a new table using an R script:
1. Change the data type of the Date Column to Text
2. Click Enter Data and click OK to get an empty table named Table3 by default.
3. Select the Transform tab and click Run R Script to open the Run R Script Edtor.
4. Leave it empty and click OK.
5. Remove = R.Execute("# 'dataset' holds the input data for this script",[dataset=#"Changed Type"]) from the Formula Bar and insert this: = R.Execute("# R Script:",[df1=Table1, df2=Table2]).
6. If you're promted to do so, click Edit Permission and Run.
7. Click the gear symbol next to Run R Scritp under APPLIED STEPS and insert the following snippet:
R script:
df3 <- merge(x = df1, y = df2, by = "Date", all.x = TRUE)
df3$Value3 <- df1$Value1 + df2$Value2
This snippet produces a new dataframe df3 by joining df1 and df2, and adds a new column Value3. This is a very simple setup but now you can do pretty much anything by just replacing the join and calculation methods:
8. Click Home > Close&Apply to get back to Power BI Desktop (Consider changing the data type of the Date column in Table3 from Text to Date before you do that, depending on how you'd like you tables, charts and slicers to behave.)
9. Insert a simple table to make sure everything went smoothly
I hope this was exactly what you were looking for. Let me know if not and I'll take another look at it.

R: give data frames new names based on contents of their current name

I'm writing a script to plot data from multiple files. Each file is named using the same format, where strings between “.” give some info on what is in the file. For example, SITE.TT.AF.000.52.000.001.002.003.WDSD_30.csv.
These data will be from multiple sites, so SITE, or WDSD_30, or any other string, may be different depending on where the data is from, though it's position in the file name will always indicate a specific feature such as location or measurement.
So far I have each file read into R and saved as a data frame named the same as the file. I'd like to get something like the following to work: if there is a data frame in the global environment that contains WDSD_30, then plot a specific column from that data frame. The column will always have the same name, so I could write plot(WDSD_30$meas), and no matter what site's files were loaded in the global environment, the script would find the WDSD_30 file and plot the meas variable. My goal for this script is to be able to point it to any folder containing files from a particular site, and no matter what the site, the script will be able to read in the data and find files containing the variables I'm interested in plotting.
A colleague suggested I try using strsplit() to break up the file name and extract the element I want to use, then use that to rename the data frame containing that element. I'm stuck on how exactly to do this or whether this is the best approach.
Here's what I have so far:
site.files<- basename(list.files( pattern = ".csv",recursive = TRUE,full.names= FALSE))
sfsplit<- lapply(site.files, function(x) strsplit(x, ".", fixed =T)[[1]])
for (i in 1:length(site.files)) assign(site.files[i],read.csv(site.files[i]))
for (i in 1:length(site.files))
if (sfsplit[[i]][10]==grep("PARQL", sfsplit[[i]][10]))
{assign(data.frame.getting.named.PARQL, sfsplit[[i]][10])}
else if (sfsplit[i][10]==grep("IRBT", sfsplit[[i]][10]))
{assign(data.frame.getting.named.IRBT, sfsplit[[i]][10])
...and so on for each data frame I'd like to eventually plot from.Is this a good approach, or is there some better way? I'm also unclear on how to refer to the objects I made up for this example, data.frame.getting.named.xxxx, without using the entire filename as it was read into R. Is there something like data.frame[1] to generically refer to the 1st data frame in the global environment.

R rvest connect with local host

I am creating a way to read in SPSS labels into R. Using library(sjPlot), view_spss(df, useViewer = FALSE) I can create a local html page such as http://localhost:11773/session/file1e0c67270a5.html that shows a nice table with columns for the variable names and the labels I am looking for.
Now I want to use rvest to scrape it but when I start with a command such as page <- rvest::html("http://localhost:11773/session/file1e0c67270a5.html") R just seems to get stuck.
I've tried searching for "connect with local host" but I can't seem to find any questions or answers related to the R package.
This doesn't really answer your specific question as I think the reason is that R spins up a non-persistent process to serve that HTML view of your data. But your approach seems quite round-a-bout to just get to variable labels. This is a general way that works quite well:
library(foreign)
d <- read.spss("your_data.sav", use.value.labels=TRUE, to.data.frame=FALSE)
var_labels <- attr(d, "variable.labels")
## To access the label of a variable named 'var_name':
var_labels[["var_name"]]
Where d results in a list of data, and var_labels is a named list of labels keyed by variable/column.
If you want to get variable and/or value label of SPSS-imported data, you can use get_val_labels and get_var_labels of the sjmisc-package.
See examples here. Both functions accept either a single variable (vector) or a data frame and return the associated variable and value labels. See also this blog post.
The sjmisc-Package supports data frames imported both with the haven- or foreign-package.

Resources