How to manage more than one dataset - Machine Learning Azure - azure-machine-learning-studio

Is there any module that accepts more than one dataset for processing?
For instance "Split Data" , "Edit meta data" and "select columns in dataset" do not accept more than one dataset as input.
This is what I did :
There are several numeric and categorical variables in my model.I used "Convert to indicator variables " module to create dummy variables for my data. How do I include the indicator variables and numeric variables into one dataset so that I can split the data for my model ?
As of now, I'm doing data wrangling in Python and moving the datasets in Azure MLS for modeling. Ideally, I need to work on data wrangling in Azure MLS.
I expect to have one module that consolidates both the categorical binned variables and numeric variables in Azure MLS

Yup, there are several modules receiving multiple datasets - Add Columns, Apply SQL Transformation, Execute Python Script, to name a few.
Not sure why you need them for indicator values though - assuming you're talking about Train/Test Split, then I would just split the data after invoking the "Convert to indicator values" module.

I will add to the above answer. You can use Execute R script, Join data as well if the datasets have common keys.

Related

How can I use the R arctools package on a data set that contains multiple subjects?

I want to use the activity_stats function (and others) on a data set that has several dozen subjects. Based on the documentation, it looks like I have to make a separate data frame for each subject, and then run the functions on each individual data frame. Is that the case?
https://github.com/martakarass/arctools#using-arctools-package-to-compute-physical-activity-summaries

How can I train a model to detect multiple values for the same key in one page?

I am using Forms Recognizer v2, specifically the Sample Labeling tool, to build and train a model that aims to parse information from packing lists. After labeling more than 5 documents, I train the model and then pass one of the documents to perform a prediction. However, the model can only predict one value for each key/tag, even though during the labeling process, I assigned multiple values to each tag. Is this a limitation in the current version of the tool, or am I missing something?
Best regards,
João Amaro
We can handle multiple value text lines if they are very close to each other (e.g., an address that runs multiple lines next to each other). However, if the multiple values are spread in different places, we suggest you use different keys for them.

rbind and cbind commands equivalent in SPSS

can someone provide and equivalent code in SPSS that merges datasets in SPSS to replicate the rbind and cbind commands usable in R ? Many thanks !
To add rows from dataset1 to dataset2, you can use ADD FILES. This requires that both datasets hold the same variables, with matching variable names and formats.
To add columns from dataset1 to dataset2, use MATCH FILES. This command matches the values for the desired variables in dataset2 to the right rows in dataset1 using keys present in both files (such as a respondent id). The keys are defined in the BY subcommand.
Please note that R and SPSS work in a totally different way. In short, SPSS (mainly) works with datasets in which variables are defined and formatted, while R can handle single values, vectors, matrices, dataframes etc. Simply copying columns from an existing dataset to another dataset (without paying attention tohow the files are sorted) and simply adding rows without matching the variable names and types in the existing dataset are very unusual in SPSS.
If you post an example of what you are trying to achieve, I could give you a more useful answer...

Comparison of good vs bad dataset using R

Stuck in a problem. There are two datasets A and B. Say they're datasets of two factories. Factory A is performing really well whereas Factory B is not. I have the data-set of Factory A (data being output from the manufacturing units) as well as Factory B, both having the same variables. How can I identify the problematic variable in Factory B which needs to be fixed so that Factory B starts performing well too? Therefore, I need to identify the problematic variable which needs immediate attention.
Looking forward to your response.
p.s: coding language being used is R
Well this is shameless plug for the dataMaid package which I helped write and which sort of does what you are asking. The idea of the dataMaid package is to run a battery of tests on the variables in a data frame and produce a report that a human investigator (preferably someone with knowledge about the context) can look through in order to identify potential problems.
A super simple way to get started is to load the package and use the
clean function on a data frame (if you try to clean the same data
frame several times then it may be necessary to add the replace=TRUE
argument to overwrite the existing report).
devtools::install_github("ekstroem/dataMaid")
library(dataMaid)
data(trees)
clean(trees)
This will create a report with summaries and error checks for each
variable in the trees data frame. A summary of all the variables is provided and for the trees data it looks like this
while the information from each variable may look like this
Here we get a status about the variable type, summary statistics, a plot and - in this case - an indicator that there might be a problem with outliers.
The dataMaid package can also be used interactively by running checks for the individual variables or for all variables in the dataset
data(toyData)
check(toyData$var2) # Individual check of var2
check(toyData) # Check all variables at once
By default the standard battery of tests is run depending on the
variable type, but it is possible to extend the package by providing your own checks.
In your case I would run the package on both datasets to get two reports, and any major differences in those would raise a flag about what could be problematic.

Macro variable in R

I have over 300 variables in my table. I want to choose only a handful of those variables to run through many procedures. Lm(), glm() etc..i have over 10 procedures that i need to run those variables everytime. Those handful of variables can change everytime which depends if output is satisfactory or not.
i like to know how to do this in R. Any help or even if someone can point to a previous thread will help.
If you want to just select several variables, and not the entire data frame (or table in SQL parlance), a simple way to do this is to just subset the data frame prior to running your set of procedures using the "subset" function, e.g
newdata <- subset(mydata, select=c(ID, Weight))
This will only pull 2 variables out of the "mydata" data frame (ID and Weight).
You can then change this statment every time your variables change.
BTW: Macro variable is a SAS term, are you converting something from SAS?

Resources