Replacing values in dataset within Azure Machine Learning Studio - azure-machine-learning-studio

In Azure Machine Learning studio I need to convert a column of data that has three categorical values 'yes', 'no' and 'maybe', and wish to combine the 'no' and 'maybe' values as just 'no'.
I can do this easily using SQL, R, or Python but for these purposes I need to show if it is possible to do this without using these languages. I can't seem to find a way to do this.
Does anyone have any ideas? I'm fine if the answer is no but I don't want to say it's not possible if it is.

It can be done! :)
You would just use the "Group Categorical Values" module. Choose the column that has the data you want to group, and you can set the values like the following:
What's going on here is that the default, which will get used if the other levels aren't caught, is set to "yes". Then when any values are "no", or "maybe", it gets grouped into a category of "no".
However, this will error unless you make that column a categorical type, so you would need to use the "Edit Metadata" module to do that.
The example I used is published to the gallery, if you need to reference it.
If you need more info, just let me know.

Related

Filling in spreadsheet text using R

I am looking for a way to fill in a blank column with values based on text in a different column. The column containing the fill in criteria is a job Title and I need to pull out key words that will be used to fill in the column needed. An example of the spreadsheet with Title and Job Role column headers:
The goal would be to say if the Title contains "HR, Human Resources" fill in HR in the Job Role column and if Title contains "IT, Information, Technology" then fill in IT in the Job Role column.
I have tried using Excel formulas but keep running into limitations like too many arguments or not being able to properly nest multiple statements.
Let me know if something is unclear or if anything needs to be explained more.
Thanks!
If you read the spreadsheet into R, you could probably just run the following:
for(i in 1:nrow(df){
if(JobTitle[i] == "HR, Human Resources"){
JobRole[i] <- "HR"
}
}
Although you'll have to change the name JobTitle and Job Role depending on what the names of the columns end up being. More if statements could be added for different job titles and roles as well.
It would also be more helpful if you incorporated some reproducible code in the question too.

How to Add Column (script) transform that queries another column for content

I’m looking for a simple expression that puts a ‘1’ in column E if ‘SomeContent’ is contained in column D. I’m doing this in Azure ML Workbench through their Add Column (script) function. Here’s some examples they give.
row.ColumnA + row.ColumnB is the same as row["ColumnA"] + row["ColumnB"]
1 if row.ColumnA < 4 else 2
datetime.datetime.now()
float(row.ColumnA) / float(row.ColumnB - 1)
'Bad' if pd.isnull(row.ColumnA) else 'Good'
Any ideas on a 1 line script I could use for this? Thanks
Without really knowing what you want to look for in column 'D', I still think you can find all the information you need in the examples they give.
The script is being wrapped by a function that collects the value you calculate/provide and puts it in the new column. This assignment happens for each row individually. The value could be a static value, an arbitrary calculation, or it could be dependent on the values in the other columns for the specific row.
In the "Hint" section, you can see two different ways of obtaining the values from the other rows:
The current row is referenced using 'row' and then a column qualifier, for example row.colname or row['colname'].
In your case, you obtain the value for column 'D' either by row.D or row['D']
After that, all you need to do is come up with the specific logic for ensuring if 'SomeContent' is contained in column 'D' for that specific row. In your case, the '1 line script' would look something like this:
1 if [logic ensuring 'SomeContent' is contained in row.D] else 0
If you need help with the logic, you need to provide more specific examples.
You can read more in the Azure Machine Learning Documentation:
Sample of custom column transforms (Python)
Data Preparations Python extensions
Hope this helps

Label only part of the omitted variables with Stargazer

Using stargazer, I want to omit some of the control variables from the report. I want to label some of them, while not labeling the rest (or group them all under "Other control variables").
omit and omit.labels should be of the same length, so just ignoring some of the variables won't do. Is there any way to do that?
Found a way to do it. Use omit and omit.label for the omitted variables you want to state (FE etc.), use keep to state the only variable you wish to keep in the report. All other variable won't be mentioned in the exported report.

Grouping, missing data - Cognos Report Studio

In IBM Cognos Report Studio
I have a data structure like so, plain dump of the customer details:
Account|Type|Value
123-123| 19 |2000
123-123| 20 |2000
123-123| 21 |3000
If I remove the Type from my report I get:
Account|Value
123-123|2000
123-123|3000
It seems to have treated the two rows with an amount '2000' as some kind of duplicated amount and removed it from my report.
My assumption was that Cognos will aggregate the data automatically?
Account|Value
123-123|8000
I am lost on what it is doing. Any pointers? If it is not grouping it, I would at least expect 3 rows still
Account|Value
123-123|2000
123-123|2000
123-123|3000
In any case I would like to end up with 1 line. The behaviour I'm getting is something I can't figure out. Thanks for any help.
Gemmo
The 'Auto-group & Summarize' feature is the default on new queries. This will find all unique combinations of attributes and roll up all measures to these unique combinations.
There are three ways to disable auto-group & summarize behavior:
Explicitly turn it off at the query level
Include a grain-level unique column, e.g. a key, in the query
Not include any measures in the query
My guess is that your problem is #3. The [Value] column in your example has to have its 'Aggregate Function' set to an aggregate function or 'Automatic' for the auto-group behavior to work. It's possible that column's 'Aggregate Function' property is set to 'None'. This is the standard setting for an attribute value and would prevent the roll up from occurring.

R: creating factor using data from multiple columns

I want to create a column that codes for whether patients have had a comorbid diagnosis of depression or not. Problem is, the diagnosis can be recorded in one of 4 columns:
ComorbidDiagnosis;
OtherDiagnosis;
DischargeDiagnosis;
OtherDischargeDiagnosis.
I've been using
levels(dataframe$ynDepression)[levels(dataframe$ComorbidDiagnosis)=="Depression"]<-"Yes"
for all 4 columns but I don't know how to code those who don't have a diagnosis in any of the columns. I tried:
levels(dataframe$ynDepression)[levels(dataframe$DischOtherDiagnosis &
dataframe$OtherDiagnosis &
dataframe$ComorbidDiagnosis &
dataframe$DischComorbidDiagnosis)==""]<-"No"
I also tried using && instead but it didn't work. Am I missing something?
Thanks in advance!
Edit: I tried uploading an image of some example data but I don't have enough reputations to upload images yet. I'll try to put an example here but might not work:
Patient ID PrimaryDiagnosis OtherDiagnosis ComorbidDiagnosis
_________AN__________Depression
_________AN
_________AN__________Depression______PTSD
_________AN_________________________Depression
What's inside the [] must be (transformable to) a boolean for the subset to work. For example:
x<-1:5
x[x>3]
#4 5
x>3
# F F F T T
works because the condition is a boolean vector. Sometimes, the booleanship can be implicite, like in dataframe[,"var"] which means dataframe[,colnames(dataframe)=="var"] but R must be able to make it a boolean somehow.
EDIT : As pointed out by beginneR, you can also subset with something like df[,c(1,3)], which is numeric but works the same way as df[,"var"]. I like to see that kind of subset as implicit booleans as it enables a yes/no choice but you may very well not agree and only consider that they enable R to select columns and rows.
In your case, the conditions you use are invalid (dataframe$OtherDiagnosisfor example).
You would need something like rowSums(df[,c("var1","var2","var3")]=="")==3, which is a valid condition.

Resources