How to Add Column (script) transform that queries another column for content - azure-machine-learning-workbench

I’m looking for a simple expression that puts a ‘1’ in column E if ‘SomeContent’ is contained in column D. I’m doing this in Azure ML Workbench through their Add Column (script) function. Here’s some examples they give.
row.ColumnA + row.ColumnB is the same as row["ColumnA"] + row["ColumnB"]
1 if row.ColumnA < 4 else 2
datetime.datetime.now()
float(row.ColumnA) / float(row.ColumnB - 1)
'Bad' if pd.isnull(row.ColumnA) else 'Good'
Any ideas on a 1 line script I could use for this? Thanks

Without really knowing what you want to look for in column 'D', I still think you can find all the information you need in the examples they give.
The script is being wrapped by a function that collects the value you calculate/provide and puts it in the new column. This assignment happens for each row individually. The value could be a static value, an arbitrary calculation, or it could be dependent on the values in the other columns for the specific row.
In the "Hint" section, you can see two different ways of obtaining the values from the other rows:
The current row is referenced using 'row' and then a column qualifier, for example row.colname or row['colname'].
In your case, you obtain the value for column 'D' either by row.D or row['D']
After that, all you need to do is come up with the specific logic for ensuring if 'SomeContent' is contained in column 'D' for that specific row. In your case, the '1 line script' would look something like this:
1 if [logic ensuring 'SomeContent' is contained in row.D] else 0
If you need help with the logic, you need to provide more specific examples.
You can read more in the Azure Machine Learning Documentation:
Sample of custom column transforms (Python)
Data Preparations Python extensions
Hope this helps

Related

Can sqlite-utils convert function select two columns?

I'm using sqlite-utils to load a csv into sqlite which will later be served via Datasette. I have two columns, likes and dislikes. I would like to have a third column, quality-score, by adding likes and dislikes together then dividing likes by the total.
The sqlite-utils convert function should be my best bet, but all I see in the documentation is how to select a single column for conversion.
sqlite-utils convert content.db articles headline 'value.upper()'
From the example given, it looks like convert is followed by the db filename, the table name, then the col you want to operate on. Is it possible to simply add another col name or is there a flag for selecting more than one column to operate on? I would be really surprised if this wasn't possible, I just can't find any documentation to support it.
This isn't a perfect answer as it doesn't resolve whether sqlite-utils supports multiple column selection for transforms, but this is how I solved this particular problem.
Since my quality_score column would just be basic math, I was able to make use of sqlite's Generated Columns. I created a file called quality_score.sql that contained:
ALTER TABLE testtable
ADD COLUMN quality_score GENERATED ALWAYS AS (likes /(likes + dislikes));
and then implemented it by:
$ sqlite3 mydb.db < quality_score.sql
You do need to make sure you are using a compatible version of sqlite, as this only works with version 3.31 or later.
Another consideration is to make sure you are performing math on integers or floats and not text.
Also attempted to create the table with the virtual generated column first then fill it with my data later, but that didn't work in my case - it threw an error that said the number of items provided didn't match the number of columns available. So I just stuck with the ALTER operation after the fact.

R - looping through the values of a column that has a dot in its name

I'm learning the very basics of the R language.
I would like to loop (with either a loop or a while function since that's what I'm learning) through all the values of a specific dataset column that is called "Kid.Height". Let's say the dataset is called test.
I can target a standard column like this : "test$KidHeight". But not with a dot in its name ("test$Kid.Height").
I would like to do something like this:
for(i in test$Kid.Height) {
print(Kid.Height.value);
}
So that I can read all the row values of that column
I can't find any instance on the web that tells me how to deal with dots in columns name.
I know how to target a column by its index but not by name so that it always works, however fancy it is.
PS: since I'm learning the basics, if I can ask you the most clean and recommended way to achieve this, so that I can learn from scratch, I would be grateful.
Thank you.
Columns can also be accessed using square brackets for difficult column names, try:
test['Kid.Height']

Is there a way to extract a substring from a cell in OpenOffice Calc?

I have tens of thousands of rows of unstructured data in csv format. I need to extract certain product attributes from a long string of text. Given a set of acceptable attributes, if there is a match, I need it to fill in the cell with the match.
Example data:
"[ROOT];Earrings;Brands;Brands>JeweleryExchange;Earrings>Gender;Earrings>Gemstone;Earrings>Metal;Earrings>Occasion;Earrings>Style;Earrings>Gender>Women's;Earrings>Gemstone>Zircon;Earrings>Metal>White Gold;Earrings>Occasion>Just to say: I Love You;Earrings>Style>Drop/Dangle;Earrings>Style>Fashion;Not Visible;Gifts;Gifts>Price>$500 - $1000;Gifts>Shop>Earrings;Gifts>Occasion;Gifts>Occasion>Christmas;Gifts>Occasion>Just to say: I Love You;Gifts>For>Her"
Look up table of values:
Zircon, Diamond, Pearl, Ruby
Output:
Zircon
I tried using the VLOOKUP() function, but it needs to match an entire cell and works better for translating acronyms. Haven't really found a built in function that accomplishes what I need. The data is totally unstructured, and changes from row to row with no consistency even within variations of the same product. Does anyone have an idea how to do this?? Or how to write an OpenOffice Calc function to accomplish this? Also open to other better methods of doing this if anyone has any experience or ideas in how to approach this...
ok so I figured out how to do this on my own... I created many different columns, each with a keyword I was looking to extract as a header.
Spreadsheet solution for structured data extraction
Then I used this formula to extract the keywords into the correct row beneath the column header. =IF(ISERROR(SEARCH(CF$1,$D769)),"",CF$1) The Search function returns a number value for the position of a search string otherwise it produces an error. I use the iserror function to determine if there is an error condition, and the if statement in such a way that if there is an error, it leaves the cell blank, else it takes the value of the header. Had over 100 columns of specific information to extract, into one final column where I join all the previous cells in the row together for the final list. Worked like a charm. Recommend this approach to anyone who has to do a similar task.

Suppress/Filter a row

I am fairly new to using PeopleSoft BI Publisher plugin for MS Word and integrating it with PS Query Manager. My question is whether in the RTF file you can put logic to suppress or filter out data?
I have a for-each grouping that prints a line (row). I would like to add logic to NOT print the line if the Witholding amount field (M.WTHD_AMT) is equal to 0 (zero). My question is what would the syntax look like, and where should I place it (on the For Each grouping below, the Field level, or somewhere else?) I know I can alter the PS Query (data source) to do the filtering but I would like to leave that as-is and handle this in the template.
I see that there is another conditional IF statement ("rmt_") so I'm not sure if I can add this additional logic to that element or if I need a separate one. I appreciate any feedback!
EDIT:
I've added a new "Conditional Region" as suggested, and it works with just the WTHD_AMT criteria !0 to zero, however I tried added additional criteria where L.PYMNT_TYPE = 'R' and when I run the process it doesn't display data on the PDF output. Is there something wrong with the syntax? Do I need to have a separate Conditional Region for this 2nd criteria? I've seen another BI report where they have 2 or 3 criteria as part of one element.
<?if:number(M.WTHD_AMT)!=0.00?> and <?if:L.PYMNT_TYPE='R'?>
Option 1
You can nest <?if?> statements. Just add another <?end if?> at the end. Make sure there are no spaces between the all of the IF or END IF objects at the beginning or end of the content/row, else the row may still be displayed.
Option 2
You can add conditions in the repeating section. Below will repeat the region for every record where M.WTHD_AMT is not 0.00
<?for-each:record_path/record[M.WTHD_AMT!='0.00']?>
'Conditional Region' is the button you are looking for.
When using this button, make sure to double check where the if/endif or C/EC elements are added. It tends to ignore the selected element and join the elements to the start and end of the line. You will then need to cut and paste it into the right spot. For you this will probably be right after the F element and before the E element.

R: creating factor using data from multiple columns

I want to create a column that codes for whether patients have had a comorbid diagnosis of depression or not. Problem is, the diagnosis can be recorded in one of 4 columns:
ComorbidDiagnosis;
OtherDiagnosis;
DischargeDiagnosis;
OtherDischargeDiagnosis.
I've been using
levels(dataframe$ynDepression)[levels(dataframe$ComorbidDiagnosis)=="Depression"]<-"Yes"
for all 4 columns but I don't know how to code those who don't have a diagnosis in any of the columns. I tried:
levels(dataframe$ynDepression)[levels(dataframe$DischOtherDiagnosis &
dataframe$OtherDiagnosis &
dataframe$ComorbidDiagnosis &
dataframe$DischComorbidDiagnosis)==""]<-"No"
I also tried using && instead but it didn't work. Am I missing something?
Thanks in advance!
Edit: I tried uploading an image of some example data but I don't have enough reputations to upload images yet. I'll try to put an example here but might not work:
Patient ID PrimaryDiagnosis OtherDiagnosis ComorbidDiagnosis
_________AN__________Depression
_________AN
_________AN__________Depression______PTSD
_________AN_________________________Depression
What's inside the [] must be (transformable to) a boolean for the subset to work. For example:
x<-1:5
x[x>3]
#4 5
x>3
# F F F T T
works because the condition is a boolean vector. Sometimes, the booleanship can be implicite, like in dataframe[,"var"] which means dataframe[,colnames(dataframe)=="var"] but R must be able to make it a boolean somehow.
EDIT : As pointed out by beginneR, you can also subset with something like df[,c(1,3)], which is numeric but works the same way as df[,"var"]. I like to see that kind of subset as implicit booleans as it enables a yes/no choice but you may very well not agree and only consider that they enable R to select columns and rows.
In your case, the conditions you use are invalid (dataframe$OtherDiagnosisfor example).
You would need something like rowSums(df[,c("var1","var2","var3")]=="")==3, which is a valid condition.

Resources