Sagemaker Pipelines | Pass list of strings as parameter - pipeline

The Sagemaker Pipeline only has Parameter classes for single values (a string, a float, etc), but how can I deal with a parameter that is best represented by a list (e.g. the list of features to select for training from a file with many features)?

Background: Following best practices, in general, of using feature names (e.g., column names of a dataframe pandas), these should be without spaces between them.
Base case
To bypass your problem, you can use a string as a parameter where each element is a single feature.
features = "feature_0 feature_1 feature_2"
and then, use it normally with ParameterString.
If it cannot be that way, I recommend inserting a specific separation pattern between names instead of space and splitting the whole string into features list later.
At this point, in the training script you pass the parameter to the ArgumentParser which you can configure to have the space-separated word string reprocessed into a list of individual words.
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--features",
nargs="*",
type=str,
default=[]
)
args, _ = parser.parse_known_args()
Extra case
Should the string mistakenly be interpreted as a list directly when passing the argument to a pipeline component (e.g., to a preprocessor), the latter can be reworked with an input reinterpretation function.
import itertools
def decode_list_of_strings_input(str_input: str) -> []:
str_input = [s.split() for s in str_input]
return list(itertools.chain.from_iterable(str_input))
Here is an example of the use of this code:
features = ['a b c']
features = decode_list_of_strings_input(features)
print(features)
>>> ['a', 'b', 'c']

Related

Why are names(x)<-y and "names<-"(x,y) not equivalent?

Consider the following:
y<-c("A","B","C")
x<-z<-c(1,2,3)
names(x)<-y
"names<-"(z,y)
If you run this code, you will discover that names(x)<-y is not identical to "names<-"(z,y). In particular, one sees that names(x)<-y actually changes the names of x whereas "names<-"(z,y) returns z with its names changed.
Why is this? I was under the impression that the difference between writing a function normally and writing it as an infix operator was only one of syntax, rather than something that actually changes the output. Where in the documentation is this difference discussed?
Short answer: names(x)<-y is actually sugar for x<-"names<-"(x,y) and not just "names<-"(x,y). See the the R-lang manual, pages 18-19 (pages 23-24 of the PDF), which comes to basically the same example.
For example, names(x) <- c("a","b") is equivalent to:
`*tmp*`<-x
x <- "names<-"(`*tmp*`, value=c("a","b"))
rm(`*tmp*`)
If more familiar with getter/setter, one can think that if somefunction is a getter function, somefunction<- is the corresponding setter. In R, where each object is immutable, it's more correct to call the setter a replacement function, because the function actually creates a new object identical to the old one, but with an attribute added/modified/removed and replaces with this new object the old one.
In the case example for instance, the names attribute are not just added to x; rather a new object with the same values of x but with the names is created and linked to the x symbol.
Since there are still some doubts about why the issue is discussed in the language doc instead directly on ?names, here is a small recap of this property of the R language.
You can define a function with the name you wish (there are some restrictions of course) and the name does not impact in any way if the function is called "normally".
However, if you name a function with the <- suffix, it becomes a replacement function and allows the parser to apply the function with the mechanism described at the beginning of this answer if called by the syntax foo(x)<-value. See here that you don't call explicitely foo<-, but with a slightly different syntax you obtain an object replacement (since the name).
Although there are not formal restrictions, it's common to define getter/setter in R with the same name (for instance names and names<-). In this case, the <- suffix function is the replacement function of the corresponding version without suffix.
As stated at the beginning, this behaviour is general and a property of the language, so it doesn't need to be discussed in any replacement function doc.
In particular, one sees that names(x)<-y actually changes the names of x whereas "names<-"(z,y) returns z with its names changed.
That’s because `names<-`1 is a regular function, albeit with an odd name2. It performs no assignment, it returns a new object with the names attribute set. In fact `names<-` is a primitive function in R but it could be implemented as follows (there are shorter, better ways of writing this in R, but I want the separate steps to be explicit):
`names<-` = function (x, value) {
new = x
attr(new, 'names') = value
new
}
That is, it
… creates a new object that’s a copy of x,
… sets the names attribute on that newly created object, and
… returns the new object.
Since virtually all objects in R are immutable, this fits naturally into R’s semantics. In fact, a better name for this exact function would be with_names3. But the creators of R found it convenient to be able to write such an assignment without repeating the name of the object. So instead of writing
x = with_names(x, c('foo', 'bar'))
or
x = `names<-`(x, c('foo', 'bar'))
R allows us to write
names(x) = c('foo', 'bar')
R handles this syntax specially by internally converting it to another expression, documented in the Subset assignment section of the R language definition, as explained in the answer by Nicola.
But the gist is that names(x) = y and `names<-`(x, y) are different because … they just are. The former is a special syntactic form that gets recognised and transformed by the R parser. The latter is a regular function call, and the weird function name is a red herring: it doesn’t affect the execution whatsoever. It does the same as if the function was named differently, and you can confirm this by assigning it a different name:
with_names = `names<-`
`another weird(!) name` = `names<-`
# These are all identical:
`names<-`(x, y)
with_names(x, y)
`another weird(!) name`(x, y)
1 I strongly encourage using backtick quotes (`) instead of straight quotes (' or ") to quote R variable names. While both are allowed in some circumstances, the latter invites confusion with strings, and is conceptually bonkers. These are not strings. Consider:
"a" = "b"
"c" = "a"
Rather than copy the value of a into c, what this code actually does is set c to literal "a", because quotes now mean different things on the left- and right-hand side of assignment.
The R documentation confirms that
The preferred quote [for variable names] is the backtick (`)
2 Regular variable names (aka “identifiers” or just “names”) in R can only contain letters, digits, underscore and the dot, must start with a letter, or with a dot not followed by a digit, and can’t be reserved words. But R allows using pretty much arbitrary characters — including punctuation and even spaces! — in variable names, provided the name is backtick-quoted.
3 In fact, R has an almost-alias for this function, called setNames — which isn’t a great name, since set… implies mutating the object, but of course it doesn’t do that.

Building an Expression tree for Comparison Operations with value concatenated e.g ">= 1"

The following is a great example of how to create an Expression tree when the operator and value is passed into the method as SEPERATE parameters.
Get list on basis of dropdownlist data in asp.net mvc3
In my example i have several dropdownboxes where the operator and value are combined e.g. ">= 1", "< 3" etc. I could potentially split this into operator and value, passing both to the example above but was wondering if there is a simpler way to write the expression where i can just use the operator and value as one parameter, replacing MakeBinary method with an alternative.
I am relatively new to Expression trees so some guidance would be helpful. Thanks.
No. Expression trees are quite low level, and don't handle string->code. They aren't eval, they are build code at runtime (technically they are build descriptor for code at runtime, and if you really want, compile it.
Use a regex to split operator and value, if they are in the form "<something".
var rx = new Regex("([<>]=?|==)(.*)");
string str = "<=1234";
var match = rx.Match(str);
string op = match.Groups[1].Value;
string val = match.Groups[2].Value;

How to access a list-within-a-list inside a hash env in R, like for a Python dict

I am trying to use the hash package in R to replicate dictionary behavior in python. I have created it like this,
library(hash)
titles = hash(NAME = list("exact"=list('NAME','Age'), "partial"=list()),
Dt = list("exact"=list('Dt'), "partial"=list()),
CC = list("exact"=list(), "partial"=list()))
I can access the keys in the hash using keys(titles) , values using values(titles), and access values for a particular key using values(titles['Name']).
But how can I access the elements of the inner list? e.g. list('NAME','Age') ?
I need to access the elements based on its names, in this case - "exact" or else I need to know which element of the outer list this element belong to, whether its "exact" or "partial".
Simply:
titles[["NAME"]][["exact"]]
as hrbmstr wrote. There's nothing special about this whatsoever.
In your nested-list, "exact" and "partial" are simply two string keys. Again, there's no special magic significance to their names.
Also, this is in fact the recommended proper R syntax (esp. when the key is variable), it's not "bringing gosh-awful Python syntax".

Token matching order in PLY

I have a parser written in PLY that has the following token definition
def t_COMMAND(t):
r'create|show'
return t
def t_SCOPE(t):
r'user|domain'
return t
def t_STRING(t):
r'[a-zA-Z_#\*\.]*'
return t
I am trying to parse the following string
show user where created_on = foo
Here is my grammar
S:COMMAND SCOPE FILTER;
FILTER:WHERE EXP |;
EXP:STRING OP STRING
...
I get a syntax error at the created_on token, probably because it gets matched as a COMMAND rather than STRING
Is there a way to make PLY take the largest possible match?
Found two possible approaches
User a reserved words tuple and append it with the token list as in Specification of tokens
If possible, add quotes to the STRING as '[a-zA-Z_#\*\.]*', so that it can be distinguished from COMMAND
I chose the second approach, as I have a lot of so called reserved words.

Acquiring all nodes that have ids beginning with "ABC"

I'm attempting to scrape a page that has about 10 columns using Ruby and Nokogiri, with most of the columns being pretty straightforward by having unique class names. However, some of them have class ids that seem to have long number strings appended to what would be the standard class name.
For example, gametimes are all picked up with .eventLine-time, team names with .team-name, but this particular one has, for example:
<div class="eventLine-book-value" id="eventLineOpener-118079-19-1522-1">-3 -120</div>
.eventLine-book-value is not specific to this column, so it's not useful. The 13 digits are different for every game, and trying something like:
def nodes_by_selector(filename,selector)
file = open(filename)
doc = Nokogiri::HTML(file)
doc.css(^selector)
end
Has left me with errors. I've seen ^ and ~ be used in other languages, but I'm new to this and I have tried searching for ways to pick up all data under id=eventLineOpener-XXXX to no avail.
To pick up all data under id=eventLineOpener-XXXX, you need to pass 'div[id*=eventLineOpener]' as the selector:
def nodes_by_selector(filename,selector)
file = open(filename)
doc = Nokogiri::HTML(file)
doc.css(selector) #doc.css('div[id*=eventLineOpener]')
end
The above method will return you an array of Nokogiri::XML::Element objects having id=eventLineOpener-XXXX.
Further, to extract the content of each of these Nokogiri::XML::Element objects, you need to iterate over each of these objects and use the text method on those objects. For example:
doc.css('div[id*=eventLineOpener]')[0].text

Resources