h2o categorical_encoding understanding when to use and why - r

I'm trying to understand the pros/cons and when to use the various encoding options that are available to me in h2o with the parameter 'categorical_encoding'.
It would be helpful if people could point out general rules of thumb on how to use this.
Typically I use the 'Enum' value because I like how all categorical values are grouped together when looking at feature importance. On the other hand, xgboost's default value is 'label-encoder' I believe, which breaks things up by categorical level/value.
Unfortunately, I don't really know where to begin or questions to ask around these other values available:
one hot internal
one hot explicit
sort_by_response
enum_limited
enum
-label-encoder
Again, I primarily stick with enum, sometimes label-encoder, but honestly I don't know practical implications of these various options. Would love a generalized understanding of when one might be better than other from someone knowledgeable !

As requested (thanks!) this question was reposted to cross-validated. So the answer on what the pros and cons are can be found at: https://stats.stackexchange.com/questions/376203/categorical-encoding-in-h2o-what-is-the-difference-between-the-options

Related

RaptorQ FEC Implementation Obstacle

I am trying to implement the RaptorQ Forward Error Correction Scheme in java as specified here:
https://datatracker.ietf.org/doc/html/draft-ietf-rmt-bb-fec-raptorq-04#section-5.3.3
The core of the problem is actually to execute gaussian elimination on a matrix A in a smart way to be fast.
The matrix A is composed of submatrices, among others these are G_LDPC,1 and G_LDPC,2.
(Generator matrices for Low Density Parity Checks)
On page 22 in section "5.3.3.3. Pre-coding relationships" it is stated that this matrices can be decuced from the code snippet on the same page.
My Problem: I am not able to derive the structure of these two submatrices from the code snipped.
Does someone see how to do that, or how the structure looks like?
Thanks for any kind of help!
Max
I'm also trying to implement RaptorQ, and ran into this exactly same problem. My suggestion is this book:
Raptor Codes (Foundations and Trends(R) in Communications and Information Theory) [Paperback]
Amin Shokrollahi (Author), Michael Luby (Author)
It has a better explanation on constructing the constraint matrix in section 3.3.3 (I'd quote it, but I don't have it digital).
#Max anyway we can chat or you can share your RFC5053 implementation? I really could use someone familiar with these difficulties to talk to and share some doubts/ideas.
After being stuck with the problem, I decided to implement the Raptor codec according to RFC 5053 as described here:
https://www.rfc-editor.org/rfc/rfc5053
This is actually the predecessor version of RaptorQ.
The general working principle seems to be the same, but it is less optimized and therefore has worse properties, especially in sense of reception efficiency.
But on the other hand it was less complex and more intuitive to me, and therefore I was able to code a working implementation in Java.
And after all, I have to admit that I'm very astonished by the capabilities of the created codec!
With the deeper understanding gained during coding the RFC 5053 implementation I was probably also able to realize the RaptorQ codec now.

Static code analysis tool for Common Lisp?

I'm busy learning Common Lisp, & I'm looking for a static code analysis tool that will help me develop better style & avoid falling into common traps.
I've found Lisp Critic and I think it looks good, but I was hoping that someone may be able to recommend some other tools, and / or share their experiences with them.
Given the dynamic nature of Lisp, static analysis is everything from tough to impossible, depending on the type of source code.
For some purposes I would recommend using the SBCL compiler. Check out its manual for what features it provides. One feature is some form of type inference. It provides also a lot of standard warnings for things like undeclared variables, type problems, calling functions with the wrong number of args, using undefined functions, violating the ANSI CL standard in various ways and more.
The best way to learn about good style is to read a lot of code and ask for others to review your code. This isn't something that's specific to Common Lisp.
I think that one gray tool is use lisp-critic, you can get some information
here:
http://quickdocs.org/lisp-critic/
and a remake that was done by #Xach
http://xach.com/lisp/

Are there a set of "universal" error/exception codes?

I'm a bit of a polyglot when it comes to programming languages, and most of the languages I use have Error/Exception handling of some sort.
In most languages there's a default implementation of error ID's with their associated messages, but I've never found a list of production codes to base my own error codes off of.
Does such a thing exist?
If not would it be useful, or just noise that most programmers ignore?
The closest thing I can think of is POSIX error constants (though their numeric values are not standardized.)
Short answer - no, it doesn't exist. Every OS, platform and piece of software pretty much has its own error IDs. These are not synchronized or based on any standard set.
I would say that apart from the common errors, this would indeed just be noise, and even with the common one, one one need to standardize them and ensure they are used universally.

Fiddling with point-free code?

I have been learning the Factor and J languages to experiment with point-free programming. The basic mechanics of the languages seem clear, but getting a feeling for how to approach algorithm design is a challenge.
A particular source of confusion for me is how one should structure code so that it is easy to experiment with different parameters. By this, I mean the sort of thing Mathematica and Matlab are so good at; you set up an algorithm then manipulate the variables and watch what happens.
How do you do this without explicit variables? Maybe I'm thinking about this all wrong. How should I approach this in point-free programming?
Here are three important advices that I found really helpful while dealing with the concatenative paradigm (applied to the Factor programming language in my case):
Factor your code mercilessly. Write extremely small functions: if there is more than 3-4 stack parameters, maybe you could break it into smaller parts.
Invest your time in learning dataflow combinators (bi, tri, cleave, spread,...). They allow to express common dataflow patterns while removing the need of complex stack shuffling.
Learn to build quotations from other quotations. Use currying techniques (curry, with, ...) to build simple quotations from stack parameters, and when things get too complex use Fried quotations (the "fry" vocab). They allow to easily build complex nested quotations from patterns, without any stack shuffling.
And as a always, read and "Walk" into existing code. In Factor it is quite easy to explore the runtime and see how things are working.
For your specific source of confusion, if you have a lot of input parameters in your algorithm, the most important thing to do is to study how they will be used. Harvest for dataflow patterns. You must really THINK about the best way to "schedule" operations on the smallest set of related parameters.
It is a quite difficult experience, but it is also really rewarding one when it succeed. We feel like a human compiler after that..
Good luck!
I have had a little experience in the concatenative programming language Joy and in a Backus FP-like language. Regarding the algorithm design, I can say that it is a very structured algorithm design.
How do you do this without explicit variables?
In fact, there are no global variables in languages like Backus FP. However, there is nothing to prevent the use of somewhat restricted local variables such as the instance variables.

Should I use an expression parser in my Math game?

I'm writing some children's Math Education software for a class.
I'm going to try and present problems to students of varying skill level with randomly generated math problems of different types in fun ways.
One of the frustrations of using computer based math software is its rigidity. If anyone has taken an online Math class, you'll know all about the frustration of taking an online quiz and having your correct answer thrown out because your problem isn't exactly formatted in their form or some weird spacing issue.
So, originally I thought, "I know! I'll use an expression parser on the answer box so I'll be able to evaluate anything they enter and even if it isn't in the same form I'll be able to check if it is the same answer." So I fire up my IDE and start implementing the Shunting Yard Algorithm.
This would solve the problem of it not taking fractions in the smallest form and other issues.
However, It then hit me that a tricky student would simply be able to enter most of the problems into the answer box and my expression parser would dutifully parse and evaluate it to the correct answer!
So, should I not be using an expression parser in this instance? Do I really have to generate a single form of the answer and do a string comparison?
One possible solution is to note how many steps your expression evaluator takes to evaluate the problem's original expression, and to compare this to the optimal answer. If there's too much difference, then the problem hasn't been reduced enough and you can suggest that the student keep going.
Don't be surprised if students come up with better answers than your own definition of "optimal", though! I was a TA/grader for several classes, and the brightest students routinely had answers on their problem sets that were superior to the ones provided by the professor.
For simple problems where you're looking for an exact answer, then removing whitespace and doing a string compare is reasonable.
For more advanced problems, you might do the Shunting Yard Algorithm (or similar) but perhaps parametrize it so you could turn on/off reductions to guard against the tricky student. You'll notice that "simple" answers can still use the parser, but you would disable all reductions.
For example, on a division question, you'd disable the "/" reduction.
This is a great question.
If you are writing an expression system and an evaluation/transformation/equivalence engine (isn't there one available somewhere? I am almost 100% sure that there is an open source one somewhere), then it's more of an education/algebra problem: is the student's answer algebraically closer to the original expression or to the expected expression.
I'm not sure how to answer that, but just an idea (not necessarily practical): perhaps your evaluation engine can count transformation steps to equivalence. If the answer takes less steps to the expected than it did to the original, it might be ok. If it's too close to the original, it's not.
You could use an expression parser, but apply restrictions on the complexity of the expressions permitted in the answer.
For example, if the goal is to reduce (4/5)*(1/2) and you want to allow either (2/5) or (4/10), then you could restrict the set of allowable answers to expressions whose trees take the form (x/y) and which also evaluate to the correct number. Perhaps you would also allow "0.4", i.e. expressions of the form (x) which evaluate to the correct number.
This is exactly what you would (implicitly) be doing if you graded the problem manually -- you would be looking for an answer that is correct but which also falls into an acceptable class.
The usual way of doing this in mathematics assessment software is to allow the question setter to specify expressions/strings that are not allowed in a correct answer.
If you happen to be interested in existing software, there's the open-source Stack http://www.stack.bham.ac.uk/ (or various commercial options such as MapleTA). I suspect most of the problems that you'll come across have also been encountered by Stack so even if you don't want to use it, it might be educational to look at how it approaches things.

Resources