Not sure what keys should be in class_weight dict to pass to Keras model.fit() method - dictionary

I'm working on a text classification problem with very imbalanced classes, of which there are 94 total. I think that passing a dictionary of class weights to the class_weight parameter in model.fit() might prove fruitful for me. I've read up on how to generate good ratios; that's not a problem.
If I understand correctly, the dictionary should contain pairs of (class: weight). But in my model's parlance, a sample's class (or label) is a one-hot encoded vector — not an integer (which I've seen as the keys in discussion online about the class_weight parameter, and which confuses me), and certainly not the label string itself. I can't have a vector as a key in a dictionary... I'd appreciate any insight.
I have tried a couple things. I did try making a dictionary of (class: weight) pairs in which the class value was merely an integer (maybe Keras ingeniously knows to associate an integer label with the one-hot vectors whose 1 is in the associated index — wishful thinking!), but performance didn't change. Then I updated the dictionary to feature an INCORRECT number of classes, and got no error messages and the same performance. So... fit was just neglecting it, or...?
Any advice would be greatly appreciated.

Related

The Idiomatic Way To Do OOP, Types and Methods in Julia

As I'm learning Julia, I am wondering how to properly do things I might have done in Python, Java or C++ before. For example, previously I might have used an abstract base class (or interface) to define a family of models through classes. Each class might then have a method like calculate. So to call it I might have model.calculate(), where the model is an object from one of the inheriting classes.
I get that Julia uses multiple dispatch to overload functions with different signatures such as calculate(model). The question I have is how to create different models. Do I use the type system for that and create different types like:
abstract type Model end
type BlackScholes <: Model end
type Heston <: Model end
where BlackScholes and Heston are different types of model? If so, then I can overload different calculate methods:
function calculate(model::BlackScholes)
# code
end
function calculate(model::Heston)
# code
end
But I'm not sure if this is a proper and idiomatic use of types in Julia. I will greatly appreciate your guidance!
This is a hard question to answer. Julia offers a wide range of tools to solve any given problem, and it would be hard for even a core developer of the language to assert that one particular approach is "right" or even "idiomatic".
For example, in the realm of simulating and solving stochastic differential equations, you could look at the approach taken by Chris Rackauckas (and many others) in the suite of packages under the JuliaDiffEq umbrella. However, many of these people are extremely experienced Julia coders, and what they do may be somewhat out of reach for less experienced Julia coders who just want to model something in a manner that is reasonably sensible and attainable for a mere mortal.
It is is possible that the only "right" answer to this question is to direct users to the Performance Tips section of the docs, and then assert that as long as you aren't violating any of the recommendations there, then what you are doing is probably okay.
I think the best way I can answer this question from my own personal experience is to provide an example of how I (a mere mortal) would approach the problem of simulating different Ito processes. It is actually not too far off what you have put in the question, although with one additional layer. To be clear, I make no claim that this is the "right" way to do things, merely that it is one approach that utilizes multiple dispatch and Julia's type system in a reasonably sensible fashion.
I start off with an abstract type, for nesting specific subtypes that represent specific models.
abstract type ItoProcess ; end
Now I define some specific model subtypes, e.g.
struct GeometricBrownianMotion <: ItoProcess
mu::Float64
sigma::Float64
end
struct Heston <: ItoProcess
mu::Float64
kappa::Float64
theta::Float64
xi::Float64
end
Note, in this case I don't need to add constructors that convert arguments to Float64, since Julia does this automatically, e.g. GeometricBrownianMotion(1, 2.0) will work out-of-the-box, as Julia will automatically convert 1 to 1.0 when constructing the type.
However, I might want to add some constructors for common parameterizations, e.g.
GeometricBrownianMotion() = GeometricBrownianMotion(0.0, 1.0)
I might also want some functions that return useful information about my models, e.g.
number_parameter(model::GeometricBrownianMotion) = 2
number_parameter(model::Heston) = 4
In fact, given how I've defined the models above, I could actually be a bit sneaky and define a method that works for all subtypes:
number_parameter(model::T) where {T<:ItoProcess} = length(fieldnames(typeof(model)))
Now I want to add some code that allows me to simulate my models:
function simulate(model::T, numobs::Int, stval) where {T<:ItoProcess}
# code here that is common to all subtypes of ItoProcess
simulate_inner(model, somethingelse)
# maybe more code that is common to all subtypes of ItoProcess
end
function simulate_inner(model::GeometricBrownianMotion, somethingelse)
# code here that is specific to GeometricBrownianMotion
end
function simulate_inner(model::Heston, somethingelse)
# code here that is specific to Heston
end
Note that I have used the abstract type to allow me to group all code that is common to all subtypes of ItoProcess in the simulate function. I then use multiple dispatch and simulate_inner to run any code that needs to be specific to a particular subtype of ItoProcess. For the aforementioned reasons, I hesitate to use the phrase "idiomatic", but let me instead say that the above is quite a common pattern in typical Julia code.
The one thing to be careful of in the above code is to ensure that the output type of the simulate function is type-stable, that is, the output type can be uniquely determined by the input types. Type stability is usually an important factor in ensuring performant Julia code. An easy way in this case to ensure type-stability is to always return Matrix{Float64} (if the output type is fixed for all subtypes of ItoProcess then obviously it is uniquely determined). I examine a case where the output type depends on input types below for my estimate example. Anyway, for simulate I might always return Matrix{Float64} since for GeometricBrownianMotion I only need one column, but for Heston I will need two (the first for price of the asset, the second for the volatility process).
In fact, depending on how the code is used, type-stability is not always necessary for performant code (see eg using function barriers to prevent type-instability from flowing through to other parts of your program), but it is a good habit to be in (for Julia code).
I might also want routines to estimate these models. Again, I can follow the same approach (but with a small twist):
function estimate(modeltype::Type{T}, data)::T where {T<:ItoProcess}
# again, code common to all subtypes of ItoProcess
estimate_inner(modeltype, data)
# more common code
return T(some stuff generated from function that can be used to construct T)
end
function estimate_inner(modeltype::Type{GeometricBrownianMotion}, data)
# code specific to GeometricBrownianMotion
end
function estimate_inner(modeltype::Type{Heston}, data)
# code specific to Heston
end
There are a few differences from the simulate case. Instead of inputting an instance of GeometricBrownianMotion or Heston, I instead input the type itself. This is because I don't actually need an instance of the type with defined values for the fields. In fact, the values of those fields is the very thing I am attempting to estimate! But I still want to use multiple dispatch, hence the ::Type{T} construct. Note also I have specified an output type for estimate. This output type is dependent on the ::Type{T} input, and so the function is type-stable (output type can be uniquely determined by input types). But common with the simulate case, I have structured the code so that code that is common to all subtypes of ItoProcess only needs to be written once, and code that is specific to the subtypes is separted out.
This answer is turning into an essay, so I should tie it off here. Hopefully this is useful to the OP, as well as anyone else getting into Julia. I just want to finish by emphasizing that what I have done above is only one approach, there are others that will be just as performant, but I have personally found the above to be useful from a structural perspective, as well as reasonably common across the Julia ecosystem.

How to find Hash/Cipher

is there any tool or method to figure out what is this hash/cipher function?
i have only a 500 item list of input and output plus i know all of the inputs are numeric, and output is always 2 Byte long hexadecimal representation.
here's some samples:
794352:6657
983447:efbf
479537:0796
793670:dee4
1063060:623c
1063059:bc1b
1063058:b8bc
1063057:b534
1063056:b0cc
1063055:181f
1063054:9f95
1063053:f73c
1063052:a365
1063051:1738
1063050:7489
i looked around and couldn't find any hash this short, is this a hash folded on itself? (with xor maybe?) or maybe a simple trivial cipher?
is there any tool or method for finding the output of other numbers?
(i want to figure this out; my next option would be training a Neural Network or Regression, so i thought i ask before taking any drastic action )
Edit: The Numbers are directory names, and for accessing them, the Hex parts are required.
Actually, Wikipedia's page on hashes lists three CRCs and three checksum methods that it could be. It could also be only half the output from some more complex hashing mechanism. Cross your fingers and hope that it's of the former. Hashes are specifically meant to be difficult (if not impossible) to reverse engineer.
What it's being used for should be a very strong hint about whether or not it's more likely to be a checksum/CRC or a hash.

How to use a vector as a type parameter in Julia

This is similar to my previous question, but a bit more complicated.
Before I was defining a type with an associated integer as a parameter, Intp{p}. Now I would like to define a type using a vector as a parameter.
The following is the closest I can write to what I want:
type Extp{g::Vector{T}}
c::Vector{T}
end
In other words, Extp should be defined with respect to a Vector, g, and I want the contents, c, to be another Vector, whose entries should be the of the same type as the entries of g.
Well, this does not work.
Problem 1: I don't think I can use :: in the type parameter.
Problem 2: I could work around that by making the types of g and c arbitary and just making sure the types in the vectors match up in the constructor. But, even if I completely take everything out and use
type Extp{g}
c
end
it still doesn't seem to like this. When I try to use it the way I want to,
julia> Extp{[1,1,1]}([0,0,1])
ERROR: type: apply_type: in Extp, expected Type{T<:Top}, got Array{Int64,1}
So, does Julia just not like particular Vectors being associated with types? Does what I'm trying to do only work with integers, like in my Intp question?
EDIT: In the documentation I see that type parameters "can be any type at all (or an integer, actually, although here it’s clearly used as a type)." Does that mean that what I'm asking is impossible, and that that only types and integers work for Type parameters? If so, why? (what makes integers special over other types in Julia in this way?)
In Julia 0.4, you can use any "bitstype" as a parameter of a type. However, a vector is not a bitstype, so this is not going to work. The closest analog is to use a tuple: for example, (3.2, 1.5) is a perfectly valid type parameter.
In a sense vectors (or any mutable object) are antithetical to types, which cannot change at runtime.
Here is the relevant quote:
Both abstract and concrete types can be parameterized by other types
and by certain other values (currently integers, symbols, bools, and
tuples thereof).
So, your EDIT is correct. Widening this has come up on the Julia issues page (e.g., #5102 and #6081 were two related issues I found with some discussion), so this may change in the future - I'm guessing not in v0.4 though. It'd have to be an immutable type really to make any sense, so not Vector. I'm not sure I understand your application, but would a Tuple work?

How to compute the exponent values for error correction in QR Code

When i started reading about Qr Codes every article browsed i can see one QR Code Exponents of αx Table which having values for the specific powers of αx . I am not sure how this table is getting created. Can anybody explain me the logic behind this table.
For reference the table can be found at http://www.matchadesign.com/_blog/Matcha_Design_Blog/post/QR_Code_Demystified_-_Part_4/#
(The zxing source code for this might help you.)
It would take a lot to explain all the math here. For the Reed-Solomon error correction, you need a Galois field of 256 elements (nothing fancy -- just a set of 256 things that have addition and exponentiation and such defined.)
This is defined not in terms of numbers, but in terms of polynomials whose coefficients are all 0 or 1. We work with polynomials with 8 coefficient -- conveniently these map to 8-bit values. While it's tempting to think of those values as numbers, they're really something different.
In fact for addition and such to make sense such that all the operations land you back in a value in the Galois field, all the results are computed modulo an irreducible polynomial in the field. (Skip what that means now.)
To make operations faster, it helps to pre-compute what the powers of the polynomial "x" are in the field. This is alpha. You can think of this as "2", since the polynomial "x" is 00000010, though that's not entirely accurate.
So then you just compute the powers of x in the field. Because it's a field you'll hit every element of the field this way. The sequence seems to be the powers of two, which it happens to map to for a short while, until the first "modulo" of the primitive polynomial takes effect. Multiplying by x is indeed still something like multiplying by 2 but it's a bit of coincidence in this field, really.

Math question regarding Python's uuid4

I'm not great with statistical mathematics, etc. I've been wondering, if I use the following:
import uuid
unique_str = str(uuid.uuid4())
double_str = ''.join([str(uuid.uuid4()), str(uuid.uuid4())])
Is double_str string squared as unique as unique_str or just some amount more unique? Also, is there any negative implication in doing something like this (like some birthday problem situation, etc)? This may sound ignorant, but I simply would not know as my math spans algebra 2 at best.
The uuid4 function returns a UUID created from 16 random bytes and it is extremely unlikely to produce a collision, to the point at which you probably shouldn't even worry about it.
If for some reason uuid4 does produce a duplicate it is far more likely to be a programming error such as a failure to correctly initialize the random number generator than genuine bad luck. In which case the approach you are using it will not make it any better - an incorrectly initialized random number generator can still produce duplicates even with your approach.
If you use the default implementation random.seed(None) you can see in the source that only 16 bytes of randomness are used to initialize the random number generator, so this is an a issue you would have to solve first. Also, if the OS doesn't provide a source of randomness the system time will be used which is not very random at all.
But ignoring these practical issues, you are basically along the right lines. To use a mathematical approach we first have to define what you mean by "uniqueness". I think a reasonable definition is the number of ids you need to generate before the probability of generating a duplicate exceeds some probability p. An approcimate formula for this is:
where d is 2**(16*8) for a single randomly generated uuid and 2**(16*2*8) with your suggested approach. The square root in the formula is indeed due to the Birthday Paradox. But if you work it out you can see that if you square the range of values d while keeping p constant then you also square n.
Since uuid4 is based off a pseudo-random number generator, calling it twice is not going to square the amount of "uniqueness" (and may not even add any uniqueness at all).
See also When should I use uuid.uuid1() vs. uuid.uuid4() in python?
It depends on the random number generator, but it's almost squared uniqueness.

Resources