Translational equivariance and its relationship with convolutonal layer and spatial pooling layer - math

In the context of convolutional neural network model, I once heard a statement that:
One desirable property of convolutions is that they are
translationally equivariant; and the introduction of spatial pooling
can corrupt the property of translationally equivalent.
What does this statement mean, and why?

Most probably you heard it from Bengio's book. I will try to give you my explanation.
In a rough sense, two transformations are equivariant if f(g(x)) = g(f(x)). In your case of convolutions and translations means that if you convolve(translate(x)) it would be the same as if you translate(convolve(x)). This is desired because if your convolution will find an eye of a cat in an image, it will find that eye if you will shift the image.
You can see this by yourself (I use 1d conv only because it is easy to calculate stuff). Lets convolve v = [4, 1, 3, 2, 3, 2, 9, 1] with k = [5, 1, 2]. The result will be [27, 12, 23, 17, 35, 21]
Now let's shift our v by appending it with something v' = [8] + v. Convolving with k you will get [46, 27, 12, 23, 17, 35, 21]. As you the result is just a previous result prepended with some new stuff.
Now the part about spatial pooling. Let's do a max-pooling of size 3 on the first result and on the second one. In the first case you will get [27, 35], in the second [46, 35, 21]. As you see 27 somehow disappeared (result was corrupted). It will be more corrupted if you will take an average pooling.
P.S. max/min pooling is the most translationally invariant of all poolings (if you can say so, if you compare the number of non-corrupt elements).

A note on translation equivariant and invariant terms. These terms are different.
Equivariant translation means that a translation of input features results in an equivalent translation of outputs. This is desirable when we need to find the pattern rectangle.
Invariant translation means that a translation of input does not change the outputs at all.
Translation invariance is so important to achieve. This effectively means after learning a certain pattern in the lower-left corner of a picture our convnet can recognize the pattern anywhere (also in the upper right corner).
As we know just a densely connected network without convolution layers in-between cannot achieve translation invariance.
We need to introduce convolution layers to bring generalization power to the deep networks and learn representations with fewer training samples.

Related

How to find the value of the common ratio in geometric series without the method of substitution

It has been sometime since I got myself to solve mathematical equations.
So I can't seem to find a way to come to a simplified equation for finding the common ratio knowing the final sum and the value of n in the formula :
finalSum = r(1 - r^n)/(1 - r);
For example: 3, 9, 27, 81, 243
363 = r(1 - r^5)/(1 - r);
Given above is a very simple example. But I'll be dealing these in decimals. Is there any way of getting a simplified equation to get the value of r? Or is the method of substitution the only way?
PS: This is for a program I'm writing
Please let me know.
Thanks
Substituion is the easiest method for finding the common ratio.
This resource may be helpful when dealing with common ratios of geometric sequences:
Geometric Sequences

How exactly works Evaluate function in Wolfram Mathematica. How does it make a difference in two plots below

I'm trying to understand how exactly works function Evaluate. Here I have two examples and the only difference between them is function Evaluate.
First plot with Evaluate.
ReliefPlot[
Table[Evaluate[Sum[Sin[RandomReal[9, 2].{x, y}], {20}]], {x, 1, 2, .02},
{y, 1, 2, .02}],
ColorFunction ->
(Blend[{Darker[Green, .8], Lighter[Brown, .2],White}, #] &),
Frame -> False, Background -> None, PlotLegends -> Automatic]
https://imgur.com/itBRYEv.png "plot1"
Second plot without Evaluate.
ReliefPlot[
Table[Sum[Sin[RandomReal[9, 2].{x, y}], {20}], {x, 1, 2, .02},
{y, 1,2, .02}],
ColorFunction ->
(Blend[{Darker[Green, .8], Lighter[Brown, .2], White}, #] &),
Frame -> False, Background -> None,
PlotLegends -> Automatic]
https://i.imgur.com/fvdiSCm.png "plot2"
Please explain how Evaluate makes a difference here.
Compare this
count=0;
ReliefPlot[Table[Sum[Sin[count++;RandomReal[9,2].{x,y}],{20}],{x,1,2,.02},{y,1,2,.02}]]
count
which should display your plot followed by 52020=51*51*20 because you have a 51*51 Table and each entry needs to evaluate the 20 iterations of your Sum
with this
count=0;
ReliefPlot[Table[Evaluate[Sum[Sin[count++;RandomReal[9,2].{x,y}],{20}]],{x,1,2,.02},{y,1,2,.02}]]
count
which should display your plot followed by 20 because the Evaluate needed to do the 20 iterations of your Sum only once, even though you do see 51*51 blocks of different colors on the screen.
You will get the same counts displayed, without the graphics, if you remove the ReliefPlot from each of these, so that seems to show it isn't the ReliefPlot that is responsible for the number of times your RandomReal is calculated, it is the Table.
So that Evaluate is translating the external text of your Table entry into an internal form and telling Table that this has already been done and does not need to be repeated for every iteration of the Table.
What you put and see on the screen is the front end of Mathematica. Hidden behind that is the back end where most of the actual calculations are done. The front and back ends communicate with each other during your input, calculations, output and display.
But this still doesn't answer the question why the two plots look so different. I am guessing that when you don't use Evaluate and thus don't mark the result of the Table as being complete and finished then the ReliefPlot will repeatedly probe that expression in your Table and that expression will be different every time because of the RandomReal and this is what displays the smoother higher resolution displayed graphic. But when you do use the Evaluate and thus the Table is marked as done and finished and needs no further evaluation then the ReliefPlot just uses the 51*51 values without recalculating or probing and you get a lower resolution ReliefPlot.
As with almost all of Mathematica, the details of the algorithms used for each of the thousands of different functions are not available. Sometimes the Options and Details tab in the help page for a given function can give you some additional information. Experimentation can sometimes help you guess what is going on behind the code. Sometimes other very bright people have figured out parts of the behavior and posted descriptions. But that is probably all there is.
Table has the HoldAll attribute
Attributes[Table]
(* {HoldAll, Protected} *)
Read this and this to learn more about evaluation in the WL.

Why do we have lower and upper bounds for datatypes in MPI?

Or the question can be paraphrased like this:
Why may one need a datatype with a non-zero lower bound?
Consider the following example:
struct S {
int a;
int b;
float c;
float d;
} array[N];
If I had an array of type S[] and I wanted to send only values of fields b and
d, I would create a datatype with the type map { (4, MPI_INT), (12, MPI_FLOAT) }.
At first, it seems that such a type could be used to correctly send an array of
struct S:
MPI_Send(array, N, datatype, ...);
But this doesn't work if N > 1.
Such a type would have lb = 4, ub = 16 and extent = ub - lb = 12. That is,
MPI would consider that the second element of the array starts 12 bytes from the
first one, which is not true.
Well, that may be not a big deal. After all, generally, for such partially sent structures
we have to specify the exact size of the structure:
MPI_Type_create_resized(datatype, 0, sizeof(struct S), &resized);
But I wonder why we always need to specify a zero lower bound. Why would
someone need a non-zero lower bound? The datatypes with non-zero lower bounds looks extremely confusing to me, and I cannot make any sense of them.
If I were to design a type system for MPI, I would describe a type with a single
parameter - its size (extent), which is the stride between two adjacent elements of an array. In terms of MPI, I would always set lb = 0 and extent = ub. Such a system looks much clearer to me, and it would work correctly in the example described above.
But MPI has chosen the different way. We have two independent parameters instead: the lower
and the upper bounds. Why is it so? What's the use of this additional flexibility? When should one use datatypes with a non-zero lower bound?
You have no idea what kind of weird and complex structures one finds in scientific and engineering codes. The standard is designed to be as general as possible and to provide maximum flexibility. Section 4.1.6 Lower-Bound and Upper-Bound Markers begins like this:
It is often convenient to define explicitly the lower bound and upper bound of a type map, and override the definition given on page 105. This allows one to define a datatype that has "holes" at its beginning or its end, or a datatype with entries that extend above the upper bound or below the lower bound. Examples of such usage are provided in Section 4.1.14.
Also, the user may want to overide [sic] the alignment rules that are used to compute upper bounds and extents. E.g., a C compiler may allow the user to overide [sic] default alignment rules for some of the structures within a program. The user has to specify explicitly the bounds of the datatypes that match these structures.
The simplest example of a datatype with non-zero lower bound is a structure with absolute addresses used as offsets, useful in, e.g., sending structures with pointers to data scattered in memory. Such a datatype is used with MPI_BOTTOM specified as the buffer address, which corresponds to the bottom of the memory space (0 on most systems). If the lower bound would be fixed to 0, then you have to find the data item with the lowest address fist and compute all offsets relative to it.
Another example is the use of MPI_Type_create_subarray to create a datatype that describes a subarray of an n-dimensional array. With zero lower bounds you will have to provide a pointer to the beginning of the subarray. With non-zero lower bounds you just give a pointer to the beginning of the whole array instead. And you can also create a contiguous datatype of such subarray datatypes in order to send such n-dimensional "slices" from an n+1-dimensional array.
IMO the only reason to have both LB and UB markers is it simplifies the description of datatype construction. The MPI datatypes are described by a type map (list of offsets and types, including possible LB/UB markers) and all the datatype construction calls define the new typemap in terms of the old typemap.
When you have LB/UB markers in the old typemap and you follow the rules of construction of the new typemap from the old, you get a natural definition of the LB/UB marker in the new type which defines the extent of the new type. If extent were a separate property on the side of the typemap, you'd have to define what the new extent is for every datatype construction call.
Other than that I fundamentally I agree with you on the meaninglessness of having LB/UB as two separate pieces of data, when the only thing they're used for is to define the extent. Once you add LB/UB markers, their meaning is completely disconnected from any notion of actual data offsets.
If you wanted to put an int at displacement 4 and have its extent be 8, it would be fine to construct
[(LB,4), (int,4), (UB,12)]
but it would be equally fine to construct any of
[(LB,0),(int,4),(UB,8)]
[(LB,1000000),(int,4),(UB,1000008)]
[(LB,-1000),(int,4),(UB,-992)]
The above are all completely equivalent in behavior because they have the same extent.
When explanations of LB/UB markers talk about how you need to have datatypes where the first data displacement is non-0, I think that's misleading. It's true you need to be able to make types like that, but the LB/UB markers aren't fundamentally connected to the data displacements. I'm concerned that suggesting they are connected will lead an MPI user to write invalid code if they think the LB is intrinsically related to the data offsets.

how to understand time complexity from a plot?

I have written a program in C where I allocate memory to store a matrix of dimensions n-by-n and then feed a linear algebra subroutine with. I'm having big troubles in understanding how to identify time complexity for these operations from a plot. Particularly, I'm interested in identify how CPU time scales as a function of n, where n is my size.
To do so, I created an array of n = 2, 4, 8, ..., 512 and I computed the CPU time for both the operations. I repeated this process 10000 times for each n and I took the mean eventually. I therefore come up with a second array that I can match with my array of n.
I've been suggested to print in double logarithmic plot, and I read here and here that, using this way, "powers shows up as a straight line" (2). This is the resulting figure (dgesv is the linear algebra subroutine I used).
Now, I'm guessing that my time complexity is O(log n) since I get straight lines for both my operations (I do not take into consideration the red line). I saw the shapes differences between, say, linear complexity, logarithmic complexity etc. But I still have doubts if I should say something about the time complexity of dgesv, for instance. I'm sure there's a way that I don't know at all, so I'd be glad if someone could help me in understanding how to look at this plot properly.
PS: if there's a specific community where to post this question, please let me know so I could move it avoiding much more mess here. Thanks everyone.
Take your yellow line, it appears to be going from (0.9, -2.6) to (2.7, 1.6), giving it a slope roughly equal to 2.5. As you're plotting log(t) versus log(n) this means that:
log(t) = 2.5 log(n) + c
or, exponentiating both sides:
t = exp(2.5 log(n) + c) = c' n^2.5
The power of 2.5 may be an underestimate as your dsegv likely has a cost of 2/3 n^3 (though O(n^2.5) is theoretically possible).

Making a cryptaritmetic solver in C++

I am planning out a C++ program that takes 3 strings that represent a cryptarithmetic puzzle. For example, given TWO, TWO, and FOUR, the program would find digit substitutions for each letter such that the mathematical expression
TWO
+ TWO
------
FOUR
is true, with the inputs assumed to be right justified. One way to go about this would of course be to just brute force it, assigning every possible substitution for each letter with nested loops, trying the sum repeatedly, etc., until the answer is finally found.
My thought is that though this is terribly inefficient, the underlying loop-check thing may be a feasible (or even necessary) way to go--after a series of deductions are performed to limit the domains of each variable. I'm finding it kind of hard to visualize, but would it be reasonable to first assume a general/padded structure like this (each X represents a not-necessarily distinct digit, and each C is a carry digit, which in this case, will either be 0 or 1)? :
CCC.....CCC
XXX.....XXXX
+ XXX.....XXXX
----------------
CXXX.....XXXX
With that in mind, some more planning thoughts:
-Though leading zeros will not be given in the problem, I probably ought to add enough of them where appropriate to even things out/match operands up.
-I'm thinking I should start with a set of possible values 0-9 for each letter, perhaps stored as vectors in a 'domains' table, and eliminate values from this as deductions are made. For example, if I see some letters lined up like this
A
C
--
A
, I can tell that C is zero and this eliminate all other values from its domain. I can think of quite a few deductions, but generalizing them to all kinds of little situations and putting it into code seems kind of tricky at first glance.
-Assuming I have a good series of deductions that run through things and boot out lots of values from the domains table, I suppose I'd still just loop over everything and hope that the state space is small enough to generate a solution in a reasonable amount of time. But it feels like there has to be more to it than that! -- maybe some clever equations to set up or something along those lines.
Tips are appreciated!
You could iterate over this problem from right to left, i.e. the way you'd perform the actual operation. Start with the rightmost column. For every digit you encounter, you check whether there already is an assignment for that digit. If there is, you use its value and go on. If there isn't, then you enter a loop over all possible digits (perhaps omitting already used ones if you want a bijective map) and recursively continue with each possible assignment. When you reach the sum row, you again check whether the variable for the digit given there is already assigned. If it is not, you assign the last digit of your current sum, and then continue to the next higher valued column, taking the carry with you. If there already is an assignment, and it agrees with the last digit of your result, you proceed in the same way. If there is an assignment and it disagrees, then you abort the current branch, and return to the closest loop where you had other digits to choose from.
The benefit of this approach should be that many variables are determined by a sum, instead of guessed up front. Particularly for letters which only occur in the sum row, this might be a huge win. Furthermore, you might be able to spot errors early on, thus avoiding choices for letters in some cases where the choices you made so far are already inconsistent. A drawback might be the slightly more complicated recursive structure of your program. But once you got that right, you'll also have learned a good deal about turning thoughts into code.
I solved this problem at my blog using a randomized hill-climbing algorithm. The basic idea is to choose a random assignment of digits to letters, "score" the assignment by computing the difference between the two sides of the equation, then altering the assignment (swap two digits) and recompute the score, keeping those changes that improve the score and discarding those changes that don't. That's hill-climbing, because you only accept changes in one direction. The problem with hill-climbing is that it sometimes gets stuck in a local maximum, so every so often you throw out the current attempt and start over; that's the randomization part of the algorithm. The algorithm is very fast: it solves every cryptarithm I have given it in fractions of a second.
Cryptarithmetic problems are classic constraint satisfaction problems. Basically, what you need to do is have your program generate constraints based on the inputs such that you end up with something like the following, using your given example:
O + O = 2O = R + 10Carry1
W + W + Carry1 = 2W + Carry1 = U + 10Carry2
T + T + Carry2 = 2T + Carry2 = O + 10Carry3 = O + 10F
Generalized pseudocode:
for i in range of shorter input, or either input if they're the same length:
shorterInput[i] + longerInput2[i] + Carry[i] = result[i] + 10*Carry[i+1] // Carry[0] == 0
for the rest of the longer input, if one is longer:
longerInput[i] + Carry[i] = result[i] + 10*Carry[i+1]
Additional constraints based on the definition of the problem:
Range(digits) == {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Range(auxiliary_carries) == {0, 1}
So for your example:
Range(O, W, T) == {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Range(Carry1, Carry2, F) == {0, 1}
Once you've generated the constraints to limit your search space, you can use CSP resolution techniques as described in the linked article to walk the search space and determine your solution (if one exists, of course). The concept of (local) consistency is very important here and taking advantage of it allows you to possibly greatly reduce the search space for CSPs.
As a simple example, note that cryptarithmetic generally does not use leading zeroes, meaning if the result is longer than both inputs the final digit, i.e. the last carry digit, must be 1 (so in your example, it means F == 1). This constraint can then be propagated backwards, as it means that 2T + Carry2 == O + 10; in other words, the minimum value for T must be 5, as Carry2 can be at most 1 and 2(4)+1==9. There are other methods of enhancing the search (min-conflicts algorithm, etc.), but I'd rather not turn this answer into a full-fledged CSP class so I'll leave further investigation up to you.
(Note that you can't make assumptions like A+C=A -> C == 0 except for in least significant column due to the possibility of C being 9 and the carry digit into the column being 1. That does mean that C in general will be limited to the domain {0, 9}, however, so you weren't completely off with that.)

Resources