Is there a way to prove that this example of a hascode algorithm will give unique values - hashtable

I was watching https://www.youtube.com/watch?v=UPo-M8bzRrc&index=21&list=PL4BBB74C7D2A1049C,(CS 61B Lecture 21: Hash Tables) and the example the professor gave was
You have a two letter word, each letter falls between a-z
public class Word{
public static final int LETTERS = 26, WORDS = LETTERS * LETTERS;
private String word;
public int hashCode(){
return LETTERS * (word.charAt(0)-'a') + word.charAt(1) - 'a';
}
}
Is there a way to prove(mathematical?) that each possible word will map to a different value between 0 and 675?
I've proved that the range will be between 0 and 675(give "aa" and "zz", but unsure about how to prove uniqueness.

The formula to get the hash code is:
hash = 26 * (A - 'a') + (B - 'a')
= 26 * (A - 97) + (B - 97) // 'a' == 97 in ASCII
= 26A + B - 27*97
So what we need to prove is that 26A + B has distinct values for any A and B in range <97; 122> (decimal values for <'a'; 'z'>). We ignore the constant `27 * 97 part, as it would not change the reasoning.
Let's look at the opposite statement - when the hash code would not be distinct? It would not be distinct when change in A would be compensated by the change in B. So the following would need to be true:
26 * A1 + B1 = 26 * A2 + B2
Let's assume that A2 = A1 + 1:
26 * A1 + B1 = 26 * (A1 + 1) + B2
= 26 * A1 + B2 + 26
Which means:
B1 = B2 + 26
B1 - B2 = 26
Which is impossible, because B is the char code for letters in range <'a'; 'z'>. And this range, in decimal ASCII values, is 25 (122-97). The required compensation by B would increase for every other A1 - A2 difference.
So, by proving the opposite is impossible, we've proved that hash code is unique for that characters.

Related

What are the different versions of arithmetic swap and why do they work?

I think we all should be familiar of the arithmetic swap algorithm, that swaps two variables without using a third variable. Now I found out that there are two variations of the arithmetic swap. Please consider the following:
Variation 1.
int a = 2;
int b = 3;
a = a + b;
b = a - b;
a = a - b;
Variation 2.
int a = 2;
int b = 3;
b = b - a;
a = a + b;
b = a - b;
I want to know, why are there two distinct variations of the arithmetic swap and why do they work? Are there also other variations of the arithmetic swap that achieve the same result? How are they related? Is there any elegant mathematical formula that justifies why the arithmetic swap works the way it does, for all variations? Is there anything related between these two variations of the two arithmetic swap, like an underlying truth?
Break each variable out as what it represents:
a = 2
b = 3
a1 = a + b
b1 = a1 - b = (a + b) - b = a
a2 = a1 - b1 = (a + b) - a = b
a = 2
b = 3
b1 = b - a
a1 = a + b1 = a + (b - a) = b
b2 = a1 - b1 = b - (b - a) = a
There's not underlying truth other than the fact that the math works out. Remember that each time you do an assignment, it's effectively a new "variable" from the math side.

Dynamic forecasting the data.table way?

I need to create dynamic forecast using
v(t) = b0 + b1*v(t-1) + b2*v(t-2) + b3*x(t-1) + b4*x(t-2)
The dataset looks like this at time 0. In actual data, there are 80 different x's and 100K "dates".
date v vLag1 vLag2 x xLag1 xLag2 b1 b2 b3 b4
2016-06-30 NA 105 95 33 11 23 0.2 3.2 -1.2 0.4
2016-07-01 NA NA NA 43 33 11 0.2 3.2 -1.2 0.4
2016-07-02 NA NA NA 52 43 33 0.2 3.2 -1.2 0.4
The goal is to predict v's, replacing all NA's with values. I created vLag1, vLag2, xLag1, xLag2 so that I have all I need to calculate v in one row.
All x's and b's are known ahead of time, so I created lags of x shown above. The b's are the coefficients.
For each date, the v(t) would be predicted, and the predicted v(t)'s will feed into the next date's v prediction as the lagged regressors.
To avoid looping over rows like this:
for (i in 2:nrows){
df$v[i] <- df$v[i-1] * df$coeff[i]
}
I have tried to use repeated substitution, so that all the future v's only reference v1, which is easy to calculate because v1's calculation involves other values in the same row.
v2 = b0 + b1*v1 + b2*v0 + b3*x1 + b4*x0
v3 = b0 + b1*v2 + b2*v1 + b3*x2 + b4*x1
(substitute v2) v3 = b0 + b1*(b0 + b1*v1 + b2*v0 + b3*x1 + b4*x0) + b2*v1 + b3*x2 + b4*x1
v4 = ...
But with so many lags of v's and x's to keep track, this also got out of control.
I have been browsing the data.table's shift function in SO. But, in my case, where the the values need to be dynamically obtained and then shifted, is there any way to dynamically predict in data.table's functions?
Instead of data.table (where you can't do this easily) this looks like an easy job for Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector dyn_fore(const NumericVector x,
const double v1, const double v2,
const double x1, const double x2,
const double b0, const double b1, const double b2,
const double b3, const double b4) {
int n = x.size();
NumericVector v(n);
v(0) = b0 + b1*v1 + b2*v2 + b3*x1 + b4*x2;
v(1) = b0 + b1*v(0) + b2*v1 + b3*x(0) + b4*x1;
for (int i = 2; i < n; i++) {
v(i) = b0 + b1*v(i-1) + b2*v(i-2) + b3*x(i-1) + b4*(i-2);
}
return v;
}
(If you use Windows, make sure you have a working Rtools installation, put this in an C++ File in RStudio and source it. Check if I got coefficients and indices right.)
Then in R:
x <- c(33, 43, 52, 67)
dyn_fore(x, 105, 95, 11, 23, 0, 0.2, 3.2, -1.2, 0.4)
#[1] 321.00 365.00 1048.60 1315.72

Curve Fit 5 points

I am trying to curve fit 5 points in C. I have used this code from a previous post (Can sombody simplify this equation for me?) to do 4 points, but now I need to add another point.
// Input data: arrays x[] and y[]
// x[1],x[2],x[3],x[4] - X values
// y[1],y[2],y[3],y[4] - Y values
// Calculations
A = 0
B = 0
C = 0
D = 0
S1 = x[1] + x[2] + x[3] + x[4]
S2 = x[1]*x[2] + x[1]*x[3] + x[1]*x[4] + x[2]*x[3] + x[2]*x[4] + x[3]*x[4]
S3 = x[1]*x[2]*x[3] + x[1]*x[2]*x[4] + x[1]*x[3]*x[4] + x[2]*x[3]*x[4]
for i = 1 to 4 loop
C0 = y[i]/(((4*x[i]-3*S1)*x[i]+2*S2)*x[i]-S3)
C1 = C0*(S1 - x[i])
C2 = S2*C0 - C1*x[i]
C3 = S3*C0 - C2*x[i]
A = A + C0
B = B - C1
C = C + C2
D = D - C3
end-loop
// Result: A, B, C, D
I have been trying to covert this to a 5 point curve fit, but am having trouble figuring out what goes inside the loop:
// Input data: arrays x[] and y[]
// x[1],x[2],x[3],x[4],x[5] - X values
// y[1],y[2],y[3],y[4],y[5] - Y values
// Calculations
A = 0
B = 0
C = 0
D = 0
E = 0
S1 = x[1] + x[2] + x[3] + x[4]
S2 = x[1]*x[2] + x[1]*x[3] + x[1]*x[4] + x[2]*x[3] + x[2]*x[4] + x[3]*x[4]
S3 = x[1]*x[2]*x[3] + x[1]*x[2]*x[4] + x[1]*x[3]*x[4] + x[2]*x[3]*x[4]
S4 = x[1]*x[2]*x[3]*x[4] + x[1]*x[2]*x[3]*[5] + x[1]*x[2]*x[4]*[5] + x[1]*x[3]*x[4]*[5] + x[2]*x[3]*x[4]*[5]
for i = 1 to 4 loop
C0 = ??
C1 = ??
C2 = ??
C3 = ??
C4 = ??
A = A + C0
B = B - C1
C = C + C2
D = D - C3
E = E + C4
end-loop
// Result: A, B, C, D, E
any help in filling out the C0...C4 would be appreciated. I know this has to do with the matrices but I have not been able to figure it out. examples with pseudo code or real code would be most helpful.
thanks
I refuse to miss this opportunity to generalize. :)
Instead, we're going to learn a little bit about Lagrange polynomials and the Newton Divided Difference Method of their computation.
Lagrange Polynomials
Given n+1 data points, the interpolating polynomial is
where l_j(i) is
.
What this means is that we can find the polynomial approximating the n+1 points, regardless of spacing, etc, by just summing these polynomials. However, this is a bit of a pain and I wouldn't want to do it in C. Let's take a look at Newton Polynomials.
Newton Polynomials
Same start, given n+1 data points, the approximating polynomial is going to be
where each n(x) is
with a coefficient of
, being the divided difference.
The final form end's up looking like
.
As you can see, the formula is pretty easy given the divided difference values. You just do each new divided difference and multiply by each point so far. It should be noted that you'll end up with a polynomial of degree n from n+1 points.
Divided Difference
All that's left is to define the divided difference which is really best explained by these two pictures:
and
.
With this information, a C implementation should be reasonable to do. I hope this helps and I hope you learned something! :)
If the x values are equally spaced with x2-x1=h, x3-x2=h, x4-x3=h and x5-x4=h then
C0 = y1;
C1 = -(25*y1-48*y2+36*y3-16*y4+3*y5)/(12*h);
C2 = (35*y1-104*y2+114*y3-56*y4+11*y5)/(24*h*h);
C3 = -(5*y1-18*y2+24*y3-14*y4+3*y5)/(12*h*h*h);
C4 = (y1-4*y2+6*y3-4*y4+y5)/(24*h*h*h*h);
y(x) = C0+C1*(x-x1)+C2*(x-x1)^2+C3*(x-x1)^3+C4*(x-x1)^4
// where `^` denotes exponentiation (and not XOR).

Determining whether there is a descending pattern between two sampled numbers

I have two numbers that are samples of two different quantities (it doesn't really matter what it is). They are both fluctuating with time. I have samples for these values from two different points in time. Call them a0, a1, b0, b1. I can use the differences (a1-a0, b1-b0) the difference and sum of the differences ( (a1-a0)-(b1-b0) ) ( (a1-a0) + (b1-b0) ) )
My questions is how do you determine when both of them are descending in an fashion that doesn't hard code any constants. Let me explain.
I want to detect when both of these quantities have decreased by a certain amount but that amount may change if I change the quantities I'm sampling so I can't hard code a constant.
I'm sorry if this is vague but that's really all the information I have. I was just wondering if this is even solvable.
if ( a1 - a0 < 0)
if( b1 - b0 < 0) {
//... descending
}
or:
if ( a1 - a0 + b1 - b0 < a1 - a0) // b1 - b0 is negative
if( a1 - a0 + b1 - b0 < b1 - b0) { // a1 - a0 is negative
//... descending
}
To add a threshold is simple:
if ( a1 - a0 < -K)
if( b1 - b0 < -K) {
//... descending, more than K
}
or:
if ( a1 - a0 + b1 - b0 < a1 - a0 - K) // b1 - b0 is less than -K
if( a1 - a0 + b1 - b0 < b1 - b0 - K) { // a1 - a0 is less than -K
//... descending more than K
}

Adding two fractions, why a (minor) optimization works

I was adding a Fraction class to my codebase the other day (the first time, never needed one before and I doubt I do now, but what the hell :-)). When writing the addition between two fractions, I found a small optimization but it doesn't make sense (in the mathematical sense) why it is like it is.
To illustrate I will use fractions A and B, effecively consisting of An, Bn, Ad and Bd for numerator and denominator respectively.
Here are two functions I use for GCD/LCM, the formulas are on Wikipedia as well. They're simple enough to understand. The LCM one could just as well be (A*B)/C of course.
static unsigned int GreatestCommonDivisor(unsigned int A, unsigned int B)
{
return (!B) ? A : GreatestCommonDivisor(B, A % B);
}
static unsigned int LeastCommonMultiple(unsigned int A, unsigned int B)
{
const unsigned int gcDivisor = GreatestCommonDivisor(A, B);
return (A / gcDivisor) * B;
}
First lets go around the 1st approach:
least_common_mul = least_common_multiple(Ad, Bd)
new_nominator = An * (least_common_mul / Ad) + Bn * (least_common_mul / Bd)
new_denominator = least_common_mul
Voila, works, obvious, done.
Then through some scribbling on my notepad I came across another one that works:
greatest_common_div = greatest_common_divisor(Ad, Bd)
den_quot_a = Ad / greatest_common_div
den_quot_b = Bd / greatest_common_div
new_numerator = An * den_quot_b + Bn * den_quot_a
new_denominator = den_quot_a * Bd
Now the new denominator is fairly obvious, as it's exactly the same as happens in the LCD function. The other ones seem to make sense too, except for that the the right factors to multiply the original numerators with are swapped, in this line to be specific:
new_numerator = An * den_quot_b + Bn * den_quot_a
Why is that not AA + BB?
Input example: 5/12 & 11/18
greatest_common_div = 6
den_quot_a = 12/6 = 2;
den_quot_b = 18/6 = 3;
new_numerator = 5*3 + 11*2 = 37;
new_denominator = 36;
It's pretty straightforward, it's what you'd normally do to make fractions be over the same denominator - multiply each fraction's numerator and denominator by the factors that the other fraction has in its denominator that aren't present in the first.
2 is the factor of 36 which is missing from 18; 3 is the factor of 36 which is missing from 12. Thus, you multiply:
(5/12) * (3/3) ==> 15/36
(11/18) * (2/2) ==> 22/36
Perhaps you're missing one of the identities of number theory... for any two positive numbers m and n,
m*n = gcd(m,n) * lcm(m,n)
examples:
4*18 = 2 * 36
15*9 = 3 * 45
Finding a common denominator to fractions a/b and c/d involves using the lcm(b,d) or equivalently, bd/gcd(b,d).

Resources