Hash collisions for golang built-in map and string keys? - dictionary

I wrote this function to generate random unique id's for my test cases:
func uuid(t *testing.T) string {
uidCounterLock.Lock()
defer uidCounterLock.Unlock()
uidCounter++
//return "[" + t.Name() + "|" + strconv.FormatInt(uidCounter, 10) + "]"
return "[" + t.Name() + "|" + string(uidCounter) + "]"
}
var uidCounter int64 = 1
var uidCounterLock sync.Mutex
In order to test it, I generate a bunch of values from it in different goroutines, send them to the main thread, which puts the result in a map[string]int by doing map[v] = map[v] + 1. There is no concurrent access to this map, it's private to the main thread.
var seen = make(map[string]int)
for v := range ch {
seen[v] = seen[v] + 1
if count := seen[v]; count > 1 {
fmt.Printf("Generated the same uuid %d times: %#v\n", count, v)
}
}
When I just cast the uidCounter to a string, I get a ton of collisions on a single key. When I use strconv.FormatInt, I get no collisions at all.
When I say a ton, I mean I just got 1115919 collisions for the value [TestUuidIsUnique|�] out of 2227980 generated values, i.e. 50% of the values collide on the same key. The values are not equal. I do always get the same number of collisions for the same source code, so at least it's somewhat deterministic, i.e. probably not related to race conditions.
I'm not surprised integer overflow in a rune would be an issue, but I'm nowhere near 2^31, and that wouldn't explain why the map thinks 50% of the values have the same key. Also, I wouldn't expect a hash collision to impact correctness, just performance, since I can iterate over the keys in a map, so the values are stored there somewhere.
In the output, all runes printed are 0xEFBFBD. It's the same number of bits as the highest valid unicode code point, but that doesn't really match either.
Generated the same uuid 2 times: "[TestUuidIsUnique|�]"
Generated the same uuid 3 times: "[TestUuidIsUnique|�]"
Generated the same uuid 4 times: "[TestUuidIsUnique|�]"
Generated the same uuid 5 times: "[TestUuidIsUnique|�]"
...
Generated the same uuid 2047 times: "[TestUuidIsUnique|�]"
Generated the same uuid 2048 times: "[TestUuidIsUnique|�]"
Generated the same uuid 2049 times: "[TestUuidIsUnique|�]"
...
What's going on here? Did the go authors assume that hash(a) == hash(b) implies a == b for strings? Or am I just missing something silly? go test -race isn't complaining either.
I'm on macOS 10.13.2, and go version go1.9.2 darwin/amd64.

String conversion of an invalid rune returns a string containing the unicode replacement character: "�".
Use the strconv package to convert an integer to text.

Related

Filter for highest numeric value from Results

The objective is the get the object from Realm that contains the highest numeric value in one of its properties. The object has a persons name (a string) and a person_id (also a string).
PersonClass: Object {
#objc dynamic var name = ""
#objc dynamic var person_id = ""
}
the person_id can be a number, string or a combination. For this filter, all strings that do not contain only a number should be ignored. The table may look like this
name person_id
Henry 0000
Leroy test
Frank 3333
Steve a123
and the result should be
Henry 0000
Frank 3333 <- .last or the highest
For a simple example, let's take this array
let arrayOfWords = ["thing", "1", "stuff", "2"]
and some code to get the string "1" and "2".
let swiftResults = arrayOfWords.compactMap { Int($0) } //knowing that Int has limitations
swiftResults.forEach { print($0) } //note we can use .last to get the last one
While there is solution by getting all Realm objects and then applying a Swift filter, the problem is there could be thousands of persons and as soon as the results object is manipulated as a Swift enumerated object (like an array), it not only breaks the connection with Realm (losing the objects live updating ability) but they are are no longer lazily loaded, taking up memory and possibly overwhelming the device.
Since realm queries do not support regex, you'd have to check that no letter is contained in person_id and get the max value using #max.
let regexAnythingButDigits = "[^0-9]"
let maxNumericID = realm.objects(PersonClass.self).filter("person_id NOT CONTAINS[c] 'a' person_id NOT CONTAINS[c] 'b' AND ..[all letters in between]... AND person_id NOT CONTAINS[c] 'z' AND person_id.#max", regexAnythingButDigits)
Based on the insight provided in the answer by rs7, here's a coding solution that meets the criteria in the question.
TLDR: we build a compound AND predicate that filters Realm for all strings that have no alpha characters. i.e. get all strings that do not have 'a' AND 'b' AND 'c' etc.
This is important because there could be thousands of objects and while filtering using pure Swift is super simple code-wise, doing that goes around the lazy-loading characteristic of Realm and the entire dataset would need to be loaded into the array for filtering, which could overwhelm the device.
This solution keeps the Realm objects lazy avoiding that issue.
let charStringToOmit = "abcdefghijklmnopqrstuvwxyz" //the invalid chars
let charsOmitArray = Array(charStringToOmit) //make an array of Char objects
var predicateArray = [NSPredicate]() //the array of predicates
//iterate over the array of char objects
for char in charsOmitArray {
let c = String(char) //make each char a string
let pred = NSPredicate(format: "!(person_id CONTAINS[cd] %#)", c) //craft a predicate
predicateArray.append(pred) //append the predicate to the array
}
//craft compound AND predicate
let compound = NSCompoundPredicate(andPredicateWithSubpredicates: predicateArray)
let realm = try! Realm()
let results = realm.objects(PersonClass.self).filter(compound).sorted(byKeyPath: "person_id")
if let lastPerson = results.last {
print(lastPerson.person_id)
}
I simplified the initial dataset provided for this example and limited it to only have a-z and 0-9 characters but that could be expanded on.

how to sum the digits in an integer using recusion?

Write a recursive method that computes the sum of the sum of the digits in an integer. use the following method header:
public static int sumDigits(long n)
For example, sumDigits(234) returns 2 + 3 + 4 = 9. Write a real program that prompts the user to enter an integer and displays its sum.
Receive an integer as a parameter
Convert to string
Parse the string's individual characters
Remove a character (first or last doesn't matter)
Put the remaining characters back into a single string
Cast that string back to integer
Call "result = removedChar As Integer + function(remainingChars as Integer)" <--- this is the recursion
In the future you should at least make one attempt for others to help you edit when you post an obvious homework question ;)

Split string based on byte length in golang

The http request header has a 4k length limit.
I want to split the string which I want to include in the header based on this limit.
Should I use []byte(str) to split first then convert back to string using string([]byte) for each split part?
Is there any simpler way to do it?
In Go, a string is really just a sequence of bytes, and indexing a string produces bytes. So you could simply split your string into substrings by slicing it into 4kB substrings.
However, since UTF-8 characters can span multiple bytes, there is the chance that you will split in the middle of a character sequence. This isn't a problem if the split strings will always be joined together again in the same order at the other end before decoding, but if you try to decode each individually, you might end up with invalid leading or trailing byte sequences. If you want to guard against this, you could use the unicode/utf8 package to check that you are splitting on a valid leading byte, like this:
package httputil
import "unicode/utf8"
const maxLen = 4096
func SplitHeader(longString string) []string {
splits := []string{}
var l, r int
for l, r = 0, maxLen; r < len(longString); l, r = r, r+maxLen {
for !utf8.RuneStart(longString[r]) {
r--
}
splits = append(splits, longString[l:r])
}
splits = append(splits, longString[l:])
return splits
}
Slicing the string directly is more efficient than converting to []byte and back because, since a string is immutable and a []byte isn't, the data must be copied to new memory upon conversion, taking O(n) time (both ways!), whereas slicing a string simply returns a new string header backed by the same array as the original (taking constant time).

How does timestamp hashing work?

Suppose you have an existing hash g84t5tw73y487tb38wo4bq8o34q384o7nfw3q434hqa which was created from the original string dont downvote my stupid question
Now I timestamp this hash like this (in JS/pseudo-code):
var hash = 'g84t5tw73y487tb38wo4bq8o34q384o7nfw3q434hqa';
var today= new Date(); // 2017-10-19
var timestamped = hash + today;
var new_hash = SHA256(timestamped);
// new_hash is 34t346tf3847tr8qrot3r8q248rtbrq4brtqti4t
If I wanted to verify my original string I can do:
var verified = goodHash('dont downvote my stupid question',hash); // true
If I wanted to verify the timestamped version I can do:
var original_hash = 'g84t5tw73y487tb38wo4bq8o34q384o7nfw3q434hqa';
var today = '2017-10-19';
var verified = goodHash(original_hash+today, timestamped_hash); // true
But if I tried to verify the original string against the timestamp, I CANT do:
var today = '2017-10-19';
var verified = goodHash('dont downvote my stupid question'+today, timestamped_hash); // FALSE
Now suppose this original string is hashed and timestamped over and over again for n iterations.
I would only ever be able to verify the n-1th timestamp, provided I have the n-1th hash.
But what if I have the original string dont downvote my stupid question and want to verify any ith timestamp, where 0 < i < n.
Basically, I want to verify whether a string that only I should have knowledge of, has been timestamped with a given date, regardless of how many times it may have been timestamped and without increasing the length of the string (too much - although any increase in length would approach infinity as n grows).
Is this even possible? Can a hash even contain all this information?
Let's look at the mathematics involved here:
First, you have a input string s and a sequence t of timestamps. I will use t[i] to denote the ith timestamp. Your repeated hashing is a recurrence relation:
f(i) = hash(f(t[i-1]) + t[i])
Where + denotes string concatenation. Now we want to determine if there is a closed formula F(x) which will calculate the ith hash with less time-complexity than evaluating the recurrence relation f(i).
One way to accomplish this is to find a string x(i) with the same hash as f(t[i-1]) + t[i]. For a good hashing algorithm, these collisions are exceedingly rare. My gut instinct is that finding such a string (other than f(t[i-1]) + t[i] itself) is more difficult than simply calculating directly from the recurrence relation.

openresty: convert int64 to string

I am using openresty/1.7.7.2 with Lua 5.1.4. I am receiving int64 in request and i have it's string format saved in DB (can't change DB schema or request format). I am not able to match both of them.
local i = 913034578410143848 --request
local p = "913034578410143848" -- stored in DB
print(p==tostring(i)) -- return false
print(i%10) -- return 0 ..this also doesn't work
Is there a way to convert int64 to string and vice versa if possible?
update:
I am getting i from protobuf object. proto file describe i as int64. I am using pb4lua library for protobuf.
ngx.req.read_body()
local body = ngx.req.get_body_data()
local request, err = Request:load(body)
local i = request.id
Lua 5.1 can not represent integer values larger than 2^53.
Number literal not excaption. So you can not just write
local i = 913034578410143848.
But LuaJIT can represent int64 values like boxed values.
Also there exists Lua libraries to make deal with large numbers.
E.g. bn library.
I do not know how your pb4lua handle this problem.
E.g. lua-pb library uses LuaJIT boxed values.
Also it provide way to specify user defined callback to make int64 value.
First I suggest figure out what real type of your i value (use type function).
All other really depends on it.
If its number then I think pb4lua just loose some info.
May be it just returns string type so you can just compare it as string.
If it provide LuaJIT cdata then this is basic function to convert string
to int64 value.
local function to_jit_uint64(str)
local v = tonumber(string.sub(str, 1, 9))
v = ffi.new('uint64_t', v)
if #str > 9 then
str = string.sub(str, 10)
v = v * (10 ^ #str) + tonumber(str)
end
return v
end

Resources