URL-Compact representation of GUID/UUID? - guid

I need to generate a GUID and save it via a string representation. The string representation should be as short as possible as it will be used as part of an already-long URL string.
Right now, instead of using the normal abcd-efgh-... representation, I use the raw bytes generated and base64-encode them instead, which results in a somewhat shorter string.
But is it possible to make it even shorter?
I'm OK with losing some degree of uniqueness and keeping a counter, but scanning all existing keys is not an option. Suggestions?

I used an Ascii85 encoding for writing a Guid to a database column in 20 ASCII characters. I've posted the C# code in case it is useful. The specific character set may be different for a URL encoding, but you can pick whichever characters suit your application. It's available here: What is the most efficient way to encode an arbitrary GUID into readable ASCII (33-127)?

Sure, just use a base larger than 64. You'll have to encode them using a custom alphabet, but you should be able to find a few more "url-safe" printable ASCII characters.
Base64 encodes 6 bits using 8, so a 16 byte GUID value becomes 22 bytes encoded. You may be able to reduce that by a character or two, but not much more.

I found this discussion interesting: https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
Basically you take the 36 characters and turn them into 16 bytes of binary but first sort the three temporal pieces using a stored procedure:
set #uuid:= uuid();
select #uuid;
+--------------------------------------+
| #uuid |
+--------------------------------------+
| 59f3ac1e-06fe-11e6-ac3c-9b18a7fcf9ed |
+--------------------------------------+
CREATE DEFINER=`root`#`localhost`
FUNCTION `ordered_uuid`(uuid BINARY(36))
RETURNS binary(16) DETERMINISTIC
RETURN UNHEX(CONCAT(SUBSTR(uuid, 15, 4),SUBSTR(uuid, 10, 4),SUBSTR(uuid, 1, 8),SUBSTR(uuid, 20, 4),SUBSTR(uuid, 25)));
select hex(ordered_uuid(#uuid));
+----------------------------------+
| hex(ordered_uuid(#uuid)) |
+----------------------------------+
| 11e606fe59f3ac1eac3c9b18a7fcf9ed |
+----------------------------------+

I'm not sure if this is feasible, but you could put all the generated GUIDs in a table and use in the URL only the index of the GUID in the table.
You could also reduce the length of the guid - for example use 2 bytes to indicate the number of days since 2010 for example and 4 bytes for the number of miliseconds since the start of the current day. You will have collisions only for 2 GUIDs generated in the same milisecond. You could also add 2 more random bytes which will make this even better.

(long time, but just came into the same need today)
UUIDs are 128bits long, represented by 32 hex plus 4 hyphens.
If we use a dictionary of 64 (2^6) printable ascii`s, it is just a matter of converting from 32 groups of 4 bits (length of a hex) to 22 groups of 6 bits.
Here is a UUID shortner. Instead 36 chars you get 22, without losing the original bits.
https://gist.github.com/tomlobato/e932818fa7eb989e645f2e64645cf7a5
class UUIDShortner
IGNORE = '-'
BASE6_SLAB = ' ' * 22
# 64 (6 bits) items dictionary
DICT = 'a'.upto('z').to_a +
'A'.upto('Z').to_a +
'0'.upto('9').to_a +
['_', '-']
def self.uuid_to_base6 uuid
uuid_bits = 0
uuid.each_char do |c|
next if c == IGNORE
uuid_bits = (uuid_bits << 4) | c.hex
end
base6 = BASE6_SLAB.dup
base6.size.times { |i|
base6[i] = DICT[uuid_bits & 0b111111]
uuid_bits >>= 6
}
base6
end
end
# Examples:
require 'securerandom'
uuid = ARGV[0] || SecureRandom.uuid
short = UUIDShortner.uuid_to_base6 uuid
puts "#{uuid}\n#{short}"
# ruby uuid_to_base6.rb
# c7e6a9e5-1fc6-4d5a-b889-4734e42b9ecc
# m75kKtZrjIRwnz8hLNQ5hd

You could approach this from the other direction. Produce the shortest possible string representation and map it into a Guid.
Generate the key using a defined alphabet as below:
In psuedocode:
string RandomString(char[] alphabet, int length)
{
StringBuilder result = new StringBuilder();
for (int i = 0; i < length; i++)
result.Append(alphabet[RandomInt(0, alphabet.Length)]);
return result;
}
If you keep the string length < 16, you can simply hex encode the result and pass it to the Guid constructor to parse.

not for exact same problem, but very very close - I have used CRC64, Base64 that and you get 11 bytes, CRC64 has been tested (not proven) to NOT produce duplicates on a wide range of strings.
And since it is 64 bit long by definition - you get the key that is half the size.
To directly answer the original question - you can CRC64 encode any representation of your GUIDs.
Or just run CRC64 on the business key and you will have a 64 bit unique thing that you can then base64.

Related

Interface design for binary file viewer

I'm currently working on a simple binary viewer for different file formats.
In essence, it parses the binary file to some human readable representation (usually something similar to json).
(using JSON result as an example) then each field would then have associated with it a location in the binary data. This would be an array of intervals ( one interval if the source is a continuous block ).
My problem is how do I design a generic interface for a File Format Reader which can support a reasonable range of different file formats.
Some examples to illustrate the point better
In case the binary file contains a point consisting of 2 integer coordinates each 4 bytes
function read_point( stream ){
const point = new Record();
point.addField( "x", Integer( stream.read( 4 ) ) )
point.addField( "y", Integer( stream.read( 4 ) ) )
return point;
}
Here stream.read(n) would return n bytes as well as the location of those bytes, for example like a struct of the form { data: ?, location: ?}
A function like Integer would then interpret the bytes as an int and keep the location.
Record could then also work similar to the results of Stream.read and Integer having data, namely the values of x and y and also a location, the union of the locations of x and y.
This sort of design with each output being of the form { data: ?, location: ?} works fine for data whose size is known in advance and whose location is somewhat continuous.
Can this idea be extended or reworked for cases like.
a Integer of 4 bytes but whose bytes are not continuous
function exotic_example_1( stream ){
const higher = stream.read( 2 );
stream.read( 1 ); //Some padding data for example
const lower = stream.read( 2 );
Integer( ?join?( higher, lower ) );
}
Or for example a string terminated by a \0 byte. Where you might start decoding the data at the same time as you are reading it.
I want the interface to be of a form where keeping track of the origin of bytes is somewhat automated.

How to extract n characters from a string of undefined length using regex in varnish?

I am writing a varnish module (VCL) for my backend server. It requires a logic of extracting n characters from a string of undefined length.
I tried regsub() function of vcl with a regex to replace part of the string with empty space.
I need to extract first 20 characters of string. When the string length is 36, i used regex to replace last 16 characters to empty space.
But when length of the string is undefined say 40. i get 24 characters instead of 20. How do i achieve this?
set req.http.mysubstr = regsub(req.http.mystring, ".{16}$", "");
set req.http.mysubstr = regsub(req.http.mystring, ".{($variable)}$", ""); # $variable should be the length of the string - first 20 characters
Use capturing groups:
regsub(req.http.mystring, "^(.{20}).*", "\1")
test it on regex101.com

Nginx: How to convert post data encoding from UTF-8 to TIS-620 before pass them through proxy

I would like to convert POST data received from request and convert it from UTF-8 to TIS-620 before pass it to backend via proxy_pass using code below but I am not sure which way to do it
location / {
proxy_pass http://targetwebsite;
}
If I am not wrong, I believe I have to use Lua module to manipulate the request but I don't know if they support any character conversion.
Could anybody help me with sample code to convert POST data from UTF-8 toTIS-620 using LUA and how to validate if POST data is UTF-8 before convert it or if there is other better way to manipulate/convert POST data in nginx ?
This solution works on Lua 5.1/5.2/5.3
local function utf8_to_unicode(utf8str, pos)
-- pos = starting byte position inside input string
local code, size = utf8str:byte(pos), 1
if code >= 0xC0 and code < 0xFE then
local mask = 64
code = code - 128
repeat
local next_byte = utf8str:byte(pos + size)
if next_byte and next_byte >= 0x80 and next_byte < 0xC0 then
code, size = (code - mask - 2) * 64 + next_byte, size + 1
else
return
end
mask = mask * 32
until code < mask
elseif code >= 0x80 then
return
end
-- returns code, number of bytes in this utf8 char
return code, size
end
function utf8to620(utf8str)
local pos, result_620 = 1, {}
while pos <= #utf8str do
local code, size = utf8_to_unicode(utf8str, pos)
if code then
pos = pos + size
code =
(code < 128 or code == 0xA0) and code
or (code >= 0x0E01 and code <= 0x0E3A or code >= 0x0E3F and code <= 0x0E5B) and code - 0x0E5B + 0xFB
end
if not code then
return utf8str -- wrong UTF-8 symbol, this is not a UTF-8 string, return original string
end
table.insert(result_620, string.char(code))
end
return table.concat(result_620) -- return converted string
end
Usage:
local utf8string = "UTF-8 Thai text here"
local tis620string = utf8to620(utf8string)
I looked up the encoding on Wikipedia and came up the following solution for converting from UTF-8 to TIS-620. It assumes that all the codepoints in the UTF-8 string have an encoding in TIS-620. It will work if the UTF-8 string only contains ASCII printable characters (codepoints " " to "~") or Thai characters (codepoints "ก" to "๛"). Otherwise, it will give wrong and possibly very strange results.
This assumes you have Lua 5.3's utf8 library or an equivalent. If you're using an earlier version of Lua, one possibility is the pure-Lua version of the ustring library from MediaWiki (used by Wikipedia and Wiktionary, for instance). It provides a function to validate UTF-8, and many of the other functions will validate strings automatically. (That is, they throw an error if the string is invalid UTF-8.) If you use that library, you just have to replace utf8.codepoint with ustring.codepoint in the code below.
-- Add this number to TIS-620 values above 0x80 to get the Unicode codepoint.
-- 0xE00 is the first codepoint of Thai block, 0xA0 is the corresponding byte used in TIS-620.
local difference = 0xE00 - 0xA0
function UTF8_to_TIS620(UTF8_string)
local TIS620_string = UTF8_string:gsub(
'[\194-\244][\128-\191]+',
function (non_ASCII)
return string.char(utf8.codepoint(non_ASCII) - difference)
end)
return TIS620_string
end

what is the fastest way to replace multiple words in a very long string simultaneously

Suppose I have a string which contains more than 10,000 words, for example, it is the contents of the famous fiction "The old man and Sea"
and a dictionary which have 1,000 words pairs,for example,
before,after
--------------
fish , net
black, white
good, bad
....
....
round,rect
so what I want to do is ,according to the dictionary ,replace all 'fish' in the string with 'net', 'black' with 'white' ....
the simplest and intuitive algorithm is :
foreach(line, dict)
str.replace(line.before,line.after)
but it was so inefficiency.
one improvement I can think of is to
separate the string to multiple small string, then
use multithread to handle each smallstring respectively ,then combine the result.
is there any other ideas?
btw, I am using QT
I think it's a better idea to have a vector of 10k words, not a string of characters.
Just like this:
QVector<QString> myLongString;
Your dictionary can be implemented as a hash table:
QHash<QString, QString> dict;
This will provide constant access time to your dictionary words:
QString replaceWith = dict.value("fish") // replaceWith == "net"
Then you can iterate through your vector and replace words:
for (int i=0; i < myLongString.size(); ++i)
{
QString word = myLongString[i];
if dict.contains(word)
{
myLongString[i] = dict.value(word);
}
}

Substring with Multiple Arguments

I have a dropdownlist that has the value of two columns in it... One column is a number ranging from 5 characters long to 8 characters long then a space then the '|' character and another space, followed by a Description for the set of numbers.
An example:
12345678 | Description of Product
In order to pull the items for the dropdownlist into my database I need a to utilize a substring to pull the sequence of numbers out only.
Is it possible to write a substring to pull multiple character lengths? (Sometimes it may be 6 numbers, sometimes 5 numbers, sometimes 8, it would depend on what the user selected from the dropdownlist.)
Use a regular expression for this.
Assuming the number is at the start of the string, you can use the following:
^[0-9]+
Usage:
var theNumbers = RegEx.Match(myDropdownValue, "^[0-9]+").Value;
You could also use string.Split to get the parts separated by | if you know the first part is what you need and will always be numeric:
var theNumbers = myDropdownValue.Split("| ".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries)[0];
Either of these approaches will result in a string. You can use int.Parse on the result in order to get an integer from it.
This is how I would do it
string str = "12345678 | Description of Product";
int delimiter;
delimiter = str.IndexOf("|") - 1;
string ID =str.substring(0, delimiter);
string desc = str.substring(delimiter + 1, str.length - 1);
Try using a regex to pull out the first match of a sequence of numbers of any length. The regex will look something like "^\d+" - starts with any number of decimal digits.
Instead of using substring, you should use Split function.
var words = phrase.Split(new string[] {" | "},
StringSplitOptions.RemoveEmptyEntries);
var number = word[0];

Resources