What does "bucket entries" mean in the context of a hashtable? - hashtable

What does "bucket entries" mean in the context of a hashtable?

A bucket is simply a fast-access location (like an array index) that is the the result of the hash function.
The idea with hashing is to turn a complex input value into a different value which can be used to rapidly extract or store data.
Consider the following hash function for mapping people's names into street addresses.
First take the initials from the first and last name and turn them both into numeric values (0 through 25, from A through Z). Multiply the first by 26 and add the second, and this gives you a value from 0 to 675 (26 * 26 distinct values, or bucket IDs). This bucket ID is then to be used to store or retrieve the information.
Now you can have a perfect hash (where each allowable input value maps to a distinct bucket ID) so that a simple array will suffice for the buckets. In that case, you can just maintain an array of 676 street addresses and use the bucket ID to find the one you want:
+-------------------+
| George Washington | -> hash(GW)
+-------------------+ |
+-> GwBucket[George's address]
+-------------------+
| Abraham Lincoln | -> hash(AL)
+-------------------+ |
+-> AlBucket[Abe's address]
However, this means that George Wendt and Allan Langer are going to cause problems in the future.
Or you can have an imperfect hash (such as one where John Smith and Jane Seymour would end up with the same bucket ID).
In that case, you need a more complex backing data structure than a simple array, to maintain a collection of addresses. This could be as simple as a linked list, or as complex as yet another hash:
+------------+ +--------------+
| John Smith | | Jane Seymour |
+------------+ +--------------+
| |
V V
hash(JS) hash(JS)
| |
+-----> JsBucket <----+
\/
+-----------------------------------+
| John Smith -> [John's address] |
| Jane Seymour -> [Jane's address] |
+-----------------------------------+
Then, as well as the initial hash lookup, an extra level of searching needs to be carried out within the bucket itself, to find the specific information.

From Wikipedia:
hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (e.g., their telephone number). Thus, a hash table implements an associative array. The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought.
Each entry in the array/vector is called as an Bucket.

I think Bucket is a structure that at least contain hash value, which works as indexes, (hash values are generated by hash functions), but the structure itself might contain entries (data) or not.
illustration:
[hash value][points to actual data] ---> [actual data]
|<------------bucket structure------>|
[hash value][actual data]
|-----bucket structure--->|
It is the [hash value] part works as indexes.
I found these photos from hash_table Wikipedia are pretty straightforward.
The photos below indicates that entries (data) can be stored within buckets or it can be stored with its own data structure, while bucket simply points to the data.

Both rehashing and coalesced hashing assume fixed table sizes determined in advance. If the number of records grow beyond the number of table positions, it is impossible to insert them without allocating a larger table and recomputing hash.
Another method of resolving hash clashes is separate chaining. Term Bucket is generally used with separate chaining. Separate chaining involves keeping a distinct linked list for all records whose keys hash into a particular value.
Suppose that hash function produces values between 0 and tablesize - 1. Then an array bucket of header nodes of size tablesize is declared. This array is called the hash table.
Bucket[i], bucket entry, points to the list of all records who keys hash into i. To insert a record, the list head bucket[i] is accessed and record is inserted at the tail end.

Related

DynamoDB Global Secondary Index "Batch" Retrieval

I've see older posts around this but hoping to bring this topic up again. I have a table in DynamoDB that has a UUID for the primary key and I created a secondary global index (SGI) for a more business-friendly key. For example:
| account_id | email | first_name | last_name |
|------------ |---------------- |----------- |---------- |
| 4f9cb231... | linda#gmail.com | Linda | James |
| a0302e59... | bruce#gmail.com | Bruce | Thomas |
| 3e0c1dde... | harry#gmail.com | Harry | Styles |
If account_id is my primary key and email is my SGI, how do I query the table to get accounts with email in ('linda#gmail.com', 'harry#gmail.com')? I looked at the IN conditional expression but it doesn't appear to work with SGI. I'm using the go SDK v2 library but will take any guidance. Thanks.
Short answer, you can't.
DDB is designed to return a single item, via GetItem(), or a set of related items, via Query(). Related meaning that you're using a composite primary key (hash key & sort key) and the related items all have the same hash key (aka partition key).
Another way to think of it, you can't Query() a DDB Table/index. You can only Query() a specific partition in a table or index.
Scan() is the only operation that works across partitions in one shot. But scanning is very inefficient and costly since it reads the entire table every time.
You'll need to issue a GetItem() for every email you want returned.
Luckily, DDB now offers BatchGetItem() with will allow you to send multiple, up to 100, GetItem() requests in a single call. Saves a little bit of network time and automatically runs the requests in parallel; but otherwise is the little different from what your application could do itself directly with GetItem(). Make no mistake, BatchGetItem() is making individual GetItem() requests behind the scenes. In fact, the requests in a BatchGetItem() don't even have to be against the same tables/indexes. The cost for each request in a batch will be the same as if you'd used GetItem() directly.
One difference to make note of, BatchGetItem() can only return 16MB of data. So if your DDB items are large, you may not get as many returned as your requested.
For example, if you ask to retrieve 100 items, but each individual
item is 300 KB in size, the system returns 52 items (so as not to
exceed the 16 MB limit). It also returns an appropriate
UnprocessedKeys value so you can get the next page of results. If
desired, your application can include its own logic to assemble the
pages of results into one dataset.
Because you have a GSI with PK of email (from what I understand) you can use PartiQL command to get your batch of emails back. The API is called ExecuteStatment and you use a SQL like syntax:
SELECT * FROM mytable.myindex WHERE email IN ['email#email.com','email1#email.com']

DynamoDB Single Table Schema Design with Adjacency Lists

I am trying to understand how to properly design a DynamoDB schema. I've read a few articles, watched some YouTube videos but, to be honest, I don't yet feel quite comfortable.
This is what I am trying to design properly:
two entities, "location" (id & name) and "vehicle" (id & name)
a location can have 0-n vehicles
a vehicle can be in 0-1 locations
Access patterns:
get a list of all available locations (id & name)
get a list of all available vehicles and their current location (id, name, location-id, location-name)
get a list of all vehicles in a given location (id, name)
I've read about adjacency lists and because there will be n-m relations I've decided to give it a try.
This is what I've came up with:
# | PK (GSI1-SK) | SK (GSI1-PK) | DATA
==|======================|====================|==============
1 | LOCATION#locationId1 | A | locationName1
2 | LOCATION#locationId2 | A | locationName2
3 | LOCATION#locationId1 | VEHICLE#vehicleId1 |
4 | LOCATION#locationId1 | VEHICLE#vehicleId2 |
5 | LOCATION#locationId2 | VEHICLE#vehicleId3 |
6 | VEHICLE#vehicleId1 | A | vehicleName1
7 | VEHICLE#vehicleId2 | A | vehicleName2
8 | VEHICLE#vehicleId3 | A | vehicleName3
#1-2 & #6-8 are my entity records, those with additional data for the entity itself (e.g. its name).
#3-5 is an example of how I would design a relationship. I've added an inverted GSI in order to be able to search in both ways.
Back to my access patterns:
get a list of all available locations (id & name)
query GSI1 for SK=A and PK begins with LOCATION#
get a list of all available vehicles and their current location (id, name, location-id, location-name)
query GSI1 for SK=A and PK begins with VEHICLE#
for each result item, query GSI1 for SK=VEHICLE#vehicleId and PK begins with LOCATION#
for each result item, query table for PK=LOCATION#locationId and SK=A
... this doesn't seem right
get a list of all vehicles in a given location (id, name)
query table for PK=LOCATION#locationId and SK begins with VEHICLE#
for each result item, query table for PK=VEHICLE#vehicleId and SK=A
... this doesn't seem right
Adjacency lists look like a nice and clean way to design complex relationships but either I am doing something wrong (probably) or they come with alot of querys that are necessary to look things up.
Any advice is appreciated.
I modelled this in DynamoDB Workbench:
Main Index (PK -> SK)
GSI1 (PK1 -> SK)
In order to:
"get a list of all available locations (id & name)"
select * from GS1 where PK1="ALL#LOCATION"
get a list of all available vehicles and their current location (id, name, location-id, location-name)
select * from MAIN-INDEX where PK="ALL#VEHICLE"
get a list of all vehicles in a given location (id, name)
select * from GSI1 where PK1="LOC#ID"
Several things to here:
It's important to distribute the traffic across all partition keys. I'm using "ALL#" partition keys in this design. Ideally you shard that somehow, there are several tricks like using dates or timestamp to the beginning of the day. You can randomly spread them across a fixed number of "ALL#" records and then randomly query 1 if your use case allows it. If you have millions of locations this is probably ok. That's how you take these decisions: think of the traffic and the behaviour of the data.
In order to use both indexes I put the "ALL#LOCATION" and the "ALL#VEHICLE" partition keys in different indexes.
Notice that vehicle 4 doesn't have a PK1. See what happens to GSI1. This is what's called a sparse index.
I denormalized the vehicle-location relationship. Assuming that the location ID and the location name are immutable it's ok to do this, the problem is when the attributes you denormalize are mutable, avoid that if possible.

Determining a partition key in Dynamo DB for GSI

I am new to DynamoDB and I am finding it hard to think of how I should decide my partition key. I am using a condensed version of my use case:
I have an attribute which is a boolean value => B
For a given ID, I need to return all the data for it. The ID is either X or Y attribute. For the given ID, if B is true, I need to read attribute X, else Y.
While inserting into the table I know the the value of B and hence I can fill it in either X or Y depending on the value of it.
However while fetching, I just am given an ID, and I need to figure out whether it exists in column X or column Y ( I won't be getting the value of B in the input).
In a RDBMS I could run a query like select * from tab where (B == true && X == ID) || (B==false && Y == ID).
I think creating a GSI in DynamoDB will be the way to go about solving this in Dynamo. However I am not able to figure out the best way to approach this. Could I get suggestions?
Not sure if I got your use case correctly, but why not just swapping target columns based on value B while inserting a row.
Consider the following input:
+-----+------+--------+
| X | Y | B |
+-----+------+--------+
| ID1 | ID2 | true |
+-----+------+--------+
| ID3 | ID4 | true |
+-----+------+--------+
| ID5 | ID6 | false |
+-----+------+--------+
| ID7 | ID8 | false |
+-----+------+--------+
What if you store the values like this:
+-----------+-------------------------+
| id | opposite id |
|(hash key) | or whatever you call it |
+-----------+-------------------------+
| ID1 | ID2 |
+-----------+-------------------------+
| ID3 | ID4 |
+-----------+-------------------------+
| ID6 | ID5 |
+-----------+-------------------------+
| ID8 | ID7 |
+-----------+-------------------------+
This way, while fetching an item by an IDXXX value you would need to perform a query on the single column id.
UPD: Notice, if your use case allows having multiple records with a same id, you would need an another field to serve as a range key. This holds true no matter whether you swap columns like shown above or not.
As Per AWS DynamoDB Blog Post : Choosing the Right DynamoDB Partition Key
Choosing the Right DynamoDB Partition Key is an important step in the
design and building of scalable and reliable applications on top of
DynamoDB.
What is a partition key?
DynamoDB supports two types of primary keys:
Partition key: Also known as a hash key, the partition key is composed of a single attribute. Attributes in DynamoDB are similar in
many ways to fields or columns in other database systems.
Partition key and sort key: Referred to as a composite primary key or hash-range key, this type of key is composed of two attributes. The
first attribute is the partition key, and the second attribute is the
sort key. Here is an example:
Why do I need a partition key?
DynamoDB stores data as groups of attributes, known as items. Items
are similar to rows or records in other database systems. DynamoDB
stores and retrieves each item based on the primary key value which
must be unique. Items are distributed across 10 GB storage units,
called partitions (physical storage internal to DynamoDB). Each table
has one or more partitions, as shown in Figure 2. For more
information, see the Understand Partition Behavior in the DynamoDB
Developer Guide.
DynamoDB uses the partition key’s value as an input to an internal
hash function. The output from the hash function determines the
partition in which the item will be stored. Each item’s location is
determined by the hash value of its partition key.
All items with the same partition key are stored together, and for
composite partition keys, are ordered by the sort key value. DynamoDB
will split partitions by sort key if the collection size grows bigger
than 10 GB.
Recommendations for partition keys
Use high-cardinality attributes. These are attributes that have
distinct values for each item like e-mail id, employee_no,
customerid, sessionid, ordered, and so on.
Use composite attributes. Try to combine more than one attribute to
form a unique key, if that meets your access pattern. For example,
consider an orders table with customerid+productid+countrycode as the
partition key and order_date as the sort key.
Cache the popular items when there is a high volume of read traffic.
The cache acts as a low-pass filter, preventing reads of unusually
popular items from swamping partitions. For example, consider a table
that has deals information for products. Some deals are expected to be
more popular than others during major sale events like Black Friday or
Cyber Monday.
Add random numbers/digits from a predetermined range for write-heavy
use cases. If you expect a large volume of writes for a partition key,
use an additional prefix or suffix (a fixed number from predeternmined
range, say 1-10) and add it to the partition key. For example,
consider a table of invoice transactions. A single invoice can contain
thousands of transactions per client.
Read More # Choosing the Right DynamoDB Partition Key

How to create lines/stops relationship

I'm not a database expert and I'm simply building a prototype app, so nothing really important.
Anyway, the app is about a subway: this subway has many lines and sometimes some stops are shared between lines (so, for example, stops 3 and 4 are stops of lines 2, 7 and 9).
So, I made up a SQLite stops table:
+---------+-------------+------+
| Field | Type | Auto |
+---------+-------------+------+
| id | integer | YES |
| name | varchar(20) | NO |
| lines | ? | NO |
+---------+-------------+------+
What's the best way to deal with shared stops? My idea was to create a lines table and then in the lines field of the stops table put a comma separated list of lines.id. I don't know why, but I feel there could be a better way.
Any suggestion is appreciated, and sorry for the really noob question.
I would keep it simple and use a table lines which has an ID (primary key) along with other metadata for a line (such as name):
lines
(id, name)
Then, create a table for the stops:
stops
(id, name)
Finally, you can create a bridge table which will connect lines with stops:
bridge
(lineId, stopId)
Each record in the bridge table represents one line having a given stop.
Note that using CSV to represent a line having multiple stops is totally not the way to go here, as it renders the powers of your relational database useless.
Update:
If you want to record the position of a stop in a given line (and assuming that positions would differ across lines), you could use the following table:
stopNumbers
(lineId, stopId, stopPosition)
The stop position can be obtained knowing the line's ID and the stop's ID.
You need a many-to-many relation, which is stored in a separate table like this:
table lines_to_stops
line_fk
stop_fk
That's the relational world ...
Note that records in the database are not in any specific order. If you need to put the stops into any specific order (which you most probably do), you have to store this order to the database as well:
table lines_to_stops
line_fk
stop_fk
order_in_line

Analyze a scenario performance?

i want to design something like a dynamic form in which admin define each form fields.
i design 3 table: mainform table for shared properties, then formfield tables which have mainformID as a foreign key and define each form fields
e.g:
AutoID | FormID | FieldName
_____________________________
100 | Form1 | weight
101 | Form1 | height
102 | Form1 | color
103 | Form2 | Size
104 | Form2 | Type
....
at leas a formvalues table like bellow:
FormFieldID | Value | UniqueResponseID
___________________________________________
100 | 50px | 200
101 | 60px | 200
102 | Red | 200
100 | 30px | 201
101 | 20px | 201
102 | Black | 201
103 | 20x10 | 201
104 | Y | 201
....
for each form i have to join these 3 tables to catch all fields and values. i wonder if its the only way to design such a scenario? does it decrease sql performance? or is there any fast and better way?
This is a form of EAV, and I'm gonna assume you absolutely have to do it instead of the "static" design.
does it decrease sql performance?
Yes, getting a bunch of rows (under EAV) is always going to be slower than getting just one (under the static design).
or is there any fast and better way?
Not from the logical standpoint, but there are significant optimizations (for query performance at least) that can be done at the physical level. Specifically, you can carefully design your keys to minimize the I/O (by putting related data close together) and even eliminate the JOIN itself.
For example:
This model migrates keys through FOREIGN KEY hierarchy all the way down to the ATTRIBUTE_VALUE table. The resulting natural composite key in ATTRIBUTE_VALUE table enables us to:
Get all attributes1 of a given form by a single index range scan + table heap access on ATTRIBUTE_VALUE table, and without doing any JOINs at all. In addition to that, you can cluster2 it, eliminating the table heap access and leaving you with only the index range scan3.
If you need to only get the data for a specific response, change the order of the fields in the composite key, so the RESPONSE_ID is at the leading edge.
If you need both "by form" and "by response" queries, you'll need both indexes, at which point, I'd recommend secondary index to also cover4 the VALUE field.
For example:
-- Since we haven't used NONCLUSTERED clause, this is a B-tree
-- that covers all fields. Table heap doesn't exist.
CREATE TABLE ATTRIBUTE_VALUE (
FORM_ID INT,
ATTRIBUTE_NAME VARCHAR(50),
RESPONSE_ID INT,
VALUE VARCHAR(50),
PRIMARY KEY (FORM_ID, ATTRIBUTE_NAME, RESPONSE_ID)
-- FOREIGN KEYs omitted for brevity.
);
-- We have included VALUE, so this B-tree covers all fields as well.
CREATE UNIQUE INDEX ATTRIBUTE_VALUE_IE1 ON
ATTRIBUTE_VALUE (RESPONSE_ID, FORM_ID, ATTRIBUTE_NAME)
INCLUDE (VALUE);
1 Or a specific attribute, or a specific response for a specific attribute.
2 MS SQL Server clusters all tables by default, unless you specify NONCLUSTERED clause.
3 Friendliness to clustering and elimination of JOINs are some of the main strengths of natural keys (as opposed to surrogate keys). But they also make tables "fatter" and don't isolate from ON UPDATE CASCADE. I believe pros outweigh cons in this particular case. For more info on natural vs. surrogate keys, look here.
4 Fortunately, MS SQL Server supports including fields in index solely for covering purposes (as opposed to actually searching through the index). This makes the index leaner than a "normal" index on the same fields.
I like Branko's approach, and it is quite similar to metadata models i have created in the past, so this post is by way of extension to his. you may want to add a datatype table, which can work both for native types (int,varchar,bit,datetime etc.) and your own definitions (although i don't see the necessity off the cuff).
thence, Branko's "value" column becomes:
value_tinyint tinyint
value_int int
value_varchar varchar(xx)
etc.
with a datatype_id (probably tinyint) as a foreign key into the "mydatatype" table.
[excuse the lack of pretty ER diagrams like BD's]
mydatatype
datatype_id tinyint
code varchar(16)
description varchar(64) -- for reference purposes
This extension should:
a. save you a good deal of casting when reading or writing your data
b. allow both reads and writes with some easily constructed dynamic SQL
Furthermore (and maybe this is out of scope), you may want to store the order in which these objects are created/saved, as well as conditional display based on button push/checkbox/radio button selection etc.
I won't go into detail here, since i'm not sure you need these things, but if you do i'll check this every so often and respond with stuff.

Resources