Hash table open addressing - search for a removed/not exist key could result in infinite loop in probing and finding? - hashtable

Hash table open addressing removing: what could happen if we search for a removed key (which we don't know) is replaced by a tombstone? since the key is not existing, will the search become an infinite loop to keep probing and find the key?
See the screenshot from William Fiset's data structure course -- what if k3 is already removed or not exist the hashtable orginally? looks like we keep probing until find k3?
enter image description here

When looking for a key, you keep probing until you find the key, or get to an empty slot (note that a "removed" slot is not empty). If you get to an empty slot, you know the key is not present. As long as there is at least one empty slot, you'll eventually get to it. This potential looping is why open hash tables generally limit their occupancy to not get too large, as the greater the occupancy (fewer empty slots), the longer you need to loop until you get to an empty slot when searching for a not-present key.

Related

Dynamodb: Index on List attribute and query NOT_CONTAINS

I am trying to figure out (at this point I think the answer is No) if it is possible to build a index on a List Attribute and query NOT_CONTAINS on that attribute.
Example table:
Tasks
Task_id: string
solved_by: List<String> # stores list of user_ids who previously solved this task.
My query would be:
Get me all the tasks not yet solved by current_user
select * from tasks where tasks.solved_by NOT_CONTAINS current_user_id
Is it possible to do this without full scans. I tried creating an attribute of type L but aws cli errors out saying Member must satisfy enum value set: [B, N, S]
If this is not possible with dynamodb, please suggest what datastore I can use.
Any help is highly appreciated. Thanks!
As you found out, and as the error you got suggests, this is NOT possible.
However, I'd argue if your design couldn't be improved. Storing a potentially unbound list of entries (users in your case) inside a single item, which is limited to 400kb seems dangerous.
If instead, you'd store for each task the information that a particular user resolved it as a separate item (partition key - task_id, sort key - user_id) than you could easily look up if a user solved a task or not. You could also store additional information about the particular solution or attempts.
If you haven't heard of DynamoDB single table design yet, or how to overload indexes, I can recommend looking at
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-modeling-nosql-B.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
https://www.dynamodbbook.com/
Update
I just realised, you care about a negation (NOT_CONTAINS) - for those, you can't use an index anyway. For the sort key you can only use positive comparison (=, <, >, <=, >=, between, begins_with): https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.KeyConditionExpressions
So you might have to rethink the whole approach, to better pre-process the data stored in DDB, so it's easier to fetch, or pick a different database.
In your original question, you defined your access pattern as
Get me all the tasks not yet solved by current_user
In a later comment, you clarified that the access pattern is
A solver should be shown a task that is not yet solved by them.
which is a slightly different access pattern.
Here's one way you could fetch a task not yet solved by a user.
In this data model, I chose to model Users and Tasks as separate items. Tasks have numerically increasing ID's. Each User item should start with the lastSolved attribute set to 1. Each time you fetch a new Task for a user, you fetch TASK#{last_solved+1} and increment the lastSolved attribute by 1.
You could probably take a similar approach by using timestamps instead of numbers... anything sortable, really.

Use RocksDB to support key-key-value (RowKey->Containers) by splitting the container

Support I have key/value where value is a logical list of strings where I can append strings. To avoid the situation where inserting a single string item to the queue causing re-write the entire list, I'd using multiple key-value pairs to represent it.
Key -> metadata of the value such as length and subkey format
Key-l1 -> value of item 1 in list
Key-l2 -> value of item 2 in list
Key-ln -> the lastest value in the list
I'd override the key comparer in RocksDB such that sorting of Key-ln formatted key is sort Key part first and ln second (i.e. group by and sort by Key and within the same Key value sort by ln). This way, all the list items along with its root key and metadata are grouped together in sst during initial bulk insert and during later sst compaction.
Appending a new list item becomes (1) first read Key-metadata to get the current list size of n; 2) insert Key-l(n+1) with new value. Deleting list item works as it is for RocksDB by deleting Key-ln and update the metadata.
To ensure the consistency, (1) and (2) will be done inside a RocksDB transaction.
This design seems to be ok?
Now, if I want to add anther feature of TTL for entire key-value(list), I'd use TTL support already in RocksDB. My understanding is that TTL to remove expired item happens during compaction. However, such compaction is not done under a transaction. RocksDB doesn't know that Key-metadata and Key-ln entries are related. It is entirely possible that there is a time window where Key->metadata(root node) is deleted while child nodes of (Key-ln) is not deleted yet (or reverse order). If during this time window, someone reads or update the list, it will get an inconsistent for the Key-list. Any remedy for it?
Thanks
You should use Merge Operator, it's designed for such value append use case. Your design is read-before-write, which has performance penalty, in general it should be avoided if possible: What's read-before-write in NoSQL?.
Options options;
options.merge_operator.reset(new StringAppendOperator(','));
DB::Open(options, kDBPath, &db)
...
db->Merge(WriteOptions(), "key", "value1");
db->Merge(WriteOptions(), "key", "value2");
db_->Get(ReadOptions(), "key", &result); // return "value1,value2"
The above example uses a predefined StringAppendOperator, which simply append new values at the end. You can defined your own MergeOperator to customize the merge operation.
In the backend, the merge operation is done on the read path (and compaction to reduce the version number), details: Merge Operator Implementation.

What is the model for checking if a GSI key exists or not in DDB?

I have a pretty straight forward question
I want to know if some GSI hash key exists or not.
The best I can find right now is
DynamoDBQueryExpression<T> queryExpression;
// Logic for constructing query
queryExpression.withIndexName(SomeIndexName);
QueryResultPage<T> queryResponse mapper.queryPage(T.class, queryExpression, someMapperConfig));
Here query result page contains a list of results, I can check if that list has anything and conclude whether it exists or not.
The obvious problem is the efficiency drop when there are things that are present. Is there a way to not move the contents of the item across network IO for the purpose of verification (i.e. a server side total validation of the predicate of checking if some GSI key exists or not)?
When writing an item with, for example, Put-Item you can add a condition specifying the key must not exist. This way DynamoDB checks whether the provided key is already taken and will give an error when you try to put something in. Just catch the error and then you know the key was already taken.

CustTableListPage filtering is too slow

When I'm trying to filter CustAccount field on CustTableListPage it's taking too long to filter. On the other fields there is no latency. I'm trying to filter just part of account number like "*123".
I have done reindexing for custtable and also updated statics but not appreciable difference at all.
When i have added listpage's query in a view it's filtering custAccount field normally like the other fields.
Any suggestion?
Edit:
Our version is AX 2012 r2 cu8, not a user based problem it occurs for every user, Interaction class has some custimizations but just for setting some buttons enable/disable props. etc... i tryed to look query execution what i found is not clear. something like FETCH_API_CURSOR_000000..x
Record a trace of this execution and locate what is a bottleneck.
Keep in mind that that wildcards (such as *) have to be used with care. Using a filter string that starts with a wildcard kills all performance because the SQL indexes cannot be used.
Using a wildcard at the end
Imagine that you have a dictionnary and have to list all the words starting with 'Foo'. You can skip all entries before 'F', then all those before 'Fo', then all those before 'Foo' and start your result list from there.
Similarly, asking the underlying SQL engine to list all CustAccount entries starting with '123' (= filter string '123*') allows using an index on CustAccount to quickly skip to the relevant data.
Using a wildcard at the start
Imagine that you still have that dictionnary and have to list all the words ending with 'ing'. You would have no other choice than going through the entire dictionnary and checking the ending of every word (due to the alphabetical sorting).
This explains why asking the SQL engine to list all CustAccount entries ending with '123' (= filter string '*123') means that all CustAccount values must be investigated. So the AOS loops through all the entries and uses an SQL cursor to do this. That is the FETCH_API_CURSOR statement you see on the SQL level.
Possible solutions
Educate your end user that using a wildcard at the beginning of a filter string will always be slow on a large table.
Step up the SQL server hardware / allocated resources (faster CPU, more RAM, faster disk, ...).
Create a full text index on CustAccount (not a fan of this one and performance impact should be thoroughly investigated).
I've solve the problem. CustTableListPage query had a sorting over DirPartyTable.Name field. When I remove this sorting, filtering with wildcard working like a charm.

Is there anything like pointers in Lua?

I'm new to Lua and I want to create a table [doh] which would store values like:
parent.child[1].value = "whaterver"
parent.child[2].value = "blah"
however, most often there's only one child, so it would be easier to access the value like this:
parent.child.value
To make things simpler, I would like to store my values, in a way, that
parent.child[1].value == parent.child.value
But to do this I would have to store this value twice in the memory.
Is there any way I could do it, so that:
parent.child.value points to parent.child[1].value
without storing the value twice in the memory?
Additional question is, how to check how much memory does a table take?
but the value will be stored as string, so it's a string that needs to
be referenced in both places, not table.
First, all types (except booleans, numbers and light userdata) are references - if t is a table and you do t2 = t, then both t and t2 are references to the same table in memory.
Second thing - string are interned in Lua. That means that all equal strings, like "abc" and the result of "ab".."c" are actually a single string. Lua also stores only references to strings. So you should not worry about memory - there is only a single instance of the string at a time.
You can safely do parent.child.value = parent.child[1].value, you will only use a memory for one slot in a table (a few bytes), no string will be copied, only referenced.
Lua tables (often used as objects) are not copied, but referenced.
(internally, a pointer is used to them)
This is a nice application for using metatables:
parent={
child={
{value="whatever"},
{value="blah"}
}
}
setmetatable(parent.child,{__index=parent.child[1]})
If an index is not found in the child table (like 'value'), it gets looked up in the table that's the value of __index of the metatable (the first element of child in this case).
Now there is a problem with the above code which we can see as folows:
print(parent.child.value) -- prints whatever
parent.child[1]=nil --remove first child
print(parent.child.value) -- still prints whatever!
This is because the metatable keeps a reference to the first child table, preventing it from being reaped. The workaround for this kind of stuff is A) making the metatable a weak table, or B) make the __index field a function, instead of referencing it to a table.
-- A)
setmetatable(parent.child, setmetatable(
{__index=parent.child[1]} -- metatable for the child table
{__mode='v'}-- metatable for the metatable, making it have weak keys
)
)
parent.child[1]=nil
print(parent.child.value) --returns nil
parent.child[1]={value='foo'}
print(parent.child.value) -- prints nil, the metatable references to a non-existant table.
-- hence solution B)
setmetatable(parent.child, {__index=function(t,k) return parent.child[1][k]})
print(parent.child.value) -- 'whatever'
parent.child[1]=nil
print(parent.child.value) -- nil
parent.child[1]={value='foobar'
print(parent.child.value) -- foobar, now it will always refer to the table at child[1], even when it changes.
If you're really interested to read up on metatables, try reading Programming in Lua, chapter 13 and chapter 17 (weak tables). Lua-Users wiki on MetaMethods might also be interesting.
With C arrays, parent.child and parent.child[0] are equivalent because of pointer arithmetic. You really shouldn't try to emulate one of the most error-prone, confusing and redundant features of C just because you like the style.

Resources