What is the complexity of insertion in min heap? - math

After i have my min heap with limited size array, isn’t the complexity of insert is O(n)? In order to insert i need to increase the size of the array, but the size is limited so i need to create new array and copy the old one and after that to add the new value.
So what is the complexity of insert by these terms?


What will happen to my sqlite database ten years from now in terms of capacity and query speed

I have created a database with one single table (check the code bellow). I plan to insert 10 rows per minute, which is about 52 million rows in ten years from now.
My question is, what can I expect in terms of database capacity and how long it will take to execute select query. Of course, I know you can not provide me an absolute values, but if you can give me any tips on change/speed rates, traps etc. I would be very glad.
I need to tell you, there will be 10 different observations (this is why I will insert ten rows per minute).
create table if not exists my_table (
date_observation default current_timestamp,
observation_name text,
value_1 real(20),
value_1_name text,
value_2 real(20),
value_2_name text,
value_3 real(20),
value_3_name text);
Database capacity exceeds known storage device capacity as per Limits In SQLite.
The more pertinent paragraphs are :-
Maximum Number Of Rows In A Table
The theoretical maximum number of rows in a table is 2^64
(18446744073709551616 or about 1.8e+19). This limit is unreachable
since the maximum database size of 140 terabytes will be reached
first. A 140 terabytes database can hold no more than approximately
1e+13 rows, and then only if there are no indices and if each row
contains very little data.
Maximum Database Size
Every database consists of one or more "pages". Within a single
database, every page is the same size, but different database can have
page sizes that are powers of two between 512 and 65536, inclusive.
The maximum size of a database file is 2147483646 pages. At the
maximum page size of 65536 bytes, this translates into a maximum
database size of approximately 1.4e+14 bytes (140 terabytes, or 128
tebibytes, or 140,000 gigabytes or 128,000 gibibytes).
This particular upper bound is untested since the developers do not
have access to hardware capable of reaching this limit. However, tests
do verify that SQLite behaves correctly and sanely when a database
reaches the maximum file size of the underlying filesystem (which is
usually much less than the maximum theoretical database size) and when
a database is unable to grow due to disk space exhaustion.
Speed determination has many aspects and is thus not a simple how fast will it go, like a car. The file system, the memory, optimisation are all factors that need to be taken into consideration. As such the answer is the same as the length of the anecdotal piece of string.
Note 18446744073709551616 is if you utilise negative numbers otherwise the more frequently mentioned number of 9223372036854775807 is the limit (i.e a 64 bit signed integer)
To utilise negative rowid numbers and therefore the higher range you have to insert at least 1 negative value explicitly into a rowid or alias thereof as per If no negative ROWID values are inserted explicitly, then automatically generated ROWID values will always be greater than zero.

How consumed throughput is influenced by write into local secondary index with no change in data?

Condider a table A with index A-index. I write around 100 items into A in batches (using PutRequest within BatchWriteItem).
If I repeat the operation with the same set of items, they will be just replacing the existing items. But how does that impact the local secondary index? Since it's a complete replace, does it replace in index also, thereby consuming throughput there too? Or does it figure out the items are exactly same and hence doesn't perform any operation, thereby resulting in no additional consumed throughput for index?
Found the answer by running a trial program and noticing the results in ConsumedCapacity attribute for table and indices.
During replace, if there are no changes, the consumed throughput is not calculated as DynamoDB figures out it's exactly the same. But if there are changes, throughput per item is calculated.

hash table lookup time

When we are insert/lookup an key in a hash table, textbook said it is O(1) time. Yet, how is possible to have an O(1) lookup time? If the hash table store the key in a vector, it will cost O(N), if in a binary tree, it will be O(logN). I just can't image some data structure with O(1) accessing time.
The hashtable hashes your key and put it in array.
For example, hash(x) = 3, where x is your key. The table then puts it into array[3]. Accessing from array is O(1).
At a minimum, hash tables consist of an array and hash function. When an object is added to the table, the hash function is computed on the object and it is stored in the array at the index of the corresponding value that was computed. e.g., if hash(obj) = 2 then arr[2] = obj.
The average insert/lookup on a hash table is O(1).
However it is possible to have collisions when objects compute the same hash value.
In the general case there are "buckets" at each index of the array to handle these collisions. Meaning, all three objects are stored in some other data structure (maybe a linked list or another array) at the index of the hash table.
Therefore, the worst case for lookup on a hash table is O(n) because it is possible that all objects stored in the hash table have collided and are stored in the same bucket.
Technically speaking, hash table lookup, if there is no collision, is O(logn). This is because hashing time is linear with respect to the size (in bytes) of the identifier, and the smallest that a new identifier added to the hash table can be, for that identifier to be unique, is O(logn).
However, the log of all the computer memory in the world is just such a small number, which means that we have very good upper bounds on hash table identifier size. Case in point, log10 of the number of particles in the observable universe is estimated at slightly over 80; in log2 it's about 3.3 times as much. Logarithms grow very slowly.
As a result, most log terms can be treated as constant terms. It's just that traditionally we only apply this fact to hash tables, but not to search trees, in order to teach recurrence relations to students.

What exactly is table size in SAS HashTable specified by hashexp?

I would like to have a little clarification on the definiton of a bucket in SAS hashtable. The question is exactly about the hashexp parameter.
According to the SAS DOCs, hashexp is:
The hash object's internal table size, where the size of the hash table is 2n.
The value of HASHEXP is used as a power-of-two exponent to create the hash table size. For example, a value of 4 for HASHEXP equates to a hash table size of 24, or 16. The maximum value for HASHEXP is 20.
The hash table size is not equal to the number of items that can be stored. Imagine the hash table as an array of 'buckets.' A hash table size of 16 would have 16 'buckets.' Each bucket can hold an infinite number of items. The efficiency of the hash table lies in the ability of the hashing function to map items to and retrieve items from the buckets.
You should set the hash table size relative to the amount of data in the hash object in order to maximize the efficiency of the hash object lookup routines. Try different HASHEXP values until you get the best result. For example, if the hash object contains one million items, a hash table size of 16 (HASHEXP = 4) would work, but not very efficiently. A hash table size of 512 or 1024 (HASHEXP = 9 or 10) would result in the best performance.
The question is what exactly is a hash table size, while it is not a amount of data in the hash object?
Should it be understood as if we wanted to allocate as much memory as it may be neccessary but not less, no more. It is a power of two to get things work fast. But it does not limit the amount of data possibly used, it only indicates about how much is going to be used, right?
Paul Dorfman (the master of hashing) goes into a fair bit of detail on page 10 of this whitepaper:
As I understand it, hashtables store their data in binary trees. Each bucket created by hashexp represents the number of binary trees that will be used to store the data. A hashexp of 0 would use a single tree, while a hashexp of 8 would use 256 trees. When a lookup is performed against the hash object, an internal algorithm determines which tree the key should exist in (based on the hashed value). It then checks that tree for the value. By automatically knowing which of the 256 trees to look in (for example) it would have saved itself 8 comparisons (2^8) when compared to a single binary tree.
The whole thing seems a lot more complex than that but that's my interpretation of why it works out faster.
As Rob Penridge pointed out, Paul Dorfman is indeed the SAS Hash Object Guru. Hashexp is not related to the size of the hash table, again as mentioned in Rob's answer.
If you have a table with 100obs and 10 numeric variables which is loaded into a hash table, then size of the hash table is simply 100obs*10vars*8bytes(assuming all numeric vars are stored as 8byte fields) 7.8KB give or take a 10%.
Remember that SAS dynamically allocates RAM space as records are added to the Hash table in memory, so you do not need to specify in advance what size it should be.[I've been using hash tables regularly, but cant think of any place where one can specify the size in advance].
General tip: if you want to know how big your hash table is going to be, run a PROC CONTENTS on the dataset you want to load into Hash table and multiply "Observation Length" & "No. of obs in dataset", this will give the memory size needed in bytes. If you have that much memory then you can load it into memory.

Time complexity of Hash table

I am confused about the time complexity of hash table many articles state that they are "amortized O(1)" not true order O(1) what does this mean in real applications. What is the average time complexity of the operations in a hash table, in actual implementation not in theory, and why are the operations not true O(1)?
It's impossible to know in advance how many collisions you will get with your hash function, as well as things like needing to resize. This can add an element of unpredictability to the performance of a hash table, making it not true O(1). However, virtually all hash table implementations offer O(1) on the vast, vast, vast majority of inserts. This is the same as array inserting - it's O(1) unless you need to resize, in which case it's O(n), plus the collision uncertainty.
In reality, hash collisions are very rare and the only condition in which you'd need to worry about these details is when your specific code has a very tight time window in which it must run. For virtually every use case, hash tables are O(1). More impressive than O(1) insertion is O(1) lookup.
For some uses of hash tables, it's impossible to create them of the "right" size in advance, because it is not known how many elements will need to be held simultaneously during the lifetime of the table. If you want to keep fast access, you need to resize the table from time to time as the number of element grows. This resizing takes linear time with respect to the number of elements already in the table, and is usually done on an insertion, when the number elements passes a threshold.
These resizing operations can be made seldom enough that the amortized cost of insertion is still constant (by following a geometric progression for the size of the table, for instance doubling the size each time it is resized). But one insertion from time to time takes O(n) time because it triggers a resize.
In practice, this is not a problem unless you are building hard real-time applications.
Inserting a value into a Hash table takes, on the average case, O(1) time. The hash function is
computed, the bucked is chosen from the hash table, and then item is inserted. In the worst case scenario,
all of the elements will have hashed to the same value, which means either the entire bucket list must be
traversed or, in the case of open addressing, the entire table must be probed until an empty spot is found.
Therefore, in the worst case, insertion takes O(n) time
refer: http://www.cs.unc.edu/~plaisted/comp550/Neyer%20paper.pdf (Hash Table Section)
