effect of size of the hash table on access speed - hashtable

I read about this question in an interview forum & I am not sure about the answer. Please help me out. Is it better to have multiple small hash tables or one big hash table in terms of access speed (assuming both fit in RAM)? I think the answer should be; both give same performance as the access time would be same. But I am confused about multiple small hash tables.

Such a question depends on a lot of factors, such as the implementation of the hash function and how you choose which table.
In general, if both fit in memory, the speed should be comparable. A hash table takes a key, converts it to a number, and looks up the value. With multiple small tables, you have to choose which hash table, so that adds an extra layer. On the other hand, modern computer architectures have multiple levels of caching in RAM, so a small hash table might be in a cache, making accesses quicker.

Related

Partitioning vs extra database

Where i work, we got a dilemma. We are using a database(MariaDB 10) that has 1 table that is growing very large(107.4GiB as i write this,. so 1.181 million rows..). This does off course affect the performance of the system.
Me and a coworker had a discussion, he suggested using partitions on that table. This will likely increase the performance, but does not reduce the size of the DB.
In previous times, i however, have been working on writing a cronjob that will move data older then 2 years from that table to a exact copy of the database on a other location.
I feel that that is the more effective way. I expect that doing this will not only increase performance(except during the times when the cronjob is running) but i know that it will also reduce the size of the table.
We don't expect that our customers are interested in this old data anyway.
Question is: What would you choose? I prefer my option, because old data is not used anyway and it keeps the main DB a lot cleaner, my coworker prefers his solution because it means less load at all times and customers can still access the old data.
I have read some of the pro's to use partitioning but haven't found a comparison yet between partitioning and moving old data to another database/place
The table in question uses several query's, This is the most important insert:
INSERT INTO ".$defaultDataTable." (
sensor_data_type_id,
sequence_number,
value,
flag,
datetime
) VALUES (
'".Database::esc($sdtid)."',
'".Database::esc($valueSequence)."',
'".Database::esc($value)."',
'".Database::esc($valueSensorDataFlagsExtended)."',
'".Database::esc($valueDateTime)."'
);
The data is selected in several pages of the application, but 1 example is the following.
SELECT
ws_sensor_data_type.sensor_data_type_id as sensor_data_type_id,
ws_sensor_data_type.name as sensor_data_type_name,
ws_sensor_data_type.equation_id as equation_id,
ws_sensor.name as sensor_name,
ws_equation.description as data_type_name,
ws_basestation.network_id as network_id,
ws_basestation.name as basestation_name,
ws_basestation.worldwide_id as worldwide_id,
ws_client.name as client_name,
ws_sensor.device_type_id as device_type,
ws_sensor.device_id as device_id
FROM
ws_sensor_data_type,
ws_sensor,
ws_basestation,
ws_client_basestation,
ws_client,
ws_equation
WHERE ws_sensor.sensor_id = ws_sensor_data_type.sensor_id
AND ws_sensor.basestation_id = ws_basestation.basestation_id
AND ws_basestation.basestation_id = ws_client_basestation.basestation_id
AND ws_client_basestation.client_id = ws_client.client_id
AND ws_sensor_data_type.equation_id = ws_equation.equation_id
AND ws_sensor_data_type.sensor_data_type_id = '".Database::esc($sdtid)."'
");
In this example, the data, along with some other information is being selected to create a .CSV export file.
The create table statement will follow as i am creating a copy of the Development DB right now to test partitioning on.
We do not use UUID's so that should not be a problem.
It depends.
Partitioning does not inherently improve performance. Only a very limited number of use cases show any performance improvement. More details .
If you are only fetching "recent" rows from the table and you have adequate indexing, then "neither" is the answer -- your million rows could grow to a billion without any performance degradation.
If you are using UUIDs, you are doomed. Performance declines terribly once the data is too big to be cached.
You have done some "hand waving". So have I. If you want to continue this discussion, please provide more specifics. CREATE TABLE, sample queries, proposed partition mechanism, proposed mechanism for accessing 'old' data, etc.

(CHORD) Peer-2-Peer How does it work/What does it do?

https://en.wikipedia.org/wiki/Chord_(peer-to-peer)
I've looked into Chord and i'm having trouble understanding exactly what it does.
It's a protocol for a distributed hash table which stores various keys/values for later usage? Is it just an efficient way to look up in the hash table what value for a given key?
Any help such as a basic example would be much appreciated
An example question is say if I hashed inserting string "Hi" to 3 and there were no peers at 3 it would go to the next available peer and store it there right? Or where does it store it's values to?
I already answered a similar question for bittorrent/kademlia, so just to summarize in a more general sense:
DHTs store the values with some redundancy on N nodes whose ID is closest to the target hash.
Considering the vastness of >= 128bit keyspaces it would be extremely unlikely for nodes to exactly match the key. At least in routing schemes where nodes don't adjust their IDs based on content, and chord is one of those.
It's pretty much the same as regular hash tables, hence distributed hash table. You have a limited set of buckets into which the entries are hashed, where the bucket space is much smaller than the potential input keyspace and thus does not precisely match the keys either.

Sqlite3 database performance

I want create database. Simple I think. Just to storage number of phone, date, time and note.
Better (for database perfomance) use new table for every phone number and notes or one table and all information in it?
The right way is to normalize your data (hence, use as much tables as needed).
If you split your data into several tables (assuming you use indexed) write performance will be better.
Regarding read performance, depends on the size of the data (namely notes), but I would argue that having more tables is also better - except if indexing is out of the question (no reason for that really) and if you would otherwise need to join tables to get data. Even then, I don't think it would be a big trade-off.
SQLite can write millions of rows/s and read another more, are you sure you want to ask this question?

Alternative to GUID with Scalablity in mind and Friendly URL

I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
You could use a long (e.g. 275001120966638272) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
Twitter Snowflake ID's
One solution to the partitioning problem with long ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This provides one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hex character strings.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
should primary keys of database
and Unique Resource Location be kept as two different entities ?
does this numbering destruct the sequentiality in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
Now considering the DB system:
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.

Calculate at runtime vs Lookup from SQL Server Table

I have an MVC application that needs to run several tillion calculations. Of those, I am interested in only about 8 million results. I have to do this work because I need to see an overall high and low score. I will save this data, and store it is in a single table of 16 floats. I have a few indexes too on this table for lookups. So far I have only processed 5% of my data.
As users enter data into my website, I have to do calculations based on their data. I have to determine the Best and Worst outcomes. This is only about 4 million calculations. Right now, that takes about a second or less to calculate on my local PC. Or it is a simple query that will always return 2 records from my stored data. The Best and The Worst. Right now, the query to get the results is the same speed or faster than calculating the result, but I don't have all 8 million records yet. I am worried that the DB will get slow.
I was thinking I would use the Database Lookup, and if performance became an issue, switch to runtime calculation.
QUESTION: Should I just save myself the trouble and do the runtime calculation anyway?
I am not sure which option is more scalable. I don't expect a large user base for this website.
The site needs to be snappy.
Your question is a little vague to provide a clear cut answer, but my guess is using the db to calculate the totals will be far more efficient than you writing the code on the website. Sql Server will attempt to optimize the query to use as much of the server resources as possible to make it more efficient. Your code won't do that unless you specifically write it to do so.
I would start by loading the data and doing tests before making an optimization strategy. You have no idea where the real bottlenecks of the system will be before you load data that is remotely close to what you are going to have to deal with.
If I understand the question performing the calculation is more scalable has it is on that single data set. As you add data to a table even with indexes lookups will get slower. Also the indexes increase table size and increase the time required to insert a record.
If I've understood you correctly, this is a question about caching - should you calculate on the fly, or lookup the results in a cache?
In most web architectures, your SQL database is a brilliant cache, right up to the point where it becomes a terrible cache. Scaling your (SQL) database is notoriously tricky - introducing clustering, sharding etc. becomes a production in its own right.
My - very general - advice is to use your relational database for managing transactional data, and to use caching technology for caching. 8 million records should fit into RAM on a decent server these days - and you can add web servers far more cheaply than scaling your database.

Resources