I have a table with the following columns:
.create-merge table events
(
Time: datetime,
SiteId: int,
SiteCode: string,
...
)
Site ID and code both provide unique value for a site, theoretically it does not matter which one to use unless I need certain data type in the output. However I see a noticeable difference in performance between the queries:
events | summarize count() by SiteCode
~ 300 ms on a 150M rows table
events | summarize count() by SiteId
~ 560 ms on a 150M rows table
The difference is small in the absolute value, but the string one is almost two times faster than the integer one (for consistent results, I issue requests from a client in the same region). The string code consists of 10-20 characters and intuitively seems to have larger footprint in the computer memory as opposed to 4-byte integer, hence I would expect longer processing of the string one, but it works conversely.
What could be the reason for that? I am missing something fundamental from ADX internals?
Assuming that you are using EngineV3, you are seeing the impact of the dictionary encoding optimization implemented in this engine, where in certain cases string values are encoded to small and efficient int values, hence the better performance. As EngineV3 continues to improve this optimization may be added to int values as well.
So I am looking for potential feedback on the issues and benefits around storing many 100's of millions of rows of memory address data.
The addresses consist of 8 x UInt32 positions, I was initially going to store the data in 8 columns, and have an 8 way multipart index on those columns.
CREATE TABLE address (position1 INTEGER, position2 INTEGER, positionN ......)
But it then dawned on me that I could also have a BLOB column, and just store 64 bytes of data and index that.
CREATE TABLE address (position BLOB)
So I guess it becomes a question about any known efficiencies or detrimental effects of looking up on an 8 way index vs a far larger binary index, both with high cardinality.
There will be duplicate, timestamped, entries for the same positions although they will be fairly infrequent.
The table will be read and added to constantly if that helps, with a 3(read)/1(write) ratio.
Many thanks.
I want to build a 20,000,000 record table in sqlite, but the file size is slightly larger than its TAB-separated plaintext representation.
What are the ways to optimize data size storage, specific to sqlite?
Details:
Each record has:
8 integers
3 enums (represented for now as 1 byte text),
7 text
I suspect that the numbers are not stored efficiently (value range 10,000,000 to 900,000,000)
According to the docs, I expect them to take 3-4 bytes, if stored as a number, and 8-9 bytes if stored as text (maybe additional termination byte or size indicator byte), meaning 1:2 ratio between storing as int : store as text).
But it doesn't appear so.
Your integers should take at least 3-4 bytes (3 Bytes for up to 2^24 =~ 16,000,000). Additionally SQLite always stores at least one byte for every column as size information (also for your 1 byte texts --> 2bytes in sum for each).
Some questions:
Do you use a compound primary key or a primary key other than a plain integer?
Do you use other indexes?
Did you try to vacuum the database? (command "vacuum") -- a SQLite database is not necessarily auto-vacuum't, so when data is deleted, the space stays reserved.
One further:
Do you already have your 20,000,000 entries or less? For small databases the storage overhead can be much larger than the real content size.
I'm storing phone country codes. They range from 1 to about 300. What's going to be more performant for datatype: int or string? I'm using SQL server 2008 and linq-to-sql.
Thanks.
Note: Whoa, really wierd - you asked about phone codes and I wrote about ZIP codes. Sorry about that! I think the advice still stands though...
Original answer: Performance will most likely be negligible - assign the proper type based on what the data is. ZIP codes, while numeric (in the US at least), aren't numbers - they should be stored as strings.
It is very important to understand the semantic nature of the data you are storing. Once you understand what something is then you can begin to reason about how it should be stored. I am assuming that currently you are storing only the first 5 numbers of a US postal code (like this: 12345).
If you were to store this data as a number this would work. Then imagine that your manager tells you that there is a new requirement that the app you are building will start to collect ZIP codes in the ZIP+4 format (which looks like this: 12345-6789). Now you are stuck with a nasty refactoring that involves either changing the type in the database to varchar(10) or doing some crazy voodoo in your app to strip out the dash when you save the ZIP code and then add it back in for display later.
If you're really worried about space and performance then you could use a smallint (which equates to a int16). This will mean that the data will only take 2 bytes of storage (and 2 bytes in memory).
Given an option where I know the datatype will always be integer, I'll go for integer albeit smaller size - smallint / tinyint (depending on the required range).
I don't expect much difference in performance though.
How are you going to be using them and do any have leading zeros?
If you are going to be combining with phone numbers that are usually stored as string, you want to store them as a string as well or you will waste processing power converting them in every query.
If you aren't planning on doing math or joins with it, it is problably a bad idea to store as a number. Your data set is likely so small and the strings so tiny (300 is the max value) that using an int would probably gain you nothing in a join either.
Country codes are strings (notwithstanding that they use only the characters 0..9) and should be stored as such.
They are so few that you don't need to be concerned about this, though it would be simpler to apply a check constraint with an integer type.
my rule of thumb has always been.. do I need an average? For example, you can store a zip code as integer, but are you ever going to need the average zip code? Probably not. As such, store as char.. unless you may need more than 5 characters, in which case store as varchar.
I don't have experience with hash tables outside of arrays/dictionaries in dynamic languages, so I recently found out that internally they're implemented by making a hash of the key and using that to store the value. What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key, instead of making a hash of it and storing that.
This is a near duplicate: Why do we use a hashcode in a hashtable instead of an index?
Long story short, you can check if a key is already stored VERY quickly, and equally rapidly store a new mapping. Otherwise you'd have to keep a sorted list of keys, which is much slower to store and retrieve mappings from.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key
And how do you implement that?
Computers know only numbers. A hash table is a table, i.e. an array and when we get right down to it, an array can only addressed via an integral nonnegative index. Everything else is trickery. Dynamic languages that let you use string keys – they use trickery.
And one such trickery, and often the most elegant, is just computing a numerical, reproducible “hash” number of the key and using that as the index.
(There are other considerations such as compaction of the key range but that’s the foremost issue.)
In a nutshell: Hashing allows O(1) queries/inserts/deletes to the table. OTOH, a sorted structure (usually implemented as a balanced BST) makes the same operations take O(logn) time.
Why take a hash, you ask? How do you propose to store the key "as the key"? Ask yourself this, if you plan to store simply (key,value) pairs, how fast will your lookups/insertions/deletions be? Will you be running a O(n) loop over the entire array/list?
The whole point of having a hash value is that it allows all keys to be transformed into a finite set of hash values. This allows us to store keys in slots of a finite array (enabling fast operations - instead of searching the whole list you only search those keys that have the same hash value) even though the set of possible keys may be extremely large or infinite (e.g. keys can be strings, very large numbers, etc.) With a good hash function, very few keys will ever have the same hash values, and all operations are effectively O(1).
This will probably not make much sense if you are not familiar with hashing and how hashtables work. The best thing to do in that case is to consult the relevant chapter of a good algorithms/data structures book (I recommend CLRS).
The idea of a hash table is to provide a direct access to its items. So that is why the it calculates the "hash code" of the key and uses it to store the item, insted of the key itself.
The idea is to have only one hash code per key. Many times the hash function that generates the hash code is to divide a prime number and uses its remainer as the hash code.
For example, suppose you have a table with 13 positions, and an integer as the key, so you can use the following hash function
f(x) = x % 13
What I don't understand is why aren't
the values stored with the key
(string, number, whatever) as the,
well, key, instead of making a hash of
it and storing that.
Well, how do you propose to do that, with O(1) lookup?
The point of hashtables is basically to provide O(1) lookup by turning the key into an array index and then returning the content of the array at that index. To make that possible for arbitrary keys you need
A way to turn the key into an array index (this is the hash's purpose)
A way to deal with collisions (keys that have the same hash code)
A way to adjust the array size when it's too small (causing too many collisions) or too big (wasting space)
Generally the point of a hash table is to store some sparse value -- i.e. there is a large space of keys and a small number of things to store. Think about strings. There are an uncountable number of possible strings. If you are storing the variable names used in a program then there is a relatively small number of those possible strings that you are actually using, even though you don't know in advance what they are.
In some cases, it's possible that the key is very long or large, making it impractical to keep copies of these keys. Hashing them first allows for less memory usage as well as quicker lookup times.
A hashtable is used to store a set of values and their keys in a (for some amount of time) constant number of spots. In a simple case, let's say you wanted to save every integer from 0 to 10000 using the hash function of i % 10.
This would make a hashtable of 1000 blocks (often an array), each having a list 10 elements deep. So if you were to search for 1234, it would immediately know to search in the table entry for 123, then start comparing to find the exact match. Granted, this isn't much better than just using an array of 10000 elements, but it's just to demonstrate.
Hashtables are very useful for when you don't know exactly how many elements you'll have, but there will be a good number fewer collisions on the hash function than your total number of elements. (Which makes the hash function "hash(x) = 0" very, very bad.) You may have empty spots in your table, but ideally a majority of them will have some data.
The main advantage of using a hash for the purpose of finding items in the table, as opposed to using the original key of the key-value pair (which BTW, it typically stored in the table as well, since the hash is not reversible), is that..
...it allows mapping the whole namespace of the [original] keys to the relatively small namespace of the hash values, allowing the hash-table to provide O(1) performance for retrieving items.
This O(1) performance gets a bit eroded when considering the extra time to dealing with collisions and such, but on the whole the hash table is very fast for storing and retrieving items, as opposed to a system based solely on the [original] key value, which would then typically be O(log N), with for example a binary tree (although such tree is more efficient, space-wise)
Also consider speed. If your key is a string and your values are stored in an array, your hash can access any element in 'near' constant time. Compare that to searching for the string and its value.