Encryption security - encryption

Let's say we have a number (12345) and we want to store it in the database but encrypted somehow.
We would like to avoid using any common encryption method.
We would like to know if doing this is secure, and if it is, HOW secure.
Original number: 12345
Shuffle: 35124
Add some data: 53412-35124-14352
then you store it on the db...
You can read the original number since you know where to look.
Is this method easlily reverse engineered?

Let's say I have some way of sending you some numbers to store. I send you a few numbers, then I inspect your "encrypted" numbers.
Number I Sent Number You Stored
12345 53412-35124-14352
11111 73671-11111-78162
67890 98126-80679-98983
Just by looking at that, you can quite easily see what's going on.
You really should not invent your own crypto algorithm. I'll just quote Bruce Scheier:
Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can't break. It's not even hard. What is hard is creating an algorithm that no one else can break, even after years of analysis. And the only way to prove that is to subject the algorithm to years of analysis by the best cryptographers around.

The method is very easily reverse engineered and broken.
Something this simple would probably be broken with a human looking at it and noticing that the a constant mapping of positions. I give you abcd and you give me back bdca-bafd-jc6f. The extra data you added does not obscure the first 4 characters are linked
However if there was a more complicated method something similar to machine learning could be applied where computer will detect these direct patterns. Google translations use a version of this to produce translations through pure maths and books that have been translated into multiple languages.
In addition if you are only shuffling numbers the sample space of encrypting something will be quite small.
eg. If the encrypted text is: 1342
Better encryption would mean it could have originally been 0 - 9999 (10,000 total)
Your encryption tells me it started as one of these 24 charcters:
1234 | 2134 | 3124 | 4123
1243 | 2143 | 3142 | 4132
1324 | 2341 | 3214 | 4213
1342 | 2341 | 3241 | 4231
1423 | 2413 | 3412 | 4312
1432 | 2431 | 3421 | 4321
Finally:
Because the "secrecy" of your encryption comes from people not knowing how perform it you cannot ask the community to examine its strength without them knowing how you did it and thus cracking it.
All good encryption methods "secrecy" comes from a key or password which means you can share the method and ask the community to test its strength because you don't give them the key.

Related

256-bit encryption with small random alphanumeric password

We are considering using email to transmit PDFs containing personal health-related information ("PHI"). There is nothing of commercial value, no social security numbers or credit card numbers, or anything like that in these documents. Only recommendations for treatment of medical conditions.
The PDFs would be password-encrypted using Adobe Acrobat Pro's 256-bit password encryption.
Using very long passwords is not logistically desirable because the recipient of the emails with PDF attachment is the patient, not a technical person. We want to make the password easy-to-type, and yet not so short that any desktop PC has the CPU capacity to crack it in a few minutes.
If a password does not use any dictionary words but is simply a four-character random ASCII alphanumeric string, like DT4K (alphas all uppercase, not mixed), how long would it take a typical desktop business or home computer with no specialized hardware to crack the encryption? Does going to 5 characters significantly increase the cracking time?
Short answer: no, and no.
Longer answer: alphanumeric means A-Za-z0-9, right? That's 62 possible characters, or 5.95 bits of entropy. Since entropy is additive, 4 characters are roughly 24 bits, and 5 are about 30. To put that into comparison, 10 bits mean the attacker has to try about a thousand possible keys, 20 bits are a million, 30 bits about a billion. That's almost nothing these days. 56 bit DES was cracked using brute force in 1998, today people worry that 128 bit AES might not be safe enough.
If i were you, I'd try to use something like diceware. That's a list of 7776 easily pronounced words. You can use a random number generator to pick a passphrase from these words, and each word will have about 12.9 bits of entropy. So 5 words are about 65 bits, which for the kind of data you have might be an acceptable level of security, while being easily remembered or communicated via phone.
Why 7776 words? Well, 7776 is 6*6*6*6*6, so you can roll a die five times and get a number, and just look up the corresponding word on the list.
My bank sends statements encrypted and uses a combination of my name and birth date. I'm not a huge fan of that idea, but provided you use information that's unlikely to be known to an attacker you'll get a greater level of security than from four or five character alphanumeric passwords.
This would take less than an 25 seconds even with the most rudimentary tools. There are precomplied rainbow tables for passwords this short that can run in seconds on decent PC's. Password length, NOT complexity, are what make a password difficult to crack. I would highly recomend giving them a longer password, but make it something eaisly recalled. Maybe you entire business name salted with your street address number at the end. Please take at least some precautions. Having a four character password is barely better than not having one at all.
How Strong is your Password?

Berkeleydb - B-Tree versus Hash Table

I am trying to understand what should drive the choice of the access method while using a BerkeleyDB : B-Tree versus HashTable.
A Hashtable provides O(1) lookup but inserts are expensive (using Linear/Extensible hashing we get amortized O(1) for insert). But B-Trees provide log N (base B) lookup and insert times. A B-Tree can also support range queries and allow access in sorted order.
Apart from these considerations what else should be factored in?
If I don't need to support range queries can I just use a Hashtable access method?
When your data sets get very large, B-trees are still better because the majority of the internal metadata may still fit in cache. Hashes, by their nature (uniform random distribution of data) are inherently cache-unfriendly. I.e., once the total size of the data set exceeds the working memory size, hash performance drops off a cliff while B-tree performance degrades gracefully (logarithmically, actually).
It depends on your data set and keys On small data sets your benchmark will be close to the same, however on larger data sets it can vary depending on what type of keys / how much data you have. Usually b-tree is better, until the btree meta data exceeds your cache and it ends up doing lots of io, in that case hash is better. Also as you pointed out, b-tree inserts are more expensive, if you find you will be doing lots of inserts and few reads, hash may be better, if you find you do little inserts, but lots of reads, b-tree may be better.
If you are really concerned about performance you could try both methods and run your own benchmarks =]
For many applications, a database is accessed at random, interactively
or with "transactions". This might happen if you have data coming in
from a web server. However, you often have to populate a large
database all at once, as a "batch" operation. This might happen if you
are doing a data analysis project, or migrating an old database to a
new format.
When you are populating a database all at once, a B-Tree or other
sorted index is preferable because it allows the batch insertions to
be done much more efficiently. This is accomplished by sorting the
keys before putting them into the database. Populating a BerkeleyDB
database with 10 million entries might take an hour when the entries
are unsorted, because every access is a cache miss. But when the
entries are sorted, the same procedure might take only ten minutes.
The proximity of consecutive keys means you'll be utilizing various
caches for almost all of the insertions. Sorting can be done very
quickly, so the whole operation could be sped up by several times just
by sorting the data before inserting it. With hashtable indexing,
because you don't know in advance which keys will end up next to each
other, this optimization is not possible.
Update: I decided to provide an actual example. It is based on the
following script "db-test"
#!/usr/bin/perl
use warnings;
use strict;
use BerkeleyDB;
my %hash;
unlink "test.db";
tie %hash, (shift), -Filename=>"test.db", -Flags=>DB_CREATE or die;
while(<>) { $hash{$_}=1; }
untie %hash;
We can test it with a Wikipedia dump index file of 16 million entries. (I'm running this on an 800MHz 2-core laptop, with 3G of memory)
$ >enw.tab bunzip2 <enwiki-20151102-pages-articles-multistream-index.txt.bz2
$ wc -l enw.tab
16050432 enw.tab
$ du -shL enw.tab
698M enw.tab
$ time shuf enw.tab > test-shuf
16.05s user 6.65s system 67% cpu 33.604 total
$ time sort enw.tab > test-sort
70.99s user 10.77s system 114% cpu 1:11.47 total
$ time ./db-test BerkeleyDB::Btree < test-shuf
682.75s user 368.58s system 42% cpu 40:57.92 total
$ du -sh test.db
1.3G test.db
$ time ./db-test BerkeleyDB::Btree < test-sort
378.10s user 10.55s system 91% cpu 7:03.34 total
$ du -sh test.db
923M test.db
$ time ./db-test BerkeleyDB::Hash < test-shuf
672.21s user 387.18s system 39% cpu 44:11.73 total
$ du -sh test.db
1.1G test.db
$ time ./db-test BerkeleyDB::Hash < test-sort
665.94s user 376.65s system 36% cpu 46:58.66 total
$ du -sh test.db
1.1G test.db
You can see that pre-sorting the Btree keys drops the insertion time
down from 41 minutes to 7 minutes. Sorting takes only 1 minute, so
there's a big net gain - the database creation goes 5x faster. With
the Hash format, the creation times are equally slow whether the input
is sorted or not. Also note that the database file size is smaller for
the sorted insertions; presumably this has to do with tree balancing.
The speedup must be due to some kind of caching, but I'm not sure
where. It is likely that we have fewer cache misses in the kernel's
page cache with sorted insertions. This would be consistent with the
CPU usage numbers - when there is a page cache miss, then the process
has to wait while the page is retrieved from disk, so the CPU usage is
lower.
I ran the same tests with two smaller files as well, for comparison.
File | WP index | Wikt. words | /usr/share/dict/words
Entries | 16e6 | 4.7e6 | 1.2e5
Size | 700M | 65M | 1.1M
shuf time | 34s | 4s | 0.06s
sort time | 1:10s | 6s | 0.12s
-------------------------------------------------------------------------
| total DB CPU | |
| time size usage| |
-------------------------------------------------------------------------
Btree shuf | 41m, 1.3G, 42% | 5:00s, 180M, 88% | 6.4s, 3.9M, 86%
sort | 7m, 920M, 91% | 1:50s, 120M, 99% | 2.9s, 2.6M, 97%
Hash shuf | 44m, 1.1G, 39% | 5:30s, 129M, 87% | 6.2s, 2.4M, 98%
sort | 47m, 1.1G, 36% | 5:30s, 129M, 86% | 6.2s, 2.4M, 94%
-------------------------------------------------------------------------
Speedup | 5x | 2.7x | 2.2x
With the largest dataset, sorted insertions give us a 5x speedup.
With the smallest, we still get a 2x speedup - even though the data
fits easily into memory, so that CPU usage is always high. This seems
to imply that we are benefiting from another source of efficiency in
addition to the page cache, and that the 5x speedup was actually due
in equal parts to page cache and something else - perhaps the better
tree balancing?
In any case, I tend to prefer the Btree format because it allows
faster batch operations. Even if the final database is accessed at
random, I use batch operations for development, testing, and
maintenance. Life is easier if I can find a way to speed these up.
To quote the two main authors of Berkeley DB in this write up of the architecture:
The main difference between Btree and Hash access methods is that
Btree offers locality of reference for keys, while Hash does not. This
implies that Btree is the right access method for almost all data
sets; however, the Hash access method is appropriate for data sets so
large that not even the Btree indexing structures fit into memory. At
that point, it's better to use the memory for data than for indexing
structures. This trade-off made a lot more sense in 1990 when main
memory was typically much smaller than today.
So perhaps in embedded devices and specialized use cases a hash table may work. BTree is used in modern filesystems like Btrfs and it is pretty much the idea data structure for building either databases or filesystems.

How can I reduce the size of a cookie value?

I need to know the best way to reduce the size of data to be stored in cookie.
You could store a unique identifier or token as the cookie's value and then store all the data you want associated with it on the server side in the database.
User shows up with token abcdefg. You query the db and get all your info for token abcdefg.
Also depends on the kind of data you want to store. You can express a subset of known, possibly applicable property values as 2 to the power of n, e.g.
Car Wash Properties:
Basic Air Dry 1
Hand wipe with chamois 2
Steam clean wheels 4
Steam clean engine 8
Hot Wax 16
Interior vacuum 32
Tire treatment 64
such that Basic Air Dry + Interior vacuum = 33. All you'd need to store is the value 33.

Entire range - Reverse MD5 lookup

I am learning about encryption methods and I have a question about MD5.
I have seen there are several websites that have 'rainbow tables' that will give you reverse MD5 lookup, but, they can't lookup all the combinations possible.
For knowledge's sake, my question is this :
Hypothetically, if a group of people were to consider an upper limit (eg. 5 or 6 characters) and decide to map out the entire MD5 hash for all the values inside that range, storing the results in a database to use for reverse lookup.
1. Do you think such a thing is probable.
2. If you can speculate, what kind of scale of resources would this mean?
3. To your knowledge have there been any public or private attempts to do this?
I am not referring to tables that have select entries based on a dictionary, but mapping the entire range upto a certain number of characters.
(I have refered to This question already.)
It is possible. For a small number of characters, it has already been done. In the near future, it will be easy for larger numbers of characters. MD5 isn't getting any stronger.
That's a function of time. To reverse the entire 6-or-fewer-character alphanumeric space would require computing 62^6 entries. That's 56 trillion MD5s. That's doable by a determined small group or easy for a government, right now. In the future, it will be doable on a home computer. Remember, though, that as the number of allowable characters or the maximum length increases, the difficulty increase is exponential.
People already have done it. But, honestly, it doesn't matter - because anyone with half an ounce of sense uses a random salt. If you precompute the entire MD5 space and reverse it, that doesn't mean jack dandy if someone is using key strengthening or a good salt! Read up on salting.
5 or 6 characters is easy. 6 bytes is doable (that's 248 combinations), even with limited hardware.
Namely, a simple Core2 CPU from Intel will be able to hash one password in about 150 clock cycles (assuming you use a SSE2 implementation, which will hash four passwords in parallel in 600 clock cycles). With a 2.4 GHz quad core CPU (that's my PC, not exactly the newest machine available), I can then try about 226 passwords per second. For that kind of job, a massively parallel architecture is fine, hence it makes sense to use a GPU. For maybe 200$, you can buy a NVidia video card which will be about four times faster (i.e. 228 passwords per second). 6 alphanumeric characters (uppercase, lowercase and digits) are close to 236 combinations; trying them all is then a matter of 2(36-28) seconds, which is less than five minutes. With 6 random bytes, it will need 220 seconds, i.e. a bit less than a fortnight.
That's for the CPU cost. If you want to speed up the actual attack, you store the hash results: thus you will not need to recompute all those hashed passwords every time you attack a password (but you still have to do it once). 236 hash results (16 bytes each) mean 1 terabyte. You can buy a harddisk that big for 100$. 248 hash results imply 4096 times that storage space; in plain harddisks this will cost as much as a house: a bit expensive for the average bored student, but affordable for most kinds of governmental or criminal organizations.
Rainbow tables are an optimization trick for the storage. In rough terms, you store only one every t hash results, in exchange of having to do t lookups and t2 hash computations for every attack. E.g., you choose t=1000, you only have to buy four harddisks instead of four thousands, but you will need to make 1000 lookups and a million hashes every time you want to crack a password (this will need a dozen seconds at most, if you do it right).
Hence you have two costs:
The CPU cost is about computing hashes for the complete password space; with a table (rainbow or not) you have to do it once, and then can reuse that computational effort for every attacked password.
The storage cost is about storing the hash results in order to easily attack several passwords. Harddisks are not very expensive, as shown above. Rainbow tables help you lower storage costs.
Salting defeats cost sharing through precomputed tables (whether they are rainbow tables or just plain tables has no effect here: tables are about reusing precomputed values for several attacked passwords, and salts prevent such recycling).
The CPU cost can be increased by defining that the hash procedure is not just a single hash computation; for instance, you can define the "password hash" as applying MD5 over the concatenation of 10000 copies of the password. This will make each attacker guess one
thousand times more expensive. It also makes legitimate password validation one thousands times more expensive, but most users will not mind (the user has just typed his password; he cannot really see whether the password verification took 10ms or 10µs).
Modern Unix-like systems (e.g. Linux) use "MD5" passwords which actually combine salting and iterated hashing, as described above. (Actually, a modern Linux system may use another hash function, such as SHA-256, but that does not change things much here.) So precomputed tables will not help, and the on-the-fly password cracking is expensive. A password with 6 alphanumeric characters can still be cracked within a few days, because 6 characters are kind of weak anyway. Also, many longer passwords are crackable because it turns out that human begins are bad are remembering passwords; hence they will not choose just any random sequence of characters, they will select passwords which have some "meaning". This reduces the space of possible passwords.
It's called a rainbow table, and it's easily defeated with salting.
Yes, it is not only probable, but it's probably been done before.
It depends on whether they are mapping the entire possible range or just a range of ASCII characters. Let's say you need 128 bits + 6 bytes to store each match. That's 22 bytes. You'd need:
6.32 GB to store all lowercase alphabetic combinations [a-z]
405 GB to for all alphabetic combinations [a-zA-Z]
1.13 TB for all alphanumeric combinations [a-zA-Z0-9]
5.24 TB for all combinations that consists of letters, numbers and 18 symbols.
As you see, it increases exponentially, but even at 5.24 TB that's nothing to agencies like, say, the NSA or the CIA. They probably have done it.
As everyone else said, salting can easily defeat rainbow tables and that's almost as important as hashing. Read this: Just hashing is far from enough - How to position against dictionary and rainbow attacks

please advice whether about our case of using encryption

Our client wants to give us a database. The original database has a phone number column. He doesn't want to give us a phone number. Somehow i'm not sure why - it is decided that client will give us encrypted phone numbers with encrypted with 128bit AES key.
We will tell the client which phone number is to be shortlisted for some purpose but we will never know what is the actual phone number .. we'll just know the encrypted numbers.
Here are things I don't understand:
is using 128bit AES key encryption suitable for this purpose ?
should the client preserve the AES key used to convert the numbers or
should the client instead of
preserving the key create a database
mapping the orignal numbers with
encrypted numbers
should the same key be used to convert all numbers or different
if randomly generated keys are used to encrypt numbers isn't it
possible that for two phone numbers
the encrypted text may be same ?
IMO this is the wrong approach. Instead of encrypting the phone number, which still leaves a chance of you decrypting it (e.g. because someone leaks the key), the client should just replace them with an ID that points to a table with the real telephone numbers; of course, this lookup table stays with him, you never get it.
I.E.
Original table:
Name | Phone
-------+---------
Erich | 555-4245
Max | 1234-567
You get:
Name | Phone
-------+---------
Erich | 1
Max | 2
Only your client has:
ID | Phone
---+---------
1 | 555-4245
2 | 1234-567
Addressing your concerns in order:
It may be, it may not be. You haven't really mentioned what the purpose is at all, in fact:
Why the need for encryption?
Who is it being protected from?
What's the value (or liability to you, if lost) of the data?
How motivated are the hypothetical attackers assumed to be?
What performance loss is acceptable for the security gain?
What hardware do you have available?
Who has what physical/logical access to various parts of the system?
And so on, and so forth. Without knowing the situation, it's not possible to say whether this is an appropriate encryption scheme. (Though it is likely to be a solid choice).
Surely that's for the client to decide? I will say, though, that the latter case seems to defeat the purpose of encryption entirely.
The same key ought to be used to convert all numbers, unless you fancy juggling keys around to try and remember which one to use to decrypt which phone numbers. If the security system is well designed, this wouldn't give any extra security and would just be a bizarre headache.
By definitiong of encryption, no. It's always a reversable mapping which means there's no loss of information such as you would get with a hash. And consequently, every instance of ciphertext has a single unique plaintext that will encrypt to it (with a given key).
Though all in all this doesn't sound like it's needed. It sounds to me like someone's been making decisions based on appearances rather than technical merit - "We encrypt your phone numbers with the same 128-bit encryption used in browsers" sounds good but is it actually needed?

Resources