Say you have some software server that uses hash functions and some external source wants to exploit that and it keeps attacking the server using keys that they know (or with high probability) will result in collisions. How would you prevent this in practice?
I think one way is to choose the hash function randomly at the beginning of the problem, but this method seems slow in the sense that every time you change hash functions you have to rehash everything.
As you obviously realise, the best defence is to make sure they don't know what your hash function will produce - ideally not your bucket count either (if the hash function is strong, hard to reverse and produces a large range of outputs - such as say 64-bit unsigned integers - then finding two keys that produce the same hash may be time consuming, but finding a value that will hash to a specific bucket after modding by N only needs on average N attempts with any random, distinct keys).
choose the hash function randomly at the beginning of the problem, but this method seems slow in the sense that every time you change hash functions you have to rehash everything.
There's not necessarily a need to repeatedly change the hash function... you just need to make it unguessable based on exposed data/code and observable behaviours. For example, you might generate a random seed value on your server, write that to a secure file somewhere, and use it as a seed for your hash function (or if your hash function doesn't support a seed value, just XOR the hash output with the random value). Even if someone knows your hash function, if they don't know the seed then they can't engineer collisions.
You could also count the collisions a particular client has had, and if it's obviously malicious - disconnect them and remove their keys.
For the purpose of reproducibility, one has to choose a seed. In R, we can use set.seed().
My question is, when the seed is not set explicitly, how does the computer choose the seed?
Why is there no default seed?
A pseudo random number generator (PRNG) needs a default start value, which you can set with set.seed(). If there is no given it generally takes computer based information. This could be time, cpu temperatur or something similar. If you want a more random start value it is possible to use physical values, like white noise or nuclear decay, but you generally need an extern information source for this kind of random information.
The documentation mentions R uses current time and the process ID:
Initially, there is no seed; a new one is created from the current time and the process ID when one is required. Hence different sessions will give different simulation results, by default. However, the seed might be restored from a previous session if a previously saved workspace is restored.
A default seed is a bad idea, since a random generators would always produce the same samples of numbers by default. If you always take the same seed it's not anymore randomized, since there will be always the same numbers. So you just provide a fixed data sample, which is not the intended output of a PRNG. You could of course turn the default seed off (if there would be one), but the intended function is primary to generate a completely random set of data and not a fixed one.
For statistical approaches it matters for validation and verification reasons, but it's getting more important when you get to cryptography. In this field a good PRNG is mandatory.
I am faced with a need to send my data in parts, and at the same time I am expected to provide sha256 for my WHOLE data.
Something like this cat large file | chunker | receiver
where receiver is an application that is expected to receive the data, possibly in chunks having in the header sha256 of the payload, and then following payload. After collecting all chunks, it is supposed to store the whole transmitted data, and the sha256 of all data (particular sha256 will be used only to rehash and confirm integrity of the data.)
Of course, the simplest thing would be if the receiver generated sha256 from whole the streamed data, but I was wondering if there is a simpler way by collecting all hashes of all chunks, and combine them to generate one final hash, which would be the same as the hash calculated from all the data.
In other words - and I copy this from the title - I wonder if there is a function F that would receive a list of hashes of chunks of data, and then generated final hash that would be equal to the hash generated on all the data.
And again, in other words, having this formula:
F(sha256(data[0]), sha256(data[1]), ... sha256(data[N])) = sha256(data[0..N])
What would be the function F?
Would it be a universal function or there is no such thing for the way hashing is calculated?
I suspect there is no such function or this is too complicated question to answer.
AFAIK there are still no known collisions for SHA-256 but I bet that once some is found, i.e. someone finds two messages m1 and m2 such that SHA-256(m1) = SHA-256(m2), then for almost any prefix a hashes SHA-256(a || m1) and SHA-256(a || m2) will be different i.e. the function you ask is actually not a function (has different outputs for the same inputs). Or to put it otherwise SHA-2 is susceptible to length extension attacks but AFAIK not to prefixing attacks. Still even if this actually a function, it is not enough for you for such a function to exist, you also want it to be fast. And I believe there is no such fast to compute function.
On the other hand SHA-256 works by splitting the original message into 512-bit chunks and processing them using a well defined process (which is based on the state from all the previous chunks) so theoretically you can modify some implementation of SHA-256 to compute two hashes at the same time (by applying the same logic to different initial states):
Hash of your application-defined chunk (using standard initial state)
Hash of all chunks up to this point (using the state passed from the previous output of the same step as the initial state).
This probably will be slightly faster than doing those things independently but I don't know whether it will be so much faster to justify such a custom implementation.
I want to compare a hash function and a RSA encryption with another parameter.
I have an algorithm with some hash function and I want to claim that computation load of these hashes is less than one RSA.
Can I say compare them with multiplication parameter, for example how many multiplication each of them has?
How can I compare them in communication load? How can I say that what the length of output in RSA is?
It sounds like you're trying to compare apples and oranges.
A hash function is generally expected to accept arbitrarily long inputs, and the time needed to compute it should generally scale linearly with the length of the input. Thus, a useful measure of hash function performance would be, say, "megabytes per second".
(Specifically, that would be a measure of throughput, which is the relevant measure when hashing long inputs. For short messages, a more relevant measure is the latency, which is basically the minimum time needed to hash zero-length input. Given the throughput and the latency, one can generally calculate a fairly good approximation of the time needed to hash an input of any given length as time = latency + length / throughput.)
RSA, on the other hand, can only encrypt messages shorter than the modulus, which is chosen at the time the key is generated. (Typical modulus sizes might be, say, from 1024 to 4096 bits.) To "encrypt a long message with RSA" one would normally use hybrid encryption: first encrypt the message using a symmetric cipher like AES, using a suitable mode of operation and a randomly chosen key, and then encrypt the AES key with RSA.
The same length limits apply to signing messages with RSA — by itself, RSA can only sign messages shorter than the modulus. The standard workaround in this case is to first hash the message, and then sign the hash value. (There's also a lot of important details like padding involved that I'm not going to go into here, since we're not on crypto.SE, but which are absolutely crucial for security.)
The point is that, in both cases, the RSA operation itself takes a fixed amount of time regardless of the message length, and thus, for sufficiently long messages, most of the time will be consumed by AES or the hash function, not by RSA itself. So when you say you want to "claim that computation load of these hashes is less than one RSA", I would say that's meaningless, at least unless you fixed a specific input length for your hash. (And if you did, my next question would be "what's so special about that particular input length?")
I understand how standard random number generators work. But when working with crytpography, the random numbers really have to be random.
I know there are instruments that read cosmic white noise to help generate secure hashes, but your standard PC doesn't have this.
How does a cryptographically secure random number generator get its values with no repeatable patterns?
A cryptographically secure number random generator, as you might use for generating encryption keys, works by gathering entropy - that is, unpredictable input - from a source which other people can't observe.
For instance, /dev/random(4) on Linux collects information from the variation in timing of hardware interrupts from sources such as hard disks returning data, keypresses and incoming network packets. This approach is secure provided that the kernel does not overestimate how much entropy it has collected. A few years back the estimations of entropy from the various different sources were all reduced, making them far more conservative. Here's an explanation of how Linux estimates entropy.
None of the above is particularly high-throughput. /dev/random(4) probably is secure, but it maintains that security by refusing to give out data once it can't be sure that that data is securely random. If you want to, for example, generate a lot of cryptographic keys and nonces then you'll probably want to resort to hardware random number generators.
Often hardware RNGs are designed about sampling from the difference between a pair of oscillators that are running at close to the same speed, but whose rates are varied slightly according to thermal noise. If I remember rightly, the random number generator that's used for the UK's premium bond lottery, ERNIE, works this way.
Alternate schemes include sampling the noise on a CCD (see lavaRND), radioactive decay (see hotbits) or atmospheric noise (see random.org, or just plug an AM radio tuned somewhere other than a station into your sound card). Or you can directly ask the computer's user to bang on their keyboard like a deranged chimpanzee for a minute, whatever floats your boat.
As andras pointed out, I only thought to talk about some of the most common entropy gathering schemes. Thomas Pornin's answer and Johannes Rössel's answer both do good jobs of explaining how one can go about mangling gathered entropy in order to hand bits of it out again.
For cryptographic purposes, what is needed is that the stream shall be "computationally indistinguishable from uniformly random bits". "Computationally" means that it needs not be truly random, only that it appears so to anybody without access to God's own computer.
In practice, this means that the system must first gather a sequence of n truly random bits. n shall be large enough to thwart exhaustive search, i.e. it shall be infeasible to try all 2^n combinations of n bits. This is achieved, with regards to today's technology, as long as n is greater than 90-or-so, but cryptographers just love powers of two, so it is customary to use n = 128.
These n random bits are obtained by gathering "physical events" which should be unpredictable, as far as physics are concerned. Usually, timing is used: the CPU has a cycle counter which is updated several billions times per second, and some events occur with an inevitable amount of jitter (incoming network packets, mouse movements, key strokes...). The system encodes these events and then "compresses" them by applying a cryptographically secure hash function such as SHA-256 (output is then truncated to yield our n bits). What matters here is that the encoding of the physical events has enough entropy: roughly speaking, that the said events could have collectively assumed at least 2^n combinations. The hash function, by its definition, should make a good job at concentrating that entropy into a n-bit string.
Once we have n bits, we use a PRNG (Pseudo-Random Number Generator) to crank out as many bits as necessary. A PRNG is said to be cryptographically secure if, assuming that it operates over a wide enough unknown n-bit key, its output is computationally indistinguishable from uniformly random bits. In the 90's, a popular choice was RC4, which is very simple to implement, and quite fast. However, it turned out to have measurable biases, i.e. it was not as indistinguishable as was initially wished for. The eSTREAM Project consisted in gathering newer designs for PRNG (actually stream ciphers, because most stream ciphers consist in a PRNG, which output is XORed with the data to encrypt), documenting them, and promoting analysis by cryptographers. The eSTREAM Portfolio contains seven PRNG designs which were deemed secure enough (i.e. they resisted analysis and cryptographers tend to have a good understanding of why they resisted). Among them, four are "optimized for software". The good news is that while these new PRNG seem to be much more secure than RC4, they are also noticeably faster (we are talking about hundreds of megabytes per second, here). Three of them are "free for any use" and source code is provided.
From a design point of view, PRNG reuse much of the elements of block ciphers. The same concepts of avalanche and diffusion of bits into a wide internal state are used. Alternatively, a decent PRNG can be built from a block cipher: simply use the n-bit sequence as key into a block cipher, and encrypt successive values of a counter (expressed as a m-bit sequence, if the block cipher uses m-bit blocks). This produces a pseudo-random stream of bits which is computationally indistinguishable from random, as long as the block cipher is secure, and the produced stream is no longer than m*2^(m/2) bits (for m = 128, this means about 300 billions of gigabytes, so that's big enough for most purposes). That kind of usage is known as counter mode (CTR).
Usually, a block cipher in CTR mode is not as fast as a dedicated stream cipher (the point of the stream cipher is that, by forfeiting the flexibility of a block cipher, better performance is expected). However, if you happen to have one of the most recent CPU from Intel with the AES-NI instructions (which are basically an AES implementation in hardware, integrated in the CPU), then AES with CTR mode will yield unbeatable speed (several gigabytes per second).
First of all, the point of a cryptographically secure PRNG is not to generate entirely unpredictable sequences. As you noted, the absence of something that generates large volumes of (more or less) true randomness1 makes that impossible.
So you resort to something which is only hard to predict. “Hard” meaning here that it takes unfeasibly long by which time whatever it was necessary for would be obsolete anyway. There are a number of mathematical algorithms that play a part in this—you can get a glimpse if you take some well-known CSPRNGs and look at how they work.
The most common variants to build such a PRNG are:
Using a stream cipher, which already outputs a (supposedly secure) pseudo-random bit stream.
Using a block cipher in counter mode
Hash functions on a counter are also sometimes used. Wikipedia has more on this.
General requirements are just that it's unfeasible to determine the original initialization vector from a generator's bit stream and that the next bit cannot be easily predicted.
As for initialization, most CSPRNGs use various sources available on the system, ranging from truly random things like line noise, interrupts or other events in the system to other things like certain memory locations, &c. The initialization vector is preferrably really random and not dependent on a mathematical algorithm. This initialization was broken for some time in Debian's implementation of OpenSSL which led to severe security problems.
1 Which has its problems too and one has to be careful in eliminating bias as things such as thermal noise has different characteristics depending on the temperature—you almost always have bias and need to eliminate it. And that's not a trivial task in itself.
In order for a random number generator to be considered cryptographically secure, in needs to be secure against attack by an adversary who knows the algorithm and a (large) number of previously generated bits. What this means is that someone with that information can't reconstruct any of the hidden internal state of the generator and give predictions of what the next bits produced will be with better than 50% accuracy.
Normal pseudo-random number generators are generally not cryptographically secure, as reconstructing the internal state from previously output bits is generaly trivial (often, the entire internal state is just the last N bits produced directly). Any random number generator without good statistical properties is also not cryptographically secure, as its output is at least party predictable even without knowing the internal state.
So, as to how they work, any good crypto system can be used as a cryptographically secure random number generator -- use the crypto system to encrypt the output of a 'normal' random number generator. Since an adversary can't reconstruct the plaintext output of the normal random number generator, he can't attack it directly. This is a somewhat circular definition an begs the question of how you key the crypto system to keep it secure, which is a whole other problem.
Each generator will use its own seeding strategy, but here's a bit from the Windows API documentation on CryptGenRandom
With Microsoft CSPs, CryptGenRandom uses the same random number
generator used by other security components. This allows numerous
processes to contribute to a system-wide seed. CryptoAPI stores an
intermediate random seed with every user. To form the seed for the
random number generator, a calling application supplies bits it might
have—for instance, mouse or keyboard timing input—that are then
combined with both the stored seed and various system data and user
data such as the process ID and thread ID, the system clock, the
system time, the system counter, memory status, free disk clusters,
the hashed user environment block. This result is used to seed the
pseudorandom number generator (PRNG).
In Windows Vista with Service Pack 1 (SP1) and later, an
implementation of the AES counter-mode based PRNG specified in NIST
Special Publication 800-90 is used. In Windows Vista, Windows Storage
Server 2003, and Windows XP, the PRNG specified in Federal Information
Processing Standard (FIPS) 186-2 is used. If an application has access
to a good random source, it can fill the pbBuffer buffer with some
random data before calling CryptGenRandom. The CSP then uses this data
to further randomize its internal seed. It is acceptable to omit the
step of initializing the pbBuffer buffer before calling
CryptGenRandom.