How to safely de-duplicate files encrypted at the client's side?

How to safely de-duplicate files encrypted at the client's side? - encryption

Bitcasa's claim its to provide infinite storage for a fixed fee.
According to a TechCrunch interview, Bitcasa uses client-side convergent encryption. Thus no unencrypted data ever reaches the server. Using convergent encryption, the encryption-key gets derived from the be encrypted source-data.
Basically, Bitcasa uses a hash function to identify identical files uploaded by different users to store them only once on their servers.
I wonder, how the provider is able to ensure, that no two different files get mapped to the same encrypted file or the same encrypted data stream, since hash functions aren't bijective.
Technical question: What do I have to implement, so that such a collision may never happen.

Most deduplication schemes make the assumption that hash collisions are so unlikely to happen that they can be ignored. This allows clients to skip reuploading already-present data. It does break down when you have two files with the same hash, but that's unlikely to happen by chance (and you did pick a secure hash function to prevent people from doing it intentionally, right?)
If you insist on being absolutely sure, all clients must reupload their data (even if it's already on the server), and once this data is reuploaded, you must check that it's identical to the currently-present data. If it's not, you need to pick a new ID rather than using the hash (and sound the alarm that a collision has been found in SHA1!)

Related

Is HMAC still needed if encrypted data is always saved and retrieved locally

My understanding of HMAC is that it can help to verify the integrity of encrypted data before the data is processed i.e. it can be used to determine whether or not the data being sent to a decryption routine has been modified in any way.
That being the case, is there any advantage in incorporating it into an encryption scheme if the data is never transmitted outside of the application generating it? My use case is quite simple - a user submits data (in plaintext) to the scripts I've written to store customer details. My scripts then encrypt this data and save it to the database, and my scripts then provide a way for the user to retrieve the data and decrypt it based on the record ID they supply. There is no way for my users to send encrypted data directly to the decryption routine and I don't need to provide an external API.
Therefore, is it reasonable to assume that there is a chain of trust in the application by default because the same application is responsible for writing and retrieving the data? If I add HMAC to this scheme, is it redundant in this context or is it best practice to always implement HMAC regardless of the context? I'm intending to use the Defuse library but I'd like to understand what the benefit of HMAC is to my project.
Thanks in advance for any advice or input :)

First, you should understand that there are attacks that allow an attacker to modify encrypted data without decrypting it. See Is there an attack that can modify ciphertext while still allowing it to be decrypted? on Security.SE and Malleability attacks against encryption without authentication on Crypto.SE. If an attacker gets write access to the encrypted data -- even without any decryption keys -- they could cause significant havoc.
You say that the encrypted data is "never transmitted outside of the application generating it" but in the next two sentences you say that you "save it to the database" which appears (to me) to be something of a contradiction. Trusting the processing of encrypted data in memory is one thing, but trusting its serialization to disk, especially if done by another program (such as a database system) and/or on a separate physical machine (now or in the future, as the system evolves).
The significant question here is: would it ever be a possible for an attacker to modify or replace the encrypted data with alternate encrypted data, without access to the application and keys? If the attacker is an insider and runs the program as a normal user, then it's not generally possible to defend your data: anything the program allows the attacker to do is on the table. However, HMAC is relevant when write access to the data is possible for a non-user (or for a user in excess of their normal permissions). If the database is compromised, an attacker could possibly modify data with impunity, even without access to the application itself. Using HMAC verification severely limits the attacker's ability to modify the data usefully, even if they get write access.

My OCD usually dictates that implementing HMAC is always good practice, if for no other reason, to remove the warning from logs.
In your case I do not believe there is a defined upside to implementing HMAC other than ensuring the integrity of the plain text submission. Your script may encrypt the data but it would not be useful in the unlikely event that bad data is passed to it.

Is end-to-end encryption possible with Realm Mobile Platform?

On the client device, a synced Realm can be setup with an encryption key that's unique to the user and stored on the device keychain, so data is stored encrypted on the client.
(related question: Can "data at rest" in the Realm Mobile Platform be encrypted?)
Realm Object Server and the clients can communicate via TLS, so data is encrypted in transit.
But the Realm Object Server does not appear to store data using encryption, since an admin user is able to access all the database contents via Realm Browser (https://realm.io/docs/realm-object-server/#data-browser).
Is it possible to setup Realm Mobile Platform so user data is encrypted end-to-end, such as no one but the user (not even server admins) have access to the decryption key?

Due to the way we handle conflict resolution, we currently are unable to provide end-to-end encryption, as you correctly deduced. Let's go a tiny bit into detail with regards to the conflict resolution.
In order to handle conflicts the way we do, we use something called operational transformation. This means that instead of sending the data over directly, the client tells the server the intent of the change, rather than the result. For example, when two users edit a text field, we would tell the server insert(data='new text', offset=0) because the first user prepended data at the beginning of the text field, and insert(data='some more stuff', offset=10) because the second user added data in the middle of the field. These two separate operations allow the server to uniquely resolve what happened, and have conflictless resolution of the two writes.
This also means that if we encrypt everything, the server would be unable to handle this conflict resolution.
This being said, that's for the current version. We do have a number of thoughts on how we could handle this in the future, while providing (some degree) of encryption. Mainly this would mean more work on the client, and maybe find a new algorithm that would allow us to tell the client the intent, and let the client figure out how to merge everything. This is a quadratic problem, though, so we're reticent to putting too much work on the client side, as it could really drain the battery.
That might be acceptable for some users, which is why we're looking into it. Basically, there will be a trade-off. As the old adage goes: fast, secure, convenient: pick two. We just have to figure out how to handle this properly.

I just opened a feature request around possibly using Tresorit's ZeroKit to solve the end-to-end encryption question posed. Sounds like the conflict resolution implementation will still cause an issue though, but maybe there is a different conflict resolution level that can be applied for those that don't need the realtime dynamic editing of individual data fields (like patient health data, where only a single clinician ever really edits a record at any given time).
https://github.com/realm/realm-mobile-platform/issues/96

Where to Store Encryption Keys MVC Application

I am using a AES encryption/decryption class that needs a key value and vector value encrypt and decrypt data in an MVC3 application.
On saving the record I am encrypting the data then storing in a database. When i retrieve the record i am decrypting in the controller and passing the unencrypted value to the view.
The concern is not protecting data as it traverses the network but to protect the database should it be compromised.
I have read many posts that say dont put the keys for encryption in your code.
Ok so where should they be kept? File system? Another Database?
Looking for some direction.

Common sense says, if an intruder gets access to your database, they will most likely also have access to your file system. It really comes down to you. For one, you can try to hide it. In configuration files, in plain files somewhere in file system, encrypt it with another key that is within the application ... and so on and so forth.
Configuration files are a logical answer, but why take a chance - mix it. Feel free to mix keys with multi-level encryptions - one requiring something from the record itself and being unique to every record, other one requiring a configuration value, third one requiring an application-specific value, and perhaps a fourth one from a library hidden well within your application's references? This way, even if one layer somehow gets compromised, you will have several others protecting it.
Yes, it adds overhead. Yes, it is relatively expensive. But is it worth it if you have sensitive data like user credit card details? You bet it is.
I'm using similar encryption and hashing techniques in one of my personal pet projects that is highly security focused and carefully controlled. It depends how much data you need to display at any one time - for example, mine will ever fetch only 10 records at a time, most likely even less.
... To specify what I mean by mixing: Encrypt once. Then encrypt that data again with different key and suggestedly different algorithm.

I would use Registry Keys protected by ACL, so only the account under which your app pool is running can read them.

Is it insecure to pass initialization vector and salt along with ciphertext?

I'm new to implementing encryption and am still learning basics, it seems.
I have need for symmetric encryption capabilities in my open source codebase. There are three components to this system:
A server that stores some user data, and information about whether or not it is encrypted, and how
A C# client that lets a user encrypt their data with a simple password when sending to the server, and decrypt with the same password when receiving
A JavaScript client that does the same and therefore must be compatible with the C# client's encryption method
Looking at various JavaScript libraries, I came across SJCL, which has a lovely demo page here: http://bitwiseshiftleft.github.com/sjcl/demo/
From this, it seems that what a client needs to know (besides the password used) in order to decrypt the ciphertext is:
The initialization vector
Any salt used on the password
The key size
Authentication strength (I'm not totally sure what this is)
Is it relatively safe to keep all of this data with the ciphertext? Keep in mind that this is an open source codebase, and there is no way I can reasonably hide these variables unless I ask the user to remember them (yeah, right).
Any advice appreciated.

Initialization vectors and salts are called such, and not keys, precisely because they need not be kept secret. It is safe, and customary, to encode such data along with the encrypted/hashed element.
What an IV or salt needs is to be used only once with a given key or password. For some algorithms (e.g. CBC encryption) there may be some additional requirements, fulfilled by chosing the IV randomly, with uniform probability and a cryptographically strong random number generator. However, confidentiality is not a needed property for an IV or salt.
Symmetric encryption is rarely enough to provide security; by itself, encryption protects against passive attacks, where the attacker observes but does not interfere. To protect against active attacks, you also need some kind of authentication. SJCL uses CCM or OCB2 encryption modes which combine encryption and authentication, so that's fine. The "authentication strength" is the length (in bits) of a field dedicated to authentication within the encrypted text; a strength of "64 bits" means that an attacker trying to alter a message has a maximum probability of 2-64 to succeed in doing so without being detected by the authentication mechanism (and he cannot know whether he has succeeded without trying, i.e. having the altered message sent to someone who knows the key/password). That's enough for most purposes. A larger authentication strength implies a larger ciphertext, by (roughly) the same amount.
I have not looked at the implementation, but from the documentation it seems that the SJCL authors know their trade, and did things properly. I recommend using it.
Remember the usual caveats of passwords and Javascript:
Javascript is code which runs on the client side but is downloaded from the server. This requires that the download be integrity-protected in some way; otherwise, an attacker could inject some of his own code, for instance a simple patch which also logs a copy of the password entered by the user somewhere. In practice, this means that the SJCL code should be served across a SSL/TLS session (i.e. HTTPS).
Users are human beings and human beings are bad at choosing passwords. It is a limitation of the human brain. Moreover, computers keep getting more and more powerful while human brains keep getting more or less unchanged. This makes passwords increasingly weak towards dictionary attacks, i.e. exhaustive searches on passwords (the attacker tries to guess the user's password by trying "probable" passwords). A ciphertext produced by SJCL can be used in an offline dictionary attack: the attacker can "try" passwords on his own computers, without having to check them against your server, and he is limited only by his own computing abilities. SJCL includes some features to make offline dictionary attacks more difficult:
SJCL uses a salt, which prevents cost sharing (usually known as "precomputed tables", in particular "rainbow tables" which are a special kind of precomputed tables). At least the attacker will have to pay the full price of dictionary search for each attacked password.
SJCL uses the salt repeatedly, by hashing it with the password over and over in order to produce the key. This is what SJCL calls the "password strengthening factor". This makes the password-to-key transformation more expensive for the client, but also for the attacker, which is the point. Making the key transformation 1000 times longer means that the user will have to wait, maybe, half a second; but it also multiplies by 1000 the cost for the attacker.

AES Encryption and key storage?

A few years ago, when first being introduced to ASP.net and the .NET Framework, I built a very simple online file storage system.
This system used Rijndael encryption for storing the files encrypted on the server's hard drive, and an HttpHandler to decrypt and send those files to the client.
Being one of my first project with ASP.net and databases, not understanding much about how the whole thing works (as well as falling to the same trap described by Jeff Atwood on this subject), I decided to store freshly generated keys and IVs together with each file entry in the database.
To make things a bit clearer, encryption was only to protect files from direct access to the server, and keys were not generated by user-entered passwords.
My question is, assuming I don't want to keep one key for all files, how should I store encryption keys for best security? What is considered best practice? (i.e: On a different server, on a plain-text file, encrypted).
Also, what is the initialization vector used for in this type of encryption algorithm? Should it be constant in a system?

Keys should be protected and kept secret, simple as that. The implementation is not. Key Management Systems get sold for large amounts of money by trusted vendors because solving the problem is hard.
You certainly don't want to use the same key for each user, the more a key is used the "easier" it comes to break it, or at least have some information leaks. AES is a block cipher, it splits the data into blocks and feeds the results of the last block encryption into the next block. An initialization vector is the initial feed into the algorithm, because at the starting point there is nothing to start with. Using random IVs with the same key lowers the risk of information leaks - it should be different for every single piece of data encrypted.
How you store the keys depends on how your system is architected. I've just finished a KMS where the keys are kept away from the main system and functions to encrypt and decrypt are exposed via WCF. You send in plain text and get a reference to a key and the ciphered text back - that way the KMS is responsible for all cryptography in the system. This may be overkill in your case. If the user enters a password into your system then you could use that to generate a key pair. This keypair could then be used to encrypt a key store for that user - XML, SQL, whatever, and used to decrypt each key which is used to protect data.
Without knowing more about how your system is configured, or it's purpose it's hard to recommend anything other than "Keys must be protected, keys and IVs must not be reused."

There's a very good article on this one at http://web.archive.org/web/20121017062956/http://www.di-mgt.com.au/cryptoCreditcard.html which covers the both the IV and salting issues and the problems with ECB referred to above.
It still doesn't quite cover "where do I store the key", admittedly, but after reading and digesting it, it won't be a huge leap to a solution hopefully....

As a pretty good soltution, you could store your Key/IV pair in a table:
ID Key IV
skjsh-38798-1298-hjj FHDJK398720== HFkjdf87923==
When you save an encrypted value, save the ID and a random Salt value along with it.
Then, when you need to decrypt the value, lookup the key/iv pair using the id and the salt stored with the data.
You'd want to make sure you have a good security model around the key storage. If you went with SQL server, don't grant SELECT rights to the user that accesses the database from the application. You wouldn't want to give someone access to the whole table.

What if, you simply just generated a key for each user, then encrypted it with a "master key"? Then, make sure to have random ivs and as long as you keep the master key secret, no one should be able to make much use of any amount of keys. Of course, the encryption and decryption functions would have to be server-side, as well as the master key not being exposed at all, not even to the rest of the server. This would be a decent way to go about it, but obviously, there are some issues, namely, if you have stored your master key unsafely, well there goes your security. Of course, you could encrypt the master key, but then your just kicking the can down the road. Maybe, you could have an AES key, encrypted with a RSA key, and the RSA key is then secured by a secret passprase. This would mitigate the problem, as if you have a decent sized RSA key, you should be good, and then you could expose the encryption functions to the client (though still probably shouldn't) and since the key encryption uses a public key, you can have that taken. For added security, you could cycle the RSA key every few months or even weeks if need be. These are just a few ideas, and I know that it isn't bulletproof, but is more secure than just stuffing it in a sql database.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex