sha3 algorithm encryption and backtracking it - encryption

I have three questions:
Q1: Suppose I have a transaction ID in sequence order 1,2,3,4 so on and so forth. If I am using SHA3 algorithm on each of these transaction ID, Is there a way where I can figure out before the transaction ID gets encrypted, whether the numbers will be picked up in sequence (1,2,3,4 .. ) and get encrypted?
Q2: can I backtrack and see what transaction ID got encrypted ?
Q3: What type of characters are present in a hashcode?

Related

May two DynamoDB scan segments contain the same hash key?

I'm scanning a huge table (> 1B docs) so I'm using a parallel scan (using one segment per worker).
The table has a hash key and a sort key.
Intuitively a segment should contain a set of hash keys (including all their sort keys), so one hash key shouldn't appear in more than one segment, but I haven't found any documentation indicating this.
Does anyone know how does DynamoDB behave in this scenario?
Thanks
This is an interesting question. I thought it would be easy to find a document stating that each segment contains a disjoint range of hash keys, and the same hash key cannot appear in more than one segment - but I too failed to find any such document. I am curious if anyone else can find such a document. In the meantime, I can try to offer additional intuitions on why your conjecture is likely correct - but also might be wrong:
My first intuition would be that you are right:
DynamoDB uses the hash key, also known as a partition key to decide on which of the many storage nodes to store copy of this data. All of the items sharing the same partition key (with different sort key values) are stored together, in sort-key order, so they can be Queryed together in order. DynamoDB uses a hash function on the partition key to decide the placement of each item (hence the name "hash key").
Now, if DynamoDB needs to divide the task of scanning all the data into "segments", the most sensible thing for it to do is to divide the space of hash values (i.e., hash function of the hash keys) to different equal-sized pieces. This division is easy to do (just a numeric division by TotalSegments), it ensures roughly the same amount of items in each segment (assuming there are many different partitions), and it ensures that the scanning of each segment involves a different storage node, so the parallel scan can proceed faster than what a single storage node is capable of.
However, there is one indication that this might not be the entire story.
The DynamoDB documentation claims that
In general, there is no practical limit on the number of distinct sort key values per partition key value.
This means that in theory at least, your entire database, perhaps one petabyte of it, may be in a single partition with billions of different sort keys. Since Amazon's single storage node do have a size limit, it means DynamoDB must (unless the above statement is false) support splitting of a single huge partition into multiple storage nodes. This means that when GetItem is looking for a particular item, DynamoDB needs to know which sort key is on which storage node. It also means that a parallel scan might - possibly - divide this huge partition into pieces, all scanning the same partition but different sort-key ranges in it. I am not sure we can completely rule out this possibility. I am guessing it will never happen when you only have smallish partitions.
Every DynamoDB table has a "hashspace" and data is partitioned as per the hash value of the partition key. When a ParallelScan is intended and the TotalSegments and Segment values are provided, the table's complete hashspace is logically divided into these "Segments" such that TotalSegments cover the complete hash space, without overlapping. It is quite possible some segments here do not actually have any data corresponding to them, since there may not be any data in the hashspace allocated to the segment. This can be observed if the TotalSegments value chosen is very high for instance.
And for each Segment value passed in the Scan request (with TotalSegments value being constant), each Segment would return distinct items without any overlap.
FAQs
Q. Ideal Number for TotalSegments ?
-> You might need to experiment with values, find the sweet spot for your table, and the number of workers you use, until your application achieves its best performance.
Q. One or more segments do not return any records. Why?
-> This is possible if the hash range that is allocated as per the TotalSegments value does not have any items. In this case, the TotalSegments value can be decreased, for better performance.
Q. Scan for a segment failed midway. Can a Scan for that segment alone be retried now ?
-> As long as the TotalSegments value remains the same, a Scan for one of the segments can be re-run, since it would have the same hash range allocated at any given time.
Q. Can I perform a Scan for a single segment, without performing the Scan for other segments as per TotalSegments value?
-> Yes. Multiple Scan operations for different Segments are not linked/do not depend on previous/other Segment Scans.

DynamoDB - Get N random items? (Schema Question)

Quick background - I want store every possible 5 character base 64 product. So, AAAAA, Afjsfs, 00ZZ0, etc.
I want to be able to grab 1000 of them randomly, then delete them from the DB so they're not used again.
Its trivial to generate and shuffle these. If I store them in an RDBMS, I could use an auto-inc Int ID, the first 1000, then delete the records. Assuming I put them in randomized, that totally works.
I'd like to see if its feasible to accomplish with DynamoDB, or if this problem is best left to RDBMS.
I could use an Int ID as the key, the 5 char string as the value, and do something similar.
Unless I'm misunderstanding, I can't just get walk keys and grab 1000 records can I? I need to provide a key. That sounds fine, except now I have to maintain DB state at the app layer or introduce another table just to keep track of the IDs I've iterated and deleted.
you can do the following:
(1) Each item will have a fixed partition key (that is same partition key value for all item. The exact value does not matter, as long as it is the same for all items, so let's assume it is simply the string "foo").
(2) The sort key will be a something random, for instance a randomly generated 32 bit integer.
(3) the 5-characters base 64 string will be stored in a third attribute (which is neither the partition nor the sort key)
when you want to grab 1000 random items you need to issue a DynamoDB query on partition key = "foo". Items returned from a query are sorted by the sort key. Since you chose a random sort key (see (2) above) you will get 1000 random items.
sort key considerations
the set of all 5 characters base 64 is a space of size 2^30. Thus your sort key needs to be large enough to store 2^30 items. So, pragmatically picking a random 32-bit int will be enough. However, if you need ensure that the selection of 1000 items is really really random you may want to pick something whose randomness is better than your runtime's random function. For instance, you can compute sha-384 on the base 64 value that you store and use it as the sort key value. The max length of a sort key is 1024 bytes so 384 bits is well within the limits.
In particular, do not use UUID as your sort key. UUIDs are typically not that random.

Amazon DynamoDB hash algorithm

I have use the DynamoDB for a while.
And was told that my hashkeys to insert are not so uniformly, there was a hot spot in a partition.
May I have the hash algorithm to judge my hashkeys?
DynamoDB does not expose their internal hashing algorithm but that should not affect your hash key distribution. A good hashing algorithm will randomly distribute your hash key values (i.e. "key1" and "key2" will hash to 2 strings that are not correlated to each other in any way).
If you are suffering from hot key issues in your DynamoDB table, it likely means you are accessing one hash key (or a small range of hash keys) more frequently than others, or that your hash key values are not distributed enough (i.e. not enough unique values).
Where did you get the information regarding hot spot in your partition? It may be helpful to go back to that source and dig more into the details of the unevenly distributed hashkey values.

GCM - Max length for Registration ID

Update: GCM is deprecated, use FCM
What is the maximum length for a Registration ID issued by GCM servers? GCM documentation do not provide this info. Googling for this reveals that Registration ID is not fixed length in nature and can be up to 4K (4096 bytes) in length. But these are not official answers from Google. I am currently receiving Registration IDs which are 162 characters long. Can anybody help?
On android-gcm forum a google's developer confirms it's 4k
I am interested in know about this also. My reg id size is 183 chars. I suspect it won't be longer than 512 chars though, let alone 4K. Imagine sending bulk notification, a 4K reg id x 1000 = 4MB message size!
In the end, I just use the 'text' type in my MySQL table to store the registration id. So even if google send me a 1K, 2K, or 4K (very unlikely) reg id, I will be able to handle it.
Update: I have come across a new reg id size: 205.
This is what has said in GCM doc,
A JSON object whose fields represents the key-value pairs of the message's payload data. If present, the payload data it will be included in the Intent as application data, with the key being the extra's name. For instance, "data":{"score":"3x1"} would result in an intent extra named score whose value is the string 3x1.
There is no limit on the number of key/value pairs, though there is a limit on the total size of the message (4kb). The values could be any JSON object, but we recommend using strings, since the values will be converted to strings in the GCM server anyway.
If you want to include objects or other non-string data types (such as integers or booleans), you have to do the conversion to string yourself. Also note that the key cannot be a reserved word (from or any word starting with google.).
To complicate things slightly, there are some reserved words (such as collapse_key) that are technically allowed in payload data. However, if the request also contains the word, the value in the request will overwrite the value in the payload data. Hence using words that are defined as field names in this table is not recommended, even in cases where they are technically allowed. Optional.

Should I generate a key from a hash for encryption?

I am currently currently using Rijndael 256-bit in CBC mode to encrypt some data which needs to be sent elsewhere. In order to enhance security, I'm taking a randomly generated SHA-256 hash and using a formula to chop off different parts it to use the encryption key and initialization vector (of course, the hash is sent with the data). The formula to generate the key and IV is fairly basic, and because the code is written in PHP, it's coded into a user-accessible page. What I'm wondering is: is this more or less safe than having one constant key and/or IV?
This is probably NOT the way you wish to go. In essence, it will take a good hacker not to long to figure out your mathematical formula for manipulating the HASH to generate your key and IV. Thus you are essentially sending the keys to the kingdom along with the kingdom itself.
Generally the way this type of operation is done, is to generate a session key (could be the same way you are doing it now), but use a public key encryption method to encrypt that session key. Then you use the public key encryption method to send the session key to the location your data is to be sent. The receiver has the public key and can encrypt the comm. channel session key.
Now both sides have the comm. channel session key and your REAL data can be encrypted using this key as the session key has not been sent in the clear.
Rijindael is an example of a symmetric crypto algorithm, where public key crypto algorithms are asymmetric. Examples of public key crypto algorithms are RSA, ECDSA (Crypto), etc....
On generating short-use keys. Have a long term key. Agree a date format with the receiver. Each day concatenate your long term key with the day's date and hash it with SHA-256 to generate a day key for use on that date only:
dayKey <- SHA256("my very secret long term key" + "2012-06-27")
The receiver will have all the information they need to generate exactly the same key at their end. Any attacker will know the date, but will not know the long term key.
You will need to agree protocols for around midnight and a few other details.
Change the long term key every month or two, depending on the amount of encrypted data you are passing. The more data you pass, the more often you need to change the long term key.

Resources