How to extract the keyword from a mono-alphabetic substitution cipher - encryption

I am a beginner to cryptography and i have the following the question. I know that to extract a plaintext from a mono-alphabetic substitution cipher i can perform shifting the characters by n times until i have the plaintext, or i can also use the frequency letter analysis. However by doing this im getting only the plaintext and not they keyword thats been used.
The answer might include creating a table with the alphabet linked to the corresponding letters used in ciphertext by following the frequency letter analysis but im not sure how this works. Can someone explain to me how i will be able to get the keyword?

Related

Google OCR: Special Characters Affecting OCR Number Recognition

I've been playing around with Google's OCR recently using the default tutorial and was trying to parse numbers. I've seen previous issues dealing with numbers on license plates, but was wondering if there was a solution when special characters affect the results of OCR. Most notably, including the '#' character with a number, such as #1, #2, etc as shown below results in the output ##Z#T# and even occasionally gives me Chinese characters even after I set the language to/from settings to English.
Numbers with pound sign
For a similar comparison, the image below is easily read by the OCR:
Numbers without pound sign
Is there a setting that I'm missing that can improve the results or is this just a constraint by the model?

How to replace english abbreviated form to their dictionary form

I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.

Finding spaces in many time pad

I'm currently completing an online course in Cryptography, and have been give an exercise to complete. The course been running for a while and I know the answer is on the web but would like to complete it myself thought actions and research.
I have a list of 13 cipertext based on one/many time Pad - the same cipher key has been used to encrypt plain text. My task is to decrypt the last ciphertext.
The steps I have taken so far are based on cribing techniques at the following location:
http://adamsblog.aperturelabs.com/2013/05/back-to-skule-one-pad-two-pad-me-pad.html
https://crypto.stackexchange.com/questions/6020/many-time-pad-attack
and I'm using the following tool to XOR the ciphertexts.
In the tutorial I'm following the author suggest that the first step is to identify spaces I have tried to follow the steps but still cannot Identify the spaces once I Xor the cipher's
When I XOR the first cipertext i.e cipher 1 with cipher 2 and 3 I get the following:
15040900180952180C4549114F190E0159490A49120E00521201064C0A4F144F281B13490341004F480203161600000046071B1E4119061D1A411848090F4E0D0000161A0A41140C16160600151C00170B0B090653014E410D4C530F01150116000307520F104D200345020615041C541A49151744490F4C0D0015061C0A1F454F1F4509074D2F01544C080A090028061C1D002E413D004E0B141118
000D064819001E0303490A124C5615001647160C1515451A041D544D0B1D124C3F4F0252021707440D0B4C1100001E075400491E4F1F0A5211070A490B080B0A0700190D044E034F110A00001300490F054F0E08100357001E0853D4315FCEACFA7112C3E55D74AAF3394BB08F7504A8E5019C4E3E838E0F364946F31721A49AD2D24FF6775EFCB4F79FE4217A01B43CB5068BF3B52CA76543187274
000000003E010609164E0C07001F16520D4801490B09160645071950011D0341281B5253040F094C0D4F08010545050150050C1D544D061C5415044548090717074F0611454F164F1F101F411A4F430E0F0219071A0B411505034E461C1B0310454F12480D55040F18451E1B1F0A1C541646410D054C0D4C1B410F1B1B03149AD2D24FF6775EFCB4F79FE4217A01B43CB5068BF3B52CA76543187274
I'm getting confused at to where the space are based on the ASCII table which gives a 20(HEX) to the value a space.
Sorry if this is not enough information I can if more is required. Thanks Mark.
The question you're linking to has the correct answer.
You don't appear to use that method. You should have c1⊕c2 and c1⊕c3 but your question contains three blocks of strings, one with 6 lines, one with 5 lines and one with 4 lines. That's insufficient for us to even guess what your problem is. Go back to the linked answer, read it, and follow the steps listed there.

Which printable ASCII characters will usually appear in an english text?

I have been trying to solve Project Euler's problem #59 for a while, and I am having trouble because some of it seems somewhat more ambiguous than previous problems.
As background, the problem says that the given text file is encrypted text with the ASCII codes saved as numbers. The encryption method is to XOR 3 lowercase letters cyclically with the plaintext (so it is reversible). The problem asks for the key that decrypts the file to English text. How should I restrict the character set of my output to get the answer, without trying to sift through all possible plaintexts (26^3)?
I have tried restricting to letters, spaces, and punctuation, and that did not work.
To clarify: I want to determine, out of all printable ASCII characters, which ones I can probably discard and which ones I can expect to be in the plaintext string.
Have you tried two of the most "basic" and common tools in analyzing the algorithm used?
Analyze the frequency of the characters and try to match it against English letter frequency
Bruteforce using keys from a wordlist, most often common words are used as keys by "dumb" users
To analyze the frequency for this particular problem you would have to split the string every third element since the key is of length 3, you should now be able to produce three columns:
79 59 12
2 79 35
8 28 20
2 3 68
...
you have to analyse the frequency for each column, since now they are independent of the key.
Ok, actually took my time and constructed the 3 complete columns and counted the frequency for each of the columns and got the two most frequent item or each column:
Col1 Col2 Col3
71 79 68
2 1 1
Now if you check for instance: http://en.wikipedia.org/wiki/Letter_frequency
You have the most frequent letters, and don't forget you have spaces and other characters which is not present on that page, but I think you can assume that space is the most frequent character.
So now it is just a matter of xor:ing the most frequent characters in the table I provided with the most frequent characters in English language, and see if you get any lowercase characters, I found a three letter word which I think is the answer with only this data.
Good luck and by the way, it was a nice problem!
A possible solution is to simply assume the presence of a given three-character sequence in the encrypted text. You can use a three-letter word, or a three letter sequence which is likely to appear in English text (e.g. " a ": the letter 'a' enclosed between two spaces). Then simply try all possible positions of that sequence in the encrypted text. Each position allows you to simply recompute the key, then decrypt the whole text into a file.
Since the original text has length 1201, you get 1199 files to skim through. At that point it is only a matter of patience, but you can make it much faster by using a simple text search utility on another frequent sequence in English (e.g. "are"), for instance with the Unix tool grep.
I did just that, and got the decrypted text in less than five minutes.
I'll admit upfront I'm not familiar with an XOR cipher.
However, it seems very similar to the concept of the vigenere cipher. Escpecially in the line where they mention for unbreakable encryption the keylength equals the message length. That screams Vernam Cipher.
As mentioned in the other answer, the strategical approach to breaking a vigenere cipher involves a probabilistic approach. I will not go into detail because most of the theory I learned was relatively complicated, but it can be found here keeping in mind that vignere is a series of caesar ciphers.
The problem makes it easy for you though because you already know the keylength. Because of that, as you mentioned, you can simply bruteforce by trying every single 3 letter combination.
Here's what I would do: take a reasonably sized chunk of the ciphertext, say maybe 10-20 characters, and try the brute force approach on that. Keep track of all the keys that seem to create understandable sequences of letters and then use those on the whole ciphertext. That way we can employ the obvious brute forcing method, but without bruteforcing the entire problem, so I don't think you'll have to worry about limiting your output.
That said, I agree that as you're creating the output, if you ever get a non printable character, you could probably break your loop and move on to the next key. I wouldn't try anything more specific than that because who knows what the original message could have, never make assumptions about the data you're dealing with. Short circuiting logic like that is always a good idea, especially when implementing a brute force solution.
Split the ciphertext into 3.
Ciphertext1 comprises the 1st, 4th, 7th, 10th...numbers
Ciphertext2 comprises the 2nd, 5th, 8th, 11th...numbers
Ciphertext3 comprises the 3rd, 6th, 9th, 12th...numbers
Now you know that each cyphertext is encrypted with the same key letter. Now do a standard frequency analysis on it. That should give you enough clues as to what the letter is.
I just solved this problem a few days ago. Without spoiling it for you, I want to describe my approach to this problem. Some of what I say may be redundant to what you already knew, but was part of my approach.
First I assumed that the key is exactly as described, three lowercase ASCII letters. So I began brute forcing at 'aaa' and went to 'zzz'. While decrypting, if any resulting byte was a value lower than 32 (the ASCII value of space, the lowest "printable" ASCII value) or higher than 126 (the ASCII value of the tilde '~' which is the highest printable character in ASCII) than I assumed the key was invalid because any value outside 32 and 126 would be an invalid character for a plain text stretch of English. As soon as a single byte is outside of this range, I stopped decrypting and went to the next possible key.
Once I decrypted the entire message using a particular key (after passing the first test of all bytes being printable characters), I needed a way to verify it as a valid decryption. I was expecting the result to be a simple list of words with no particular order or meaning. Through other cryptography experience, I thought back to letter frequency, and most simply that your average English word in text is 5 characters long. The file contains 1201 input bytes. So that would mean that there would be (on average) 240 words. After decrypting, I counted how many spaces were in the resulting output string. Since Project Euler is anything but average, I compared the number of spaces to 200 accounting for longer, more obscure words. When an output had more than 200 spaces in it, I printed out the key it was decrypted with and the output text. The one and only output that has more than 200 spaces is the answer. Let me tell you that it's more than obvious that you have the answer when you see it.
Something to point out is that the answer to the question is NOT the key. It is the sum of all the ASCII values of the output string. This approach will also solve the equation under the one minute mark, in fact, it times in around 3 or 4 seconds.

Find HEX patterns and number of occurrences

I'd like to find patterns and sort them by number of occurrences on an HEX file I have.
I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.
DB0DDAEEDAF7DAF5DB1FDB1DDB20DB1BDAFCDAFBDB1FDB18DB23DB06DB21DB15DB25DB1DDB2EDB36DB43DB59DB32DB28DB2ADB46DB6FDB32DB44DB40DB50DB87DBB0DBA1DBABDBA0DB9ADBA6DBACDBA0DB96DB95DBB7DBCFDBCBDBD6DB9CDBB5DB9DDB9FDBA3DB88DB89DB93DBA5DB9CDBC1DBC1DBC6DBC3DBC9DBB3DBB8DBB6DBC8DBA8DBB6DBA2DB98DBA9DBB9DBDBDBD5DBD9DBC3DB9BDBA2DB84DB83DB7DDB6BDB58DB4EDB42DB16DB0DDB01DB02DAFCDAE9DAE5DAD9DAE2DAB7DA9BDAA6DA9EDAAADAC9DACADAC4DA92DA90DA84DA89DA93DAA9DA8CDA7FDA62DA53DA6EDA
That's an excerpt of the HEX file, and as an example I'd like to get:
XX occurrences of BDBDBD
XX occurrences of B93D
Is there a way to mine the file to generate that output?
Sure. Use a sliding window to create the counts (The link is for Perl, but it seems general enough to understand the algorithm). Your patterns are named N-grams. You will have to limit the maximal pattern, though.
This is a pretty classic CS problem. The code in general is non-trivial to implement as it will require at least one full parse of the sequence, and depending on your efficiency and memory/processor constraints might require several. See here.
You will need to partition your input string in some way to ensure that you get a good subsequence across it.
If there is a specific problem we might be able to help more, but the general strategy is in the Wikipedia article above.
You can use Regular Expressions to make a pattern to search for.
The regex needed would be very simple. Just use the exact phrase you're searching for. Then there should be a regular expression function in the language you're using (you didn't specify) that can count the number of matches.
Use that to create a simple counter.

Resources