Finding the size of a protocol buffer message in Julia - julia

I've looked at the code in ProtoBuf.jl and didn't see anything obvious. Given an instance of a protocol buffer object in Julia, is there a way of finding out how many bytes it needs before actually calling writeproto()?
As a little background, I want to know this because following the protobuf message, I'm sending a good-sized data array. The protobuf documentation discourages incorporating large arrays as part of a message, so my intention is to publish: a) the size of the protobuf structure, b) the protobuf structure itself, and c) the data array. The receiver of the message can then do the necessary bookkeeping to read the protobuf contents and the data array.
In the meantime, I've come up with an ugly workaround:
write(iob, UInt32(0)) # set aside space for header size
headerSize = UInt32(writeproto(iob, rdmMessage.header))
sizeBytes = reinterpret(UInt8, [headerSize])
iob.data[1:4] = sizeBytes[:]
# Back to the respectable world
write(iob, rdmMessage.body[:])
Not pretty, but if anyone else is in the same boat that I am, it might be useful to them.

Related

Finding AES Key in binary using Ghidra and FindCrypt

I would like to learn more about RE.
I wrote a simple program on a STM32F107 which does nothing else than encrypting and decrypting a text once using AES128-ECB.
Here is the C code (I intentionally left out the key so far):
struct AES_ctx TestAes;
uint8_t key[16] =
{ MY_KEY_IS_HERE };
uint8_t InputText[16] =
{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0 };
AES_init_ctx(&TestAes, key);
AES_ECB_encrypt(&TestAes, InputText);
AES_ECB_decrypt(&TestAes, InputText);
Now I want to find the 16 byte private key in my binary.
When I open the binary in a hex editor and search for my key I find all 16 bytes in a row.
I loaded the binary in Ghidra, installed FindCrypt before and now run the analysis.
FindCrypt now finds AES_Decrytion_SBox_Inverse and AES_Ecryption_SBox.
But both are not my AES key but the SBox. How does it go on after that? In all tutorials I find it looks quite simple, because the Functions Finder finds the AES functions - but since the project is Bare Metal this will probably not work.
I thought FindCrypt looks for some kind of hex pattern which could result in a key...
I have attached the binary. endian is little, architecture is ARM Cortex (I think?!)
I thought FindCrypt looks for some kind of hex pattern which could result in a key...
That would literally be any sequence of 16 or 32 bytes.
When reverse engineering larger apps, this is often a bit easier because the key will tend to be surrounded by a sea of zeros for uninitialized memory. So you just hunt for exactly 16/32 aligned bytes of "no more than 1 zero in a row surrounded by lots of zeros" and you'll probably find keys. But given your program structure, this may not happen. There are never any promises when you're reverse engineering. You'll often have to use lots of different approaches.
In your case, you'd want to hunt for the call to AES_init_ctx, which will reference the key, but it's a little harder to find the key itself automatically.
The important lesson here is that proper (random) AES keys have absolutely no structure other than their length. So it's impossible to look at a sequence of bytes and say "that's definitely an AES key." On the other hand, because they have no structure (and most data does), it's often quite easy to look at a sequence of bytes and say "that's almost certainly an AES key."
(It's also worth noting that almost everyone creates their AES keys incorrectly, basing them on a human-readable password, so that they have quite a lot of structure. Looking for ASCII strings that happen to be exactly 16/32 bytes long is often much more rewarding than looking for 16/32-bytes of random binary data.)

Is it ok to create big array of AVX/SSE values

I am parallelizing a certain dynamic programming problem using AVX2/SSE instructions.
In the main iteration of my calculation, I calculate column in matrix where each cell is a structure of AVX2 registers (_m256i). I use values from the previous matrix column as input values for calculating the current column. Columns can be big, so what I do is I have an array of structures (on stack), where each structure has two _m256i elements.
Structure:
struct Cell {
_m256i first;
_m256i second;
};
An then I have array like this: Cell prevColumn [N]. N will tipically be few hundreds.
I know that _m256i basically represents an avx2 register, so I am wondering how should I think about this array, how does it behave, since N is much larger than 16 (which is number of avx registers)? Is it a good practice to create such an array, or is there some better approach that i should use when storing a lot of _m256i values that are going to be reused real soon?
Also, is there any aligning I should be doing with this structures? I read a lot about aligning, but I am still not sure how and when to do it exactly.
It's better to structure your code to do everything it can with a value before moving on. Small buffers that fit in L1 cache aren't going to be too bad for performance, but don't do that unless you need to.
I think it's more typical to write your code with buffers of int [] type, rather than __m256i type, but I'm not sure. Either way works, and should get the compile to generate efficient code. But the int [] way means less code has to be different for the SSE, AVX2, and AVX512 version. And it might make it easier to examine things with a debugger, to have your data in an array with a type that will get the data formatted nicely.
As I understand it, the load/store intrinsics are partly there as a cast between _m256i and int [], since AVX doesn't fault on unaligned, just slows down on cacheline boundaries. Assigning to / from an array of _m256i should work fine, and generate load/store instructions where needed, otherwise generate vector instructions with memory source operands. (for more compact code and fewer fused-domain uops.)

How to find Hash/Cipher

is there any tool or method to figure out what is this hash/cipher function?
i have only a 500 item list of input and output plus i know all of the inputs are numeric, and output is always 2 Byte long hexadecimal representation.
here's some samples:
794352:6657
983447:efbf
479537:0796
793670:dee4
1063060:623c
1063059:bc1b
1063058:b8bc
1063057:b534
1063056:b0cc
1063055:181f
1063054:9f95
1063053:f73c
1063052:a365
1063051:1738
1063050:7489
i looked around and couldn't find any hash this short, is this a hash folded on itself? (with xor maybe?) or maybe a simple trivial cipher?
is there any tool or method for finding the output of other numbers?
(i want to figure this out; my next option would be training a Neural Network or Regression, so i thought i ask before taking any drastic action )
Edit: The Numbers are directory names, and for accessing them, the Hex parts are required.
Actually, Wikipedia's page on hashes lists three CRCs and three checksum methods that it could be. It could also be only half the output from some more complex hashing mechanism. Cross your fingers and hope that it's of the former. Hashes are specifically meant to be difficult (if not impossible) to reverse engineer.
What it's being used for should be a very strong hint about whether or not it's more likely to be a checksum/CRC or a hash.

Testing whether buffers have been flushed in R

I have some big, big files that I work with and I use several different I/O functions to access them. The most common one is the bigmemory package.
When writing to the files, I've learned the hard way to flush output buffers, otherwise all bets are off on whether the data was saved. However, this can lead to some very long wait times while bigmemory does its thing (many minutes). I don't know why this happens - it doesn't always occur and it's not easily reproduced.
Is there some way to determine whether or not I/O buffers have been flushed in R, especially for bigmemory? If the operating system matters, then feel free to constrain the answer in that way.
If an answer can be generalized beyond bigmemory, that would be great, as I sometimes rely on other memory mapping functions or I/O streams.
If there are no good solutions to checking whether buffers have been flushed, are there cases in which it can be assumed that buffers have been flushed? I.e. besides using flush().
Update: I should clarify that these are all binary connections. #RichieCotton noted that isIncomplete(), though the help documentation only mentions text connections. It's not clear if that is usable for binary connections.
Is this more convincing that isIncomplete() works with binary files?
# R process 1
zz <- file("~/test", "wb")
writeBin(c(1:100000),con=zz)
close(zz)
# R process 2
zz2 <- file("~/test", "rb")
inpp <- readBin(con=zz2, integer(), 10000)
while(isIncomplete(con2)) {Sys.sleep(1); inpp <- c(inpp, readBin(zz2),integer(), 10000)}
close(zz2)
(Modified from the help(connections) file.)
I'll put forward my own answer, but I welcome anything that is clearer.
From what I've seen so far, the various connection functions, e.g. file, open, close, flush, isOpen, and isIncomplete (among others), are based on specific connection types, e.g. files, pipes, URLs, and a few other things.
In contrast, bigmemory has its own connection type and the bigmemory object is an S4 object with a slot for a memory address for operating system buffers. Once placed there, the OS is in charge of flushing those buffers. Since it's an OS responsibility, then getting information on "dirty" buffers requires interacting with the OS, not with R.
Thus, the answer for bigmemory is "no" as the data is stored in the kernel buffer, though it may be "yes" for other connections that are handled through STDIO (i.e. stored in "user space").
For more insight on the OS / kernel side of things, see this question on SO; I am investigating a couple of programs (not just R + bigmemory) that are producing buffer flushing curiosities, and that thread helped to enlighten me about the kernel side of things.

How to decide if the chosen password is correct?

If an encrypted file exists and someone wants to decrypt it, there are several methods do try.
For example, if you would chose a brute force attack, that's easy: just try all possible keys and you will find the correct one. For this question, it doesn't matter that this might take too long.
But trying keys means the following steps:
Chose key
Decrypt data with key
Check if decryption was successful
Besides the problem that you would need to know the algorithm that was used for the encryption, I cannot imagine how one would do #3.
Here is why: After decrypting the data, I get some "other" data. In case of an encrypted plain text file in a language that I can understand, I can now check if the result is a text in that langauge.
If it would be a known file type, I could check for specific file headers.
But since one tries to decrypt something secret, it is most likely unknown what kind of information there will be if correctly decrypted.
How would one check if a decryption result is correct if it is unknown what to look for?
Like you suggest, one would expect the plaintext to be of some know format, e.g., a JPEG image, a PDF file, etc. The idea would be that it is very unlikely that a given ciphertext can be decrypted into both a valid JPEG image and a valid PDF file (but see below).
But it is actually not that important. When one talks about a cryptosystem being secure, one (roughly) talks about the odds of you being able to guess the plaintext corresponding to a given ciphertext. So I pick a random message m and encrypts it c = E(m). I give you c and if you cannot guess m, then we say the cryptosystem is secure, otherwise it's broken.
This is just a simple security definition. There are other definitions that require the system to be able to hide known plaintexts (semantic security): you give me two messages, I encrypt one of them, and you will not be able to tell which message I chose.
The point is, that in these definitions, we are not concerned with the format of the plaintexts, all we require is that you cannot guess the plaintext that was encrypted. So there is no step 3 :-)
By not considering your step 3, we make the question of security as clear as possible: instead of arguing about how hard it is to guess which format you used (zip, gzip, bzip2, ...) we only talk about the odds of breaking the system compared to the odds of guessing the key. It is an old principle that you should concentrate all your security in the key -- it simplifies things dramatically when your only assumption is the secrecy of the key.
Finally, note that some encryption schemes makes it impossible for you to verify if you have the correct key since all keys are legal. The one-time pad is an extreme example such a scheme: you take your plaintext m, choose a perfectly random key k and compute the ciphertext as c = m XOR k. This gives you a completely random ciphertext, it is perfectly secure (the only perfectly secure cryptosystem, btw).
When searching for an encryption key, you cannot know when you've found the right one. This is because c could be an encryption of any file with the same length as m: if you encrypt the message m' with the key *k' = c XOR m' you'll see that you get the same ciphertext again, thus you cannot know if m or m' was the original message.
Instead of thinking of exclusive-or, you can think of the one-time pad like this: I give you the number 42 and tell you that is is the sum of two integers (negative, positive, you don't know). One integer is the message, the other is the key and 42 is the ciphertext. Like above, it makes no sense for you to guess the key -- if you want the message to be 100, you claim the key is -58, if you want the message to be 0, you claim the key is 42, etc. One time pad works exactly like this, but on bit values instead.
About reusing the key in one-time pad: let's say my key is 7 and you see the ciphertexts 10 and 20, corresponding to plaintexts 3 and 13. From the ciphertexts alone, you now know that the difference in plaintexts is 10. If you somehow gain knowledge of one of the plaintext, you can now derive the other! If the numbers correspond to individual letters, you can begin looking at several such differences and try to solve the resulting crossword puzzle (or let a program do it based on frequency analysis of the language in question).
You could use heuristics like the unix
file
command does to check for a known file type. If you have decrypted data that has no recognizable type, decrypting it won't help you anyway, since you can't interpret it, so it's still as good as encrypted.
I wrote a tool a little while ago that checked if a file was possibly encrypted by simply checking the distribution of byte values, since encrypted files should be indistinguishable from random noise. The assumption here then is that an improperly decrypted file retains the random nature, while a properly decrypted file will exhibit structure.
#!/usr/bin/env python
import math
import sys
import os
MAGIC_COEFF=3
def get_random_bytes(filename):
BLOCK_SIZE=1024*1024
BLOCKS=10
f=open(filename)
bytes=list(f.read(BLOCK_SIZE))
if len(bytes) < BLOCK_SIZE:
return bytes
f.seek(0, 2)
file_len = f.tell()
index = BLOCK_SIZE
cnt=0
while index < file_len and cnt < BLOCKS:
f.seek(index)
more_bytes = f.read(BLOCK_SIZE)
bytes.extend(more_bytes)
index+=ord(os.urandom(1))*BLOCK_SIZE
cnt+=1
return bytes
def failed_n_gram(n,bytes):
print "\t%d-gram analysis"%(n)
N = len(bytes)/n
states = 2**(8*n)
print "\tN: %d states: %d"%(N, states)
if N < states:
print "\tinsufficient data"
return False
histo = [0]*states
P = 1.0/states
expected = N/states * 1.0
# I forgot how this was derived, or what it is suppose to be
magic = math.sqrt(N*P*(1-P))*MAGIC_COEFF
print "\texpected: %f magic: %f" %(expected, magic)
idx=0
while idx<len(bytes)-n:
val=0
for x in xrange(n):
val = val << 8
val = val | ord(bytes[idx+x])
histo[val]+=1
idx+=1
count=histo[val]
if count - expected > magic:
print "\tfailed: %s occured %d times" %( hex(val), count)
return True
# need this check because the absence of certain bytes is also
# a sign something is up
for i in xrange(len(histo)):
count = histo[i]
if expected-count > magic:
print "\tfailed: %s occured %d times" %( hex(i), count)
return True
print ""
return False
def main():
for f in sys.argv[1:]:
print f
rand_bytes = get_random_bytes(f)
if failed_n_gram(3,rand_bytes):
continue
if failed_n_gram(2,rand_bytes):
continue
if failed_n_gram(1,rand_bytes):
continue
if __name__ == "__main__":
main()
I find this works reasonable well:
$ entropy.py ~/bin/entropy.py entropy.py.enc entropy.py.zip
/Users/steve/bin/entropy.py
1-gram analysis
N: 1680 states: 256
expected: 6.000000 magic: 10.226918
failed: 0xa occured 17 times
entropy.py.enc
1-gram analysis
N: 1744 states: 256
expected: 6.000000 magic: 10.419895
entropy.py.zip
1-gram analysis
N: 821 states: 256
expected: 3.000000 magic: 7.149270
failed: 0x0 occured 11 times
Here .enc is the source ran through:
openssl enc -aes-256-cbc -in entropy.py -out entropy.py.enc
And .zip is self-explanatory.
A few caveats:
It doesn't check the entire file, just the first KB, then random blocks from the file. So if a file was random data appended with say a jpeg, it will fool the program. The only way to be sure if to check the entire file.
In my experience, the code reliably detects when a file is unencrypted (since nearly all useful data has structure), but due to its statistical nature may sometimes misdiagnose an encrypted/random file.
As it has been pointed out, this kind of analysis will fail for OTP, since you can make it say anything you want.
Use at your own risk, and most certainly not as the only means of checking your results.
One of the ways is compressing the source data with some standard algorithm like zip. If after decryption you can unzip the result - it's decrypted right. Compression is almost usually done by encryption programs prior to encryption - because it's another step the bruteforcer will need to repeat for each trial and lose time on it and because encrypted data is almost surely uncompressible (size doesn't decrease after compression with a chained algorithm).
Without a more clearly defined scenario, I can only point to cryptanalysis methods. I would say it's generally accepted that validating the result is an easy part of cryptanalysis. In comparison to decrypting even a known cypher, a thorough validation check costs little cpu.
are you seriously asking questions like this?
well if it was known whats inside then you would not need to decrypt it anywayz right?
somehow this has nothing to do with programming question its more mathematical. I took some encryption math classes at my university.
And you can not confirm without a lot of data points.
Sure if your result makes sense and its clear it is meaningful in plain english (or whatever language is used) but to answer your question.
If you were able to decrypt you should be able to encrypt as well.
So encrypt the result using reverse process of decryption and if you get same results you might be golden...if not something is possibly wrong.

Resources