I just made a simple text editor using inbuilt opening, writing, overwriting file.I used python installed with tkinter. But, I also want to extend the text editor to add some new features like search, replace etc efficiently. In order to make it efficient, I need to know data structure that Unix uses to store data in file and to calculating the time complexity for search.
A text file is stored as a stream of bytes. Depending on the encoding used (ASCII, UTF-8, Unicode, etc.), it can be a fixed value of one or more bytes per character, or in the case of UTF-8 and some other encodings, a varying number of bytes per character.
The best search algorithms have a worst case complexity of O(n + m), where n is the length of the string you're searching for, and m is the length of the string you're searching in. A good example is the Boyer-Moore search algorithm. If the file you're working with is larger than will fit in memory, then you have to worry about buffering and such, which is an added complication, but doesn't impact the efficiency of the search. You'll have to be creative about buffering the input so that you don't miss a string that crosses input buffer boundaries.
Related
I have never worked on the security side of web apps, as I am just out of college. Now, I am looking for a job and working on some websites on the side, to keep my skills sharp and gain new ones. One site I am working on is pretty much copied from the original MEAN stack from the guys that created it, but trying to understand it and do things better where I can.
To compute the hash & salt, the creators used PBKDF2. I am not interested in hearing about arguments for or against PBKDF2, as that is not what this question is about. They seem to have used buffers for everything here, which I understand is a common practice in node. What I am interested in are their reasons for using base64 for the buffer encoding, rather than simply using UTF-8, which is an option with the buffer object. Most computers nowadays can handle many of the characters in Unicode, if not all of them, but the creators could have chosen to encode the passwords in a subset of Unicode without restricting themselves to the 65 characters of base64.
By "the choice between encoding as UTF-8 or base64", I mean transforming the binary of the hash, computed from the password, into the given encoding. node.js specifies a couple ways to encode binary data into a Buffer object. From the documentation page for the Buffer class:
Pure JavaScript is Unicode friendly but not nice to binary data. When dealing with TCP
streams or the file system, it's necessary to handle octet streams. Node has several
strategies for manipulating, creating, and consuming octet streams.
Raw data is stored in instances of the Buffer class. A Buffer is similar to an array
of integers but corresponds to a raw memory allocation outside the V8 heap. A Buffer
cannot be resized.
What the Buffer class does, as I understand it, is take some binary data and calculate the value of each 8 (usually) bits. It then converts each set of bits into a character corresponding to its value in the encoding you specify. For example, if the binary data is 00101100 (8 bits), and you specify UTF-8 as the encoding, the output would be , (a comma). This is what anyone looking at the output of the buffer would see when looking at it with a text editor such as vim, as well as what a computer would "see" when "reading" them. The Buffer class has several encodings available, such as UTF-8, base64, and binary.
I think they felt that, while storing any UTF-8 character imaginable in the hash, as they would have to do, would not phase most modern computers, with their gigabytes of RAM and terabytes of space, actually showing all these characters, as they may want to do in logs, etc., would freak out users, who would have to look at weird Chinese, Greek, Bulgarian, etc. characters, as well as control characters, like the Ctrl button or the Backspace button or even beeps. They would never really need to make sense of any of them, unless they were experienced users testing PBKDF2 itself, but the programmer's first duty is to not give any of his users a heart attack. Using base64 increases the overhead by about a third, which is hardly worth noting these days, and decreases the character set, which does nothing to decrease the security. After all, computers are written completely in binary. As I said before, they could have chosen a different subset of Unicode, but base64 is already standard, which makes it easier and reduces programmer work.
Am I right about the reasons why the creators of this repository chose to encode its passwords in base64, instead of all of Unicode? Is it better to stick with their example, or should I go with Unicode or a larger subset of it?
A hash value is a sequence of bytes. This is binary information. It is not a sequence of characters.
UTF-8 is an encoding for turning sequences of characters into sequences of bytes. Storing a hash value "as UTF-8" makes no sense, since it is already a sequence of bytes, and not a sequence of characters.
Unfortunately, many people have took to the habit of considering a byte as some sort of character in disguise; it was at the basis of the C programming language and still infects some rather modern and widespread frameworks such as Python. However, only confusion and sorrow lie down that path. The usual symptoms are people wailing and whining about the dreadful "character zero" -- meaning, a byte of value 0 (a perfectly fine value for a byte) that, turned into a character, becomes the special character that serves as end-of-string indicator in languages from the C family. This confusion can even lead to vulnerabilities (the zero implying, for the comparison function, an earlier-than-expected termination).
Once you have understood that binary is binary, the problem becomes: how are we to handle and store our hash value ? In particular in JavaScript, a language that is known to be especially poor at handling binary values. The solution is an encoding that turns the bytes into characters, not just any character, but a very small subset of well-behaved characters. This is called Base64. Base64 is a generic scheme for encoding bytes into character strings that don't include problematic characters (no zero, only ASCII printable characters, excluding all the control characters and a few others such as quotes).
Not using Base64 would imply assuming that JavaScript can manage an arbitrary sequence of bytes as if it was just "normal characters", and that is simply not true.
There is a fundamental, security-related reason to store as Base64 rather than Unicode: the hash may contain the byte value "0", used by many programming languages as an end-of-string marker.
If you store your hash as Unicode, you, another programmer, or some library code you use may treat it as a string rather than a collection of bytes, and compare using strcmp() or a similar string-comparison function. If your hash contains the byte value "0", you've effectively truncated your hash to just the portion before the "0", making attacks much easier.
Base64 encoding avoids this problem: the byte value "0" cannot occur in the encoded form of the hash, so it doesn't matter if you compare encoded hashes using memcmp() (the right way) or strcmp() (the wrong way).
This isn't just a theoretical concern, either: there have been multiple cases of code for checking digital signatures using strcmp(), greatly weakening security.
This is an easy answer, since there are an abundance of byte sequences which are not well-formed UTF-8 strings. The most common one is a continuation byte (0x80-0xbf) that is not preceded by a leading byte in a multibyte sequence (0xc0-0xf7); bytes 0xf8-0xff aren't valid either.
So these byte sequences are not valid UTF-8 strings:
0x80
0x40 0xa0
0xff
0xfe
0xfa
If you want to encode arbitrary data as a string, use a scheme that allows it. Base64 is one of those schemes.
An addtional point: you might think to yourself, well, I don't really care whether they're well-formed UTF-8 strings, I'm never going to use the data as a string, I just want to hand this byte sequence to store for later.
The problem with that, is if you give an arbitrary byte sequence to an application expecting a UTF-8 string, and it is not well-formed, the application is not obligated to make use of this byte sequence. It might reject it with an error, it might truncate the string, it might try to "fix" it.
So don't try to store arbitrary byte sequences as a UTF-8 string.
Base64 is better, but consider a websafe base64 alphabet for transport. Base64 can conflict with querystring syntax.
Another option you might consider is using hex. Its longer but seldom conflicts with any syntax.
I'm increasingly looking at using QR codes to transmit binary information, such as images, since it seems whenever I demo my app, it's happening in situations where the WiFi or 3G/4G just doesn't work.
I'm wondering if it's possible to split a binary file up into multiple parts to be encoded by a series of QR codes?
Would this be as simple as splitting up a text file, or would some sort of complex data coherency check be required?
Yes, you could convert any arbitrary file into a series of QR codes,
something like Books2Barcodes.
The standard way of encoding data too big to fit in one QR code is with the "Structured Append Feature" of the QR code standard.
Alas, I hear that most QR encoders or decoders -- such as zxing -- currently do not (yet) support generating or reading such a series of barcodes that use the structured append feature.
QR codes already have a pretty strong internal error correction.
If you are lucky, perhaps splitting up your file with the "split" utility
into pieces small enough to fit into a easily-readable QR code,
then later scanning them in (hopefully) the right order and using "cat" to re-assemble them,
might be adequate for your application.
You surely can store a lot of data in a QR code; it can store 2953 bytes of data, which is nearly twice the size of a standard TCP/IP packet originated on an Ethernet network, so it's pretty powerful.
You will need to define some header for each QR code that describes its position in the stream required to rebuild the data. It'll be something like filename chunk 12 of 96, though encoded in something better than plain text. (Eight bytes for filename, one byte each for chunk number and total number of chunks -- a maximum of 256 QR codes, one simple ten-byte answer, still leaving 2943 bytes per code.)
You will probably also want to use some form of forward error correction such as erasure codes to encode sufficient redundant data to allow for mis-reads of either individual QR codes or entire missing QR codes to be transparently handled well. While you may be able to take an existing library, such as for Reed-Solomon codes to provide the ability to fix mis-reads within a QR code, handling missing QR codes entirely may take significantly more effort on your part.
Using erasure codes will of course reduce the amount of data you can transmit -- instead of all 753,408 bytes (256 * 2943), you will only have 512k or 384k or even less available to your final images -- depending upon what code rate you choose.
I think it is theoretically possible and as simple as splitting up text file. However, you probably need to design some kind of header to know that the data is multi-part and to make sure different parts can be merged together correctly regardless of the order of scanning.
I am assuming that the QR reader library returns raw binary data, and you will you the job of converting it to whatever form you want.
If you want automated creation and transmission, see
gre/qrloop: Encode a big binary blob to a loop of QR codes
maxg0/displaysocket.js: DisplaySocket.js - a JavaScript library for sending data from one device to another via QR ocdes using only a display and a camera
Note - I haven't used either.
See also: How can I publish data from a private network without adding a bidirectional link to another network - Security StackExchange
Goal (General)
My ultimate (long term) goal is to write an importer for a binary file into another application
Question Background
I am interested in two fields within a binary file format. One is
encrypted, and the other is compressed and possibly also encrypted
(See how I arrived at this conclusion here).
I have a viewer program (I'll call it viewer.exe) which can open these files for viewing. I'm hoping this can offer up some clues.
I will (soon) have a correlated deciphered output to compare and have values to search for.
This is the most relevant stackoverflow Q/A I have found
Question Specific
What is the best strategy given the resources I have to identify the algorithm being used?
Current Ideas
I realize that without the key, identifying the algo from just data is practically impossible
Having a file and a viewer.exe, I must have the key somewhere. Whether it's public, private, symmetric etc...that would be nice to figure out.
I would like to disassemble the viewer.exe using OllyDbg with the findcrypt plugin as a first step. I'm just not proficient enough in this kind of thing to accomplish it yet.
Resources
full example file
extracted binary from the field I am interested in
decrypted data In this zip archive there is a binary list of floats representing x,y,z (model2.vertices) and a binary list of integers (model2.faces). I have also included an "stl" file which you can view with many free programs but because of the weird way the data is stored in STL's, this is not what we expect to come out of the original file.
Progress
1. I disassembled the program with Olly, then did the only thing I know how to do at this poing and "searched for all referenced text" after pausing the porgram right before it imports of of the files. Then I searched for words stings like "crypt, hash, AES, encrypt, SHA, etc etc." I came up with a bunch of things, most notably "Blowfish64" which seems to go nicely with the fact that mydata occasionally is 4 bytes too long (and since it is guranteed to be mod 12 = 0) this to me looks like padding for 64 bit block size (odd amounts of vertices result in non mod 8 amounts of bytes). I also found error messages like...
“Invalid data size, (Size-4) mod 8 must be 0"
After reading Igor's response below, here is the output from signsrch. I've updated this image with green dot's which cause no problems when replaced by int3, red if the program can't start, and orange if it fails when loading a file of interest. No dot means I haven't tested it yet.
Accessory Info
Im using windows 7 64 bit
viewer.exe is win32 x86 application
The data is base64 encoded as well as encrypted
The deciphered data is groups of 12 bytes representing 3 floats (x,y,z coordinates)
I have OllyDb v1.1 with the findcrypt plugin but my useage is limited to following along with this guys youtube videos
Many encryption algorithms use very specific constants to initialize the encryption state. You can check if the binary has them with a program like signsrch. If you get any plausible hits, open the file in IDA and search for the constants (Alt-B (binary search) would help here), then follow cross-references to try and identify the key(s) used.
You can't differentiate good encryption (AES with XTS mode for example) from random data. It's not possible. Try using ent to compare /dev/urandom data and TrueCrypt volumes. There's no way to distinguish them from each other.
Edit: Re-reading your question. The best way to determine which symmetric algorithm, hash and mode is being used (when you have a decryption key) is to try them all. Brute-force the possible combinations and have some test to determine if you do successfully decrypt. This is how TrueCrypt mounts a volume. It does not know the algo beforehand so it tries all the possibilities and tests that the first few bytes decrypt to TRUE.
Is there a simple and quick way to detect encrypted files? I heard about enthropy calculation, but if I calculate it for every file on a drive, it will take days to detect encryption.
Is it possible to, say it, calculate some value for first 100 bytes or 1024 bytes and then decide? Anyone has a sources for that?
I would use a cross-entropy calculation. Calculate the cross-entropy value for X bytes for known encrypted data (it should be near 1, regardless of type of encryption, etc) - you may want to avoid file headers and footers as this may contain non-encrypted file meta data.
Calculate the entropy for a file; if it's close to 1, then it's either encrypted or /dev/random. If it's quite far away from 1, then it's likely not encrypted. I'm sure you could apply signifance tests to this to get a baseline.
It's about 10 lines of Perl; I can't remember what library is used (although, this may be useful: http://dingo.sbs.arizona.edu/~hammond/ling696f-sp03/addonecross.txt)
You could just make a system that recognizes particular common forms of encrypted files (ex: recognize encrypted zip, rar, vim, gpg, ssl, ecryptfs, and truecrypt). Any attempt to determine encryption based on the raw data will quickly run into a steganography discussion.
One of the advantages of good encryption is that you can design it so that it can't be detected - see the Wikipedia article on deniable encryption for example.
Every statistical approach to detect encryption will give you various "false alarms", like
compressed data or random looking data in general.
Imagine I'd write a program that outputs two files: file1 contains 1024 bit of π and file2 is an encrypted version of file1. If you don't know anything about the contents of file1 or file2, there's no way to distinguish them. In fact, it's quite likely that π contains the contents of file2 somewhere!
EDIT:
By the way, it's not even working the other way round (detecting unencrypted files). You could write a program that transforms encrypted data to readable english text by assigning words or whole sentences to bits/bytes of it.
Would it be possible to transfer large files using only a system of checksums, and then reconstruct the original file by calculations?
Say that you transfer the MD5 checksum of a file and the size of the file. By making a "virtual file" and calculating it's checksum, trying every single bit combination, you should eventually "reach" the original file. But on the way you would also get a lot of "collisions" where the checksum also match.
So we change the first byte of the original file to some specified value, calculate the checksum again, and send this too. If we make the same substitution in the virtual file we can test each "collision" to see if it still matches. This should narrow it down a bit, and we can do this several times.
Of course, the computing power to do this would be enormous. But is it theoretically possible, and how many checksums would you need to transfer something (say 1mb)? Or would perhaps the amount of data needed to transfer the checksums almost as large as the file, making it pointless?
The amount of data you need to transfer would most certainly be the same size as the file. Consider: If you could communicate a n byte file with n-1 bytes of data, that means you've got 256^(n-1) possible patterns of data you may have sent, but are selecting from a space of size 256^n. This means that one out of every 256 files won't be expressible using this method - this is often referred to as the pidegonhole principle.
Now, even if that wasn't a problem, there's no guarentee that you won't have a collision after any given amount of checksumming. Checksum algorithms are designed to avoid collisions, but for most checksum/hash algorithms there's no strong proof that after X hashes you can guarantee no collisions in a N-byte space.
Finally, hash algorithms, at least, are designed to be hard to reverse, so even if it were possible it would take an impossible huge amount of CPU power to do so.
That said, for a similar approach, you might be interested in reading about Forward Error Correction codes - they're not at all hash algorithms, but I think you may find them interesting.
What you have here is a problem of information. A checksum is not necessarily unique to a particular set of data, in fact to be so it would effectively need to have a many bits of information as the source. What it can indicate is that the data received is not the exact data that the checksum was generated from but in most cases it can't prove it.
In short "no".
To take a hypothetical example, consider a 24 bpp photo with 6 pixels -- there are 2^(24 * 6) (2^144) possible combinations of intensities for each colour channel on those six pixels, so you can gaurantee that if you were to evaluate every possibility, you are guaranteed an MD5 collision (as MD5 is a 128 bit number).
Short answer: Not in any meaningfull form.
Long answer:
Let us assume an arbitrary file file.bin with a 1000-byte size. There are 2^(8*1000) different combinations that could be its actual contents. By sending e.g. a 1000-bit checksum,
you still have about 2^(7*1000) colliding alternatives.
By sending a single additional bit, you might be able cut those down by half... and you still have 2^6999 collisions. By the time you eliminate the colisions, you will have sent at least 8000 bits i.e. an amount equal or greater to the file size.
The only way for this to be theoretically possible (Note: I did not say "feasible", let alone "practical") would be if the file did not really contain random data and you could use that knowledge to prune alternatives. In that case you'd be better off using compression, ayway. Content-aware compression algorithms (e.g. FLAC for audio) use a-priori knowledge on the properties of the input data to improve the compression ratio.
I think what you are thinking of is in fact an interesting topic, but you haven't hit upon the right method. If I can try and rephrase your question, you are asking if there is a way to apply a function to some data, transmit the result of the function, and then reconstruct the original data from the terser function result. For a single MD5 checksum the answer is no, but with other functions, provided you are willingly to send several function results, it is possible. In general this area of research is called compressed sensing. Sometimes exact reconstruction is possible, but more often it is used as a lossy compression scheme for images and other visual or sound data.