Detecting format of raw burned data which was burned as CDDA - cd-burning

Given a file containing raw data burned to a CD/DVD, I'm trying to figure out which format was used (i.e. ISO, UDF, etc.).
For example, I can detect the ISO9660 format because in the beginning of one of the 2048-byte sectors I can find the string "CD001".
However, when it comes to CDDA (Redbook) data, it seems to be much harder. I tried looking for the synchronization bytes, but wasn't very lucky so far. Another complication in this case is that each track is written to another file. In case of ISO, for example, everything is on a single track; In case of audio I got a separate file for each track.
I'm aware that this format is abused by many companies, so I'm just looking for a pretty good detection.
The specific case is to detect audio CD (in CDDA format) created by Windows Media Player.
Any suggestions?

Related

Disassemble to identify encryption algorithm

Goal (General)
My ultimate (long term) goal is to write an importer for a binary file into another application
Question Background
I am interested in two fields within a binary file format. One is
encrypted, and the other is compressed and possibly also encrypted
(See how I arrived at this conclusion here).
I have a viewer program (I'll call it viewer.exe) which can open these files for viewing. I'm hoping this can offer up some clues.
I will (soon) have a correlated deciphered output to compare and have values to search for.
This is the most relevant stackoverflow Q/A I have found
Question Specific
What is the best strategy given the resources I have to identify the algorithm being used?
Current Ideas
I realize that without the key, identifying the algo from just data is practically impossible
Having a file and a viewer.exe, I must have the key somewhere. Whether it's public, private, symmetric etc...that would be nice to figure out.
I would like to disassemble the viewer.exe using OllyDbg with the findcrypt plugin as a first step. I'm just not proficient enough in this kind of thing to accomplish it yet.
Resources
full example file
extracted binary from the field I am interested in
decrypted data In this zip archive there is a binary list of floats representing x,y,z (model2.vertices) and a binary list of integers (model2.faces). I have also included an "stl" file which you can view with many free programs but because of the weird way the data is stored in STL's, this is not what we expect to come out of the original file.
Progress
1. I disassembled the program with Olly, then did the only thing I know how to do at this poing and "searched for all referenced text" after pausing the porgram right before it imports of of the files. Then I searched for words stings like "crypt, hash, AES, encrypt, SHA, etc etc." I came up with a bunch of things, most notably "Blowfish64" which seems to go nicely with the fact that mydata occasionally is 4 bytes too long (and since it is guranteed to be mod 12 = 0) this to me looks like padding for 64 bit block size (odd amounts of vertices result in non mod 8 amounts of bytes). I also found error messages like...
“Invalid data size, (Size-4) mod 8 must be 0"
After reading Igor's response below, here is the output from signsrch. I've updated this image with green dot's which cause no problems when replaced by int3, red if the program can't start, and orange if it fails when loading a file of interest. No dot means I haven't tested it yet.
Accessory Info
Im using windows 7 64 bit
viewer.exe is win32 x86 application
The data is base64 encoded as well as encrypted
The deciphered data is groups of 12 bytes representing 3 floats (x,y,z coordinates)
I have OllyDb v1.1 with the findcrypt plugin but my useage is limited to following along with this guys youtube videos
Many encryption algorithms use very specific constants to initialize the encryption state. You can check if the binary has them with a program like signsrch. If you get any plausible hits, open the file in IDA and search for the constants (Alt-B (binary search) would help here), then follow cross-references to try and identify the key(s) used.
You can't differentiate good encryption (AES with XTS mode for example) from random data. It's not possible. Try using ent to compare /dev/urandom data and TrueCrypt volumes. There's no way to distinguish them from each other.
Edit: Re-reading your question. The best way to determine which symmetric algorithm, hash and mode is being used (when you have a decryption key) is to try them all. Brute-force the possible combinations and have some test to determine if you do successfully decrypt. This is how TrueCrypt mounts a volume. It does not know the algo beforehand so it tries all the possibilities and tests that the first few bytes decrypt to TRUE.

What file format do people use when logging data to a FAT32 file system using a 8bit microcontroller?

Updated question to be less vague.
I plan to log sensor data by time so something like sqlite would be perfect, but it requires too much resources in something like an atmega328p. Most of the searching will be done off the uC.
What do other people use? Flat text files? XML? A more complicated data structure?
Thanks for the feedback. It is good to know what other people are using. I've decided to serialize my data structures and save them in a binary file to eliminate string processing on the uC for now.
I've used flat text files for similar projects, albiet years ago, but I believe it's still a good approach for that environment. Since you don't need to process the data on-chip, you want it to be as efficient as possible (as little overhead as possible).
However, if you want more flexibility and weren't as concerned about space, perhaps saving JSON objects would be better, where each field is identified clearly. A tiny bit of overhead for creating the objects, but allows you to add and remove fields without complex logic on the interpreting side. I would pick JSON over XML just because you have about half the overhead (in space, and probably in processing).
With a small micro-controller like the 328, it is very important to determine the space requirements.
How big is each record? How many records do you want to store? How will you get the records off of the micro-controller?
Like Doug, I usually use a flat text to store data. So each record might contain year, day of the year and a value if I am storing a value once a day.
The file would look like:
11,314,100<cr>
11,315,99<cr>
11,316,98<cr>
11,317,220<cr>
You could store approximately 90-100 records, requiring you to dump the data every three months
If you need more then the 1kEEprom holds (200 5 byte records, 100 10 byte records or simliar) then you will need additional memory using an IC, SD or Flash.
If you want to unplug the memory and plug it into a PC, then the SD or Flash would be best.
You could use a vinculum chip from FTDIChip.com to simplify writing the fat files to flash drive.

Math - big number from couple of numbers export-able

Let's say I have some numbers, like
5,10,7,8,9,6,2,4,8,5,3,9,78,5,6
I need to send this to another computer, but as the least number of possible bytes. I know what there is a way to do that, I just forgot what it's called and how it works, but generally doing some math with those numbers, getting a big number that, from this number, I'll be able to export the data and get this numbers from this number. Thanks in advance.
EDIT
OK so I need to send this text in UDP but I need it as less bits as possible. I'm sending some options, like firstcolor-secondcolor, let's say I have 15 colors. Every color is just number, from 1 to 199, but maybe there is a better way to send this data? thanks.
No one can say which compression scheme is the best for you. We don't have any information about the numbers. But as a first try, you could just write them into a file and use gzip compression on it. Or bzip2, or 7zip.
And only if all these don't help, you should think about doing the compression yourself.
You also didn't tell us your operating systems (source computer, destination computer) and from where you get the data.
[Update, based on the edit in the question:] So basically you want to send some numbers in the range of 1 to 199. This is pretty close to what a single byte can hold.
If it is ok that you use 8 bits per number (meaning you waste 0.4 bits per number), this is trivial but highly depends on the programming language. Here is how it might look like in Java syntax:
ByteBuffer buf = new ByteBuffer();
buf.add(1);
buf.add(199);
buf.add(78);
buf.add(7);
udpSocket.send(buf.toArray());
Get a compression library (like zlib, for example) and feed your numbers in (as an array of integers, for example). This is compressing your data. That same library should allow you to reverse the process and decompress the data at the other end to get your values back out.
If you want to improve your algorithmic knowledge and your requirements are simple and non-critical I'd recommend having a go at writing your own compression/decompression code. If not, grab some code off the shelf - there are loads of good libraries around.

I want to convert a sound from Mic to binary and match it from the database

I want to convert a sound from Mic to binary and match it from the database(a type of voice identification program but don't getting idea how to get sound from Mic directly so that i can convert it to binary?Also it is possible or not. Please guide me )
See this:
http://www.dotnetspider.com/resources/4967-How-record-voice-from-microphone.aspx
You're not going to be able to identify voices by doing a binary comparison on sound data. The binary of a particular sound will not be identical to an imitation of that sound unless it is literally the same file because of minor variations in just about everything. You'll need to do some signals processing to do a fuzzy comparison of the data. You can read about signal processing on wikipedia.
You will probably find it easier to use a third party library to process the sound for you. Something like this might be a good start.
You're looking at two very distinct problems here.
The first is pretty technical: Getting sound from the microphone into a digital waveform. How you do this exactly depends on the OS and API you're using (on Windows, you're probably looking at DirectX audio or, if available, ASIO). Typically, this is how you'd proceed:
Set up a recording buffer for the microphone, with suitable parameters (number of channels, physical input on the sound card, sample rate, bit depth, buffer size)
Start the recording. This usually involves pointing the sound library to a callback function to process the recorded buffer.
In the callback, read the buffer, convert it to a suitable format, and append it to the audio file of your choice. (You could also record to RAM only, but longer recordings may exceed available storage).
Store the recorded audio in a suitable database field (some kind of binary blob)
This is the easy part though; the harder part is matching a chunk of audio data against other chunks. A naïve approach would be to try and find exact matches, but that won't help you much, because the chance that you find one is practically zero - recording equipment, even the best, introduces a bit of random noise, and recording setups vary slightly whether you want to or not, so even if you'd have someone say something twice, perfectly identical, you'd still see differences in the recorded audio.
What you need to do, then, is find certain typical characteristics of the waveform. Things you could look for are:
Overall amplitude shape
Base frequencies
Selected harmonics (formants)
Extracting these is non-trivial and involves pretty severe math; and then you'll have to condense them into some sort of fingerprint, and find a way to compare them with some fuzziness (so that a near-match is good enough, rather than requiring exact matches). Finding the right parameters and comparison algorithms isn't easy, and it takes a lot of tweaking and testing; your best bet is to go find a library that does this for you.

Intelligent Voice Recording: Request for Ideas

Say you have a conference room and meetings take place at arbitrary impromptu times. You would like to keep an audio record of all meetings. In order to make it as easy to use as possible, no action would be required on the part of meeting attenders, they just know that when they have a meeting in a specific room they will have a record of it.
Obviously just recording nonstop would be inefficient as it would be a waste of data storage and a pain to sift through.
I figure there are two basic ways to go about it.
Recording simply starts and stops according to sound level thresholds.
Recording is continuous, but split into X minute blocks. Blocks found to contain no content are discarded.
I like the second way better because I feel there is less risk for losing data because of late starts, or triggers failing.
I would like to implement in Python, and on Windows if possible.
Implementation suggestions?
Bonus considerations that probably deserve their own questions:
best audio format and compression for this purpose
any way of determining how many speakers are present, assuming identification is unrealistic
This is one of those projects where the path is going to be defined more about what's on hand for ready reuse.
You'll probably find it easier to continuously record and saving the data off in chunks (for example, hour long pieces).
Format is going to be dependent on what you in the form of recording tools and audio processing library. You may even find that you use two. One format, like PCM encoded WAV for recording and processing, but compressed MP3 for storage.
Once you have an audio stream, you'll need to access it in a PCM form (list of amplitude values). A simple averaging approach will probably be good enough to detect when there is a conversation. Typical tuning attributes:
* Average energy level to trigger
* Amount of time you need to be at the energy level or below to identify stop and start (I recommend two different values)
* Size of analysis window for averaging
As for number of participants, unless you find a library that does this, I don't see an easy solution. I've used speech recognition engines before and also done a reasonable amount of audio processing and I haven't seen any 'easy' ways to do this. If you were to look, search out universities doing speech analysis research. You may find some prototypes you can modify to give your software some clues.
I think you'll have difficulty doing this entirely in Python. You're talking about doing frequency/amplitude analysis of MP3 files. You would have to open up the file and look for a volume threshold, then cut out the portions that go below that threshold. Figuring out how many speakers are present would require very advanced signal processing.
A cursory Google search turned up nothing for me. You might have better luck looking for an off-the-shelf solution.
As an aside- there may be legal complications to having a recorder running 24/7 without letting people know.

Resources