How does sound data look like? - r

I read how sounds represented with numbers in computer here.
And I figured out that usual representation is that, we get 44,100 numbers between [-32767, 32767] per second.
Then to my imagination, there's got to be a big one-column matrix, right?
I'm a R user, so speaking in R, sound data of 3 seconds would be,
s <- 3
sound <- matrix(0, ncol = 1, nrow = 44100 * s)
nrow(sound)
#> [1] 132300
one-column matrix with 132,300 rows.
Is this really the case?
I want some analogous picture in my head, say, in case of a picture with 256 * 256,
if we RGB that picture, we get 3 matrices each with 256 * 256.
And in the case of sounds, we get a long long column? As I think about this again, it's not even a matrix after all. It's a column.
Am I right? I can't find any similar dataset searching Internet.
Any advices will be welcomed. Thanks.

The raw format that is created early in that linked question could look a lot like a single dimension array. And probably the signal that is sent to the speaker to make the sound could be represented similarly.
But you're unlikely to find a file on your computer that looks like that for several reasons:
Sound can be stored at different bit depth - that is how many bits for each 'number' CD Audio tracks have a 16 bit depth, but you could have 8 or 32 bits etc. In a straight stream of these numbers you need some how to know how far to read to the next number, so that information needs to be safed somewhere.
Sample rate can vary. If you've got a sequence of numbers representing an audio signal, then you need to know how long each number lasts for.
mostly sounds are more complex. Instead of a single source, you have stereo, or 5 channel, or whatever, so the system needs to be able to store / decode multiple pieces of information for the sounds you want to hear at a particular time
much of sound is repetitive, and so can often benefit from compression.
So most sounds are stored in a compressed format that includes wrapper information about how to decode it. The wrapper information includes how to decode the different audio channels, what sort of compression was used etc.
The closest you're likely to find are a .wav file (Windows) or .aiff (Mac). But even these include some metadata (sample rate and bit depth to start).

Related

Encoding DNA strand in Binary

Hey guys I have the following question:
Suppose we are working with strands of DNA, each strand consisting of
a sequence of 10 nucleotides. Each nucleotide can be any one of four
different types: A, G, T or C. How many bits does it take to encode a
DNA strand?
Here is my approach to it and I want to know if that is correct.
We have 10 spots. Each spot can have 4 different symbols. This means we require 4^10 combinations using our binary digits.
4^10 = 1048576.
We will then find the log base 2 of that. What do you guys think of my approach?
Each nucleotide (aka base-pair) takes two bits (one of four states -> 2 bits of information). 10 base-pairs thus take 20 bits. Reasoning that way is easier than doing the log2(4^10), but gives the same answer.
It would be fewer bits of information if there were any combinations that couldn't appear. e.g. some codons (sequence of three base-pairs) that never appear. But ten independent 2-bit pieces of information sum to 20 bits.
If some sequences appear more frequently than others, and a variable-length representation is viable, then Huffman coding or other compression schemes could save bits most of the time. This might be good in a file-format, but unlikely to be good in-memory when you're working with them.
Densely packing your data into an array of 2bit fields makes it slower to access a single base-pair, but comparing the whole chunk for equality with another chunk is still efficient. (memcmp).
20 bits is unfortunately just slightly too large for a 16bit integer (which computers are good at). Storing in an array of 32bit zero-extended values wastes a lot of space. On hardware with good unaligned support, storing 24bit zero-extended values is ok (do a 32bit load and mask the high 8 bits. Storing is even less convenient though: probably a 16b store and an 8b store, or else load the old value and merge the high 8, then do a 32b store. But that's not atomic.).
This is a similar problem for storing codons (groups of three base-pairs that code for an amino acid): 6 bits of information doesn't fill a byte. Only wasting 2 of every 8 bits isn't that bad, though.
Amino-acid sequences (where you don't care about mutations between different codons that still code for the same AA) have about 20 symbols per position, which means a symbol doesn't quite fit into a 4bit nibble.
I used to work for the phylogenetics research group at Dalhousie, so I've sometimes thought about having a look at DNA-sequence software to see if I could improve on how they internally store sequence data. I never got around to it, though. The real CPU intensive work happens in finding a maximum-likelihood evolutionary tree after you've already calculated a matrix of the evolutionary distance between every pair of input sequences. So actual sequence comparison isn't the bottleneck.
do the maths:
4^10 = 2^2^10 = 2^20
Answer: 20 bits

Why is Unix/Terminal faster than R?

I'm new to Unix, however, I have recently realized that very simple Unix commands can do very simple things to large data set very very quickly. My question is why are these Unix commands so fast relative to R?
Let's begin by assuming that the data is big, but not larger than the amount of RAM on your computer.
Computationally, I understand that Unix commands are likely faster than their R counterparts. However, I can't imagine that this would explain the entire time difference. After all basic R functions, like Unix commands, are written in low-level languages like C/C++.
I therefore suspect that the speed gains have to do with I/O. While I only have a basic understanding of how computers work, I do understand that to manipulate data it most first be read from disk (assuming the data is local). This is slow. However, regardless of whether you use R functions or Unix commands to manipulate data both most obtain the data from disk.
Therefore I suspect that how data is read from disk, if that even makes sense, is what is driving the time difference. Is that intuition correct?
Thanks!
UPDATE: Sorry for being vague. This was done on purpose, I was hoping to discuss this idea in general, rather than focus on a specific example.
Regardless, I'll generate an example of counting the number of rows
First I'll generate a big data set.
row = 1e7
col = 50
df<-matrix(rpois(row*col,1),row,col)
write.csv(df,"df.csv")
Doing it with Unix
time wc -l df.csv
real 0m12.261s
user 0m1.668s
sys 0m2.589s
Doing it with R
library(data.table)
system.time({ nrow(fread("df.csv")) })
...
user system elapsed
26.77 1.67 47.07
Notice that elapsed/real > user + system. This suggests that the CPU is waiting on the disk.
I suspected the slow speed of R has to do with reading the data in. It appears that I'm right:
system.time(fread("df.csv"))
user system elapsed
34.69 2.81 47.41
My question is how is the I/O different for Unix and R. Why?
I'm not sure what operations you're talking about, but in general, more complex processing systems like R use more complex internal data structures to represent the data being manipulated, and constructing these data structures can be a big bottleneck, significantly slower than the simple lines, words, and characters that Unix commands like grep tend to operate on.
Another factor (depending on how your scripts are set up) is whether you're processing the data one thing at a time, in "streaming mode", or reading everything into memory. Unix commands tend to be written to operate in pipelines, and to read a small piece of data (usually one line), process it, maybe write out a result, and move on to the next line. If, on the other hand, you read the entire data set into memory before processing it, then even if you do have enough RAM, allocating and organizing all the necessary memory can be very expensive.
[updated in response to your additional information]
Aha. So you were asking R to read the whole file into memory at once. That accounts for much of the difference. Let's talk about a few more things.
I/O. We can think about three ways of reading characters from a file, especially if the style of processing we're doing affects the way that's most convenient to do the reading.
Unbuffered small, random reads. We ask the operating system for 1 or a few characters at a time, and process them as we read them.
Unbuffered large, block-sized reads. We ask the operating for big chunks of memory -- usually of a size like 1k or 8k -- and chew on each chunk in memory before asking for the next chunk.
Buffered reads. Our programming language gives us a way of asking for as many characters as we want out of an intermediate buffer, and code that's built into the language ("library" code) automatically takes care of keeping that buffer full by reading large, block-sized chunks from the operating system.
Now, the important thing to know is that the operating system would much rather read big, block-sized chunks. So #1 can be drastically slower than 2 and 3. (I've seen factors of 10 or 100.) But no well-written programs use #1, so we can pretty much forget about it. As long as you're using 2 or 3, the I/O speed will be roughly the same. (In extreme cases, if you know what you're doing, you can get a little efficiency increase by using 2 instead of 3, if you can.)
Now let's talk about the way each program processes the data. wc has basically 5 steps:
Read characters one at a time. (I can assure you it uses method 3.)
For each character read, add one to the character count.
If the character read was a newline, add one to the line count.
If the character read was or wasn't a word-separator character, update the word count.
At the very end, print out the counts of lines, words, and/or characters, as requested.
So as you can see it's all I/O and very simple, character-based processing. (The only step that's at all complicated is 4. As an exercise, I once wrote a version of wc that contrived not to do all of steps 2, 3, and 4 inside the read loop if the user didn't ask for all the counts. My version did indeed run significantly faster if you invoked wc -c or wc -l. But obviously the code was significantly more complicated.)
In the case of R, on the other hand, things are quite a bit more complicated. First, you told it to read a CSV file. So as it reads, it has to find the newlines separating lines and the commas separating columns. That's roughly equivalent to the processing that wc has to do. But then, for each number that it finds, it has to convert it into an internal number that it can work with efficiently. For example, if somewhere in the CSV file occurs the sequence
...,12345,...
R is going to have to read those digits (as individual characters) and then do the equivalent of the math problem
1 * 10000 + 2 * 1000 + 3 * 100 + 4 * 10 + 5 * 1
to get the value 12345.
But there's more. You asked R to build a table. A table is a specific, highly regular data structure which orders all the data into rigid rows and columns for efficient lookup. To see how much work that can be, let's use a slightly far-fetched hypothetical real-world example.
Suppose you're a survey company and it's your job to ask people walking by on the street certain questions. But suppose that the questions are complicated enough that you need all the people seated in a classroom at once. (Suppose further that the people don't mind this inconvenience.)
But first you have to build that classroom. You're not sure how many people are going to walk by, so you build an ordinary classroom, with room for 5 rows of 6 desks for 30 people, and you haul in the desks, and the people start filing in, and after 30 people file in you notice there's a 31st, so what do you do? You could ask him to stand in the back, but you're kind of fixated on the rigid-rows-and-columns idea, so you ask the 31st person to wait, and you quickly call the builders and ask them to build a second 30-person classroom right next to the first, and now you can accept the 31st person and in fact 29 more for a total of 60, but then you notice a 61st person.
So you ask him to wait, and you call the builders back again, and you have them build two more classrooms, so now you've got a nice 2x2 grid of 30-person classrooms, but the people keep coming and soon enough the 121st person shows up and there's not enough room and you still haven't even started asking your survey questions yet.
So you call some fancier builders that know how to do steelwork and you have them build a big 5-story building next door with 50-person classrooms, 5 on each floor, for a total of 50 x 5 x 5 = 1,250 desks, and you have the first 120 people (who've been waiting patiently) file out of the old rooms into the new building, and now there's room for the 121st person and quite a few more behind him, and you hire some wreckers to demolish the old classrooms and recycle some of the materials, and the people keep coming and pretty soon there's 1,250 people in your new building waiting to be surveyed and the 1,251st has just showed up.
So you build a giant new skyscraper with 1,000 desks on each floor and 100 floors, and you demolish the old 5-story building, but the people keep coming, and how big did you say your big data set was? 1e7 x 50? So I don't think the 100-story building is going to be big enough, either. (And when you're all done with all this, the only "survey question" you're going to ask is "How many rows are there?")
Contrived as it may seem, this is actually not too bad an analogy for what R is having to do internally to build the table to store that data set in.
Meanwhile, Bob's discount survey company, who can only tell you how many people he surveyed and how many were men and women and in which age brackets, is down there on the streetcorner, and the people are filing by, and Bob is jotting down tally marks on his clipboards, and the people, once surveyed, are walking away and going about their business, and Bob isn't wasting time and money building any classrooms at all.
I don't know anything about R, but see if there's a way to construct an empty 1e7 x 50 matrix up front, and read the CSV file into it. You might find that significantly quicker. R will still have to do some building, but at least it won't have any false starts.

Finding similar hashes

I'm trying to find 2 different plain text words that create very similar hashes.
I'm using the hashing method 'whirlpool', but I don't really need my question to be answered in the case or whirlpool, if you can using md5 or something easier that's ok.
The similarities i'm looking for is that they contain the same number of letters (doesnt matter how much they're jangled up)
i.e
plaintext 'test'
hash 1: abbb5 has 1 a , 3 b's , one 5
plaintext 'blahblah'
hash 2: b5bab must have the same, but doesnt matter what order.
I'm sure I can read up on how they're created and break it down and reverse it, but I am just wondering if what I'm talking about occurs.
I'm wondering because I haven't found a match of what I'm explaining (I created a PoC to run threw random words / letters till it recreated a similar match), but then again It would take forever doing it the way i was dong it. and was wondering if anyone with real knowledge of hashes / encryption would help me out.
So you can do it like this:
create an empty sorted map \
create a 64 bit counter (you don't need more than 2^63 inputs, in all probability, since you would be dead before they would be calculated - unless quantum crypto really takes off)
use the counter as input, probably easiest to encode it in 8 bytes;
use this as input for your hash function;
encode output of hash in hex (use ASCII bytes, for speed);
sort hex on number / alphabetically (same thing really)
check if sorted hex result is a key in the map
if it is, show hex result, the old counter from the map & the current counter (and stop)
if it isn't, put the sorted hex result in the map, with the counter as value
increase counter, goto 3
That's all folks. Results for SHA-1:
011122344667788899999aaaabbbcccddeeeefff for both 320324 and 429678
I don't know why you want to do this for hex, the hashes will be so large that they won't look too much alike. If your alphabet is smaller, your code will run (even) quicker. If you use whole output bytes (i.e. 00 to FF instead of 0 to F) instead of hex, it will take much more time - a quick (non-optimized) test on my machine shows it doesn't finish in minutes and then runs out of memory.

Control sum in network connection

i'm making network application which doesn't send good data every time (most of time they are broken) so i tought to make control sum. At the end of data i will add control sum to check if they are valid. So i'm not sure is that a good idea to multiply every data (they are from 1 to 100) by 100, 100^2, 100^3..., and sum them.
Do you have any suggestion what to do, without making really big number(there are many data in the every packet).
Example:
Data: 1,4,2,77,12,32,5,52,23
My solution:1,4,2,77,12,32,5,52,23, 100+40000+2000000+ 77*10^4 ...
When client receive the packet he will check if last data is equal to sum of other datas.
Is there any better solution?
Multiplying the data results in a very large number to transmit, and not a lot of confidence that the numbers are correct. And addition runs into potential overflow issues. That is why it is customary to use an xor.
Or you can read up on http://en.wikipedia.org/wiki/Error-correcting_code to get even fancier solutions that can detect, and sometimes correct, small numbers of errors.
Best explanation here:
http://www.textfiles.com/programming/crc.txt
CRC functions will be available in you language's networking library.
Because 128 is 10000000 in binary, there is only 1 bit for subnetting, and there are 7 bits for hosts. We're going to subneting the Class C network address 192.168.10.0.
192.168.10.0 = Network address
255.255.255.128= Subnet mask

Maximizing Stored Information (Entropy?)

So I'm not sure if this question belongs here or maybe Math overflow. In any case, my question is about information theory.
Let's say I have a 16 bit word. There are 65,536 unique configurations of 1's and 0's in that number. What each one of those configurations represents is unimportant as depending on your notation (2's complement vs signed magnitude etc.) the same configuration can mean different things.
What I'm wondering is are there any techniques to store more information than that in a 16 bit word?
My original ideas were like odd/even parity or something but then I realized that's already determined by the configuration... i.e. there is no extra information encoded in that. I'm beginning to wonder if no such thing exists.
EDIT For example, let's say some magical computer (thinking quantum or something here) could understand 0,1,a. Then obviously we have 3^16 configurations and can now store more than the numbers [0 - 65,536]. Are there any other properties of a 16 bit word that you can mess with in order to encode extra information in your bit stream?
EDIT2 I am really struggling to put this into words. Right now when I look at a 16 bit word in the computer, the property which conveys information to me the relative ordering of individual 1's and 0's. Is there another property or way of looking at a 16 bit word which would allow more than 2^16 unique "configurations"? (Note it would no longer be a configuration, but 2^16 xxxx's where xxxx is a noun describing an instance of that property). The only thing I can really think of is something like if we looked at the number of 1 to 0 transitions or something rather than whether each bit was actually a 1 or 0? Now transitions does not yield more than 2^16 combinations because it is ultimately solely dependent on the configuration of 1's and 0's. I'm looking for properties that would derive from the configuration of 1's and 0's AND something else thus resulting in MORE than 2^16. Does anyone even know what this would be called if it did exist?
EDIT3 Ok I got it. My question boils down to this: How do we prove that the configuration of 1's and 0's in a word completely defines it? I.E. How do we prove that you need no other information besides the bitmap to show equality between two 16 bit words?
FINAL EDIT
I have an example... If instead of looking at the presence of 1's and 0's we look at transition between bits we can store 2^16 alphabet characters. If the bit to left is the same, treat it as a 1, if it transitions, treat it as a 0. Using the 16 bit word as a circularly linked list type structure where each link represent 0/1 we basically for a 16 bit word out of the transition between bits. That is an exact example of what I was looking for but that results in 2^16, nothing better. I am convinced that you cannot do better and am marking the correct answer =(
The amount of information in a particular configuration of 16 0/1s is determined by the probability of this configuration (this is called self-information). This can be bigger than 16 bits if the configuration is less likely than 1/(2^16), but that means that some other configurations are more likely than 1/(2^16) and so will contain less information than 16 bits.
To take into account all the possible configurations, you have to use the expected value of self-information (called entropy) of individual configurations. This value will reach its maximum when the probabilities of all configurations are equal (that is 1/(2^16)) and then it will be exactly 16 bits.
So the answer is no, you cannot store more than 16 bits of information in 16 0/1s.
See
http://en.wikipedia.org/wiki/Information_theory
http://en.wikipedia.org/wiki/Self-information
EDIT It is important to realize that bit does not stand for 0 or 1, but it is a unit of information, that is -log_2 P(w) where P(w) is the probability of a particular configuration.
You cannot store more than 2 states in one digit of a semiconductor device. You answered it yourself. The only way more information can be fitted into 16 digits is if each digit were to have many possible values.

Resources