Related
I'm currently working a project where I use MARIE Assembly Language to create a version of Hamming code. My initial work has been to first have the user input the 8 bits for the data bits, and to then have the program output the correct 12-bit code word.
In order to find the correct values for the parity bits, I thought that the easiest way might be to load each data bit, which corresponds to a certain parity bit, and to count the number of 1s. However, I have yet to have found a way to have Marie count bits. I know there is no direct instruction to make Marie count bits, but does anyone have any advice on which instruction(s) might lead me towards an answer? I apologize in advance if clarity is lacking, and if it is please let me know so I can edit my question accordingly.
I read how sounds represented with numbers in computer here.
And I figured out that usual representation is that, we get 44,100 numbers between [-32767, 32767] per second.
Then to my imagination, there's got to be a big one-column matrix, right?
I'm a R user, so speaking in R, sound data of 3 seconds would be,
s <- 3
sound <- matrix(0, ncol = 1, nrow = 44100 * s)
nrow(sound)
#> [1] 132300
one-column matrix with 132,300 rows.
Is this really the case?
I want some analogous picture in my head, say, in case of a picture with 256 * 256,
if we RGB that picture, we get 3 matrices each with 256 * 256.
And in the case of sounds, we get a long long column? As I think about this again, it's not even a matrix after all. It's a column.
Am I right? I can't find any similar dataset searching Internet.
Any advices will be welcomed. Thanks.
The raw format that is created early in that linked question could look a lot like a single dimension array. And probably the signal that is sent to the speaker to make the sound could be represented similarly.
But you're unlikely to find a file on your computer that looks like that for several reasons:
Sound can be stored at different bit depth - that is how many bits for each 'number' CD Audio tracks have a 16 bit depth, but you could have 8 or 32 bits etc. In a straight stream of these numbers you need some how to know how far to read to the next number, so that information needs to be safed somewhere.
Sample rate can vary. If you've got a sequence of numbers representing an audio signal, then you need to know how long each number lasts for.
mostly sounds are more complex. Instead of a single source, you have stereo, or 5 channel, or whatever, so the system needs to be able to store / decode multiple pieces of information for the sounds you want to hear at a particular time
much of sound is repetitive, and so can often benefit from compression.
So most sounds are stored in a compressed format that includes wrapper information about how to decode it. The wrapper information includes how to decode the different audio channels, what sort of compression was used etc.
The closest you're likely to find are a .wav file (Windows) or .aiff (Mac). But even these include some metadata (sample rate and bit depth to start).
I'm new to Unix, however, I have recently realized that very simple Unix commands can do very simple things to large data set very very quickly. My question is why are these Unix commands so fast relative to R?
Let's begin by assuming that the data is big, but not larger than the amount of RAM on your computer.
Computationally, I understand that Unix commands are likely faster than their R counterparts. However, I can't imagine that this would explain the entire time difference. After all basic R functions, like Unix commands, are written in low-level languages like C/C++.
I therefore suspect that the speed gains have to do with I/O. While I only have a basic understanding of how computers work, I do understand that to manipulate data it most first be read from disk (assuming the data is local). This is slow. However, regardless of whether you use R functions or Unix commands to manipulate data both most obtain the data from disk.
Therefore I suspect that how data is read from disk, if that even makes sense, is what is driving the time difference. Is that intuition correct?
Thanks!
UPDATE: Sorry for being vague. This was done on purpose, I was hoping to discuss this idea in general, rather than focus on a specific example.
Regardless, I'll generate an example of counting the number of rows
First I'll generate a big data set.
row = 1e7
col = 50
df<-matrix(rpois(row*col,1),row,col)
write.csv(df,"df.csv")
Doing it with Unix
time wc -l df.csv
real 0m12.261s
user 0m1.668s
sys 0m2.589s
Doing it with R
library(data.table)
system.time({ nrow(fread("df.csv")) })
...
user system elapsed
26.77 1.67 47.07
Notice that elapsed/real > user + system. This suggests that the CPU is waiting on the disk.
I suspected the slow speed of R has to do with reading the data in. It appears that I'm right:
system.time(fread("df.csv"))
user system elapsed
34.69 2.81 47.41
My question is how is the I/O different for Unix and R. Why?
I'm not sure what operations you're talking about, but in general, more complex processing systems like R use more complex internal data structures to represent the data being manipulated, and constructing these data structures can be a big bottleneck, significantly slower than the simple lines, words, and characters that Unix commands like grep tend to operate on.
Another factor (depending on how your scripts are set up) is whether you're processing the data one thing at a time, in "streaming mode", or reading everything into memory. Unix commands tend to be written to operate in pipelines, and to read a small piece of data (usually one line), process it, maybe write out a result, and move on to the next line. If, on the other hand, you read the entire data set into memory before processing it, then even if you do have enough RAM, allocating and organizing all the necessary memory can be very expensive.
[updated in response to your additional information]
Aha. So you were asking R to read the whole file into memory at once. That accounts for much of the difference. Let's talk about a few more things.
I/O. We can think about three ways of reading characters from a file, especially if the style of processing we're doing affects the way that's most convenient to do the reading.
Unbuffered small, random reads. We ask the operating system for 1 or a few characters at a time, and process them as we read them.
Unbuffered large, block-sized reads. We ask the operating for big chunks of memory -- usually of a size like 1k or 8k -- and chew on each chunk in memory before asking for the next chunk.
Buffered reads. Our programming language gives us a way of asking for as many characters as we want out of an intermediate buffer, and code that's built into the language ("library" code) automatically takes care of keeping that buffer full by reading large, block-sized chunks from the operating system.
Now, the important thing to know is that the operating system would much rather read big, block-sized chunks. So #1 can be drastically slower than 2 and 3. (I've seen factors of 10 or 100.) But no well-written programs use #1, so we can pretty much forget about it. As long as you're using 2 or 3, the I/O speed will be roughly the same. (In extreme cases, if you know what you're doing, you can get a little efficiency increase by using 2 instead of 3, if you can.)
Now let's talk about the way each program processes the data. wc has basically 5 steps:
Read characters one at a time. (I can assure you it uses method 3.)
For each character read, add one to the character count.
If the character read was a newline, add one to the line count.
If the character read was or wasn't a word-separator character, update the word count.
At the very end, print out the counts of lines, words, and/or characters, as requested.
So as you can see it's all I/O and very simple, character-based processing. (The only step that's at all complicated is 4. As an exercise, I once wrote a version of wc that contrived not to do all of steps 2, 3, and 4 inside the read loop if the user didn't ask for all the counts. My version did indeed run significantly faster if you invoked wc -c or wc -l. But obviously the code was significantly more complicated.)
In the case of R, on the other hand, things are quite a bit more complicated. First, you told it to read a CSV file. So as it reads, it has to find the newlines separating lines and the commas separating columns. That's roughly equivalent to the processing that wc has to do. But then, for each number that it finds, it has to convert it into an internal number that it can work with efficiently. For example, if somewhere in the CSV file occurs the sequence
...,12345,...
R is going to have to read those digits (as individual characters) and then do the equivalent of the math problem
1 * 10000 + 2 * 1000 + 3 * 100 + 4 * 10 + 5 * 1
to get the value 12345.
But there's more. You asked R to build a table. A table is a specific, highly regular data structure which orders all the data into rigid rows and columns for efficient lookup. To see how much work that can be, let's use a slightly far-fetched hypothetical real-world example.
Suppose you're a survey company and it's your job to ask people walking by on the street certain questions. But suppose that the questions are complicated enough that you need all the people seated in a classroom at once. (Suppose further that the people don't mind this inconvenience.)
But first you have to build that classroom. You're not sure how many people are going to walk by, so you build an ordinary classroom, with room for 5 rows of 6 desks for 30 people, and you haul in the desks, and the people start filing in, and after 30 people file in you notice there's a 31st, so what do you do? You could ask him to stand in the back, but you're kind of fixated on the rigid-rows-and-columns idea, so you ask the 31st person to wait, and you quickly call the builders and ask them to build a second 30-person classroom right next to the first, and now you can accept the 31st person and in fact 29 more for a total of 60, but then you notice a 61st person.
So you ask him to wait, and you call the builders back again, and you have them build two more classrooms, so now you've got a nice 2x2 grid of 30-person classrooms, but the people keep coming and soon enough the 121st person shows up and there's not enough room and you still haven't even started asking your survey questions yet.
So you call some fancier builders that know how to do steelwork and you have them build a big 5-story building next door with 50-person classrooms, 5 on each floor, for a total of 50 x 5 x 5 = 1,250 desks, and you have the first 120 people (who've been waiting patiently) file out of the old rooms into the new building, and now there's room for the 121st person and quite a few more behind him, and you hire some wreckers to demolish the old classrooms and recycle some of the materials, and the people keep coming and pretty soon there's 1,250 people in your new building waiting to be surveyed and the 1,251st has just showed up.
So you build a giant new skyscraper with 1,000 desks on each floor and 100 floors, and you demolish the old 5-story building, but the people keep coming, and how big did you say your big data set was? 1e7 x 50? So I don't think the 100-story building is going to be big enough, either. (And when you're all done with all this, the only "survey question" you're going to ask is "How many rows are there?")
Contrived as it may seem, this is actually not too bad an analogy for what R is having to do internally to build the table to store that data set in.
Meanwhile, Bob's discount survey company, who can only tell you how many people he surveyed and how many were men and women and in which age brackets, is down there on the streetcorner, and the people are filing by, and Bob is jotting down tally marks on his clipboards, and the people, once surveyed, are walking away and going about their business, and Bob isn't wasting time and money building any classrooms at all.
I don't know anything about R, but see if there's a way to construct an empty 1e7 x 50 matrix up front, and read the CSV file into it. You might find that significantly quicker. R will still have to do some building, but at least it won't have any false starts.
I'm more or less attempting to determine crypography algoirthms and how they work. I'm a little confused on proving how one is trivial.
For example:
MAC(xbit_key,Message) = xbit_hash(Message) XOR xbit_key
Take a look at this for a general explanation and that for a good example. If it's still not clear, come back with a more specific question.
I would like to think that some of the software I'm writing today will be used in 30 years. But I am also aware that a lot of it is based upon the UNIX tradition of exposing time as the number of seconds since 1970.
#include <stdio.h>
#include <time.h>
#include <limits.h>
void print(time_t rt) {
struct tm * t = gmtime(&rt);
puts(asctime(t));
}
int main() {
print(0);
print(time(0));
print(LONG_MAX);
print(LONG_MAX+1);
}
Execution results in:
Thu Jan 1 00:00:00 1970
Sat Aug 30 18:37:08 2008
Tue Jan 19 03:14:07 2038
Fri Dec 13 20:45:52 1901
The functions ctime(), gmtime(), and localtime() all take as an argument a time value representing the time in seconds since the Epoch (00:00:00 UTC, January 1, 1970; see time(3) ).
I wonder if there is anything proactive to do in this area as a programmer, or are we to trust that all software systems (aka Operating Systems) will some how be magically upgraded in the future?
Update It would seem that indeed 64-bit systems are safe from this:
import java.util.*;
class TimeTest {
public static void main(String[] args) {
print(0);
print(System.currentTimeMillis());
print(Long.MAX_VALUE);
print(Long.MAX_VALUE + 1);
}
static void print(long l) {
System.out.println(new Date(l));
}
}
Wed Dec 31 16:00:00 PST 1969
Sat Aug 30 12:02:40 PDT 2008
Sat Aug 16 23:12:55 PST 292278994
Sun Dec 02 08:47:04 PST 292269055
But what about the year 292278994?
I have written portable replacement for time.h (currently just localtime(), gmtime(), mktime() and timegm()) which uses 64 bit time even on 32 bit machines. It is intended to be dropped into C projects as a replacement for time.h. It is being used in Perl and I intend to fix Ruby and Python's 2038 problems with it as well. This gives you a safe range of +/- 292 million years.
You can find the code at the y2038 project. Please feel free to post any questions to the issue tracker.
As to the "this isn't going to be a problem for another 29 years", peruse this list of standard answers to that. In short, stuff happens in the future and sometimes you need to know when. I also have a presentation on the problem, what is not a solution, and what is.
Oh, and don't forget that many time systems don't handle dates before 1970. Stuff happened before 1970, sometimes you need to know when.
You can always implement RFC 2550 and be safe forever ;-)
The known universe has a finite past and future. The current age of
the universe is estimated in [Zebu] as between 10 ** 10 and 2 * 10 **
10 years. The death of the universe is estimated in [Nigel] to occur
in 10 ** 11 - years and in [Drake] as occurring either in 10 ** 12
years for a closed universe (the big crunch) or 10 ** 14 years for an
open universe (the heat death of the universe).
Y10K compliant programs MAY choose to limit the range of dates they
support to those consistent with the expected life of the universe.
Y10K compliant systems MUST accept Y10K dates from 10 ** 12 years in
the past to 10 ** 20 years into the future. Y10K compliant systems
SHOULD accept dates for at least 10 ** 29 years in the past and
future.
Visual Studio moved to a 64 bit representation of time_t in Visual Studio 2005 (whilst still leaving _time32_t for backwards compatibility).
As long as you are careful to always write code in terms of time_t and don't assume anything about the size then as sysrqb points out the problem will be solved by your compiler.
Given my age, I think I should pay a lot into my pension and pay of all my depts, so someone else will have to fit the software!
Sorry, if you think about the “net present value” of any software you write today, it has no effect what the software does in 2038. A “return on investment” of more than a few years is uncommon for any software project, so you make a lot more money for your employer by getting the software shipped quicker, rather than thinking that far ahead.
The only common exception is software that has to predict future, 2038 is already a problem for mortgage quotation systems.
I work in embedded and I thought I would post our solution here. Our systems are on 32 bits, and what we sell right now has a warantee of 30 years which means that they will encounter the year 2038 bug. Upgrading in the future was not a solution.
To fix this, we set the kernel date 28 years earlier that the current date. It's not a random offset, 28 years is excatly the time it will take for the days of the week to match again. For instance I'm writing this on a thursday and the next time march 7 will be a thursday is in 28 years.
Furthermore, all the applications that interact with dates on our systems will take the system date (time_t) convert it to a custom time64_t and apply the 28 years offset to the right date.
We made a custom library to handle this. The code we're using is based off this: https://github.com/android/platform_bionic
Thus, with this solution you can buy yourself an extra 28 years easily.
I think that we should leave the bug in. Then about 2036 we can start selling consultancy for large sums of money to test everything. After all isn't that how we successfully managed the 1999-2000 rollover.
I'm only joking!
I was sat in a bank in London in 1999 and was quite amazed when I saw a consultant start Y2K testing the coffee machine. I think if we learnt anything from that fiasco, it was that the vast majority of software will just work and most of the rest won't cause a melt down if it fails and can be fixed after the event if needed. As such, I wouldn't take any special precautions until much nearer the time, unless you are dealing with a very critical piece of software.
Keep good documentation, and include a description of your time dependencies. I don't think many people have thought about how hard this transition might be, for example HTTP cookies are going to break on that date.
What should we do to prepare for 2038?
Hide, because the apocalypse is coming.
But seriously, I hope that compilers (or the people who write them, to be precise) can handle this. They've got almost 30 years. I hope that's enough time.
At what point do we start preparing for Y10K? Have any hardware manufacturers / research labs looked into the easiest way to move to whatever new technology we'll have to have because of it?
By 2038, time libraries should all be using 64-bit integers, so this won't actually be that big of a deal (on software that isn't completely unmaintained).
COBOL programs might be fun though.
Operative word being "should".
If you need to ensure futureproofing then you can construct your own date/time class and use that but I'd only do that if you think that what you write will be used on legacy OS'