I'm new to Unix, however, I have recently realized that very simple Unix commands can do very simple things to large data set very very quickly. My question is why are these Unix commands so fast relative to R?
Let's begin by assuming that the data is big, but not larger than the amount of RAM on your computer.
Computationally, I understand that Unix commands are likely faster than their R counterparts. However, I can't imagine that this would explain the entire time difference. After all basic R functions, like Unix commands, are written in low-level languages like C/C++.
I therefore suspect that the speed gains have to do with I/O. While I only have a basic understanding of how computers work, I do understand that to manipulate data it most first be read from disk (assuming the data is local). This is slow. However, regardless of whether you use R functions or Unix commands to manipulate data both most obtain the data from disk.
Therefore I suspect that how data is read from disk, if that even makes sense, is what is driving the time difference. Is that intuition correct?
Thanks!
UPDATE: Sorry for being vague. This was done on purpose, I was hoping to discuss this idea in general, rather than focus on a specific example.
Regardless, I'll generate an example of counting the number of rows
First I'll generate a big data set.
row = 1e7
col = 50
df<-matrix(rpois(row*col,1),row,col)
write.csv(df,"df.csv")
Doing it with Unix
time wc -l df.csv
real 0m12.261s
user 0m1.668s
sys 0m2.589s
Doing it with R
library(data.table)
system.time({ nrow(fread("df.csv")) })
...
user system elapsed
26.77 1.67 47.07
Notice that elapsed/real > user + system. This suggests that the CPU is waiting on the disk.
I suspected the slow speed of R has to do with reading the data in. It appears that I'm right:
system.time(fread("df.csv"))
user system elapsed
34.69 2.81 47.41
My question is how is the I/O different for Unix and R. Why?
I'm not sure what operations you're talking about, but in general, more complex processing systems like R use more complex internal data structures to represent the data being manipulated, and constructing these data structures can be a big bottleneck, significantly slower than the simple lines, words, and characters that Unix commands like grep tend to operate on.
Another factor (depending on how your scripts are set up) is whether you're processing the data one thing at a time, in "streaming mode", or reading everything into memory. Unix commands tend to be written to operate in pipelines, and to read a small piece of data (usually one line), process it, maybe write out a result, and move on to the next line. If, on the other hand, you read the entire data set into memory before processing it, then even if you do have enough RAM, allocating and organizing all the necessary memory can be very expensive.
[updated in response to your additional information]
Aha. So you were asking R to read the whole file into memory at once. That accounts for much of the difference. Let's talk about a few more things.
I/O. We can think about three ways of reading characters from a file, especially if the style of processing we're doing affects the way that's most convenient to do the reading.
Unbuffered small, random reads. We ask the operating system for 1 or a few characters at a time, and process them as we read them.
Unbuffered large, block-sized reads. We ask the operating for big chunks of memory -- usually of a size like 1k or 8k -- and chew on each chunk in memory before asking for the next chunk.
Buffered reads. Our programming language gives us a way of asking for as many characters as we want out of an intermediate buffer, and code that's built into the language ("library" code) automatically takes care of keeping that buffer full by reading large, block-sized chunks from the operating system.
Now, the important thing to know is that the operating system would much rather read big, block-sized chunks. So #1 can be drastically slower than 2 and 3. (I've seen factors of 10 or 100.) But no well-written programs use #1, so we can pretty much forget about it. As long as you're using 2 or 3, the I/O speed will be roughly the same. (In extreme cases, if you know what you're doing, you can get a little efficiency increase by using 2 instead of 3, if you can.)
Now let's talk about the way each program processes the data. wc has basically 5 steps:
Read characters one at a time. (I can assure you it uses method 3.)
For each character read, add one to the character count.
If the character read was a newline, add one to the line count.
If the character read was or wasn't a word-separator character, update the word count.
At the very end, print out the counts of lines, words, and/or characters, as requested.
So as you can see it's all I/O and very simple, character-based processing. (The only step that's at all complicated is 4. As an exercise, I once wrote a version of wc that contrived not to do all of steps 2, 3, and 4 inside the read loop if the user didn't ask for all the counts. My version did indeed run significantly faster if you invoked wc -c or wc -l. But obviously the code was significantly more complicated.)
In the case of R, on the other hand, things are quite a bit more complicated. First, you told it to read a CSV file. So as it reads, it has to find the newlines separating lines and the commas separating columns. That's roughly equivalent to the processing that wc has to do. But then, for each number that it finds, it has to convert it into an internal number that it can work with efficiently. For example, if somewhere in the CSV file occurs the sequence
...,12345,...
R is going to have to read those digits (as individual characters) and then do the equivalent of the math problem
1 * 10000 + 2 * 1000 + 3 * 100 + 4 * 10 + 5 * 1
to get the value 12345.
But there's more. You asked R to build a table. A table is a specific, highly regular data structure which orders all the data into rigid rows and columns for efficient lookup. To see how much work that can be, let's use a slightly far-fetched hypothetical real-world example.
Suppose you're a survey company and it's your job to ask people walking by on the street certain questions. But suppose that the questions are complicated enough that you need all the people seated in a classroom at once. (Suppose further that the people don't mind this inconvenience.)
But first you have to build that classroom. You're not sure how many people are going to walk by, so you build an ordinary classroom, with room for 5 rows of 6 desks for 30 people, and you haul in the desks, and the people start filing in, and after 30 people file in you notice there's a 31st, so what do you do? You could ask him to stand in the back, but you're kind of fixated on the rigid-rows-and-columns idea, so you ask the 31st person to wait, and you quickly call the builders and ask them to build a second 30-person classroom right next to the first, and now you can accept the 31st person and in fact 29 more for a total of 60, but then you notice a 61st person.
So you ask him to wait, and you call the builders back again, and you have them build two more classrooms, so now you've got a nice 2x2 grid of 30-person classrooms, but the people keep coming and soon enough the 121st person shows up and there's not enough room and you still haven't even started asking your survey questions yet.
So you call some fancier builders that know how to do steelwork and you have them build a big 5-story building next door with 50-person classrooms, 5 on each floor, for a total of 50 x 5 x 5 = 1,250 desks, and you have the first 120 people (who've been waiting patiently) file out of the old rooms into the new building, and now there's room for the 121st person and quite a few more behind him, and you hire some wreckers to demolish the old classrooms and recycle some of the materials, and the people keep coming and pretty soon there's 1,250 people in your new building waiting to be surveyed and the 1,251st has just showed up.
So you build a giant new skyscraper with 1,000 desks on each floor and 100 floors, and you demolish the old 5-story building, but the people keep coming, and how big did you say your big data set was? 1e7 x 50? So I don't think the 100-story building is going to be big enough, either. (And when you're all done with all this, the only "survey question" you're going to ask is "How many rows are there?")
Contrived as it may seem, this is actually not too bad an analogy for what R is having to do internally to build the table to store that data set in.
Meanwhile, Bob's discount survey company, who can only tell you how many people he surveyed and how many were men and women and in which age brackets, is down there on the streetcorner, and the people are filing by, and Bob is jotting down tally marks on his clipboards, and the people, once surveyed, are walking away and going about their business, and Bob isn't wasting time and money building any classrooms at all.
I don't know anything about R, but see if there's a way to construct an empty 1e7 x 50 matrix up front, and read the CSV file into it. You might find that significantly quicker. R will still have to do some building, but at least it won't have any false starts.
Related
I’ve noticed in many games there are a lot of errors regarding the number 536870916. For example, in one game that’s coded in Lua, the maximum number you can damage an enemy is 536870916, which is undocumented. I noticed other errors regarding this number when I googled it, for example:
“Random crash "Failed to allocate 536870916 bytes and will now terminate"”
Does anyone happen to know why this is?
There's nothing all that special about 536870916. It just happens to be very close to a power of 2: 229 = 536870912.
536870912 bytes is 512MiB, or 0.5GiB. It's a reasonable memory limit to configure for an application, so numbers going slightly above it are bound to appear in crash reports.
If you search numbers 536870912-536870916 on Google you'll see a diminishing number of results:
536870912: 47,500,000 results
536870913: 7,920,000 results
536870914: 36,300 results
536870915: 7,720 results
536870916: 8,380 results
Another source where you might see 536870916 is when numbers are used as bit sets to store flags. Sometimes error codes are stored like this. In binary, 536870916 has only 2 bits set, which makes it a union of two flags.
I read how sounds represented with numbers in computer here.
And I figured out that usual representation is that, we get 44,100 numbers between [-32767, 32767] per second.
Then to my imagination, there's got to be a big one-column matrix, right?
I'm a R user, so speaking in R, sound data of 3 seconds would be,
s <- 3
sound <- matrix(0, ncol = 1, nrow = 44100 * s)
nrow(sound)
#> [1] 132300
one-column matrix with 132,300 rows.
Is this really the case?
I want some analogous picture in my head, say, in case of a picture with 256 * 256,
if we RGB that picture, we get 3 matrices each with 256 * 256.
And in the case of sounds, we get a long long column? As I think about this again, it's not even a matrix after all. It's a column.
Am I right? I can't find any similar dataset searching Internet.
Any advices will be welcomed. Thanks.
The raw format that is created early in that linked question could look a lot like a single dimension array. And probably the signal that is sent to the speaker to make the sound could be represented similarly.
But you're unlikely to find a file on your computer that looks like that for several reasons:
Sound can be stored at different bit depth - that is how many bits for each 'number' CD Audio tracks have a 16 bit depth, but you could have 8 or 32 bits etc. In a straight stream of these numbers you need some how to know how far to read to the next number, so that information needs to be safed somewhere.
Sample rate can vary. If you've got a sequence of numbers representing an audio signal, then you need to know how long each number lasts for.
mostly sounds are more complex. Instead of a single source, you have stereo, or 5 channel, or whatever, so the system needs to be able to store / decode multiple pieces of information for the sounds you want to hear at a particular time
much of sound is repetitive, and so can often benefit from compression.
So most sounds are stored in a compressed format that includes wrapper information about how to decode it. The wrapper information includes how to decode the different audio channels, what sort of compression was used etc.
The closest you're likely to find are a .wav file (Windows) or .aiff (Mac). But even these include some metadata (sample rate and bit depth to start).
I'm running distributed MPI programs on clusters using multiple nodes, where I make use of the MPI FFT's of FFTW. To save time I reuse wisdom from one run to the next. To generate this wisdom, FFTW experiments with a lot of different algorithms and what not for the given problem. I am worried that because I am working on a cluster, the best solution stored as wisdom for one set of CPUs/nodes may not be the best solution for some other set of CPUs/nodes performing the same task, and so I should not reuse wisdom unless I am running on exactly the same CPUs/nodes as the run where the wisdom was gathered.
Is this correct, or is the wisdom somehow completely indifferent to the physical hardware on which it is generated?
If your cluster is homogeneous, the saved fftw plans likely make sense, though the the way the processes are connected may affect optimal plans for mpi-related operations. But if your cluster is not homogeneous, saving the fftw plan can be suboptimal and problem related to load balance could proove hard to solve.
Taking a look at wisdom files produced by fftw and fftw_mpi for a 2D c2c transform, I can see additionnal lines likely related to phases like transposition where mpi communications are required, such as:
(fftw_mpi_transpose_pairwise_register 0 #x1040 #x1040 #x0 #x394c59f5 #xf7d5729e #xe8cf4383 #xce624769)
Indeed, there are different algorithms for transposing the 2D (or 3D) array: in the folder mpi of the source of fftw, files transpose-pairwise.c, transpose-alltoall.c and transpose-recurse.c implement these algorithms. As flags FFTW_MEASURE or FFTW_EXHAUSTIVE are set, these algorithms are run to select the fastest, as stated here. The result might depend on the topology of the network of processes (how many processes on each node? How these nodes are connected?). If the optimal plan depends on where the processes are running and on the network topology, using the wisdom utility will not be decisive. Otherwise, using the wisdom feature can save some time as the plan is built.
To test whether the optimal plan changed, you can perform a couple of runs and save the resulting plan in files: a reproductibility test!
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
if (rank == 0) fftw_export_wisdom_to_filename("wisdommpi.txt");
/* save the plan on each process ! Depending on the file system of the cluster, performing communications can be required */
char filename[42];
sprintf(filename, "wisdom%d.txt",rank);
fftw_export_wisdom_to_filename(filename);
Finally, to compare the produced wisdom files, try in a bash script:
for filename in wis*.txt; do
for filename2 in wis*.txt; do
echo "."
if grep -Fqvf "$filename" "$filename2"; then
echo "$filename"
echo "$filename2"
echo $"There are lines in file1 that don’t occur in file2."
fi
done
done
This script check that all lines in files are also present in the other files, following Check if all lines from one file are present somewhere in another file
On my personal computer, using mpirun -np 4 main, all wisdom files are identical except for a permutation of lines.
If the files are different from one run to another, it could be attributed to the communication pattern between processes... or sequential performance of dft for each process. The piece of code above save the optimal plan for each process. If lines related to sequential operations, without fftw_mpi in it, such as:
(fftw_codelet_n1fv_10_sse2 0 #x1440 #x1440 #x0 #xa9be7eee #x53354c26 #xc32b0044 #xb92f3bfd)
become different, it is a clue that the optimal sequential algorithm changes from one process to the other. In this case, the wall clock time of the sequential operations may also differ from one process to another. Hence, checking the load balance between processes could be instructive. As noticed in the documentation of FFTW about load balance:
Load balancing is especially difficult when you are parallelizing over heterogeneous machines; ... FFTW does not deal with this problem, however—it assumes that your processes run on hardware of comparable speed, and that the goal is therefore to divide the problem as equally as possible.
This assumption is consistent with the operation performed by fftw_mpi_gather_wisdom();
(If the plans created for the same problem by different processes are not the same, fftw_mpi_gather_wisdom will arbitrarily choose one of the plans.) Both of these functions may result in suboptimal plans for different processes if the processes are running on non-identical hardware...
The transpose operation in 2D and 3D fft requires a lot a communications: one of the implementation is a call to MPI_Alltoall involving almost the whole array. Hence, a good connectivity between nodes (infiniband...) can proove useful.
Let us know if you found different optimal plans from one run to another and how these plans differ!
There are a lot of discussions on the web on the topic of sorting huge files on Unix when the data will not fit into memory. Generally using mergesort and variants.
Hoewever, if suppose, there was enough memory to fit the entire data into it, what could be the most efficient / fastest way of sorting ? The csv files are ~ 50 GB (> 1 billion rows) and there is enough memory (5x the size of data) to hold the entire data.
I can use the Unix sort, but that still takes > 1 hr. I can use any language necessary, but what I am primarily looking for is speed. I understand we can load the data into say, a columnar type db table and sort, but it's a one-time effort, so looking for something more nimble ...
Thanks in advance.
Use parallel sorting algorithms for huge data.
Useful topic:
Which parallel sorting algorithm has the best average case performance?
What about QuickSort? Did you try? std::sort is usually implemented by quicksort (more precisely introsort, which switches to heapsort if quicksort performance would be bad), so you can try with it. quicksort is usually the fastest option (although the worst-case complexity is O(n^2), but in usual cases it beats all other sorting algorithms).
The space complexity of quicksort should not be too bad, it requires log2(N) stack space, which is around 30 stack frames for 1 billion items.
However, it is unstable sorting algorithm (order of "equal" items is not preserved), so it depends if you are ok with that.
Btw. Unix sort seems to be implemented by merge sort, which usually isn't the fastest option for in-RAM sort.
I know this is old but I figure I'd chime in with what I just figured out in hopes that it may help someone else in the future.
GNU sort as you may already know is pretty fast. Couple that with many CPU cores and a lot of RAM and when you pass in some special flags to GNU's sort and make it extremely fast.
* pay close attention to the 'buffer-size' flag. buffer size is the main reason for this speed-up. ive used parallel flag before and it wasn't as fast by itself.
sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.csv $file
I used a for loop to handle all the files in the folder and sorted huge csv files, by the second key, with a comma delim, only keeping unique values, with the following results:
for file in $(ls -p | grep -v -E "[0-4/]");
do
time sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.sorted.csv $file;
done
real 0m36.041s
user 1m53.291s
sys 0m32.007s
real 0m52.449s
user 1m52.812s
sys 0m38.202s
real 0m50.133s
user 1m41.124s
sys 0m38.595s
real 0m41.948s
user 1m41.080s
sys 0m35.949s
real 0m47.387s
user 1m39.998s
sys 0m34.076s
The input files are 5.5 GB with ~75,000,000 million rows each. The max memory usage I saw while a sort was taking place was a little less then 20 GB. So if it scales proportionally then a 50 GB file should take a little less then 200 GB of space. sorted 27.55 GB of data in under 9 minutes!
I made this function in Octave which plots fractals. Now, it takes a long time to plot all the points I've calculated. I've made my function as efficient as possible, the only way I think I can make it plot faster is by having my CPU completely focus itself on the function or telling it somehow it should focus on my plot.
Is there a way I can do this or is this really the limit?
To determine how much CPU is being consumed for your plot, run your plot, and in a separate window (assuming your on Linux/Unix), run the top command. (for windows, launch the task master and switch to the 'Processes' tab, click on CPU header to sort by CPU).
(The rollover description for Octave on the tag on your question says that Octave is a scripting language. I would expect it's calling gnuplot to create the plots. Look for this as the highest CPU consumer).
You should see that your Octave/gnuplot cmd is near the top of the list, and for top there is a column labeled %CPU (or similar). This will show you how much CPU that process is consuming.
I would expect to see that process is consuming 95% or more CPU. If you see that is a significantly lower number, then you need to check the processes below that, are they consuming the remaining CPU (some sort of Virus scan (on a PC), or DB or Server?)? If a competing program is the problem, then you'll have to decide if you can wait til it/they are finished, OR that you can kill them and restart later. (For lunix, use kill -15 pid or kill -11 pid. Only use kill -9 pid as a last resort. Search here for articles on correct order for trying to kill -$n)
If there are no competing processes AND it octave/gnuplot is using less than 95%, then you'll have to find alternate tools to see what is holding up the process. (This is unlikely, it's possible some part of your overall plotting process is either Disk I/O or Network I/O bound).
So, it depends on the timescale you're currently experiencing versus the time you "want" to experience.
Does your system have multiple CPUs? Then you'll need to study the octave/gnuplot documentation to see if it supports a switch to indicate "use $n available CPUs for processing". (Or find a plotting program that does support using $n multiple CPUs).
Realistically, if your process now takes 10 mins, and you can, by eliminating competing processes, go from 60% to 90%, that is a %50 increase in CPU, but will only reduce it to 5 mins (not certain, maybe less, math is not my strong point ;-)). Being able to divide the task over 5-10-?? CPUs will be the most certain path to faster turn-around times.
So, to go further with this, you'll need to edit your question with some data points. How long is your plot taking? How big is the file it's processing. Is there something especially math intensive for the plotting you're doing? Could a pre-processed data file speed up the calcs? Also, if the results of top don't show gnuplot running at 99% CPU, then edit your posting to show the top output that will help us understand your problem. (Paste in your top output, select it with your mouse, and then use the formatting tool {} at the top of the input box to keep the formatting and avoid having the output wrap in your posting).
IHTH.
P.S. Note the # of followers for each of the tags you've assigned to your question by rolling over. You might get more useful "eyes" on your question by including a tag for the OS you're using, and a tag related to performance measurement/testing (Go to the tags tab and type in various terms to see how many followers you're getting. One bit of S.O. etiquette is to only specify 1 programming language (if appropriate) and that may apply to OS's too.)