I have a general question about MPI usage.
When you want to write data to files, for example, 100 files are gonna be created.
Do you prefer to use each rank (core) to write one file, or you prefer to write them serially, one by one? Which is faster?
If you are having each rank write to one file (serially), there will exist a race condition, so all writing must be done atomically. This will give much slower overall write time than having each rank write to its own file.
However you would then need some sort of reduce where the reduce joins two separate files.You would have to weight the communication / joining overhead against the wait time for writing serially (unless you will leave the files in parts).
Related
I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that:
The write speed into datastore is at most around 25-30 entities per second.
I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below: the target workers once increased to 2 but reduced to 1 "based on the ability to parallelize the work in the currently running step").
I did not use ancestor path for the keys; all the entities are the same kind.
The pipeline code looks like below:
def write_to_datastore(project, user_options, pipeline_options):
"""Creates a pipeline that writes entities to Cloud Datastore."""
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Read vcf files' >> vcfio.ReadFromVcf(user_options.input)
| 'Create my entity' >> beam.ParDo(
ToEntityFn(), user_options.kind)
| 'Write to datastore' >> WriteToDatastore(project))
Because I have millions of rows to write into the datastore, it would take too long to write with a speed of 30 entities/sec.
Question: The input is just one huge gzipped file. Do I need to split it into multiple small files to trigger multiple workers? Is there any other way I can make the importing faster? Do I miss something in the num_workers setup? Thanks!
I'm not familiar with apache beam, the answer is from the general flow perspective.
Assuming there are no dependencies to be considered between entity data in various input file sections then yes, working with multiple input files should definitely help as all these files could then be processed virtually in parallel (depending, of course, on the max number of available workers).
You might not need to split the huge zipfile beforehand, it might be possible to simply hand off segments of the single input data stream to separate data segment workers for writing, if the overhead of such handoff itself is neglijible compared to the actual data segment processing.
The overall performance limitation would be the speed of reading the input data, splitting it in segments and handoff to the segment data workers.
A data segment worker would further split the data segment it receives in smaller chunks of up to the equivalent of the max 500 entities that can be converted to entities and written to the datastore in a single batch operation. Depending of the datastore client library used it may be possible to perform this operation asyncronously, allowing the split into chunks and conversion to entities to continue without waiting for the previous datastore writes to complete.
The performance limitation at the data segment worker would then be the speed at which the data segment can be split into chunks and the chunk converted to entities
If async ops aren't available or for even higher throughput, yet another handoff of each chunk to a segment worker could be performed, with the segment worker performing the conversion to entities and datastore batch write.
The performance limitation at the data segment worker level would then be just the speed at which the data segment can be split into chunks and handed over to the chunk workers.
With such approach the actual conversion to entities and batch writing them to the datastore (async or not) would no longer sit in the critical path of splitting the input data stream, which is, I believe, the performance limitation in your current approach.
I looked into the design of vcfio. I suspect (if I understand correctly) that the reason I always get one worker when the input is a single file is due to the limit of the _VcfSource and the VCF format constraint. This format has a header part that defines how to translate the non-header lines. This causes that each worker that reads the source file has to work on an entire file. When I split the single file into 5 separate files that share the same header, I successfully get up to 5 workers (but not any more probably due to the same reason).
One thing I don't understand is that the number of workers that read can be limited to 5 (in this case). But why we are limited to have only 5 workers to write? Anyway, I think I have found the alternative way to trigger multiple workers with beam Dataflow-Runner (use pre-split VCF files). There is also a related approach in gcp variant transforms project, in which the vcfio has been significantly extended. It seems to support the multiple workers with a single input vcf file. I wish the changes in that project could be merged into the beam project too.
I'm confused about the advantage of embedded key-value databases over the naive solution of just storing one file on disk per key. For example, databases like RocksDB, Badger, SQLite use fancy data structures like B+ trees and LSMs but seem to get roughly the same performance as this simple solution.
For example, Badger (which is the fastest Go embedded db) takes about 800 microseconds to write an entry. In comparison, creating a new file from scratch and writing some data to it takes 150 mics with no optimization.
EDIT: to clarify, here's the simple implementation of a key-value store I'm comparing with the state of the art embedded dbs. Just hash each key to a string filename, and store the associated value as a byte array at that filename. Reads and writes are ~150 mics each, which is faster than Badger for single operations and comparable for batched operations. Furthermore, the disk space is minimal, since we don't store any extra structure besides the actual values.
I must be missing something here, because the solutions people actually use are super fancy and optimized using things like bloom filters and B+ trees.
But Badger is not about writing "an" entry:
My writes are really slow. Why?
Are you creating a new transaction for every single key update? This will lead to very low throughput.
To get best write performance, batch up multiple writes inside a transaction using single DB.Update() call.
You could also have multiple such DB.Update() calls being made concurrently from multiple goroutines.
That leads to issue 396:
I was looking for fast storage in Go and so my first try was BoltDB. I need a lot of single-write transactions. Bolt was able to do about 240 rq/s.
I just tested Badger and I got a crazy 10k rq/s. I am just baffled
That is because:
LSM tree has an advantage compared to B+ tree when it comes to writes.
Also, values are stored separately in value log files so writes are much faster.
You can read more about the design here.
One of the main point (hard to replicate with simple read/write of files) is:
Key-Value separation
The major performance cost of LSM-trees is the compaction process. During compactions, multiple files are read into memory, sorted, and written back. Sorting is essential for efficient retrieval, for both key lookups and range iterations. With sorting, the key lookups would only require accessing at most one file per level (excluding level zero, where we’d need to check all the files). Iterations would result in sequential access to multiple files.
Each file is of fixed size, to enhance caching. Values tend to be larger than keys. When you store values along with the keys, the amount of data that needs to be compacted grows significantly.
In Badger, only a pointer to the value in the value log is stored alongside the key. Badger employs delta encoding for keys to reduce the effective size even further. Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs.
Your question assumes that the only operation needed are single random reads and writes. Those are the worst case scenarios for log-structured merge (LSM) approaches like Badger or RocksDB. The range query, where all keys or key-value pairs in a range gets returned, leverages sequential reads (due to the adjacencies of sorted kv within files) to read data at very high speeds. For Badger, you mostly get that benefit if doing key-only or small value range queries since they are stored in a LSM while large values are appended in a not-necessarily sorted log file. For RocksDB, you’ll get fast kv pair range queries.
The previous answer somewhat addresses the advantage on writes - the use of buffering. If you write many kv pairs, rather than storing each in separate files, LSM approaches hold these in memory and eventually flush them in a file write. There’s no free lunch so asynchronous compaction must be done to remove overwritten data and prevent checking too many files for queries.
Previously answered here. Mostly similar to other answers provided here but makes one important, additional point: files in a filesystem can't occupy the same block on disk. If your records are, on average, significantly smaller than typical disk block size (4-16 KiB), storing them as separate files will incur substantial storage overhead.
I have 2 processors (this is an example), and I want these 2 processors to write in a file. I want them to write at the end of file, but not in a mixed pattern, like that :
[file content]
proc0
proc1
proc0
proc1
proc0
proc1
(and so on..)
I'd like to make them write following this kind of pattern :
[file content]
proc0
proc0
proc0
proc1
proc1
proc1
(and so on..)
Is it possible? If so, what's the setting to use?
The sequence in which your processes have outputs ready to report is, essentially, unknowable in advance. Even repeated runs of exactly the same MPI program will show differences in the ordering of outputs. So something, somewhere, is going to have to impose an ordering on the writes to the file.
A very common pattern, the one Wesley has already mentioned, is to have all processes send their outputs to one process, often process 0, and let it deal with the writing to file. This master-writer could sort the outputs before writing but this creates a couple of problems: allocating space to store output before writing it and, more difficult to deal with, determining when a collection of output records can be sorted and written to file and the output buffers be reused. How long does the master-writer wait and how does it know if a process is still working ?
So it's common to have the master-writer write outputs as it gets them and for another program to order the output file as desired after the parallel program has finished. You could tack this on to your parallel program as a step after mpi_finalize or you could use a completely separate program (such as sort on a Linux machine). Of course, for this to work each output record has to contain some sequencing information on which to sort.
Another common pattern is to only have one process which does any writing at all, that is, none of the other processes do any output at all. This completely avoids the non-determinism of the sequencing of the writing.
Another pattern, less common partly because it is more difficult to implement and partly because it depends on underlying mechanisms which are not always available, is to use mpi io. With mpi io multiple processes can write to different parts of a file as if simultaneously. To actually write simultaneously the program needs to be executing on hardware, network and operating system which supports parallel i/o. It can be tricky to implement this even with the right platform, and especially when the volume of output from processes is uncertain.
In my experience here on SO people asking question such as yours are probably at too early a stage in their MPI experience to be tackling parallel i/o, even if they have access to the necessary hardware.
I disagree with High Performance Mark. MPI-IO isn't so tricky in 2014 (as long as you have have access to any file system besides NFS -- install PVFS if you need a cheap easy parallel file system).
If you know how much data each process has, you can use MPI_SCAN to efficiently compute how much data was written by "earlier" processes, then use MPI_FILE_WRITE_AT_ALL to carry out the I/O efficiently. Here's one way you might do this:
incr = (count*datatype_size);
MPI_Scan(&incr, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
MPI_File_write_at_all(mpi_fh, new_offset, buf, count,
datatype, status)
The answer to your question is no. If you do things that way, you'll end up with jumbled output from all over the place.
However, you can get the same thing by sending your output to a single processor having it do all of the writing itself. For example, at the end of your application, just have everything send to rank 0 and have rank 0 write it all to a file.
Preamble:
I have a very large array (one dim) and need to solve evolution equation (wave-like eq). I I need to calculate integral at each value of this array, to store the resulting array of integral and apply integration again to this array, and so on (in simple words, I apply integral on grid of values, store this new grid, apply integration again and so on).
I used MPI-IO to spread over all nodes: there is a shared .dat file on my disc, each MPI copy reads this file (as a source for integration), performs integration and writes again to this shared file. This procedure repeats again and again. It works fine. The most time consuming part was the integration and file reading-writing was negligible.
Current problem:
Now I moved to 1024 (16x64 CPU) HPC cluster and now I'm facing an opposite problem: a calculation time is NEGLIGIBLE to read-write process!!!
I tried to reduce a number of MPI processes: I use only 16 MPI process (to spread over the nodes) + 64 threads with OpenMP to parallelize my computation inside of each node.
Again, reading and writing processes is the most time consuming part now.
Question
How should I modify my program, in order to utilize the full power of 1024 CPUs with minimal loss?
The important point, is that I cannot move to the next step without completing the entire 1D array.
My thoughts:
Instead of reading-writing, I can ask my rank=0 (master rank) to send-receive the entire array to all nodes (MPI_Bcast). So, instead of each node will I/O, only one node will do it.
Thanks in advance!!!
I would look here and here. FORTRAN code for the second site is here and C code is here.
The idea is that you don't give the entire array to each processor. You give each processor only the piece it works on, with some overlap between processors so they can handle their mutual boundaries.
Also, you are right to save your computation to disk every so often. And I like MPI-IO for that. I think it is the way to go. But the codes in the links will allow you to run without reading every time. And, for my money, writing out the data every single time is overkill.
There are two really simple ways to let one program send a stream of data to another:
Unix pipe, or TCP socket, or something like that. This requires constant attention by consumer program, or producer program will block. Even increasing buffers their typically tiny defaults, it's still a huge problem.
Plain files - producer program appends with O_APPEND, consumer just reads whatever new data became available at its convenience. This doesn't require any synchronization (as long as diskspace is available), but Unix files only support truncating at the end, not at beginning, so it will fill up disk until both programs quit.
Is there a simple way to have it both ways, with data stored on disk until it gets read, and then freed? Obviously programs could communicate via database server or something like that, and not have this problem, but I'm looking for something that integrates well with normal Unix piping.
A relatively simple hand-rolled solution.
You could have the producer create files and keep writing until it gets to a certain size/number of record, whatever suits your application. The producer then closes the file and starts a new one with an agreed naming algorithm.
The consumer reads new records from a file then when it gets to the agreed maximum size closes and unlinks it and then opens the next one.
If your data can be split into blocks or transactions of some sort, you can use the file method for this with a serial number. The data producer would store the first megabyte of data in outfile.1, the next in outfile.2 etc. The consumer can read the files in order and delete them when read. Thus you get something like your second method, with cleanup along the way.
You should probably wrap all this in a library, so that from the applications point of view this is a pipe of some sort.
You should read some documentation on socat. You can use it to bridge the gap between tcp sockets, fifo files, pipes, stdio and others.
If you're feeling lazy, there's some nice examples of useful commands.
I'm not aware of anything, but it shouldn't be too hard to write a small utility that takes a directory as an argument (or uses $TMPDIR); and, uses select/poll to multiplex between reading from stdin, paging to a series of temporary files, and writing to stdout.