Classic file system problem - concurrent remote processing on a directory - networking

I have an application that processes files in a directory and moves them to another directory along with the processed output. Nothing special about that. An interesting requirement was introduced:
Implement fault tolerance and processing throughput by allowing multiple remote instances to work on the same file store.
Additional considerations are that we can not assume the file system, as we support both Windows and NFS.
Of course the problems is, how do I make sure that the different instances do not try and process the same work, potentially corrupting work or reducing throughput? File locking can be problematic, especially across network shares. We can use a more sophisticated method, such as a simple database or messaging framework, (a la JMS or similar), but the entire cluster needs to be fault tolerant. We can't have one database or messaging provider because of the single point of failure that it introduces.
We've implemented a solution that uses multicast messages to self-discover processing instances and elect a supervisor who assigns work. There's a timeout in case the supervisor goes down and another election takes place. Our networking library, however, isn't very mature and the our implementation of messages is clunky.
My instincts, however, tell me that there is a simpler way.
Thoughts?

I think you can safely assume that rename operations are atomic on all network file systems that you care about. So if you arrange an amount of work to be a single file (or keyed to a single file), then have each server first list the directory containing new work, pick a piece of work, and then have it rename the file to its own server name (say, machine name or IP address). For one of the instances who concurrently perform the same operation, the rename will succeed, so they should then process the work. For the others, it will fail, so they should pick a different file from the listing they got.
For creation of new work, assume that directory creation (mkdir) is atomic, but file creation is not (for file creation, the second writer might overwrite the existing file). So if there are multiple producers of work also, create a new directory for each piece of work.

Related

BizTalk Server: maximum number of receive locations per host

I have more than 900 receive locations associated with the same host.
All receive locations are enabled but sometimes some of them are not working (and are still enabled).
When I disabled and re-enabled it, the receive location works but another one is going into trouble.
Are there any known limitations of the number of receive locations that can be associated with the same host in BizTalk 2016?
I don't know if there is a limitation number, but if you associate all the receive locations to the same Host, problably your problems are due to the Throttling mechanism.
While there are no hard limits to Receive Locations or Send Ports, there are still practical limits based on available resources.
900 is a lot for a single Host. Even if everything was running perfectly, I would still break that up across ~3 Hosts.
If these are File Receive Locations, there are other techniques to reduce the amount even more. Some options:
Use a Windows Scheduler task to move files from various locations to fewer, or maybe one location. If 'source' information is necessary, you can add a tag to the file name which can be extracted in a custom Pipeline Component.
Modify the sample File Adapter in the SDK to scan sub-folders as well. You can combine this with option 1 if you cannot modify the filename for some reason.
Similar to option 1, the script can write a meta-data file before moving the file with any data you need to preserve. The meta-data can then be read in a Pipeline Component.

MPI one-sided file I/O

I have some questions on performing File I/Os using MPI.
A set of files are distributed across different processes.
I want the processes to read the files in the other processes.
For example, in one-sided communication, each process sets a window visible to other processors. I need the exactly same functionality. (Create 'windows' for all files and share them so that any process can read any file from any offset)
Is it possible in MPI? I read lots of documentations about MPI, but couldn't find the exact one.
The simple answer is that you can't do that automatically with MPI.
You can convince yourself by seeing that MPI_File_open() is a collective call taking an intra-communicator as first argument and returning a file handler to the opened file as last argument. In this communicator, all processes open the file and therefore, all processes must see the file. So unless a process sees a file, it cannot get a MPI_file handler to access it.
Now, that doesn't mean there's no solution. A possibility could be to do by hand exactly what you described, namely:
Each MPI process opens individually the file they see and are responsible of; then
Each of theses processes reads this local file into a buffer;
Theses individual buffers are all exposed, using either a global MPI_Win memory windows, or several individual ones, ready for one-sided read accesses; and finally
All read accesses to any data that were previously stored in these individual local files, are now done through MPI_Get() calls using the memory window(s).
The true limitation of this approach is that it requires to fully read all of the individual files, therefore, you need to have sufficient memory per node for storing each of them. I'm well aware that this is a very very big caveat that could just make the solution completely impractical. However, if the memory is sufficient, this is an easy approach.
Another even simpler solution would be to store the files into a shared file system, or having them all copied on all local file systems. I imagine this isn't an option since the question wouldn't have been asked otherwise...
Finally, in last resort, a possibility I see would be to dedicate a MPI process (or an OpenMP thread of a MPI process) per node to serve each files. This process would just act as a "file server", answering "read" request coming from the other MPI processes, and serving them by reading the requested data from the file, and sending it back via MPI. It's a bit lengthy to write, but it should work.

Reduce the BizTalk receive location file input speed

We have a BizTalk 2010 receive location, which will get a 70MB file and then using inbound map (in receive location) and outbound map (in send port) to produce a 1GB file.
While performing the above process, a lot of disk I/O resource is consumed in SQL Server. Another receive location processes performance are highly affected.
We have tried reduce the maximum disk I/O threads in host instance of that receive location, but it still consumes a lot of disk I/O resource in SQL Server.
In fact this process priority is very very low. Is there any method to reduce the disk I/O resource usage of this process such that other processes performance can be normal?
This issue isn't related to the speed of the file input, but, as you mentioned in a comment, to the load this places on the messagebox when trying to persist the 1gb map output to the MessageBox. You have a few options here to try to minimize the impact this will have on other processes:
Adjust the throttling settings on the newly created host to something very low. This may or may not work the way you want it to though.
Set a service window on the recieve location for these files so that they only run during off hours. This would be ideal if you don't have 24/7 demand on the MessageBox and can afford to have slow response time in the middle of the night (say 2-3am)
If your requirements can handle this, don't map the file in the recieve port, but instead route it to an Orchestration and/or custom pipeline component that will split it into smaller pieces and then map the smaller pieces. This should at least give you more fine grained control over the speed at which these are processed (have a delay shape in the loop that processes the pieces). There'd still possibly be issues when you joined them back together, but it shouldn't be as bad as your current process.
It also may be worth looking at your map. If there are lots of slow/processor heavy calls you might be able to refactor it.
Ideally you should debatch the file. Apply business logic including map on each individual segments and then load them into sql one at a time. Later you can use pipeline or some other .NET component to pull data from SQL and rebatch the data. Handling big xml (10 times size as compared to flat file) in BizTalk messagebox is not a very good practice.
If however it was a pure messaging scenario, you can convert file into stream and route it to destination.

MPI, NFS File Writing

I'm having an issue with a MPI program running across a group of Linux nodes. The group is currently set up with NFS, with /home/mpi mounted across all nodes. The problem is that the program requires all of the nodes to open a file in the file system in write mode (use fopen on /home/mpi/file), and write to while it does calculations. One node will be able to open it, and the others won't and will throw an error. Instead I want each node to have its own file to write to.
I was wondering if there was a way to get around this. I was thinking about making a separate file for each node, with the nodes rank appended to the filename, but was wondering if there were simpler ways to get around this issue. Is there a way to set up the group so that all the worker nodes have their own copy of the /home/mpi directory that is auto-updated with any changes that the master node does to its copy?
Thanks.
As far as I know, the standard way of doing things is to open one file per node, indexed by rank as you described. Depending on what these files are used for (e.g. logging), you then have to write a script to re-combine them at the end of the computation.
If you really need all processes to write to the same file on the filesystem, you'll have to somehow coordinate concurrent outputs from all processes wanting to write to the file.
There is no way to do this at the filesystem level as far as I know, but you can do this within you MPI code. The standard, historical implementation of this is to have all MPI processes send messages to rank 0, which is in charge of effectively writing them to the filesystem.
Another option would be to look at the IO features introduced in MPI2, which allow all processes to work on different parts of the same file.

What are the options for transferring 60GB+ files over the a network?

I'm about to start developing an application to transfer very large files without any rush but with need of reliability. I would like people that had worked coding such a particular case give me an insight of what I'm about to get into.
The environment will be intranet ftp server> so far using active ftp normal ports windows systems. I might need to also zip up the files before sending and I remember working with a library once that would zip in memory and there was a limit on the size... ideas on this would also be appreciated.
Let me know if I need to clarify something else. I'm asking for general/higher level gotchas if any not really detail help. I've done apps with normal sizes (up to 1GB) before but this one seems I'd need to limit the speed so I don't kill the network or things like that.
Thanks for any help.
I think you can get some inspiration from torrents.
Torrents generally break up the file in manageable pieces and calculate a hash of them. Later they transfer them piece by piece. Each piece is verified against hashes and accepted only if matched. This is very effective mechanism and let the transfer happen from multiple sources and also let is restart any number of time without worrying about corrupted data.
For transfer from a server to single client, I would suggest that you create a header which includes the metadata about the file so the receiver always knows what to expect and also knows how much has been received and can also check the received data against hashes.
I have practically implemented this idea on a client server application but the data size was much smaller, say 1500k but reliability and redundancy were important factors. This way, you can also effectively control the amount of traffic you want to allow through your application.
I think the way to go is to use the rsync utility as an external process to Python -
Quoting from here:
the pieces, using checksums, to possibly existing files in the target
site, and transports only those pieces that are not found from the
target site. In practice this means that if an older or partial
version of a file to be copied already exists in the target site,
rsync transports only the missing parts of the file. In many cases
this makes the data update process much faster as all the files are
not copied each time the source and target site get synchronized.
And you can use the -z switch to have compression on the fly for the data transfer transparently, no need to boottle up either end compressing the whole file.
Also, check the answers here:
https://serverfault.com/questions/154254/for-large-files-compress-first-then-transfer-or-rsync-z-which-would-be-fastest
And from rsync's man page, this might be of interest:
--partial
By default, rsync will delete any partially transferred
file if the transfer is interrupted. In some circumstances
it is more desirable to keep partially transferred files.
Using the --partial option tells rsync to keep the partial
file which should make a subsequent transfer of the rest of
the file much faster

Resources