How does RAM usage work with MPI_Bcast

How does RAM usage work with MPI_Bcast - mpi

For some background;
I am working on some code, trying to debug on a local system for code running on a cluster. The code hung up on the cluster without exiting, making for a nice waste of cpu time (my fault for not monitoring it well). Out of the 350 nodes, 68 got to the point of creating a folder with the rank number, but didnt do any of the computation (which takes 2 seconds) to create the file inside their own folders.
I am trying to test why it did. The code works on my desktop with setting the nodes (rank) to 10. I tried with 350, but that gave a pipe error, which seemed to be my local configuration.
Then I tried with 100 nodes. This started to work, but I watched as the RAM increased, until my computer froze (64gb).
I assume this is because bcast is sending the data to 100 nodes all on the same system. However, would this same issue arise on the cluster. Does Bcast make multiple copies of the same data (since it is bcast it sends the same data to all nodes).
I read there is a tree system that bcast uses to get data to all nodes quickly. When this is happening, does each node read from the same memory address or from a single copy stored on its own node.

Related

Slowdown at increased number of processes for HPC with fat-tree architecture

I've noticed something particularly odd about a simple program I've been running on a HPC with a fat tree architecture and I'm not exactly sure why I'm getting the results I'm getting.
The program I've created simply prints the runtime of a program on a varying number of processes (using MPI). I experimented by varying the number of processes by 2^n from 2 to 256, and while the execution time for each process tends to decrease as the number of processes increases from 2 to 8 processes, this time jumps dramatically at 64 processes.
Could this be because of the architecture, itself? I'd imagine that the execution time would decrease with respect to the number of processes, but this doesn't seem to be the case past a certain threshold of processes.

I figured out the issue a while ago after reading the documentation (go figure) and wanted to post the solution here in case anyone had a similar issue. On the HPC I was using (AFRL's Mustang), I was executing my programs with mpirun on the login node. The documentation clearly states that jobs need to be submitted via a batch script per section 6 of the user guide:
https://www.afrl.hpc.mil/docs/mustangQuickStartGuide.html#jobSubmit

Low performance with MPI communication within a single node

I have a program that is using the openMPI implementation of MPI for data-exchange between processes. Right now I am using this program on only one node, where the data has to be shared from one process to all the others. The total amount of data that the master process is sending is 130 Gb, which is split and sent to 6-8 client processes, but this data-transfer takes an awful amount of time (1 hour).
Knowing that the code is running on the very same node, I would expect that the data-transfer could use some speed-up, through the settings that I could describe when I launch the mpirun program - Do you know which settings could help me to get a faster data-transfer in this scenario? Right now I am using only "--mca btl vader,self" as optional components.
The actual code use MPI_Send() functions that share an amount of data that is near to the maximum amount of data that is possible to transfer with this call. After the data has been transferred to a client-process after multiple MPI_Send() calls, the master process send data to the other pending client-processes.

Bnlearn out of memory

I'm running into memory issues using the bnlearn package's structure learning algorithms. Specifically, I notice that score based methods (e.g. hc and tabu) use LOTS of memory--especially when given a non-empty starting network.
Memory usage wouldn't be an issue except that it continually brings down both my laptop (16GB RAM) and a VM I'm using (128 GB RAM), yet the data set in question is a discrete BN with 41 nodes and ~250 rows (69KB in memory). The issue occurs both when running sequentially with 16GB of RAM and in parallel on a VM (32GB/core).
One last bit of detail: Occasionally I can get 100-200 nets with a random start to run successfully, but then one net will randomly get too big and bring the system down.
My question: I'm newer to BNs, so is this just inherent to the method or is it a memory management issue with the package?

Max number of workers/slaves for parallel job snow

I'm running a foreach loop with the snow back-end on a windows machine. I have 8 cores to work with. The rscript is exectuted via a system call embedded in a python script, so there would be an active python instance too.
Is there any benefit to not have #workers=#cores and instead #workers<#cores so there is always an opening for system processes or the python instance?
It successfully runs having #workers=#cores but do I take a performance hit by saturating the cores (max possible threads) with the r worker instances?

It will depend on
Your processor (specifically hyperthreading)
How much info has to be copied to/from the different images
If you're implementing this over multiple boxes (LAN)
For 1) hyperthreading helps. I know my machine does it so I typically have twice as many workers are cores and my code completes in about 85% of the time compared to if I matched the number of workers with cores. It won't improve more than that.
2) If you're not forking, using sockets for instance, you're working as if you're in a distributed memory paradigm, which means creating one copy in memory for every worker. This can be a non-trivial amount of time. Also, multiple images in the same machine may take up a lot of space, depending on what you're working on. I often match the number of workers with number because doubling workers will make me run out of memory.
This is compounded by 3) network speeds over multiple workstations. Locally between machines our switch will transfer things at about 20 mbytes/second which is 10x faster than my internet download speeds at home, but is a snail's pace compared to making copies in the same box.
You might consider increasing R's nice value so that the python has priority when it needs to do something.

Very slow Riak writes and this error: {shutdown,max_concurrency}

On a 5-node Riak cluster, we have observed very slow writes - about 2 docs per second. Upon investigation, I noticed that some of the nodes were low on disk space. After making more space available and restarting the nodes, we are see this error (or something similar) on all of the nodes inside console.log:
2015-02-20 16:16:29.694 [info] <0.161.0>#riak_core_handoff_manager:handle_info:282 An outbound handoff of partition riak_kv_vnode 182687704666362864775460604089535377456991567872 was terminated for reason: {shutdown,max_concurrency}
Currently, the cluster is not being written to or read from.
I would appreciate any help in getting the cluster back to good health.
I will add that we are writing documents to an index that is tied to a Solr index.
This is not critical production data, and I could technically wipe everything and start fresh, but I would like to properly diagnose and fix the issue so that I am prepared to handle it if it should happen in a production environment in the future.
Thanks!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How does RAM usage work with MPI_Bcast - mpi

Related

Slowdown at increased number of processes for HPC with fat-tree architecture

Low performance with MPI communication within a single node

Bnlearn out of memory

Max number of workers/slaves for parallel job snow

Very slow Riak writes and this error: {shutdown,max_concurrency}

Categories

Resources