Running jobs (scripts) on Beowulf with multiple machines with memory allocation - mpi

I built a simple Beowulf cluster with one master and 4 nodes (total of 128 cores) following this tutorial.
https://www.youtube.com/watch?v=gvR1eQyxS9I
I was successful to run a "Hello World" program by allocating some cores of my cluster. This is what i used:
$ mpiexec -n 64 -f hosts ./mpi_hello
Now, i know how to run a multithread program, I would like to allocate some memory. I am planning to do some data analysis.
Each nodes have 16GB of ram. How can I allocate 32GB or 64GB of ram for data analysis?
Thank you very much for your help.

Related

Checking available cores in R on SLURM

I ran below script for SLURM RStudio setup (currently running):
#!/bin/bash
#SBATCH --job-name=nodes
#SBATCH --output=a.log
#SBATCH --ntasks=18
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=7gb
date;hostname;pwd
module load R/4.2
rserver <- runs RStudio server
Which runs 8 cores with 18 nodes (144 cores).
However, when I check the number of cores available for parallel processing in the R console, it says 32 instead.
Here's the code for checking.
library(doParallel)
detectCores() # 32
Even worse, with another package, parallelly (or future) that considers the scheduler setting, it reports differently.
From parallely package:
For instance, if compute cluster schedulers are used (e.g. TORQUE/PBS and Slurm), they set specific environment variable specifying the number of cores that was allotted to any given job; availableCores() acknowledges these as well.)
library(parallelly)
availableCores() # 8
I am wondering if the current R is running with the above scheduler specification (144 cores) and if I am missing something important.
Also, could you recommend how to check available resources (core / memory) allocated and able to use in R with slurm setting?
Thank you very much in advance.

Dask job fails in Jupyter notebook cell with KilledWorker

I am running a join task in a Jupyter notebook which is producing many warnings from Dask about a possible memory leak before finally failing with a killed worker error:
2022-07-26 21:38:05,726 - distributed.worker_memory - WARNING - Worker is at 85% memory usage. Pausing worker. Process memory: 1.59 GiB -- Worker memory limit: 1.86 GiB
2022-07-26 21:38:06,319 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 1.59 GiB -- Worker memory limit: 1.86 GiB
2022-07-26 21:38:07,501 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:46137 (pid=538697) exceeded 95% memory budget. Restarting...
2022-07-26 21:38:07,641 - distributed.nanny - WARNING - Restarting worker
KilledWorker: ("('assign-6881b18750807133ba976bf463a98c23', 0)", <WorkerState 'tcp://127.0.0.1:46137', name: 0, status: closed, memory: 0, processing: 50>)
This happens when I run my code on a laptop with 32GB RAM (Kubuntu 20). Maybe I have not configured Dask correctly for the environment? I can watch the memory usage go up and down in the system monitor but at no point does it consume all the memory. How can I tell Dask to use all the cores and as much memory as it can manage? It seems to be running in single processor mode, maybe because I'm running on a laptop rather than a proper cluster?
For context: I'm joining two datasets, both are text files with sizes 25GB and 5GB. Both files have been read into Dask DataFrame objects using dd.read_fwf(), then I transform a string field on one of the frames, then join (merge) on that field.
There are certain memory constraints for a particular worker, which you can read from here : https://distributed.dask.org/en/stable/worker-memory.html .
Apart from this, you can try increasing the number of workers, threads while initializing the dask client.

R error: C stack usage too close to limit on small dataset

I have just done a clean install of R on Ubuntu 18.04, and it does not really work.
I can do stuff like make vectors and dataframes, and get summaries. I can also make a histogram with the hist command just fine. However, the plot command does not work.
The following code, which is very basic and should work just fine:
data(faithful)
plot(faithful$eruptions)
runs for about 30 seconds, before giving the following error:
Error: C stack usage 7970244 is too close to the limit
I have seen lots of posts of other people having the same error, but it seems to be because they are dealing with large datasets/lots of recursion or something like that, but I have this problem even with a dataset of just 3 values. R should definitely be able to handle this without me increasing the limit, and it should not take 30 seconds to run.
Does anybody know what the problem could be?
Edit:
Output of ulimit -a:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 28697
max locked memory (kbytes, -l) 16384
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 28697
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Version info:
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
This is plain R in the terminal, not through any IDE.
Edit 2:
I have discovered that it works just fine on other user accounts on the computer. I've also started getting other weird issues (after a reboot I have to login to a Gnome session, then log out before I can log into plasma, but other users don't have that problem, sometimes the terminal can't launch, lots of things are crashing, etc) so I think this has nothing to do with R and is much bigger. Unfortunately, it might be a hardware issue (this computer has had wacky issues before on other operating systems).

How to run r script on remote GPU?

I am trying to run r script on a GPU server provided by the institute.
Specifications of GPU server are as follows:
Host Name: gpu01.cc.iitk.ac.in,
Configuration: Four Tesla T10 GPUs added to each machine with 8 cores in each
Operating System: Linux
Specific Usage: Parallel Programming under Linux using CUDA with C Language
R code:
setwd("~/Documents/tm dataset")
library(ssh)
session <- ssh_connect("dgaurav#gpu01.cc.iitk.ac.in")
print(session)
out <- ssh_exec_wait(session, command = 'articles1_test.R')
Error:
ksh: articles1_test.R: not found
Your dataset and script are only on local machine... you nneed to cooy the to remote server before you can run them.

How to stop h2o from saving massive .ERR, .OUT and other log files to the local drive

I am currently running a script in which several h2o glm and deeplearning models are being generated for several iterations of a Monte-Carlo Cross-Validation. When finished running (which takes about half a day), h2o is saving immense files to the local drive (with sizes up to 8.5 GB). These files are not erased when RStudio or my computer is restarted (as I originally thought). Is there a way to stop h2o from saving these files?
When you start H2O with h2o.init() from R, the stdout and stderr files should be saved to a temporary directory (see R's tempdir() to see the path). This temporary directory should be removed when the R session exits. It seems as though this is not working with RStudio, however it works if you are using R from the command line. I'm not sure if this is a setting that can be changed in RStudio or if this is an RStudio bug.
But you can take more control yourself. You can start H2O by hand using java on the command line and then connect from R using h2o.init().
java -Xmx5g -jar h2o.jar
In this example, I started H2O with 5 GB of Java heap memory, but you should increase that if your data is larger. Then connecting in R will look like this:
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 16 hours 34 minutes
H2O cluster version: 3.15.0.99999
H2O cluster version age: 17 hours and 25 minutes
H2O cluster name: H2O_started_from_R_me_exn817
H2O cluster total nodes: 1
H2O cluster total memory: 4.43 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.2 (2016-10-31)
So if you want to redirect both stdout and stderr to devnull you simply add the redirect command to the end of the java command to start the H2O cluster and connect to H2O from R again. To redirect both stderr and stdout, you append > /dev/null 2>&1 like this:
java -Xmx5g -jar h2o.jar > /dev/null 2>&1 &
Encountered this in a spark shell running H2O. The shell had 50 executors connected and this caused the /tmp directories on those nodes to eventually cause storage issues.
When h2o.init() is called it creates jvm's. The logging from h2o is handled by these jvm's. But when the shell is shutdown those jvm's persist and just log heartbeat errors in /tmp in perpetuity. You will need to find the jvm's associated with h2o and shut them down. I believe in my case the specific process names where water.H2OApp
I found it easier to take care of the problem by removing those files after running every model.
unlink(list.files(tempdir(), full.names = TRUE), recursive = TRUE)
This helps remove the temporary files when I run the multiple models in a loop.

Resources