how tboot does static root of trust measurement and will it change PCR 12-PCR 14 values for different linux kernel? - intel

I have installed tboot using this command apt-get install tboot on ubuntu .
Actually I am having one doubt regarding tboot and trusted Grub. trusted grub does STRM(static root of trust for Measurement ) and it changes values in PCR 12 -PCR14. tboot does DRTM(Dynamic root of trust for Measurement ) using Trusted Execution Technology ( Intel 's TXT) technology . It will changes values in PCR17 to PCR22 .I want to know that if tboot supports SRTM then it should also change values in PCR 12 to PCR14 for different linux kernel version .But tboot is changing only For PCR17 to PCR22 for different linux kernel version .
tboot can provide SRTM and DRTM both at a time??

No.
The SRTM is always your firmware. And tboot itself is not your DRTM either. The DRTM is the SINIT module. Tboot is responsible for preparing the late launch and after returning from the SINIT code functions as your MLE, thus extending your dynamic chain of trust.

Related

mpirun error of oneAPI with Slurm (and PBS) in old cluster

Recently I installed Intel OneAPI including c compiler, FORTRAN compiler and mpi library and complied VASP with it.
Before presenting the question, there are some tricks I need to clarify during the installation of VASP:
GLIBC2.14: the cluster is an old machine with a glibc version of 2.12, where OneAPI needs a version of 2.14. So I compile the GLIBC2.14 and export the ld_path: export LD_LIBRARY_PATH="~/mysoft/glibc214/lib:$LD_LIBRARY_PATH"
ld 2.24: The ld version is 2.20 in the cluster, while a higher version is needed. So I installed binutils 2.24.
There is one master computer connected with 30 calculating nodes in the cluster. The calculation is executed with 3 ways:
When I do the calculation in the master, it's totally OK.
When I login the nodes manually with rsh command, the calculation in the logged node is also no problem.
But usually I submit the calculation script from the master (with slurm or pbs), and then do the calculation in the node. In that case, I met following error message:
[mpiexec#node3.alineos.net] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec#node3.alineos.net] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec#node3.alineos.net] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec#node3.alineos.net] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec#node3.alineos.net] Possible reasons:
[mpiexec#node3.alineos.net] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec#node3.alineos.net] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec#node3.alineos.net] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec#node3.alineos.net] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
I only met this error with oneAPI compiled codes but Intel® Parallel Studio XE compiled. Do you have any idea of this error? Your response will be highly appreciated.
Best,
Léon
Could it be a permissions error with the Slurm agent not having the correct permissions or library path?

Why is MKL in parallel not faster than serial in R 3.6?

I am trying to use Intel's MKL with R and adjust the number of threads using the MKL_NUM_THREADS variable.
It loads correctly, and I can see it using 3200% CPU in htop. However, it isn't actually faster than using only one thread.
I've been adapting Dirk Eddelbuettel's guide for centos, but I may have missed some flag or config somewhere.
Here is a simplified version of how I am testing how number of threads relates to job time. I do get expected results when using OpenBlas.
require(callr)
#> Loading required package: callr
f <- function(i) r(function() crossprod(matrix(1:1e9, ncol=1000))[1],
env=c(rcmd_safe_env(),
R_LD_LIBRARY_PATH=MKL_R_LD_LIBRARY_PATH,
MKL_NUM_THREADS=as.character(i),
OMP_NUM_THREADS="1")
)
system.time(f(1))
#> user system elapsed
#> 14.675 2.945 17.789
system.time(f(4))
#> user system elapsed
#> 54.528 2.920 19.598
system.time(f(8))
#> user system elapsed
#> 115.628 3.181 20.364
system.time(f(32))
#> user system elapsed
#> 787.188 7.249 36.388
Created on 2020-05-13 by the reprex package (v0.3.0)
EDIT 5/18
Per the suggestion to try MKL_VERBOSE=1, I now see the following on stdout which shows it properly calling lapack:
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7fff436222c0,0x7f71024ef040,1000000,0x7fff436222d0,0x7f7101d4d040,1000) 10.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
for f(8), it shows NThr:8
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7ffe6b39ab40,0x7f4bb52eb040,1000000,0x7ffe6b39ab50,0x7f4bb4b49040,1000) 11.98s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8
I still am not getting any expected performance increase from extra cores.
EDIT 2
I am able to get the expected results using Microsoft's distribution of MKL, but not with Intel's official distribution as in the walkthrough. It appears that MS is using a GNU threading library; could the problem be in the threading library and not in blas/lapack itself?
Only seeing this now: Did you check the obvious one, ie whether R on CentOS actually picks up the MKL?
As I recall, R on CentOS it is built in a more, ahem, "restricted" mode with the shipped-with-R reference BLAS. And if and when that is the case you simply cannot switch and choose another one as we have done within Debian and Ubuntu for 20+ years as that requires a different initial choice when R is compiled.
Edit: Per subsequent discussions (see comments below) we all re-realized that it is important to have the threading libraries / models aligned. The MKL is an Intel product and defaults to using their threading library, on Linux the GNU compiler is closer to the system and has its own. That latter one needs to be selected. In my writeup / script for the MKL on .deb systems I use
echo "MKL_THREADING_LAYER=GNU" >> /etc/environment
so set this "system-wide" on the machine, one can also add it just to the R environment files.
I am not sure exactly how R call MKL but if the crossprod function calls mkl's gemm underneath then we have to see very good scalability results with such inputs. What is the input problem sizes?
MKL supports the verbose mode. This option could help to see the many useful runtime info when dgemm will be running. Could you try to export the MKL_VERBOSE=1 environment and see the log file?
Though, I am not pretty sure if R will not suppress the output.

Intel MKL/Xeon Phi Offload Runtime Issue - Auto Offload not working

I have set up my Xeon phi 3120A in Windows 10 Pro, with MPSS 3.8.4 and Parallel XE 2017 (Initial Release). I have chosen this Parallel XE as this was the last supported XE for the x100 series. I have installed the MKL version that is packaged with the Parallel XE 2017 (Initial Release).
What have I done / setup:
After setting up MPSS 3.8.4, and following the steps such as flashing and pinging, I have checked that micctrl -s shows “mic0 ready” (with linux image containing the appropriate KNC name), miccheck produces all "passes" and micinfo gives me a reading for all the key stats that the co-processor is providing.
Hence to me it looks like the co-processor is certainly installed and being recognised by my computer. I can also see that mic0 is up and running in the micsmc gui.
I have then set up my environment variables to enable automatic offload, namely, MKL_MIC_ENABLE=1, OFFLOAD_DEVICES= 0, MKL_MIC_MAX_MEMORY= 2GB, MIC_ENV_PREFIX= MIC, MIC_OMP_NUM_THREADS= 228, MIC_KMP_AFFINITY= balanced.
The Problem - Auto Offload is not working
When I go to run some simple code in R-3.4.3 (copied below, designed specifically for automatic offload), it keeps running the code through my host computer rather than running anything through the Xeon phi.
To support this, I cannot see any activity on the Xeon Phis when I look at the micsmc gui.
Hence, auto offload is not working.
The R code:
require(Matrix)
sink("output.txt")
N <- 16000
cat("Initialization...\n")
a <- matrix(runif(N*N), ncol=N, nrow=N);
b <- matrix(runif(N*N), ncol=N, nrow=N);
cat("Matrix-matrix multiplication of size ", N, "x", N, ":\n")
for (i in 1:5) {
dt=system.time( c <- a %*% b )
gflops = 2*N*N*N*1e-9/dt[3]
cat("Trial: ", i, ", time: ", dt[3], " sec, performance: ", gflops, " GFLOP/s\n")
}
Other steps I have tried:
I then proceeded to set up the MKL_MIC_DISABLE_HOST_FALLBACK=1 environmental variable, and as expected, when I ran the above code, R terminated.
In Using Intel® MKL Automatic Offload on Intel Xeon Phi Coprocessors it says that if the HOST_FALLBACK flag is active and offload is attempted but fails (due to “offload runtime cannot find a coprocessor or cannot initialize it properly”), it will terminate the program – this is happening in that R is terminating completely. For completeness, this problem is happening on R-3.5.1, Microsoft R Open 3.5.0 and R-3.2.1 as well.
So my questions are:
What am I missing to make the R code run on the Xeon phi? Can you
please advise me on what I need to do to make this work?
(linked to
1) is there a way to check if the MKL offload runtime can see the
Xeon phi? Or that it is correctly set up, or what (if any) problem
that MKL is having initialising the Xeon phi?
Will sincerely appreciate if someone out there can help me – I believe that I am missing a fundamental/simple step, and have been tearing my hair out trying to make this work.
Many thanks in advance,
Rash

"A request was made to bind to that would result in binding more processes than cpus on a resource" mpirun command (for mpi4py)

I am running OpenAI baselines, specifically the Hindsight Experience Replay code. (However, I think this question is independent of the code and is an MPI-related one, hence why I'm posting on StackOverflow.)
You can see the README there but the point is, the command to run is:
python -m baselines.her.experiment.train --num_cpu 20
where the number of CPUs can vary and is for MPI.
I am successfully running the HER training script with 1-4 CPUs (i.e., --num_cpu x for x=1,2,3,4) on a single machine with:
Ubuntu 16.04
Python 3.5.2
TensorFlow 1.5.0
One TitanX GPU
The number of CPUs seems to be 8 as I have a quad-core i7 Intel processor with hyperthreading, and Python confirms that it sees 8 CPUs.
(py3-tensorflow) daniel#titan:~/baselines$ ipython
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import os, multiprocessing
In [2]: os.cpu_count()
Out[2]: 8
In [3]: multiprocessing.cpu_count()
Out[3]: 8
Unfortunately, when I run with 5 or more CPUs, I get this message blocking the code from running:
(py3-tensorflow) daniel#titan:~/baselines$ python -m baselines.her.experiment.train --num_cpu 5
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: titan
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
And here's where I got lost. There's no error message or line of code that I need to fix. I am therefore unsure about where I even add overload-allowed in the code?
The way this code works at a high level is that it takes in this argument and uses the python subprocess module to run an mpirun command. However, checking mpirun --help on the command line doesn't reveal overload-allowed as a valid argument.
Googling this error message leads to questions in the openmpi repository, for instance:
https://github.com/open-mpi/ompi/issues/626 (seems to have died out without resolving issue)
https://github.com/open-mpi/ompi/issues/2158 (not sure how this relates to my issue, didn't get clear resolution)
But I'm not sure if it's an OpenMPI thing or an mpi4py thing?
Here's pip list in my virtual environment if it helps:
(py3.5-mpi-practice) daniel#titan:~$ pip list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
decorator (4.2.1)
ipython (6.2.1)
ipython-genutils (0.2.0)
jedi (0.11.1)
line-profiler (2.1.2)
mpi4py (3.0.0)
numpy (1.14.1)
parso (0.1.1)
pexpect (4.4.0)
pickleshare (0.7.4)
pip (9.0.1)
pkg-resources (0.0.0)
pprintpp (0.3.0)
prompt-toolkit (1.0.15)
ptyprocess (0.5.2)
Pygments (2.2.0)
setuptools (20.7.0)
simplegeneric (0.8.1)
six (1.11.0)
traitlets (4.3.2)
wcwidth (0.1.7)
So, TL;DR:
How do I fix this error in my code?
If I add the "overload-allowed" thing, what happens? Is it safe?
Thanks!
overload-allowed is a qualifier that is passed to --bind-to parameter of mpirun (source).
mpirun ... --bind-to core:overload-allowed
Beware that hyperthreading thing is more about marketing than about performance bonuses.
Your i7 can actually have four silicon cores and four "logical" ones. The logical ones basically try to use resources of the silicon cores that are currently unused. The problem is that a good HPC program will use 100% of the CPU hardware, and hyperthreading won't have resources to successfully operate.
So, it is safe to "overload" "cores", but it's not a performance boost candidate #1.
Regarding the advice that the paper authors give about reproducing the results. In the best case less cpus just means slow learning. However, if learning doesn't converge to an expected value no matter how hyperparams are tweaked, then it is a reason to look closer at the proposed algorithm.
While IEEE754 computations do differ if done in different order, this difference should not play the crucial role.
The error message suggests that mpi4py is built on top of Open MPI.
By default, a slot is a core, but if you want a slot to be an hyperthread, then you should
mpirun --use-hwthread-cpus ...

Error: Maximal number of DLLs reached

I'm writing an R package which depends upon many other packages. When I load too many packages into the session I frequently got this error:
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared object '/Library/Frameworks/R.framework/Versions/3.2/Resources/library/proxy/libs/proxy.so':
`maximal number of DLLs reached...
This post Exceeded maximum number of DLLs in R pointed out that the issue is with the Rdynload.c of the base R code:
#define MAX_NUM_DLLS 100
Is there any way to bypass this issue except modifying and building from source?
As of R 3.4, you can set a different max number of DLLs using and environmental variable R_MAX_NUM_DLLS. From the release notes:
The maximum number of DLLs that can be loaded into R e.g. via
dyn.load() can now be increased by setting the environment
variable R_MAX_NUM_DLLS before starting R.
Increasing that number is of course "possible"... but it also costs a bit
(adding to the fixed memory footprint of R).
I did not set that limit, but I'm pretty sure it was also meant as reminder for the useR to "clean up" a bit in her / his R session, i.e., not load package namespaces unnecessarily. I cannot yet imagine that you need > 100 packages | namespaces loaded in your R session.
OTOH, some packages nowadays have a host of dependencies, so I agree that this at least may happen accidentally more frequently than in the past.
The real solution of course would be a code improvement that starts with a relatively small number of "DLLinfo" structures (say 32), and then allocates more batches (of size say 32) if needed.
Patches to the R sources (development trunk in subversion at https://svn.r-project.org/R/trunk/ ) are very welcome!
---- added Jan.26, 2017: In the mean time, we've had a public bug report about this, a proposed patch (which was not good enough: There is always an OS dependent limit on the number of open files), and today that bug report has been closed by R core member #TomasKalibera who implemented new code where the maximal number of loaded DLLs is set at
pmax(100, pmin(1000, 0.6* OS_dependent_getrlimit_or_equivalent()))
and so on Windows and Linux (and not yet tested, but "almost surely" macOS), the limit should be considerably higher than previously.
----- Update #2 (written Jan.5, 2018):
In Oct'17, the above change was made more automatic with the following commit to the sources (of the development version of R - only!)
r73545 | kalibera | 2017-10-12 14:41:20
Increase the number of DLLs that can be loaded by default. If needed,
increase the soft limit on open files.
and on the help page ?dyn.load (https://stat.ethz.ch/R-manual/R-devel/library/base/html/dynload.html) the ulimit -n <num_open_files> is now mentioned (section Note close to bottom).
So you might consider using R's development version till that becomes "main stream" in April.
Alternatively, you do (in a terminal / shell)
ulimit -n 2048
and then start R from that terminal. Tomas Kalibera mentioned this to work on macOS.
I had this issue with the simpleSingleCell library in bioconductor
On the macOS you can't exceed 256. So I set my .Renviron in my home dir
R_MAX_NUM_DLLS=150
It's easy
Go to the environment variable and edit
variable_name = R_MAX_NUM_DLL
value = 1000
Restart R
worked well for me

Resources