I am trying to check memory leaks in my modules of nginx with valgrind.I am trying the following command
valgrind --leak-check=full --tool=memcheck --show-reachable=yes --log-file="/tmp/val.out" -v /usr/local/nginx -c /usr/local/conf/nginx.conf
I am getting the error nginx: [error] failed to initialize Lua VM
I am using nginx-1.6.2 on Cent OS 7 wit lua 0.9.15.
I had the same problem and it was fixed by upgrading Valgrind and adding some additional flags to LuaJIT compilation.
Look into:
https://github.com/openresty/lua-nginx-module/issues/681
Specifically:
==52538== Using Valgrind-3.10.0 and LibVEX;
Your version of valgrind does not support the MAP_32BIT flag required by LuaJIT's own allocator. You need either an older version of valgrind (like 3.8.1) or a newer version (like 3.11.0).
https://groups.google.com/forum/#!topic/openresty/riEO_YXTwz4
Specifically (using Google translate):
This is because LuaJIT own memory allocator MAP_32BIT use this flag in Linux x86_64 above to call mmap ()
System call. And Valgrind 3.9.0 is no longer supported from the mmap MAP_32BIT this flag, it will make LuaJIT
Initialization failed.
The solution is to re-compile a special version of LuaJIT, force it to use a dispenser system, using a command like this at compile time LuaJIT:
make CCDEBUG=-g Q= XCFLAGS='-DLUAJIT_USE_VALGRIND -DLUAJIT_USE_SYSMALLOC'
Here the most important thing is LUAJIT_USE_SYSMALLOC this macro. Of course, for best results, you should also specify the following C compiler options:
-DLUA_USE_APICHECK -DLUA_USE_ASSERT
Related
Reaching out to the wider group as I am totally stumped trying to install Intel MPSS 4.x for my Xeon Phi 7220P.
I followed the precise steps in this link: Intel MPSS Linux User Guide Rev 4.4.1, and did it 3 times to make sure I wasn't missing any steps or making mistakes, but keep getting the following error readout:
modprobe: WARNING: Module mic_x200_dma not found.
modprobe: WARNING: Module scif_bus not found.
modprobe: WARNING: Module vop_bus not found.
modprobe: WARNING: Module cosm_bus not found.
modprobe: WARNING: Module scif not found.
modprobe: WARNING: Module vop not found.
modprobe: WARNING: Module mic_cosm not found.
modprobe: WARNING: Module mic_x200 not found.
As a result of this, I cant do the basic MPSS commands such as micctrl -s etc nor use the Xeon Phi whatsoever.
I am running Centos 7 (862 kernel), and know its not listed in the Intel pdf, but did not think this should be causing an issue as it seems that the above kernel modules are simply not being installed seemingly from Intel MPSS - but not sure if this diagnosis is correct.
Would appreciate your help - many thanks in advance!
It is complaining about that because at one point your kernel got updated from 3.10.0-514.el7 to a later version, (it happens automatically when you do a yum update, annoying I know)
Check your kernel version by running
uname -r
When you installed/compiled all the modules they were placed into /lib/modules/3.10.0-514.el7.x86_64 which is where the source code you have exported to.
You have 2 options:
Recompile the source code to work for your current kernel version (which is a pain and has it's own problems)
Revert your host kernel back to 3.10.0-514.el7 via grub config (example here) and everything will work nicely
I too struggled with this very much in the beginning and I had to read pretty much every line of source code and spend countless hours debugging until I found out. At this point there is nothing I do not know about the Xeon Phi x100/x200.
The documentation is not bad, but it didn't cover this bare essential, which is frustrating.
I am running OpenAI baselines, specifically the Hindsight Experience Replay code. (However, I think this question is independent of the code and is an MPI-related one, hence why I'm posting on StackOverflow.)
You can see the README there but the point is, the command to run is:
python -m baselines.her.experiment.train --num_cpu 20
where the number of CPUs can vary and is for MPI.
I am successfully running the HER training script with 1-4 CPUs (i.e., --num_cpu x for x=1,2,3,4) on a single machine with:
Ubuntu 16.04
Python 3.5.2
TensorFlow 1.5.0
One TitanX GPU
The number of CPUs seems to be 8 as I have a quad-core i7 Intel processor with hyperthreading, and Python confirms that it sees 8 CPUs.
(py3-tensorflow) daniel#titan:~/baselines$ ipython
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import os, multiprocessing
In [2]: os.cpu_count()
Out[2]: 8
In [3]: multiprocessing.cpu_count()
Out[3]: 8
Unfortunately, when I run with 5 or more CPUs, I get this message blocking the code from running:
(py3-tensorflow) daniel#titan:~/baselines$ python -m baselines.her.experiment.train --num_cpu 5
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: titan
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
And here's where I got lost. There's no error message or line of code that I need to fix. I am therefore unsure about where I even add overload-allowed in the code?
The way this code works at a high level is that it takes in this argument and uses the python subprocess module to run an mpirun command. However, checking mpirun --help on the command line doesn't reveal overload-allowed as a valid argument.
Googling this error message leads to questions in the openmpi repository, for instance:
https://github.com/open-mpi/ompi/issues/626 (seems to have died out without resolving issue)
https://github.com/open-mpi/ompi/issues/2158 (not sure how this relates to my issue, didn't get clear resolution)
But I'm not sure if it's an OpenMPI thing or an mpi4py thing?
Here's pip list in my virtual environment if it helps:
(py3.5-mpi-practice) daniel#titan:~$ pip list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
decorator (4.2.1)
ipython (6.2.1)
ipython-genutils (0.2.0)
jedi (0.11.1)
line-profiler (2.1.2)
mpi4py (3.0.0)
numpy (1.14.1)
parso (0.1.1)
pexpect (4.4.0)
pickleshare (0.7.4)
pip (9.0.1)
pkg-resources (0.0.0)
pprintpp (0.3.0)
prompt-toolkit (1.0.15)
ptyprocess (0.5.2)
Pygments (2.2.0)
setuptools (20.7.0)
simplegeneric (0.8.1)
six (1.11.0)
traitlets (4.3.2)
wcwidth (0.1.7)
So, TL;DR:
How do I fix this error in my code?
If I add the "overload-allowed" thing, what happens? Is it safe?
Thanks!
overload-allowed is a qualifier that is passed to --bind-to parameter of mpirun (source).
mpirun ... --bind-to core:overload-allowed
Beware that hyperthreading thing is more about marketing than about performance bonuses.
Your i7 can actually have four silicon cores and four "logical" ones. The logical ones basically try to use resources of the silicon cores that are currently unused. The problem is that a good HPC program will use 100% of the CPU hardware, and hyperthreading won't have resources to successfully operate.
So, it is safe to "overload" "cores", but it's not a performance boost candidate #1.
Regarding the advice that the paper authors give about reproducing the results. In the best case less cpus just means slow learning. However, if learning doesn't converge to an expected value no matter how hyperparams are tweaked, then it is a reason to look closer at the proposed algorithm.
While IEEE754 computations do differ if done in different order, this difference should not play the crucial role.
The error message suggests that mpi4py is built on top of Open MPI.
By default, a slot is a core, but if you want a slot to be an hyperthread, then you should
mpirun --use-hwthread-cpus ...
I have tried running the Openmdao paraboloid tutorial as well as benchmarks and I consistently receive the same error which reads as following:
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash
---------------------------------------------------------------------
MPI_abort was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 59.
NOTE: invoking MPI_ABORT causes MPI to kill all MPI processes.
you may or may not see output from other processes, depending on exactly when Open MPI kills them.
I don't understand why this error is occurring and what I can do to be able to run OpenMDAO without getting this error. Can you please help me with this?
Something has not gone well with your PETSc install. Its hard to debug that from afar though. It could be in your MPI install, or your PETSc install, or your petsc4py install. I suggest not install PETSc or PETSc4Py through pip though. I've had mixed success with that. Both can be installed from source without tremendous difficulty.
However, to run the tutorials you don't need to have PETSc installed. You could remove those packages and the tutorials will run correctly in serial for you
Today's (1 Mar 2016) OpenSSL release has caused the following error when running Plone/Zope
.buildout/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/persistent/cPersistence.so: undefined symbol: SSLv2_method
It's hard to see what's going on since it's a binary file. I also tried updating to ZODB3 3.11.0 which yields the following traceback
.buildout/eggs/ZConfig-2.9.0-py2.7.egg/ZConfig/loader.py", line 217, in schemaComponentSource
package=package)
ZConfig.SchemaResourceError: could not load package ZServer:
.buildout/eggs/zope.security-3.7.4-py2.7-linux-x86_64.egg/zope/security/_proxy.so: undefined symbol: SSLv2_method
Package name: 'ZServer'
File name: 'component.xml'
Package path: None
Is there any workaround for this other than reverting OpenSSL?
zope security is a compiled egg, like all the ones ending with -py2.7-linux-x86_64.egg.
As the traceback says, it cannot find anymore a symbol.
Probably you have to recompile it with the new openssl-dev.
I would try (on a development server first):
backup your compiled egg (mkdir eggs-backup && mv `eggs/zope.security-3.7.4-py2.7-linux-x86_64.egg eggs-backup/)
rerun buildout
This will recompile your missing egg.
Hopefully it works and hopefully it is the only one linked to that library.
Anyway, dependending on the way you patched openssl you may have a lot of other issues (I am thinking about Python, urllib*, curl, wget, ...)
OpenSSL 1.0.2g by default doesn't build with SSLv2 (because of the recent DROWN attack). You may need to manually build it without OPENSSL_NO_SSL2 flag.
(but in fact you shouldn't do this if you're doing some server-related stuff, there is a serious security reason because of which it was disabled, see https://drownattack.com)
I was able to resolve this by upgrading python to 2.7.10+, and then upgrading Pillow and lxml.
I'm running a standalone application using Apache Spark and when I load all my data to a RDD as a textfile I got the following error:
15/02/27 20:34:40 ERROR Utils: Uncaught exception in thread stdout writer for python
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:81)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:764)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
Exception in thread "stdout writer for python" java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:81)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:764)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
I thought that was related with the fact I'm caching the whole RDD to memory with the cache function. I haven't noticed any change when I rid off this function from my code. SO I keep getting this error.
My RDD is derived from several text files inside a directory that is located in a google cloud bucket.
Could you help me to solve this error?
Spark requires a fair bit of configuration tuning depending on cluster size, shape, and workload, and out-of-the-box, probably won't work for realistically-sized workloads.
When using bdutil to deploy, the best way to get Spark is actually to use the officially-supported bdutil plugin, simply with:
./bdutil -e extensions/spark/spark_env.sh deploy
Or equivalently as shorthand:
./bdutil -e spark deploy
This will make sure the gcs-connector and memory settings, etc., are all properly configured in Spark.
You can also theoretically use bdutil to install Spark directly on your existing cluster, though this is less thoroughly-tested:
# After you've already deployed the cluster with ./bdutil deploy:
./bdutil -e spark run_command_group install_spark -t all
./bdutil -e spark run_command_group spark_configure_startup -t all
./bdutil -e spark run_command_group start_spark -t master
This should be the same as if you had just run ./bdutil -e spark deploy originally. If you had deployed with ./bdutil -e my_custom_env.sh deploy then all the above commands need to actually start with ./bdutil -e my_custom_env.sh -e spark run_command_group.
In your case, the relevant Spark memory settings were probably related to spark.executor.memory and/or SPARK_WORKER_MEMORY and/or SPARK_DAEMON_MEMORY
EDIT: On a related note, we just released bdutil-1.2.0 which defaults to Spark 1.2.1, and also adds improved Spark driver memory settings and YARN support.