I am trying to run "nvprof" from command line on R. Here is how I am doing it:
./nvprof --print-gpu-trace --devices 0 --analysis-metrics --export-profile /home/xxxxx/%p R
This gives me a R prompt and i write R code. I can do with Rscript too.
Problem i see is when i give --analysis-metrics option it gives me lots of lines similar to
==44041== Replaying kernel "void ger_kernel(cublasGerParams)"
And R process never ends. I am not sure what I am missing.
nvprof doesn't modify process exit behavior, so I think you're just suffering from slowness because your app invokes a lot of kernels. You have two options to speed this up.
1. Selectively profiling metrics
The --analysis-metrics option enables collection of a number of metrics, which requires kernels to be replayed - collecting a different set of metrics for each kernel run.
If your application has a lot of kernel invocations, this can take time. I'd suggest you query the available metrics with the nvprof --query-metrics command, and then manually choose the metrics you are interested in.
Once you know which metrics you want, you can query them using nvprof -m metric_1,metric_2,.... This way, the application will profile less metrics, hence requiring less replays, and running faster.
2. Selectively profiling kernels
Alternatively, you can only profile a specific kernel using the --kernels <context id/name>:<stream id/name>:<kernel name>:<invocation> option.
For example, nvprof --kernels ::foo:2 --analysis-metrics ./your_cuda_app will profile all analysis metrics for the kernel whose name contains the string foo, and only on its second invocation. This option takes regular expressions, and is quite powerful.
You can mix and match the above two approaches to speed up profiling. You will be able to find more help about these and other nvprof options using the command nvprof --help.
Related
I'm running distributed MPI programs on clusters using multiple nodes, where I make use of the MPI FFT's of FFTW. To save time I reuse wisdom from one run to the next. To generate this wisdom, FFTW experiments with a lot of different algorithms and what not for the given problem. I am worried that because I am working on a cluster, the best solution stored as wisdom for one set of CPUs/nodes may not be the best solution for some other set of CPUs/nodes performing the same task, and so I should not reuse wisdom unless I am running on exactly the same CPUs/nodes as the run where the wisdom was gathered.
Is this correct, or is the wisdom somehow completely indifferent to the physical hardware on which it is generated?
If your cluster is homogeneous, the saved fftw plans likely make sense, though the the way the processes are connected may affect optimal plans for mpi-related operations. But if your cluster is not homogeneous, saving the fftw plan can be suboptimal and problem related to load balance could proove hard to solve.
Taking a look at wisdom files produced by fftw and fftw_mpi for a 2D c2c transform, I can see additionnal lines likely related to phases like transposition where mpi communications are required, such as:
(fftw_mpi_transpose_pairwise_register 0 #x1040 #x1040 #x0 #x394c59f5 #xf7d5729e #xe8cf4383 #xce624769)
Indeed, there are different algorithms for transposing the 2D (or 3D) array: in the folder mpi of the source of fftw, files transpose-pairwise.c, transpose-alltoall.c and transpose-recurse.c implement these algorithms. As flags FFTW_MEASURE or FFTW_EXHAUSTIVE are set, these algorithms are run to select the fastest, as stated here. The result might depend on the topology of the network of processes (how many processes on each node? How these nodes are connected?). If the optimal plan depends on where the processes are running and on the network topology, using the wisdom utility will not be decisive. Otherwise, using the wisdom feature can save some time as the plan is built.
To test whether the optimal plan changed, you can perform a couple of runs and save the resulting plan in files: a reproductibility test!
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
if (rank == 0) fftw_export_wisdom_to_filename("wisdommpi.txt");
/* save the plan on each process ! Depending on the file system of the cluster, performing communications can be required */
char filename[42];
sprintf(filename, "wisdom%d.txt",rank);
fftw_export_wisdom_to_filename(filename);
Finally, to compare the produced wisdom files, try in a bash script:
for filename in wis*.txt; do
for filename2 in wis*.txt; do
echo "."
if grep -Fqvf "$filename" "$filename2"; then
echo "$filename"
echo "$filename2"
echo $"There are lines in file1 that don’t occur in file2."
fi
done
done
This script check that all lines in files are also present in the other files, following Check if all lines from one file are present somewhere in another file
On my personal computer, using mpirun -np 4 main, all wisdom files are identical except for a permutation of lines.
If the files are different from one run to another, it could be attributed to the communication pattern between processes... or sequential performance of dft for each process. The piece of code above save the optimal plan for each process. If lines related to sequential operations, without fftw_mpi in it, such as:
(fftw_codelet_n1fv_10_sse2 0 #x1440 #x1440 #x0 #xa9be7eee #x53354c26 #xc32b0044 #xb92f3bfd)
become different, it is a clue that the optimal sequential algorithm changes from one process to the other. In this case, the wall clock time of the sequential operations may also differ from one process to another. Hence, checking the load balance between processes could be instructive. As noticed in the documentation of FFTW about load balance:
Load balancing is especially difficult when you are parallelizing over heterogeneous machines; ... FFTW does not deal with this problem, however—it assumes that your processes run on hardware of comparable speed, and that the goal is therefore to divide the problem as equally as possible.
This assumption is consistent with the operation performed by fftw_mpi_gather_wisdom();
(If the plans created for the same problem by different processes are not the same, fftw_mpi_gather_wisdom will arbitrarily choose one of the plans.) Both of these functions may result in suboptimal plans for different processes if the processes are running on non-identical hardware...
The transpose operation in 2D and 3D fft requires a lot a communications: one of the implementation is a call to MPI_Alltoall involving almost the whole array. Hence, a good connectivity between nodes (infiniband...) can proove useful.
Let us know if you found different optimal plans from one run to another and how these plans differ!
I have a TCL script running inside a TCL shell (synopsys primetime if it's of any difference).
The script is initiated by source <script> from the shell.
The script calls itself recursively after a specific time interval has passed by calling source <script> at the end of the script.
My question is a bit academic: Could there be a stack-overflow issue if the script keeps calling itself in this method?
If I expand the question: What happens when a TCL script sources another script? Does it fork to a child process? if so, then every call forks to another child, which will eventually stack up to a pile of processes - but since the source command itself is not parallel - there is no fork (from my understanding).
Hope the question is clear.
Thanks.
Short answer: yes.
If you're using Tcl 8.5 or before, you'll run out of C stack. There's code to try to detect it and throw a soft (catchable) error if you do. There's also a (lower) limit on the number of recursions that can be done, controllable via interp recursionlimit. Note that this is counting recursive entries to the core Tcl script interpreter engine; it's not exactly recursion levels in your script, though it is very close.
# Set the recursion limit for the current interpreter to 2000
interp recursionlimit {} 2000
The default is 1000, which is enough for nearly any non-recursive algorithm.
In Tcl 8.6, a non-recursive execution engine is used for most commands (including source). This lets your code use much greater recursion depths, limited mainly by how much general memory you have. I've successfully run code with recursion depths of over a million on conventional hardware.
You'll still need to raise the interp recursionlimit though; the default 1000 limit remains because it catches more bugs (i.e., unintentional recursions) than not. It's just that you can meaningfully raise it much more.
The command doesn’t fork a new process. It acts as if the lines in the sourced files were there in place of the invocation of source. They are interpreted by the current interpreter unless you specify otherwise.
I'm running a simulation in an ipython notebook that is composed of seven functions that are dependent of each other, and requires 13 different parameters. Some of the functions are called within other functions to allow one function to run the entire simulation. The simulation involves manipulating two parameters for a total of >20k iterations. Two simulations can be run asynchronously. Since each iteration is taking ~1.5 seconds, I'm investigating parallel processing.
When I first tried ipyparallel, I got a global name not defined error. Makes sense that local objects can't been found a worker. In an effort to avoid spending quite a bit of time going down a rabbit hole, what would be the easiest way to pass a whole bunch of objects to all of the workers? Are there other gotchas to consider when using ipyparallel in this way?
There is a bit more detail in this related question, but the gist is: interactively defined modules resolve in the interactive namespace (__main__), which is different on the engine and client. You can send functions to the engine with view.push(dict(func=func, func2=func2)), in which case they will be found. The alternative is to define your functions in a module or package that you ensure is installed on all the engines.
For instance, in a script:
def bar(x):
return x * x
def foo(y):
return bar(y)
view.apply(foo, 5) # NameError on bar
view.push(dict(bar=bar)) # send bar
view.apply(foo, 5) # 25
Often when using IPython parallel from a notebook or larger script, one of the early steps is seeding the namespace of the engines:
rc[:].push(dict(
f1=f1,
f2=f2,
const=const,
))
If you have more than a few names to push this way, it might be time to consider defining these functions in a module, and distributing that instead.
It seems that I can duplicate a kernel by get the program object and kernel name from the kernel. And then I can create a new one.
Is this the right way? It doesn't looks so good, though.
EDIT: To answer properly the question: Yes it is the correct way, there is no other way in CL 2.0 or earlier versions.
The compilation (and therefore, slow step) of the CL code creation is in the "program" creation (clProgramBuild + clProgramLink).
When you create a kernel. You are just creating a object that packs:
An entry point to a function in the program code
Parameters for input + output to that function
Some memory to remember all the above data between calls
It is an simple task that should be almost for free.
That is why it is preferred to have multiple kernel with different input parameters. Rather than one single kernel, and changing the parameters every loop.
I have started using the doMC package for R as the parallel backend for parallelised plyr routines.
The parallelisation itself seems to be working fine (though I have yet to properly benchmark the speedup), my problem is that the logging is now asynchronous and messages from different cores are getting mixed in together. I could created different logfiles for each core, but I think I neater solution is to simply add a different label for each core. I am currently using the log4r package for my logging needs.
I remember when using MPI that each processor got a rank, which was a way of distinguishing each process from one another, so is there a way to do this with doMC? I did have the idea of extracting the PID, but this does seem messy and will change for every iteration.
I am open to ideas though, so any suggestions are welcome.
EDIT (2011-04-08): Going with the suggestion of one answer, I still have the issue of correctly identifying which subprocess I am currently inside, as I would either need separate closures for each log() call so that it writes to the correct file, or I would have a single log() function, but have some logic inside it determining which logfile to append to. In either case, I would still need some way of labelling the current subprocess, but I am not sure how to do this.
Is there an equivalent of the mpi_rank() function in the MPI library?
I think having multiple process write to the same file is a recipe for a disaster (it's just a log though, so maybe "disaster" is a bit strong).
Often times I parallelize work over chromosomes. Here is an example of what I'd do (I've mostly been using foreach/doMC):
foreach(chr=chromosomes, ...) %dopar% {
cat("+++", chr, "+++\n")
## ... some undoubtedly amazing code would then follow ...
}
And it wouldn't be unusual to get output that tramples over each other ... something like (not exactly) this:
+++chr1+++
+++chr2+++
++++chr3++chr4+++
... you get the idea ...
If I were in your shoes, I think I'd split the logs for each process and set their respective filenames to be unique with respect to something happening in that process's loop (like chr in my case above). Collate them later if you must ... ie. map/reduce your log files :-)