OpenCL: Kernel Code? [closed] - opencl

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am working on optimization of the ADAS algorithm which are in c++.
I want to optimize that algorithm using OpenCL tech.
I have gone through some basic doc of OpenCL.
I came to know the kernel code is written in C which is doing the optimization.
But I want to know how internally kernel is splitting the work into different workitems ?
How is the single statement is doing for loop task.
Please share your knowledge with me on OpenCL.
Tr,
Ashwin

First of all C code is not doing the optimization. Parallelism is. Optimization with OpenCL only works on algorithms that can heavily utilize parallelism. If you are using OpenCL like regular C you are probably slowing your algorithm down. This is because it takes lot of time to move data between host and device.
Secondly kernel is not splitting the work into different workitems. Instead programmer is splitting it by launching multiple kernels to run the same kernel code in parallel. You can set how many kernels you want launch by setting the global_work_size of the clEnqueueNDRangeKernel.
If you have a for loop where iterations are not dependent on each other, it could be a good part to optimize with OpenCL. It is also good if there is quite a lot calculations in that loop but not much data going into it and out from it. In that case you make the inner part of the loop into OpenCL kernel and launch it with a global_work_size that is equivalent to the for loop's total loop count.

Related

Effective simulation of large scale Modelica models by automatic translation to Modia [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
this is more of an hypothetical question, but might have great consequences. Lot of us in Modelica community are dealing with large scale systems with expensive simulation times. This is usually not an obstacle for bugfixing and development, but speeding up the simulation might allow for better and faster optimizations.
Recently I came across Modia possibilities, claiming to have superb numerical solvers, achieving better simulation times than Dymola, a state-of-the-art Modelica compiler. The syntax seemed to cover all important bits. Recreating large scale component models in Modia is unfeasible, but what about automatically translating the flattenized Modelica to Modia? Is that realistic? Would that provide a speed up? Has anyone tried before? I have searched for some
This might also hopefully improve integration of Modelica models and postprocesssing / identificaiton tooling within one language, instead of using FMI or invoking a separate executable.
Thanks for any suggestions.
For those interested, we might as well start developing this.
We in the Modia team agrees that the modeling know how in Modelica libraries must be reused. So we are working on a translator (brief details given in https://ep.liu.se/ecp/157/060/ecp19157060.pdf) from Modelica to Modia. The plan is to initially provide translated versions of Modelica.Blocks, Modelica.Electrical.Analog and Modelica.Mechanics together with Modia.

Single Process (Shared-Memory) Multiple-CPU Parallelism in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have used mclapply quite a bit and love it. It is a memory hog but very convenient. Alas, now I have a different problem that is not simply embarrassingly parallel.
Can R (esp Unix R) employ multiple CPU cores on a single computer, sharing the same memory space, without resorting to copying full OS processes, so that
there is minimal process overhead; and
modification of global data by one CPU are immediately available to other CPUs?
If yes, can R lock some memory just like files (flock)?
I suspect that the answer is no and learning this definitively would be very useful. If the answer is yes, please point me the right way.
regards,
/iaw
You can use the Rdsm package to use distributed shared memory parallelism, i.e. multiple R processes using the same memory space.
Besides that, you can employ multi-threaded BLAS/LAPACK (e.g. OpenBLAS or Intel MKL) and you can use C/C++ (and probably Fortran) code together with OpenMP. See assembling a matrix from diagonal slices with mclapply or %dopar%, like Matrix::bandSparse for an example.
Have you take a look at Microsoft's R Open (available for Linux), with the custom Math Kernel Library (MKL).
I've seen very good performance improvements without rewriting code.
https://mran.microsoft.com/documents/rro/multithread

Estimating CPU and Memory Requirements for a Big Data Project [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am working on an analysis of big data, which is based on social network data combined with data on the social network users from other internal sources, such as a CRM database.
I realize there are a lot of good memory profiling, CPU benchmarking, and HPC packages and code snippets out there. I'm currently using the following:
system.time() to measure the current CPU usage of my functions
Rprof(tf <- "rprof.log", memory.profiling=TRUE) to profile memory
usage
Rprofmem("Rprofmem.out", threshold = 10485760) to log objects that
exceed 10MB
require(parallel) to give me multicore and parallel functionality
for use in my functions
source('http://rbenchmark.googlecode.com/svn/trunk/benchmark.R') to
benchmark CPU usage differences in single core and parallel modes
sort( sapply(ls(),function(x){format(object.size(get(x)), units = "Mb")})) to list object sizes
print(object.size(x=lapply(ls(), get)), units="Mb") to give me total memory used at the completion of my script
The tools above give me lots of good data points and I know that many more tools exist to provide related information as well as to minimize memory use and make better use of HPC/cluster technologies, such as those mentioned in this StackOverflow post and from CRAN's HPC task view. However, I don't know a straighforward way to synthesize this information and forecast my CPU, RAM and/or storage memory requirements as the size of my input data increases over time from increased usage of the social network that I'm analyzing.
Can anyone give examples or make recommendations on how to do this? For instance, is it possible to make a chart or a regression model or something like that that shows how many CPU cores I will need as the size of my input data increases, holding constant CPU speed and amount of time the scripts should take to complete?

Optimizing an SBCL Application Program for Speed [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've just finished and tested the core of a common lisp application and want to optimize it for speed now. It works with SBCL and makes use of CLOS.
Could someone outline the way to optimize my code for speed?
Where will I have to start? Will I just have to provide some global declaration or will I have to blow up my code with type information for each binding? Is there a way to find out which parts of my code could be compiled better with further type information?
The programm makes heavy use of a single 1-dimensional array 0..119 where it shifts CLOS-Instances around.
Thank you you Advance!
It's not great to optimize in a vacuum, because there's no limit to the ugliness you can introduce to make things some fraction of a percent faster.
If it's not fast enough, it's helpful to define what success means so you know when to stop.
With that in mind, a good first pass is to run your project under the profiler (sb-sprof) to get an idea of where the time is spent. If it's in generic arithmetic, it can help to judiciously use modular arithmetic in inner loops. If it's in CLOS stuff, it might possibly help to switch to structures for key bits of data. Whatever's the most wasteful will direct where to spend your effort in optimization.
I think it could be helpful if, after profiling, you post a followup question along the lines of "A lot of my program's time is spent in <foo>, how do I make it faster?"

Unix Parent-child process relationship [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
i understand well the parent-child relationship in unix processes creation. But i don't understand the rationale behind it :( why do we need to fork from the current process to create a new one, then overwrite its image with a new code if any? cheers
The rationale is that unix system calls (at least originally) are "elementary" operations done by the kernel.
In practice, applications often do some specific things between fork(2) and execve(2), in particular calls to close(2) and dup2(2), also sigaction(2) to ignore some signals (with perhaps some pipe(2) syscalls done before the fork).
If you wanted to have a single syscall handling all this at a time, it would have been very complex, and less flexible.
I suggest to read some book like Advanced Linux Programming (it is free and online) or Advanced Unix Programming in addition of intro(2).
I find on the contrary the intent to separate creating a process and executing a program quite natural. I don't really understand why you want both operations to be combined.
See also this mine answer about syscalls.

Resources