Quartus fitter takes a lot of RAM - opencl

I am recently doing an Altera OpenCl project on FPGA and when the compilation moves into quartus_fit. It takes 80+% of the RAM on my PC(I have 32GB). And the fitting will crush after around ten hours. Is a fitting supposed to take this amount of resource? I don't know how to resolve it, is quartus fitter guaranteed to finish if the synthesis is successful? Thanks

I have seen quartus_fit consuming around 110GB. So in case of building large designs, it will most likely fail due to insufficient RAM.

The 17.1 is the most stable release of OpenCL SDK and Quartus Pro tools so far. If it is crashing in other releases - I feel your pain. If you compiling for a 1150 Arria 10 'they' recommend 64GB of RAM. We are able to compile 50% full 1150 FPGA in 32GB in the company on multiple projects. The 10+ hours used to be normal in 13.1 but 250 MHz small kernel should compile in under 2 hours. Large memory indicate usually a problem in the code or BSP. Look at your timing results in the TimeQuest after OpenCL flow completes - takes just few minutes - if you have A LOT of violations this usually indicates that something is wrong.

Related

Visual studio 22 slow Build time

we are working on a project with asp.net core 6 using Visual studio 22 and the build process stuck in
C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Current\Bin\Roslyn\csc.exe
the build done successfuly ,but its slow . build time about 1 minute and 30 seconds.
how to reduce build process time?
Any help would be much appreciated.
I am working on Blazor in vs2022 and every change requires recompilation or partial-compilation (hot reload) which was painfully slow.
The following changes I recommend for speeding up build times.
CPU
Get a processor with high turbo clock rate, around 4GHZ-5GHZ.
If you are running a laptop, try get an Intel processor that ends with the H letter. For example, the latest Intel CPUs are 12700H/12900H. These are insanely fast laptop processors which can outperform many desktop CPUs.
Ensure your computer is using the Windows Performance profile or equivalent so that your CPU is not being throttled to save power.
DISK
First prize is a 4th GEN NVME drive paired with a computer that supports GEN4 NVME. Second prize is any NVME drive.
ENCRYPTION
First prize is not to use disk encryption, but if you do need it, opt for hardware encryption as software encryption will consume CPU resources leaving less for compiling. Hardware encryption uses the SSD's own internal encryption (which is always active) to handle the encryption.
My own testing has resulted in +- 40% loss in write performance with software encryption.
RAM
Just make sure you have enough RAM and Windows is not swapping memory to disk in order to compile your project. So most often 16GB RAM is sufficient, but I personally prefer to have 32GB so that more is cached by Windows in memory.
VS2022
Disable visual studio analyzers during build. Some have reported build times increase when this is turned off.

Optimize mathematical library (libm)

Have anyone tried to compile glibc with -march=corei7 to see if there's any performance improvement over the version that comes by default with any Linux x68_64 distribution? GCC is compiled with -march=i686. I think (not sure) that the mathematical library is also compiled the same way. Can anybody confirm this?
Most Linux distributions for x86 compile using only i686 instructions, but asking for scheduling them for later processors. I haven't really followed later developments.
A long while back different versions of system libraries according to processor lines were common, but the performance differences were soon deemed too small for the cost. And machines got more uniform in performance meanwhile.
One thing that has to be remembered always is that today's machines are memory bound. I.e., today a memory access takes a few hundred times longer than an instruction, and the gap is growing. Not to mention that this machine (an oldish laptop, was top-of-the-line some 2 years back) has 4 cores (8 threads), all battling to get data/instructions from memory. Making the code run a tiny bit faster, so the CPU can wait longer for RAM, isn't very productive.

Programming Intel IGP (e.g. Iris Pro 5200) hardware without OpenCL

The Peak GFLOPS of the the cores for the Desktop i7-4770k # 4GHz is 4GHz * 8 (AVX) * (4 FMA) * 4 cores = 512 GFLOPS. But the latest Intel IGP (Iris Pro 5100/5200) has a peak of over 800 GFLOPS. Some algorithms will therefore run even faster on the IGP. Combining the cores with the IGP together would even be better. Additionally, the IGP keeps eating up more silicon. The Iris Pro 5100 takes up over 30% of the silicon now. It seems clear which direction Intel desktop processors are headed.
As far as I have seen the Intel IGP, however, is mostly ignored by programmers with the exception of OpenCL/OpenGL. I'm curious to know how one can program the Intel HD Graphics hardware for compute (e.g. SGEMM) without OpenCL?
Added comment:
Their is no Intel support for HD graphics and OpenCL on Linux. I found beignet which is open source attempt to add support to Linux at least for Ivy Bridge HD graphics. I have not tried it. Probably the people developing Beignet know how to program the HD graphics hardware without OpenCL then.
Keep in mind that there is a performance hit to copy the data to the video card and back, so this must be taken into account. AMD is close to releasing APU chips that have unified memory for the CPU and GPU on the same die, which will go a long way towards alleviating this problem.
The way the GPU used to be utilized before CUDA and OpenCL were to represent the memory to be operated on as a texture utilizing DirectX or OpenGL. Thank goodness we don't have to do that anymore!
AMD is really pushing the APU / OpenCL model, so more programs should take advantage of the GPU via OpenCL - if the performance trade off is there. Currently, GPU computing is a bit of a niche market relegated to high performance computing or number crunching that just isn't needed for web browsing and word processing.
It doesn't make sense any more for vendors to let you program using low-level ISA.
It's very hard and most programmers won't use it.
It keeps them from adjusting the ISA in future revisions.
So programmers use a language (like C99 in OpenCL) and the runtime does ISA-specific optimizations right on the user's machine.
An example of what this enables: AMD switched from VLIW vector machines to scalar machines and existing kernels still ran (most ran faster). You couldn't do this if you wrote ISA directly.
Programming a coprocessor like iris without opencl is rather like driving a car without the steering wheel.
OpenCL is designed to expose the requisite parallelism that iris needs to achieve its theoretical performance. You cant just spawn 100s of threads or processes on it and expect performance. Having blocks of threads doing the same thing, at the same time, on similar memory addresses, is the whole crux of the matter.
Maybe you can think of a better paradigm than opencl for achieving that goal; but until you do, I suggest you try learning some opencl. If you are into python; pyopencl is a great place to start.

R Performance Differential (Solaris vs Windows)

I noticed an interesting problem. If I run the following code in R 2.12.0 (32-Bit) on a windows 3.00 gHz Core 2 Duo CPU with 2GB of RAM, it runs in less than one second. If I run it on a unix-box with sparc-sun-solaris2.10 (Also 32-Bit, though the unix box could run 64-bit) it takes 84 seconds. The processing speed of the unix box is 2.5 gHz. If I run top while the code is running, I noticed that my R process is only taking up to ~3.2% of available cpu states, even if more are available. Could this be part of the problem? I read the install manual, but nothing jumped out at me as the obvious solution to my problem. Is the unix operating system somehow limiting available resources while windows is not? Or, is there some preferable way to compile R from source that was not done? I apologize if I have not given enough information to answer the problem, this is not really my area of expertise.
t0 <- proc.time()[[3]]
x <- rnorm(10000)
for(i in 1:10000){
sd(x)
}
print(proc.time()[[3]]-t0)
Processors such as the T1 or T2 have a number of cores, and each core has a number of strands (hardware-level context switching). If you can run a multithreaded application, you'll get a large throughput. A typical intended use case would be a Java based web server, processing e.g. 20-40 connections at the same time.
The downside of this type of processors is that single threaded performance of these SPARC chips is quite low. It looks like Oracle is aware of the issue; the current development on T4 focuses on improving the single threaded speed.
The T1 processor exposes 32 logical CPUs to the operating system. If this was your case, and the displayed value was the percent of total computing power, 1/32 ~= 3.125%, which is close to what you saw.
To squeeze all the performance from a T1 processor, you need to make R use multiple CPUs, for example via the multicore package.

CPU usage different?

I have a basic question.
If I run an executable file (Release, Visual Studio 2010) on two computers with the same CPU speed run two different Windows operating systems, eg. Windws7 vs XP, shall I expect to see different CPU usages when I measure it using the task manager? Is the CPU speed the only factor to measuring the CPU usage?
Thanks.
Sar
Different OS's? Yes.
Operating Systems are the go-between between the programs you run and the bare-metal they run on. As OS'es change and evolve the naturally and and remove features that consume resources--these are things that run in the background; or they could be changes to the manner in which the OS speaks to the hardware.
Also, the measurement of CPU usage is done by the OS. There isn't a tachometer on chips saying "running at 87% of redline", but rather that "tach" is constructed largely by the OS.
After better understanding your situation: I would suggest taking a look at the Performance Monitor (perfmon.exe) which ships with both XP and Win7, and gets you much finer-grain detail about processor usage levels. Another (very good) option would be to consider running a profiler on your application on both OSes and compare the results. It would likely be the best option to specifically benchmark your application on both OSes.
Even on the same OS you should expect to see different usages, because there are so many factors that determine CPU usage.
The percentage of CPU usage listed in the task manager is not a very good indication of much of anything, except to say that a program either is, or is not using CPU. That particular statistic is derived from task switching statistics, and task switching is very sensitive to basically every single thing that's going on in a computer, from network access to memory speed to CPU temperature.

Resources