I want to know the performance of an C application using MPI on a computing cluster, but I only have one server, whose CPU are Intel Xeon E5-2692v2 with Accelerator card(XEON PHI). Is there any tools can simulate it? I know one called MPI-SIM, but, unfortunately, it's written for IBM SP2 and a little old.
Related
we are working on a project with asp.net core 6 using Visual studio 22 and the build process stuck in
C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Current\Bin\Roslyn\csc.exe
the build done successfuly ,but its slow . build time about 1 minute and 30 seconds.
how to reduce build process time?
Any help would be much appreciated.
I am working on Blazor in vs2022 and every change requires recompilation or partial-compilation (hot reload) which was painfully slow.
The following changes I recommend for speeding up build times.
CPU
Get a processor with high turbo clock rate, around 4GHZ-5GHZ.
If you are running a laptop, try get an Intel processor that ends with the H letter. For example, the latest Intel CPUs are 12700H/12900H. These are insanely fast laptop processors which can outperform many desktop CPUs.
Ensure your computer is using the Windows Performance profile or equivalent so that your CPU is not being throttled to save power.
DISK
First prize is a 4th GEN NVME drive paired with a computer that supports GEN4 NVME. Second prize is any NVME drive.
ENCRYPTION
First prize is not to use disk encryption, but if you do need it, opt for hardware encryption as software encryption will consume CPU resources leaving less for compiling. Hardware encryption uses the SSD's own internal encryption (which is always active) to handle the encryption.
My own testing has resulted in +- 40% loss in write performance with software encryption.
RAM
Just make sure you have enough RAM and Windows is not swapping memory to disk in order to compile your project. So most often 16GB RAM is sufficient, but I personally prefer to have 32GB so that more is cached by Windows in memory.
VS2022
Disable visual studio analyzers during build. Some have reported build times increase when this is turned off.
I own an Intel Compute Stick 2 that I intend to use to process object detection networks.
After installing OpenVINO on my machine (Ubuntu 18.04), I tried running the object detection python demo on a video. When running it on the Intel stick, I would get a speed of around 7.5 frames per second, while running it on my laptop Intel CPU is a lot faster at 44 frames per second.
Even if my laptop is a decent gaming laptop, I was surprised by the fact that processing on the Intel stick is so much slower. I plan to use the Intel stick on another device, not my laptop, but I would like to understand why there is this big difference in performance. Anyone had a similar experience?
You're getting an expected performance of Intel® Neural Compute Stick 2.
Check out the following discussions regarding the performance of Intel® Neural Compute Stick 2.
Raspberry Pi and Movidius NCS Face Recognition
Share | Intel Neural Compute Stick 2 (Intel Neural Compute Stick 2) related tests
Battle of Edge AI — Nvidia vs Google vs Intel
The Peak GFLOPS of the the cores for the Desktop i7-4770k # 4GHz is 4GHz * 8 (AVX) * (4 FMA) * 4 cores = 512 GFLOPS. But the latest Intel IGP (Iris Pro 5100/5200) has a peak of over 800 GFLOPS. Some algorithms will therefore run even faster on the IGP. Combining the cores with the IGP together would even be better. Additionally, the IGP keeps eating up more silicon. The Iris Pro 5100 takes up over 30% of the silicon now. It seems clear which direction Intel desktop processors are headed.
As far as I have seen the Intel IGP, however, is mostly ignored by programmers with the exception of OpenCL/OpenGL. I'm curious to know how one can program the Intel HD Graphics hardware for compute (e.g. SGEMM) without OpenCL?
Added comment:
Their is no Intel support for HD graphics and OpenCL on Linux. I found beignet which is open source attempt to add support to Linux at least for Ivy Bridge HD graphics. I have not tried it. Probably the people developing Beignet know how to program the HD graphics hardware without OpenCL then.
Keep in mind that there is a performance hit to copy the data to the video card and back, so this must be taken into account. AMD is close to releasing APU chips that have unified memory for the CPU and GPU on the same die, which will go a long way towards alleviating this problem.
The way the GPU used to be utilized before CUDA and OpenCL were to represent the memory to be operated on as a texture utilizing DirectX or OpenGL. Thank goodness we don't have to do that anymore!
AMD is really pushing the APU / OpenCL model, so more programs should take advantage of the GPU via OpenCL - if the performance trade off is there. Currently, GPU computing is a bit of a niche market relegated to high performance computing or number crunching that just isn't needed for web browsing and word processing.
It doesn't make sense any more for vendors to let you program using low-level ISA.
It's very hard and most programmers won't use it.
It keeps them from adjusting the ISA in future revisions.
So programmers use a language (like C99 in OpenCL) and the runtime does ISA-specific optimizations right on the user's machine.
An example of what this enables: AMD switched from VLIW vector machines to scalar machines and existing kernels still ran (most ran faster). You couldn't do this if you wrote ISA directly.
Programming a coprocessor like iris without opencl is rather like driving a car without the steering wheel.
OpenCL is designed to expose the requisite parallelism that iris needs to achieve its theoretical performance. You cant just spawn 100s of threads or processes on it and expect performance. Having blocks of threads doing the same thing, at the same time, on similar memory addresses, is the whole crux of the matter.
Maybe you can think of a better paradigm than opencl for achieving that goal; but until you do, I suggest you try learning some opencl. If you are into python; pyopencl is a great place to start.
I noticed an interesting problem. If I run the following code in R 2.12.0 (32-Bit) on a windows 3.00 gHz Core 2 Duo CPU with 2GB of RAM, it runs in less than one second. If I run it on a unix-box with sparc-sun-solaris2.10 (Also 32-Bit, though the unix box could run 64-bit) it takes 84 seconds. The processing speed of the unix box is 2.5 gHz. If I run top while the code is running, I noticed that my R process is only taking up to ~3.2% of available cpu states, even if more are available. Could this be part of the problem? I read the install manual, but nothing jumped out at me as the obvious solution to my problem. Is the unix operating system somehow limiting available resources while windows is not? Or, is there some preferable way to compile R from source that was not done? I apologize if I have not given enough information to answer the problem, this is not really my area of expertise.
t0 <- proc.time()[[3]]
x <- rnorm(10000)
for(i in 1:10000){
sd(x)
}
print(proc.time()[[3]]-t0)
Processors such as the T1 or T2 have a number of cores, and each core has a number of strands (hardware-level context switching). If you can run a multithreaded application, you'll get a large throughput. A typical intended use case would be a Java based web server, processing e.g. 20-40 connections at the same time.
The downside of this type of processors is that single threaded performance of these SPARC chips is quite low. It looks like Oracle is aware of the issue; the current development on T4 focuses on improving the single threaded speed.
The T1 processor exposes 32 logical CPUs to the operating system. If this was your case, and the displayed value was the percent of total computing power, 1/32 ~= 3.125%, which is close to what you saw.
To squeeze all the performance from a T1 processor, you need to make R use multiple CPUs, for example via the multicore package.
I have a basic question.
If I run an executable file (Release, Visual Studio 2010) on two computers with the same CPU speed run two different Windows operating systems, eg. Windws7 vs XP, shall I expect to see different CPU usages when I measure it using the task manager? Is the CPU speed the only factor to measuring the CPU usage?
Thanks.
Sar
Different OS's? Yes.
Operating Systems are the go-between between the programs you run and the bare-metal they run on. As OS'es change and evolve the naturally and and remove features that consume resources--these are things that run in the background; or they could be changes to the manner in which the OS speaks to the hardware.
Also, the measurement of CPU usage is done by the OS. There isn't a tachometer on chips saying "running at 87% of redline", but rather that "tach" is constructed largely by the OS.
After better understanding your situation: I would suggest taking a look at the Performance Monitor (perfmon.exe) which ships with both XP and Win7, and gets you much finer-grain detail about processor usage levels. Another (very good) option would be to consider running a profiler on your application on both OSes and compare the results. It would likely be the best option to specifically benchmark your application on both OSes.
Even on the same OS you should expect to see different usages, because there are so many factors that determine CPU usage.
The percentage of CPU usage listed in the task manager is not a very good indication of much of anything, except to say that a program either is, or is not using CPU. That particular statistic is derived from task switching statistics, and task switching is very sensitive to basically every single thing that's going on in a computer, from network access to memory speed to CPU temperature.