I saw a similar question here, but no useful answer, so let me please write it again.
My application is using too much CPU, for a test example I pick a slow site (https://www.hao123.com/) from top10 slowest sites, and read it with 100 threads simultaneously.
There is no response processing; one request is taking about 5 seconds, so it looks logical for me that my threads should use about 0% CPU being almost all the time waiting for a response.
procedure mth.Execute;
var
tm1: dword;
begin
HTTP:=TIdHTTP.Create;
HTTP.ConnectTimeout:=60000;
HTTP.ReadTimeout:=60000;
ssl:=TIdSSLIOHandlerSocketOpenSSL.Create;
HTTP.IOHandler:=ssl;
HTTP.HandleRedirects:=true;
HTTP.ProtocolVersion:=pv1_1;
repeat
sleep(5);
If StartWork then begin
tm1:=TimeGetTime;
s:=HTTP.Get('https://www.hao123.com/');
GlobalTiming:=(GlobalTiming * 9 + (TimeGetTime-tm1)) / 10;
end;
until Terminated;
HTTP.Free;
ssl.Free;
end;
The test application starts creating threads with StartWork=false. As long as I dont set StartWork:=true, CPU load is about 0%
UPDATE: to answer comments below: 100 threads running sleep(5) cycle DO NOT load CPU
As soon as I start readers by setting StartWork:=true, I see 10% CPU load on my 16-core Ryzen. When running on a 1-core VDS, this turns into a really painful problem.
The question is: how is a simple operation which should just wait, actually using that much CPU? How to "optimize" it?
UPDATE2:
Hard to explain that the issue has nothing to do with the sleep(5) so 2 more pictures:
I've replaced sleep(5) with sleep(100 + random(100))
The picture from 2-cores vds:
Related
I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible.
I noticed that perf record detects that ~20% of execution time in function myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).
Some things to take note:
the pointer chasing path can comfortably fit in the L1 data cache
using __builtin_expect to avoid the cost of branch misprediction had no noticeable effect
perf record has the following output:
Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926
myFunction /tmp/foobar [Percent: local hits]
Percent│ endbr64
...
80.09 │20: mov (%rdx,%rbx,1),%ebx
0.07 │ add $0x1,%rax
│ cmp $0xffffffff,%ebx
19.84 │ ↑ jne 20
...
I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf.
I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.
I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.
Remember that the cycles event has to pick an instruction to blame, even if both mov-load and the macro-fused cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors
A 90-Minute Guide! and https://agner.org/optimize/)
But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.
The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.
The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the mov doesn't get the blame because it can actually execute and get out of the way?
IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.
I have the following dummy code to test out TPL in F#. (Mono 4.5, Xamarin studio, quad core MacBook Pro)
To my surprise, all the processes are done on the same thread. There is no parallelism at all.
open System
open System.Threading
open System.Threading.Tasks
let doWork (num:int) (taskId:int) : unit =
for i in 1 .. num do
Thread.Sleep(10)
for j in 1 .. 1000 do
()
Console.WriteLine(String.Format("Task {0} loop: {1}, thread id {2}", taskId, i, Thread.CurrentThread.ManagedThreadId))
[<EntryPoint>]
let main argv =
let t2 = Task.Factory.StartNew(fun() -> doWork 10 2)
//printfn "launched t2"
Console.WriteLine("launched t2")
let t1 = Task.Factory.StartNew(fun() -> doWork 8 1)
Console.WriteLine("launched t1")
let t3 = Task.Factory.StartNew(fun() -> doWork 10 3)
Console.WriteLine("launched t3")
let t4 = Task.Factory.StartNew(fun() -> doWork 5 4)
Console.WriteLine("launched t4")
Task.WaitAll(t1,t2,t3,t4)
0 // return an integer exit code
However, if I increase the thread sleep time from 10 to 100ms, I can see a little parallelism.
What have I done wrong? What does this mean? I did consider the possibility of the CPU finished the work before TPL can start the task on a new thread. But this doesn't make sense to me. I can increase the inner dummy loop for j in 1 .. 1000 do () to loop 1000 more times. The result is the same: no parallelism (thread.sleep is set 10 ms).
The same code in C# on the other hand, produces the desired results: all tasks print the message to the window in a mixed order (rather than sequential order)
Update:
As suggested I changed the inner loop to do some 'actual' thing but the result is still execution on the single thread
Update 2:
I don't quite understand Luaan's comments but I just did a test on a friend's PC. And with the same code, parallelism is working (without thread sleep). It looks like something to do with Mono. But can Luaan explain what I should expect from TPL again? If I have tasks that I want to perform in parallel and taking advantage of the multicore CPU, isn't TPL the way to go?
Update 3:
I have tried out #FyodorSoikin's suggestion again with dummy code that won't be optimized away. Unfortunately, the workload still is not able to make Mono TPL to use multiple threads. Currently the only way I can get Mono TPL to allocate multiple threads is to force a sleep on the existing thread for more than 20ms. I am not qualified enough to asset that Mono is wrong, but I can confirm the same code (same benchmark workload) have the different behaviors under Mono and Windows.
It looks like the Sleeps are ignored completely - see how the Task 2 loop is printed even before launching the next task, that's just silly - if the thread waited for 10ms, there's no way for that to happen.
I'd assume that the cause might be the timer resolution in the OS. The Sleep is far from accurate - it might very well be that Mono (or Mac OS) decides that since they can't reliably make you run again in 10ms, the best choice is to simply let you run right now. This is not how it works on Windows - there you're guaranteed to lose control as long as you don't Sleep(0); you'll always sleep at least as long as you wanted. It seems that on Mono / Mac OS, the idea is the reverse - the OS tries to let you sleep at most the amount of time you specified. If you want to sleep for less time than is the timer precision, too bad - no sleep.
But even if they are not ignored, there's still not a lot of pressure on the thread pool to give you more threads. You're only blocking for less than 100ms, for four tasks in a line - that's not nearly enough for the pool to start creating new threads to handle the requests (on MS.NET, new threads are only spooled after not having any free threads for 200ms, IIRC). You're simply not doing enough work for it to be worth it to spool up new threads!
The point you might be missing is that Task.Factory.StartNew is not actually starting any new threads, ever. Instead, it's scheduling the associated task on the default task scheduler - which just puts it in the thread pool queue, as tasks to execute "at earliest convenience", basically. If there's one free thread in the pool, the first tasks starts running there almost immediately. The second will run when there's another thread free etc. Only if the thread usage is "bad" (i.e. the threads are "blocked" - they're not doing any CPU work, but they're not free either) is the threadpool going to spawn new threads.
If you look at the IL output from this program, you'll see that the inner loop is optimized away, because it doesn't have any side effects, and its return value is completely ignored.
To make it count, put something non-optimizable there, and also make it heavier: 1000 empty cycles is hardly noticeable compared to the cost of spinning up a new task.
For example:
let doWork (num:int) (taskId:int) : unit =
for i in 1 .. num do
Thread.Sleep(10)
for j in 1 .. 1000 do
Debug.WriteLine("x")
Console.WriteLine(String.Format("Task {0} loop: {1}, thread id {2}", taskId, i, Thread.CurrentThread.ManagedThreadId))
Update:
Adding a pure function, such as your fact, is no good. The compiler is perfectly able to see that fact has no side effects and that you duly ignore its return value, and therefore, it is perfectly cool to optimize it away. You need to do something that the compiler doesn't know how to optimize, such as Debug.WriteLine above.
I've written this little MPI_Allreduce benchmark: bench_mpi.cxx.
It work well with Open MPI 1.8.4 and MPICH 1.4.1.
The results (1 column for the number of processors, and 1 columns for the corresponding wall clock time) are here or here.
With MPICH 3.1.4, the wall clock time increase for 7, 8 or more processes: results are here.
In a real code (edit: a Computational Fluid Dynamic software), but for all of the 3 above MPI implementation, I observe the same problem for 7, 8 or more processes, while I expect my code to be scallable to at least 8 or 16 processes.
So I'm trying to understand what could happen with the little benchmark and MPICH 3.1.4?
edit
Here is a zoom in the figure Rob Latham give in his answer.
What does the code do during the green rectangle? The Mpi_Allreduce operation starts too late.
edit
I've posted another question on much more simpler code (just the time to execute MPI_Barrier).
It's interesting you don't see this with OpenMPI or with earlier versions of MPICH, but the way your code is set up seems guaranteed to cause problems for any MPI collective.
You've given each process a variable amount of work to do. The problem with that is the introduction of "pseudo-synchronization" -- the time other MPI processes spend waiting for the laggard to catch up and participate in the collective.
With point-to-point messaging the costs are clear, and probably follow a LogP model
Collective costs have an additional cost: sometimes a process is blocked waiting for a participating process to send it some needed information. In Allgather, well, all the processes have a data dependency on another.
When you have variable-sized work units, no process can make progress until the largest/slowest processor finishes.
If you instrument with MPE and display the trace in Jumpshot, it's easy to see this effect:
I've added (see https://gist.github.com/roblatham00/b981fc875d3852c6b63f) red boxes for work, and the purple boxes are the default allgather color. The second iteration shows this most clearly: rank 0 spends almost no time in allgather . Rank 2,3,4, and 5 have to wait for the slowpokes to catch up.
.
I'm running a kernel on a big array. When I profile the clEnqueueNDRange command, the execution time (end-start) is .001 ms but the time between submit and start (start-submit) is around 120 ms which varies with the size of the input data. What happens when a command is submitted until it start to execute. Is it reasonable to get this large time?
OpenCL operates asynchronously. That is to say that when you ask for a piece of work to be done, it may not happen at that time. It will happen at some time in the future. This is a little weird, especially when you start profiling things, but it works like this so that the CPU can queue up lots of work for the OpenGL device, and then go do something else while the work is done.
For example:
clEnqueueWriteBuffer(blah);
clEnqueueNDRange(blah);
clEnqueueReadBuffer(blah, but blocking_read = CL_TRUE);
Here, the writeBuffer and the NDRange will probably appear to take very small amounts of time. All they'll do is record what needs to be done. The blocking readBuffer will take a long time, because it has to wait for the results of the read. For that read to complete, the write, and the kernel execution have to complete, before the read can even start.
Now the read might be very small, but because it's waiting for everything before it to finish the time it appears to take is dependent on the amount of work in the commands before it.
I don't quite understand what you're measuring from your question, but I expect what you're seeing is this effect. The time for work is being charged to other functions because they have to wait for previous work to finish.
Knowing which functions cause the CPU to wait on the GPU is one of the big tricks when it comes to writing high performance code. Any time you introduce a wait like this the CPU stops doing any useful work, and the GPU is likely to go idle whilst the CPU prepares the next lump of work. Sometimes, there no alternative, and you just have to wait.
I have windows service which need to execute around 10000 schedules (need to send/execute data to all the members).
For one member it's taking 3 to 5 seconds for 10000 schedules it's taking around 10 minutes or so..
But I need to execute all these schedules in one minutes.
Thanks In advance
Assuming you need to do parallel processing, you better read this here doc to get to know the paradigm and avoid common pitfalls (its for .net 4.0 but I am suggesting you ready it no matter what because it goes over basic concepts).
If you can push down processing time down to < 2 seconds per task then I'd suggest you don't mess with parallel processing (it's likely to complicate your life in ways you cannot imagine).