I am working on a monitoring project that it monitor different targets including (Switch,Router,PC ,…) in their specific interval(mostly 60 seconds) via SNMP . depending on the number of elements that is going to be monitored, some of the monitoring processes take long time maybe more than 30 seconds .the problem appear when the number of targets are more than 2000 and majority of the monitoring operations take long time .in this occasion monitoring operations overlap.it is because apparently there is no available thread to allocate them while the cpu has four cores.i tried below options.
I found that there is gap between when monitor come into queue and when it really start to perform.in fact , ( startMonitoringOperation – startMonitoring ) = big interval
This big interavel could be more than 2 minute.
I want to know what is the best practice for synchronous long-running Process .How I can decrease the gap I have mentioned before.
for (int I = 0 ;MonitorsQueue.Count;i++)
{
Var startMonitoring = DateTime.Now;
1.Threadpool.QueueUserWorkItem(new System.Threading.WaitCallback(DoMonitor),monitorObject);
2.var task = new task(() => DoMonitor(monitorObject),TakeCreationOptions.LongRuning);
3. var task = new task(() => DoMonitor(monitorObject));
}
Private void DoMonitor() //this Method take long in some occasions
{
Var startMonitoringOperation = DateTime.Now;
//Monintoring Operation
}
That seems to be a pool size problem. Increase the the number of threads in ThreadPool class to say 3000 to avoid this. However , I suggest you to use async/await feature of c# language instead of using threadpool to run such threads, as they are not CPU intensive threads and they are just waiting for I/O to accomplish their job.
Through using async/await, no thread would exist during the wait time which is what you can take benefit from.
Good luck,
Ahmad
Related
I am reading a large XML file using XmlReader and am exploring potential performance improvements via Async & pipelining. The following initial foray into the world of Async is showing that the Async version (which for all intents and purposes at this point is the equivalent of the Synchronous version) is much slower. Why would this be? All I've done is wrapped the "normal" code in an Async block and called it with Async.RunSynchronously
Code
open System
open System.IO.Compression // support assembly required + FileSystem
open System.Xml // support assembly required
let readerNormal (reader:XmlReader) =
let temp = ResizeArray<string>()
while reader.Read() do
()
temp
let readerAsync1 (reader:XmlReader) =
async{
let temp = ResizeArray<string>()
while reader.Read() do
()
return temp
}
let readerAsync2 (reader:XmlReader) =
async{
while reader.Read() do
()
}
[<EntryPoint>]
let main argv =
let path = #"C:\Temp\LargeTest1000.xlsx"
use zipArchive = ZipFile.OpenRead path
let sheetZipEntry = zipArchive.GetEntry(#"xl/worksheets/sheet1.xml")
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
let temp1 = readerNormal reader
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
System.GC.Collect()
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
let temp1 = readerAsync1 reader |> Async.RunSynchronously
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
System.GC.Collect()
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
readerAsync2 reader |> Async.RunSynchronously
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
printfn "DONE"
System.Console.ReadLine() |> ignore
0 // return an integer exit code
INFO
I am aware that the above Async code does not do any actual Async work - what I a trying to ascertain here is the overhead of simply making it Async
I don't expect it to go faster just because I've wrapped it in an Async. My question is the opposite: why the dramatic (IMHO) slowdown.
TIMINGS
A comment below correctly pointed out that I should provide timings for datasets of various sizes which is implicitly what had led me to be asking this question in the first instance.
The following are some times based on small vs large datasets. While the absolute values are not too meaningful, the relativities are interesting:
30 elements (small dataset)
Normal: 00:00:00.0006994
Async1: 00:00:00.0036529
Async2: 00:00:00.0014863
(A lot slower but presumably indicative of Async setup costs - this is as expected)
1.5 million elements
Normal: 00:00:01.5749734
Async1: 00:00:03.3942754
Async2: 00:00:03.3760785
(~ 2x slower. Surprised that the difference in timing is not amortized as the dataset gets bigger. If this is the case, then pipelining/parallelization can only improve performance here if you have more than two cores - to outweigh the overhead that I can't explain...)
There's no asynchronous work to do. In effect, all you get is the overheads and no benefits. async {} doesn't mean "everything in the braces suddenly becomes asynchronous". It simply means you have a simplified way of using asynchronous code - but you never call a single asynchronous function!
Additionaly, "asynchronous" doesn't necessarily mean "parallel", and it doesn't necessarily involve multiple threads. For example, when you do an asynchronous request to read a file (which you're not doing here), it means that the OS is told what you want to be done, and how you should be notified when it is done. When you run code like this using RunSynchronously, you're simply blocking one thread while posting asynchronous file requests - a scenario pretty much identical to using synchronous file requests in the first place.
The moment you do RunSynchronously, you throw away any reason whatsoever to use asynchronous code in the first place. You're still using a single thread, you just blocked another thread at the same time - instead of saving on threads, you waste one, and add another to do the real work.
EDIT:
Okay, I've investigated with the minimal example, and I've got some observations.
The difference is absolutely brutal with a profiler on - the non-async version is somewhat slower (up to 2x), but the async version is just never ending. It seems as if a huge number of allocations is going on - and yet, when I break the profiler, I can see that the non-async version (running in 4 seconds) makes a hundred thousand allocations (~20 MiB), while the async version (running over 10 minutes) only makes mere thousands. Maybe the memory profiler interacts badly with F# async? The CPU time profiler doesn't have this problem.
The generated IL is very different for the two cases. Most importantly, even though our async code doesn't actually do anything asynchronous, it creates a ton of async builder helpers, sprinkles a ton of (asynchronous) Delay calls through the code, and going into outright absurd territory, each iteration of the loop is an extra method call, including the setup of a helper object.
Apparently, F# automatically translates while into an asynchronous while. Now, given how well compressed xslt data usually is, very little I/O is involved in those Read operations, so the overhead absolutely dominates - and since every iteration of the "loop" has its own setup cost, the overhead scales with the amount of data.
While this is mostly caused by the while not actually doing anything, it also obviously means that you need to be careful about what you select as async, and you need to avoid using it in a case where CPU time dominates (as in this case - after all, both the async and non-async case are almost 100% CPU tasks in practice). This is further worsened by the fact that Read reads a single node at a time - something that's relatively trivial even in a big, non-compressed xml file. The overheads absolutely dominate. In effect, this is analogous to using Parallel.For with a body like sum += i - the setup cost of each of the each of the iterations absolutely dwarfs any actual work being done.
The CPU profiling makes this rather obvious - the two most work intensive methods are:
XmlReader.Read (expected)
Thread::intermediateThreadProc - also known as "this code runs on a thread pool thread". The overhead from this in a no-op code like this is around 100% - yikes. Apparently, even though there is no real asynchronicity anywhere, the callbacks are never run synchronously. Every iteration of the loop posts work to a new thread pool thread.
The lesson learned? Probably something like "don't use loops in async if the loop body does very little work". The overhead is incurred for each and every iteration of the loop. Ouch.
Asynchronous code doesn't magically make your code faster. As you've discovered, it'll tend to make isolated code slower, because there's overhead involved with managing the asynchrony.
What it can do is to be more efficient, but that's not the same as being inherently faster. The main purpose of Async is to make Input/Output code more efficient.
If you invoke a 'slow', blocking I/O operation directly, you'll block the thread until the operation returns.
If you instead invoke that slow operation asynchronously, it may free up the thread to do other things. It does require that there's an underlying implementation that's not thread-bound, but uses another mechanism for receiving the response. I/O Completion Ports could be such a mechanism.
Now, if you run a lot of asynchronous code in parallel, it may turn out to be faster than attempting to run the blocking implementation in parallel, because the async versions use fewer resources (fewer threads = less memory).
I have the following dummy code to test out TPL in F#. (Mono 4.5, Xamarin studio, quad core MacBook Pro)
To my surprise, all the processes are done on the same thread. There is no parallelism at all.
open System
open System.Threading
open System.Threading.Tasks
let doWork (num:int) (taskId:int) : unit =
for i in 1 .. num do
Thread.Sleep(10)
for j in 1 .. 1000 do
()
Console.WriteLine(String.Format("Task {0} loop: {1}, thread id {2}", taskId, i, Thread.CurrentThread.ManagedThreadId))
[<EntryPoint>]
let main argv =
let t2 = Task.Factory.StartNew(fun() -> doWork 10 2)
//printfn "launched t2"
Console.WriteLine("launched t2")
let t1 = Task.Factory.StartNew(fun() -> doWork 8 1)
Console.WriteLine("launched t1")
let t3 = Task.Factory.StartNew(fun() -> doWork 10 3)
Console.WriteLine("launched t3")
let t4 = Task.Factory.StartNew(fun() -> doWork 5 4)
Console.WriteLine("launched t4")
Task.WaitAll(t1,t2,t3,t4)
0 // return an integer exit code
However, if I increase the thread sleep time from 10 to 100ms, I can see a little parallelism.
What have I done wrong? What does this mean? I did consider the possibility of the CPU finished the work before TPL can start the task on a new thread. But this doesn't make sense to me. I can increase the inner dummy loop for j in 1 .. 1000 do () to loop 1000 more times. The result is the same: no parallelism (thread.sleep is set 10 ms).
The same code in C# on the other hand, produces the desired results: all tasks print the message to the window in a mixed order (rather than sequential order)
Update:
As suggested I changed the inner loop to do some 'actual' thing but the result is still execution on the single thread
Update 2:
I don't quite understand Luaan's comments but I just did a test on a friend's PC. And with the same code, parallelism is working (without thread sleep). It looks like something to do with Mono. But can Luaan explain what I should expect from TPL again? If I have tasks that I want to perform in parallel and taking advantage of the multicore CPU, isn't TPL the way to go?
Update 3:
I have tried out #FyodorSoikin's suggestion again with dummy code that won't be optimized away. Unfortunately, the workload still is not able to make Mono TPL to use multiple threads. Currently the only way I can get Mono TPL to allocate multiple threads is to force a sleep on the existing thread for more than 20ms. I am not qualified enough to asset that Mono is wrong, but I can confirm the same code (same benchmark workload) have the different behaviors under Mono and Windows.
It looks like the Sleeps are ignored completely - see how the Task 2 loop is printed even before launching the next task, that's just silly - if the thread waited for 10ms, there's no way for that to happen.
I'd assume that the cause might be the timer resolution in the OS. The Sleep is far from accurate - it might very well be that Mono (or Mac OS) decides that since they can't reliably make you run again in 10ms, the best choice is to simply let you run right now. This is not how it works on Windows - there you're guaranteed to lose control as long as you don't Sleep(0); you'll always sleep at least as long as you wanted. It seems that on Mono / Mac OS, the idea is the reverse - the OS tries to let you sleep at most the amount of time you specified. If you want to sleep for less time than is the timer precision, too bad - no sleep.
But even if they are not ignored, there's still not a lot of pressure on the thread pool to give you more threads. You're only blocking for less than 100ms, for four tasks in a line - that's not nearly enough for the pool to start creating new threads to handle the requests (on MS.NET, new threads are only spooled after not having any free threads for 200ms, IIRC). You're simply not doing enough work for it to be worth it to spool up new threads!
The point you might be missing is that Task.Factory.StartNew is not actually starting any new threads, ever. Instead, it's scheduling the associated task on the default task scheduler - which just puts it in the thread pool queue, as tasks to execute "at earliest convenience", basically. If there's one free thread in the pool, the first tasks starts running there almost immediately. The second will run when there's another thread free etc. Only if the thread usage is "bad" (i.e. the threads are "blocked" - they're not doing any CPU work, but they're not free either) is the threadpool going to spawn new threads.
If you look at the IL output from this program, you'll see that the inner loop is optimized away, because it doesn't have any side effects, and its return value is completely ignored.
To make it count, put something non-optimizable there, and also make it heavier: 1000 empty cycles is hardly noticeable compared to the cost of spinning up a new task.
For example:
let doWork (num:int) (taskId:int) : unit =
for i in 1 .. num do
Thread.Sleep(10)
for j in 1 .. 1000 do
Debug.WriteLine("x")
Console.WriteLine(String.Format("Task {0} loop: {1}, thread id {2}", taskId, i, Thread.CurrentThread.ManagedThreadId))
Update:
Adding a pure function, such as your fact, is no good. The compiler is perfectly able to see that fact has no side effects and that you duly ignore its return value, and therefore, it is perfectly cool to optimize it away. You need to do something that the compiler doesn't know how to optimize, such as Debug.WriteLine above.
I am using Visual Studio 2010 to test a ASP .NET web service. I want to be able to set a fairly low goal for request / second (say 20) and achieve it with good precision (say +/- 2%). I am using the perfmon counter 'Requests/Sec' from category ASP.NET Apps v4.0.30319.
My WebTest vary a lot in duration and the load test keeps varying number of user. When I look at result graph the RPS goes between 0 and 60 and keeps going up and down.
Is there some way to set VS to fire a constant rps from controller stand point. Or even a fixed test / second.
Maybe a counter that lets me measure request initiated per second instead of request / sec so the variation in turn around time does not make a difference.
The only way to do this is using multiple threads - potentially up to RPS * ConnectionTimeout of them, if all the requests have to wait and you're not using async IO.
You can cheat and use the ThreadPool (but be aware it makes no guarantees re: scheduling). There's also the Task Parallel Library if you're on .Net 4.
For ~20 RPS I doubt the lack of ThreadPool precision is going to be an issue. As you get higher RPS, that will change.
If you want to DIY, be aware it's difficult to do properly. A simple implementation would be to have a main worker thread launching new threads for each connection...
Private ShutdownRequested As Boolean = False
Public Sub StartLoad(RPS As Integer)
ShutdownRequested = False
Dim MainThread As New Threading.Thread(Sub() Worker(RPS))
MainThread.Start()
End Sub
Public Sub StopLoad()
ShutdownRequested = True
End Sub
Private Sub Worker(RPS As Integer)
While Not ShutdownRequested
Threading.Thread.Sleep(CInt(1 / RPS * 1000))
Dim Worker As New Threading.Thread(Sub() OpenAConntection())
End While
End Sub
A better way would be to use the async IO mechanism described here as it allows for many connections with a limited number of threads.
This article may also be of interest re: threading
To the best of my knowledge, event-driven programs require a main loop such as
while (1) {
}
I am just curious if this while loop can cost a high CPU usage? Is there any other way to implement event-driven programs without using the main loop?
Your example is misleading. Usually, an event loop looks something like this:
Event e;
while ((e = get_next_event()) != E_QUIT)
{
handle(e);
}
The crucial point is that the function call to our fictitious get_next_event() pumping function will be generous and encourage a context switch or whatever scheduling semantics apply to your platform, and if there are no events, the function would probably allow the entire process to sleep until an event arrives.
So in practice there's nothing to worry about, and no, there's not really any alternative to an unbounded loop if you want to process an unbounded amount of information during your program's runtime.
Usually, the problem with a loop like this is that while it's doing one piece of work, it can't be doing anything else (e.g. Windows SDK's old 'cooperative' multitasking). The next naive jump up from this is generally to spawn a thread for each piece of work, but that's incredibly dangerous. Most people would end up with an executor that generally has a thread pool inside. Then, the handle call is actually just enqueueing the work and the next available thread dequeues it and executes it. The number of concurrent threads remains fixed as the total number of worker threads in the pool and when threads don't have anything to do, they are not eating CPU.
I've written a Qt console application to try out QSemaphores and noticed some strange behavior. Consider a semaphore with 1 resource and two threads getting and releasing a single resource. Pseudocode:
QSemaphore sem(1); // init with 1 resource available
thread1()
{
while(1)
{
if ( !sem.tryAquire(1 resource, 1 second timeout) )
{
print "thread1 couldn't get a resource";
}
else
{
sem.release(1);
}
}
}
// basically the same thing
thread2()
{
while(1)
{
if ( !sem.tryAquire(1 resource, 1 second timeout) )
{
print "thread2 couldn't get a resource";
}
else
{
sem.release(1);
}
}
}
Seems straightforward, but the threads will often fail to get a resource. A way to fix this is to put the thread to sleep for a bit after sem.release(1). What this tells me is that the release() member does not allow other threads waiting in tryAquire() access to the semaphore before the current thread loops around to the top of while(1) and grabs the resource again.
This surprises me because similar testing with QMutex showed proper behavior... i.e. another thread hanging out in QMutex::tryLock(timeout) gets notified properly when QMutex::unlock() is called.
Any ideas?
I'm not able to fully test this out or find all of the suppporting links at the moment, but here are a few observations...
First, the documentation for QSemaphore.tryAcquire indicates that the timeout value is in milliseconds, not seconds. So your threads are only waiting 1 millisecond for the resource to become free.
Secondly, I recall reading somewhere (I unfortunately can't remember where) a discussion about what happens when multiple threads are trying to acquire the same resource simultaneously. Although the behavior may vary by OS and situation, it seemed that the typical result is that it is a free-for-all with no one thread being given any more priority than another. As such, a thread waiting to acquire a resource would have just as much chance of getting it as would a thread that had just released it and is attempting to immediately reacquire it. I'm unsure if the priority setting of a thread would affect this.
So, why might you get different results for a QSemaphore versus a QMutex? Well, I think a semaphore may be a more complicated system resource that would take more time to acquire and release than a mutex. I did some simple timing recently for mutexes and found that on average it was taking around 15-25 microseconds to lock or unlock one. In the 1 millisecond your threads are waiting, this would be at least 20 cycles of locking and unlocking, and the odds of the same thread always reacquiring the lock in that time are small. The waiting thread is likely to get at least one bite at the apple in the time that it is waiting, so you won't likely see any acquisition failures when using mutexes in your example.
If, however, releasing and acquiring a semaphore takes much longer (I haven't timed them but I'm guessing they might), then it's more likely that you could just by chance get a situation where one thread is able to keep reacquiring the resource repeatedly until the wait condition for the waiting thread runs out.